# Assignment2 - Supervised Learning flow

# Part 1 - Student details:
* Please write the First Name and last 4 digits of the i.d. for each student. For example:
<pre>Israel 9812</pre>

In [None]:
# student 1: Alina 5247

## Part 2 - Initial Preparations 

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Suppress warnings for clean output
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the datasets
wine_train = pd.read_csv('./wine_train.csv')
wine_test = pd.read_csv('./wine_test.csv')

In [None]:
# First five rows of the training set
print("First five rows of the training set:")
display(wine_train.head())

# First five rows of the test set
print("First five rows of the test set:")
display(wine_test.head())

#### Data Set Charachtiristics 
An overview of the dataset’s structure and contents, highlighting key features and patterns to guide further analysis.

In [None]:
# Check for missing values in training data
print("Missing values in training data:")
print(wine_train.isna().sum())

In [None]:
wine_train.shape

In [None]:
wine_train.info()

In [None]:
# Count Unique Values per Feature
for col in wine_train.columns:
    print(f"{col}: {wine_train[col].nunique()} unique values")

In [None]:
# Check for Duplicate Records
duplicate_rows = wine_train.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

#### Summary
* <b>Training and Test Sets:</b> Both sets contain 14 features related to the chemical properties of wines and a target variable indicating the wine class.
* The first five rows show diverse values across features like alcohol, malic acid, ash, and proline.
* Significant range and variation in feature values suggest a rich dataset suitable for modeling.
* No duplicate rows found in the training set, ensuring data uniqueness and integrity.
#### Key Features: 
* <b>Chemical Properties:</b> Features such as alcohol content, malic acid, ash etc.
* <b>Target Variable:</b> The target column classifies the wine into different classes, essential for classification tasks.

## EDA

#### Basic Statistics

In [None]:
# Get basic statistics of the training data
print("Basic statistics of the training data:")
display(wine_train.describe())

In [None]:
# Get basic statistics of the testing data
print("Basic statistics of the testing data:")
display(wine_train.describe())

#### Conclusion
While most features exhibit stable distributions, certain variables like proline and malic acid indicate potential skewness and outliers due to their wide ranges and mean values significantly higher than the median.

#### Class Distribution
A necessary initial check to ensure the target distribution is balanced and appropriate for modeling

In [None]:
# Plot the distribution of the target variable
plt.figure(figsize=(8,6))
sns.countplot(x='target', data=wine_train)
plt.title('Distribution of Wine Classes')
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.show()

### Features Correlation Matrix
Identifying how features are linearly related to each other

In [None]:
# Exclude 'target' from correlation matrix
features = wine_train.drop('target', axis=1)
corr_matrix = features.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(14, 12))

# Draw the heatmap with the mask 
sns.heatmap(
    corr_matrix,
    mask=mask,
    cmap='coolwarm',
    annot=True,
    fmt='.2f',
    linewidths=0.5,
    cbar_kws={"shrink": .5},
    xticklabels=corr_matrix.columns,
    yticklabels=corr_matrix.columns
)

# Add title and labels
plt.title('Correlation Heatmap')
plt.show()

#### HeatMap Overview:
* High correlation between total_phenols and flavanoids (0.87), indicating redundancy.
* Correlation between od280/od315_of_diluted_wines and flavanoids (0.78) suggests similar information is captured by both features.
* Low correlation features like malic_acid and nonflavanoid_phenols could provide unique information, warranting further investigation.

#### Further Features Analysis

In [None]:
# Set a correlation threshold (e.g., 0.75) to detect highly correlated features
correlation_threshold = 0.75

# Mask the upper triangle to avoid duplicate pairs
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Apply the mask to the correlation matrix
masked_corr_matrix = corr_matrix.mask(mask)

# Filter for pairs with correlation higher than the threshold
high_corr_pairs = masked_corr_matrix.stack().reset_index()
high_corr_pairs.columns = ['Feature 1', 'Feature 2', 'Correlation']
high_corr_pairs = high_corr_pairs[high_corr_pairs['Correlation'].abs() > correlation_threshold]

# Print the high correlation pairs
print(high_corr_pairs)

In [None]:
# Pairplot for highly correlated features (adjust based on findings from heatmap)
sns.pairplot(wine_train, vars=['total_phenols', 'flavanoids', 'od280/od315_of_diluted_wines'], hue='target')
plt.show()

#### Pairplot Overview
<b> Separation of Classes:</b>
* Class 0 (light color) has generally lower values for total phenols and flavanoids compared to Class 1 and Class 2.
* Class 2 (dark color) tends to have higher values in both total phenols and od280/od315_of_diluted_wines.

<b> High Correlation:</b>
* The scatterplots show a very strong linear relationship between total phenols and flavanoids (high positive correlation), as well as between flavanoids and od280/od315_of_diluted_wines.
* This confirms what we saw in the heatmap: these features are highly correlated and may be redundant if used together in a model. This could lead to multicollinearity, which might cause issues in linear models like Logistic Regression.

<b>Class Overlap:</b>
There is some overlap between the classes, especially between Class 1 and Class 2 for the feature flavanoids. This might mean that some models could have difficulty separating these classes purely based on this feature.

In [None]:
# Checking distribution and skewness for important features
wine_train[features.columns].hist(bins=15, figsize=(15, 10), layout=(4, 4))
plt.show()

# Calculate skewness to see if any feature is heavily skewed
skewness = wine_train[features.columns].skew()
print("Skewness:\n", skewness)

#### <b>Overview:</b>

* <b> Normal Distribution:</b>

Features like alcohol, ash, total phenols, and od280/od315_of_diluted_wines have relatively symmetric distributions, indicating they are approximately normally distributed.

* <b> Skewed Distributions:</b>

Malic acid and proline are positively skewed (right-skewed). The malic acid distribution is heavily concentrated around lower values, while proline has a long tail, meaning most of the data is concentrated on lower values but with some extreme higher values.
Magnesium, color intensity, and proanthocyanins also show positive skewness.
Hue is slightly negatively skewed, but its distribution is more balanced compared to the others.

* <b> Highly Skewed Features: </b>

Proline has the highest skewness (0.81), followed by magnesium (0.79), and color intensity (0.78). These features are heavily right-skewed, meaning that a large portion of the data is concentrated in lower values, but there are some extreme higher values.

In [None]:
# Boxplot to identify outliers
plt.figure(figsize=(10, 6))
sns.boxplot(data=wine_train[features.columns])
plt.xticks(rotation=90)
plt.show()


#### Conclusion:

Proline stands out with a much larger scale compared to the other features. This difference in scale could potentially affect the model's performance, particularly in distance-based models like KNN or in gradient-based algorithms (like logistic regression or neural networks).
Other features, such as magnesium, have values that are in a more reasonable range but also have outliers, which might affect the model's robustness.

## Part 3 - Experiments
You could add as many code cells as needed

In [None]:
from sklearn.ensemble import RandomForestClassifier

# After finiding high correlation between some features - 
# checking the importance of each to consider removing some later on

# Prepare Data
X = wine_train.drop('target', axis=1)
y = wine_train['target']

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances.sort_values(ascending=False, inplace=True)
print(feature_importances)

#### Data Preparation


In [None]:
# Separate features and target variable from training data
X_train = wine_train.drop('target', axis=1)
y_train = wine_train['target']

#### Feature Engineering


In [None]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Create a pipeline for combining different feature engineering techniques
# The pipeline allows us to apply multiple transformations sequentially

# 1. StandardScaler for normalization
# 2. PolynomialFeatures to add interactions and higher-degree terms (degree=2 in this case)
# 3. PCA for dimensionality reduction, keeping 95% of the variance
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Normalization step
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),  # Adding polynomial features
    ('pca', PCA(n_components=0.95))  # Dimensionality reduction step, keeping 95% of variance
])

# Apply the pipeline to the training data
X_transformed = pipeline.fit_transform(X_train)
print(f'Transformed shape: {X_transformed.shape}')


#### Grid Search with Cross-Validation

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, make_scorer
from sklearn.neighbors import KNeighborsClassifier

# Define models and parameters

models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'KNN': KNeighborsClassifier(),
}

# Define hyperparameters for each model
param_grid = {
    # Logistic Regression hyperparameters
    'Logistic Regression': {
        'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength: smaller values specify stronger regularization
        'solver': ['liblinear', 'lbfgs', 'saga'],  # Different solvers to optimize the logistic function
        'max_iter': [10000]  # Maximum number of iterations for solver convergence
    },
    
    # Random Forest hyperparameters
    'Random Forest': {
        'n_estimators': [50, 100, 200, 500],  # Number of trees in the forest
        'max_depth': [None, 10, 20, 30],  # Maximum depth of each tree
        'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
        'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required to be at a leaf node
        'bootstrap': [True, False]  # Whether to use bootstrap samples when building trees
    },
    
    # K-Nearest Neighbors hyperparameters
    'KNN': {
        'n_neighbors': [3, 5, 7, 9, 11],  # Number of neighbors to consider (k)
        'weights': ['uniform', 'distance'],  # Weight function used in prediction (uniform: equal, distance: weighted)
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],  # Algorithm to compute the nearest neighbors
        'p': [1, 2]  # Power parameter for the Minkowski metric (p=1: Manhattan, p=2: Euclidean)
    }
}

In [None]:
# Define a scorer using macro-average F1 score
scorer = make_scorer(f1_score, average='macro')

# Run Grid Search with 5-fold cross-validation for Random Forest
grid_search_rf = GridSearchCV(estimator=RandomForestClassifier(), 
                              param_grid=param_grid['Random Forest'], 
                              scoring=scorer, 
                              cv=5)

grid_search_rf.fit(X_train, y_train)

# Run Grid Search for Logistic Regression
grid_search_lr = GridSearchCV(estimator=LogisticRegression(), 
                              param_grid=param_grid['Logistic Regression'], 
                              scoring=scorer, 
                              cv=5)

grid_search_lr.fit(X_train, y_train)

# Run Grid Search for KNN
grid_search_knn = GridSearchCV(estimator=KNeighborsClassifier(), 
                               param_grid=param_grid['KNN'], 
                               scoring=scorer, 
                               cv=5)

grid_search_knn.fit(X_train, y_train)

# Collect the results into a DataFrame
results_df = pd.DataFrame({
    'Model': ['Random Forest', 'Logistic Regression', 'KNN'],
    'Best Macro F1 Score': [
        grid_search_rf.best_score_,
        grid_search_lr.best_score_,
        grid_search_knn.best_score_
    ],
    'Best Params': [
        grid_search_rf.best_params_,
        grid_search_lr.best_params_,
        grid_search_knn.best_params_
    ]
})

# Display the results DataFrame
print("All Model Results:")
print(results_df)

## Part 4 - Training 
Use the best combination of feature engineering, model (algorithm and hyperparameters) from the experiment part (part 3)

In [None]:
# Find the model with the highest score
best_model_idx = results_df['Best Macro F1 Score'].idxmax()
best_model_name = results_df.loc[best_model_idx, 'Model']
best_model_params = results_df.loc[best_model_idx, 'Best Params']

print(f"Best Model: {best_model_name}")
print(f"Best Parameters: {best_model_params}")

#### Retrain With The Best Model

In [None]:
X = wine_train.drop(columns=['target'])
y = wine_train.target
X_test = wine_test.drop(columns=['target'])
y_test = wine_test.target

# Use the pipeline from before
X_train_transformed = pipeline.fit_transform(X)  
X_test_transformed = pipeline.transform(X_test) 

print(f"Transformed train shape: {X_train_transformed.shape}")

In [None]:
# Retrain the best model on the full training set
if best_model_name == 'Random Forest':
    best_model = RandomForestClassifier(**best_model_params)
elif best_model_name == 'Logistic Regression':
    best_model = LogisticRegression(**best_model_params)
elif best_model_name == 'KNN':
    best_model = KNeighborsClassifier(**best_model_params)

# Train the best model on the transformed train data
best_model.fit(X_train_transformed, y)

## Part 5 - Apply on test and show model performance estimation

In [None]:
# Predict on the transformed test set
y_test_pred = best_model.predict(X_test_transformed)

# Show the first 5 predictions along with actual labels
print("First 5 Predictions vs Actual Labels:")
for i in range(5):
    print(f"Prediction: {y_test_pred[i]}, Actual: {y_test.iloc[i]}")

# Evaluate the model on the test set using the macro-average F1 score
f1_test = f1_score(y_test, y_test_pred, average='macro')
print(f'\nMacro F1 Score on the Test Set: {f1_test:.2f}')

In [None]:
# Generate a classification report
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))

In [None]:
# Generate and plot the confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)

# Plot the confusion matrix using seaborn
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix Heatmap')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

### <b> Assignment Conclusion: </b>

The project successfully classified wine types with a high degree of accuracy using a Random Forest model. Feature scaling and hyperparameter tuning were crucial in improving model performance. The high Macro F1 Score and classification report metrics demonstrate that the model can generalize well to unseen data.