# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Jauhar Fathima

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [136]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected.

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [137]:
# Import dataset (1 mark)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

# Specify the correct delimiter and skip the header
wine_data = pd.read_csv(url, delimiter=';', skiprows=1, header=None)
wine_data



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


In [138]:
# Set the column names
column_names = [
    "fixed acidity", "volatile acidity", "citric acid",
    "residual sugar", "chlorides", "free sulfur dioxide",
    "total sulfur dioxide", "density", "pH", "sulphates",
    "alcohol", "quality"
]

wine_data.columns = column_names

# Separate features and target variable
X = wine_data.drop('quality', axis=1)
y = wine_data['quality']

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

Answer1: The dataset is sourced from the UCI Machine Learning Repository.

Answer2: I selected this dataset for its mix of numerical features and the interesting task of predicting wine quality based on chemical properties.

Answer3: While finding a dataset was not challenging, ensuring it had a good balance of features and instances required some consideration.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [139]:
# Clean data (if needed)
# Checking for missing values
missing_values = wine_data.isnull().sum()
missing_values

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [140]:
# Check the data types of features
data_types = wine_data.dtypes
data_types

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

In [141]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed.
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X.columns)  # Standard scaling for numerical features
    ])

# Display processed data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

Answer1: No, there were no missing values in this dataset. If missing values were present, I would have used mean imputation for numerical features to maintain data integrity and completeness.

Answer2: The dataset consists of numerical features. I applied standard scaling using `StandardScaler` to ensure all features contribute equally to the models. This is particularly important for models sensitive to feature scales, such as SVM.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [142]:
# Implement pipeline and grid search here. Can add more code blocks if necessary


In [128]:
# Pipeline and grid search for Random Forest
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

param_grid_rf = {
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [None, 10, 20],
    'rf__min_samples_split': [2, 5, 10]
}

grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='accuracy')
grid_rf.fit(X_train, y_train)

In [129]:
# Print results for Random Forest
print("Random Forest - Best Parameters:", grid_rf.best_params_)
print("Random Forest - Best Cross-validated Score:", grid_rf.best_score_)

Random Forest - Best Parameters: {'rf__max_depth': None, 'rf__min_samples_split': 2, 'rf__n_estimators': 200}
Random Forest - Best Cross-validated Score: 0.6633454531237782


In [130]:
# Pipeline and grid search for Logistic Regression
pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression())
])

param_grid_lr = {
    'lr__C': [0.001, 0.01, 0.1, 1, 10, 100]
}

grid_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [131]:
# Print results for Logistic Regression
print("\nLogistic Regression - Best Parameters:", grid_lr.best_params_)
print("Logistic Regression - Best Cross-validated Score:", grid_lr.best_score_)


Logistic Regression - Best Parameters: {'lr__C': 10}
Logistic Regression - Best Cross-validated Score: 0.5431308155446086


In [132]:
# Pipeline and grid search for Support Vector Machine
pipeline_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

param_grid_svm = {
    'svm__C': [0.001, 0.01, 0.1, 1, 10],
    'svm__kernel': ['linear', 'rbf']
}

grid_svm = GridSearchCV(pipeline_svm, param_grid_svm, cv=5, scoring='accuracy')
grid_svm.fit(X_train, y_train)

In [133]:
# Print results for Support Vector Machine
print("\nSupport Vector Machine - Best Parameters:", grid_svm.best_params_)
print("Support Vector Machine - Best Cross-validated Score:", grid_svm.best_score_)


Support Vector Machine - Best Parameters: {'svm__C': 10, 'svm__kernel': 'rbf'}
Support Vector Machine - Best Cross-validated Score: 0.5786088956655459


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

Answer1: Classification models are needed since the task involves predicting wine quality, a categorical variable.

Answer2: I selected Logistic Regression, Random Forest, and Support Vector Machine (SVM). Logistic Regression serves as a baseline, while Random Forest and SVM can capture non-linear relationships often present in quality prediction.

Answer3: After grid search, Random Forest performed the best. This aligns with expectations, as it is adept at capturing complex relationships, which could exist in the context of wine quality prediction.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [134]:
# Calculate testing accuracy (1 mark)
# Calculate testing accuracy for the best model (Random Forest in this case)
best_model = grid_rf.best_estimator_
test_accuracy = best_model.score(X_test, y_test)

In [135]:
# Print the testing accuracy
print("Testing Accuracy for the Best Model:", test_accuracy)

Testing Accuracy for the Best Model: 0.6979591836734694



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose?
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

Answer1: I chose accuracy as the metric for evaluating model performance.

Answer2: Testing accuracy may be slightly lower than cross-validation accuracy from Part 3, but it provides a realistic estimate of generalization to new data.

 Answer3: The Random Forest model performed well enough for real-world use in the context of predicting wine quality. Further analysis and potential feature engineering could enhance model robustness. It's also essential to consider domain-specific factors and explore more advanced techniques for improvement.


## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

Answer1: I sourced code from class examples, scikit-learn documentation, and the UCI Machine Learning Repository.

Answer2: I followed a sequential order—importing data, processing, implementing models, validating the best model, and finally reflecting on the process.

Answer3: I used generative AI tools for this assignment. The task complexity was a bit high so I needed AI to solve the errors and used generative assistance.

Answer4: Tuning hyperparameters for multiple models was challenging and time-consuming, requiring careful consideration of each model's specific parameters. However, the systematic approach of using pipelines and grid search helped streamline the process.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

Answer1: I enjoyed the hands-on application of pipeline construction, hyperparameter tuning, and model evaluation. It reinforced my understanding of these crucial concepts.

Answer2: Tuning hyperparameters for multiple models was challenging and time-consuming. It required careful consideration of each model's specific parameters.