**1. Importing necessary libraries for data manipulation, visualization, and modeling**

In [None]:
# For handling data in DataFrame format
import pandas as pd

# For numerical computations
import numpy as np                

# For creating visualizations
import matplotlib.pyplot as plt

# For advanced visualizations
import seaborn as sns

# For splitting data into training and testing sets
from sklearn.model_selection import train_test_split

# For K-Nearest Neighbors regression
from sklearn.neighbors import KNeighborsRegressor

# For data normalization or standardization
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# For hyperparameter tuning using grid search
from sklearn.model_selection import GridSearchCV

# For Gradient Boosting regression
from sklearn.ensemble import GradientBoostingRegressor

# For performing cross-validation
from sklearn.model_selection import cross_val_score

**2. Loading Data and Ensuring Data Quality**

In [None]:
# Load the dataset
# df = pd.read_csv(r"C:\Users\38095\Documents\GitHub\Project_6\winequality-red.csv")
df = pd.read_csv("data/winequality-red.csv")

In [None]:
# Display the first 5 rows of the dataset
df.head(5)

In [None]:
df['quality'].value_counts()

In [None]:
# Display the shape of the dataset (number of rows and columns)
df.shape  

In [None]:
# Display the data types of each column in the dataset
df.dtypes

In [None]:
# Check for missing values in the dataset
df.isnull().sum() 

**3. Modelling**

In [None]:
# Split the dataset into features (independent variables) and target (dependent variable)

features = df.drop(columns=["quality"])
target = df["quality"]
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.30, random_state=0)

In [None]:
# Initialize the KNN regressor with 10 neighbors
knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
pred

In [None]:
# Train the model
knn.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
pred = knn.predict(X_test)

In [None]:
# Calculate the R-squared score on the test set
knn.score(X_test, y_test)

R-squared (coefficient of determination) measures the proportion of the variance in the dependent variable (target) that is predictable from the independent variables (features) in the model.
In this case, an R-squared of 0.1221 indicates that approximately 12.21% of the variance in the wine quality can be explained by the features included in the model.
This value is relatively low, suggesting that the model may not be capturing a significant portion of the variance in the target variable, and there may be room for improvement in the model's predictive performance.

In [None]:
# Feature scaling using MinMaxScaler
normalizer = MinMaxScaler()
normalizer.fit(X_train)

In [None]:
# Transform the training and testing sets
X_train_norm = normalizer.transform(X_train)
X_test_norm = normalizer.transform(X_test)

In [None]:
# Convert the scaled arrays back to dataframes
X_train_norm = pd.DataFrame(X_train_norm, columns = X_train.columns)
X_test_norm = pd.DataFrame(X_test_norm, columns = X_test.columns)

In [None]:
# Initialize the KNN regressor with 10 neighbors
knn = KNeighborsRegressor(n_neighbors=10)

In [None]:
# Train the model on the scaled data
knn.fit(X_train_norm, y_train)

In [None]:
# Calculate the R-squared score on the scaled test set
knn.score(X_test_norm, y_test)

The score obtained from the KNN model represents the coefficient of determination (R-squared) on the test set, which measures the proportion of the variance in the target variable (wine quality) that is explained by the features in the model.
In this case, the score of 0.1984 indicates that approximately 19.84% of the variance in wine quality can be explained by the features included in the KNN model.
A higher R-squared value closer to 1 would indicate a better fit of the model to the data, suggesting that the features are more effective in predicting wine quality.

**4. Model Development and Initial Tuning**

In [None]:
# Correlation heatmap to identify highly correlated features
corr=np.abs(df.corr())

#Set up mask for triangle representation
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask,  vmax=1,square=True, linewidths=.5, cbar_kws={"shrink": .5},annot = corr)

plt.show()

Fixed acidity and pH (corr. = 0.68): Both features are related to the acidity of the wine. Including both may lead to redundancy of information. It may be sufficient to keep only one of them.

Free sulfur dioxide and total sulfur dioxide (corr. = 0.67): Both features pertain to the sulfur dioxide content in wine, where the total content includes the free form. Keeping only one of them could suffice.

Density and residual sugar: High sugar content may affect the density of wine. If information about sugar content (residual sugar) is available, density may not be as crucial a feature.

Volatile acidity and sulphates (corr. = 0.26): High volatile acidity levels may lead to increased sulphur dioxide levels. Considering this, one of these features could be excluded.

To build a wine quality prediction model, it's crucial to select informative features relevant to the target variable.
Features like:
- citric acid
- density
- total sulfur dioxide
- fixed acidity
- volatile acidity
- alcohol
can contribute significantly to predicting wine quality.

These attributes impact the taste, freshness, acidity, and preservation of wine, making them relevant for modeling. Additionally, some of these features are interrelated or share similarities. For example, both free and total sulfur dioxide levels are relevant for preserving wine, while pH and acidity are closely related. 

fixed acidity
most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

volatile acidity
the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

citric acid
found in small quantities, citric acid can add 'freshness' and flavor to wines

residual sugar
the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

chlorides
the amount of salt in the wine

free sulfur dioxide
the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

total sulfur dioxide
amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

density
the density of water is close to that of water depending on the percent alcohol and sugar content

pH
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

sulphates
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

In [None]:
# Feature selection based on correlation analysis
# Drop highly correlated features to reduce redundancy
X_train_reduced = X_train_norm.drop(columns = ["residual sugar", "chlorides", "free sulfur dioxide","pH", "sulphates"])
X_test_reduced = X_test_norm.drop(columns = ["residual sugar", "chlorides", "free sulfur dioxide","pH", "sulphates"])

In [None]:
# Initialize the KNN regressor with 10 neighbors
knn = KNeighborsRegressor(n_neighbors=10)

In [None]:
# Train the model on reduced features
knn.fit(X_train_reduced, y_train)

In [None]:
# Make predictions on the reduced feature test set
pred_new = knn.predict(X_test_reduced)

In [None]:
# Calculate the R-squared score on the reduced feature test set
knn.score(X_test_reduced, y_test)

In [None]:
# Initialize the KNN regressor with 30 neighbors
knn = KNeighborsRegressor(n_neighbors=30)
knn.fit(X_train_reduced, y_train)
knn.score(X_test_reduced, y_test)

**5. Advanced Modeling**

Experiment with more powerful models, such as Ensemble models.

In [None]:
# Advanced modeling using ensemble techniques
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [None]:
# Bagging regressor
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=30),
                               n_estimators=100,
                               max_samples = 1000)

In [None]:
bagging_reg.fit(X_train_reduced, y_train)

In [None]:
pred = bagging_reg.predict(X_test_reduced)

In [None]:
# Evaluation metrics
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_reduced, y_test))

The Mean Absolute Error (MAE) is approximately 0.472, indicating the average absolute difference between the predicted and true wine quality ratings.
The Root Mean Squared Error (RMSE) is approximately 0.638, representing the square root of the average squared differences between the predicted and true wine quality ratings.
The R-squared score is approximately 0.311, which indicates that around 31.1% of the variance in the wine quality ratings can be explained by the features included in the model.
These evaluation metrics provide insights into the performance of the model in predicting wine quality, with the R-squared score suggesting a moderate level of predictive capability.

In [None]:
# Decision tree regressor
tree = DecisionTreeRegressor(max_depth  = 5)

In [None]:
tree.fit(X_train_reduced, y_train)

In [None]:
tree.score(X_test_reduced, y_test)

The decision tree regressor with a maximum depth of 5 achieves an R-squared score of approximately 0.1564.
This score indicates that around 15.64% of the variance in the wine quality ratings can be explained by the features included in the model.
Despite having a limited depth, the decision tree shows some predictive capability, albeit modest, in capturing the relationship between the features and the wine quality ratings.

In [None]:
# Random forest regressor
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [None]:
forest.fit(X_train_reduced, y_train)

In [None]:
predictions = forest.predict(X_test_reduced)
forest.score(X_test_reduced, y_test)

The random forest regressor achieves an R-squared score of approximately 0.3077 on the test set.
Compared to the decision tree regressor with a maximum depth of 5, which had an R-squared score of approximately 0.1564, the random forest regressor demonstrates better predictive performance, explaining around 30.77% of the variance in the wine quality ratings based on the features included in the model.
This improvement in predictive capability suggests that the ensemble of decision trees in the random forest model is better able to capture the complex relationships between the features and the wine quality ratings.

In [None]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [None]:
ada_reg.fit(X_train_reduced, y_train)

In [None]:
red = ada_reg.predict(X_test_reduced)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_reduced, y_test))

The AdaBoostRegressor achieves an R-squared score of approximately 0.1501 on the test set.
Compared to the Random Forest Regressor with an R-squared score of approximately 0.3077 and the Decision Tree Regressor with an R-squared score of approximately 0.1564, the AdaBoostRegressor demonstrates weaker predictive performance.
With an R-squared score of 0.1501, the AdaBoostRegressor explains approximately 15.01% of the variance in the wine quality ratings based on the features included in the model.
This indicates that the AdaBoost ensemble method, in this case, may not be as effective as the Random Forest or Decision Tree models in capturing the underlying relationships between the features and the wine quality ratings.

In [None]:
# Gradient boosting regressor
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [None]:
gb_reg.fit(X_train_reduced, y_train)

In [None]:
pred = gb_reg.predict(X_test_reduced)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_reduced, y_test))

The GradientBoostingRegressor achieves an R-squared score of approximately -0.238 on the test set.
Compared to previous models:
- AdaBoostRegressor with an R-squared score of approximately 0.1501,
- Random Forest Regressor with an R-squared score of approximately 0.3077,
- Decision Tree Regressor with an R-squared score of approximately 0.1564, the GradientBoostingRegressor demonstrates the weakest predictive performance, with a negative R-squared score.
An R-squared score below zero indicates that the model performs worse than a model that simply predicts the mean of the target variable.
This suggests that the GradientBoostingRegressor may not be suitable for capturing the relationships between the features and the wine quality ratings in this dataset.

**5. Hyperparameter Tuning and Model Optimization**

In [None]:
# Define the parameter grid
param_grid = {
    'max_depth': [5, 10, 15, 20],
    'n_estimators': [50, 100, 150, 200]
}

# Initialize the GradientBoostingRegressor
gb_reg = GradientBoostingRegressor()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=gb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')

# Perform grid search
grid_search.fit(X_train_reduced, y_train)

# Print the best parameters and best score
print("Best parameters found:", grid_search.best_params_)

In [None]:
# Initialize a new GradientBoostingRegressor with the best parameters
best_gb_reg = GradientBoostingRegressor(max_depth=5, n_estimators=150)

# Fit the model to the training data
best_gb_reg.fit(X_train_reduced, y_train)

# Make predictions on the test data
predictions = best_gb_reg.predict(X_test_reduced)

In [None]:
# Print the evaluation metrics
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", r2_score(y_test, predictions))

The GradientBoostingRegressor model with optimized hyperparameters achieves:
- Mean Absolute Error (MAE) of approximately 0.526,
- Root Mean Squared Error (RMSE) of approximately 0.850,
- R-squared score of approximately 0.215 on the test set.
Compared to the previous GradientBoostingRegressor model with default hyperparameters, the model with optimized hyperparameters shows a slight improvement in predictive performance, with a higher R-squared score, indicating better fit to the test data.

In [None]:
# Create a GradientBoostingRegressor model with optimized hyperparameters
best_gb_reg = GradientBoostingRegressor(max_depth=5, n_estimators=150)

In [None]:
# Perform cross-validation with 5 folds
scores = cross_val_score(best_gb_reg, X_train_reduced, y_train, cv=5, scoring='r2')

In [None]:
# Calculate the mean R2 score across all folds
print("Mean R2 Score:", scores.mean())

The mean R2 score, approximately 0.3525, derived from cross-validation, signifies the overall predictive capability of the 
GradientBoostingRegressor model. Compared to prior individual R2 scores—AdaBoostRegressor around 0.1501,
GradientBoostingRegressor demonstrates comparatively stronger predictive performance.
This indicates that the model captures a significant portion of the target variable's variance, suggesting its potential effectiveness, although further refinements may be beneficial.

**6. The outcome**

Let's compare the performance metrics of the machine learning models provided in the code and draw a general conclusion:

K-Nearest Neighbors (KNN):

R-squared: 0.1984 (after scaling)
MAE and RMSE not provided.

BaggingRegressor (with Decision Trees):

R-squared: 0.311
MAE: 0.472
RMSE: 0.638

DecisionTreeRegressor:

R-squared: 0.1564
MAE and RMSE not provided.

RandomForestRegressor:

R-squared: 0.3077
MAE and RMSE not provided.

AdaBoostRegressor (with Decision Trees):

R-squared: 0.1501
MAE and RMSE not provided.

GradientBoostingRegressor:

R-squared: -0.238
MAE: not provided
RMSE: not provided

GradientBoostingRegressor with Hyperparameter Optimization:

R-squared: 0.215
MAE: 0.526
RMSE: 0.850

Cross-validation score (with Optimized GradientBoostingRegressor):

Mean R-squared: 0.3525
General Conclusion:

The BaggingRegressor and RandomForestRegressor models show comparable and relatively good quality with an R-squared around 0.31, indicating that about 31% of the variance in the target variable is explained by the model.
The GradientBoostingRegressor with optimized hyperparameters also demonstrates acceptable quality with an R-squared around 0.215 and a mean R-squared during cross-validation of about 0.3525.
Other models like KNN, DecisionTreeRegressor, and AdaBoostRegressor exhibit lower prediction quality.
GradientBoostingRegressor without hyperparameter optimization shows poor quality with a negative R-squared, indicating that the model performs worse than simply predicting the mean.

***Thus, the most effective models for predicting wine quality on this dataset are BaggingRegressor, RandomForestRegressor, and GradientBoostingRegressor with optimized hyperparameters.***