## Wine Quality Testing

MLP Week 1 Lec 1-5

Excellent wine company wants to develop ML model for predicting wine
quality on certain physiochemical characteristics in order to replace
expensive quality sensor.

# Step 1 :- Get the Big Picture

1. Frame the problem
2. Select a performance measure
3. List and check the assumptions

1.1 Frame the problem
* What is input and output?
* What is the business objective? How does
  company expects to use and benefit from the model?
  * Useful in problem framing
  * Algorithm and performance measure selection
  * Overall effort estimation
* What is the current solution (if any)?
  * Provides a useful baseline
* Design consideration in problem framing
  * Is this a supervised, unsupervised or a RL problem?
  * Is this a classification, regression or some other task?
  * What is the nature of the output: single or multiple outputs?
  * Does system need continuous learning or periodic updates?
  * What would be the learning style: batch or online?

1.2 Selection of performance measure
* Regression
  * Mean Squared Error (MSE) or
  * Mean Absolute Error (MAE)
* Classification
  * Precision
  * Recall
  * F1-score
  * Accuracy

1.3 Check the assumptions
* List down various assumptions about the task.
* Review with domain experts and other teams that plan to consume ML output.
* Make sure all assumptions are reviewed and approved before coding!

# Step 2 :- Get the Data

In [None]:
# Load basic libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# Get the Data
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(data_url, sep=";")

In [None]:
# Check data samples
data.head()

In [None]:
# Understanding the Features
feature_list = data.columns[:-1].values
label = [data.columns[-1]]

print ("Feature list:", feature_list)
print ("Label:", label)

In [None]:
# Data Statistics
data.info()

In [None]:
# Data Statistics
data.describe()

In [None]:
# Distribution of Wine Quality
data['quality'].value_counts()

In [None]:
# Distribution of Wine Quality
sns.set()
data.quality.hist()
plt.xlabel('Wine Quality')
plt.ylabel('Count')

In [None]:
# In a similar manner, we can plot all numerical attributes with histogram plot for quick examination.

In [None]:
# Split the Test and Train Datasets
def split_train_test(data, test_ratio):
    # set the random seed.
    np.random.seed(42)
    # shuffle the dataset.
    shuffled_indices = np.random.permutation(len(data))
    # calculate the size of the test set.
    test_set_size = int(len(data) * test_ratio)
    # split dataset to get training and test sets.
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

In [None]:
train_set, test_set = split_train_test(data, 0.2)

In [None]:
# Another Approach to Splitting the test and train dataset :- Random Sampling
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
# Another Approach to Splitting the test and train dataset :- Stratified Sampling
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["quality"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

# Step 3 :- Data Visulaization

In [None]:
'''Create a copy of training dataset to freely manipulate it
without worrying about any manipulation in the original set.'''

exploration_set = strat_train_set.copy()

In [None]:
# Scatter Visualization
sns.scatterplot(x='fixed acidity', y='density', hue='quality', data=exploration_set)

In [None]:
exploration_set.plot(kind='scatter', x='fixed acidity', y='density', alpha=0.5, c="quality", cmap=plt.get_cmap("jet"))

In [None]:
# Correlation
corr_matrix = exploration_set.corr()

In [None]:
corr_matrix['quality']

In [None]:
plt.figure(figsize=(14,7))
sns.heatmap(corr_matrix, annot=True)

In [None]:
# Scatter Matrix
from pandas.plotting import scatter_matrix
attribute_list = ['citric acid', 'pH', 'alcohol', 'sulphates', 'quality']
scatter_matrix(exploration_set[attribute_list])

Note of wisdom

1. Visualization and data exploration do not have to be absolutely thorough.
2. Objective is to get quick insight into features and its relationship with other features
and labels.
3. Exploration is an iterative process: Once we build model and obtain more insights,
we can come back to this step.

# Step 4 :- Prepare Data for ML Algorithm

In [None]:
# Separate features and labels from the training set.
# Copy all features leaving aside the label.
wine_features = strat_train_set.drop("quality", axis=1)
# Copy the label list
wine_labels = strat_train_set['quality'].copy()

In [None]:
# Data Cleaning

# Check if there are any missing values
wine_features.isna().sum()

# Replacing missing values with Median value
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

imputer.fit(wine_features)

imputer.statistics_

wine_features.median()

# Train imputer to transform the training set such that the missing values are replaced by the medians:
tr_features = imputer.transform(wine_features)
tr_features.shape
wine_features_tr = pd.DataFrame(tr_features, columns=wine_features.columns)



In [None]:
# Handling text and categorical attributes

#1 Converting categories to numbers:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

#2 Using one hot encoding
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()

In [None]:
# Feature Scaling

# Step 5 :- Select and Train ML Model

In [None]:
# Train Model
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(wine_features_tr, wine_labels)

In [None]:
# evaluate performance of the model on training as well as test sets. For regression models, we use mean squared error as an evaluation measure.
from sklearn.metrics import mean_squared_error
quality_predictions = lin_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)

In [None]:
# Let's evaluate performance on the test set. We need to first apply transformation on the test set and then apply the model prediction function.
# copy all features leaving aside the label.
wine_features_test = strat_test_set.drop("quality", axis=1)
# copy the label list
wine_labels_test = strat_test_set['quality'].copy()
# apply transformations
wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)
# call predict function and calculate MSE.
quality_test_predictions = lin_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
# Let's visualize the error between the actual and predicted values.
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')

In [None]:
# Let's try another model: DecisionTreeRegressor.
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(wine_features_tr, wine_labels)

In [None]:
quality_predictions = tree_reg.predict(wine_features_tr)
mean_squared_error(wine_labels, quality_predictions)

In [None]:
quality_test_predictions = tree_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')

In [None]:
# We can use cross-validation (CV) for robust evaluation of model performance.
from sklearn.model_selection import cross_val_score

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
# Linear Regression CV
scores = cross_val_score(lin_reg, wine_features_tr, wine_labels, scoring="neg_mean_squared_error", cv=10)
lin_reg_mse_scores = -scores
display_scores(lin_reg_mse_scores)

In [None]:
# Decision tree CV
scores = cross_val_score(tree_reg, wine_features_tr, wine_labels, scoring="neg_mean_squared_error", cv=10)
tree_mse_scores = -scores
display_scores(tree_mse_scores)

In [None]:
# Random forest CV
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(wine_features_tr, wine_labels)
scores = cross_val_score(forest_reg, wine_features_tr, wine_labels, scoring="neg_mean_squared_error", cv=10)
forest_mse_scores = -scores
display_scores(forest_mse_scores)

In [None]:
quality_test_predictions = forest_reg.predict(wine_features_test_tr)
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
plt.scatter(wine_labels_test, quality_test_predictions)
plt.plot(wine_labels_test, wine_labels_test, 'r-')
plt.xlabel('Actual quality')
plt.ylabel('Predicted quality')

# Step 6 :- Finetune Your Model

GridSearchCv

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

In [None]:
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

In [None]:
grid_search.fit(wine_features_tr, wine_labels)

In [None]:
grid_search.best_params_

In [None]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(-mean_score, params)

In [None]:
grid_search.best_estimator_

Randomized Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_

In [None]:
sorted(zip(feature_importances, feature_list), reverse=True)

Evaluation on Test Set :- Once we are satisfied with the model performance, We evaluate it on test set

In [None]:
# 1. Transform the test features.

# copy all features leaving aside the label.
wine_features_test = strat_test_set.drop("quality", axis=1)
# copy the label list
wine_labels_test = strat_test_set['quality'].copy()
# apply transformations
wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)

In [None]:
# 2. Use the predict method with the trained model and the test set.
quality_test_predictions = grid_search.best_estimator_.predict(wine_features_test_tr)

In [None]:
# 3.Compare the predicted labels with the actual ones and report the evaluation metrics.
mean_squared_error(wine_labels_test, quality_test_predictions)

In [None]:
# 4.It's a good idea to get 95% confidence interval of the evaluation metric. It can be obtained by the following code:
from scipy import stats
confidence = 0.95
squared_errors = (quality_test_predictions - wine_labels_test) ** 2
stats.t.interval(confidence, len(squared_errors) - 1, loc=squared_errors.mean(), scale=stats.sem(squared_errors))

# Step 7 :- Present your solution

# Step 8 :- Launch, monitor and maintain your system