<h1>Introduction to classification</h1>

In this notebook we will be using the wine quality dataset (available at [kaggle](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset)) to explore how machine learning can be used to infer the quality of a wine based on its attributes.

So, first let us explore our dataset using pandas and matplotlib.

In [None]:
# Importing the holy trinity of data science.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# getting seaborn online to give our graphics a kick.
import seaborn as sns
sns.set()

<h1>Summary</h1>

Before we start let us....kind of jump to the end!

In [None]:
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error

from sklearn.ensemble import RandomForestClassifier

In [None]:
# Re loading the dataset.
wine_df = pd.read_csv("Datasets\\WineQT.csv")

# Separating independent and dependent variables.
X = wine_df.drop(['quality','Id'], axis=1)
Y = wine_df['quality']

# Splitting.
X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=Y)

# Creating the classifier object.
RFR = RandomForestClassifier(n_estimators=10,
                             max_depth=50, 
                             random_state=42)

# Fitting the train dataset.
RFR.fit(X_train,Y_train)

# Creating the predictions.
Y_pred = RFR.predict(X_test)

# Calculating error
print(mean_absolute_error(Y_test,Y_pred))

<h1>Now we start the lecture</h1>

In [None]:
# Loading the dataset.
wine_df = pd.read_csv("WineQT.csv")
wine_df.head()

In [None]:
wine_df.keys()

Question: in the context of our problem, what are the independent variables and what is the dependent variable? (or, what is even all of this?)

Exercise: how could you check which variables seem better correlated with the independent variable?

In [None]:
# Scattering
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14,4))

for xcol, ax in zip(['fixed acidity', 'volatile acidity', 'residual sugar'], axes):
    wine_df.plot(kind='scatter', x=xcol, y='quality', ax=ax, alpha=0.5, color='r')

In [None]:
# Scattering
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14,4))

for xcol, ax in zip(['fixed acidity', 'volatile acidity', 'alcohol'], axes):
    wine_df.plot(kind='scatter', x=xcol, y='quality', ax=ax, alpha=0.5, color='r')

In [None]:
corr = wine_df.corr()
sns.heatmap(corr)

Exercise: What would you consider a wine with high acidity content?

Exercise: How many rows of missing data do we have? What should we do about them?

Exercise: are there any columns that we can assume are irrelevant for this exercise? are there any....potentially dangerous columns out there?

In [None]:
wine_df = wine_df.drop(['Id'], axis=1)

Exercise: Check if you have [scikit-learn](https://scikit-learn.org/stable/install.html) installed and install it if you don't.

In [None]:
import sklearn

Now we will manually encode the quality column to labels.

In [None]:
wine_df.quality.value_counts()

In [None]:
wine_df.quality.describe()

In [None]:
def WineSnob(x):
    if x<=4:
        return('Nope')
    elif x<7:
        return('Ok')
    elif x>=7:
        return('More')

In [None]:
wine_df.quality.apply(lambda x: WineSnob(x))

In [None]:
wine_df['quality_str'] = wine_df.quality.apply(lambda x: WineSnob(x))

In [None]:
wine_df['quality_str'].value_counts()

Question: how to we handle non-numerical variables? What steps can we take to ensure the machine can make sense of it?

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
le.fit_transform(wine_df['quality_str'])

In [None]:
wine_df['quality_enc'] = le.fit_transform(wine_df['quality_str'])

In [None]:
wine_df['quality_str'].value_counts()

In [None]:
wine_df['quality_enc'].value_counts()

Question: in this specific case was all of this largely unecessary?

Now how can we use this data to make predictions? What are the steps involved?

<h1>Splitting the dataset</h1>

We will first split the dataset into a training set and a test set. But why would we do this on the first place?

In [None]:
# Notice that I will slightly change how I import libraries for now on to be more specific.
from sklearn.model_selection import train_test_split

In [None]:
# Separating the train and test.
train, test = train_test_split(wine_df, random_state=42) #train_size=0.25

In [None]:
len(wine_df)

In [None]:
print(len(train))

In [None]:
len(train)/len(wine_df)

In [None]:
train.head()

In [None]:
print(len(test))

In [None]:
test.head()

Exercise: Check the train_test_split [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and try to figure out how to change the proportion of train and test set.

In [None]:
#train, test = train_test_split(wine_df, train_size=0.25, random_state=42)

Question: what proportion of train and test set should we use? Would it make a difference?

Question: if you and your colleagues run the same line of code to separate the datasets, should you always find the same results?

After my incredibly interesting and engaging monologue about randomness we will separate our dependent and independent variables.

In [None]:
X_train = train.drop(['quality','quality_str','quality_enc'], axis=1)
Y_train = train['quality_enc']

In [None]:
X_test = test.drop(['quality','quality_str','quality_enc'], axis=1)
Y_test = test['quality_enc']

<h1>Feature scalling</h1>

It is a common practice in machine learning to scale features. We will check how to do this process on the next few lines.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

Exercise: print the first 5 rows of one of the objects defined above!

In [None]:
X_train.head()

In [None]:
X_train

Should we feature scale Y?

In [None]:
Y_train = StandardScaler().fit_transform(Y_train)
Y_test = StandardScaler().fit_transform(Y_test)

In [None]:
Y_train_nope = Y_train.to_numpy()

In [None]:
Y_train.to_numpy()

In [None]:
Y_train_nope.reshape(-1,1)

In [None]:
Y_train_nope = StandardScaler().fit_transform(Y_train_nope.reshape(-1,1))

In [None]:
Y_train_nope

But really....what do we stand to gain doing all this? Will the model still run later on if you don't apply feature scalling?

In [None]:
# Loading the dataset.
wine_df = pd.read_csv("WineQT.csv")

# Separating the train and test.
train, test = train_test_split(wine_df, random_state=42)

X_train = train.drop(['quality','Id'], axis=1)
Y_train = train['quality']
X_test = test.drop(['quality','Id'], axis=1)
Y_test = test['quality']

# Scalling.
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

<h1>The decision tree classifier</h1>
In the next few blocks we will be exploring the use of 

[decision trees](https://scikit-learn.org/stable/modules/tree.html) to classify our data.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Initializing the model.
DTR = DecisionTreeClassifier(random_state=42)

# fit and batch of predictions.
DTR.fit(X_train,Y_train)
Y_pred = DTR.predict(X_test)

Exercise: how can you visually check how good the performance was?

In [None]:
Y_pred

In [None]:
fig, ax = plt.subplots(1,1,figsize=(14,6))
x_axis = np.arange(0,len(Y_pred))

ax.scatter(x_axis,Y_test, s=10, label='Test')
ax.scatter(x_axis,Y_pred, s=25, label='Pred', alpha=0.25)

ax.set_title(f'The wine snob classifier!')

ax.legend()

<h1>Measuring accuracy</h1>
There are many metrics that can be used to measure the accuracy of a model. 
Over the next few blocks we will see how to use both the

[mean absolut error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) and [mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [None]:
ys_mae = mean_absolute_error(Y_test,Y_pred)
ys_mse = mean_squared_error(Y_test,Y_pred)

In [None]:
print(f"MAE {ys_mae}, MSE {ys_mse}")

Exercise: check the [DTC] documentation and run the model for a different depth.

In [None]:
# Initializing the model.
DTR = DecisionTreeClassifier(max_depth = 15, random_state=42)

# Systolic fit and batch of predictions.
DTR.fit(X_train,Y_train)
Y_pred = DTR.predict(X_test)

In [None]:
ys_mae = mean_absolute_error(Y_test,Y_pred)
ys_mse = mean_squared_error(Y_test,Y_pred)

fig, ax = plt.subplots(1,1,figsize=(14,6))
x_axis = np.arange(0,len(Y_pred))

ax.scatter(x_axis,Y_test, s=10, label='Test')
ax.scatter(x_axis,Y_pred, s=25, label='Pred', alpha=0.25)

ax.set_title(f"MAE {ys_mae:.2f}, MSE {ys_mse:.2f}")

ax.legend()

Question: If the results depend on the parameters we set, how could you determine the optimum number of parameters? Exercise: do it!

In [None]:
results = []
for i in range(1,100):
    # Initializing the model.
    DTR = DecisionTreeClassifier(max_depth=i, 
                                 random_state=42)

    # Systolic fit and batch of predictions.
    DTR.fit(X_train,Y_train)
    Y_pred = DTR.predict(X_test)

    ys_mae = mean_absolute_error(Y_test,Y_pred)

    results.append(ys_mae)


In [None]:
min(results)

In [None]:
plt.plot(results)

<h1>The randon forest regressor</h1>
The implementation of the random forest classifier is quite similar to what has previously seen although the rationale of how they work bear some significant differences. Details can be found in the 

[Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Creating the classifier object.
RFR = RandomForestClassifier(n_estimators=10,
                             max_depth=50, 
                             random_state=42)

# Fitting the train dataset.
RFR.fit(X_train,Y_train)

# Creating the predictions.
Y_pred = RFR.predict(X_test)

In [None]:
ys_mae = mean_absolute_error(Y_test,Y_pred)
ys_mse = mean_squared_error(Y_test,Y_pred)

fig, ax = plt.subplots(1,1,figsize=(14,6))
x_axis = np.arange(0,len(Y_pred))

ax.scatter(x_axis,Y_test, s=10, label='Test')
ax.scatter(x_axis,Y_pred, s=25, label='Pred', alpha=0.25)

ax.set_title(f"MAE {ys_mae:.2f}, MSE {ys_mse:.2f}")

ax.legend()

Exercise: The accuracy obtained is currently dreadful. Improve on that!

In [None]:
results = []
for i in range(1,100):
    # Creating the classifier object.
    RFR = RandomForestClassifier(n_estimators=i,
                                 max_depth=10, 
                                 random_state=42)

    # Fitting the train dataset.
    RFR.fit(X_train,Y_train)

    # Creating the predictions.
    Y_pred = RFR.predict(X_test)

    # Calculate error
    ys_mae = mean_absolute_error(Y_test,Y_pred)
    
    # Appending to the list.
    results.append(ys_mae)
    
print(min(results))

In [None]:
plt.plot(results)

In [None]:
index_min = min(range(len(results)), key=results.__getitem__)

In [None]:
results[33]

In [None]:
index_min

<h1>Support Vector Machines</h1>

The last technique we will see today are the Support Vector Machines (or SVMs for short).

[Documentation](https://scikit-learn.org/stable/modules/svm.html)

In [None]:
from sklearn import svm

In [None]:
# Initializing the model.
SVC = svm.SVC(kernel='linear', degree=1)

# Fit and batch of predictions.
SVC.fit(X_train,Y_train)

# Creating the predictions.
Y_pred = SVC.predict(X_test)

In [None]:
ys_mae = mean_absolute_error(Y_test,Y_pred)
ys_mse = mean_squared_error(Y_test,Y_pred)

fig, ax = plt.subplots(1,1,figsize=(14,6))
x_axis = np.arange(0,len(Y_pred))

ax.scatter(x_axis,Y_test, s=10, label='Test')
ax.scatter(x_axis,Y_pred, s=25, label='Pred', alpha=0.25)

ax.set_title(f"MAE {ys_mae:.2f}, MSE {ys_mse:.2f}")

ax.legend()

Exercise: improve the accuracy.

<h1>Pipelining</h1>

Hint: check this [post](https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976)

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [None]:
# Re loading the dataset.
wine_df = pd.read_csv("WineQT.csv")
wine_df.head()

# Separating independent and dependent variables.
X = wine_df.drop(['quality','Id'], axis=1)
Y = wine_df['quality']

In [None]:
# Splitting.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=Y)

In [None]:
# Defining the pipeline.
steps = [('scaler', StandardScaler()), ('SVM', svm.SVC())]
pipeline = Pipeline(steps)

In [None]:
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}

In [None]:
# defining the gridsearch.
grid = GridSearchCV(pipeline, 
                    param_grid=parameteres, 
                    cv=5)

In [None]:
grid.fit(X_train, y_train)
print("score = %3.2f" %(grid.score(X_test,y_test)))
print(grid.best_params_)

Exercise: re-run the model without applying feature scalling!