# Sklearn Pipelines Exercise
*Made by viga@itu.dk and thso@itu.dk*

## Introduction

In this exercise you'll be working with the [Wine Quality Dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Repository. The dataset consists of 11 features and a quality score for 4898 white wine samples and 1599 red wine samples. The goal is to predict the quality of the wine based on the features.

The datasets are located in the `data` folder. The `winequality-red.csv` file contains the red wine samples and the `winequality-white.csv` file contains the white wine samples. Lastly, the `winequality.names` file contains a description of the dataset.

The goal of this exercise is to get you familiar with the [Scikit-learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) API. You'll be using pipelines to perform feature scaling and feature selection.

## Load in the data

You can either load the red-wine dataset or the white-wine dataset. You can also load both datasets and combine them if you want.

Both datasets are available in the `data` folder, and are called `winequality-red.csv` and `winequality-white.csv`.

Hint: You can use the `pd.read_csv()` function to load in the data (remember to check the delimiter!). You can find the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).


In [92]:
import pandas as pd

df = pd.read_csv("data/winequality-white.csv", delimiter=";")
# if you want to you can combine the two datasets into one - but this is not necessary

# check a few rows of the data - hint: use .head()
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Data Exploration

### Check the number of missing values in the dataset.

Hint: `.isnull()`

Dont worry if there are missing values, we'll handle them later in our pipeline!

In [93]:
df.isnull().sum()

fixed acidity           3
volatile acidity        3
citric acid             5
residual sugar          6
chlorides               6
free sulfur dioxide     3
total sulfur dioxide    4
density                 1
pH                      3
sulphates               3
alcohol                 3
quality                 0
dtype: int64

### Check some basic statistics

We want to know the mean, standard deviation, minimum, maximum and quartiles of each feature.
This will give us a good idea of the distribution of the data, and also tell us if we need to do any scaling.

Hint: `.describe()`, If the output is hard to read, you can use `.T` to transpose the dataframe, i.e., swapping the rows and columns.

Do you notice anything strange about the data? Is there anything that stands out to you?

In [94]:
# check some basic statistics
df.describe().T

# The mean for total sulfur is WAY higher than the other means, 
# and in general all the values on the total sulfur dioxide row are way higher than the others.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fixed acidity,4895.0,6.855148,0.843928,3.8,6.3,6.8,7.3,14.2
volatile acidity,4895.0,0.278252,0.100782,0.08,0.21,0.26,0.32,1.1
citric acid,4893.0,0.334224,0.120905,0.0,0.27,0.32,0.39,1.66
residual sugar,4892.0,6.391977,5.073363,0.6,1.7,5.2,9.9,65.8
chlorides,4892.0,0.045781,0.021859,0.009,0.036,0.043,0.05,0.346
free sulfur dioxide,4895.0,35.306844,17.004703,2.0,23.0,34.0,46.0,289.0
total sulfur dioxide,4894.0,138.394463,42.490153,9.0,108.0,134.0,167.0,440.0
density,4897.0,0.994027,0.002991,0.98711,0.99172,0.99374,0.9961,1.03898
pH,4895.0,3.188319,0.151024,2.72,3.09,3.18,3.28,3.82
sulphates,4895.0,0.489841,0.114137,0.22,0.41,0.47,0.55,1.08


We saw that there were some missing values in the dataset, this we can fix in the pipeline, using the [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) from sklearn.

Next we also saw that there was a some differences in the scale of the different variables, so we will use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn to scale the data. This will make it easier for the model to learn the patterns in the data. Especially for the KNN algorithm (which we'll use), which is based on distance, it is important that the data is scaled.

If you think of other transformations that might be useful for this dataset, feel free to try them out!

**Take a look at the [sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) module for some inspiration.**

# Splitting the data

Now that we have created our pipeline, we can train the model.

First we need to split the data into a training set and a test set. We will use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from sklearn to do this. But first we need to split the data into features and labels.

The features are all the columns in the dataset, except for the `quality` column, which are the labels.

We will use the default split of 75% training data, and 25% test data.

Hint: You can use the `random_state` parameter to make sure that the data is split the same way every time you run the code.

The train_test_split function returns four values, the first two are the training and test data, and the last two are the train and test labels.

In [100]:
from sklearn.model_selection import train_test_split
# split the data into X and y
X = df.drop(columns="quality")
y = df["quality"]

# now split the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42,
)

# Creating the pipeline

We will now create a pipeline that will handle the missing values and scaling for us, and finally train a KNN model on the data.

We will use the [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class from sklearn to create our pipeline.

The pipeline will consist of three steps, the first step will be to impute the missing values, and the second step will be to scale the data, and the third step will be to train the model.

The pipeline format is a list of tuples, where the first element in the tuple is the name of the step, and the second element is the step itself, e.g.:

```python
pipeline = Pipeline([
	('step_name', step()),
	('step_name', step()),
	('step_name', step()),
])
```

Where the `step_name` is a string, and the `step` is a sklearn object - this can be a "Transformer" object (like `SimpleImputer` and `StandardScaler`) or an "Estimator" object (like `KNeighborsClassifier` or `LinearRegression`).

In [104]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

# create your pipeline
pipe = Pipeline([
    ("simple_imputer", SimpleImputer()),
    ("standard_scalar", StandardScaler()),
    ("k_neighbors_classifier", KNeighborsClassifier())
])



# Evaluating the model

Now that we have trained the model, we want to evaluate it to see how well it performs.

We will use the [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function from sklearn to calculate the accuracy of the model.

Since we have created a pipeline, we can simply call the `.fit()` and `.predict()` methods on the pipeline object, and it will handle the preprocessing for us - and importantly in the correct order.

Remember to only call `.fit()` on the training data. Calling `.fit()` on the test data will cause the model to overfit to the test data, and will give you an overly optimistic accuracy score.

* **`.fit(X_train, y_train)` will train the model on the training data.**
* **`.predict(X_train)` will return the predicted labels for the test data, which you can then pass to the `accuracy_score` function, along with the true labels (y_train).**
* **`.predict(X_test)` will return the predicted labels for the test data, which you can then pass to the `accuracy_score` function, along with the true labels (y_test).**

In [105]:
from sklearn.metrics import accuracy_score
# fit the pipeline
pipe.fit(X_train, y_train)

# evaluate the pipeline
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("model prediction accuracy:", accuracy)

model prediction accuracy: 0.5395918367346939


## Further testing to try and improve the accuracy

I want to further my understanding of how pipelines work, so i will try to just tweak the pipeline with
different tweaks to see if I can get better prediction acccuracy

In [106]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# More configuration in KNeighborsClassifier
pipe = Pipeline([
    ("simple_imputer", SimpleImputer()),
    ("standard_scalar", StandardScaler()),
    ("k_neighbors_classifier", KNeighborsClassifier(
        n_neighbors=7,
        weights="distance",
    ))
])

pipe.fit(X_train, y_train)

# evaluate the pipeline
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("model prediction accuracy:", accuracy)

model prediction accuracy: 0.6620408163265306


In [107]:
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ("simple_imputer", SimpleImputer()),
    ("standard_scaler", StandardScaler()),  # optional for trees
    ("rf_classifier", RandomForestClassifier(
        n_estimators=200,
        class_weight='balanced'
    ))
])

pipe.fit(X_train, y_train)

# evaluate the pipeline
y_pred = pipe.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("model prediction accuracy:", accuracy)

model prediction accuracy: 0.6955102040816327
