# Wine quality analysis with decision trees

The file `wine_quality.csv` contains information about chemical properties of some wines. Let's see if what we learned so far can help us to predict if a wine will be good based on its properties.

## Load, examine, clean, prepare

In [23]:
# Read and parse the wine_quality.csv file.
# Read and parse the wine_quality.csv file.

# The file is in CSV format. The pandas library is well
# suited to read and parse each field.
import pandas
data = pandas.read_csv("wine_quality.csv")

# We can take a look at the dataset to see what it contains.
data.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [24]:
# How many rows and columns does the dataset have ?
n_rows, n_cols = data.shape
print("The dataset has {} rows and {} columns.".format(
    n_rows, n_cols))

The dataset has 6497 rows and 13 columns.


In [25]:
# List all chemical properties of this dataset.
print("The chemical properties of each wine are:")
for col_name in data.columns:
    print("  -", col_name)

The chemical properties of each wine are:
  - type
  - fixed acidity
  - volatile acidity
  - citric acid
  - residual sugar
  - chlorides
  - free sulfur dioxide
  - total sulfur dioxide
  - density
  - pH
  - sulphates
  - alcohol
  - quality


In [26]:
print(list(data.columns))

['type', 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']


In [27]:
# What kind of wines are present in this dataset ?

# With the previous question, we can see that the column
# `type` indicates what are the different kind of wines.
print("The different kind of wines in this dataset are:")
for kind in data.type.unique():
    print("  -", kind)

The different kind of wines in this dataset are:
  - white
  - red


In [28]:
# Find the right method to get the average/minimum/maximum value
# of each column (and only these 3 information per column)

#data.describe().loc[["mean", "min", "max"]]
data.agg(['mean', 'min', 'max'])

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
min,red,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
max,white,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0
mean,,7.216579,0.339691,0.318722,5.444326,0.056042,30.525319,115.744574,0.994697,3.218395,0.531215,10.491801,5.818378


In [29]:
# Does this dataset have any missing information ?
# How many missing values ?
# Which column has the most missing values ?
print("There are {} missing values in the dataframe.".
     format(data.isna().sum().sum()))

There are 38 missing values in the dataframe.


In [30]:
# Remove the rows which have at least 1 missing value.
# How many rows have been removed ?
print("The column with the most missing values is '{}'.".
     format(data.isna().sum().idxmax()))

The column with the most missing values is 'fixed acidity'.


In [31]:
n_before = len(data)
data = data.dropna(axis='index') # this axis drops the rows

# How many rows have been removed ?

n_after = len(data)
print("{} rows have been removed because they contained at "
      "least 1 missing value.".format(n_before - n_after))

34 rows have been removed because they contained at least 1 missing value.


In [17]:
# Use an histogram to see the repartition of
# the wine quality.
import matplotlib.pyplot as plt
%matplotlib inline

quality_histogram = data.quality.value_counts().sort_index()
quality_histogram.plot(kind="bar", colormap="summer")
plt.title("Distribution of wines quality")
plt.show()

TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

In [32]:
# Let's consider that a wine is good if its quality is
# at least 7. Replace the values in the "quality" column
# with "good" if quality >= 7 and with "not good" otherwise.
import numpy as np

data.quality = np.where(data.quality >= 7, "good", "not good")

In [None]:
# Create the input data (i.e. the properties) and the
# label (i.e. the quality of wine) and assign them
# to 2 different variables X and y. Our machine learning
# algorithm needs to have both input and output data.

In [None]:
# Separate your data into a training and a test set
# with 80% for the training set.

## Predicting wine quality with a decision tree

In [None]:
# Is this a classification or a regression problem ?
# Import the appropriate version of DecisionTree, then
# train it with your training data.

In [None]:
# Oops, it seems that there is a problem! Indeed, most
# machine learning algorithms only work with numerical vectors.
# And our current training data still have some string values
# (like the type or the quality). We need to transform them before
# training our model.

# sklearn comes with tools to transform non-numerical values.
# In our case, we are going to use a LabelEncoder. Look at the
# documentation to learn what is does.

from sklearn.preprocessing import LabelEncoder

# now create two encoders: one for the `type` in X, the other
# for the `quality` in y. Use the trained encoders to transform
# X_train, X_test, y_train and y_test.

In [None]:
# Now train again your Decision Tree.

In [None]:
# What is the accuracy of your model (both on training
# and test sets) ? Do you think we are underfitting ? Overfitting ?

In [None]:
# Look at the documentation of your DecisionTree model
# and try to tune the hyperparameters: create other models
# with different values for max_depth, min_samples_split, max_features...
# Train them and evaluate their accuracy. What is the best accuracy
# you obtain?

In [None]:
# Use the feature_importances_ attribute of your best model. What are
# the three most important features to evaluate the quality of a wine?

## Predicting wine quality with random forests

We saw in the course (and in this example) that Decision Trees can easily overfit. To prevent this, we can use Random forests instead. Random forests are a collection of decision trees, where each decision tree is trained differently. The prediction of the RandomForest is then the average (or the most frequent) prediction of all the decision trees.

In [None]:
# Use a RandomForest composed of 20 decision trees and
# train it on your data. Evaluate its accuracy. Do you see
# an improvement ?
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Train other random forest classifiers with different
# hyperparameters (n_estimators, max_features). Can you beat
# the best accuracy you obtained with a single decision tree ?