# Wine chemicals dataset

In this notebook we shall use Python to create some plots and visualisations for the ["Wine" dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html).

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

![](images/wine.jpeg)

## Loading a dataset

The dataset is loaded from the `data/` folder

In [None]:
import pandas as pd
wine = pd.read_csv('data/wine.csv')
wine.head()

## Pandas

Use Pandas to get a better understanding of the data we will be working with.

1. How many rows and columns are present in the data?

2. Which data types are in each column?

3. Verify that there are no missing values

4. Rename 'od280/od315_of_diluted_wines' into something more meaningful, e.g. protein concentration.

5. How many wines are present of each class?

6. What is the median color intensity for class 0?

7. Create separate DataFrames for each of the different classes

In [None]:
class_one = wine.loc[wine['class']==0]
class_two = ...
class_three = ...

In [None]:
# %load answers/classes.py

## Pandas plotting

As well as being a great tool for data wrangling, we can also use Pandas to create plots directly from a DataFrame.

a. Create a boxplot for color intensity

In [None]:
# %load answers/pandas_a.py

b. Create a histogram for color intensity

In [None]:
# %load answers/pandas_b.py

c. Create a scatter plot of alcohol against color intensity

In [None]:
# %load answers/pandas_c.py

## Matplotlib

In [None]:
import matplotlib.pyplot as plt

a. Change the style of the plot below so that it has red crosses as markers. Add a title, x label and y label too.

In [None]:
plt.plot(wine['alcohol'], wine['color_intensity'], 'o')

In [None]:
# %load answers/matplotlib_a.py

b. Create a subplot that contains *three* scatterplots of color intensity against alcohol percentage, one for each class. Control the figure size so that the plots are displayed nicely together.

In [None]:
# %load answers/matplotlib_b.py

c. Plot *three* scatterplots of color intensity against alcohol percentage, one for each class, on the same Matplotlib axes.

In [None]:
# %load answers/matplotlib_c.py

## Machine Learning

We shall train a Machine Learning model to predict what cultivator the wines are from, i.e. their class.

Before a Scikit-learn model is trained, a dataframe is commonly split into an n-dimensional vector of features *X* and a 1-dimensional target vector _y_


In [None]:
y = wine['class']
X = wine.drop(['class'], axis=1)
print(X.shape, y.shape)

### Train-test split

We then perform a train test split so that we can evaluate the model we have learnt.

In [None]:
# %load answers/ML_a.py

### Training a model

Using the structure you have seen previously, what is the best performance you can get on this dataset?

- step 1: import a classifier
- step 2: instantiate the classifier
- step 3: fit the model on the training data
- step 4: make predictions on the test
- step 5: evaluate your predictions

In [None]:
# %load answers/ML_b.py

What other interesting things can you do with your model?

In [None]:
# %load answers/ML_c.py