# Classification Challenge

Wine experts can identify wines from specific vineyards through smell and taste, but the factors that give different wines their individual charateristics are actually based on their chemical composition.

In this challenge, you must train a classification model to analyze the chemical and visual features of wine samples and classify them based on their cultivar (grape variety).

> **Citation**: The data used in this exercise was originally collected by Forina, M. et al.
>
> PARVUS - An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno,
16147 Genoa, Italy.
>
> It can be downloaded from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository]([http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science). 

In [1]:
import warnings
warnings.filterwarnings("ignore")

## Explore the data

Run the following cell to load a CSV file of wine data, which consists of 12 numeric features and a classification label with the following classes:

- **0** (*variety A*)
- **1** (*variety B*)
- **2** (*variety C*)

In [16]:
import pandas as pd
import numpy as np

# load the training dataset
data = pd.read_csv('data/wine.csv')
data.sample(10)

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity,Magnesium,Phenols,Flavanoids,Nonflavanoids,Proanthocyanins,Color_intensity,Hue,OD280_315_of_diluted_wines,Proline,WineVariety
8,14.83,1.64,2.17,14.0,97,2.8,2.98,0.29,1.98,5.2,1.08,2.85,1045,0
36,13.28,1.64,2.84,15.5,110,2.6,2.68,0.34,1.36,4.6,1.09,2.78,880,0
37,13.05,1.65,2.55,18.0,98,2.45,2.43,0.29,1.44,4.25,1.12,2.51,1105,0
78,12.33,0.99,1.95,14.8,136,1.9,1.85,0.35,2.76,3.4,1.06,2.31,750,1
119,12.0,3.43,2.0,19.0,87,2.0,1.64,0.37,1.87,1.28,0.93,3.05,564,1
54,13.74,1.67,2.25,16.4,118,2.6,2.9,0.21,1.62,5.85,0.92,3.2,1060,0
90,12.08,1.83,2.32,18.5,81,1.6,1.5,0.52,1.64,2.4,1.08,2.27,480,1
127,11.79,2.13,2.78,28.5,92,2.13,2.24,0.58,1.76,3.0,0.97,2.44,466,1
5,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450,0
114,12.08,1.39,2.5,22.5,84,2.56,2.29,0.43,1.04,2.9,0.93,3.19,385,1


Your challenge is to explore the data and train a classification model that achieves an overall *Recall* metric of over 0.95 (95%).

> **Note**: There is no single "correct" solution. A sample solution is provided in [03 - Wine Classification Solution.ipynb](03%20-%20Wine%20Classification%20Solution.ipynb).

## Train and evaluate a model

Add markdown and code cells as required to to explore the data, train a model, and evaluate the model's predictive performance.

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer

In [4]:
column_numbers = list(range(data.shape[1] - 1))
scaler = MinMaxScaler()
data = ColumnTransformer(transformers=[('preprocessing', scaler, column_numbers)], remainder="passthrough").fit_transform(data)
data = pd.DataFrame(data)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.842105,0.1917,0.572193,0.257732,0.619565,0.627586,0.57384,0.283019,0.59306,0.372014,0.455285,0.970696,0.561341,0.0
1,0.571053,0.205534,0.417112,0.030928,0.326087,0.575862,0.510549,0.245283,0.274448,0.264505,0.463415,0.78022,0.550642,0.0
2,0.560526,0.320158,0.700535,0.412371,0.336957,0.627586,0.611814,0.320755,0.757098,0.375427,0.447154,0.695971,0.646933,0.0
3,0.878947,0.23913,0.609626,0.319588,0.467391,0.989655,0.664557,0.207547,0.55836,0.556314,0.308943,0.798535,0.857347,0.0
4,0.581579,0.365613,0.807487,0.536082,0.521739,0.627586,0.495781,0.490566,0.444795,0.259386,0.455285,0.608059,0.325963,0.0


In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score

In [6]:
x_train, x_test, y_train, y_test = train_test_split(
    data.iloc[:,:-1].to_numpy(), 
    data.iloc[:,-1].to_numpy(),
    test_size=0.15,
    random_state=100
    )
kfold = KFold(shuffle=True, random_state=100)
parameters = {
    'n_estimators': [100, 200, 500, 1000],
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 20, 50, 100]
}

In [7]:
# Your code to evaluate data, and train and evaluate a classification model
model = RandomForestClassifier()
grid_searchCV = GridSearchCV(model, param_grid=parameters, scoring=recall_score, cv=kfold)
grid_searchCV.fit(x_train, y_train)

GridSearchCV(cv=KFold(n_splits=5, random_state=100, shuffle=True),
             estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [10, 20, 50, 100],
                         'n_estimators': [100, 200, 500, 1000]},
             scoring=<function recall_score at 0x7fabe8e47790>)

In [8]:
best_model = grid_searchCV.best_estimator_
print(best_model)

RandomForestClassifier(max_depth=10)


In [9]:
y_pred = best_model.predict(x_test)
y_pred

array([1., 2., 0., 1., 2., 2., 1., 1., 1., 1., 2., 1., 2., 2., 2., 0., 2.,
       0., 1., 0., 2., 0., 1., 1., 0., 0., 1.])

In [12]:
y_test

array([1., 2., 0., 1., 2., 2., 1., 1., 1., 1., 2., 1., 2., 2., 2., 0., 2.,
       0., 1., 0., 2., 0., 1., 1., 0., 0., 1.])

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

1.0

In [10]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00         7
         1.0       1.00      1.00      1.00        11
         2.0       1.00      1.00      1.00         9

    accuracy                           1.00        27
   macro avg       1.00      1.00      1.00        27
weighted avg       1.00      1.00      1.00        27



## Use the model with new data observation

When you're happy with your model's predictive performance, save it and then use it to predict classes for the following two new wine samples:

- \[13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285\]
- \[12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520\]


In [17]:
# Your code to predict classes for the two new samples
unobserved_sample_1 = np.array([13.72,1.43,2.5,16.7,108,3.4,3.67,0.19,2.04,6.8,0.89,2.87,1285]).reshape(1,-1)
unobserved_sample_2 = np.array([12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520]).reshape(1,-1)

In [18]:
best_model.predict(unobserved_sample_1)

array([0.])

In [19]:
best_model.predict(unobserved_sample_2)

array([0.])