# Practical example using scikit-learn

Scikit-learn is a must have to implement classification and here we will go through some very basic commands, just to have a grasp on how things are working. 

## 1. Let's import the tools we will need to visualize our dataset

In [1]:
#import of libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

### Import the dataset

In [2]:
#load the data
mushrooms = pd.read_csv('/home/regis/Desktop/ANT-Theano-2-27_regression_classification/4.machine_learning/2.Classification/assets/mushrooms.csv')

## 2. Let's explore our data

Let's check the shape of our data

In [3]:
#get the shape of the dataframe
mushrooms.shape

(8124, 23)

In [4]:
mushrooms.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Check that there are no null values.

In [5]:
#find null values in the dataset
print(mushrooms.isnull().sum())

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64


In [6]:
mushrooms.dtypes

class                       object
cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object

In [8]:
for col in mushrooms:
    print(col)
    print(f'number of unique values: {mushrooms[col].nunique()}')
    print(f'unique values: {mushrooms[col].unique()}')
    print('==========')

class
number of unique values: 2
unique values: ['p' 'e']
cap-shape
number of unique values: 6
unique values: ['x' 'b' 's' 'f' 'k' 'c']
cap-surface
number of unique values: 4
unique values: ['s' 'y' 'f' 'g']
cap-color
number of unique values: 10
unique values: ['n' 'y' 'w' 'g' 'e' 'p' 'b' 'u' 'c' 'r']
bruises
number of unique values: 2
unique values: ['t' 'f']
odor
number of unique values: 9
unique values: ['p' 'a' 'l' 'n' 'f' 'c' 'y' 's' 'm']
gill-attachment
number of unique values: 2
unique values: ['f' 'a']
gill-spacing
number of unique values: 2
unique values: ['c' 'w']
gill-size
number of unique values: 2
unique values: ['n' 'b']
gill-color
number of unique values: 12
unique values: ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk-shape
number of unique values: 2
unique values: ['e' 't']
stalk-root
number of unique values: 5
unique values: ['e' 'c' 'b' 'r' '?']
stalk-surface-above-ring
number of unique values: 4
unique values: ['s' 'f' 'k' 'y']
stalk-surface-below-ring
numb

### Apply your data visualisation skills to explore and understand the data!

In [9]:
# to be checked later

## 3. Format the dataset for machine learning
Let's isolate our `X` and `y`:

In [11]:
X=mushrooms.drop('class',axis=1) #not in place dropping of class so mushrooms stays intact
y=mushrooms['class'] 

We will train our model based on data `X` and try to predict the class `y`.

As we can see below, our data contains a lot of text data.

In [12]:
print(X.dtypes)
X.head()

cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,x,s,n,t,p,f,c,n,k,e,...,s,w,w,p,w,o,p,k,s,u
1,x,s,y,t,a,f,c,b,k,e,...,s,w,w,p,w,o,p,n,n,g
2,b,s,w,t,l,f,c,b,n,e,...,s,w,w,p,w,o,p,n,n,m
3,x,y,w,t,p,f,c,n,n,e,...,s,w,w,p,w,o,p,k,s,u
4,x,s,g,f,n,f,w,b,k,t,...,s,w,w,p,w,o,e,n,a,g


Our model will not be able to understand it, so let's convert them to categorical data with pandas `get_dummies`. In our case, it's super easy because all our columns are text data. 

So we can grab the name of all the columns and apply `get_dummies` on them.

In [15]:
# Get the columns name
X_columns_name = [column_name for column_name in X.columns]
# Label encode them
#get_dummies: Convert categorical variable into dummy/indicator variables.
X = pd.get_dummies(X, columns=X_columns_name, prefix=X_columns_name, drop_first=True)

In [16]:
X

Unnamed: 0,cap-shape_c_1,cap-shape_f_1,cap-shape_k_1,cap-shape_s_1,cap-shape_x_1,cap-surface_g_1,cap-surface_s_1,cap-surface_y_1,cap-color_c_1,cap-color_e_1,...,population_n_1,population_s_1,population_v_1,population_y_1,habitat_g_1,habitat_l_1,habitat_m_1,habitat_p_1,habitat_u_1,habitat_w_1
0,0,0,0,0,1,0,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,1,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
3,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
8120,0,0,0,0,1,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0
8121,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
8122,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0


In [17]:
y

0       p
1       e
2       e
3       p
4       e
       ..
8119    e
8120    e
8121    e
8122    p
8123    e
Name: class, Length: 8124, dtype: object

## 4. Split the dataset

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We also standardize `X` to get better results. You can find more informations on the [Scikit learn official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

Explenation StandardScaler

Standardize features by removing the mean and scaling to unit variance

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

In [20]:
#scale the features
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
#we only standardize X not y
X_train = sc.fit_transform(X_train) #Fit to data, then transform it.

X_test = sc.transform(X_test) #Perform standardization by centering and scaling

## 5. Fit the model
We give `X_train` and `y_train` to our model so it can learn from our train data.

For this example, we will use a model called [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

But there are way more classification models!

### Internet explenation
A classification problem is one in which you try to predict discrete outcomes, such as whether someone has a disease. In contrast, a regression problem is one in which you are trying to predict a value of a continuous variable, such as the sale price of a home. Although logistic regression has regression in its name, it’s an algorithm for classification problems.
Logistic regression is probably the most important supervised learning classification method. It’s a fast, versatile extension of a generalized linear model.
Logistic regression makes an excellent baseline algorithm. It works well when the relationship between the features and the target aren’t too complex.
Logistic regression produces feature weights that are generally interpretable, which makes it especially useful when you need to be able to explain the reasons for a decision. This interpretability often comes in handy — for example, with lenders who need to justify their loan decisions.


### solver explained
lbfgs — Stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. It isn't super fast with large data sets. It will be the default solver as of Scikit-learn version 0.22.0.

In [21]:
#Logistic regression is a statistical model that in its basic form uses a logistic 
#function to model a binary dependent variable,

# Import the model
from sklearn.linear_model import LogisticRegression
# Declare an instance of it
classifier = LogisticRegression(solver='lbfgs')
# Fit the model
classifier.fit(X_train,y_train)

LogisticRegression()

## 6. Evaluate the model

In [22]:
# Evaluate the model
classifier.score(X_test, y_test)

1.0

## Conclusion
WOW!!! You achieved the optimal score!! 

Your model is fully trained! 

Yeah, no... It usually does not work that well.

This dataset is designed almost perfectly, that is why it was so easy to achieve such a high score. Let's now classify more "dirty" datasets. 

## A final word of encouragement
![dive](https://media2.giphy.com/media/PiQejEf31116URju4V/giphy.gif?cid=ecf05e471j3r2pzkzjka1xpiizye06uoazl22r8lfdcfvic0&rid=giphy.gif)
