# Practical example using scikit-learn

Scikit-learn is a must have to implement classification and here we will go through some very basic commands, just to have a grasp on how things are working. 

## 1. Let's import the tools we will need to visualize our dataset

In [None]:
#import of libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

### Import the dataset

In [None]:
#load the data
mushrooms = pd.read_csv('assets/mushrooms.csv')

## 2. Let's explore our data

Let's check the shape of our data

In [None]:
#get the shape of the dataframe
mushrooms.shape

Check that there are no null values.

In [None]:
#find null values in the dataset
print(mushrooms.isnull().sum())

### Apply your data visualisation skills to explore and understand the data!

In [None]:
# Add your code to analyse the data here:






## 3. Format the dataset for machine learning
Let's isolate our `X` and `y`:

In [None]:
X=mushrooms.drop('class',axis=1) 
y=mushrooms['class'] 

We will train our model based on data `X` and try to predict the class `y`.

As we can see below, our data contains a lot of text data.

In [None]:
print(X.dtypes)
X.head()

Our model will not be able to understand it, so let's convert them to categorical data with pandas `get_dummies`. In our case, it's super easy because all our columns are text data. 

So we can grab the name of all the columns and apply `get_dummies` on them.

In [None]:
# Get the columns name
X_columns_name = [column_name for column_name in X.columns]
# Label encode them
X = pd.get_dummies(X, columns=X_columns_name, prefix=X_columns_name, drop_first=True)

In [None]:
X

In [None]:
y

## 4. Split the dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We also standardize `X` to get better results. You can find more informations on the [Scikit learn official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
#scale the features
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## 5. Fit the model
We give `X_train` and `y_train` to our model so it can learn from our train data.

For this example, we will use a model called [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

But there are way more classification models!

In [None]:
# Import the model
from sklearn.linear_model import LogisticRegression
# Declare an instance of it
classifier = LogisticRegression(solver='lbfgs')
# Fit the model
classifier.fit(X_train,y_train)

## 6. Evaluate the model

In [None]:
# Evaluate the model
classifier.score(X_test, y_test)

## Conclusion
WOW!!! You achieved the optimal score!! 

Your model is fully trained! 

Yeah, no... It usually does not work that well.

This dataset is designed almost perfectly, that is why it was so easy to achieve such a high score. Let's now classify more "dirty" datasets. 

## A final word of encouragement
![dive](https://media2.giphy.com/media/PiQejEf31116URju4V/giphy.gif?cid=ecf05e471j3r2pzkzjka1xpiizye06uoazl22r8lfdcfvic0&rid=giphy.gif)
