## Scikit-Learn

<div class="alert alert-success">
<a href="https://scikit-learn.org/stable/index.html" class="alert-link">Scikit-Learn</a> is a Python library that provides simple tools for data analysis, including most machine learning algorithms.
</div>

<center><img src="../media/ml_map.png" width="1200px"></center>

1. Encode  
2. Fit  
3. Predict  
4. Score  

#### 1. Encode
We typically don't want strings in our data when training for ML.  
Yes we can use the pandas `pd.replace` method to change categorical data into numerical data, but it's annoying to keep track and change them back.  
Scikit-Learn provides an encoder and decoder for pre and post processing, `preprocessing.LabelEncoder()`  
`transform` categories to numerical values and `inverse_transform` back to the original labels.

In [9]:
import pandas as pd
df = pd.read_csv('../datasets/liver_dataset.csv', names=['age','gender','tb','db','aap','sgpt','sgot','tp','alb','ag','selector'])
df.dropna(inplace=True)
df.head()

Unnamed: 0,age,gender,tb,db,aap,sgpt,sgot,tp,alb,ag,selector
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [10]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df['gender'])
# print(le.classes_)
df['gender'] = le.transform(df['gender'])        # 0 is Female, 1 is Male
le2 = preprocessing.LabelEncoder()
le2.fit(df['selector'])
# print(le2.classes_)
df['selector'] = le2.transform(df['selector'])   # 0 is heathy liver, 1 is unhealthy liver
df.head()

Unnamed: 0,age,gender,tb,db,aap,sgpt,sgot,tp,alb,ag,selector
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,0
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,0
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89,0
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.0,0
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.4,0


In [11]:
df = df.sample(frac=1)                        # random shuffle our data
row, col = df.shape
split = 0.75                                  # 75% training data 25% testing data
train_data = df.iloc[:int(row*split), :-1]
train_label = df.iloc[:int(row*split), -1:]
test_data = df.iloc[int(row*split):, :-1]
test_label = df.iloc[int(row*split):, -1:]

#### 2. Fit
In order for the classifier to learn our data, the data must first be "fitted" to the model.  
The process trains the model on the data.  
`clf.fit(X, y)`

In [12]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0) # don't worry about parameter just yet
clf.fit(train_data.values, train_label.values.ravel())

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

#### 3. Predict
After our data is fitted, we can use the trained classifier to predict the results.  
Will return an array of predicted classes.  
`clf.predict(Y)`  

In [13]:
clf.predict(train_data)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,

In [14]:
clf.predict(test_data)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

#### 4. Score
Returns the accuracy of the model.  
Essentially is the mean accuracy of the predicted values with respect to the truth values.  
Very important because we want to compare the training accuracy and testing accuracy, and obtaining good testing accuracy is the one of the main objectives in ML.  
`clf.score(X, y)`

In [15]:
clf.score(train_data, train_label)

0.9746543778801844

In [16]:
clf.score(test_data, test_label)     # Accuracy is not great, but we can make it better by tuning parameters

0.7241379310344828