
### Supervised ML - (Penguin species prediction)

Guillermo Altesor

## Phase 1

## Data readiness

In [3]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc

import matplotlib.pyplot as plt
import seaborn as sns


In [4]:
df = pd.read_csv('penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
df.isnull().values.any()

True

Eliminación de **valores nulos**:

In [8]:
df = pd.read_csv('penguins.csv').dropna()
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male
...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male


In [9]:
# This statement helps to check if a DataFrame contains null values in any attribute
df.isnull().values.any()

False

In [10]:
from sklearn.model_selection import train_test_split

X = df.drop('species',axis=1)
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(
                        X, y, test_size=.2, random_state=42)

Imputation of **missing values** in numeric attributes:

In [None]:
X_train_num = X_train.drop(["island", "sex"], axis=1) # Only numeric attributes to fill in missing values

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="median") # Fill missing values of numeric attributes with the median of it

X_train_num_array = num_imputer.fit_transform(X_train_num)
X_train_num = pd.DataFrame(X_train_num_array, columns=X_train_num.columns, index=X_train_num.index) # When applying an imputation the DataFrame structure is lost, but it can be recreated
X_train_num.head()

: 

Handling the categorical attribute *island* using **OneHotEncoder**:

In [12]:
X_train[["island", 'sex']].head()

Unnamed: 0,island,sex
232,Biscoe,female
84,Dream,female
306,Dream,female
22,Biscoe,female
29,Biscoe,male


In [13]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoder.fit(X_train[["island", 'sex']])
X_train_encoded = encoder.transform(X_train[["island",'sex']]).toarray()

X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names(X_train[["island",'sex']].columns))

X_train_encoded_df.head()



Unnamed: 0,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,1.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,1.0


Once we have treated **numeric attributes** and **categorical attribute** separately, we unify them in a new version of training data. As *OneHotEncoder* creates new indices associated with the instances, you have to **reset the indices of both parties** to be merged:

In [14]:
X_train_num.reset_index(drop=True, inplace=True) # Resetear índices de las instancias para asegurar coherencia al fusionar los datos

X_train_encoded_df.reset_index(drop=True, inplace=True) # Resetear índices de las instancias para asegurar coherencia al fusionar los datos

X_train_prepared = pd.concat([X_train_num, X_train_encoded_df], axis=1) # Se fusionan todos los atributos necesarios

X_train_prepared

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,49.1,14.5,212.0,4625.0,1.0,0.0,0.0,1.0,0.0
1,37.3,17.8,191.0,3350.0,0.0,1.0,0.0,1.0,0.0
2,40.9,16.6,187.0,3200.0,0.0,1.0,0.0,1.0,0.0
3,35.9,19.2,189.0,3800.0,1.0,0.0,0.0,1.0,0.0
4,40.5,18.9,180.0,3950.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
261,49.6,15.0,216.0,4750.0,1.0,0.0,0.0,0.0,1.0
262,37.2,19.4,184.0,3900.0,0.0,0.0,1.0,0.0,1.0
263,39.7,17.7,193.0,3200.0,1.0,0.0,0.0,1.0,0.0
264,45.2,17.8,198.0,3950.0,0.0,1.0,0.0,1.0,0.0


To make predictions on the **test data** we will need to apply the same sequence of transformations that we originally applied to the training data.

On the **test** data we use the same *OneHotEncoder* transformer previously defined for the training data, calling the *transform()* method directly: the *fit()* method should not be called again, since the mode of ***transforming the data*** should be ***as learned from the training data***.

In [15]:
X_test_encoded = encoder.transform(X_test[["island",'sex']]).toarray() # Apply OneHotEncoder transformer

X_test_encoded_df = pd.DataFrame(X_test_encoded,
                                 columns=encoder.get_feature_names(X_test[["island",'sex']].columns))

X_test_num = X_test.drop(["island","sex"], axis=1)

X_test_num_array = num_imputer.transform(X_test_num) # Apply transformer to impute missing numeric values
X_test_num = pd.DataFrame(X_test_num_array, columns=X_test_num.columns, index=X_test_num.index) 
# When applying an imputation the DataFrame structure is lost, but it can be recreated

# Reset indexes on numeric and binary attributes (derived from the categorical attribute), before merging them.
X_test_num.reset_index(drop=True, inplace=True)
X_test_encoded_df.reset_index(drop=True, inplace=True)

X_test_prepared = pd.concat([X_test_num,X_test_encoded_df], axis=1) # Merge all necessary attributes
X_test_prepared



Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
0,39.5,16.7,178.0,3250.0,0.0,1.0,0.0,1.0,0.0
1,50.9,17.9,196.0,3675.0,0.0,1.0,0.0,1.0,0.0
2,42.1,19.1,195.0,4000.0,0.0,0.0,1.0,0.0,1.0
3,46.6,14.2,210.0,4850.0,1.0,0.0,0.0,1.0,0.0
4,41.1,18.2,192.0,4050.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
62,50.5,19.6,201.0,4050.0,0.0,1.0,0.0,0.0,1.0
63,36.7,19.3,193.0,3450.0,0.0,0.0,1.0,1.0,0.0
64,35.1,19.4,193.0,4200.0,0.0,0.0,1.0,0.0,1.0
65,50.1,17.9,190.0,3400.0,0.0,1.0,0.0,1.0,0.0


## SOFTMAX

In *scikit-learn*, a **softmax regression model** is trained with the same logistic regression class, *LogisticRegression*, with two simple settings:

1. Setting the *multi_class="multinomial"* hyperparameter:
2. Using an underlying optimization approach (*solver*) compatible with softmax regression, for example "*lbfgs*" (default).



In [16]:
from sklearn.linear_model import LogisticRegression

softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", max_iter=1000)
softmax_reg.fit(X_train_prepared, y_train)

LogisticRegression(max_iter=1000, multi_class='multinomial')

In [17]:
X_test_prepared.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_Biscoe,island_Dream,island_Torgersen,sex_female,sex_male
count,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0,67.0
mean,43.649254,17.531343,199.208955,4105.597015,0.41791,0.373134,0.208955,0.537313,0.462687
std,5.685958,1.941197,12.911749,779.056112,0.496938,0.487288,0.409631,0.502369,0.502369
min,34.6,13.2,178.0,2900.0,0.0,0.0,0.0,0.0,0.0
25%,38.4,16.25,190.0,3475.0,0.0,0.0,0.0,0.0,0.0
50%,42.9,17.9,196.0,3900.0,0.0,0.0,0.0,1.0,0.0
75%,49.05,18.95,210.0,4475.0,1.0,1.0,0.0,1.0,1.0
max,54.2,21.1,230.0,6300.0,1.0,1.0,1.0,1.0,1.0


In [18]:
mean = {'bill_length_mm': [43.649254],
        'bill_depth_mm': [17.531343],
        'flipper_length_mm': [199.208955],
        'body_mass_g': [4105.597015],
        'island_Biscoe': [0.417910],
        'island_Dream': [0.373134],
        'island_Torgersen': [0.208955],
        'sex_female':[0.537313],
        'sex_male':[0.462687]}

mean = pd.DataFrame(mean)

softmax_reg.predict(mean)


array(['Adelie'], dtype=object)

In [19]:
softmax_reg.predict_proba(mean)


array([[0.88941569, 0.03811566, 0.07246865]])

In [20]:
score = softmax_reg.score(X_train_prepared, y_train)
score

0.9962406015037594

## SVM (Support Vector Classification)

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC # SVC: Support Vector Classification
from sklearn.metrics import accuracy_score

In [22]:
svm_classifier = Pipeline([
                           ("scaler", StandardScaler()),
                           ("linear_svc", LinearSVC(C=0.1, loss="hinge", max_iter=10000)),
])
# Define a simple sequence of actions (pipeline), consisting of training an SVM model preceded by scaling the attributes.

svm_classifier.fit(X_train_prepared, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_svc', LinearSVC(C=0.1, loss='hinge', max_iter=10000))])

In [23]:
y_pred = svm_classifier.predict(X_test_prepared)
accuracy_score(y_test, y_pred)

1.0

We see that both SVM and Softmax give us excellent results. We have to highlight here the importance of data processing, thanks to:
* Remove null values
* Impute missing values
* Treat categorical variables with a OneHoteEncoder approach
* Reset indexes
* Unify data with pandas

Our models show remarkable scoring and accuracy. If we had to choose, we would go with SVM as it has 100% accuracy.


# Phase 2: decision Trees

In [24]:
from sklearn.model_selection import train_test_split

X = df.select_dtypes(exclude=['object'])
y = df.species

X_train, X_test, y_train, y_test = train_test_split(
                        X, y, test_size=.2, random_state=42)

In [25]:
from sklearn.tree import DecisionTreeClassifier
tree_1_noHip = DecisionTreeClassifier(random_state=42)
tree_1_noHip.fit(X_train, y_train)
tree_2_split = DecisionTreeClassifier(min_samples_split=12, random_state=42)
tree_2_split.fit(X_train, y_train)
tree_3_leaf = DecisionTreeClassifier(min_samples_leaf=6, random_state=42)
tree_3_leaf.fit(X_train, y_train)
tree_4_features = DecisionTreeClassifier(max_features=2, random_state=42)
tree_4_features.fit(X_train, y_train)
tree_5_depth = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_5_depth.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3, random_state=42)

### Decision Tree 1 

In [26]:
from sklearn.metrics import accuracy_score
y_pred1_tr = tree_1_noHip.predict(X_train)
accuracy_score(y_pred1_tr, y_train)
y_pred1 = tree_1_noHip.predict(X_test)
accuracy_score(y_pred1, y_test)

1.0

### Decision Tree 2

In [27]:
y_pred2_tr = tree_2_split.predict(X_train)
accuracy_score(y_pred2_tr, y_train)
y_pred2 = tree_2_split.predict(X_test)
accuracy_score(y_pred2, y_test)

0.9701492537313433

### Decision Tree 3

In [28]:
y_pred3_tr = tree_3_leaf.predict(X_train)
accuracy_score(y_pred3_tr, y_train)
y_pred3 = tree_3_leaf.predict(X_test)
accuracy_score(y_pred3, y_test)


0.9701492537313433

### Decision Tree 4

In [29]:
y_pred4_tr = tree_4_features.predict(X_train)
accuracy_score(y_pred4_tr, y_train)
y_pred4 = tree_4_features.predict(X_test)
accuracy_score(y_pred4, y_test)

0.9402985074626866

### Decision Tree 5

In [30]:
y_pred5_tr = tree_5_depth.predict(X_train)
accuracy_score(y_pred5_tr, y_train)
y_pred5 = tree_5_depth.predict(X_test)
accuracy_score(y_pred5, y_test)

0.9701492537313433

We see that the changes made to depth, leaf and split had the same results, which means that they arrived at a "child" of similar classification. The Decision Tree with the best accuracy is the one that has not had any changes in its hyperparameters.

# Phase 3:

scikit-lear assembly 

* Bagging Classifier (x)
* AdaBoost Classifier
* Gradient Boosting Classifier 
* Random Forest Classifier (x)

## BaggingClasifier

In [55]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bagging_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)

bagging_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(), max_samples=100,
                  n_estimators=500, n_jobs=-1)

In [56]:
y_pred10 = bagging_clf.predict(X_test[:10])
y_pred10

array(['Adelie', 'Chinstrap', 'Adelie', 'Gentoo', 'Adelie', 'Chinstrap',
       'Chinstrap', 'Gentoo', 'Gentoo', 'Gentoo'], dtype=object)

In [33]:
y_test[:10]

30        Adelie
320    Chinstrap
79        Adelie
202       Gentoo
63        Adelie
307    Chinstrap
292    Chinstrap
187       Gentoo
219       Gentoo
204       Gentoo
Name: species, dtype: object

In [34]:
from sklearn.metrics import confusion_matrix

y_pred = bagging_clf.predict(X_test)

confusion_matrix(y_pred, y_test)

array([[31,  2,  0],
       [ 0, 16,  0],
       [ 0,  0, 18]])

In [35]:
from sklearn.metrics import accuracy_score, classification_report

accuracy_score(y_pred, y_test)

0.9701492537313433

## Random Forest XClasifier

In [41]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male


In [49]:
y_train.head()

232       Gentoo
84        Adelie
306    Chinstrap
22        Adelie
29        Adelie
Name: species, dtype: object

In [52]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
rf_clf.fit(X_train_prepared,y_train)

RandomForestClassifier(n_estimators=50, n_jobs=-1)

In [60]:
from sklearn.metrics import accuracy_score, confusion_matrix

predictions_tr = rf_clf.predict(X_test_prepared)
accuracy_score(y_test, predictions_tr)

1.0

We have had good scores with several models of learning throughout the exercise. But the best and those that have arrived that have settled perfectly have been:
- Decision Tree 1
- RandomForest
- SVM