### Classification example with EigenSample
The code below is an example of use of the Eigen Sample module. The dataset used can be found [here](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5). 

In [18]:
import pandas as pd
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier)
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from EigenSamplerClassifier import *

In [2]:
df = pd.read_csv('titanic.csv')
df.drop(['Name', 'Parch', 'Ticket', 'Cabin', 'PassengerId'], axis = 1, inplace=True)
df.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Fare,Embarked
0,0,3,male,22.0,1,7.25,S
1,1,1,female,38.0,1,71.2833,C
2,1,3,female,26.0,0,7.925,S
3,1,1,female,35.0,1,53.1,S
4,0,3,male,35.0,0,8.05,S


###### Feature Engineering
Performing one hot encoding for the categorical variables:

In [3]:
one_hot = pd.get_dummies(df[['Pclass', 'Sex', 'Embarked', 'SibSp']])
df.drop(['Pclass', 'Sex', 'Embarked', 'SibSp'], axis = 1, inplace=True)
df = df.join(one_hot)
df.head()

Unnamed: 0,Survived,Age,Fare,Pclass,SibSp,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,3,1,0,1,0,0,1
1,1,38.0,71.2833,1,1,1,0,1,0,0
2,1,26.0,7.925,3,0,1,0,0,0,1
3,1,35.0,53.1,1,1,1,0,0,0,1
4,0,35.0,8.05,3,0,0,1,0,0,1


In [12]:
df.dropna(inplace = True)
X = df.drop('Survived', axis = 1)
y = df[['Survived']]

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.3, random_state=42)

We use the [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) while performing data augmentation. The user is free to use any scikit-learn classifier.

In [30]:
# Here we perform data augmentation on the X_train and y_train datasets
model = AdaBoostClassifier()
augmented_X_train, augmented_y_train = EigenSamplerClassifier(X_train, y_train.values, model)

  return f(**kwargs)


In [31]:
# Regression on the augmented datasets
clf1 = RandomForestClassifier()
clf1.fit(augmented_X_train, augmented_y_train)
y_pred1 = clf1.predict(X_test)
# Print the test's metrics results
print(classification_report(y_test, y_pred1))

              precision    recall  f1-score   support

           0       0.50      0.34      0.41       126
           1       0.36      0.52      0.42        89

    accuracy                           0.41       215
   macro avg       0.43      0.43      0.41       215
weighted avg       0.44      0.41      0.41       215



In [32]:
# Regression on original datasets
clf2 = RandomForestClassifier()
clf2.fit(X_train, y_train)
y_pred2 = clf2.predict(X_test)
# Print the test's metrics results
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.79      0.84      0.81       126
           1       0.75      0.67      0.71        89

    accuracy                           0.77       215
   macro avg       0.77      0.76      0.76       215
weighted avg       0.77      0.77      0.77       215



  clf2.fit(X_train, y_train)


Performing data augmentation is a delicate task. In this example we see by the classification metrics that a lot of noise is added to the synthetic datasets.
Hence is always good to compare the metrics of the original dataset and the augmented one.