### Classification example with EigenSample
The code below is an example of use of the Eigen Sample module. The dataset used can be found [here](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5). 

In [54]:
import pandas as pd
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from EigenSamplerClassifier import *

In [2]:
df = pd.read_csv('titanic.csv')
df.drop(['Name', 'Parch', 'Ticket', 'Cabin', 'PassengerId'], axis = 1, inplace=True)
df.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Fare,Embarked
0,0,3,male,22.0,1,7.25,S
1,1,1,female,38.0,1,71.2833,C
2,1,3,female,26.0,0,7.925,S
3,1,1,female,35.0,1,53.1,S
4,0,3,male,35.0,0,8.05,S


###### Feature Engineering
Performing one hot encoding for the categorical variables:

In [3]:
one_hot = pd.get_dummies(df[['Pclass', 'Sex', 'Embarked', 'SibSp']])
df.drop(['Pclass', 'Sex', 'Embarked', 'SibSp'], axis = 1, inplace=True)
df = df.join(one_hot)
df.head()

Unnamed: 0,Survived,Age,Fare,Pclass,SibSp,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,3,1,0,1,0,0,1
1,1,38.0,71.2833,1,1,1,0,1,0,0
2,1,26.0,7.925,3,0,1,0,0,0,1
3,1,35.0,53.1,1,1,1,0,0,0,1
4,0,35.0,8.05,3,0,0,1,0,0,1


In [55]:
df.dropna(inplace = True)
X = df.drop('Survived', axis = 1)
y = df[['Survived']]
X = StandardScaler().fit_transform(X)

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size=0.3, random_state=42)

We use the [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) while performing data augmentation. The user is free to use any scikit-learn classifier.

In [59]:
# Here we generate new points using X_train and y_train
model = AdaBoostClassifier()
new_X_train, new_y_train = EigenSamplerClassifier(X_train, y_train.values, model, report = True)

EigenSample - Fraction of new labels: 0.41282565130260523


  return f(**kwargs)


In [60]:
# We concatenate the generated datapoints to the original dataset
X_train_aug = np.concatenate((X_train, augmented_X_train))
y_train_aug = np.concatenate(( y_train.T.squeeze(), augmented_y_train))

The new dataset has two times the number of instances with respect to the original one:

In [62]:
print(f"Dimensions X_train (row x col): {X_train.shape}")
print(f"Dimensions X_train_aug (row x col): {X_train_aug.shape}")

Dimensions X_train (row x col): (499, 9)
Dimensions X_train_aug (row x col): (998, 9)


In [65]:
# Regression on the augmented datasets
clf1 = RandomForestClassifier()
clf1.fit(X_train_aug, y_train_aug)
# Prediction
y_pred1 = clf1.predict(X_test)
# Print the test's metrics results
print("\t Metrics with augmented datasets:")
print(classification_report(y_test, y_pred1))

	 Metrics with augmented datasets:
              precision    recall  f1-score   support

           0       0.79      0.84      0.82       126
           1       0.75      0.69      0.72        89

    accuracy                           0.78       215
   macro avg       0.77      0.76      0.77       215
weighted avg       0.78      0.78      0.77       215



In [66]:
# Regression on original datasets
clf2 = RandomForestClassifier()
clf2.fit(X_train, y_train)
# Prediction
y_pred2 = clf2.predict(X_test)
# Print the test's metrics results
print("\t Metrics with original datasets:")
print(classification_report(y_test, y_pred2))

	 Metrics with original datasets:
              precision    recall  f1-score   support

           0       0.79      0.83      0.81       126
           1       0.74      0.70      0.72        89

    accuracy                           0.77       215
   macro avg       0.77      0.76      0.76       215
weighted avg       0.77      0.77      0.77       215



  clf2.fit(X_train, y_train)


#### Conclusion
Performing data augmentation is a delicate task. In this example we see by the classification metrics that the model had a slightly better overall performance (F1 score) on the original dataset, so the new points added more noise to the data.

Is always good to compare the metrics of the original dataset and the augmented one. Trying different classifiers inside the EigenSampler module may improve the performance of it as well.