## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [1]:
# Load the Data into variable df

In [109]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [125]:
df = pd.read_csv('titanic_dataset.csv')

In [126]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df.sample(n=100, random_state=42)
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df[~df.PassengerId.isin(test_df.PassengerId.tolist())]

In [127]:
# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df[~df.PassengerId.isin(start_df.PassengerId.tolist())]

### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [113]:
y = start_df.Survived
X = start_df.drop(['Survived'], axis=1)

In [114]:
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# # Keep selected columns only
# my_cols = categorical_cols + numerical_cols
# X_train = X_train_full[my_cols].copy()
# X_valid = X_valid_full[my_cols].copy()

In [115]:
my_cols

['Sex',
 'Cabin',
 'Embarked',
 'PassengerId',
 'Pclass',
 'Age',
 'SibSp',
 'Parch',
 'Fare']

In [116]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [172]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import SVC

model = RandomForestClassifier()

In [173]:
from sklearn.metrics import mean_absolute_error, f1_score, accuracy_score, roc_auc_score, classification_report

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = f1_score(y_valid, preds)
print('f1_score:', score)
print(classification_report(y_valid, preds))

f1_score: 0.8181818181818182
              precision    recall  f1-score   support

           0       0.78      0.78      0.78         9
           1       0.82      0.82      0.82        11

    accuracy                           0.80        20
   macro avg       0.80      0.80      0.80        20
weighted avg       0.80      0.80      0.80        20



In [175]:
my_pipeline.named_steps['model'].feature_importances_

array([0.1586335 , 0.04325159, 0.10075011, 0.05318166, 0.05493281,
       0.17610446, 0.11483511, 0.16138958, 0.04605794, 0.00056524,
       0.00483148, 0.00211161, 0.00361104, 0.00104185, 0.00400778,
       0.00245905, 0.00239306, 0.02641228, 0.01216763, 0.03126221])

In [64]:
preprocessor.fit_transform(X_train).shape

(80, 20)

In [153]:
df_test = df.drop(['Survived'], axis=1)

In [155]:
df_test = df_test[['Sex',
 'Cabin',
 'Embarked',
 'PassengerId',
 'Pclass',
 'Age',
 'SibSp',
 'Parch',
 'Fare']]

In [156]:
df_test_preds = my_pipeline.predict_proba(df_test)

In [132]:
df_preds = df_test_preds[:,0]

In [133]:
import numpy as np
def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argsort()

    return idx

In [134]:
sorted_index = find_nearest(df_preds, 0.5)

In [135]:
sorted_index[:10]

array([339, 252, 465, 331, 446, 337, 346, 423, 144, 328], dtype=int64)

In [140]:
df_preds[446]

0.5

In [142]:
df.iloc[list(sorted_index[:10]), :]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
452,453,0,1,"Foreman, Mr. Benjamin Laventall",male,30.0,0,0,113051,27.75,C111,C
331,332,0,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
607,608,1,1,"Daniel, Mr. Robert Williams",male,27.0,0,0,113804,30.5,,S
440,441,1,2,"Hart, Mrs. Benjamin (Esther Ada Bloomfield)",female,45.0,1,1,F.C.C. 13529,26.25,,S
580,581,1,2,"Christy, Miss. Julie Rachel",female,25.0,1,1,237789,30.0,,S
450,451,0,2,"West, Mr. Edwy Arthur",male,36.0,1,2,C.A. 34651,27.75,,S
460,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S
555,556,0,1,"Wright, Mr. George",male,62.0,0,0,113807,26.55,,S
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S
432,433,1,2,"Louch, Mrs. Charles Alexander (Alice Adelaide ...",female,42.0,1,0,SC/AH 3085,26.0,,S


In [159]:
df.loc[452]

PassengerId                                453
Survived                                     0
Pclass                                       1
Name           Foreman, Mr. Benjamin Laventall
Sex                                       male
Age                                       30.0
SibSp                                        0
Parch                                        0
Ticket                                  113051
Fare                                     27.75
Cabin                                     C111
Embarked                                     C
Name: 452, dtype: object