# Example Evaluation Code

This notebook will be very __similar__ to the code I use to evaluate your results - it is provided for __your convenience__ so that you can use it to evaluate your preprocessing results at any time before your __final submission__.

Please note that the results here will __NOT__ be the same as my evaluation results.

Let's start with loading the required packages.

In [11]:
# import required package for data handling
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

Next you should load __your__ data. In this case, I am using a sample dataset (`GroupX.csv`) which contains 6 predictors (`X1 - X6`) and two target variables (`Y1, Y2`).

Please make sure you change the data to your __OWN__ dataset when using this code.

__NOTE__:
1. Your dataset maybe very different from the sample dataset.
2. Please follow this structure when submitting your dataset.

In [12]:
data = pd.read_csv('out.csv', header=0)
data.head()

Unnamed: 0,C1,C7,C5_D,C6_D,T1_D,T3_D,T5_D,S1_D,S2_D,S3_D,C2,C3_D,Class_1.0,Class_2.0,Class_3.0,Y1,Y2
0,11.045361,3.716773,1.965794,0.00111,0.410145,0.826056,0.002943,0.069818,0.095911,0.10454,True,True,1.0,0.0,0.0,False,True
1,16.093477,2.960063,3.468261,0.0,0.415707,0.807705,0.002641,0.05708,0.10538,0.104286,False,False,1.0,0.0,0.0,True,False
2,9.486833,1.946762,1.83552,0.0,0.405535,0.826918,0.003815,0.107669,0.079192,0.106783,True,False,1.0,0.0,0.0,True,False
3,14.456832,2.042906,1.816507,0.0,0.291205,0.840999,0.003798,0.09842,0.084521,0.115407,True,False,1.0,0.0,0.0,True,True
4,9.69536,5.824461,2.392571,0.0,0.413626,0.599315,0.00205,0.058691,0.073032,0.10028,True,True,0.0,0.0,1.0,False,True


Checking your data types and make sure it follows the data dictionary would be an important step, you can do that using the `.dtypes` attribute.

__NOTE__: all __continuous__ faetures will be in `float64` data type, and all __categorical__ features will be in `int64` data type (given you already coded (per __suggest task \#6__ in the competition document) them).

In [13]:
data.dtypes
data.describe()

Unnamed: 0,C1,C7,C5_D,C6_D,T1_D,T3_D,T5_D,S1_D,S2_D,S3_D,Class_1.0,Class_2.0,Class_3.0
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,10.598179,5.082318,1.992324,0.000553,0.397619,0.79492,0.002661766,0.070765,0.092399,0.105829,0.501114,0.164811,0.327394
std,2.581888,2.171957,0.643733,0.000946,0.09093,0.067242,0.0006669395,0.011048,0.013168,0.011923,0.500556,0.371423,0.469786
min,3.162278,0.419834,0.668679,0.0,0.0,0.0,7.935055e-09,0.037268,0.046394,0.074628,0.0,0.0,0.0
25%,9.165151,3.440068,1.684725,0.0,0.336129,0.779506,0.00219129,0.063434,0.082768,0.097482,0.0,0.0,0.0
50%,10.29563,4.722347,1.926198,0.0,0.395638,0.803006,0.002607353,0.06971,0.092194,0.105541,1.0,0.0,0.0
75%,12.041595,6.845115,2.193096,0.000869,0.452084,0.827029,0.003082233,0.077908,0.100119,0.112762,1.0,0.0,1.0
max,17.804494,10.508235,9.989357,0.0079,0.68285,0.893935,0.004724773,0.111814,0.130974,0.171456,1.0,1.0,1.0


Now you need to specify your targets and predictors. __NOTE__ we have two targets here (`Y1, Y2`).

In [14]:
y1 = data.Y1
y2 = data.Y2

Check the shape of the data.

In [15]:
data.shape

(449, 17)

It is very possible that you will use different sets of the predictors for `Y1` and `Y2`. Now let's define them.

First, let's define predictors for `Y1` - which will be the first 5 features in `data`.

In [16]:
cols = list(data.columns)
# first 5 features 
cols[:-2]

['C1',
 'C7',
 'C5_D',
 'C6_D',
 'T1_D',
 'T3_D',
 'T5_D',
 'S1_D',
 'S2_D',
 'S3_D',
 'C2',
 'C3_D',
 'Class_1.0',
 'Class_2.0',
 'Class_3.0']

Use below code to select the first 5 features as predictors for `Y1`.

In [17]:
# Drop columns other than the Feature Selected Columns for Y1
predictors_y1 = data.drop(columns=['C1','C7', 'C6_D', 'T3_D', 'T5_D','S3_D','C3_D','Class_3.0', 'Y1','Y2'])
predictors_y1.head()

Unnamed: 0,C5_D,T1_D,S1_D,S2_D,C2,Class_1.0,Class_2.0
0,1.965794,0.410145,0.069818,0.095911,True,1.0,0.0
1,3.468261,0.415707,0.05708,0.10538,False,1.0,0.0
2,1.83552,0.405535,0.107669,0.079192,True,1.0,0.0
3,1.816507,0.291205,0.09842,0.084521,True,1.0,0.0
4,2.392571,0.413626,0.058691,0.073032,True,0.0,0.0


Upon investigation of the data, we know we have __six__ features (`X1 - X6`) predicting `Y2`. Use similar code (as below) to select them.

In [18]:
# Drop columns other than the Feature Selected Columns for Y2
predictors_y2 = data.drop(columns=['C1','S1_D','C7','C6_D', 'C3_D', 'T5_D','Y1','Y2'])

#predictors_y2 = datacols[:-1]]
predictors_y2.head()

Unnamed: 0,C5_D,T1_D,T3_D,S2_D,S3_D,C2,Class_1.0,Class_2.0,Class_3.0
0,1.965794,0.410145,0.826056,0.095911,0.10454,True,1.0,0.0,0.0
1,3.468261,0.415707,0.807705,0.10538,0.104286,False,1.0,0.0,0.0
2,1.83552,0.405535,0.826918,0.079192,0.106783,True,1.0,0.0,0.0
3,1.816507,0.291205,0.840999,0.084521,0.115407,True,1.0,0.0,0.0
4,2.392571,0.413626,0.599315,0.073032,0.10028,True,0.0,0.0,1.0


Below is the key part of this notebook - which generates a `logistic regression` model to predict `Y1`/`Y2`.

The code works this way:

1. We generate two lists `f1_score_lst` and `auc_lst` to store f1_score and AUC from each of the `10` runs of the model;
2. Define model:
    1. We define a `LogisticRegression()` model;
    
    2. We split predictors (`predictors_y1`) and target `y1` to training (80%) and testing (20%);
    
    3. We fit the model `clf` to the training data, then use it to predict on the testing data;
    
    4. We also defined a `10-fold cross validation` to make sure our model do not overfit - see [here](https://scikit-learn.org/stable/modules/cross_validation.html) for more info;
    
    5. We append the f1_score and AUC of current model to the lists (`f1_score_lst` and `auc_lst`) we defined earlier.
  
3. Print out average f1_score and AUC for all 10 runs;
4. Print out average average accuracy from cross validation
5. Print out confusion matrix and classification report for the __last__ model.

__NOTE__: Step 3 provides the evaluation results we need; step 4 - 5 can be used to verify the results from step 3.

In [19]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf = LogisticRegression()
    X1_train, X1_test, y1_train, y1_test = train_test_split(predictors_y1, y1, test_size=0.2, random_state=123)
    clf.fit(X1_train, y1_train)

    y1_pred = clf.predict(X1_test)

    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=10)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, X1_train, y1_train, cv=kfold, scoring=scoring)

    

    
    #calculate f1-score and AUC
    
    clf_roc_auc = roc_auc_score(y1_test, y1_pred)
    f1_score_lst.append(precision_recall_fscore_support(y1_test, y1_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)


print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

#result=logit_model.fit()
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)


#print(result.summary())
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X1_test, y1_test)))

print("10-fold cross validation average accuracy of classifier: %.3f" % (results.mean()))

print('Confusion Matrix for Logistic Regression Classfier:')
print(confusion_matrix_y1)

print('Classification Report for Logistic Regression Classfier:')
print(classification_report(y1_test, y1_pred))


F1 0.5187; AUC 0.5618 
Accuracy of classifier on test set: 0.56
10-fold cross validation average accuracy of classifier: 0.510
Confusion Matrix for Logistic Regression Classfier:
[[37  7]
 [33 13]]
Classification Report for Logistic Regression Classfier:
             precision    recall  f1-score   support

      False       0.53      0.84      0.65        44
       True       0.65      0.28      0.39        46

avg / total       0.59      0.56      0.52        90



Below code are used to evaluate model toward `Y2`. It is very similar to the code above - key difference is that `Y2` is imbalanced - so I wrote some code (under `# Begin oversampling`) to deal with that.

In [20]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf1 = LogisticRegression()

    
    # Splitting data into testing and training
    X2_train, X2_test, y2_train, y2_test = train_test_split(predictors_y2, y2, test_size=0.2, random_state=123)
    
    # Begin oversampling
    oversample = pd.concat([X2_train,y2_train],axis=1)
    max_size = oversample['Y2'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('Y2'):
        lst.append(group.sample(max_size-len(group), replace=True))
    X2_train = pd.concat(lst)
    y2_train=pd.DataFrame.copy(X2_train['Y2'])
    del X2_train['Y2']
    
    # fitting model on oversampled data
    clf1.fit(X2_train, y2_train)
    
    y2_pred = clf1.predict(X2_test)
    
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf1, X2_train, y2_train, cv=kfold, scoring=scoring)
    
    #calculate f1-score and AUC
    
    clf1_roc_auc = roc_auc_score(y2_test, y2_pred)
    
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y2_test, y2_pred, average='weighted')[2])
    auc_lst.append(clf1_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)


print('Accuracy of classifier on test set: {:.3f}'.format(clf1.score(X2_test, y2_test)))

print("10-fold cross validation average accuracy of clf1: %.3f" % (results.mean()))

print('Confusion Matrix for Classfier:')
print(confusion_matrix_y2)

print('Classification Report for Classfier:')
print(classification_report(y2_test, y2_pred))


F1 0.6242; AUC 0.5692 
Accuracy of classifier on test set: 0.611
10-fold cross validation average accuracy of clf1: 0.409
Confusion Matrix for Classfier:
[[16  5]
 [30 39]]
Classification Report for Classfier:
             precision    recall  f1-score   support

      False       0.35      0.76      0.48        21
       True       0.89      0.57      0.69        69

avg / total       0.76      0.61      0.64        90

