# Example Evaluation Code

This notebook will be very __similar__ to the code I use to evaluate your results - it is provided for __your convenience__ so that you can use it to evaluate your preprocessing results at any time before your __final submission__.

Please note that the results here will __NOT__ be the same as my evaluation results.

Let's start with loading the required packages.

In [1]:
# import required package for data handling
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

Next you should load __your__ data. In this case, I am using a sample dataset (`GroupX.csv`) which contains 6 predictors (`X1 - X6`) and two target variables (`Y1, Y2`).

Please make sure you change the data to your __OWN__ dataset when using this code.

__NOTE__:
1. Your dataset maybe very different from the sample dataset.
2. Please follow this structure when submitting your dataset.

In [2]:
data = pd.read_csv('evaltest.csv', header=0)
data.head()

Unnamed: 0.1,Unnamed: 0,I1,I2,I3,P(IPO),P(H),P(L),P(1Day),C1,C2,...,Pos_to_Total_Words,Neg_to_Total_Words,Pos_Neg_Words,Long_to_Total_Words,Real_to_Total_Words,C3P,C5P,C6P,Y1,Y2
0,0,AATI,ADVANCED ANALOGIC TECHNOLOGIES INC,3674,10.0,9.5,8.5,11.87,122.0,1.0,...,0.004875,0.009199,0.529915,0.05425,0.908876,1,3.864345,11.111111,0,1
1,1,ABPI,ACCENTIA BIOPHARMACEUTICALS INC,2834,8.0,10.0,8.0,7.25,259.0,0.0,...,0.003258,0.011105,0.293388,0.051395,0.898724,0,12.028832,0.0,1,0
2,2,ACAD,ACADIA PHARMACEUTICALS INC,2834,7.0,14.0,12.0,6.7,90.0,1.0,...,0.011593,0.006271,1.848485,0.061764,0.90935,0,3.369134,0.0,1,0
3,3,ACHN,ACHILLION PHARMACEUTICALS INC,2834,11.5,16.0,14.0,12.39,209.0,1.0,...,0.009686,0.007144,1.355932,0.06163,0.91706,0,3.299697,0.0,1,1
4,4,ACLI,AMERICAN COMMERCIAL LINES INC.,4492,21.0,21.0,19.0,56.599998,80.0,1.0,...,0.004518,0.010047,0.449664,0.04855,0.888469,1,3.726269,5.0,0,1


Checking your data types and make sure it follows the data dictionary would be an important step, you can do that using the `.dtypes` attribute.

__NOTE__: all __continuous__ faetures will be in `float64` data type, and all __categorical__ features will be in `int64` data type (given you already coded (per __suggest task \#6__ in the competition document) them).

In [3]:
data.dtypes

Unnamed: 0                   int64
I1                          object
I2                          object
I3                           int64
P(IPO)                     float64
P(H)                       float64
P(L)                       float64
P(1Day)                    float64
C1                         float64
C2                         float64
C3                         float64
C4                         float64
C5                         float64
C6                         float64
C7                         float64
T1                         float64
T2                         float64
T3                         float64
T4                         float64
T5                         float64
S1                         float64
S2                         float64
S3                         float64
P(mid)                     float64
Long_to_Total_Sentences    float64
Pos_to_Total_Words         float64
Neg_to_Total_Words         float64
Pos_Neg_Words              float64
Long_to_Total_Words 

Now you need to specify your targets and predictors. __NOTE__ we have two targets here (`Y1, Y2`).

In [4]:
y1 = data.Y1
y2 = data.Y2

Check the shape of the data.

In [5]:
data.shape

(682, 35)

It is very possible that you will use different sets of the predictors for `Y1` and `Y2`. Now let's define them.

First, let's define predictors for `Y1` - which will be the first 5 features in `data`.

In [6]:
cols = list(data.columns)
# first 5 features 
cols[:-3]

['Unnamed: 0',
 'I1',
 'I2',
 'I3',
 'P(IPO)',
 'P(H)',
 'P(L)',
 'P(1Day)',
 'C1',
 'C2',
 'C3',
 'C4',
 'C5',
 'C6',
 'C7',
 'T1',
 'T2',
 'T3',
 'T4',
 'T5',
 'S1',
 'S2',
 'S3',
 'P(mid)',
 'Long_to_Total_Sentences',
 'Pos_to_Total_Words',
 'Neg_to_Total_Words',
 'Pos_Neg_Words',
 'Long_to_Total_Words',
 'Real_to_Total_Words',
 'C3P',
 'C5P']

Use below code to select the first 5 features as predictors for `Y1`.

In [7]:
predictors_y1 = data[cols[:-3]]
predictors_y1.head()

Unnamed: 0.1,Unnamed: 0,I1,I2,I3,P(IPO),P(H),P(L),P(1Day),C1,C2,...,S3,P(mid),Long_to_Total_Sentences,Pos_to_Total_Words,Neg_to_Total_Words,Pos_Neg_Words,Long_to_Total_Words,Real_to_Total_Words,C3P,C5P
0,0,AATI,ADVANCED ANALOGIC TECHNOLOGIES INC,3674,10.0,9.5,8.5,11.87,122.0,1.0,...,139.0,9.0,0.640426,0.004875,0.009199,0.529915,0.05425,0.908876,1,3.864345
1,1,ABPI,ACCENTIA BIOPHARMACEUTICALS INC,2834,8.0,10.0,8.0,7.25,259.0,0.0,...,237.0,9.0,0.644753,0.003258,0.011105,0.293388,0.051395,0.898724,0,12.028832
2,2,ACAD,ACADIA PHARMACEUTICALS INC,2834,7.0,14.0,12.0,6.7,90.0,1.0,...,60.0,13.0,0.636816,0.011593,0.006271,1.848485,0.061764,0.90935,0,3.369134
3,3,ACHN,ACHILLION PHARMACEUTICALS INC,2834,11.5,16.0,14.0,12.39,209.0,1.0,...,110.0,15.0,0.539634,0.009686,0.007144,1.355932,0.06163,0.91706,0,3.299697
4,4,ACLI,AMERICAN COMMERCIAL LINES INC.,4492,21.0,21.0,19.0,56.599998,80.0,1.0,...,167.0,20.0,0.587413,0.004518,0.010047,0.449664,0.04855,0.888469,1,3.726269


Upon investigation of the data, we know we have __six__ features (`X1 - X6`) predicting `Y2`. Use similar code (as below) to select them.

In [8]:
predictors_y2 = data[cols[:-2]]
predictors_y2.head()

Unnamed: 0.1,Unnamed: 0,I1,I2,I3,P(IPO),P(H),P(L),P(1Day),C1,C2,...,P(mid),Long_to_Total_Sentences,Pos_to_Total_Words,Neg_to_Total_Words,Pos_Neg_Words,Long_to_Total_Words,Real_to_Total_Words,C3P,C5P,C6P
0,0,AATI,ADVANCED ANALOGIC TECHNOLOGIES INC,3674,10.0,9.5,8.5,11.87,122.0,1.0,...,9.0,0.640426,0.004875,0.009199,0.529915,0.05425,0.908876,1,3.864345,11.111111
1,1,ABPI,ACCENTIA BIOPHARMACEUTICALS INC,2834,8.0,10.0,8.0,7.25,259.0,0.0,...,9.0,0.644753,0.003258,0.011105,0.293388,0.051395,0.898724,0,12.028832,0.0
2,2,ACAD,ACADIA PHARMACEUTICALS INC,2834,7.0,14.0,12.0,6.7,90.0,1.0,...,13.0,0.636816,0.011593,0.006271,1.848485,0.061764,0.90935,0,3.369134,0.0
3,3,ACHN,ACHILLION PHARMACEUTICALS INC,2834,11.5,16.0,14.0,12.39,209.0,1.0,...,15.0,0.539634,0.009686,0.007144,1.355932,0.06163,0.91706,0,3.299697,0.0
4,4,ACLI,AMERICAN COMMERCIAL LINES INC.,4492,21.0,21.0,19.0,56.599998,80.0,1.0,...,20.0,0.587413,0.004518,0.010047,0.449664,0.04855,0.888469,1,3.726269,5.0


Below is the key part of this notebook - which generates a `logistic regression` model to predict `Y1`/`Y2`.

The code works this way:

1. We generate two lists `f1_score_lst` and `auc_lst` to store f1_score and AUC from each of the `10` runs of the model;
2. Define model:
    1. We define a `LogisticRegression()` model;
    
    2. We split predictors (`predictors_y1`) and target `y1` to training (80%) and testing (20%);
    
    3. We fit the model `clf` to the training data, then use it to predict on the testing data;
    
    4. We also defined a `10-fold cross validation` to make sure our model do not overfit - see [here](https://scikit-learn.org/stable/modules/cross_validation.html) for more info;
    
    5. We append the f1_score and AUC of current model to the lists (`f1_score_lst` and `auc_lst`) we defined earlier.
  
3. Print out average f1_score and AUC for all 10 runs;
4. Print out average average accuracy from cross validation
5. Print out confusion matrix and classification report for the __last__ model.

__NOTE__: Step 3 provides the evaluation results we need; step 4 - 5 can be used to verify the results from step 3.

In [9]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf = LogisticRegression()
    X1_train, X1_test, y1_train, y1_test = train_test_split(predictors_y1, y1, test_size=0.2, random_state=123)
    clf.fit(X1_train, y1_train)

    y1_pred = clf.predict(X1_test)

    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, X1_train, y1_train, cv=kfold, scoring=scoring)

    

    
    #calculate f1-score and AUC
    
    clf_roc_auc = roc_auc_score(y1_test, y1_pred)
    f1_score_lst.append(precision_recall_fscore_support(y1_test, y1_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)


print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

#result=logit_model.fit()
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)


#print(result.summary())
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X1_test, y1_test)))

print("10-fold cross validation average accuracy of classifier: %.3f" % (results.mean()))

print('Confusion Matrix for Logistic Regression Classfier:')
print(confusion_matrix_y1)

print('Classification Report for Logistic Regression Classfier:')
print(classification_report(y1_test, y1_pred))


ValueError: could not convert string to float: 'RSC Holdings Inc.'

Below code are used to evaluate model toward `Y2`. It is very similar to the code above - key difference is that `Y2` is imbalanced - so I wrote some code (under `# Begin oversampling`) to deal with that.

In [10]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf1 = LogisticRegression()

    
    # Splitting data into testing and training
    X2_train, X2_test, y2_train, y2_test = train_test_split(predictors_y2, y2, test_size=0.2, random_state=123)
    
    # Begin oversampling
    oversample = pd.concat([X2_train,y2_train],axis=1)
    max_size = oversample['Y2'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('Y2'):
        lst.append(group.sample(max_size-len(group), replace=True))
    X2_train = pd.concat(lst)
    y2_train=pd.DataFrame.copy(X2_train['Y2'])
    del X2_train['Y2']
    
    # fitting model on oversampled data
    clf1.fit(X2_train, y2_train)
    
    y2_pred = clf1.predict(X2_test)
    
    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf1, X2_train, y2_train, cv=kfold, scoring=scoring)
    
    #calculate f1-score and AUC
    
    clf1_roc_auc = roc_auc_score(y2_test, y2_pred)
    
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y2_test, y2_pred, average='weighted')[2])
    auc_lst.append(clf1_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)


print('Accuracy of classifier on test set: {:.3f}'.format(clf1.score(X2_test, y2_test)))

print("10-fold cross validation average accuracy of clf1: %.3f" % (results.mean()))

print('Confusion Matrix for Classfier:')
print(confusion_matrix_y2)

print('Classification Report for Classfier:')
print(classification_report(y2_test, y2_pred))


ValueError: could not convert string to float: 'FriendFinder Networks Inc.\xa0'