# Example Evaluation Code

This notebook will be very __similar__ to the code I use to evaluate your results - it is provided for __your convenience__ so that you can use it to evaluate your preprocessing results at any time before your __final submission__.

Please note that the results here will __NOT__ be the same as my evaluation results.

Let's start with loading the required packages.

In [10]:
# import required package for data handling
import pandas as pd
import numpy as np

# import required packages for splitting data
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# import required packages for evaluating models
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

# import `logistic regression` model
from sklearn.linear_model import LogisticRegression

Next you should load __your__ data. In this case, I am using a sample dataset (`GroupX.csv`) which contains 6 predictors (`X1 - X6`) and two target variables (`Y1, Y2`).

Please make sure you change the data to your __OWN__ dataset when using this code.

__NOTE__:
1. Your dataset maybe very different from the sample dataset.
2. Please follow this structure when submitting your dataset.

In [19]:
data = pd.read_csv('median.csv', header=0)
data.head()

Unnamed: 0.1,Unnamed: 0,ticker,company_name,offer_price,price_range_higher_bound,price_range_lower_bound,first_day_trading_price,days,top_tier_dummy,positive_eps_dummy,...,number_of_sentences,number_of_words,number_of_real_words,number_of_long_sentences,number_of_long_words,number_of_positive_words,number_of_negative_words,number_of_uncertain_words,pre_IPO_price_revision,post_IPO_initial_return
0,0,AATI,ADVANCED ANALOGIC TECHNOLOGIES INC,-0.636323,-0.905198,-0.865043,-0.190333,-0.175193,0.39736,0.010968,...,0.025058,-0.007086,0.034154,0.054951,0.022999,-0.164165,-0.036299,-0.083026,0,1
1,1,ABPI,ACCENTIA BIOPHARMACEUTICALS INC,-0.968157,-0.829439,-0.951417,-0.254485,0.735662,-2.516611,-0.020955,...,1.854267,1.660219,1.694742,1.777161,0.933582,0.066365,1.439359,1.333653,1,1
2,2,ACAD,ACADIA PHARMACEUTICALS INC,-1.134074,-0.22336,-0.260419,-0.262122,-0.387947,0.39736,-0.018553,...,-1.50783,-1.377426,-1.367776,-1.37061,-0.749937,-0.18978,-1.027941,-1.225043,1,1
3,3,ACHN,ACHILLION PHARMACEUTICALS INC,-0.387448,0.079679,0.08508,-0.183112,0.403233,0.39736,-0.016467,...,-0.784124,-0.82668,-0.790657,-0.966838,-0.360293,0.296895,-0.721004,-0.502248,1,1
4,4,ACLI,AMERICAN COMMERCIAL LINES INC.,1.188763,0.837277,0.948829,0.430778,-0.454433,0.39736,-0.010272,...,0.606302,0.380843,0.368547,0.343359,0.086528,-0.036093,0.34147,0.321739,0,0


Checking your data types and make sure it follows the data dictionary would be an important step, you can do that using the `.dtypes` attribute.

__NOTE__: all __continuous__ faetures will be in `float64` data type, and all __categorical__ features will be in `int64` data type (given you already coded (per __suggest task \#6__ in the competition document) them).

In [20]:
data.dtypes

Unnamed: 0                      int64
ticker                         object
company_name                   object
offer_price                   float64
price_range_higher_bound      float64
price_range_lower_bound       float64
first_day_trading_price       float64
days                          float64
top_tier_dummy                float64
positive_eps_dummy            float64
prior_nasdaq_15day_returns    float64
share_overhang                float64
up_revision                   float64
sales                         float64
number_of_sentences           float64
number_of_words               float64
number_of_real_words          float64
number_of_long_sentences      float64
number_of_long_words          float64
number_of_positive_words      float64
number_of_negative_words      float64
number_of_uncertain_words     float64
pre_IPO_price_revision          int64
post_IPO_initial_return         int64
dtype: object

Now you need to specify your targets and predictors. __NOTE__ we have two targets here (`Y1, Y2`).

In [21]:
y1 = data.pre_IPO_price_revision
y2 = data.post_IPO_initial_return

Check the shape of the data.

In [22]:
data.shape

(682, 24)

It is very possible that you will use different sets of the predictors for `Y1` and `Y2`. Now let's define them.

First, let's define predictors for `Y1` - which will be the first 5 features in `data`.

In [23]:
cols = list(data.columns)
# first 5 features 
cols[:-3]

['Unnamed: 0',
 'ticker',
 'company_name',
 'offer_price',
 'price_range_higher_bound',
 'price_range_lower_bound',
 'first_day_trading_price',
 'days',
 'top_tier_dummy',
 'positive_eps_dummy',
 'prior_nasdaq_15day_returns',
 'share_overhang',
 'up_revision',
 'sales',
 'number_of_sentences',
 'number_of_words',
 'number_of_real_words',
 'number_of_long_sentences',
 'number_of_long_words',
 'number_of_positive_words',
 'number_of_negative_words']

Use below code to select the first 5 features as predictors for `Y1`.

In [24]:
predictors_y1 = data[cols[:-3]].drop(['Unnamed: 0','ticker','company_name','offer_price','first_day_trading_price'],1)
predictors_y1.head()

Unnamed: 0,price_range_higher_bound,price_range_lower_bound,days,top_tier_dummy,positive_eps_dummy,prior_nasdaq_15day_returns,share_overhang,up_revision,sales,number_of_sentences,number_of_words,number_of_real_words,number_of_long_sentences,number_of_long_words,number_of_positive_words,number_of_negative_words
0,-0.905198,-0.865043,-0.175193,0.39736,0.010968,0.663539,-0.078977,-0.070832,-0.260654,0.025058,-0.007086,0.034154,0.054951,0.022999,-0.164165,-0.036299
1,-0.829439,-0.951417,0.735662,-2.516611,-0.020955,-0.631805,-0.195412,-0.398784,-0.276917,1.854267,1.660219,1.694742,1.777161,0.933582,0.066365,1.439359
2,-0.22336,-0.260419,-0.387947,0.39736,-0.018553,0.408314,-0.311179,-0.294799,-0.288795,-1.50783,-1.377426,-1.367776,-1.37061,-0.749937,-0.18978,-1.027941
3,0.079679,0.08508,0.403233,0.39736,-0.016467,0.38719,-0.330407,-0.314796,-0.28806,-0.784124,-0.82668,-0.790657,-0.966838,-0.360293,0.296895,-0.721004
4,0.837277,0.948829,-0.454433,0.39736,-0.010272,-1.289555,-0.177382,-0.164818,0.111183,0.606302,0.380843,0.368547,0.343359,0.086528,-0.036093,0.34147


Upon investigation of the data, we know we have __six__ features (`X1 - X6`) predicting `Y2`. Use similar code (as below) to select them.

In [25]:
predictors_y2 = data[cols[:-2]].drop(['Unnamed: 0','ticker','company_name','first_day_trading_price'],1)
predictors_y2.head()

Unnamed: 0,offer_price,price_range_higher_bound,price_range_lower_bound,days,top_tier_dummy,positive_eps_dummy,prior_nasdaq_15day_returns,share_overhang,up_revision,sales,number_of_sentences,number_of_words,number_of_real_words,number_of_long_sentences,number_of_long_words,number_of_positive_words,number_of_negative_words,number_of_uncertain_words
0,-0.636323,-0.905198,-0.865043,-0.175193,0.39736,0.010968,0.663539,-0.078977,-0.070832,-0.260654,0.025058,-0.007086,0.034154,0.054951,0.022999,-0.164165,-0.036299,-0.083026
1,-0.968157,-0.829439,-0.951417,0.735662,-2.516611,-0.020955,-0.631805,-0.195412,-0.398784,-0.276917,1.854267,1.660219,1.694742,1.777161,0.933582,0.066365,1.439359,1.333653
2,-1.134074,-0.22336,-0.260419,-0.387947,0.39736,-0.018553,0.408314,-0.311179,-0.294799,-0.288795,-1.50783,-1.377426,-1.367776,-1.37061,-0.749937,-0.18978,-1.027941,-1.225043
3,-0.387448,0.079679,0.08508,0.403233,0.39736,-0.016467,0.38719,-0.330407,-0.314796,-0.28806,-0.784124,-0.82668,-0.790657,-0.966838,-0.360293,0.296895,-0.721004,-0.502248
4,1.188763,0.837277,0.948829,-0.454433,0.39736,-0.010272,-1.289555,-0.177382,-0.164818,0.111183,0.606302,0.380843,0.368547,0.343359,0.086528,-0.036093,0.34147,0.321739


Below is the key part of this notebook - which generates a `logistic regression` model to predict `Y1`/`Y2`.

The code works this way:

1. We generate two lists `f1_score_lst` and `auc_lst` to store f1_score and AUC from each of the `10` runs of the model;
2. Define model:
    1. We define a `LogisticRegression()` model;
    
    2. We split predictors (`predictors_y1`) and target `y1` to training (80%) and testing (20%);
    
    3. We fit the model `clf` to the training data, then use it to predict on the testing data;
    
    4. We also defined a `10-fold cross validation` to make sure our model do not overfit - see [here](https://scikit-learn.org/stable/modules/cross_validation.html) for more info;
    
    5. We append the f1_score and AUC of current model to the lists (`f1_score_lst` and `auc_lst`) we defined earlier.
  
3. Print out average f1_score and AUC for all 10 runs;
4. Print out average average accuracy from cross validation
5. Print out confusion matrix and classification report for the __last__ model.

__NOTE__: Step 3 provides the evaluation results we need; step 4 - 5 can be used to verify the results from step 3.

In [26]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf = LogisticRegression()
    X1_train, X1_test, y1_train, y1_test = train_test_split(predictors_y1, y1, test_size=0.2, random_state=123)
    clf.fit(X1_train, y1_train)

    y1_pred = clf.predict(X1_test)

    
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=7)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf, X1_train, y1_train, cv=kfold, scoring=scoring)

    

    
    #calculate f1-score and AUC
    
    clf_roc_auc = roc_auc_score(y1_test, y1_pred)
    f1_score_lst.append(precision_recall_fscore_support(y1_test, y1_pred, average='weighted')[2])
    auc_lst.append(clf_roc_auc)


print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

#result=logit_model.fit()
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)


#print(result.summary())
print('Accuracy of classifier on test set: {:.2f}'.format(clf.score(X1_test, y1_test)))

print("10-fold cross validation average accuracy of classifier: %.3f" % (results.mean()))

print('Confusion Matrix for Logistic Regression Classfier:')
print(confusion_matrix_y1)

print('Classification Report for Logistic Regression Classfier:')
print(classification_report(y1_test, y1_pred))


F1 0.6182; AUC 0.6199 
Accuracy of classifier on test set: 0.62
10-fold cross validation average accuracy of classifier: 0.575
Confusion Matrix for Logistic Regression Classfier:
[[48 21]
 [31 37]]
Classification Report for Logistic Regression Classfier:
             precision    recall  f1-score   support

          0       0.61      0.70      0.65        69
          1       0.64      0.54      0.59        68

avg / total       0.62      0.62      0.62       137



Below code are used to evaluate model toward `Y2`. It is very similar to the code above - key difference is that `Y2` is imbalanced - so I wrote some code (under `# Begin oversampling`) to deal with that.

In [40]:
# lists for f1-score and AUC
f1_score_lst = []
auc_lst = []


#loop to calculate f1 and auc scores and present averages after 10 runs
for count in range (1,10):
    #Model building
    clf1 = LogisticRegression()
    
    # Splitting data into testing and training
    X2_train, X2_test, y2_train, y2_test = train_test_split(predictors_y2, y2, test_size=0.2, random_state=123)
    
    # Begin oversampling
    oversample = pd.concat([X2_train,y2_train],axis=1)
    max_size = oversample['post_IPO_initial_return'].value_counts().max()
    lst = [oversample]
    for class_index, group in oversample.groupby('post_IPO_initial_return'):
        lst.append(group.sample(max_size-len(group), replace=True))
    X2_train = pd.concat(lst)
    y2_train=pd.DataFrame.copy(X2_train['post_IPO_initial_return'])
    del X2_train['post_IPO_initial_return']
    # fitting model on oversampled data
    clf1.fit(X2_train, y2_train)
    
    y2_pred = clf1.predict(X2_test)
    #10-fold cross validation
    kfold = model_selection.KFold(n_splits=10, random_state=123)
    scoring = 'accuracy'
    results = model_selection.cross_val_score(clf1, X2_train, y2_train, cv=kfold, scoring=scoring)
    
    #calculate f1-score and AUC
    clf1_roc_auc = roc_auc_score(y2_test, y2_pred)
    
    
    #calculate average f1-score and AUC
    f1_score_lst.append(precision_recall_fscore_support(y2_test, y2_pred, average='weighted')[2])
    auc_lst.append(clf1_roc_auc)
    
    
print('F1 {:.4f}; AUC {:.4f} '.format(np.mean(f1_score_lst),np.mean(auc_lst)))

confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)


print('Accuracy of classifier on test set: {:.3f}'.format(clf1.score(X2_test, y2_test)))

print("10-fold cross validation average accuracy of clf1: %.3f" % (results.mean()))

print('Confusion Matrix for Classfier:')
print(confusion_matrix_y2)

print('Classification Report for Classfier:')
print(classification_report(y2_test, y2_pred))


F1 0.4139; AUC 0.5549 
Accuracy of classifier on test set: 0.467
10-fold cross validation average accuracy of clf1: 0.271
Confusion Matrix for Classfier:
[[35  9]
 [64 29]]
Classification Report for Classfier:
             precision    recall  f1-score   support

          0       0.35      0.80      0.49        44
          1       0.76      0.31      0.44        93

avg / total       0.63      0.47      0.46       137

