<div class="clearfix" style="padding: 2px; padding-left: 0px">
<img src="http://alpinedata.com/wp-content/themes/alpine/library/images/logo.png" width="250px" style="display: inline-block; margin-top: 2px;">
</div>

# Institutional Holdings Analysis - Model Tuning 

This notebook can be used for quick discovery and model tuning (finding best tuning parameters with cross-validation).

### Libraries
- Seaborn/matplotlib for visualizations and heatmaps.
- Scikit-learn for cross-validation and grid search tuning of Elastic Net Logistic Regression.

### Instructions

1) To run Jupyter notebooks within Chorus, you need to set up a dedicated server and make all the needed configurations. See [our installation instructions](https://alpine.atlassian.net/wiki/display/V6/How+to+Install+Jupyter+Notebook).<br>

2) <i>(Once 1 is completed)</i> DO NOT modify/run this script in the current workspace. You should copy it to your own workspace (using the Copy button after closing the notebook).


In [None]:
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from mpl_toolkits.mplot3d import Axes3D

from sklearn import cross_validation, metrics, linear_model
from sklearn.grid_search import GridSearchCV

import seaborn as sns
#from mpltools import style
if sys.version_info.major < 3:
    import compiler
%pylab inline
plt.style.use('ggplot')
#input_buy_in_or_sell_outs

# Data import and discovery 

In [None]:
cc.datasource_name = 'Postgres_Demo'
df_input_buyin_or_sellouts = cc.read_table(table_name='input_buyin_or_sellouts', 
                                           schema_name='public', database_name='demo')

In [None]:
data = df_input_buyin_or_sellouts

In [None]:
data.head()

In [None]:
summary = data.describe() # 10k rows
summary.transpose().head()

In [None]:
data.buyin_or_sellou.value_counts().plot(kind = 'bar', color = '#728FCE', title = 
                                          "Distribution of the classes in the training set")
# Classes are pretty balanced 

In [None]:
print(data.dtypes[0:10])

## Filtering variables and null value checks  

In [None]:
# Filtering irrelevant columns for classification based on our knowledge of the data

input_filtered = data.drop(['ticker_QuandlInstitutionalShareholderMetrics',
                            'comp_name_QuandlInstitutionalShareholderMetrics',
                            'exchange_QuandlInstitutionalShareholderMetrics',
                            'pct_total_portfolio_q0_q1_change','q0_q1_change','q0_q1_pct_change',
                            'inst_nbr','inst_name','mgr_nbr','inst_loc','inst_addr',
                            'inst_city_state','inst_post_code','inst_phone_nbr','last_change_date',
                            'fund_ticker','inst_url','fund_mgmt_city','fund_mgmt_state','fund_mgmt_name',
                            'shares_held_q0','fqe_date_q0','fqe_date_q1','fqe_date_q2','fqe_date_q3',
                            'last_close_date','last_sec_filing_date','last_sec_filing_type',
                            'shares_held_pct','shares_change',
                            'shares_change_pct','inst_cik'], inplace = False, axis = 1)

In [None]:
print("Total null values: {0}".format(input_filtered.isnull().values.sum()))
print("Columns containing null values are: ")

null_values = pd.concat([input_filtered.dtypes,input_filtered.isnull().any()],
                        axis=1, join_axes=[input_filtered.dtypes.index])
NA_DF = null_values[null_values[1] == True]
NA_DF

In [None]:
print("Null values in last_sec_filing_shares {0}".format(input_filtered[['last_sec_filing_shares']]
                                                         .isnull().values.sum()))

Removing rows with null values:

In [None]:
input_filtered = input_filtered[pd.notnull(input_filtered['last_sec_filing_shares'])]

In [None]:
print("Remaining null values? {0}".format(input_filtered.isnull().any().any()))

## Converting categorical column to binary features 

In [None]:
#Check catgorical column in the features selected
input_filtered.dtypes.value_counts().plot(kind = 'barh', color = '#728FCE', title = 'Distribution of feature types')

Object columns need to be converted to binary features to train the Elastic Net Logistic Regression with scikit-learn.

In [None]:
#Check categorical columns in our features

types_df = pd.DataFrame({'dtypes': input_filtered.dtypes, 'names': input_filtered.columns})
types_df[types_df['dtypes'] == 'object']

In [None]:
# Look at categorical columns to make sure they do not represent IDs

test = input_filtered[['size_','code_13f_flag','fund_flag','mgr_flag',
                       'etf_flag','invst_style']]
test.head()

In [None]:
# Converting categorical var to dummy variables

final_input = pd.get_dummies(input_filtered)
final_input.head()

## Feature correlations

In [None]:
# Just taking the first 21 variables as an example to plot correlations
cor = final_input[:21].corr()

s = cor.unstack()
so = s.sort_values(kind="quicksort")
so.head()

In [None]:
#Plotting the correlation heatmap

f, ax = plt.subplots(figsize=(13, 11))
ax.set_title("Correlations between First 20 features and dependent variable")
sns.heatmap(cor, vmax=.3,
            square=True)

Some independent features seem to be strongly correlated, which might impact the performance of the classification model.

# Model Tuning 

## Data preparation

In [None]:
#Create arrays of X_features and Y_predict to train the model

columns_all = list(final_input.columns)
columns = [col for col in columns_all if col not in ('buyin_or_sellou')]
result = 'buyin_or_sellou'
# convert to np array
X_features = final_input[columns].values 
Y_status = final_input[result].values

print("X_shape: %s, Y_shape: %s" % X_features.shape, Y_status.shape) # check #rows, columns

## Grid search on alphas and l1_ratios 

In [None]:
# Training Elastic Net algorithm with different tuning parameters in a grid search
# that returns the best tuning parameters
# Grid search configured here with 3-fold cross validation for each pair of parameters

lambdas = list(np.logspace(-5,2,5)) # we test 5 values from 10E-5 to 10E2
l1_ratios = [0, 0.1, 0.3, 0.5, 0.7, 1]

tuned_parameters = {'alpha': lambdas, 'l1_ratio': l1_ratios}

grid = GridSearchCV(linear_model.ElasticNet(normalize = False,
                                            max_iter = 100000), 
                    tuned_parameters, scoring='roc_auc', cv=3, n_jobs=1, verbose=2)
grid.fit(X_features, Y_status)
print(grid.best_params_)
print(grid.best_score_)

In [None]:
# Plotting AUC manually for <> tuning parameters and drawing a 3D chart to better visualize results

# taking AUC as evaluation metric - we could also pick accuracy 

lambdas1 = list(np.logspace(-5,2,5)) # 10 values from 10E-5 to 10E2
l1ratios1 = [1e-5, 1e-2, 0.1, 0.4, 0.7, 1]
alphas = lambdas1
l1_ratios = l1ratios1
AUC = [[] for i in range(0,len(alphas))]

for a in alphas:
    for l1_rat in l1_ratios:
        # Create elastic net instance:
        lor = linear_model.ElasticNet(alpha=a, l1_ratio = l1_rat, max_iter = 10000)
        # this returns the k AUC for the k-fold cross-validation (3 folds here)
        CVauc = cross_validation.cross_val_score(lor, X_features, Y_status, 
                                                 scoring='roc_auc', n_jobs = 1, verbose =2 )
        # taking the average of auc (for 3-fold cv)
        CV_AUC = np.mean(CVauc)
        AUC[alphas.index(a)].append(CV_AUC)

max_AUCs =[] # list of max AUCs (associated with best alpha for each d value)

for i in range(0,len(alphas)):
    max_AUCs.append(max(AUC[i]))
    
    
for a in alphas:
    print("AUCs for alpha = {alpha}: {AUC_vals}".format(alpha = a, AUC_vals = AUC[alphas.index(a)]))
        

In [None]:
best_alpha = alphas[max_AUCs.index(max(max_AUCs))]
best_l1_ratio = l1_ratios[AUC[alphas.index(best_alpha)].index(max(AUC[alphas.index(best_alpha)]))]
best_AUC = max_AUCs[alphas.index(best_alpha)]

print('Best alpha: {0}'.format(best_alpha))
print('Best l1_ratio: {0}'.format(best_l1_ratio))
print("\nBest AUC: {0}".format(best_AUC))

## Plot: Tuning parameters alpha and l1_ration vs AUC 

In [None]:
colors = cm.rainbow(np.linspace(0,1, len(alphas)))

fig = plt.figure(figsize(15,12))
ax = fig.add_subplot(111, projection='3d')

for alpha, c in zip(alphas, colors):
    ax.scatter(l1_ratios,[alpha for i in xrange(len(l1_ratios))],AUC[alphas.index(alpha)],
               color = c, marker = "o", s=60)

ax.set_xlabel('l1_ratio')
ax.set_ylabel('alpha')
ax.set_zlabel('AUC')
plt.title("AUC vs shrinkage parameters (l1_ratio,alpha)", fontsize=20)
plt.show()

<b>Results:</b> We should set alpha = 1E2 and l1ratio = 1 in our Alpine workflow to train Elastic Net Logistic Regression.