# Build a simple model

When approaching a ml problem it's a good idea to build a simple model first. This will give us an idea of how challenging the problem is.

We'll start with a model that just uses the numeric data columns.

We'll use **multi-class logistic regression**, which treats each column independently. We'll train the classifier on each label separetly and then use those models to predict whether those labels appear(or not) in any given row.

We can then compute the **log loss** on these predictions.

**NOTE**:

Because of the nature of the dataset, we can't use the sklearn **train-test split** function to divide our data into training and test sets. Some labels only appear in a small fraction of the dataset. If we split the dataset randomly we can end up with labels in the test set that DO NOT appear in the training set.

On solution is to use the sklearn **StratifiedShuffleSplit** function. However, it only works with a single target variable, since we have many we'll use the utility function `multilabel_train_test_split()`, which will ensure that all of the classes are represented in bith training and test data.

In [4]:
# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# import our custom train_test split function
from multilabel import multilabel_train_test_split

# set seed for reproducibility
np.random.seed(0)

df = pd.read_csv('../data/TrainingData.csv',index_col=0)

NUMERIC_COLUMNS = ['FTE', 'Total']
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']

First we'll subset the data to just the numeric columns, filling any `NaN` values with `-1000`(allows our algo to repond to `NaN` differently to `0`. this creates a new dataframe, `numeic_data_only`.

In [6]:
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)
numeric_data_only.sample(5)

Unnamed: 0,FTE,Total
335532,-1000.0,181.97
85737,-1000.0,3771.45
74664,-1000.0,225.95
215396,0.92,22961.12333
351352,-1000.0,-260.55


Convert the columns in the `LABELS` list into dummy variables using pandas `.get_dummies()` method, and creates a binary indicator for our targets.

In [7]:
label_dummies = pd.get_dummies(df[LABELS])
label_dummies.sample(5)

Unnamed: 0,Function_Aides Compensation,Function_Career & Academic Counseling,Function_Communications,Function_Curriculum Development,Function_Data Processing & Information Services,Function_Development & Fundraising,Function_Enrichment,Function_Extended Time & Tutoring,Function_Facilities & Maintenance,Function_Facilities Planning,...,Object_Type_Rent/Utilities,Object_Type_Substitute Compensation,Object_Type_Supplies/Materials,Object_Type_Travel & Conferences,Pre_K_NO_LABEL,Pre_K_Non PreK,Pre_K_PreK,Operating_Status_Non-Operating,"Operating_Status_Operating, Not PreK-12",Operating_Status_PreK-12 Operating
443816,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
99119,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
205188,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
307454,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
111102,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


We'll use our custom `multilabel_train_test_split()` function to split our data into training and test sets.

In [8]:
X_train, X_test, y_train, y_test = multilabel_train_test_split(
    numeric_data_only,
    label_dummies,
    size=0.2,
    seed=123
)

In [10]:
# Print the info
print("X_train info:")
print(X_train.info())
print("\nX_test info:")  
print(X_test.info())
print("\ny_train info:")  
print(y_train.info())
print("\ny_test info:")  
print(y_test.info()) 

X_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Data columns (total 2 columns):
FTE      320222 non-null float64
Total    320222 non-null float64
dtypes: float64(2)
memory usage: 7.3 MB
None

X_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Data columns (total 2 columns):
FTE      80055 non-null float64
Total    80055 non-null float64
dtypes: float64(2)
memory usage: 1.8 MB
None

y_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 320222 entries, 134338 to 415831
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 34.2 MB
None

y_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80055 entries, 206341 to 72072
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 8.6 MB
None


Sklearn's `OneVsrestClassifier` lets us treat each label(y) column independently, setting a separate classifier for each of the columns.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(LogisticRegression())

# train our model
clf.fit(X_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [11]:
# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))

Accuracy: 0.0


### Making predictions

When making predictions we could use the test set we've generated, but in this case we'll use the test set provided by the competition site, `TestData.csv`.

Load `TestData.csv` and perform the same simple preprosing used earlier.

In [18]:
holdout = pd.read_csv('../data/TestData.csv', index_col=0)
holdout = holdout[NUMERIC_COLUMNS].fillna(-1000)
print(type(holdout), holdout.shape)
holdout.head()

<class 'pandas.core.frame.DataFrame'> (50064, 2)


Unnamed: 0,FTE,Total
180042,-1000.0,3999.91
28872,-1000.0,3447.320213
186915,1.0,52738.780869
412396,1.0,69729.263191
427740,1.0,29492.834215


We'll use sklearn's `.predict_proba()` method to calculate prediction probabilities instead of making actual predictions with `.predict()`. `.predict()` would return `1` or `0` values depending on whether the label appears or not. Because **log loss** penalizes you for being **confident and wrong**, the performance would be significantly worse using `.predict()`, as apposed to `.predict_proba()`.

Also, the original goal is to predict the probability of each label.

In [14]:
predictions = clf.predict_proba(holdout)
predictions

array([[3.58422797e-02, 6.46624377e-03, 8.29891300e-04, ...,
        1.69612500e-01, 1.99296715e-02, 8.10543000e-01],
       [3.58482728e-02, 6.46610320e-03, 8.29902557e-04, ...,
        1.69607057e-01, 1.99300220e-02, 8.10552551e-01],
       [1.20946821e-01, 9.06528221e-03, 1.53268023e-03, ...,
        9.59263311e-02, 5.10388015e-02, 9.28396081e-01],
       ...,
       [1.22222570e-01, 9.05175340e-03, 1.53411191e-03, ...,
        9.56957120e-02, 5.10986918e-02, 9.28680377e-01],
       [1.22275131e-01, 9.04893421e-03, 1.53377914e-03, ...,
        9.57019699e-02, 5.10808744e-02, 9.28670860e-01],
       [1.22159718e-01, 9.05015147e-03, 1.53365017e-03, ...,
        9.57227211e-02, 5.10754795e-02, 9.28645295e-01]])

### Submitting results

It is standard practice to submit predictions in a csv format, each of the labels as a column heading with the propabilities for each as the values.

The pobabilities returned by `.predict_propa()` is an array of values, without column headings or index. We can generate those:

In [22]:
columns = pd.get_dummies(df[LABELS], prefix_sep='__').columns
columns

Index(['Function__Aides Compensation',
       'Function__Career & Academic Counseling', 'Function__Communications',
       'Function__Curriculum Development',
       'Function__Data Processing & Information Services',
       'Function__Development & Fundraising', 'Function__Enrichment',
       'Function__Extended Time & Tutoring',
       'Function__Facilities & Maintenance', 'Function__Facilities Planning',
       ...
       'Object_Type__Rent/Utilities', 'Object_Type__Substitute Compensation',
       'Object_Type__Supplies/Materials', 'Object_Type__Travel & Conferences',
       'Pre_K__NO_LABEL', 'Pre_K__Non PreK', 'Pre_K__PreK',
       'Operating_Status__Non-Operating',
       'Operating_Status__Operating, Not PreK-12',
       'Operating_Status__PreK-12 Operating'],
      dtype='object', length=104)

In [24]:
index = holdout.index
index

Int64Index([180042,  28872, 186915, 412396, 427740,  69847, 358824, 254148,
               296, 416755,
            ...
            356796, 130696, 287341, 345215, 113795, 169063, 433255, 232204,
            171685, 249087],
           dtype='int64', length=50064)

We can format our results and generate the csv using the pandas `.to_csv()` function.

In [30]:
prediction_df = pd.DataFrame(columns=columns, index=index, data=predictions)
print(prediction_df.shape)
prediction_df.head()

(50064, 104)


Unnamed: 0,Function__Aides Compensation,Function__Career & Academic Counseling,Function__Communications,Function__Curriculum Development,Function__Data Processing & Information Services,Function__Development & Fundraising,Function__Enrichment,Function__Extended Time & Tutoring,Function__Facilities & Maintenance,Function__Facilities Planning,...,Object_Type__Rent/Utilities,Object_Type__Substitute Compensation,Object_Type__Supplies/Materials,Object_Type__Travel & Conferences,Pre_K__NO_LABEL,Pre_K__Non PreK,Pre_K__PreK,Operating_Status__Non-Operating,"Operating_Status__Operating, Not PreK-12",Operating_Status__PreK-12 Operating
180042,0.035842,0.006466,0.00083,0.023918,0.008916,0.000173,0.032077,0.024406,0.052099,4.8e-05,...,0.010729,0.036846,0.116126,0.01736,0.831241,0.141031,0.027749,0.169612,0.01993,0.810543
28872,0.035848,0.006466,0.00083,0.023919,0.008916,0.000173,0.032078,0.024406,0.052102,4.8e-05,...,0.010728,0.036959,0.116164,0.017361,0.831233,0.141041,0.027751,0.169607,0.01993,0.810553
186915,0.120947,0.009065,0.001533,0.028599,0.016042,0.01815,0.043858,0.031715,0.113907,0.017293,...,0.005622,0.136221,0.135391,0.016041,0.501655,0.472173,0.098601,0.095926,0.051039,0.928396
412396,0.120381,0.009071,0.001532,0.028573,0.016044,0.01812,0.043808,0.031688,0.113723,0.017261,...,0.00563,0.125189,0.134056,0.016029,0.502143,0.471525,0.098399,0.096029,0.051012,0.928269
427740,0.121725,0.009057,0.001534,0.028634,0.016038,0.01819,0.043926,0.031752,0.114158,0.017338,...,0.005612,0.152629,0.137236,0.016059,0.500987,0.473061,0.098879,0.095785,0.051075,0.92857


In [29]:
prediction_df.to_csv('../predictions/predictions.csv')

The DrivenData benchmark model performance is a **logloss** of: 2.0455, which merely submitted uniform probabilities for each class.

The exercise model performance, trained with numeric data only, yields **logloss** score: 1.9067227623381413