In this notebook I take probably the most naive approach to the Rainfall prediction Kaggle competition. In particular, there is no feature engineering or exploration of time series features. There are only ~2000 observations in the training data, so one is at risk of overfitting if too much feature engineering is done. With only hyperparameter tuning, the following code got a private score of 0.89879, up from a public score of 0.85867, placing me at 451/4382 (an improvement of more than 1000 positions compared to the public leaderboard). I interpret this as a testament to simple models being good at not overfitting, particularly when there isn't much data to train on. In order to improve this model even further, I would think that stacking would be a good option, allowing one to take advantage of the different strengths of different models. 

In [7]:
import pandas as pd
import warnings
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from tsfresh import extract_features

from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_selection import select_features

warnings.filterwarnings('ignore')

train = pd.read_csv("train.csv")

test = pd.read_csv("test.csv")

features = [c for c in train.columns if c not in ["rainfall", "id", "day"]]

# Throw away features like id and day. I am being very naive in ignoring day in particular, since this is the feature 
# with which I could index a time series. I am just going to do the most naive thing possible in this notebook and competition though. 
# In particular, it would be possible to feature engineer to better predict rain, most notably lag and seasonal features. 



In [6]:
train.head()

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0


In [66]:
num_pipeline = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'mean')),
        ("scaler", StandardScaler())
    ]
)
# There is no missing data in the training set, but there is in the test set, so I need to include an imputer in my pipeline. 

In [67]:
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(train[features], train["rainfall"], train_size = 0.8, random_state = 42)


parameters = {
    'model__C': np.logspace(-4,0,100),    
    'model__solver' : ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
}

# This is almost the most naive model one could make. Search over the space of solvers and regularisation parameters to find a good 
# combination and then train the model on all of the data once this has been found. Use this model to make predictions.


model = LogisticRegression(random_state=42)

my_pipeline = Pipeline(steps=[('preprocessor', num_pipeline),
                              ('model', model)
                             ])

gs = RandomizedSearchCV(my_pipeline, param_distributions = parameters, n_iter = 100, scoring = 'roc_auc', cv=5)
gs.fit(X_train, y_train)


best_model = gs.best_estimator_
print(gs.best_params_)
y_pred_proba = best_model.predict_proba(X_train)[:, 1] 
# Take the second column since this is the probability of rain. The first column is 
# just the complement of this. 
roc_score = roc_auc_score(y_train, y_pred_proba)
print(f"ROC-AUC Score: {roc_score}")
    

{'model__solver': 'lbfgs', 'model__C': 0.015199110829529346}
ROC-AUC Score: 0.9022541228622775


In [68]:
y_pred_test = best_model.predict(X_test[features])
y_probs_test = best_model.predict_proba(X_test[features])[:, 1] 
roc_score = roc_auc_score(y_test, y_probs_test)

print(f"ROC-AUC Score: {roc_score}")

ROC-AUC Score: 0.8699718131766814


The ROC-AUC scores of the train and dev sets are relatively close, which is good. This is perhaps not a surprise though, since the model is extremely naive, and so doesn't have as much propensity to overfit as compared to more complex ones. 

In [69]:
best_params = {key.replace("model__", ""): value for key, value in gs.best_params_.items()}
final_model = LogisticRegression(**best_params)
final_pipeline = Pipeline(steps=[('preprocessor', num_pipeline),
                              ('model', final_model)
                             ])
final_pipeline.fit(train[features], train["rainfall"])

y_pred_proba = final_pipeline.predict_proba(train[features])[:, 1]
roc_score = roc_auc_score(train["rainfall"], y_pred_proba)
print(f"ROC-AUC Score: {roc_score}")

ROC-AUC Score: 0.8956240179573511


In [70]:
predictions = final_pipeline.predict_proba(test[features])[:, 1]
my_submission = pd.DataFrame({'id': test.id, 'rainfall': predictions})
my_submission.to_csv('RefinedLogitSubmission.csv', index=False)

In [71]:
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(train[features], train["rainfall"], train_size = 0.8, random_state = 42)

parameters = {
    'model__n_estimators': np.arange(50, 300, 50),
    'model__max_depth': np.arange(3, 10, 1),
    'model__max_samples': np.linspace(0.4, 0.8, 5),
    'model__max_features': [0.2, 0.4, 'sqrt', 'log2'],
    'model__min_samples_split': np.arange(5, 20, 2),
    'model__min_samples_leaf': np.arange(3, 20, 2),
    'model__criterion': ['gini', 'entropy']
}


model = RandomForestClassifier(random_state=42)
my_pipeline = Pipeline(steps=[('preprocessor', num_pipeline),
                              ('model', model)
                             ])
gs = RandomizedSearchCV(my_pipeline, param_distributions = parameters, n_iter = 100, scoring = 'roc_auc', cv=5, n_jobs = -1)
gs.fit(X_train,y_train)

In [72]:

best_model = gs.best_estimator_
print(gs.best_params_)
y_pred_proba = best_model.predict_proba(X_train)[:, 1]
roc_score = roc_auc_score(y_train, y_pred_proba)
print(f"ROC-AUC Score: {roc_score}")

{'model__n_estimators': 150, 'model__min_samples_split': 19, 'model__min_samples_leaf': 15, 'model__max_samples': 0.7000000000000001, 'model__max_features': 0.4, 'model__max_depth': 5, 'model__criterion': 'gini'}
ROC-AUC Score: 0.9292033029297709


In [73]:
y_pred_test = best_model.predict(X_test)
y_probs_test = best_model.predict_proba(X_test)[:, 1]  

roc_score = roc_auc_score(y_test, y_probs_test)

print(f"ROC-AUC Score: {roc_score}")

ROC-AUC Score: 0.8692078712362689


The difference in the ROC-AUC score between the train and dev sets indicates some overfitting, indicating that the model could benefit from some further regularisation. 

In [74]:
best_params = {key.replace("model__", ""): value for key, value in gs.best_params_.items()}
final_model = RandomForestClassifier(**best_params)
final_pipeline = Pipeline(steps=[('preprocessor', num_pipeline),
                              ('model', final_model)
                             ])
final_pipeline.fit(train[features], train["rainfall"])

y_pred_proba = final_pipeline.predict_proba(train[features])[:, 1]
roc_score = roc_auc_score(train["rainfall"], y_pred_proba)
print(f"ROC-AUC Score: {roc_score}")

ROC-AUC Score: 0.9191964085297419


In [75]:
predictions = final_pipeline.predict_proba(test[features])[:, 1]
my_submission = pd.DataFrame({'id': test.id, 'rainfall': predictions})
my_submission.to_csv('RandomForrestClassifier.csv', index=False)