## Hackathon notebook

#### I did many things which were not fruitful so in the end I deleted everything that was not giving +ve result (as the notebook became really messy) and focused on what was working

Things that I tried:
- Dropping all the rows with null values
- Imputing null values with mean, mode, median
- Scaling the dataset
- Removing the outliers
- Manually mapping the categories instead of using OneHotEncoder or get_dummies
- Trying with and without SMOTE
- Trying with and without stratify in train_test_split
- Tried different algorithms

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 100
pd.options.display.max_rows = 1000
pd.options.display.max_colwidth = 1000
np.set_printoptions(linewidth=500)

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score,  RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [None]:
# Loaded the dataset
survey_train = pd.read_csv(r"D:\Jupyter\Shinkansen\Surveydata_train.csv")
travel_train = pd.read_csv(r"D:\Jupyter\Shinkansen\Traveldata_train.csv")
survey_test = pd.read_csv(r"D:\Jupyter\Shinkansen\Surveydata_test.csv")
travel_test = pd.read_csv(r"D:\Jupyter\Shinkansen\Traveldata_test.csv")

# Merged the two sets
train1 = pd.merge(travel_train, survey_train, on="ID")
test1 = pd.merge(travel_test, survey_test, on="ID")

# Copied train and test to save the original sets, if we needed to refer them 
train = train1.copy()
test = test1.copy()

In [None]:
dataset = [train,test]
for df in dataset:   
    df.columns = df.columns.str.lower()    # converted all the column names to lower case (optional)
    df.fillna(-999,axis=0,inplace=True)    # filled all null values with the constant -999

In [None]:
# Used get_dummies
train = pd.get_dummies(train,drop_first=True)
test =  pd.get_dummies(test,drop_first=True)

# train and test had different number of columns after get_dummies so used reindex
test = test.reindex(columns = train.columns, fill_value=-999)
test.drop("overall_experience", axis=1, inplace=True)

In [None]:
# Model building
y = train["overall_experience"]
X = train.drop(["id","overall_experience"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify = y)

clf = XGBClassifier(max_depth=21,learning_rate=0.3,subsample=0.8999999999999999,colsample_bytree=0.7999999999999999,colsample_bylevel=0.8999999999999999,
                            n_estimators=250)
clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)       # prediction on train set
y_pred_test = clf.predict(X_test)         # prediction on test set

# printed accuracy for both
print(accuracy_score(y_train,y_pred_train)," -",accuracy_score(y_test,y_pred_test))

In [None]:
# predicted the test file and saved the output for submission
predictions = clf.predict(test)
predictions

test2 = pd.DataFrame(zip(id,predictions), columns=["ID","Overall_Experience"])

test2.to_csv("latest.csv",index=False)

### Tuned the models

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

params = {'max_features':["None", "log2", "sqrt"]}

clf = RandomizedSearchCV(estimator=RandomForestClassifier(), param_distributions = params, scoring='accuracy', n_iter=25, n_jobs=-1, verbose=1)

clf.fit(x_train, y_train)
best_combination = clf.best_params_

print("Best hyperparameter combination: ", best_combination)