## Simple Baseline script
* Uses CatBoost (Has built in embedding support for categoricals, such as the string columns)
* Not compared with OneHot encoding handling of string/categorical columns yet, or xgboost, lightgbm (the later can also handle categoricals natively). 

    *Good luck!

* Target: *is_female*

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from subprocess import check_output
# print(check_output(["ls", "../input"]).decode("utf8"))
# Any results you write to the current directory are saved as output.

In [3]:
from catboost import CatBoostClassifier
# from xgboost import XGBClassifier
import lightgbm as lgb

TARGET = "is_female"

ModuleNotFoundError: No module named 'catboost'

#### Lots of columns!
* Some are strings, some are boolean or of very low cardinality ( 3-7 unique values). 
* Lot's of NaNs. 

In [None]:
df = pd.read_csv("../input/train.csv",low_memory=False)
print(df.shape)
df.head()

### Note: Test, Train have different ID columns!
* ordering reset to 0 for each. 
* Best to drop unless a useful leak is identified (But then more annoying to output test set predictions, if train has different # columns). 
    * Ignore for now

In [None]:
test = pd.read_csv("../input/test.csv",low_memory=False)
print(test.shape)

#### Lots of columns which look to be based on survey responses/multiple choice questions.
* in this case, Nulls may be the result of picking a question choice, vs not being answered. Requires digging into the data to understand how it should be addressed case by case. 
    * In short: missing value imputation may be damaging!  

In [None]:
df.isnull().sum()

## Examine Non numeric columns:
* Clear possibilities here to get less sparse features: "column is not NaN", or "sum(notNaN)(col) for col in (cols begginning with DL[0-9]"... 

In [None]:
# https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas
df.select_dtypes(exclude=[np.number])

In [None]:
category_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
print(len(category_cols))

### convert categorical columns to integers
* Our test set has categoricals not seen in train - this must be handled. For now, we'll get the categoricals from train and test together, ensuring we "see"/encode" all the categoricals

In [None]:
# Save # rows in train/test. (We could do this directly, but this is easier to debug if needed)
TR_ROWS = df.shape[0]
print(df.shape)
print(test.shape)

In [None]:
df_all = pd.concat([df,test])
df_all.shape

## I get errors with LGB and other when using the Categorical datatype: we'll turn it back into Integers to avoid this

In [None]:
for header in category_cols:
#     df[header] = df[header].astype('category').cat.codes
#     test[header] = test[header].astype('category').cat.codes
    df_all[header] = df_all[header].astype('category').cat.codes.astype('int')
    df_all[header] = pd.to_numeric(df_all[header])
    
    df[header] = df[header].astype('category').cat.codes
    df[header] = pd.to_numeric(df[header])
    test[header] = test[header].astype('category').cat.codes
    test[header] = pd.to_numeric(test[header])

In [None]:
df.shape

#### Split back into train and test

In [None]:
df =df_all.iloc[0:TR_ROWS]
test =df_all.iloc[TR_ROWS:]

df.drop(['test_id'],axis=1,inplace=True)
test.drop(['train_id',"is_female"],axis=1,inplace=True)

### Naive initial feature engineering:
* NaNS per row
(Could also be done for groups of columns, or to sum 0/1s.. (

In [None]:
df["row_nulls"] = df.isnull().sum(axis=1)
test["row_nulls"] = test.isnull().sum(axis=1)

## Build a model
* Note: CatBoost in particular has a LOT of hyperparams (it's even worse than LightGBM in this regard). IT's essential to experiment with them if you want to get decent results. 
* This is my first time using it, so assume my hyperparameters are terrible. 
* Tuning should use a seperate train/validation set split first to select hyperparams. 

* For low dimensional categoricals (e.g. <20 unique vars) - there's no benefit in embedding (Catboost/lightGBM) vs simply leaving it as a number or one hot encoding. 

In [None]:
X = df.drop([TARGET],axis=1)# .select_dtypes(include=[np.number]) #.values
Y = df[TARGET]

In [None]:
### Optional Train/Validation split for test hyperparams. 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [None]:
cat_dims = [X.columns.get_loc(i) for i in category_cols[:-1]]  # categorical columns indexes

# train default catBoost classifier. Default loss metric is LogLoss (lower is better)
clf = CatBoostClassifier(eval_metric="AUC", one_hot_max_size=3, iterations=2)

clf.fit(X,Y , cat_features=cat_dims)

## Currently there's an error where when creating predictions - need to debug

In [None]:
res = clf.predict_proba(test)

## Additional models
* Train performance is dangeorusly misleading without an external validation set. 
* This is just a starter for models
* Should also check for overfitting (requires validation set split)

In [None]:
# # increase # iterations when model debugged
# clf2 = CatBoostClassifier(eval_metric="AUC", one_hot_max_size=6,
#                           iterations=50,depth=8,learning_rate=0.04, rsm=0.8)
# clf2.fit(X,Y , cat_features=cat_dims)
# res2 = clf2.predict_proba(test)

## Simple LightGBM model:
* https://github.com/Microsoft/LightGBM/issues/1096

In [None]:
# we still ahve object datatypes columns..
df.dtypes.value_counts()

In [None]:
# categorical/objects cols:
print(df.select_dtypes(exclude=[np.number]).columns.tolist())
# df.select_dtypes(exclude=[np.number]).value_counts()

In [None]:
lgb_train = lgb.Dataset(
#             data=LabelEncoder().fit_transform(train_df.brand_name).reshape(-1, 1),
    data=df.drop([TARGET],axis=1).select_dtypes(include=[np.number]).values,
    label = df[TARGET],
#     categorical_feature=cat_dims   
        )

In [None]:
# https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
t4_params = {
    'boosting_type': 'gbdt', 'objective': 'binary', 'nthread': -1, 'silent': False,
    'num_leaves': 2**4, 'learning_rate': 0.05, 'max_depth': 11,
    'max_bin': 255, 
    'subsample': 0.8, 'subsample_freq': 1, 'colsample_bytree': 0.75, 
#     'early_stopping_round' : 10,
    'min_split_gain': 0.5, 'min_child_samples': 4}

clf_lgb = lgb.train(t4_params,lgb_train)

In [None]:
res_lgbm = clf_lgb.predict(test.select_dtypes(include=[np.number]).values)

In [None]:
len(res_lgbm)

#### Once more predictions work, can join and get mean of predictions = simple blending ensemble

In [None]:
# test["lgbm_preds"]=res_lgbm
test["is_female"]=res_lgbm

In [None]:
test["test_id"] = test["test_id"].astype(int)

In [None]:
test[["test_id","is_female"]].to_csv("submission.csv",index=False)

In [None]:
# preds = pd.DataFrame(columns=[test["test_id"].copy(),res_lgbm]
# preds = test["test_id"].copy()
# preds["is_female"] = res_lgbm# ensemble/mean of others at this points

In [None]:
# preds.head()

In [None]:
# preds.to_csv("submission.csv.gz",index=False,compression="gzip")