## Simple Baseline script
* Uses CatBoost (Has built in embedding support for categoricals, such as the string columns)
* Not compared with OneHot encoding handling of string/categorical columns yet, or xgboost, lightgbm (the later can also handle categoricals natively). 

    *Good luck!

* Target: *is_female*

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from subprocess import check_output
# print(check_output(["ls", "../input"]).decode("utf8"))
# Any results you write to the current directory are saved as output.

In [7]:
from catboost import CatBoostClassifier
# from xgboost import XGBClassifier
# import lightgbm as lgb

TARGET = "is_female"

#### Lots of columns!
* Some are strings, some are boolean or of very low cardinality ( 3-7 unique values). 
* Lot's of NaNs. 

In [35]:
df = pd.read_csv("train.csv",low_memory=False)
print(df.shape)
df.head()

(18255, 1235)


Unnamed: 0,train_id,AA3,AA4,AA5,AA6,AA7,AA14,AA15,DG1,is_female,...,GN1,GN1_OTHERS,GN2,GN2_OTHERS,GN3,GN3_OTHERS,GN4,GN4_OTHERS,GN5,GN5_OTHERS
0,0,3,32,3.0,,323011,3854,481,1975,1,...,99.0,,99,,99,,99,,99,
1,1,2,26,,8.0,268131,2441,344,1981,1,...,,,1,,2,,2,,2,
2,2,1,16,,7.0,167581,754,143,1995,1,...,1.0,,2,,2,,2,,2,
3,3,4,44,5.0,,445071,5705,604,1980,1,...,,,2,,2,,99,,99,
4,4,4,43,,6.0,436161,5645,592,1958,1,...,,,1,,1,,1,,1,


### Note: Test, Train have different ID columns!
* ordering reset to 0 for each. 
* Best to drop unless a useful leak is identified (But then more annoying to output test set predictions, if train has different # columns). 
    * Ignore for now

In [10]:
test = pd.read_csv("test.csv",low_memory=False)
print(test.shape)

(27285, 1234)


#### Lots of columns which look to be based on survey responses/multiple choice questions.
* in this case, Nulls may be the result of picking a question choice, vs not being answered. Requires digging into the data to understand how it should be addressed case by case. 
    * In short: missing value imputation may be damaging!  

In [11]:
df.isnull().sum()

train_id                0
AA3                     0
AA4                     0
AA5                 12602
AA6                  5653
AA7                     0
AA14                    0
AA15                    0
DG1                     0
is_female               0
DG3                     0
DG3A                    0
DG3A_OTHERS         18205
DG4                     0
DG4_OTHERS          18255
DG5_1                   0
DG5_2                   0
DG5_3                   0
DG5_4                   0
DG5_5                   0
DG5_6                   0
DG5_7                   0
DG5_8                   0
DG5_9                   0
DG5_10                  0
DG5_11                  0
DG5_96                  0
DG6                     0
DG8a                    0
DG8b                    0
                    ...  
FB28_2_OTHERS       18253
FB28_3_OTHERS       18255
FB28_4_OTHERS       18253
FB28_96_OTHERS      18254
FB29_1                  0
FB29_2                  0
FB29_3                  0
FB29_4      

## Examine Non numeric columns:
* Clear possibilities here to get less sparse features: "column is not NaN", or "sum(notNaN)(col) for col in (cols begginning with DL[0-9]"... 

In [12]:
# https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas
df.select_dtypes(exclude=[np.number])

Unnamed: 0,DG3A_OTHERS,DG13_OTHERS,DG14_OTHERS,DL1_OTHERS,DL2_23_OTHERS,DL2_96_OTHERS,DL4_OTHERS,DL12_OTHERS,DL28_OTHERS,G2P1_OTHERS,...,FB28_4_OTHERS,FB28_96_OTHERS,FB29_OTHERS,LN2_RIndLngBEOth,LN2_WIndLngBEOth,GN1_OTHERS,GN2_OTHERS,GN3_OTHERS,GN4_OTHERS,GN5_OTHERS
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,Bengali,Bengali,,,,,
2,,,,,,,,,,Samajvadi pension,...,,,,Hindi,Hindi,,,,,
3,,,,,,,,,,,...,,,,Tamil,Tamil,,,,,
4,,,,,,,,,,,...,,,,Malayalam,Malayalam,,,,,
5,,,,,,,,,,,...,,,,Chattisgari,Chattisgari,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,Telugu,Telugu,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,Marathi,Marathi,,,,,


In [13]:
category_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()
print(len(category_cols))

96


In [32]:
category_cols = test.select_dtypes(exclude=[np.number]).columns.tolist()
print(len(category_cols))

0


In [31]:
for header in category_cols:
#     df[header] = df[header].astype('category').cat.codes
#     test[header] = test[header].astype('category').cat.codes
#     df_all[header] = df_all[header].astype('category').cat.codes.astype('int')
#     df_all[header] = pd.to_numeric(df_all[header])
    
#     df[header] = df[header].astype('category').cat.codes
#     df[header] = pd.to_numeric(df[header])
    test[header] = test[header].astype('category').cat.codes
    test[header] = pd.to_numeric(test[header])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


### convert categorical columns to integers
* Our test set has categoricals not seen in train - this must be handled. For now, we'll get the categoricals from train and test together, ensuring we "see"/encode" all the categoricals

In [14]:
# Save # rows in train/test. (We could do this directly, but this is easier to debug if needed)
TR_ROWS = df.shape[0]
print(df.shape)
print(test.shape)

(18255, 1235)
(27285, 1234)


In [15]:
df_all = pd.concat([df,test])
df_all.shape

(45540, 1236)

## I get errors with LGB and other when using the Categorical datatype: we'll turn it back into Integers to avoid this

In [16]:
for header in category_cols:
#     df[header] = df[header].astype('category').cat.codes
#     test[header] = test[header].astype('category').cat.codes
    df_all[header] = df_all[header].astype('category').cat.codes.astype('int')
    df_all[header] = pd.to_numeric(df_all[header])
    
    df[header] = df[header].astype('category').cat.codes
    df[header] = pd.to_numeric(df[header])
    test[header] = test[header].astype('category').cat.codes
    test[header] = pd.to_numeric(test[header])

In [29]:
test.select_dtypes(exclude=[np.number])

Unnamed: 0,DG4_OTHERS,FB28_3_OTHERS,G2P2_10_OTHERS,G2P2_12_OTHERS,G2P2_15_OTHERS,G2P2_2_OTHERS,MM11_11_OTHERS,MM11_5_OTHERS,MM15_OTHERS,MM38_OTHERS,MT13_4_OTHERS,MT13_96_OTHERS,MT14_3_OTHERS,MT14_5_OTHERS,MT14_7_OTHERS
0,,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,
7,,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,,
9,,,,,,,,,,,,,,,


In [17]:
df.shape

(18255, 1235)

#### Split back into train and test

In [18]:
df =df_all.iloc[0:TR_ROWS]
test =df_all.iloc[TR_ROWS:]

df.drop(['test_id'],axis=1,inplace=True)
test.drop(['train_id',"is_female"],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


### Naive initial feature engineering:
* NaNS per row
(Could also be done for groups of columns, or to sum 0/1s.. (

In [19]:
df["row_nulls"] = df.isnull().sum(axis=1)
test["row_nulls"] = test.isnull().sum(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


## Build a model
* Note: CatBoost in particular has a LOT of hyperparams (it's even worse than LightGBM in this regard). IT's essential to experiment with them if you want to get decent results. 
* This is my first time using it, so assume my hyperparameters are terrible. 
* Tuning should use a seperate train/validation set split first to select hyperparams. 

* For low dimensional categoricals (e.g. <20 unique vars) - there's no benefit in embedding (Catboost/lightGBM) vs simply leaving it as a number or one hot encoding. 

In [20]:
X = df.drop([TARGET],axis=1)# .select_dtypes(include=[np.number]) #.values
Y = df[TARGET]

In [21]:
### Optional Train/Validation split for test hyperparams. 
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [22]:
cat_dims = [X.columns.get_loc(i) for i in category_cols[:-1]]  # categorical columns indexes

# train default catBoost classifier. Default loss metric is LogLoss (lower is better)
clf = CatBoostClassifier(eval_metric="AUC", one_hot_max_size=3, iterations=2)

clf.fit(X,Y , cat_features=cat_dims)

0:	learn: 0.9344356	total: 604ms	remaining: 604ms
1:	learn: 0.9345540	total: 1.19s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x107a2a5c0>

## Currently there's an error where when creating predictions - need to debug

In [33]:
res = clf.predict_proba(test)

In [34]:
res

array([[ 0.44440478,  0.55559522],
       [ 0.51215848,  0.48784152],
       [ 0.44241195,  0.55758805],
       ..., 
       [ 0.55068984,  0.44931016],
       [ 0.50294107,  0.49705893],
       [ 0.48998629,  0.51001371]])

## Additional models
* Train performance is dangeorusly misleading without an external validation set. 
* This is just a starter for models
* Should also check for overfitting (requires validation set split)

In [None]:
# # increase # iterations when model debugged
# clf2 = CatBoostClassifier(eval_metric="AUC", one_hot_max_size=6,
#                           iterations=50,depth=8,learning_rate=0.04, rsm=0.8)
# clf2.fit(X,Y , cat_features=cat_dims)
# res2 = clf2.predict_proba(test)

## Simple LightGBM model:
* https://github.com/Microsoft/LightGBM/issues/1096

In [None]:
# we still ahve object datatypes columns..
df.dtypes.value_counts()

In [None]:
# categorical/objects cols:
print(df.select_dtypes(exclude=[np.number]).columns.tolist())
# df.select_dtypes(exclude=[np.number]).value_counts()

In [None]:
lgb_train = lgb.Dataset(
#             data=LabelEncoder().fit_transform(train_df.brand_name).reshape(-1, 1),
    data=df.drop([TARGET],axis=1).select_dtypes(include=[np.number]).values,
    label = df[TARGET],
#     categorical_feature=cat_dims   
        )

In [None]:
# https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
t4_params = {
    'boosting_type': 'gbdt', 'objective': 'binary', 'nthread': -1, 'silent': False,
    'num_leaves': 2**4, 'learning_rate': 0.05, 'max_depth': 11,
    'max_bin': 255, 
    'subsample': 0.8, 'subsample_freq': 1, 'colsample_bytree': 0.75, 
#     'early_stopping_round' : 10,
    'min_split_gain': 0.5, 'min_child_samples': 4}

clf_lgb = lgb.train(t4_params,lgb_train)

In [None]:
res_lgbm = clf_lgb.predict(test.select_dtypes(include=[np.number]).values)

In [None]:
len(res_lgbm)

#### Once more predictions work, can join and get mean of predictions = simple blending ensemble

In [None]:
# test["lgbm_preds"]=res_lgbm
test["is_female"]=res_lgbm

In [None]:
test["test_id"] = test["test_id"].astype(int)

In [None]:
test[["test_id","is_female"]].to_csv("submission.csv",index=False)

In [None]:
# preds = pd.DataFrame(columns=[test["test_id"].copy(),res_lgbm]
# preds = test["test_id"].copy()
# preds["is_female"] = res_lgbm# ensemble/mean of others at this points

In [None]:
# preds.head()

In [None]:
# preds.to_csv("submission.csv.gz",index=False,compression="gzip")