# WiDS Datathon
_____
<img src="./img/wids-logo.jpg" width=200>

This is my python solution for the WiDS Datathon, which is a new feature of the WiDS conference 2018. You can read about the datathon [Here](http://www.widsconference.org/datathon.html). This competition was launched by [kaggle](https://www.kaggle.com/c/wids2018datathon).

The dataset includes demographic and behavioral information from a sample of survey from India, related to their usage of financial services. The goal of this project is to predict gender, exploring the key differences in behavior patterns between men and women. By doing so, this competition seeks to encourage female data scientists to engage in social impact solutions and to help the poor people. Especially, InterMedia, where provide the dataset, seeks to help the world's poorest people take advantage of mobile phones/digital technoglogy to participate fully in their local economies.

### Data
- train.csv : training data (18255 rows, 1235 columns)
- test.csv : test data (27285 rows, 1234 columns)
- WiDS data dictionary v2.xlsx.html : feature descriptions dictionary 

### Target variable (Classification)
- <b>is_female</b> : is_female=1 for female and is_female=0 for male (while is_female=2 for female and is_female=1 for male in the dictionary table)

### Overview

0. File Load
1. Drop the columns with ONLY missing values (50 columns)
2. Data Type Exploration
3. Data Preparation
4. Data Type change : float -> object
5. Make New Features
6. One-Hot Encoding
7. Find overlapping columns
8. Model Fitting 
9. Make a submission file

In [74]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn import *
import xgboost as xgb
import scipy.stats as st
%matplotlib inline
matplotlib.style.use('ggplot')

from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

Exception ignored in: <bound method DMatrix.__del__ of <xgboost.core.DMatrix object at 0x1a20289e10>>
Traceback (most recent call last):
  File "/Users/ryan/anaconda3/lib/python3.6/site-packages/xgboost/core.py", line 368, in __del__
    if self.handle is not None:
AttributeError: 'DMatrix' object has no attribute 'handle'


## 1. File Load
_________________

In [23]:
train=pd.read_csv("train.csv",low_memory=False)
test=pd.read_csv("test.csv",low_memory=False)
df = pd.read_excel('WiDS data dictionary v2.xlsx')

In [24]:
print(train.shape, test.shape)

(18255, 1235) (27285, 1234)


## 2. Drop the columns with ONLY missing values (50 columns)
---------------

| Data | # of Original columns | Only missing values in train set| result
| :- |:-------------: | :-:| :-:
|Train Data| 1235  | 50| 1185
|Test Data| 1234 | 50| 1184

In [25]:
all_nan = train.isnull().sum()[train.isnull().sum(axis=0)==18255].index.values.tolist()
train.drop(all_nan,axis=1,inplace=True)
test.drop(all_nan,axis=1,inplace=True)

print(train.shape, test.shape)

(18255, 1185)
(27285, 1184)


## 3. Data Type Exploration
---------------
Originally, the data type is:

| Data | Numeric Data | String Data | Total
| :- |:-------------: | :-:| :-:
|Train Data| 1089  | 96| 1185
|Test Data| 1097 | 87| 1184

I changed the data type of 9 features in test data into string data (The pandas considered them as numeric data because the 9 features only included np.nan value.)

In [26]:
def dist_number(df):
    return df.select_dtypes(include=[np.number]), df.select_dtypes(exclude=[np.number])

num_train, st_train = dist_number(train)
num_test, st_test = dist_number(test)
print("Train: Numeric columns: {0}, Non-Numeric columns :{1}".format(num_train.shape[1],st_train.shape[1]))
print("Test: Numeric columns: {0}, Non-Numeric columns :{1}".format(num_test.shape[1],st_test.shape[1]))

Train: Numeric columns: 1089, Non-Numeric columns :96
Test: Numeric columns: 1097, Non-Numeric columns :87


In [27]:
def change_format(st_train,st_test):
    df1=pd.DataFrame(st_train.columns.values,columns=['c1'])
    df2=pd.DataFrame(st_test.columns.values,columns=['c2'])
    new = df1.merge(df2,left_on='c1',right_on='c2',how='outer')
    for i in range(len(new[new.c2.isnull()]['c1'].values)):
        test[test.columns[i]]=test[test.columns[i]].astype(object) 
    return test

test = change_format(st_train,st_test)

In [28]:
num_train, st_train = dist_number(train)
num_test, st_test = dist_number(test)
print("Train: Numeric columns: {0}, Non-Numeric columns :{1}".format(num_train.shape[1],st_train.shape[1]))
print("Test: Numeric columns: {0}, Non-Numeric columns :{1}".format(num_test.shape[1],st_test.shape[1]))

Train: Numeric columns: 1089, Non-Numeric columns :96
Test: Numeric columns: 1088, Non-Numeric columns :96


The final data type is: 

  | Data | Numeric Data | String Data | Total
| :- |:-------------: | :-:| :-:
|Train Data| 1089  | 96| 1185
|Test Data| 1088 | 96| 1184

## 4. Data Preparation
---------------
There are several features engineering for this data. I only applied 1
1. At least 99% non-nan values
2. AA7 - pick only Top20  
3. DG1 - birth year to age   
4. Delete - 'G2P3_1', 'G2P3_7','G2P3_8','G2P3_9','G2P3_11','G2P3_13'
5.  five continuous columns     

In [76]:
## string at least 99.5% non-nan values
def most_nan(df,num,fraction,want_nan):
    missing_df = df.isnull().sum(axis=0).reset_index()
    missing_df.columns = ['column_name','missing_count']
    missing_df = missing_df.ix[missing_df['missing_count']>0]
    missing_df = missing_df.sort_values(by='missing_count',ascending=False)
    if most_nan:
        return missing_df[missing_df['missing_count']>=num*fraction]['column_name'].values
    else:
        return missing_df[missing_df['missing_count']<num*fraction]['column_name'].values   
#not_m_nan = most_nan(st_train,18255,0.8,False)
## 1. AT LEAST 99 % NON-NAN VALUES

def delete_columns(df,num,fraction,want_nan):
    m_nan = most_nan(df,num,fraction,want_nan)
    col = df.columns.values.tolist()
    remove_list=['train_id','is_female','G2P3_1','G2P3_7','G2P3_8','G2P3_9','G2P3_11','G2P3_13']
    conti=['AA14','AA15','DL8','MT6C']
    for i in remove_list:
        col.remove(i)
    col = [x for x in col if x not in m_nan]
    print(len(col))

    ## (7)
    for x in conti:
        col.remove(x)
    print(len(col))
    return col
    
col = delete_columns(num_train,18255,0.8,True)

def other_preprocessing(train,test):
    aa7_sub_list = train['AA7'].value_counts()[:20].index.values.tolist()
    train['AA7']=train.AA7.apply(lambda x:x if x in aa7_sub_list else 1)
    test['AA7']=test.AA7.apply(lambda x:x if x in aa7_sub_list else 1)
    
    ## (3) DG1->AGE
    train['DG1']=2018-train['DG1']
    test['DG1']=2018-test['DG1']
    for i in range(33):
        train.loc[(train['MT11']>3*i) & (train['MT11']<=3*(i+1)),'MT11']=i+1
        test.loc[(train['MT11']>3*i) & (test['MT11']<=3*(i+1)),'MT11']=i+1
    ### Continuous features
    ### 'AA14' has 99999 -> NaN
    ### 'MT6C' has 99 -> NaN
    ### 'MT11' has 99 ->NaN
    ### 'MT11' has 98 -> 0 
    ### 'DL8' has outliers... larger than 100
    train.loc[train['AA14']==99999,'AA14']=np.nan
    train.loc[train['MT6C']==99,'MT6C']=np.nan
#    train.loc[train['MT11']==99,'MT11']=np.nan
#    train.loc[train['MT11']==98,'MT11']=0.
    train.loc[train['DL8']>=100,'DL8']=np.nan
    return train,test
    
train,test = other_preprocessing(train,test)

440
436


## 5. Data Type change : float -> object
____________________

In [50]:
def dtype_float_object(df,col):
    for i in tqdm(range(len(df[col].columns))):
        df[df[col].columns[i]]=train[df[col].columns[i]].astype(object)
    return df

train = dtype_float_object(train,col)
test = dtype_float_object(test,col)


  0%|          | 0/435 [00:00<?, ?it/s][A
  0%|          | 1/435 [00:00<01:16,  5.70it/s][A
  0%|          | 2/435 [00:00<01:10,  6.14it/s][A
  1%|          | 3/435 [00:00<01:04,  6.66it/s][A
  1%|          | 4/435 [00:00<01:02,  6.86it/s][A
  1%|          | 5/435 [00:00<01:02,  6.87it/s][A
  1%|▏         | 6/435 [00:00<01:02,  6.90it/s][A
  2%|▏         | 7/435 [00:01<01:01,  6.94it/s][A
  2%|▏         | 8/435 [00:01<01:00,  7.03it/s][A
  2%|▏         | 9/435 [00:01<01:00,  7.08it/s][A
  2%|▏         | 10/435 [00:01<00:59,  7.09it/s][A
  3%|▎         | 11/435 [00:01<00:59,  7.12it/s][A
  3%|▎         | 12/435 [00:01<00:59,  7.09it/s][A
  3%|▎         | 13/435 [00:01<01:00,  7.01it/s][A
  3%|▎         | 14/435 [00:02<01:00,  6.92it/s][A
  3%|▎         | 15/435 [00:02<01:01,  6.83it/s][A
  4%|▎         | 16/435 [00:02<01:01,  6.84it/s][A
  4%|▍         | 17/435 [00:02<01:00,  6.87it/s][A
  4%|▍         | 18/435 [00:02<01:00,  6.88it/s][A
  4%|▍         | 19/435 [00:0

In [52]:
train_temp=train.drop(['is_female','train_id'],axis=1)
test_temp=test.drop(['test_id'],axis=1)

## 6. Make New Features
__________

In [62]:
# I saved new dictionary to "my_file.npy"
# np.save('my_file.npy', add_columns) 

## add features
## sum -> object
## mean -> continous data
## std -> continuous data

add_columns = np.load('my_file.npy').item()
def add_new_features(df,add_columns):
    add=['add'+str(i+1) for i in range(len(add_columns.keys()))]
    for i in add_columns.keys():
        df[i]=df.loc[:,add_columns[i]].sum(axis=1)
        df[i+"m"]=df.loc[:,add_columns[i]].mean(axis=1)
        df[i+"s"]=df.loc[:,add_columns[i]].std(axis=1)
        df[i]=df[i].astype(object)
    return(df)

train_temp = add_new_features(train_temp,add_columns)
test_temp = add_new_features(test_temp,add_columns)

## 7. One-Hot Encoding
__________________
| Data | Total # (Before ONE-HOT coding)| Total # (After ONE-HOT coding)
| :- |:-------------: | :-:| :-:
|Train Data|676 | 4271
|Test Data| 676 | 5676

In [64]:
final_list=col.copy()
add=['add'+str(i+1) for i in range(len(add_columns.keys()))]
for i in conti:
    final_list.append(i)
for j in not_most_nan:
    final_list.append(j)
for k in add:
    final_list.append(k)
    final_list.append(k+'m')
    final_list.append(k+'s')
final_list.remove('add33')
final_list.remove('add33m')
final_list.remove('add33s')

print(train_temp[final_list].shape,test_temp[final_list].shape)

train_final = pd.get_dummies(train_temp[final_list], dummy_na=True)
test_final = pd.get_dummies(test_temp[final_list], dummy_na=True)
print(train_final.shape,test_final.shape)

## 8. Find overlapping columns
__________________
I will only use the columns in Both train & test data set

| Data |  Total # (After ONE-HOT coding)
| :- | :-:| :-:
|Train Data| 4271
|Test Data| 5676

-> <b>Total 4091 features</b> overlapped between train set and test set.

In [68]:
train_columns = train_final.columns.values
test_columns = test_final.columns.values

n=0
new_columns=[]
for i in range(len(test_columns)):
    if test_columns[i] in train_columns:
        n+=1
        new_columns.append(test_columns[i])
print(len(train_columns), len(test_columns), len(new_columns))

4271 5676 4091


## 9. Model Fitting 
________________
###  (1) Cross validation

In [73]:
y = train.is_female.values
X = train_final
X = X[new_columns]

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0, test_size=.10)
#X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, random_state=0, test_size=.50)

dtrain_all = xgb.DMatrix(X.values, y,feature_names=X.columns.values)
dtrain = xgb.DMatrix(X_train, y_train, feature_names=X.columns.values)
dval = xgb.DMatrix(X_val, y_val, feature_names=X.columns.values)
#dtest = xgb.DMatrix(X_test, feature_names=X.columns.values)

xgb_params = {
    'eta': 0.05,
    'min_child_weight': 6,
    'n_trees': 10000, 
    'max_depth': 100,
    'subsample': 0.85,
    'colsample_bytree': 0.5,
    'objective': "binary:logistic",
    'base_score': y.mean(),
    'silent': 1,
    'eval_metric':"auc"
}

cv_results = xgb.cv(
    xgb_params,
    dtrain_all,
    num_boost_round=1000,
    seed=42,
    nfold=5,
    early_stopping_rounds=20
)

cv_results

Unnamed: 0,test-auc-mean,test-auc-std,train-auc-mean,train-auc-std
0,0.950329,0.002732,0.961995,0.000408
1,0.953348,0.001772,0.968216,0.000688
2,0.956560,0.002022,0.971998,0.000825
3,0.958809,0.001277,0.974170,0.000557
4,0.959265,0.002072,0.975554,0.000361
5,0.960311,0.001635,0.976704,0.000468
6,0.960869,0.001571,0.977491,0.000571
7,0.961448,0.001509,0.978358,0.000589
8,0.961878,0.001699,0.979061,0.000346
9,0.962304,0.001889,0.979486,0.000367


###  (2) Final model training

In [36]:
dtrain_all = xgb.DMatrix(X.values, y, feature_names=X.columns.values)
xg_model = xgb.train(dict(xgb_params, silent=0), dtrain_all, num_boost_round=500)

## 10. Make a submission file
_____

In [37]:
final_test = test_final[new_columns]
dtest = xgb.DMatrix(final_test, feature_names=X.columns.values)
preds = xg_model.predict(dtest)
sub = pd.read_csv("sample_submission.csv")
sub['is_female']=preds
sub.to_csv("final.csv",index=False)

Done!!!!! :D