# Gender Prediction for E-Commerce

With the evolution of the information and communication technologies and the rapid growth of the Internet for the exchange and distribution of information, Electronic Commerce (e-commerce) has gained massive momentum globally, and attracted more and more worldwide users overcoming the time constraints and distance barriers.

It is important to gain in-depth insights into e-commerce via data-driven analytics and identify the factors affecting product sales, the impact of characteristics of customers on their purchase habits.

It is quite useful to understand the demand, habits, concern, perception, and interest of customers from the clue of genders for e-commerce companies. 

However, the genders of users are in general unavailable in e-commerce platforms. To address this gap the aim here is to predict the gender of e-commerce’s participants from their product viewing records.



About Data Source:
PAKDD 2015 Conference


Problem Statement: To predict the gender of e-commerce’s participants from their product viewing records.

In [0]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline  

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [0]:
import os, sys

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

In [0]:
import plotly.graph_objs as go
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot, plot

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
PATH = '/content/drive/My Drive/Kaggle/janata'

In [0]:
os.getcwd()

'/content'

In [0]:
os.chdir(PATH)
os.getcwd()

'/content/drive/My Drive/Kaggle/janata'

In [0]:
view_data = pd.read_csv("train_8wry4cB.csv")

In [0]:
view_data.head

<bound method NDFrame.head of       session_id  ...  gender
0         u16159  ...  female
1         u10253  ...    male
2         u19037  ...  female
3         u14556  ...  female
4         u24295  ...    male
...          ...  ...     ...
10495     u15442  ...  female
10496     u17986  ...  female
10497     u22508  ...  female
10498     u17087  ...  female
10499     u23137  ...  female

[10500 rows x 5 columns]>

In [0]:
view_data.dtypes

session_id     object
startTime      object
endTime        object
ProductList    object
gender         object
dtype: object

In [0]:


for i, line in enumerate(view_data['ProductList']):

    content = line.split(';')

    view_data.loc[i,'New ProductList'] = str(content)


In [0]:
view_data['New ProductList'].head()

0    ['A00002/B00003/C00006/D28435/', 'A00002/B0000...
1    ['A00001/B00009/C00031/D29404/', 'A00001/B0000...
2                     ['A00002/B00001/C00020/D16944/']
3    ['A00002/B00004/C00018/D10284/', 'A00002/B0000...
4    ['A00001/B00001/C00012/D30805/', 'A00001/B0000...
Name: New ProductList, dtype: object

In [0]:
view_data['gender'] = view_data['gender'].astype("category")

In [0]:
view_data.dtypes

session_id           object
startTime            object
endTime              object
ProductList          object
gender             category
New ProductList      object
dtype: object

In [0]:
view_data['gender'].value_counts()

female    8192
male      2308
Name: gender, dtype: int64

In [0]:
view_data.dtypes

session_id           object
startTime            object
endTime              object
ProductList          object
gender             category
New ProductList      object
dtype: object

In [0]:
train_X, test_X, train_y, test_y = train_test_split(view_data['New ProductList'],
                                                    view_data['gender'],
                                                    test_size = 0.3,
                                                    random_state = 123)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Trail 3 
Tfidf_vect = TfidfVectorizer()
#trial 1
#Tfidf_vect = TfidfVectorizer(ngram_range=(4), max_features=5000)
#trial 2
#tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(4), max_features=5000)
Tfidf_vect.fit(view_data['New ProductList'])
train_X_Tfidf = Tfidf_vect.transform(train_X)
test_X_Tfidf = Tfidf_vect.transform(test_X)

In [0]:
Dense_mat = train_X_Tfidf.todense()
Tfidf_Mat = pd.DataFrame(Dense_mat, columns=Tfidf_vect.get_feature_names())
Tfidf_Mat.head()

Unnamed: 0,a00001 b00001,a00001 b00001 c00001,a00001 b00001 c00012,a00001 b00001 c00019,a00001 b00001 c00020,a00001 b00001 c00029,a00001 b00001 c00075,a00001 b00001 c00092,a00001 b00001 c00182,a00001 b00001 c00301,a00001 b00004,a00001 b00004 c00066,a00001 b00004 c00093,a00001 b00004 c00102,a00001 b00004 c00122,a00001 b00004 c00132,a00001 b00004 c00154,a00001 b00004 c00171,a00001 b00004 c00196,a00001 b00009,a00001 b00009 c00012,a00001 b00009 c00028,a00001 b00009 c00031,a00001 b00009 c00032,a00001 b00009 c00037,a00001 b00009 c00038,a00001 b00009 c00186,a00001 b00011,a00001 b00011 c00212,a00001 b00011 c00416,a00001 b00015,a00001 b00015 c00017,a00001 b00015 c00021,a00001 b00015 c00041,a00001 b00015 c00042,a00001 b00015 c00043,a00001 b00015 c00098,a00001 b00015 c00103,a00001 b00015 c00111,a00001 b00015 c00202,...,d26009 a00002,d26009 a00002 b00006,d26024 a00002,d26024 a00002 b00001,d26702 a00002,d26702 a00002 b00002,d27426 a00003,d27430 a00002,d27430 a00002 b00003,d27439 a00002,d27774 a00002,d27774 a00002 b00003,d27845 a00002,d27893 a00002,d27893 a00002 b00003,d27967 a00003,d27967 a00003 b00012,d28141 a00002,d28141 a00002 b00001,d28678 a00003,d28678 a00003 b00012,d29610 a00001,d29610 a00001 b00031,d30341 a00003,d30341 a00003 b00012,d30358 a00003,d30616 a00002,d30616 a00002 b00002,d31186 a00003,d31186 a00003 b00012,d31306 a00003,d31306 a00003 b00022,d31529 a00003,d31529 a00003 b00012,d32672 a00002,d32672 a00002 b00002,d33237 a00001,d33237 a00001 b00001,d33879 a00003,d33879 a00003 b00026
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score, recall_score, precision_score
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(train_X_Tfidf,train_y)

# predict the labels on train dataset
pred_train = Naive.predict(train_X_Tfidf)

# predict the labels on validation dataset
pred_test = Naive.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score on Train set -> ", accuracy_score(train_y, pred_train)*100)
print("Naive Bayes Accuracy Score on Validation set -> ", accuracy_score(test_y, pred_test)*100)


rec = recall_score(test_y, pred_test, pos_label='female')

prec = precision_score(test_y, pred_test, pos_label='female')

print("Recall Score on Validation set:", rec)

print("Precision Score on Validation set:", prec)

Naive Bayes Accuracy Score on Train set ->  89.48299319727892
Naive Bayes Accuracy Score on Validation set ->  85.3015873015873
Recall Score on Validation set: 0.974485596707819
Precision Score on Validation set: 0.8551823763091368


In [0]:
Cs = [0.001, 0.01, 0.1, 1, 10]
gammas = [ 0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
grid_search = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=2)
grid_search.fit(train_X_Tfidf,train_y)


GridSearchCV(cv=2, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10],
                         'gamma': [0.001, 0.01, 0.1, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [0]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier

# Trail 1
#SVM = svm.SVC(kernel='rbf')
#SVM.fit(train_X_Tfidf,train_y)

# Trail 2 with Grid search
# predict the labels on train dataset
pred_train1 = grid_search.predict(train_X_Tfidf)

# predict the labels on validation dataset
pred_test1 = grid_search.predict(test_X_Tfidf)

# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score on Train set -> ", accuracy_score(train_y, pred_train1)*100)
print("SVM Accuracy Score on Validation set -> ",accuracy_score(test_y, pred_test1)*100)

rec1 = recall_score(test_y, pred_test1, pos_label='female')

prec1 = precision_score(test_y, pred_test1, pos_label='female')

print("Recall Score on Validation set:", rec1)

print("Precision Score on Validation set:", prec1)

SVM Accuracy Score on Train set ->  90.0952380952381
SVM Accuracy Score on Validation set ->  87.42857142857143
Recall Score on Validation set: 0.9596707818930041
Precision Score on Validation set: 0.8866920152091254


In [0]:
from xgboost import XGBClassifier

In [0]:
#Trial 1
XGB_model = XGBClassifier(n_estimators=1500, gamma=0.25,learning_rate=0.2)
%time XGB_model.fit(train_X_Tfidf, train_y)

# predict the labels on train dataset
pred_train_xg = XGB_model.predict(train_X_Tfidf)

# predict the labels on validation dataset
pred_test_xg = XGB_model.predict(test_X_Tfidf)


print("XGB Accuracy Score on Train set -> ", accuracy_score(train_y, pred_train_xg)*100)
print("XGB Accuracy Score on Validation set -> ",accuracy_score(test_y, pred_test_xg)*100)
rec1 = recall_score(test_y, pred_test_xg, pos_label='female')

prec1 = precision_score(test_y, pred_test_xg, pos_label='female')

print("Recall Score on Validation set:", rec1)

print("Precision Score on Validation set:", prec1)


CPU times: user 29.7 s, sys: 16.4 ms, total: 29.7 s
Wall time: 29.8 s
SVM Accuracy Score on Train set ->  90.70748299319727
SVM Accuracy Score on Validation set ->  86.63492063492063
Recall Score on Validation set: 0.9497942386831276
Precision Score on Validation set: 0.8853087840429612


In [0]:
import xgboost as xgb
clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)



In [0]:
grid.fit(train_X_Tfidf, train_y)

In [0]:
# predict the labels on train dataset
pred_train_xg = clf.predict(train_X_Tfidf)

# predict the labels on validation dataset
pred_test_xg = clf.predict(test_X_Tfidf)


print("XGB Accuracy Score on Train set -> ", accuracy_score(train_y, pred_train_xg)*100)
print("XGB Accuracy Score on Validation set -> ",accuracy_score(test_y, pred_test_xg)*100)
rec1 = recall_score(test_y, pred_test_xg, pos_label='female')

prec1 = precision_score(test_y, pred_test_xg, pos_label='female')

print("Recall Score on Validation set:", rec1)

print("Precision Score on Validation set:", prec1)

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
clf1 = RandomForestClassifier(n_estimators=1000,max_depth=5)
clf1.fit(X=train_X_Tfidf, y=train_y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
pred_train_rf = clf1.predict(train_X_Tfidf)
pred_test_rf = clf1.predict(test_X_Tfidf)

In [0]:
print("RF Accuracy Score on Train set -> ", accuracy_score(train_y, pred_train_rf)*100)
print("RF Accuracy Score on Validation set -> ",accuracy_score(test_y, pred_test_rf)*100)
rec1 = recall_score(test_y, pred_test_rf, pos_label='female')

prec1 = precision_score(test_y, pred_test_rf, pos_label='female')

print("Recall Score on Validation set:", rec1)

print("Precision Score on Validation set:", prec1)

RF Accuracy Score on Train set ->  78.39455782312925
RF Accuracy Score on Validation set ->  77.14285714285715
Recall Score on Validation set: 1.0
Precision Score on Validation set: 0.7714285714285715


In [0]:
view_data_test = pd.read_csv("test_Yix80N0.csv")

In [0]:
view_data_test.dtypes

session_id     object
startTime      object
endTime        object
ProductList    object
gender         object
dtype: object

In [0]:
view_data_test.shape

(4500, 4)

In [0]:
for i, line in enumerate(view_data_test['ProductList']):
    content = line.split(';')
    view_data_test.loc[i,'New ProductList'] = str(content)

In [0]:

test_file = Tfidf_vect.transform(view_data_test['New ProductList'])


In [0]:


# predict the labels on train dataset
# Trial 1
#pred_subm = Naive.predict(test_file)
# Trial 3
pred_subm = grid_search.predict(test_file)
# Trial 2
#pred_subm = XGB_model.predict(test_file)


In [0]:
pred_subm.shape

(4500,)

In [0]:
sub_data = pd.read_csv("Submission_v1.csv")

In [0]:
sub_data.shape

(4500, 4)

In [0]:
sub_data.dtypes

session_id     object
startTime      object
endTime        object
ProductList    object
dtype: object

In [0]:
sub_data.drop(['startTime', 'endTime', 'ProductList'], axis=1, inplace=True)

In [0]:
sub_data.shape

(4500, 1)

In [0]:
pred_subm[2]

'female'

In [0]:
sub_data['gender'] = pred_subm

In [0]:
sub_data.head

<bound method NDFrame.head of      session_id  gender
0        u12112  female
1        u19725  female
2        u11795  female
3        u22639  female
4        u18034  female
...         ...     ...
4495     u23966    male
4496     u20527  female
4497     u13253  female
4498     u17094    male
4499     u24310  female

[4500 rows x 2 columns]>

In [0]:
sub_data.to_csv("Submission_v6.csv")