#### 1) Problem definition
    An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 

    In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 

    You are required to help the manager to predict the right group of the new customers.
#### 2) Data
    Data from https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/ 
    
#### 3) Evaluation
    Evaluation Metric : Accuracy Score 
    Final Score : 0.951173113506658)
    Private Leaderboard Rank : 12
    Public Leaderboard Rank : 30
    
    
#### 4) Features

   **Create data dictionary**
   **Variable	Definition**
1. ID	Unique ID
2. Gender	Gender of the customer
3. Ever_Married	Marital status of the customer
4. Age	Age of the customer
5. Graduated	Is the customer a graduate?
6. Profession	Profession of the customer
7. Work_Experience	Work Experience in years
8. Spending_Score	Spending score of the customer
9. Family_Size	Number of family members for the customer (including the customer)
10. Var_1	Anonymised Category for the customer
11. Segmentation	(target) Customer Segment of the customer


    
#### 5) Modelling
#### 6) Experimentation

## Imports and Get data

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

In [67]:
from sklearn.ensemble import GradientBoostingClassifier

In [68]:
import random
import warnings
warnings.filterwarnings("ignore")

from sklearn import preprocessing
from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score

In [69]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [70]:
print("Columns : ",train.columns.tolist())
print("Size : ", train.shape)
train.head()

Columns :  ['ID', 'Gender', 'Ever_Married', 'Age', 'Graduated', 'Profession', 'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1', 'Segmentation']
Size :  (8068, 11)


Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


## Data Preprocessing

1. Fill Missing data
2. Convert to numeric
3. Transformations where required

#### Fill Missing Values

In [71]:
print(train.columns[train.isnull().any()].tolist())
train.isnull().sum()

['Ever_Married', 'Graduated', 'Profession', 'Work_Experience', 'Family_Size', 'Var_1']


ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

In [72]:
np.random.choice(['Yes','No'])

'No'

In [73]:
# Random Choice 

#df.ffill(axis = 0)
train["Ever_Married"].ffill(axis = 0 ,inplace = True)
test["Ever_Married"].ffill(axis = 0 ,inplace = True)

In [74]:
# Mode

train["Graduated"].fillna("Yes" ,inplace = True)
test["Graduated"].fillna("Yes" ,inplace = True)

In [75]:
# Mode train["Profession"].mode()

train["Profession"].fillna("Artist",inplace = True)
test["Profession"].fillna("Artist",inplace = True)

In [76]:
train["Work_Experience"].ffill(axis = 0 ,inplace = True)
test["Work_Experience"].ffill(axis = 0 ,inplace = True)

In [77]:
train["Family_Size"].fillna(train["Family_Size"].median() ,inplace = True)
test["Family_Size"].fillna(train["Family_Size"].median() ,inplace = True)

In [78]:
train["Var_1"].fillna("Cat_6" ,inplace = True)
test["Var_1"].fillna( "Cat_6",inplace = True)

#### Conver to numeric

In [79]:
categorical_data = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Work_Experience', 'Spending_Score', 'Family_Size', 'Var_1']

In [80]:
for var in categorical_data:
    lb = preprocessing.LabelEncoder()
    full_var_data = pd.concat((train[var], test[var]), axis=0).astype('str')
    lb.fit(full_var_data)
    train[var] = lb.transform(train[var].astype('str'))
    test[var] = lb.transform(test[var].astype('str'))

In [81]:
segmentation = {'A' :1 , 'B' : 2 , 'C' : 3 , 'D' : 4}
train["Segmentation"] = train["Segmentation"].apply(lambda x: segmentation[x])


In [82]:
train.dtypes

ID                 int64
Gender             int32
Ever_Married       int32
Age                int64
Graduated          int32
Profession         int32
Work_Experience    int32
Spending_Score     int32
Family_Size        int32
Var_1              int32
Segmentation       int64
dtype: object

In [83]:
"""
1. List All IDs
2. if test_id in trainset: segmentation from train set
3. if test_id not in trainset : predict
"""

'\n1. List All IDs\n2. if test_id in trainset: segmentation from train set\n3. if test_id not in trainset : predict\n'

In [84]:
train_id = train['ID'].unique()
test_id = test['ID']

In [85]:
print(test_id.isin(train_id).sum(), len(test_id))
print(len(test_id)-test_id.isin(train_id).sum())

2332 2627
295


In [86]:
index_unique = []
index_not_unique = []
for i in range(len(test_id)):
    if test.ID[i] not in train_id:
        index_unique.append(i)
    else:
        index_not_unique.append(i)

In [87]:
test_unique = test.iloc[index_unique,:]
test_not_unique = test.iloc[index_not_unique,:]

In [88]:
segments = []
for val in index_not_unique:
    segments.append(train[train.ID==test_not_unique.ID[val]].Segmentation.values[0])

In [89]:
test_not_unique["Segmentation"] = segments

In [90]:
train = pd.concat([train,test_not_unique])

In [91]:
train.head(10)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,1,0,22,0,5,1,2,3,3,4
1,462643,0,1,38,1,2,1,0,2,3,1
2,466315,0,1,67,1,2,1,2,0,5,2
3,461735,1,1,67,1,7,0,1,1,5,2
4,462669,0,1,40,1,3,0,1,5,5,1
5,461319,1,1,56,0,0,0,0,1,5,3
6,460156,1,0,32,1,5,1,2,2,5,3
7,464347,0,0,33,1,5,1,2,2,5,4
8,465015,0,1,61,1,2,0,2,2,6,4
9,465176,0,1,55,1,0,1,0,3,5,3


In [92]:
test_unique

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
6,459005,1,1,61,1,1,10,2,2,5
19,459045,0,1,88,1,7,1,0,3,5
32,459090,1,0,31,0,0,1,2,1,5
38,459116,1,1,60,1,0,7,0,4,5
43,459121,0,1,51,1,0,8,0,5,5
44,459123,0,1,86,0,0,1,1,1,5
47,459136,1,1,47,1,0,0,0,1,5
50,459144,1,1,80,1,7,0,1,1,5
53,459160,1,1,70,1,1,1,0,1,5
63,459182,0,1,46,1,0,1,0,1,5


In [93]:
test_unique_id = test_unique["ID"]

In [94]:
train.drop(['ID'], axis = 1, inplace = True)
test_unique.drop(['ID'], axis = 1,inplace = True)

## Modelling

1. Split into X & y (Features and Labels)
2. TrainTestSplit
3. Instantiate Model
4. Fit training data
5. Evaluate

In [95]:
X = train.drop("Segmentation", axis=1)
y = train["Segmentation"]
X_test = test_unique

In [31]:
X_train, X_val, y_train, y_val = train_test_split(X, y,test_size = 0.2)

In [33]:
#gbc = GradientBoostingClassifier(random_state=0)
  
#gbc.fit(X_train, y_train) 
#y_pred_gbc = gbc.predict(X_val)

#gbc_acc = accuracy_score(y_val, y_pred_gbc)
#print("Accuracy : ", gbc_acc)

1. All Parameters
              criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=0,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False
              
2. Tune for
        learning_rate,N_estimators, max_depth, min_samples_split, min_samples_leaf, max_features

In [34]:
parameters = {'criterion': 'friedman_mse', 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 90, 'subsample': 0.75}
    

random.seed(10)

In [35]:
#gbc_rand = GradientBoostingClassifier(random_state=0)


#gbc_rand = RandomizedSearchCV(gbc_rand, parameters, n_jobs=-1,cv = 3,scoring = 'accuracy', n_iter=100,verbose = 1)

In [36]:
#y_pred_gbc_rand = gbc_rand.predict(X_val)
#gbc_acc_rand = accuracy_score(y_val, y_pred_gbc_rand)
#print("Accuracy : ", gbc_acc_rand)

In [37]:
#gbc_rand.best_params_ 

1. Accuracy :  0.525
    {'learning_rate': 0.5,
     'max_depth': 10,
     'max_features': 6,
     'min_samples_leaf': 0.2,
     'min_samples_split': 0.5,
     'n_estimators': 100}
2. Accuracy :  0.516826923077
    {'learning_rate': 0.25,
     'max_depth': 10,
     'max_features': 3,
     'min_samples_leaf': 0.2,
     'min_samples_split': 0.5,
     'n_estimators': 100}
3. Accuracy :  0.520192307692 
    {'learning_rate': 0.5,
     'max_depth': 35,
     'max_features': 3,
     'min_samples_leaf': 0.2,
     'min_samples_split': 0.5,
     'n_estimators': 200}
4. Accuracy :  0.522115384615
{'learning_rate': 0.5,
 'max_depth': 25,
 'max_features': 6,
 'min_samples_leaf': 0.2,
 'min_samples_split': 0.5,
 'n_estimators': 200}
 
5. Accuracy : 0.55 {'criterion': 'friedman_mse', 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 90, 'subsample': 0.75}

parameters = {'learning_rate':[ 0.5, 0.25, 0.75],
              'n_estimators': [90, 100, 200,500],
              'max_depth': [10, 25, 35, 50],
              'min_samples_split': [ 0.5, 0.8, 1.0,2.0,3.0],
              'min_samples_leaf': [0.2,0.5],
              'max_features': [1,3,6,9]
              }
min_weight_fraction_leaffloat, default=0.0
min_impurity_decreasefloat, default=0.0
 min_impurity_splitfloat, default=None

In [96]:
parameters = {'criterion': 'friedman_mse', 
              'learning_rate': 0.1,
              'loss': 'deviance', 
              'max_depth': 3, 
              'max_features': 'log2', 
              'min_samples_leaf': 1,
              'min_samples_split': 3,
              'n_estimators': 90, 
              'subsample': 0.75}
# 

random.seed(47)

## Improved Model


## Save Model & Submission csv

In [97]:
gbc = GradientBoostingClassifier(criterion= 'friedman_mse',
                                 learning_rate= 0.1,
                                 loss= 'deviance', 
                                 max_depth= 3, 
                                 max_features= 'log2', 
                                 min_samples_leaf= 1, 
                                 min_samples_split= 3, 
                                 n_estimators= 90, 
                                 subsample= 0.75, random_state=42)

In [98]:
gbc.fit(X, y)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features='log2', max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=3,
              min_weight_fraction_leaf=0.0, n_estimators=90,
              n_iter_no_change=None, presort='auto', random_state=42,
              subsample=0.75, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

In [99]:
y_pred = gbc.predict(X_test)

In [100]:
y_pred

array([2, 3, 1, 3, 3, 1, 3, 3, 3, 3, 2, 3, 4, 3, 4, 3, 2, 4, 3, 1, 3, 1, 2,
       4, 2, 2, 4, 4, 1, 3, 1, 1, 3, 1, 2, 2, 1, 2, 2, 4, 2, 2, 1, 2, 2, 4,
       3, 1, 1, 2, 2, 4, 2, 1, 1, 3, 4, 1, 2, 4, 1, 1, 3, 4, 1, 1, 4, 1, 1,
       4, 3, 3, 1, 1, 4, 1, 4, 1, 1, 4, 2, 2, 4, 2, 2, 4, 1, 4, 4, 4, 4, 3,
       3, 2, 4, 2, 2, 3, 3, 2, 1, 3, 4, 4, 4, 4, 1, 4, 4, 3, 2, 1, 3, 1, 4,
       1, 3, 3, 2, 2, 4, 4, 2, 3, 3, 3, 4, 4, 1, 1, 3, 4, 4, 3, 3, 3, 3, 1,
       4, 1, 4, 4, 4, 4, 4, 3, 4, 1, 3, 2, 4, 4, 4, 4, 2, 3, 1, 4, 1, 3, 4,
       4, 4, 2, 1, 1, 1, 4, 2, 3, 4, 3, 4, 2, 2, 4, 4, 1, 1, 2, 4, 1, 3, 1,
       1, 3, 2, 4, 1, 4, 4, 1, 3, 3, 4, 3, 1, 1, 3, 4, 3, 1, 3, 4, 3, 2, 4,
       4, 1, 4, 3, 1, 4, 1, 1, 2, 1, 3, 1, 3, 2, 3, 4, 2, 2, 3, 1, 1, 1, 1,
       4, 4, 2, 4, 1, 4, 2, 1, 4, 1, 1, 1, 4, 3, 4, 4, 3, 2, 4, 4, 2, 4, 4,
       4, 3, 1, 4, 3, 4, 4, 3, 4, 4, 2, 4, 2, 1, 3, 1, 1, 3, 1, 4, 3, 4, 2,
       3, 1, 2, 2, 3, 4, 4, 3, 2, 3, 4, 4, 3, 3, 2, 3, 1, 1, 4], dtype=int64)

In [101]:
test_unique["Segmentation"] = y_pred

In [102]:
test_unique["ID"] = test_unique_id 

In [103]:
submission = pd.concat([test_not_unique, test_unique])

In [104]:
submission.sort_index()

Unnamed: 0,Age,Ever_Married,Family_Size,Gender,Graduated,ID,Profession,Segmentation,Spending_Score,Var_1,Work_Experience
0,36,1,0,0,1,458989,2,2,2,5,0
1,37,1,3,1,1,458994,5,3,0,5,13
2,69,1,0,0,0,458996,0,1,2,5,0
3,59,1,1,1,0,459000,4,3,1,5,3
4,19,0,3,0,0,459001,8,3,2,5,3
5,47,1,4,1,1,459003,1,3,1,3,0
6,61,1,2,1,1,459005,1,2,2,5,10
7,47,1,2,0,1,459008,0,2,0,5,1
8,50,1,3,1,1,459013,0,3,0,5,7
9,19,0,3,1,0,459014,5,4,2,5,0


In [105]:
segmentation = {1: 'A'  ,  2 :'B' ,3:  'C' , 4:'D'}
submission['Segmentation'] = submission['Segmentation'].apply(lambda x: segmentation[x])

In [106]:
submission = submission.loc[:,['ID', 'Segmentation']].set_index('ID')
submission.to_csv('sub.csv')

submission = pd.DataFrame(data = None, index = None)

submission['ID'] = sub_id
submission['Segmentation'] = y_pred.reshape((len(y_pred)))

segmentation = {1: 'A'  ,  2 :'B' ,3:  'C' , 4:'D'}
submission['Segmentation'] = submission['Segmentation'].apply(lambda x: segmentation[x])

submission.to_csv('S2.csv', index=False)

In [None]:
submission

