<img src='https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/Customer_Segmentation-thumbnail-1200x1200-90.jpg'>

# Problem Statement 


An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 

You are required to help the manager to predict the right group of the new customers.


#### Public and Private split:
The public leaderboard is based on 40% of test data, while final rank would be decided on remaining 60% of test data (which is private leaderboard)

# Loading Data

In [None]:
#for data processing
import numpy as np 
import pandas as pd

#for visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [None]:
train= pd.read_csv("../input/customer-segmentation/train.csv")
test= pd.read_csv("../input/customer-segmentation/test.csv")

In [None]:
train.head(2)

In [None]:
test.head(2)

In [None]:
train['train_y_n']=1
test['train_y_n']=0
all=pd.concat([train,test])

In [None]:
all.head()

In [None]:
#Visualization to check for missing values
sns.heatmap(all.isna())

# Exploratory Data Analysis

Target variable: Segmentation (A,B,C,D) , Potential Predictors: All Others

## Univariate Analysis

In [None]:
all.info()

In [None]:
all.describe()

In [None]:
#ID
all['ID'].value_counts()>1

In [None]:
sum(all['ID'].value_counts()>1)

In [None]:
all[all['train_y_n']==0]['ID'].nunique()

In [None]:
2332/2627

###### Interestingly 89% of the test data IDs are part of train data, so a minimum 88% accuracy is guaranteed! :)

In [None]:
all[all['ID']==462826]

Sample ID available in both train & test

In [None]:
sum(all.groupby(['ID','train_y_n'])['ID'].count()>1)

In [None]:
#Gender
sns.countplot(all['Gender'],hue=all['Segmentation'])

In [None]:
groupby_df = all[all['train_y_n']==1].groupby(['Gender', 'Segmentation']).agg({'Segmentation': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Segmentation D has slightly higher Male %, while there is no other significant difference

In [None]:
#Ever_Married
sns.countplot(all['Ever_Married'],hue=all['Segmentation'])

In [None]:
groupby_df = all[all['train_y_n']==1].groupby(['Ever_Married', 'Segmentation']).agg({'Segmentation': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

A,B,C have higher Married people vs D has more Single people

In [None]:
sum(all['Ever_Married'].isnull())

In [None]:
#Age
sns.distplot(all['Age'])

In [None]:
sns.set_style('whitegrid')
sns.distplot(all[all['Segmentation']=='A']['Age'],bins=30,color='blue')
sns.distplot(all[all['Segmentation']=='B']['Age'],bins=30,color='red')
sns.distplot(all[all['Segmentation']=='C']['Age'],bins=30,color='green')
sns.distplot(all[all['Segmentation']=='D']['Age'],bins=30,color='black')
plt.legend(labels=['Seg=A', 'Seg=B', 'Seg=C','Seg=D'])

In [None]:
#Graduated
sns.countplot(all['Graduated'],hue=all['Segmentation'])

In [None]:
groupby_df = all[all['train_y_n']==1].groupby(['Graduated', 'Segmentation']).agg({'Segmentation': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

Graduated people are in Seg A,B,C while D has lesser graduates

In [None]:
#Profession
plt.rcParams['figure.figsize'] = (10, 6)
sns.countplot(all['Profession'],hue=all['Segmentation'])

In [None]:
#Work_Experience
sns.countplot(all['Work_Experience'])

In [None]:
#Spending_Score
sns.countplot(all['Spending_Score'],hue=all['Segmentation'])

In [None]:
groupby_df = all[all['train_y_n']==1].groupby(['Spending_Score', 'Segmentation']).agg({'Segmentation': 'count'})
groupby_pcts = groupby_df.groupby(level=0).apply(lambda x:round(100 * x / x.sum(),2))
groupby_df,groupby_pcts

In [None]:
#Family_Size
sns.countplot(all['Family_Size'],hue=all['Segmentation'])

In [None]:
#Var_1
sns.countplot(all['Var_1'],hue=all['Segmentation'])

## Bivariate Analysis

In [None]:
all.dtypes

In [None]:
sns.heatmap(all.corr(),annot=True)

#### Feature Engineering & Missing Value Treatment

In [None]:
feature_cols = all.columns.tolist()
feature_cols.remove('ID')
feature_cols.remove('Segmentation')
feature_cols.remove('train_y_n')
label_col = 'Segmentation'
print(feature_cols)

In [None]:
all.isnull().sum()

In [None]:
#Gender
all=pd.get_dummies(all,prefix='Gender',columns=['Gender'],drop_first=True)

In [None]:
all.head(2)

In [None]:
#Ever_Married
sns.countplot(all['Ever_Married'],hue=all['Family_Size'])

In [None]:
all[all['Ever_Married'].isnull()]['Family_Size'].value_counts()

In [None]:
all['Ever_Married']=all['Ever_Married'].fillna('Yes')

In [None]:
all=pd.get_dummies(all,prefix='Married',columns=['Ever_Married'],drop_first=True)

In [None]:
all.head(2)

In [None]:
#Graduated
sns.countplot(all['Graduated'])

In [None]:
all['Graduated']=all['Graduated'].fillna('Yes')

In [None]:
all=pd.get_dummies(all,prefix='Graduated',columns=['Graduated'],drop_first=True)
all.head(2)

In [None]:
#Profession
all['Profession'].fillna('Unknown',inplace=True)

In [None]:
all['Profession']=all['Profession'].astype('str')

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
all['Profession_en']=le.fit_transform(all['Profession'])

In [None]:
sns.countplot(all['Profession_en'],hue=all['Profession'])

In [None]:
all['Profession_en'].value_counts()

In [None]:
all.drop('Profession',axis=1,inplace=True)

In [None]:
#Work_Experience
all['Work_Experience'].fillna(all['Work_Experience'].mean(),inplace=True)

In [None]:
#Spending_Score
all.loc[all['Spending_Score']=='Low','Spending_Score']=1
all.loc[all['Spending_Score']=='Average','Spending_Score']=2
all.loc[all['Spending_Score']=='High','Spending_Score']=3
all['Spending_Score']=all['Spending_Score'].astype('int')

In [None]:
#Family_Size
all['Family_Size'].fillna(round(all['Family_Size'].mean()),inplace=True)

In [None]:
#Var_1
all['Var_1'].fillna('Cat_6',inplace=True)
all['Var_1']=all['Var_1'].apply(lambda x:x[-1])
all['Var_1']=all['Var_1'].astype('int')

In [None]:
#Train & Test Split
from sklearn.model_selection import train_test_split
df_train, df_eval = train_test_split(all[all['train_y_n']==1], test_size=0.40, random_state=101, shuffle=True, stratify=all[all['train_y_n']==1][label_col])

In [None]:
le = preprocessing.LabelEncoder()
df_train['Segmentation']=le.fit_transform(df_train['Segmentation'])
df_eval['Segmentation']=le.fit_transform(df_eval['Segmentation'])

In [None]:
df_train.info()

In [None]:
df_eval.info()

# Model Building

In [None]:
import lightgbm as lgb
from sklearn import preprocessing
from sklearn.metrics import mean_squared_log_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report
import seaborn as sns
from collections import Counter
sns.set_style('whitegrid')

In [None]:
params = {}
params['learning_rate'] = 0.04
params['max_depth'] = 18
params['n_estimators'] = 3000
params['objective'] = 'multiclass'
params['boosting_type'] = 'gbdt'
params['subsample'] = 0.7
params['random_state'] = 42
params['colsample_bytree']=0.7
params['min_data_in_leaf'] = 55
params['reg_alpha'] = 1.7
params['reg_lambda'] = 1.11
#params['class_weight']: {0: 0.44, 1: 0.4, 2: 0.37}

In [None]:
feature_cols = df_train.columns.tolist()
feature_cols.remove('ID')
feature_cols.remove('Segmentation')
feature_cols.remove('train_y_n')
label_col = 'Segmentation'
print(feature_cols)

In [None]:
cat_cols=['Spending_Score','Family_Size','Var_1','Gender_Male','Married_Yes','Graduated_Yes','Profession_en']

In [None]:
clf = lgb.LGBMClassifier(**params)
    
clf.fit(df_train[feature_cols], df_train[label_col], early_stopping_rounds=100, eval_set=[(df_train[feature_cols], df_train[label_col]), (df_eval[feature_cols], df_eval[label_col])], eval_metric='multi_error', verbose=True, categorical_feature=cat_cols)

eval_score = accuracy_score(df_eval[label_col], clf.predict(df_eval[feature_cols]))

print('Eval ACC: {}'.format(eval_score))

In [None]:
test=all[all['train_y_n']==0]
train=all[all['train_y_n']==1]

In [None]:
#Since there is big overlap between test and train, using train data for all the overlapping IDs
sub=pd.merge(left=test['ID'],right=train[['ID','Segmentation']],how='left',on='ID')

In [None]:
actual_test=(test[test['ID'].isin(train['ID'])==False])

In [None]:
actual_test.shape

In [None]:
pred=clf.predict(actual_test[feature_cols])

In [None]:
pred=le.inverse_transform(pred)
actual_test['Segmentation']=pred

In [None]:
l=actual_test[['ID','Segmentation']]
r=sub[sub['Segmentation'].isnull()==False]
fr=[l,r]
sub=pd.concat(fr)

In [None]:
sub[['ID','Segmentation']].to_csv('submission.csv',index = False)

# References:
1. EDA - https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
2. Label Encoder - https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621

## Feel free to share your feedback,do Upvote if you like/found the notebook useful!