# Objective

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market.

Content
In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers.

You are required to help the manager to predict the right group of the new customers.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
data=pd.read_csv('/kaggle/input/customer/Train.csv')

#### As ID column is not useful in dividing customers into segment because ID is any random value thus have no correlations with the segmentation , we could drop ID column.

In [None]:
data.drop(['ID'],inplace=True,axis=1)
data.head()

Numbers of non-null instances of each attributes

In [None]:
data.count()

In [None]:
data.Segmentation.value_counts()

Converting labels to categorical

In [None]:
label=pd.Categorical(data.Segmentation,categories=['A','B','C','D']).codes
data.drop(['Segmentation'],axis=1,inplace=True)
label

Train Test split of data and label

## Gender 

### Analysing and finding relations of the segmentation with gender.

In [None]:
data.Gender.isnull().sum()

In [None]:
data.Gender.value_counts()

In [None]:
sns.countplot(data.Gender,hue=label,palette='Dark2')
plt.show()

As there is no null value in the gender column we could easily assign
male as 0 and
female as 1

In [None]:
data.Gender=pd.Categorical(data.Gender,categories=['Male','Female'],ordered=True).codes

## Marital status

In [None]:
data.Ever_Married.isnull().sum()

There are 140 unknown values so we could assign them to most common marital status

In [None]:
data.Ever_Married.value_counts()

As most of the people are married filling the empty space with 'yes'

In [None]:
data['Ever_Married'].fillna('Yes',inplace=True)

In [None]:
sns.countplot(data.Ever_Married,hue=label,palette='PuBuGn')

In [None]:
data.Ever_Married=pd.Categorical(data.Ever_Married,categories=['No','Yes'],ordered=True).codes

## Graduated

In [None]:
data.Graduated.isnull().sum()

78 values in Graduated columns are empty so we would again apply same stategy and the maximum occuring value of the column would be set at the empty space 

In [None]:
data.Graduated.value_counts()

In [None]:
data.Graduated.fillna('Yes',inplace=True)

In [None]:
sns.countplot(data.Graduated,hue=label,palette='winter_r')

The above graph gives information that non graduated customers are  categorised as category D , in the mean time the graduated customers are categorised as category C most likely.

In [None]:
data.Graduated=pd.Categorical(data.Graduated,categories=['No','Yes'],ordered=True).codes

##  Profession

In [None]:
data.Profession.isnull().sum()

In [None]:
data.Profession.value_counts()

The above table specifies that customers are mainly artist and healthcare

In [None]:
data.Profession.fillna('Artist',inplace=True)

In [None]:
plt.figure(figsize=(20,15))
sns.countplot(data.Profession,hue=label,palette='twilight_r')

The above graph signifies that Artist and Executive are generally categorised as category C and Healthcare worker and other are generally  categorised as category D.

In [None]:
profession=pd.get_dummies(data.Profession)
data.drop(['Profession'],axis=1,inplace=True)

In [None]:
profession

In [None]:
data=data.join(profession)

## spending score 

In [None]:
data.Spending_Score.isnull().sum()

In [None]:
sns.countplot(data.Spending_Score,hue=label)

The above graph informs about that low budget people are mostly been classified in category D

In [None]:
data.Spending_Score=pd.Categorical(data.Spending_Score,categories=['Low','Average','High'],ordered=True).codes

## var_1

In [None]:
data.Var_1.isnull().sum()

In [None]:
data.Var_1.value_counts()

In [None]:
data.Var_1.fillna('Cat_6',inplace=True)

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(data.Var_1,hue=label)

In [None]:
data.Var_1=pd.Categorical(data.Var_1).codes

## working experience

In [None]:
data.Work_Experience.isnull().sum()

In [None]:
data.Work_Experience.value_counts()

The missing value of work_experience customer, can be treated as zero experience

In [None]:
data.Work_Experience.fillna(0,inplace=True)

In [None]:
plt.figure(figsize=(8,6))
plt.hist(data.Work_Experience)

## Family size 

In [None]:
data.Family_Size.isnull().sum()

In [None]:
data.Family_Size.value_counts()

Filling the column with the previous appeared value

In [None]:
data.Family_Size.fillna(method='ffill',inplace=True)

In [None]:
plt.figure(figsize=(8,6))
plt.hist(data.Family_Size)

## Age 

In [None]:
data.Age.isnull().sum()

In [None]:
print("the max age of the customer is {0} \n the minimum age of the customer is {1}".format(max(data.Age),min(data.Age)))

In [None]:
plt.scatter(data.Age,label)

In [None]:
plt.figure(figsize=(8,6))
plt.hist(data.Age)

In the above graphs of Age,work experience and family size are screwed.
so we could use MinMaxScaler to normalize  group of data between the range of 0 to 1.

In [None]:
# data=pd.DataFrame([data],columns=data.columns)
data.head()

In [None]:
correlation_data=pd.DataFrame(label,columns=['label'])
correlation_data=correlation_data.join(data)
correlation_data

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(correlation_data.corr(),annot=True)

# Train-validation spliting of dataset

In [None]:
train_data,val_data,train_label,val_label=train_test_split(data,label,test_size=0.2,random_state=40)

## Model Selection 

We select DecisionTreeClassifer as the first classifier, with max_depth of 10 to avoid overfittig of data.

### DecisionTreeClassifier 

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier1=DecisionTreeClassifier(max_depth=10)
classifier1.fit(train_data,train_label)

In [None]:
print("To evaluate the performace of train data on the model \n",classification_report(train_label,classifier1.predict(train_data)))
print("To evaluate the performace of validatation data on the model \n",classification_report(val_label,classifier1.predict(val_data)))

Though the overall accuracy of the classifier don't seems great but works pleasant well in identifing the category 'D' customer with accuracy of about 65%.The main reason here is lack of availability of data as category 'D' have quit larger data as compared with others thus have a little more accuracy.

### RandomForestClassifier 

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier2=RandomForestClassifier(max_depth=9)
classifier2.fit(train_data,train_label)

In [None]:
print("To evaluate the performace of train data on the model \n",classification_report(train_label,classifier2.predict(train_data)))
print("To evaluate the performace of validation data on the model \n",classification_report(val_label,classifier2.predict(val_data)))

RandomForestClassifier is better classifier model here as it gives a quit better accuracy in classifying each categories.

### LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier3=LogisticRegression(tol=0.01,max_iter=1000)
classifier3.fit(train_data,train_label)

In [None]:
print("To evaluate the performace of train data on the model \n",classification_report(train_label,classifier3.predict(train_data)))
print("To evaluate the performace of validation data on the model \n",classification_report(val_label,classifier3.predict(val_data)))

#### Though accuracy is not only metrics to judge a model ,but here it give quit clear idea the RandomForestClassifier(with accuracy of 54%) is better algorithm for the given datasets .