### Problem Statement
An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 
<br><br>
In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 
<br><br>
You are required to help the manager to predict the right group of the new customers.<br><br>
You can check this link: https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/

## Algorithms Covered

1. In this notebook, we are going to use **One vs Rest(OvR)** and **One vs One(OvO)** algorithms to solve the problem.

2. In second notebook, we are going to use **Decision Tree** and **Random Forest** algorithms to solve the same problem, [click here to see](https://www.kaggle.com/mittalvasu95/multi-class-classification-c102?scriptVersionId=43468547)

3. In the last notebook, we are going to use **k-NN** and **Naive Bayes** algorithms to solve the same problem, [click here to see](https://www.kaggle.com/mittalvasu95/multi-class-classification-c103?scriptVersionId=43468336)

**Note**:The EDA process is same in all the three notebooks. The only change is in algorithm to solve the problem.

### Variables Description

           
| Variable	            | Definition                                                        |
|---------------------- |-------------------------------------------------------------------|
| ID	                | Unique ID                                                         |
| Gender	            | Gender of the customer                                            |
| Ever_Married	        | Marital status of the customer                                    |
| Age	                | Age of the customer                                               |
| Graduated	            | Is the customer a graduate?                                       |
| Profession	        | Profession of the customer                                        |
| Work_Experience	    | Work Experience in years                                          |
| Spending_Score	    | Spending score of the customer                                    |
| Family_Size	        | Number of family members for the customer(including the customer) |
| Var_1	                | Anonymised Category for the customer                              |
| Segmentation(target)  | Customer Segment of the customer                                  |

## Multi Class Classification 
- A classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multi-class classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time. 
- Common examples include image classification (is it a cat, dog, human, etc) or handwritten digit recognition (classifying an image of a handwritten number into a digit from 0 to 9).
- In machine learning, multiclass or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification).
- Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance.

## <font color = 'blue'>Topics Covered in this notebook</font>
1. Basic cleaning and EDA
2. One vs Rest Classifier
    - Model Building with two different dataframes
    - Model Evaluation
    - Final comment on which dataframe is good for this algorithm
3. One vs One Classifier
    - Model Building with two different dataframes
    - Model Evaluation
    - Final comment on which dataframe is good for this algorithm

---
## <font color='orange'>Step I: Importing, Cleaning and EDA

In [None]:
# Importing libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

In [None]:
import warnings
warnings.filterwarnings('ignore')

#### Training data

In [None]:
# Loading the train data
df = pd.read_csv('/kaggle/input/customer/Train.csv')

# Looking top 10 rows
df.head(10)

In [None]:
# Looking the bigger picture
df.info()

<font color='blue'>1. We have seen that there are `missing values` in the dataset. So we will work on data cleaning.<br>
<font color='blue'>2. `Create some new attributes` based upon given data/domain knowledge/prior experience.<br>
<font color='blue'>3. Create graphs and `performs EDA` and write observations.

In [None]:
# Checking the number of missing values in each column
df.isnull().sum()

In [None]:
# Removing all those rows that have 3 or more missing values
df = df.loc[df.isnull().sum(axis=1)<3]

In [None]:
# Looking random 10 rows of the data
df.sample(10)

###### Var_1

In [None]:
print('The count of each category\n',df.Var_1.value_counts())

In [None]:
# Checking the count of null values
df.Var_1.isnull().sum()

In [None]:
# Filling the missing values w.r.t other attributes underlying pattern 
df.loc[ (pd.isnull(df['Var_1'])) & (df['Graduated'] == 'Yes'),"Var_1"] = 'Cat_6'
df.loc[ (pd.isnull(df['Var_1'])) & (df['Graduated'] == 'No'),"Var_1"] = 'Cat_4'
df.loc[ (pd.isnull(df["Var_1"])) & ((df['Profession'] == 'Lawyer') | (df['Profession'] == 'Artist')),"Var_1"] = 'Cat_6'
df.loc[ (pd.isnull(df["Var_1"])) & (df['Age'] > 40),"Var_1"] = 'Cat_6'

**Ways to treat missing values**<br>
Check here:https://www.datasciencenovice.com/2020/08/5-ways-to-treat-missing-values.html

In [None]:
# Counting Var_1 in each segment
ax1 = df.groupby(["Segmentation"])["Var_1"].value_counts().unstack().round(3)

# Percentage of category of Var_1 in each segment
ax2 = df.pivot_table(columns='Var_1',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>In each of the segment the count of cat_6 or proportion of cat_6 is very high i.e. most of the entries in the given data belongs to cat_6.

###### Gender

In [None]:
print('The count of gender\n',df.Gender.value_counts())

In [None]:
# Checking the count of missing values
df.Gender.isnull().sum()

In [None]:
# Counting male-female in each segment
ax1 = df.groupby(["Segmentation"])["Gender"].value_counts().unstack().round(3)

# Percentage of male-female in each segment
ax2 = df.pivot_table(columns='Gender',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>All the 4 segments have around same number of male-female distribution. In all segment male are more than female. <br> 
<font color='blue'>But segment D has highest male percentage as compared to other segments.

###### Ever Married

In [None]:
print('Count of married vs not married\n',df.Ever_Married.value_counts())

In [None]:
# Checking the count of missing values
df.Ever_Married.isnull().sum()

In [None]:
# Filling the missing values w.r.t other attributes underlying pattern
df.loc[ (pd.isnull(df["Ever_Married"])) & ((df['Spending_Score'] == 'Average') | (df['Spending_Score'] == 'High')),"Ever_Married"] = 'Yes'
df.loc[ (pd.isnull(df["Ever_Married"])) & (df['Spending_Score'] == 'Low'),"Ever_Married"] = 'No'
df.loc[ (pd.isnull(df["Ever_Married"])) & (df['Age'] > 40),"Ever_Married"] = 'Yes'
df.loc[ (pd.isnull(df["Ever_Married"])) & (df['Profession'] == 'Healthcare'),"Ever_Married"] = 'No'

In [None]:
# Counting married and non-married in each segment
ax1 = df.groupby(["Segmentation"])["Ever_Married"].value_counts().unstack().round(3)

# Percentage of married and non-married in each segment
ax2 = df.pivot_table(columns='Ever_Married',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>We seen that most of the customers in segment C are married while segment D has least number of married customers. It means segment D is a group of customers that are mostly singles and maybe younger in age. 

###### Age

In [None]:
df.Age.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Checking the count of missing values
df.Age.isnull().sum()

In [None]:
# Looking the distribution of column Age
plt.figure(figsize=(10,5))

skewness = round(df.Age.skew(),2)
kurtosis = round(df.Age.kurtosis(),2)
mean = round(np.mean(df.Age),0)
median = np.median(df.Age)

plt.subplot(1,2,1)
sns.boxplot(y=df.Age)
plt.title('Boxplot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.subplot(1,2,2)
sns.distplot(df.Age)
plt.title('Distribution Plot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.show()

In [None]:
# Looking the distribution of column Age w.r.t to each segment
a = df[df.Segmentation =='A']["Age"]
b = df[df.Segmentation =='B']["Age"]
c = df[df.Segmentation =='C']["Age"]
d = df[df.Segmentation =='D']["Age"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = df, x = "Segmentation", y="Age")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

<font color='blue'>The mean age of segment D is 33 and we can say that people in this segment are belong to 30s i.e. they are younger and also from 'ever_married' distribution it is seen that segment D has maximum number of customers who are singles indicating they are younger.<br>
<font color='blue'>Also segment C has mean age of 49 and we also seen that most cutomers in this segment are married. 

In [None]:
# Converting the datatype from float to int
df['Age'] = df['Age'].astype(int)

In [None]:
df.Age.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Divide people in the 4 age group
df['Age_Bin'] = pd.cut(df.Age,bins=[17,30,45,60,90],labels=['17-30','31-45','46-60','60+'])

In [None]:
# Counting different age group in each segment
ax1 = df.groupby(["Segmentation"])["Age_Bin"].value_counts().unstack().round(3)

# Percentage of age bins in each segment
ax2 = df.pivot_table(columns='Age_Bin',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(3)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

###### Graduated

In [None]:
print('Count of each graduate and non-graduate\n',df.Graduated.value_counts())

In [None]:
# Checking the count of missing values
df.Graduated.isnull().sum()

In [None]:
# Filling the missing values w.r.t other attributes underlying pattern
df.loc[ (pd.isnull(df["Graduated"])) & (df['Spending_Score'] == 'Average'),"Graduated"] = 'Yes'
df.loc[ (pd.isnull(df["Graduated"])) & (df['Profession'] == 'Artist'),"Graduated"] = 'Yes'
df.loc[ (pd.isnull(df["Graduated"])) & (df['Age'] > 49),"Graduated"] = 'Yes'
df.loc[ (pd.isnull(df["Graduated"])) & (df['Var_1'] == 'Cat_4'),"Graduated"] = 'No'
df.loc[ (pd.isnull(df["Graduated"])) & (df['Ever_Married'] == 'Yes'),"Graduated"] = 'Yes'

# Replacing remaining NaN with previous values
df['Graduated'] = df['Graduated'].fillna(method='pad')

In [None]:
# Counting graduate and non-graduate in each segment
ax1 = df.groupby(["Segmentation"])["Graduated"].value_counts().unstack().round(3)

# Percentage of graduate and non-graduate in each segment
ax2 = df.pivot_table(columns='Graduated',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>Segment C has most number of customers who are graduated while segment D has lowest number of graduate customers.

###### Profession

In [None]:
print('Count of each profession\n',df.Profession.value_counts())

In [None]:
# Checking the count of missing values
df.Profession.isnull().sum()

In [None]:
# Filling the missing values w.r.t other attributes underlying pattern
df.loc[ (pd.isnull(df["Profession"])) & (df['Work_Experience'] > 8),"Profession"] = 'Homemaker'
df.loc[ (pd.isnull(df["Profession"])) & (df['Age'] > 70),"Profession"] = 'Lawyer'
df.loc[ (pd.isnull(df["Profession"])) & (df['Family_Size'] < 3),"Profession"] = 'Lawyer'
df.loc[ (pd.isnull(df["Profession"])) & (df['Spending_Score'] == 'Average'),"Profession"] = 'Artist'
df.loc[ (pd.isnull(df["Profession"])) & (df['Graduated'] == 'Yes'),"Profession"] = 'Artist'
df.loc[ (pd.isnull(df["Profession"])) & (df['Ever_Married'] == 'Yes'),"Profession"] = 'Artist'
df.loc[ (pd.isnull(df["Profession"])) & (df['Ever_Married'] == 'No'),"Profession"] = 'Healthcare'
df.loc[ (pd.isnull(df["Profession"])) & (df['Spending_Score'] == 'High'),"Profession"] = 'Executives'

In [None]:
# Count of segments in each profession
ax1 = df.groupby(["Profession"])["Segmentation"].value_counts().unstack().round(3)

# Percentage of segments in each profession
ax2 = df.pivot_table(columns='Segmentation',index='Profession',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (16,5))
label = ['Artist','Doctor','Engineer','Entertainment','Executives','Healthcare','Homemaker','Lawyer','Marketing']
ax[0].set_xticklabels(labels = label,rotation = 45)

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (16,5))
ax[1].set_xticklabels(labels = label,rotation = 45)

plt.show()

<font color='blue'>Segment A,B and C have major customers from profession:**Artist** while Segment D have major customers from profession:**Healthcare**. <br>
**Homemaker** is least in all the four segment.

###### Work Experience

In [None]:
df.Work_Experience.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Checking the count of missing values
df.Work_Experience.isnull().sum()

In [None]:
# Filling NaN with previous values
df['Work_Experience'] = df['Work_Experience'].fillna(method='pad')

In [None]:
# Looking the distribution of column Work Experience
plt.figure(figsize=(15,10))

skewness = round(df.Work_Experience.skew(),2)
kurtosis = round(df.Work_Experience.kurtosis(),2)
mean = round(np.mean(df.Work_Experience),0)
median = np.median(df.Work_Experience)

plt.subplot(1,2,1)
sns.boxplot(y=df.Work_Experience)
plt.title('Boxplot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.subplot(2,2,2)
sns.distplot(df.Work_Experience)
plt.title('Distribution Plot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.show()

In [None]:
# Looking the distribution of column Work_Experience w.r.t to each segment
a = df[df.Segmentation =='A']["Work_Experience"]
b = df[df.Segmentation =='B']["Work_Experience"]
c = df[df.Segmentation =='C']["Work_Experience"]
d = df[df.Segmentation =='D']["Work_Experience"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = df, x = "Segmentation", y="Work_Experience")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Work Experience')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

<font color='blue'>Segment D has people with relatively more experienced than other segments while Segment C has people with low experience

In [None]:
# Changing the data type
df['Work_Experience'] = df['Work_Experience'].astype(int)

In [None]:
df.Work_Experience.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Dividing the people into 3 category of work experience 
df['Work_Exp_Category'] = pd.cut(df.Work_Experience,bins=[-1,1,7,15],labels=['Low Experience','Medium Experience','High Experience'])

In [None]:
# Counting different category of work experience in each segment
ax1 = df.groupby(["Segmentation"])["Work_Exp_Category"].value_counts().unstack().round(3)

# Percentage of work experience in each segment
ax2 = df.pivot_table(columns='Work_Exp_Category',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(3)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

###### Spending Score

In [None]:
print('Count of spending score\n',df.Spending_Score.value_counts())

In [None]:
# Checking the count of missing values
df.Spending_Score.isnull().sum()

In [None]:
# Counting different category of spending score in each segment
ax1 = df.groupby(["Segmentation"])["Spending_Score"].value_counts().unstack().round(3)

# Percentage of spending score in each segment
ax2 = df.pivot_table(columns='Spending_Score',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(3)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>Segment D has maximum number of people with low spending score while in Segment C average spending people are more.

###### Family Size

In [None]:
df.Family_Size.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Checking the count of missing values
df.Family_Size.isnull().sum()

In [None]:
# Filling the missing values w.r.t other attributes underlying pattern
df.loc[ (pd.isnull(df["Family_Size"])) & (df['Ever_Married'] == 'Yes'),"Family_Size"] = 2.0
df.loc[ (pd.isnull(df["Family_Size"])) & (df['Var_1'] == 'Cat_6'),"Family_Size"] = 2.0
df.loc[ (pd.isnull(df["Family_Size"])) & (df['Graduated'] == 'Yes'),"Family_Size"] = 2.0

# Fill remaining NaN with previous values
df['Family_Size'] = df['Family_Size'].fillna(method='pad')

In [None]:
# Looking the distribution of column Work Experience
plt.figure(figsize=(15,10))

skewness = round(df.Family_Size.skew(),2)
kurtosis = round(df.Family_Size.kurtosis(),2)
mean = round(np.mean(df.Family_Size),0)
median = np.median(df.Family_Size)

plt.subplot(1,2,1)
sns.boxplot(y=df.Family_Size)
plt.title('Boxplot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.subplot(2,2,2)
sns.distplot(df.Family_Size)
plt.title('Distribution Plot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.show()

In [None]:
# Looking the distribution of column Family Size w.r.t to each segment
a = df[df.Segmentation =='A']["Family_Size"]
b = df[df.Segmentation =='B']["Family_Size"]
c = df[df.Segmentation =='C']["Family_Size"]
d = df[df.Segmentation =='D']["Family_Size"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = df, x = "Segmentation", y="Family_Size")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Family Size')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

In [None]:
# Changing the data type
df['Family_Size'] = df['Family_Size'].astype(int)

In [None]:
df.Family_Size.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Divide family size into 3 category
df['Family_Size_Category'] = pd.cut(df.Family_Size,bins=[0,4,6,10],labels=['Small Family','Big Family','Joint Family'])

In [None]:
# Counting different category of family size in each segment
ax1 = df.groupby(["Segmentation"])["Family_Size_Category"].value_counts().unstack().round(3)

# Percentage of family size in each segment
ax2 = df.pivot_table(columns='Family_Size_Category',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

<font color='blue'>In the given data it is observed that most of the people have family size of 1 or 2 (i.e. they have small family).<br> But Segment D has more number of  big families as compared to other segments. 

###### Segmentation

In [None]:
print('Count of each category of segmentation\n',df.Segmentation.value_counts())

In [None]:
segments = df.loc[:,"Segmentation"].value_counts()
plt.xlabel("Segment")
plt.ylabel('Count')
sns.barplot(segments.index , segments.values).set_title('Segments')
plt.show()

In [None]:
df.reset_index(drop=True, inplace=True)
df.info()

In [None]:
# number of unique ids
df.ID.nunique()

<font color='blue'>Now all the data has been cleaned. There is no missing value and columns are in right format. <br>
<font color='blue'>All the ids are unique that is there is no duplicate entry.<br>
<font color='blue'>Created new column: 'Age_Bin', 'Work_Exp_Category' and 'Family_Size_Category'. <br> 
<font color='blue'>Delete only 0.2% of rows. 

In [None]:
df.describe(include='all')

In [None]:
df = df[['ID','Gender', 'Ever_Married', 'Age', 'Age_Bin', 'Graduated', 'Profession', 'Work_Experience', 'Work_Exp_Category',
         'Spending_Score', 'Family_Size', 'Family_Size_Category','Var_1', 'Segmentation']]
df.head(10)

In [None]:
# Saving the file
#df.to_csv('cleaned_train.csv')

#### <font color='red'>Making two different dataframes
<font color='red'>Now we consider/make two different dataframes apart from the above main dataframe (namely df) <br>
- `df1`: Spending Score(ranking), Age(normalise), Work_Experience(normalise), Family Size(normalise)
- `df2`: Spending Score(dummy variables), Age Bin(dummy variables), Work_Exp_Category(dummy variables), Family_Size_Category(dummy variables)


In [None]:
df1 = df.copy()
df1.head()

In [None]:
# Separating dependent-independent variables
X = df1.drop('Segmentation',axis=1)
y = df1['Segmentation']

In [None]:
# import the train-test split
from sklearn.model_selection import train_test_split

# divide into train and test sets
df1_trainX, df1_testX, df1_trainY, df1_testY = train_test_split(X,y, train_size = 0.7, random_state = 101, stratify=y)

###### Preprocessing on train data

In [None]:
# converting binary variables to numeric
df1_trainX['Gender'] = df1_trainX['Gender'].replace(('Male','Female'),(1,0))
df1_trainX['Ever_Married'] = df1_trainX['Ever_Married'].replace(('Yes','No'),(1,0))
df1_trainX['Graduated'] = df1_trainX['Graduated'].replace(('Yes','No'),(1,0))
df1_trainX['Spending_Score'] = df1_trainX['Spending_Score'].replace(('High','Average','Low'),(3,2,1))

# converting nominal variables into dummy variables
pf = pd.get_dummies(df1_trainX.Profession,prefix='Profession')
df1_trainX = pd.concat([df1_trainX,pf],axis=1)

vr = pd.get_dummies(df1_trainX.Var_1,prefix='Var_1')
df1_trainX = pd.concat([df1_trainX,vr],axis=1)

# scaling continuous variables
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df1_trainX[['Age','Work_Experience','Family_Size']] = scaler.fit_transform(df1_trainX[['Age','Work_Experience','Family_Size']])

df1_trainX.drop(['ID','Age_Bin','Profession','Work_Exp_Category','Family_Size_Category','Var_1'], axis=1, inplace=True)

###### Preprocessing on test data

In [None]:
# converting binary variables to numeric
df1_testX['Gender'] = df1_testX['Gender'].replace(('Male','Female'),(1,0))
df1_testX['Ever_Married'] = df1_testX['Ever_Married'].replace(('Yes','No'),(1,0))
df1_testX['Graduated'] = df1_testX['Graduated'].replace(('Yes','No'),(1,0))
df1_testX['Spending_Score'] = df1_testX['Spending_Score'].replace(('High','Average','Low'),(3,2,1))

# converting nominal variables into dummy variables
pf = pd.get_dummies(df1_testX.Profession,prefix='Profession')
df1_testX = pd.concat([df1_testX,pf],axis=1)

vr = pd.get_dummies(df1_testX.Var_1,prefix='Var_1')
df1_testX = pd.concat([df1_testX,vr],axis=1)

# scaling continuous variables
df1_testX[['Age','Work_Experience','Family_Size']] = scaler.transform(df1_testX[['Age','Work_Experience','Family_Size']])

df1_testX.drop(['ID','Age_Bin','Profession','Work_Exp_Category','Family_Size_Category','Var_1'], axis=1, inplace=True)

In [None]:
df1_trainX.shape, df1_trainY.shape, df1_testX.shape, df1_testY.shape

In [None]:
# Correlation matrix
plt.figure(figsize=(17,10))
sns.heatmap(df1_trainX.corr(method='spearman').round(2),linewidth = 0.5,annot=True,cmap="YlGnBu")
plt.show()

##### Why Spearman?
Check this: https://idkwhoneedstohearthis.blogspot.com/2020/05/correlation-why-spearmans.html

##### Observation:
1. `Age` and `Ever_Married` has a positive correlation of 0.6 which means that people who are married have more age as compared to those who are unmarried.
2. `Age` and `Profession_Healthcare` has a negative correlation of 0.5 which means all those people whose profession is healthcare are younger in age to those who of other professions people.
3. `Profession_Healthcare` and `Ever_Married` has negative correlation of 0.42 which means all those peoples whose profession is healthcare are unmarried.(only 13% of healthcare professionals are married).
4. `Age` and `Profession_Lawyer` has a positive correlation of 0.42 which means all those people whose profession is lawyer are older in age to those of other professions people.
5. `Ever_Married` and `Spending_Average` has a positive correlation which means those who are married spend averagely.(around 42% married people spent averagely)
6. `Ever_Married` and `Spending_High` has a little positive correlation which means those who are married spend high.(around 25% of married people spent high)
7. `Ever_Married` and `Spending_Low` has a negative correlation of 0.67 which means those who are unmarried spent low.(round 99% of unmarried people spent low )
8. `Age` and `Spending_Score` has a positive correlation of 0.42 which means as age increase the spending power also increase.
9. `Profession_Executives` and `Spending_High` has positive correlation of 0.40 which means all those peoples whose profession is executive spent high.(around 66% of executives spent high).

In [None]:
df2 = df.copy()
df2.head()

In [None]:
# Separating dependent-independent variables
X = df2.drop('Segmentation',axis=1)
y = df2['Segmentation']

In [None]:
# import the train-test split
from sklearn.model_selection import train_test_split

# divide into train and test sets
df2_trainX, df2_testX, df2_trainY, df2_testY = train_test_split(X,y, train_size = 0.7, random_state = 101, stratify=y)

###### Preprocessing in train data

In [None]:
# Converting binary to numeric
df2_trainX['Gender'] = df2_trainX['Gender'].replace(('Male','Female'),(1,0))
df2_trainX['Ever_Married'] = df2_trainX['Ever_Married'].replace(('Yes','No'),(1,0))
df2_trainX['Graduated'] = df2_trainX['Graduated'].replace(('Yes','No'),(1,0))

# Converting nominal variables to dummy variables
ab = pd.get_dummies(df2_trainX.Age_Bin,prefix='Age_Bin')
df2_trainX = pd.concat([df2_trainX,ab],axis=1)

pf = pd.get_dummies(df2_trainX.Profession,prefix='Profession')
df2_trainX = pd.concat([df2_trainX,pf],axis=1)

we = pd.get_dummies(df2_trainX.Work_Exp_Category,prefix='WorkExp')
df2_trainX = pd.concat([df2_trainX,we],axis=1)

sc = pd.get_dummies(df2_trainX.Spending_Score,prefix='Spending')
df2_trainX = pd.concat([df2_trainX,sc],axis=1)

fs = pd.get_dummies(df2_trainX.Family_Size_Category,prefix='FamilySize')
df2_trainX = pd.concat([df2_trainX,fs],axis=1)

vr = pd.get_dummies(df2_trainX.Var_1,prefix='Var_1')
df2_trainX = pd.concat([df2_trainX,vr],axis=1)

df2_trainX.drop(['ID','Age','Age_Bin','Profession','Work_Experience','Work_Exp_Category','Spending_Score',
               'Family_Size','Family_Size_Category','Var_1'],axis=1,inplace=True)

###### Preprocessing in test data

In [None]:
# Converting binary to numeric
df2_testX['Gender'] = df2_testX['Gender'].replace(('Male','Female'),(1,0))
df2_testX['Ever_Married'] = df2_testX['Ever_Married'].replace(('Yes','No'),(1,0))
df2_testX['Graduated'] = df2_testX['Graduated'].replace(('Yes','No'),(1,0))

# Converting nominal variables to dummy variables
ab = pd.get_dummies(df2_testX.Age_Bin,prefix='Age_Bin')
df2_testX = pd.concat([df2_testX,ab],axis=1)

pf = pd.get_dummies(df2_testX.Profession,prefix='Profession')
df2_testX = pd.concat([df2_testX,pf],axis=1)

we = pd.get_dummies(df2_testX.Work_Exp_Category,prefix='WorkExp')
df2_testX = pd.concat([df2_testX,we],axis=1)

sc = pd.get_dummies(df2_testX.Spending_Score,prefix='Spending')
df2_testX = pd.concat([df2_testX,sc],axis=1)

fs = pd.get_dummies(df2_testX.Family_Size_Category,prefix='FamilySize')
df2_testX = pd.concat([df2_testX,fs],axis=1)

vr = pd.get_dummies(df2_testX.Var_1,prefix='Var_1')
df2_testX = pd.concat([df2_testX,vr],axis=1)

df2_testX.drop(['ID','Age','Age_Bin','Profession','Work_Experience','Work_Exp_Category','Spending_Score',
               'Family_Size','Family_Size_Category','Var_1'],axis=1,inplace=True)

In [None]:
df2_trainX.shape, df2_trainY.shape, df2_testX.shape, df2_testY.shape

In [None]:
# Correlation matrix
plt.figure(figsize=(17,10))
sns.heatmap(df2_trainX.corr(method='spearman').round(2),linewidth = 0.5,annot=True,cmap="YlGnBu")
plt.show()

---
## <font color='orange'>Step II: Model Building

### I. One-Vs-Rest

![image.png](attachment:image.png)

- One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary classification algorithms for multi-class classification.
- It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.
- For example, here we are given a multi-class classification problem with examples for each class 'A', 'B', 'C' and 'D'. This could be divided into four binary classification datasets as follows:
    - Binary Classification Problem 1: A vs [B,C,D]
    - Binary Classification Problem 2: B vs [A,C,D]
    - Binary Classification Problem 3: C vs [A,B,D]
    - Binary Classification Problem 4: D vs [A,B,C]
- A possible downside of this approach is that it requires one model to be created for each class. For example, three classes requires three models. This could be an issue for large datasets (e.g. millions of rows), slow models (e.g. neural networks), or very large numbers of classes (e.g. hundreds of classes).
- This approach requires that each model predicts a class membership probability or a probability-like score. The argmax of these scores (class index with the largest score) is then used to predict a class. This approach is commonly used for algorithms that naturally predict numerical class membership probability or score, such as:
    - Logistic Regression
    - Perceptron

#### Building the model with `first type` of dataframe (df_type1)

In [None]:
train_ovr1_x = df1_trainX.copy()
train_ovr1_x.head()

In [None]:
train_ovr1_y = df1_trainY.copy()
train_ovr1_y.head()

In [None]:
# Importing the library
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

# Creating OvR object
ovr1 = OneVsRestClassifier(LinearSVC(random_state=0))

# Train model
model_ovr1 = ovr1.fit(train_ovr1_x, train_ovr1_y)

# Predicting the classes
yhat1 = ovr1.predict(train_ovr1_x)

# Looking at the coefficients of variables 
#print('-------Coefficient of variables obtained from each of the 4 models ------')
#print(model_ovr1.coef_)

# Looking at the intercepts 
#print('-------Intercept of each of the 4 models ------')
#print(model_ovr1.intercept_)

from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(train_ovr1_y.values, yhat1, labels=["A","B","C","D"])
print('\n\n-------The confusion matrix for this model is-------')
print(cm1)

from sklearn.metrics import classification_report
print('\n\n-------Printing the report of the model-------')
print(classification_report(train_ovr1_y.values, yhat1))

<font color='blue'>`Recall` - what proportion of actual Positives is correctly classified <br>
`Precision` - what proportion of predicted Positives is truly Positive <br>
`Accuracy` - what proportion of both Positive and Negative were correctly classified <br>
`f1-score` - Now imagine that you have two classifiers — classifier A and classifier B — each with its own precision and recall. One has a better recall score, the other has better precision. We would like to say something about their relative performance. In other words, we would like to summarize the models’ performance into a single metric. That’s where F1-score are used. It’s a way to combine precision and recall into a single number. <br>
    **F1-score = 2 × (precision × recall)/(precision + recall)**

#### Predicting on test set

In [None]:
test_ovr1_x = df1_testX.copy()
test_ovr1_x.head()

In [None]:
test_ovr1_y = df1_testY.copy()
test_ovr1_y.head()

In [None]:
y_ovr1 = ovr1.predict(test_ovr1_x)
y_ovr1

In [None]:
from sklearn.metrics import confusion_matrix
print('-------The confusion matrix for test data is-------\n')
print(confusion_matrix(test_ovr1_y.values, y_ovr1, labels=["A","B","C","D"]))

from sklearn.metrics import classification_report
print('\n\n-------Printing the report of test data-------\n')
print(classification_report(test_ovr1_y.values, y_ovr1))

In [None]:
pd.Series(y_ovr1).value_counts()

#### Building the model with `second type` of dataframe (df_type2)

In [None]:
train_ovr2_x = df2_trainX.copy()
train_ovr2_x.head()

In [None]:
train_ovr2_y = df2_trainY.copy()
train_ovr2_y.head()

In [None]:
# Importing the library
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

# Creating OvR object
ovr2 = OneVsRestClassifier(LinearSVC(random_state=0))

# Train model
model_ovr2 = ovr2.fit(train_ovr2_x, train_ovr2_y)

# Predicting the classes
yhat2 = ovr2.predict(train_ovr2_x)

# Looking at the coefficients of variables 
#print('-------Coefficient of variables obtained from each of the 4 models------')
#print(model_ovr2.coef_)

# Looking at the intercepts 
#print('\n-------Intercept of each of the 4 models ------')
#print(model_ovr2.intercept_)

from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(train_ovr2_y.values, yhat2, labels=["A","B","C","D"])
print('\n\n-------The confusion matrix for this model is-------')
print(cm2)

from sklearn.metrics import classification_report
print('\n\n-------Printing the whole report of the model-------')
print(classification_report(train_ovr2_y.values, yhat2))

#### Predicting on test set

In [None]:
test_ovr2_x = df2_testX.copy()
test_ovr2_x.head()

In [None]:
test_ovr2_y = df2_testY.copy()
test_ovr2_y.head()

In [None]:
y_ovr2 = ovr2.predict(test_ovr2_x)
y_ovr2

In [None]:
from sklearn.metrics import confusion_matrix
print('-------The confusion matrix for test data is-------\n')
print(confusion_matrix(test_ovr2_y.values, y_ovr2, labels=["A","B","C","D"]))

from sklearn.metrics import classification_report
print('\n\n-------Printing the report of test data-------\n')
print(classification_report(test_ovr2_y.values, y_ovr2))

In [None]:
pd.Series(y_ovr2).value_counts()

---
## <font color='orange'>Step III: Model Evaluation

In [None]:
print('************************  MODEL-1 REPORT  *********************************\n')
print('Train data')
print(classification_report(train_ovr1_y.values, yhat1))
print('\nTest data')
print(classification_report(test_ovr1_y.values, y_ovr1))

#### Observation:
<font color='blue'>1. In train the model gave accuracy of 0.5 while in test it is 0.49, hardly a difference.<br>
<font color='blue'>2. This model is good only for segment C and D as their recall is very good and close in both train and test. 

In [None]:
print('************************  MODEL-2 REPORT  *********************************\n')
print('Train data')
print(classification_report(train_ovr2_y.values, yhat2))
print('\nTest data')
print(classification_report(test_ovr2_y.values, y_ovr2))

#### Observation:
<font color='blue'>1. In train the model gave accuracy of 0.52 while in test it is 0.50, hardly a difference.<br>
<font color='blue'>2. This model is good only for segment C and D as their recall is very good and close in both train and test.<br>
<font color='blue'>3. But this second model is better than the previous model as this predicts better results for segment B.<br>
<font color='blue'>4. So we can say that `model-2 is better` than model-1 in OvR technique.

#### Is ACCURACY everything? 
In general, there is no general best measure. The best measure is derived from your needs. `In a sense, it is not a machine learning question, but a business question`. It is common that two people will use the same data set but will choose different metrics due to different goals.
<br><br>
Accuracy is a great metric. Actually, most metrics are great and I like to evaluate many metrics. However, at some point you will need to decide between using model A or B. There you should use a single metric that best fits your need.<br><br>
Read more: https://towardsdatascience.com/is-accuracy-everything-96da9afd540d

---
---
### II. One-Vs-One 

![image.png](attachment:image.png)

- One-vs-One (OvO for short) is another heuristic method for using binary classification algorithms for multi-class classification. Like one-vs-rest, one-vs-one splits a multi-class classification dataset into binary classification problems. Unlike one-vs-rest that splits it into one binary dataset for each class, the one-vs-one approach splits the dataset into one dataset for each class versus every other class.
- For example, this problem is a multi-class classification problem with four classes: A, B, C and D. This could be divided into six binary classification datasets as follows:
    - Binary Classification Problem 1: A vs B
    - Binary Classification Problem 2: A vs C
    - Binary Classification Problem 3: A vs D
    - Binary Classification Problem 4: B vs C
    - Binary Classification Problem 5: B vs D
    - Binary Classification Problem 6: C vs D
- When predicting new points, each classifier votes on the class of the point, and the class with the most votes is chosen as the winner. In the event of a tie, you may select the class with the highest aggregate classification confidence by summing over the pair-wise classification confidence levels computed by the underlying binary classifiers. 
- The formula for calculating the number of binary datasets, and in turn, models, is as follows:`(NumClasses * (NumClasses – 1)) / 2`
- One-versus-one classifiers are both more computationally expensive, requiring far more classifiers to be trained, and less immediately interpretable. However, if the classifier being used scales poorly, and the dataset is sufficiently large, training this many two-class classifiers may be faster or provide better results than classification in the one-versus-rest scheme, which considers every point.

## <font color='orange'>Step II: Model Building

#### Building the model with `first_type` of dataframe(df_type1)

In [None]:
train_ovo1_x = df1_trainX.copy()
train_ovo1_x.head()

In [None]:
train_ovo1_y = df1_trainY.copy()
train_ovo1_y.head()

In [None]:
# Importing the library
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import LinearSVC

# Creating OvO object
ovo1 = OneVsOneClassifier(LinearSVC(random_state=0))

# Train model
model_ovo1 = ovo1.fit(train_ovo1_x, train_ovo1_y)

# Predicting the classes
yhat3 = ovo1.predict(train_ovo1_x)

from sklearn.metrics import confusion_matrix
cm3 = confusion_matrix(train_ovo1_y.values, yhat3, labels=["A","B","C","D"])
print('-------The confusion matrix for this model is-------')
print(cm3)

from sklearn.metrics import classification_report
print('\n\n-------Printing the whole report of the model-------')
print(classification_report(train_ovo1_y.values, yhat3))

#### Predicting on test set

In [None]:
test_ovo1_x = df1_testX.copy()
test_ovo1_x.head()

In [None]:
test_ovo1_y = df1_testY.copy()
test_ovo1_y.head()

In [None]:
y_ovo1 = ovo1.predict(test_ovo1_x)
y_ovo1

In [None]:
from sklearn.metrics import confusion_matrix
print('-------The confusion matrix for test data is-------\n')
print(confusion_matrix(test_ovo1_y.values, y_ovo1, labels=["A","B","C","D"]))

from sklearn.metrics import classification_report
print('\n\n-------Printing the report of test data-------\n')
print(classification_report(test_ovo1_y.values, y_ovo1))

In [None]:
pd.Series(y_ovo1).value_counts()

#### Building the model using `second type` dataframe(df_type2)

In [None]:
train_ovo2_x = df2_trainX.copy()
train_ovo2_x.head()

In [None]:
train_ovo2_y = df2_trainY.copy()
train_ovo2_y.head()

In [None]:
# Importing the library
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import LinearSVC

# Creating OvO object
ovo2 = OneVsOneClassifier(LinearSVC(random_state=0))

# Train model
model_ovo2 = ovo2.fit(train_ovo2_x, train_ovo2_y)

# Predicting the classes
yhat4 = ovo2.predict(train_ovo2_x)

from sklearn.metrics import confusion_matrix
cm4 = confusion_matrix(train_ovo2_y.values, yhat4, labels=["A","B","C","D"])
print('-------The confusion matrix for this model is-------')
print(cm4)

from sklearn.metrics import classification_report
print('\n\n-------Printing the whole report of the model-------')
print(classification_report(train_ovo2_y.values, yhat4))

#### Predicting on test set

In [None]:
test_ovo2_x = df2_testX.copy()
test_ovo2_x.head()

In [None]:
test_ovo2_y = df2_testY.copy()
test_ovo2_y.head()

In [None]:
y_ovo2 = ovo2.predict(test_ovo2_x)
y_ovo2

In [None]:
from sklearn.metrics import confusion_matrix
print('-------The confusion matrix for test data is-------\n')
print(confusion_matrix(test_ovo2_y.values, y_ovo2, labels=["A","B","C","D"]))

from sklearn.metrics import classification_report
print('\n\n-------Printing the report of test data-------\n')
print(classification_report(test_ovo2_y.values, y_ovo2))

In [None]:
pd.Series(y_ovo2).value_counts()

---
## <font color='orange'>Step III: Model Evaluation

In [None]:
print('************************  MODEL-1 REPORT  *********************************\n')
print('Train data')
print(classification_report(train_ovo1_y.values, yhat3))
print('\nTest data')
print(classification_report(test_ovo1_y.values, y_ovo1))

In [None]:
print('************************  MODEL-2 REPORT  *********************************\n')
print('Train data')
print(classification_report(train_ovo2_y.values, yhat4))
print('\nTest data')
print(classification_report(test_ovo2_y.values, y_ovo2))

<font color='blue'>1. Both the reports are very similar in terms of accuracy, precision, recall and f1-score.<br>
<font color='blue'>2. We can see `model-2 is better` as to model-1 because it has better results of segment A. While we can see `model-1 is better` as to model-2 because it has better results of segment B.<br>
<font color='blue'>3. So we can't really say which model is best for model building in OvO technique. 