An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4, and P5). After intensive market research, they’ve deduced that the behavior of the new market is similar to their existing market.
In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for a different segment of customers. This strategy has work e exceptionally well for them. They plan to use the same strategy for the new markets and have identified 2627 new potential customers.
You are required to help the manager to predict the right group of the new customers.

    Variable - Definition
    ID - Unique ID
    Gender - Gender of the customer
    Ever_Married - Marital status of the customer
    Age - Age of the customer
    Graduated - Is the customer a graduate?
    Profession - Profession of the customer
    Work_Experience - Work Experience in years
    Spending_Score - Spending score of the customer
    Family_Size - Number of family members for the customer (including the customer)
    Var_1 - Anonymised Category for the customer
    Segmentation - (target) Customer Segment of the customer

Note:There is a lot of null data in the training base, if we delete it it will decrease the database a lot.
If it was a company and we had contact with the database supplier, we would try to fill in the missing data.
In order not to lag the database too much, let's fill in the average of each missing value.
There are some categorical data that could not be filled with the mean value, so you may delete them,

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv("Test.csv")
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6,B
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6,A
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6,A
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6,B
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6,A


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2627 entries, 0 to 2626
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               2627 non-null   int64  
 1   Gender           2627 non-null   object 
 2   Ever_Married     2577 non-null   object 
 3   Age              2627 non-null   int64  
 4   Graduated        2603 non-null   object 
 5   Profession       2589 non-null   object 
 6   Work_Experience  2358 non-null   float64
 7   Spending_Score   2627 non-null   object 
 8   Family_Size      2514 non-null   float64
 9   Var_1            2595 non-null   object 
 10  Segmentation     2627 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 225.9+ KB


In [4]:
df.isnull().sum()

ID                   0
Gender               0
Ever_Married        50
Age                  0
Graduated           24
Profession          38
Work_Experience    269
Spending_Score       0
Family_Size        113
Var_1               32
Segmentation         0
dtype: int64

In [5]:
df.Ever_Married.value_counts()

Ever_Married
Yes    1520
No     1057
Name: count, dtype: int64

In [6]:
df=df.drop(["ID","Gender","Var_1"],axis=1)
df.head(2)

Unnamed: 0,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
0,Yes,36,Yes,Engineer,0.0,Low,1.0,B
1,Yes,37,Yes,Healthcare,8.0,Average,4.0,A


In [7]:
num_Columns=list(df.select_dtypes(include=np.number))
num_Columns

['Age', 'Work_Experience', 'Family_Size']

In [8]:
def fillna_mean(df,Col):
    for i in Col:
        mean=df[i].mean()
        df.fillna({i:mean},inplace=True)       
fillna_mean(df,num_Columns)

In [9]:
df.isnull().sum()

Ever_Married       50
Age                 0
Graduated          24
Profession         38
Work_Experience     0
Spending_Score      0
Family_Size         0
Segmentation        0
dtype: int64

In [10]:
#removed outliers
for i in num_Columns:
        Q1=np.quantile(df[i],0.25)
        Q3=np.quantile(df[i],0.75)
        IQR=Q3-Q1
        df=df[(df[i]<(Q3+1.5*IQR)) & (df[i]>(Q1-1.5*IQR))]

In [11]:
cat_Columns=list(df.select_dtypes(exclude=np.number))
cat_Columns

['Ever_Married', 'Graduated', 'Profession', 'Spending_Score', 'Segmentation']

In [12]:
for i in cat_Columns:
    df.dropna(subset=[i],inplace=True)

In [13]:
df.isnull().sum()

Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Segmentation       0
dtype: int64

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2105 entries, 0 to 2625
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ever_Married     2105 non-null   object 
 1   Age              2105 non-null   int64  
 2   Graduated        2105 non-null   object 
 3   Profession       2105 non-null   object 
 4   Work_Experience  2105 non-null   float64
 5   Spending_Score   2105 non-null   object 
 6   Family_Size      2105 non-null   float64
 7   Segmentation     2105 non-null   object 
dtypes: float64(2), int64(1), object(5)
memory usage: 148.0+ KB


In [15]:
from sklearn.preprocessing import LabelEncoder
lenc=LabelEncoder()
for i in cat_Columns:
    if(len(df[i].unique())==2):
        df[i]=lenc.fit_transform(df[i])

In [16]:
# df.Segmentation=lenc.fit_transform(df.Segmentation)

In [17]:
df.head()

Unnamed: 0,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Segmentation
0,1,36,1,Engineer,0.0,Low,1.0,B
4,0,19,0,Marketing,2.552587,Low,4.0,A
5,1,47,1,Doctor,0.0,High,5.0,C
6,1,61,1,Doctor,5.0,Low,3.0,D
7,1,47,1,Artist,1.0,Average,3.0,D


In [18]:
df=df.replace("Low",0)
df=df.replace("Average",1)
df=df.replace("High",2)

In [19]:
df.Spending_Score=df["Spending_Score"].astype(int)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2105 entries, 0 to 2625
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ever_Married     2105 non-null   int32  
 1   Age              2105 non-null   int64  
 2   Graduated        2105 non-null   int32  
 3   Profession       2105 non-null   object 
 4   Work_Experience  2105 non-null   float64
 5   Spending_Score   2105 non-null   int32  
 6   Family_Size      2105 non-null   float64
 7   Segmentation     2105 non-null   object 
dtypes: float64(2), int32(3), int64(1), object(2)
memory usage: 123.3+ KB


In [21]:
# df1=pd.get_dummies(df["Profession"],drop_first=True,dtype=int)
# df1.info()

In [22]:
# df=pd.concat([df,df1],axis=1)
df=df.drop("Profession",axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2105 entries, 0 to 2625
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ever_Married     2105 non-null   int32  
 1   Age              2105 non-null   int64  
 2   Graduated        2105 non-null   int32  
 3   Work_Experience  2105 non-null   float64
 4   Spending_Score   2105 non-null   int32  
 5   Family_Size      2105 non-null   float64
 6   Segmentation     2105 non-null   object 
dtypes: float64(2), int32(3), int64(1), object(1)
memory usage: 106.9+ KB


In [23]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score, ConfusionMatrixDisplay,confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
X=df.drop("Segmentation",axis=1)
y=df["Segmentation"]

In [26]:
y

0       B
4       A
5       C
6       D
7       D
       ..
2620    D
2621    D
2623    A
2624    C
2625    C
Name: Segmentation, Length: 2105, dtype: object

In [27]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,stratify=y,random_state=0)

In [28]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1684 entries, 392 to 1404
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Ever_Married     1684 non-null   int32  
 1   Age              1684 non-null   int64  
 2   Graduated        1684 non-null   int32  
 3   Work_Experience  1684 non-null   float64
 4   Spending_Score   1684 non-null   int32  
 5   Family_Size      1684 non-null   float64
dtypes: float64(2), int32(3), int64(1)
memory usage: 72.4 KB


In [50]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [51]:
cv_param={
            'max_depth':[2,3,4],
          'min_samples_leaf':[1,2,3],
          'min_samples_split':[2,4],
          'max_features':[1,2],
          'n_estimators':[10,20]
         }

In [52]:
rf=RandomForestClassifier(random_state=0)

In [53]:
scoring=('accuracy','precision','recall','f1')

In [54]:
rf_cv=GridSearchCV(rf,cv_param,scoring=scoring,cv=5,refit='f1')

In [55]:
rf_cv.fit(X_train,y_train)

In [56]:
y_pred=rf_cv.predict(X_test)

In [57]:
accuracy_score(y_test,y_pred)

0.37292161520190026

In [58]:
rf_cv.best_estimator_

In [41]:
cv_param={'max_depth': [2],
 'max_features': [1],
 'min_samples_leaf': [1],
 'min_samples_split': [2],
 'n_estimators': [10]}