# MSA 2023 Phase 2 - Part 1: Classification Dataset

# Market Segmentation (market_segmentation.csv)

A car company wants to enter new markets with their existing cars. Their own market research has shown that customers within markets they are currently operating in fall into one of four groups (A, B, C, D) based on their similar characteristics, and that each of these groups have been shown targeted ads and messaging to buy their cars, which has worked exceptionally for the company. They now want to employ this strategy in new markets using data that they've collected (via ethical means, of course) about potential customers by segmenting them into the same four groups (i.e. a multi-class classification problem).

In [49]:
# Takes around 45 secs to load in all libraries
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Find all variables and understand them

In [50]:
# Load market data into notebook
market_data = pd.read_csv('market_segmentation.csv', delimiter=',',header='infer')

In [51]:
# Display insight into variable types
market_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [52]:
# Show first ten instances
market_data.head(10)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A
5,461319,Male,Yes,56,No,Artist,0.0,Average,2.0,Cat_6,C
6,460156,Male,No,32,Yes,Healthcare,1.0,Low,3.0,Cat_6,C
7,464347,Female,No,33,Yes,Healthcare,1.0,Low,3.0,Cat_6,D
8,465015,Female,Yes,61,Yes,Engineer,0.0,Low,3.0,Cat_7,D
9,465176,Female,Yes,55,Yes,Artist,1.0,Average,4.0,Cat_6,C


In [53]:
market_data.select_dtypes(include = "number").describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


In [54]:
market_data.select_dtypes(include = "object").describe()

Unnamed: 0,Gender,Ever_Married,Graduated,Profession,Spending_Score,Var_1,Segmentation
count,8068,7928,7990,7944,8068,7992,8068
unique,2,2,2,9,3,7,4
top,Male,Yes,Yes,Artist,Low,Cat_6,D
freq,4417,4643,4968,2516,4878,5238,2268


## 2. Clean data

In [55]:
market_data.isnull().sum(axis=0)


ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

In [56]:
from sklearn import preprocessing


# Converting 6 columns that have categorical text info into numbers
label_encoder = preprocessing.LabelEncoder()

market_data['Gender'] = label_encoder.fit_transform(market_data['Gender'])
# market_data['Ever_Married'] = label_encoder.fit_transform(market_data['Ever_Married'])
# market_data['Graduated'] = label_encoder.fit_transform(market_data['Graduated'])
market_data['Spending_Score'] = label_encoder.fit_transform(market_data['Spending_Score'])
# market_data['Var_1'] = label_encoder.fit_transform(market_data['Var_1'])
market_data['Segmentation'] = label_encoder.fit_transform(market_data['Segmentation'])

market_data.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,1,No,22,No,Healthcare,1.0,2,4.0,Cat_4,3
1,462643,0,Yes,38,Yes,Engineer,,0,3.0,Cat_4,0
2,466315,0,Yes,67,Yes,Engineer,1.0,2,1.0,Cat_6,1
3,461735,1,Yes,67,Yes,Lawyer,0.0,1,2.0,Cat_6,1
4,462669,0,Yes,40,Yes,Entertainment,,1,6.0,Cat_6,0


## 3. Visualise data

In [57]:
#

## 4. Identify correlated variables

In [58]:
#

## 5. Summary