# Data Exploration




    In this notebook we will explore the customer segmentation dataset for this Kaggle project. Our goal is to     understand the features and their relationships with the target variable which is 'Segmentation'. We will start by loading and inspecting the data, performing the regular checks, and visualizing the distributions of the features. Then, we will perform some exploratory data analysis to gain insights of the correlations between the features and the target variable. This part may include handling missing values. Finally, we will summarize our findings and identify any interesting patterns or trends in the data that could inform our modeling approach. 
    
    The dataset (train) consists of 8068 instances and 11 columns that are mapped below, whereas the test set consists of 2627 instances and 10 columns (excluding 'Segmenetation').


   ## Columns mapping

           
| Variable	            | Definition                                                        |
|---------------------- |-------------------------------------------------------------------|
| ID	                | Unique ID                                                         |
| Gender	            | Gender of the customer                                            |
| Ever_Married	        | Marital status of the customer                                    |
| Age	                | Age of the customer                                               |
| Graduated	            | Is the customer a graduate?                                       |
| Profession	        | Profession of the customer                                        |
| Work_Experience	    | Work Experience in years                                          |
| Spending_Score	    | Spending score of the customer                                    |
| Family_Size	        | Number of family members for the customer(including the customer) |
| Var_1	                | Anonymised Category for the customer                              |
| Segmentation(target)  | Customer Segment of the customer                                  |

## Import libraries

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [18]:
#Import data
df_train = pd.read_csv("../input/Train.csv")
df_test = pd.read_csv("../input/Test.csv")

In [19]:
#Get top 10 rows
df_train.head(10)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A
5,461319,Male,Yes,56,No,Artist,0.0,Average,2.0,Cat_6,C
6,460156,Male,No,32,Yes,Healthcare,1.0,Low,3.0,Cat_6,C
7,464347,Female,No,33,Yes,Healthcare,1.0,Low,3.0,Cat_6,D
8,465015,Female,Yes,61,Yes,Engineer,0.0,Low,3.0,Cat_7,D
9,465176,Female,Yes,55,Yes,Artist,1.0,Average,4.0,Cat_6,C


In [21]:
#Get more info
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


The deviations in 'Non-Null Count' suggests that we have missing values. Let's work with them.

In [24]:
#Missing values summary (per column)
df_train.isnull().sum()

ID                   0
Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

In [27]:
#Missing values summary (per row)
row_missing_values = df_train.isnull().sum(axis=1)
row_missing_values_sorted = row_missing_values.sort_values(ascending=False)
row_missing_values_sorted

3728    4
4782    3
2833    3
6558    3
2336    3
       ..
2944    0
2943    0
2942    0
2940    0
8067    0
Length: 8068, dtype: int64

In [35]:
#Check how many rows have missing values of at least 2 columns
rows_more_than_two_na = (row_missing_values_sorted >= 2).sum()
print(rows_more_than_two_na)

159
