## CUSTOMER SEGMENTATION WITH K-NEAREST NEIGHBOUR (KNN)

#### **ABSTRACT**
One of the roadmaps for a successful company is a customer centric experience. In building an excellent customer experience it is important to know what every customer needs through the available information, insights as well as feedbacks collated over the years of business. A proper analysis of these outputs will give rise to the Customer segregation which is a vast array of different customers and their varying needs promoting an efficient service delivery instead of an individual approach which would be cumbersome as well as not being cost effective. In this project, we combine two data sets using the KNN classification algorithm to segment our customers into groups and test our model for accuracy. KNN is a supervised learning approach which uses proximity to make classification or predictions about the grouping of an individual data point. While it is commonly used for classification problems, it can also be used to solve regression problems. Classification problems are what we intend to solve in this project. How it solves the problem when classifying is just to assign the data point to the group that has the most observed data point out of its neighbors, and this depends on the number of neighbors selected. In our dataset, customers are segmented into 4 categories based on features like gender, marital status, age, work experience, etc. The datasets we will be using were gotten from Kaggle.


## INTRODUCTION

Customer segmentation is the process of dividing clients into segments based on similar characteristics. This is done using specific variables such as demographics like age, race, and sex; behavioral, psychographic, geographical data, etc. When customers are segmented, it helps identify needs as it relates to each segment and deliver the appropriate messages across. It also minimizes risk by figuring out which products are the most likely to earn a share of a target market and the best ways to market and deliver those products to the market A combination of two data sets using the KNN classification algorithm will test our model and predict future expectations.It will help find similarities between people who are currently customers and people who are not. This information is important to find groups of potential new customers which are people who are not currently customers but have high similarities with people who are. This will also give directions into a right marketing campaign.

#### DATASET
The Dataset used for this project was gotten from Kaggle [here](https://www.kaggle.com/datasets/abisheksudarshan/customer-segmentation?select=train.csv). The dataset is titled 'Customer Segmentation'. It shows multiclass classification of an automobile company with plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). The data set have identified 2627 new potential customers.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
#importing dataset
Existing_df = pd.read_csv('train.csv')
Existing_df

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A
...,...,...,...,...,...,...,...,...,...,...,...
8063,464018,Male,No,22,No,,0.0,Low,7.0,Cat_1,D
8064,464685,Male,No,35,No,Executive,3.0,Low,4.0,Cat_4,D
8065,465406,Female,No,33,Yes,Healthcare,1.0,Low,1.0,Cat_6,D
8066,467299,Female,No,27,Yes,Healthcare,1.0,Low,4.0,Cat_6,B


In [7]:
#importing dataset ii
New_df = pd.read_csv('test.csv')
New_df

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6
...,...,...,...,...,...,...,...,...,...,...
2622,467954,Male,No,29,No,Healthcare,9.0,Low,4.0,Cat_6
2623,467958,Female,No,35,Yes,Doctor,1.0,Low,1.0,Cat_6
2624,467960,Female,No,53,Yes,Entertainment,,Low,2.0,Cat_6
2625,467961,Male,Yes,47,Yes,Executive,1.0,High,5.0,Cat_4


In [8]:
Existing_df.info

<bound method DataFrame.info of           ID  Gender Ever_Married  Age Graduated     Profession  \
0     462809    Male           No   22        No     Healthcare   
1     462643  Female          Yes   38       Yes       Engineer   
2     466315  Female          Yes   67       Yes       Engineer   
3     461735    Male          Yes   67       Yes         Lawyer   
4     462669  Female          Yes   40       Yes  Entertainment   
...      ...     ...          ...  ...       ...            ...   
8063  464018    Male           No   22        No            NaN   
8064  464685    Male           No   35        No      Executive   
8065  465406  Female           No   33       Yes     Healthcare   
8066  467299  Female           No   27       Yes     Healthcare   
8067  461879    Male          Yes   37       Yes      Executive   

      Work_Experience Spending_Score  Family_Size  Var_1 Segmentation  
0                 1.0            Low          4.0  Cat_4            D  
1                 N