# Problem Statement:

Customer Segmentation is a popular application of unsupervised learning. Using clustering, identify segments of customers to target the potential user base. They divide customers into groups according to common characteristics like gender, age, interests, and spending habits so they can market to each group effectively.

Use K-means clustering and also visualize the gender and age distributions. Then analyze their annual incomes and spending scores.

### -------------------------------------------------------------------------------------------------------------------------------------

**Data Set Information**
The dataset can be downloaded from: https://drive.google.com/file/d/19BOhwz52NUY3dg8XErVYglctpr5sjTy4/view

The information of various columns of the dataset are:-

**Features**

|Column Name |Data Type|Description|
|-----|-----|-----|
|CustomerID|Numerical|It contains the ID of the customers, used for model building|
|Gender|Categorical|It has 2 categories, i.e., 'Male' and 'Female'|
|Age|Numerical|It contains the age of the customers(in yrs.)|
|Annual Income|Numerical|It contains annual income of a customer(in Dollar)|
|Spending Score (1-100)|Numerical|This contains the score between 1-100 for a customer, known as spending score|

### ------------------------------------------------------------------------------------------------------------------------------------

###  Importing necessary libraries

The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks.

In [None]:
import pandas as pd                 #Importing Pandas
import numpy as np                  #Importing NumPy
import matplotlib.pyplot as plt     #Importing Matplotlib 
import seaborn as sns               #Importing Seaborn
from sklearn.cluster import KMeans  #Importing the K-Means Clustering Algorithm
%matplotlib inline                  

#It allows us to add plots to the browser interface, instead of showing a new terminal.

In [None]:
# loading the dataset in the variable called 'df'

df= pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

df.head()          # To view top 5 entries in the dataset.

# in order to view bottom 5 entries, we can do
#df.tail()

#in order to view more than 5 entries, we can enter any integer value into '()'.
#Ex: df.head(10) or df.tail(15), etc

In [None]:
# Renaming the columns as per our convenience!!

df.rename(columns={'Annual Income (k$)':'Annual Income', 'Spending Score (1-100)':'Spending Score'},inplace=True)

In [None]:
# Now, we let us see all the column names.

df.columns

In [None]:
# We can see that the column named 'CustomerID' can be removed from the dataset as it is unique for each customer and cannot be
#used further for any predictions.

df.drop(['CustomerID'],axis=1,inplace=True)

# If axis=0, it consitutes row operation. Since we have to remove the column, we do axis=1

In [None]:
df.info() # This gives the information like no. of non- null values and data types of the columns.

In [None]:
df.describe() # To obtain the descriptive analysis of the numerical columns in the dataset

In [None]:
# Now let us check for the shape of the dataset and also that are any null values present in our dataset.
# For that,

print('shape of the dataset=', df.shape)

print(' \nThe null count of each column of the dataset are as follows:')
df.isnull().sum()

##### Observation:
- From the above cell, we can see that there are no null values present in the dataset.

In [None]:
# Function to identify numeric features:
def numeric_features(dataset):
    numeric_col = dataset.select_dtypes(include=np.number).columns.tolist()
    return dataset[numeric_col].head()
    
numeric_columns = numeric_features(df)
print("Numerical Features:")
print(numeric_columns)

print("===="*20)

# Function to identify categorical features:
def categorical_features(dataset):
    categorical_col = dataset.select_dtypes(exclude=np.number).columns.tolist()
    return dataset[categorical_col].head()

categorical_columns = categorical_features(df)
print("Categorical Features:")
print(categorical_columns)

print("===="*20)

# Function to check the datatypes of all the columns:
def check_datatypes(dataset):
    return dataset.dtypes

print("Datatypes of all the columns:")
check_datatypes(df)

### Detect outliers in the continuous columns

Outliers are observations that lie far away from majority of observations in the dataset and can be represented mathematically in different ways.

One method of defining outliers are: outliers are data points lying beyond **(third quartile + 1.5xIQR)** and below **(first quartile - 1.5xIQR)**. 

- The function below takes a dataframe and outputs the number of outliers in every numeric feature based on the above rule of *IQR* 

You can even modify the function below to capture the outliers as per their other definitions. 

In [None]:
# Function to detect outliers in every feature
def detect_outliers(df):
    cols = list(df)
    outliers = pd.DataFrame(columns = ['Feature', 'Number of Outliers'])
    for column in cols:
        if column in df.select_dtypes(include=np.number).columns:
            q1 = df[column].quantile(0.25)
            q3 = df[column].quantile(0.75)
            iqr = q3 - q1
            fence_low = q1 - (1.5*iqr)
            fence_high = q3 + (1.5*iqr)
            outliers = outliers.append({'Feature':column, 
                            'Number of Outliers':df.loc[(df[column] < fence_low)|(df[column] > fence_high)].shape[0]},
                             ignore_index=True)
    return outliers

detect_outliers(df)

#####  Observation:
- From the above output, it is clear that almost there is no outlier data present in the dataset

### ------------------------------------------------------------------------------------------------------------------------------------

## EDA & Data Visualizations

Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations. The EDA process is a crucial step prior to building a model in order to unravel various insights that later become important in developing a robust algorithmic model.

### Univariate analysis

Univariate analysis means analysis of a single variable. It’s mainly describes the characteristics of the variable.

In [None]:
sns.countplot(df['Gender'])

##### Observation:
- From the above countplot, we can say that there are more female entries in the dataset when compared to male.

In [None]:
sns.distplot(df['Age'], bins=30)

##### Observation:
- From the distplot and the descriptive analysis run on the dataset, it is visible that the age column contains values that are almost normally distributed. The kde on the distplot looks like a bell-curve.

### Bivariate Analysis 

Bivariate analysis involves checking the relationship between two variables simultaneously.

In [None]:
sns.boxplot(df['Gender'], df['Age'])

##### Observation:
- From the boxplot we can see that the average age of females in the dataset is more than the male in the dataset. Also, there is no outlier present in the dataset

**Seaborn Pairplot**

Seaborn Pairplot uses to get the relation between each and every variable present in Pandas DataFrame. It works like a seaborn scatter plot but it plot only two variables plot and sns paiplot plot the pairwise plot of multiple features/variable in a grid format

In [None]:
sns.pairplot(df)

In [None]:
df.head()

### --------------------------------------------------------------------------------------------------------------------------------------

## Treating the categorical feature:
- As we all know that ML algorithms do not work with alphabetical values, we need to convert these values to numerical data.
- In this dataset, the only alphabetical column is the categorical column, called `Gender`.
- It contains the gender of the customer as **Male** or **Female**.
- So, we can assign or `map` the values of male and female entries in the given dataset as shown below.

In [None]:
gender= {'Male':0, 'Female':1}
df['Gender']= df['Gender'].map(gender)

In [None]:
# Looking at the head of the dataset to see if the maping works!

df.head()

#### Checking the correlation of the features with the help of 'Heatmap'

A **correlation** between two random vairables describes a statistical association, which basically means how close these two random variables are to having a linear relation ship. The correlation can range between -1 and 1:

- A correlation of 1 means the variables are perfectly correlated.
- A correlation of 0 means there is no corerlation between teh variables.
- A corerlation of -1 means the variabels are prefectly negatively corerlated

In [None]:
sns.heatmap(df.corr(), annot=True, cmap='magma')

##### Observation:

- From the above heatmap, we can find that 'Age' is negatively correlated with the 'Annual Income' and 'Spending Score'. 
- 'Annual Income' is very very less correlated 'Spending Score'.
- 'Gender' is very less correlated with 'Spending Score' but more correlated, when compared to 'Annual Income'!

### ------------------------------------------------------------------------------------------------------------------------------------

In [None]:
df.columns

In [None]:
# Separating the dataset variables as feature and target variables.

x= df['Annual Income'] # Feature
y= df['Spending Score'] #Label/ Target

In [None]:
df.drop('Spending Score', axis=1, inplace=True)  # Dropping the target variable as that will be used in prediction!
df.head()

# Model training:
### Model used: **KMeans Clustering**

- K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. 
- In k means clustering, we have the specify the number of clusters we want the data to be grouped into. 
- The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:
    1. Reassign data points to the cluster whose centroid is closest. 
    2. Calculate new centroid of each cluster. 


- These two steps are repeated till the within cluster variation cannot be reduced any further. 
- The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

In [None]:
# Let us Randomly assume that the number of clusters or groups in with the customers can be divided are 2.

km= KMeans(n_clusters=2)

In [None]:
data=pd.DataFrame({'x1':df['Annual Income'],'x2':df['Gender'],'y': y})

In [None]:
km.fit(data)    # Fitting the K-Means algorithm on the dataset.

In [None]:
yp=km.predict(data)

##### Let us plot the `Scatter Plot` to view the 2 clusters obtained.

In [None]:
plt.scatter(x,y,c=yp)

### ------------------------------------------------------------------------------------------------------------------------

## Finding correct number of clusters:

### `The Elbow Method`

- The Elbow Method is one of the most popular methods to determine this optimal value of k.  
- From the above visualization, we can see that the optimal number of clusters should be around 2
- The below function will calculate the correct value of 'K', i.e., the number of clusters present in our dataset.

In [None]:
wcss=[]

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 42)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_)
    
# Plotting the results onto a line graph, 
# allowing us to observe 'The elbow'
plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss)
plt.title('The elbow method', fontweight="bold")
plt.xlabel('Number of clusters(K)')
plt.ylabel('within Clusters Sum of Squares(WCSS)') # Within cluster sum of squares

##### Observation:

- From the above graph, we can say that the best classification or clustering of data can be done into 5 groups. 

In [None]:
# Applying K-Means to the dataset / Creating the K-Means classifier, with 5 number of clusters.
km= KMeans(n_clusters=5)
km.fit(data)

In [None]:
yp=km.predict(data)

In [None]:
plt.scatter(data['x1'],data['y'],c=yp)
plt.title("Clustering customers based on Annual Income and Spending score", fontsize=15,fontweight="bold")
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")

##### Observation:

- On the basis of the above graph, we can say that the clusters represent the following groups:
    1. HI----HS (High annual income, High Spending score)
    2. LI----HS (Low annual income, High spending score)
    3. HI----LS (High annual income, Low spending score)
    4. LI----LS (Low annual income, Low spending score)
    5. II----IS (Intermediate annual income, Intermediate spending score)
    
            where, S: spending and I: income and
                   H: high, 
                   L: low 
                   I:intermediate