<h1> Customer Segmentation </h1>

You are the owner of a shop. It doesn't matter if you own an e-commerce or a  supermarket. It doesn't matter if it is a small shop or a huge company such as Amazon or Netflix, it's better to know your customers.

You were able to collect basic data about your customers holding a membership card such as Customer ID, age, gender, annual income, and spending score. This last one is a score based on customer behavior and purchasing data.
There are some new products on the market that you are interested in selling. But you want to target a specific type of clients for each one of the products.  

Machine learning comes in handy for this task. Particularly, clustering, the most important unsupervised learning problem, is able to create categories grouping similar individuals.
These categories are called clusters. A cluster is a collection of points in a dataset. These points are more similar between them than they are to points belonging to other clusters.
Distance-based clustering groups the points into some number of clusters such that distances within the cluster should be small while distances between clusters should be large.

In [3]:
import pandas as pd
import numpy as np

In [4]:
customer = pd.read_csv("customers.csv")

We check the first five rows of the DataFrame. We can see that we have: CustumerID, Gender, Age, Annual Income expressed as price x1000, and the spending score as we expected.

In [5]:
customer.head(5)

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


<h2> 1. Exploring Data </h2>

Now, it's time to explore the data to check the quality of the data and the distribution of the variables.

<h3> Missing Data </h3>

First, we check that if there is any missing value in the dataset. K-means algorithm is not able to deal with missing values. 

In [6]:
customer.isnull().sum()

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

<h3> Duplicate Data </h3>

There is no duplicated row

In [7]:
customer.duplicated().sum()

0

<h3> Data type </h3>

Finally, we check how each variable is presented in the DataFrame. Categorical variables cannot be handled directly. K-means is based on distances. The approach for converting those variables depend on the type of categorical variables. 

In [8]:
customer.dtypes

CustomerID                 int64
Gender                    object
Age                        int64
Annual Income (k$)         int64
Spending Score (1-100)     int64
dtype: object

<h2> 2. Descriptive statistics and Distribution. </h2>

For the descriptive statistcs, we'll get mean, standard deviation, median and variance. If the variable is not numeric, we'll get the counts in each category.

After that, we can start observing the distribution of the variables. Here, we'll define two functions. The first one will retrieve descriptive statistics of the variables. The second one will help us graph the variable distribution.

In [57]:
def statistics(variable):
    
    if variable.dtype == 'int64' or variable.dtype == 'float64':
        min = variable.min()
        mean = np.mean(variable)
        max = variable.max()
        q1 = np.quantile(variable,0.25)
        q2 = np.quantile(variable,0.50)
        q3 = np.quantile(variable,0.75)
        std = np.std(variable)
        var = np.var(variable)
        columns = ['variable','min','mean','max','q1','q2','q3','std','var']
        return pd.DataFrame([[variable.name,min,mean,max,q1,q2,q3,std,var]],columns=columns).set_index("variable").reset_index()
    else:
        return pd.DataFrame(variable.value_counts()).reset_index()
        
        

In [None]:
def distribution(variable)

In [58]:
statistics(customer['CustomerID'])

Unnamed: 0,variable,min,mean,max,q1,q2,q3,std,var
0,CustomerID,1,100.5,200,50.75,100.5,150.25,57.734305,3333.25


200