# Problem Statement:

Say **Decathlon** wants to send advertisements/promotions about the sports products. Decathlon wants to identify its potential customers who are likely to buy the sports products and send specific advertisements to those groups who are interested in sports. Instead of wasting money on mass advertisement, Decathlon wants to target marketing ads to specific groups of people. This is likely to increase the hit ratio of the advertisement/promotion. So find the groups of teenagers based on their interest in sports.

**Steps:**

1. Acquire data

2. Clean data

3. Standardise data

4. Apply K-Means Clustering

# Data Source:

**snsdata.csv**

# Data Dictionary:

The snsdata has **30000 rows**(samples/observations) and **40 columns**(attributes)

These are the attributes in the dataset:

'gradyear', 'gender', 'age', 'friends', 'basketball', 'football',
'soccer', 'softball', 'volleyball', 'swimming', 'cheerleading',
'baseball', 'tennis', 'sports', 'cute', 'sex', 'sexy', 'hot', 'kissed',
'dance', 'band', 'marching', 'music', 'rock', 'god', 'church', 'jesus',
'bible', 'hair', 'dress', 'blonde', 'mall', 'shopping', 'clothes',
'hollister', 'abercrombie', 'die', 'death', 'drunk', 'drugs'

# Import Libraries

In [None]:
# Pandas is a package for data manipulation and analysis
import pandas as pd

# Numpy is a package for scientific computing (multi-dimensional arrays, matrices, mathematical functions)
import numpy as np

# Data visualization library
import matplotlib.pyplot as plt

# Seaborn is a Python data visualization library for statistical graphics
import seaborn as sns

# Magic function which displays plots directly below the code cell in jupyter notebook
%matplotlib inline

# K-Means Clustering algorithm
from sklearn.cluster import KMeans


# Acquire Data

In [None]:
teens_df = pd.read_csv("snsdata.csv")

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teens_df.head())

In [None]:
# To display number of rows and columns in the teens_df

teens_df.shape

In [None]:
teens_df.info()

# Summary Statistics

In [None]:
teens_df.drop(['gradyear'], axis=1).describe()

In [None]:
# Check for missing values
print('Number of missing values across columns-\n', teens_df.isnull().sum())

In [None]:
# Missing value inputation using 'mean' age

teens_df = teens_df.fillna({'age': teens_df.age.mean()})     

"""Alternative way of handling missing values in 'age' - you can groupby graduation year, 
then calculate mean age for that graduation year and impute the missing value with this age.                                      """


Teenagers' interest columns are extracted and stored in separate dataframe called interest_df. These columns will form the dimensions for the cluster analysis. Notice, the interest columns seem to be some kind of relative weights and since we don't know what is the range of weights, we will scale all the interest columns

In [None]:
# loc - label based indexing
# For clustering based on interests, we have not considered 4 variables 'gradyear', 'gender', 'age', 'friends' 

interest_df = teens_df.loc[:, 'basketball':'drugs']

In [None]:
# No. of rows and columns

interest_df.shape

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
interest_std = scaler.fit_transform(interest_df)

In [None]:
import matplotlib.pyplot as plt
from matplotlib import cm

In [None]:
#The built-in range function in Python is very useful to generate sequences of numbers 

list(range(1, 15))

In [None]:
cluster_range = range(1, 15)
cluster_errors = []

for num_clusters in cluster_range:
  clusters = KMeans(n_clusters = num_clusters)
  clusters.fit(interest_std)
  labels = clusters.labels_
  centroids = clusters.cluster_centers_
  cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )

"""The total sum of squared distances of every data point from respective centroid is called Inertia(cluster_error).
Let us print the inertia value for all K values. That K at which the inertia stops to drop significantly 
(elbow method) will be the best K."""

clusters_df[0:15]

In [None]:
# Elbow plot

plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
plt.xlabel("num of clusters(k)")
plt.ylabel("cluster errors")

In [None]:
kmeans = KMeans(n_clusters=6, random_state=7) #set random_state to some number for reproducibility

In [None]:
kmeans.fit(interest_std)
centroids = kmeans.cluster_centers_

In [None]:
centroids

In [None]:
## creating a new dataframe only for labels and converting it into categorical variable

teens_labels = pd.DataFrame(kmeans.labels_ , columns = list(['cluster_labels']))

teens_labels['cluster_labels'] = teens_labels['cluster_labels'].astype('category')

In [None]:
teens_labels.head(10)

In [None]:
# Joining the 'teens_labels' dataframe with the teens data frame to create teens_df_labeled. 
#Note: it could be appended to original dataframe

teens_df_labeled = teens_df.join(teens_labels)

In [None]:
teens_df_labeled.head(10)

In [None]:
# Dataset for cluster-1

teen_cluster1 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 0]
print('\nCluster-1 dataset shape:', teen_cluster1.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster1.head())

In [None]:
# Dataset for cluster-2

teen_cluster2 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 1]
print('\nCluster-2 dataset shape:', teen_cluster2.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster2.head())

### Clustering helps in identifying outliers, cluster-3 (with cluster_label=2) has one teenager who is an outlier

In [None]:
# Dataset for cluster-3

teen_cluster3 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 2]
print('\nCluster-3 dataset shape:', teen_cluster3.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster3.head())

In [None]:
# Dataset for cluster-4

teen_cluster4 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 3]
print('\nCluster-4 dataset shape:', teen_cluster4.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster4.head())

In [None]:
# Dataset for cluster-5

teen_cluster5 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 4]
print('\nCluster-5 dataset shape:', teen_cluster5.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster5.head())

In [None]:
# Dataset for cluster-6

teen_cluster6 = teens_df_labeled.loc[teens_df_labeled.cluster_labels == 5]
print('\nCluster-6 dataset shape:', teen_cluster6.shape)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(teen_cluster6.head())


Silhouette refers to a method of interpretation and validation of consistency within clusters of data. 
The technique provides a graphical representation of how well each object lies within its cluster.