<a href="https://colab.research.google.com/github/Lara237/MasterLAB/blob/master/9_Finding_Groups_of_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Finding Groups of Data – Clustering with k-means

This notebook incorporates the code samples found in chapter 9 of [Machine Learning with R](https://www.amazon.co.uk/Machine-Learning-techniques-predictive-modeling/dp/1788295862/ref=dp_ob_title_bk). The original chapter presents more content and background information as this notebook only intends to present source code and related comments.

This notebook intends to explain clustering using example. First of all it should be noticed that the goal of clustering is to build new data based on relationships in the dataset itself. It provides you with the information which data points from the dataset are related to each other and can therefore form a group of similar data points. But let's start with the example now:
It is no secret that teenagers uses social media these days to keep in touch, interact and share their preferences. The fact that companies use these information for their online marketing is also common knowledge in this century. And why they are using it? Right, because they can define different target groups, select their individual target groups and create individualized marketing content by processing the information found on the social media accounts. The following example is exactly about this idea to cluster students into groups based on their social media account.


The dataset from 2006, which is used for this example, contains 30,000 social-network-service profiles of US high school students. Because this social-network-service was widely used among students in the US in 2006 and because it contains profiles across four high school graduation years, it can be reasonably assumed that it represents American teenagers and adolescents in 2006. The dataset includes the age, gender, year of graduation and amount of social-network friends as well as the information how many times a word out a preselected words sample occurred. This word sample consists of 36 words, which aims to represent five different sections of interest: fashion,  extracurricular activities, romance, religion, antisocial behavior. Therefore there are 30,000 data points and 40 variables in the dataset.

In [8]:
teens <- read.csv("Chapter09_snsdata.csv")
str(teens)

'data.frame':	30000 obs. of  40 variables:
 $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
 $ age         : num  19 18.8 18.3 18.9 19 ...
 $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
 $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
 $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
 $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
 $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
 $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
 $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
 $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
 $ dance       : int

As you may have noticed, there is a "NA" value in the gender variable, which indicates a missing value. As this can affect the outcome of the analysis it has to be checked - using the additional parameter *useNA="ifany"* - for the amount of NAs in the prevailing dataset.

In [0]:
table(teens$gender)
table(teens$gender, useNA = "ifany")


    F     M 
22054  5222 


    F     M  <NA> 
22054  5222  2724 

By checking the other variables, it can be found that NAs exist also in the *age* variable. Because this variable is numeric, it is necessary to use the *summary()* command to check for the amount of values.

In [0]:
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.086  16.312  17.287  17.994  18.259 106.927    5086 

It can not only be seen from this summary that a high amount of profiles (~ 17 %) lack of information of age, but also that the dataset includes unreasonable age values indicated by the *Min.* and *Max.* values in the summary table. To achieve meaningful outputs such extreme values have to be removed from the dataset. Therefore, any value for the *age* variable, which is less than 13 or higher than 20 receives the NA value using the *ifelse ()* function. This provides a more senseful distribution as can be seen after performing the *summary()* function again.

In [0]:
teens$age <- ifelse(teens$age >= 13 & teens$age < 20,
                     teens$age, NA)

In [0]:
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  13.03   16.30   17.27   17.25   18.22   20.00    5523 

By excluding all of the missing values, a large amount of data points would be excluded from the dataset, which could again falsify the outcome. The datapoints with missing values could for example represent a specific type of students. Therefore, dummy variables are created representing the groups with missing age or gender. This is done using separate binary (1 or 0) valued dummy variables and the *is.na()* function, which tests if the value is NA.

In [0]:
teens$female <- ifelse(teens$gender == "F" &
                         !is.na(teens$gender), 1, 0)
teens$no_gender <- ifelse(is.na(teens$gender), 1, 0)

The first line allocates the value 1 to teens_female if the gender is "F" and not equal to NA. Elsewhere it allocates the value 0. The second line assigns the value 1 to teens$no_gender if the gender is missing. Observing the following tables it can be seen that the dummy variables do their work as they do match.

In [0]:
table(teens$gender, useNA = "ifany")
table(teens$female, useNA = "ifany")
table(teens$no_gender, useNA = "ifany")


    F     M  <NA> 
22054  5222  2724 


    0     1 
 7946 22054 


    0     1 
27276  2724 

To remove the missing *age* values a imputation strategy is used, which intends to guess the true values by using the information of the year of graduation. To calculate the average age of the students, the *mean()* function with an additional parameter to account for NA values is used.

In [0]:
mean(teens$age)
mean(teens$age, na.rm = TRUE)
aggregate(data = teens, age ~ gradyear, mean, na.rm = TRUE)

gradyear,age
<int>,<dbl>
2006,18.65586
2007,17.70617
2008,16.7677
2009,15.81957


The last line actually calculates the output (in a data frame) it was aimed for as it depicts the average age for each of the four graduation years. To get the data back in a vector (with the same length of the original vector), the *ave()* function is used. After these means are inserted for the NA values respecting the year of graduation. The summary shows that no more NA values exists in the dataset.

In [0]:
ave_age <- ave(teens$age, teens$gradyear,
                 FUN = function(x) mean(x, na.rm = TRUE))
teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)
summary(teens$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  13.03   16.28   17.24   17.24   18.21   20.00 

To now cluster the students into groups, the *stats package* and the *kmeans()* function are used. The art is to choose the right combination of data and clusters, which can be a process of trial and error. Firstly, only the 36 (word) features are considered.

In [0]:
interests <- teens[5:40]

Because of the use of distance calculations to cluster the students into groups, the features are normalized in a way that every feature covers the same range. This is done with the *scale()* and *lapply()* functions. Afterwards the correctness of this step can be checked by comparing the summary tables. The newly created data frame should hereby displaying a mean of zero.

In [0]:
interests_z <- as.data.frame(lapply(interests, scale))
summary(interests$basketball)
summary(interests_z$basketball)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2673  0.0000 24.0000 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3322 -0.3322 -0.3322  0.0000 -0.3322 29.4923 

If the normalized value is less/more than zero, the student has fewer/more-than-average entries of basketball in their account.
Next, the amount of clusters (k) we are using has to be determined. Choosing many will have a high homogeneity but risks an overfit in the data, choosing too few will result in heterogeneous groups. The amount of clusters should be varied for checking how this will affect the outcome. Here there are five clusters defined as the starting value for k.
The *kmeans()* function mentioned above is used on the normalized interests data frame. Because this function has random starting points, the *set.seed()* function is needed in addition to make sure that the results fit with the later following outputs.

In [0]:
set.seed(2345)
teen_clusters <- kmeans(interests_z, 5)
teen_clusters$size

The amount of students allocated in each group can be seen from the last line. If a group is too large or too small, the output is not very useful. In any case the homogeneity of each cluster has to be examined. In other words, the cluster centroids, which are the average positions of all the points in one cluster, have to be studied.

In [0]:
teen_clusters$centers

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,⋯,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs
1,0.36216073,0.37985213,0.13734997,0.1272107,0.09247518,0.26180286,0.2159945,0.25312305,0.11991682,0.77040675,⋯,0.36322464,0.622896285,0.2760755,1.245121599,0.31525537,0.413156,1.712160983,0.94713629,1.83371069,2.73878856
2,-0.094426312,0.06691768,-0.09956009,-0.0379725,-0.07286202,0.04578401,-0.107037,-0.11182941,0.04027335,-0.10638613,⋯,-0.01238629,-0.087713363,-0.03710273,-0.004395251,-0.16788599,-0.1413652,0.008941101,0.05464759,-0.08699556,-0.06414588
3,0.003980104,0.09524062,0.05342109,-0.0496864,-0.01459648,0.32944934,0.5142451,-0.04933628,0.06703386,-0.05435093,⋯,0.03301526,0.808620531,1.07073115,0.61620736,0.85951603,0.793506,0.062399295,0.12642222,0.03594162,-0.05888141
4,1.372334818,1.19570343,0.55621097,1.1304527,1.07177211,0.0851321,0.0400367,1.09279737,0.13887184,1.08316097,⋯,0.03690938,-0.004723697,0.03497875,0.016201064,-0.08381546,-0.0861708,-0.067312427,-0.01611162,-0.06891763,-0.08795059
5,-0.186822093,-0.18729427,-0.08331351,-0.1368072,-0.13344819,-0.08650052,-0.1092056,-0.13616893,-0.03683671,-0.15903307,⋯,-0.02793327,-0.179127117,-0.2181658,-0.177738408,-0.16182051,-0.154543,-0.085876102,-0.06882571,-0.08386703,-0.10777278


The numbers in each row represent the average value of the cluster for the specific feature. By determining if the values are above or below zero, patterns of interest can be detected, which distinguishes one group from another. Cluster 4 has for example the highest value in the basketball column, it means that this group of students has the highest interest in basketball in average. By examining these values and finding characteristics it is possible to identify groups as for example the athletes (cluster 4) with the highest values in sport disciplines or the princesses (cluster 3) with the highest values in cheerleading and hot. And that's it, the marketing director now has his students grouped by interest and can start his individual marketing campaigns. But probably he is better off if he improves the selection of the groups upfront.
Therefore, the clusters are applied onto the full dataset. This is fulfilled by adding the cluster component created by the *kmeans()* function as a column to the *teens* data. Now, it can be studied if the cluster fits with the individual students' characteristics.

In [0]:
teens$cluster <- teen_clusters$cluster
teens[1:5, c("cluster", "gender", "age", "friends")]

Unnamed: 0_level_0,cluster,gender,age,friends
Unnamed: 0_level_1,<int>,<fct>,<dbl>,<int>
1,5,M,18.982,7
2,3,F,18.801,0
3,5,M,18.335,69
4,5,F,18.875,0
5,1,,18.995,10


By employing the *aggregate()* function, the demographic characteristics can be studied as well. The mean age does not vary much between the clusters, but the proportion of females does. This proves that there are differences what females and males are mentioning on their social media and therefore in what they are interested in. Also the number of friends makes the impression that this number is related to the specific cluster as the princesses and athletes (cluster 3 and 4) enjoy a much higher popularity than the other groups.

In [0]:
aggregate(data = teens, age ~ cluster, mean)
aggregate(data = teens, female ~ cluster, mean)
aggregate(data = teens, friends ~ cluster, mean)

cluster,age
<int>,<dbl>
1,17.09319
2,17.38488
3,17.03773
4,17.03759
5,17.30265


cluster,female
<int>,<dbl>
1,0.8025048
2,0.7237937
3,0.8866208
4,0.6984421
5,0.7082735


cluster,friends
<int>,<dbl>
1,30.6657
2,32.79368
3,38.54575
4,35.91728
5,27.79221


These numbers prove that the allocation of students in clusters with different interests and characteristics based on their social-networking profiles worked quite well. The marketing director now can produce his targeted content specifically for one or more of these groups.