## Market Segementation & Clustering Algorithms ##

Market segmentation divides a broad target market into smaller, similar cohorts. Companies then design a specific marketing strategy for each group. Clustering is commonly used for market segmentation, as it finds similar groups in a data set. 

Here, we see how clustering is applied to customers who are members of an airline's frequent flyer program. The company's objective is to learn more about this group, so that it can target different sub-segments within what appears to be a homogenous population with different, mileage-based offers. 

The data come from the textbook, "Data Mining for Business Intelligence," by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.

(source: MITx)

**The Variables**

There are seven different variables in the dataset:

***Balance*** = number of miles eligible for award travel

***QualMiles*** = number of miles qualifying for TopFlight status

***BonusMiles*** = number of miles earned from non-flight bonus 

***transactions*** in the past 12 months

***BonusTrans*** = number of non-flight bonus transactions in the past 12 months

***FlightMiles*** = number of flight miles in the past 12 months

***FlightTrans*** = number of flight transactions in the past 12 months

***DaysSinceEnroll*** = number of days since enrolled in the frequent flyer program

### Data Preparation ###

In [2]:
airlines = read.csv("AirlinesCluster.csv")

In [3]:
str(airlines)

'data.frame':	3999 obs. of  7 variables:
 $ Balance        : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
 $ QualMiles      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ BonusMiles     : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
 $ BonusTrans     : int  1 2 4 1 26 0 25 4 43 28 ...
 $ FlightMiles    : int  0 0 0 0 2077 0 0 250 3850 1150 ...
 $ FlightTrans    : int  0 0 0 0 4 0 0 1 12 3 ...
 $ DaysSinceEnroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...


In [4]:
summary(airlines)

    Balance          QualMiles         BonusMiles       BonusTrans  
 Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
 1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
 Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
 Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
 3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23800   3rd Qu.:17.0  
 Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  
  FlightMiles       FlightTrans     DaysSinceEnroll
 Min.   :    0.0   Min.   : 0.000   Min.   :   2   
 1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
 Median :    0.0   Median : 0.000   Median :4096   
 Mean   :  460.1   Mean   : 1.374   Mean   :4119   
 3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
 Max.   :30817.0   Max.   :53.000   Max.   :8296   

### Exploratory Data Analysis ###

Locating the predictors that have the maximum and minimum values is generally a fruitful exercise.

It looks like the **Balance** feature contains the largest value in our data set.

In [79]:
sort(sapply(airlines,max), decreasing=TRUE, method='radix')

Based on the output below, we see that the **FlightTrans** predictor has the smallest value in our data set.

In [81]:
sort(sapply(lapply(airlines, mean), min), decreasing=FALSE, method='radix')

A cursory review of our lists of maximum and minimum values above reveals a vast discrepancy in scale among our predictors. This is problematic in clustering where averages are relied upon as the basis for the placement of observations within clusters. Below, we normalize the data to prevent larger values from dominating our calculations. 

In [5]:
install.packages("caret")
library(caret)

“installation of package ‘caret’ had non-zero exit status”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: lattice
Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


In [6]:
# preprocess the data
preproc = preProcess(airlines)
# preprocess the data
airlinesNorm = predict(preproc, airlines)

In [8]:
# verify normalization has occurred (mean = 0, sd = 1)
summary(airlinesNorm)

    Balance          QualMiles         BonusMiles        BonusTrans      
 Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
 1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
 Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
 Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  
  FlightMiles       FlightTrans       DaysSinceEnroll   
 Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
 1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
 Median :-0.3286   Median :-0.36212   Median :-0.01092  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
 3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
 Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284  

In [11]:
# retrieve standard deviations for normalized features
lapply(airlinesNorm, sd)

The mean values for all predictors is zero, while their standard deviations have been set to one. We can, therefore, conclude data normalization was successful.  

Interestingly, prior to normalization, the variable with the maximum value was **Balance**. Post normalization, **FlightMiles** takes the top spot.

In [78]:
sort(sapply(airlinesNorm,max), decreasing=TRUE, method='radix')

Additionally, before normalization, the **FlightTrans** predictor was the smallest average value in our data set. This has been supplanted by **DaysSinceEnroll**.

In [72]:
sort(sapply(airlinesNorm,min), decreasing=FALSE, method='radix')

## Hierarchial Clustering Model ##