# Customer Segmentation - Preprocessing and Training

This notebook will be used to prepare the data set for clsutering and to create a baseline clustering model. All data cleaning was done in [this](https://github.com/NickD-Dean/Springboard/blob/ce7358bfc84ceee14d287000eb132e8ff06a944b/Capstone%20Project%203/Code/03%20-%20Customer%20Segmentation%20Data%20Wrangling.ipynb) notebook, and exploratory analysis can be found [here](https://github.com/NickD-Dean/Springboard/blob/ce7358bfc84ceee14d287000eb132e8ff06a944b/Capstone%20Project%203/Code/04%20-%20Exploratory%20Data%20Analysis.ipynb)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm
from sklearn.decomposition import PCA
import matplotlib.colors

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

In [2]:
data = pd.read_csv('../Data/Trimmed_Data.csv', index_col = 0)

In [3]:
data.head()

Unnamed: 0,Age,Marital_status,Income,Homeowner_status,Household_comp,Household_size,Kids,CampaignsRedeemed,CampaignsSent,Percent_CampaignRedeemed,...,Num_stores_visited,Most_freq_store,Most_freq_time,First_active_day,Last_active_day,Recency,Frequency,Monetary,Avg_trips_week,Avg_shopping_lag
1,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,2,8,0.25,...,2,436.0,1456.0,51,706,5,0.120956,4330.16,1.264706,7.705882
7,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,0,4,0.0,...,2,359.0,1711.0,23,709,2,0.082982,3400.05,1.18,11.827586
8,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,1,10,0.1,...,3,321.0,2149.0,65,706,5,0.158931,5534.97,1.569444,5.723214
13,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,7,10,0.7,...,1,323.0,1130.0,101,709,2,0.386779,13190.92,3.16092,2.218978
16,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,0,2,0.0,...,2,3316.0,657.0,98,690,21,0.137834,1512.02,1.689655,6.103093


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 801 entries, 1 to 2499
Data columns (total 58 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            801 non-null    object 
 1   Marital_status                 801 non-null    object 
 2   Income                         801 non-null    object 
 3   Homeowner_status               801 non-null    object 
 4   Household_comp                 801 non-null    object 
 5   Household_size                 801 non-null    object 
 6   Kids                           801 non-null    object 
 7   CampaignsRedeemed              801 non-null    int64  
 8   CampaignsSent                  801 non-null    int64  
 9   Percent_CampaignRedeemed       801 non-null    float64
 10  CouponRedeemed_Count           801 non-null    int64  
 11  CouponSent_Count               801 non-null    int64  
 12  Percent_CouponsRedeemed        801 non-null    fl

### Pre-processing steps:

1. I need to remove colinear features as identified in the previous EDA notebook. 

2. Remove categorical data, and one-hot encode this information. 

3. Remove outliers from the data set (I'll initially remove values which lie outside 3 standard deviations of the mean.)

4. Standardize skewed features using a log transform.

5. Scale all numerical features using the MinMaxScaler to shift all data onto the same scale as one-hot encoded data.

#### Removing colinear features

In [5]:
# In the previous notebook I created a data frame that logged if features pairwise correlations were 'high', 'mid', 
# or 'low'.  I'll use this as a reference later on to determine if additional features should be dropped.

data = data.drop(['%_baskets_product_mailer', '%_baskets_product_displayed', 'Last_active_day'], axis=1)

#### Separating out categorical data and one-hot encoding this data

In [6]:
cat = data.select_dtypes(exclude=np.number)

In [7]:
# This does not include certain categories which ARE categorical but are not strings at this time
# I need to include most frequent store, most/least frequent product, high volume propduct, and most frequent time
cat

Unnamed: 0,Age,Marital_status,Income,Homeowner_status,Household_comp,Household_size,Kids,Most_freq_dept,Least_freq_dept
1,65+,A,35-49K,Homeowner,2 Adults No Kids,2,None/Unknown,GROCERY,FLORAL
7,45-54,A,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,GROCERY,FLORAL
8,25-34,U,25-34K,Unknown,2 Adults Kids,3,1,GROCERY,DAIRY DELI
13,25-34,U,75-99K,Homeowner,2 Adults Kids,4,2,GROCERY,AUTOMOTIVE
16,45-54,B,50-74K,Homeowner,Single Female,1,None/Unknown,GROCERY,SALAD BAR
...,...,...,...,...,...,...,...,...,...
2494,35-44,U,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,GROCERY,AUTOMOTIVE
2496,45-54,A,75-99K,Homeowner,Unknown,3,1,GROCERY,FROZEN GROCERY
2497,45-54,U,35-49K,Unknown,Single Male,1,None/Unknown,GROCERY,RESTAURANT
2498,25-34,U,50-74K,Homeowner,2 Adults No Kids,2,None/Unknown,GROCERY,COSMETICS
