***
***
***

<br><h2>Session 8b | Principal Component Analysis</h2>
<h4>DAT-5303 | Machine Learning</h4>
Chase Kusterer - Faculty of Analytics<br>
Hult International Business School<br><br><br>

***
***
***

<h3>Part I: Introduction and Preparation</h3><br>

<strong>A Note on the Dataset</strong><br>
The dataset in this script represents the annual spending of a subset of the top customers for Apprentice Chef, Inc. The monetary units are unknown, and the demographic information related to each client is as follows:<br><br><br>
<u>Channel</u><br>

1. Online
2. Mobile App

<br>
<u>Region</u><br>

1. Alameda
2. San Francisco
3. Contra Costa

<br><br>
Run the following code to import necessary packages, load data, and set display options. 

In [None]:
########################################
# importing packages
########################################
import pandas            as pd  # data science essentials
import matplotlib.pyplot as plt                  # fundamental data visualization
import seaborn           as sns                  # enhanced visualization
from sklearn.preprocessing import StandardScaler # standard scaler
from sklearn.decomposition import PCA            # pca


########################################
# loading data and setting display options
########################################
# loading data
customers_df = pd.read_excel('top_customers_subset.xlsx')


# setting print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)

***
***

<br>
<strong>User-Defined Functions</strong><br>
Run the following code to load the user-defined functions used throughout this Notebook.

In [None]:
########################################
# scree_plot
########################################
def scree_plot(pca_object, export = False):
    # building a scree plot

    # setting plot size
    fig, ax = plt.subplots(figsize=(10, 8))
    features = range(pca_object.n_components_)


    # developing a scree plot
    plt.plot(features,
             pca_object.explained_variance_ratio_,
             linewidth = 2,
             marker = 'o',
             markersize = 10,
             markeredgecolor = 'black',
             markerfacecolor = 'grey')


    # setting more plot options
    plt.title('Scree Plot')
    plt.xlabel('PCA feature')
    plt.ylabel('Explained Variance')
    plt.xticks(features)

    if export == True:
    
        # exporting the plot
        plt.savefig('top_customers_correlation_scree_plot.png')
        
    # displaying the plot
    plt.show()

***
***

<br>
<strong>Challenge 1</strong><br>
Write code to check information about non-missing values and data types for each column.

In [None]:
# checking information about each column
_____

In [None]:
# checking information about each column
customers_df.info()

***
***

<br>
<strong>Challenge 2</strong><br>
Write code to create a summary of descriptive statistics for each column, rounded to two decimal places.

In [None]:
# summary of decriptive statistics


In [None]:
# summary of decriptive statistics
customers_df.describe().round(2)

***
***

<br>
<strong>Challenge 3</strong><br>
Write code to create print the value counts for channel and region.

In [None]:
# value counts for channel



print("\n\n")



# value counts for region




In [None]:
# value counts for channel
print(customers_df['Channel'].value_counts())


print("\n\n")


# value counts for region
print(customers_df['Region'].value_counts())

***
***

<br>
<strong>Challenge 4</strong><br>
Write code to display the first ten rows of <strong>customers_df</strong>.

In [None]:
# displaying first ten rows of the dataset


In [None]:
# displaying first ten rows of the dataset
customers_df.head(n = 10)

***
***

<br>
<strong>Datasets with Features for Different Purposes</strong><br>
Notice from the outputs above that the dataset contains demographic data (channel and region) as well purchasing data (spending per category). In unsupervised learning, feature types such as these should not be used together in the same algorithm. Demographic data is extremely different from purchasing data, and their concatenation would bias the results of an analysis. Instead, if a problem requires unsupervised learning and demographic data is present in the dataset, a best practice is to remove the demographic data before building an algorithm. Later, demographic data can be used to compare results.<br><br><br>
<strong>PCA and Scaling</strong><br>
As with KNN, explanatory variables should be scaled before developing a principal component analysis algorithm.<br><br><br>
<strong>Challenge 5</strong><br>
Complete the following in the code below:<br>

* drop demographic data and store the result as purchase_behavior
* instantiate a StandardScaler() object
* fit the scaler object to purchase_behavior
* transform purchase_behavior using the scaler object

In [None]:
# scaling (normalizing) variables before correlation analysis

# dropping demographic information
purchase_behavior = customers_df._____(_____,
                                      axis = 1)


# INSTANTIATING a StandardScaler() object
scaler = _____


# FITTING the scaler with the data
scaler._____(_____)


# TRANSFORMING our data after fit
X_scaled = scaler._____(_____)


# converting scaled data into a DataFrame
purchases_scaled = pd.DataFrame(X_scaled)


# reattaching column names
purchases_scaled.columns = purchase_behavior.columns


# checking pre- and post-scaling variance
print(pd.np.var(purchase_behavior), '\n\n')
print(pd.np.var(purchases_scaled))

In [None]:
# scaling (normalizing) variables before correlation analysis

# dropping demographic information
purchase_behavior = customers_df.drop(['Channel', 'Region'],
                                      axis = 1)


# INSTANTIATING a StandardScaler() object
scaler = StandardScaler()


# FITTING the scaler with the data
scaler.fit(purchase_behavior)


# TRANSFORMING our data after fit
X_scaled = scaler.transform(purchase_behavior)


# converting scaled data into a DataFrame
purchases_scaled = pd.DataFrame(X_scaled)


# reattaching column names
purchases_scaled.columns = purchase_behavior.columns


# checking pre- and post-scaling variance
print(pd.np.var(purchase_behavior), '\n\n')
print(pd.np.var(purchases_scaled))

***
***

<br>
<h3>Part II: Exploratory Data Analysis</h3><br>
Run the following code to produce histograms for all features related to purchasing behavior.

In [None]:
# setting figure size
fig, ax = plt.subplots(figsize = (12, 8))


# initializing a counter
count = 0


# looping to create visualizations
for col in purchases_scaled:

    # condition to break
    if count == 6:
        break
    
    # increasing count
    count += 1
    
    # preparing histograms
    plt.subplot(2, 3, count)
    sns.distplot(a = purchases_scaled[col],
                 hist = True,
                 kde = True)
    
plt.tight_layout()
plt.savefig('purchases_scaled_plots.png')
plt.show()

***
***

<br>
<strong>Challenge 6</strong><br>
Fill in the blanks below to develop a correlation heatmap of the scaled purchasing features.

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (8, 8))


# developing a correlation matrix object
df_corr = _____._____.round(2)


# creating a correlation heatmap
sns.heatmap(_____,
            cmap = 'coolwarm',
            square = True,
            annot = True)


# saving and displaying the heatmap
plt.savefig('top_customers_correlation_heatmap.png')
plt.show()

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (8, 8))


# developing a correlation matrix object
df_corr = purchases_scaled.corr().round(2)


# creating a correlation heatmap
sns.heatmap(df_corr,
            cmap = 'coolwarm',
            square = True,
            annot = True)


# saving and displaying the heatmap
plt.savefig('top_customers_correlation_heatmap.png')
plt.show()

***
***

<br>
Notice that only a few (Pearson) correlations have an absolute value above 0.50. This makes the dataset a good candidate for PCA. As such, we may be able to explain a high degree of variance with a small number of principal components.<br><br>

<h3>Part III: Principal Component Analysis</h3><br>
Principal component analysis is primarily conducted in three situations:<br>

<u>Correlated Explanatory Variables</u><br>
Model building with correlated explanatory variables (i.e. <a href="https://en.wikipedia.org/wiki/Multicollinearity">multicollinearity</a>) is a violation of one of the key assumptions of generalized linear models.<br><br>

<u>Dimensionality Reduction</u><br>
This is commonly conducted when a dataset has a large amount of explanatory variables (i.e. every unique click a user has made on a website). Techniques like PCA allow features to be transformed into principal components, (potentially) reducing the number of features needed to explain a high degree of variance.<br><br>

<u>Latent Trait Exploration</u><br>
Understanding factors that cannot be measured directly through measurable constructs.<br><br>

<strong>Challenge 7</strong><br>
Complete the code to instantiate, fit, and transform a PCA model with no limits to its number of principal components. Make sure to use the scaled dataset for this task.

In [None]:
# INSTANTIATING a PCA object with no limit to principal components
pca = _____(n_components = None,
            random_state = 802)


# FITTING and TRANSFORMING the scaled data
customer_pca = _____._____(_____)


# comparing dimensions of each DataFrame
print("Original shape:", X_scaled.shape)
print("PCA shape     :",  customer_pca.shape)

In [None]:
# INSTANTIATING a PCA object with no limit to principal components
pca = PCA(n_components = None,
                   random_state = 802)


# FITTING and TRANSFORMING the scaled data
customer_pca = pca.fit_transform(X_scaled)


# comparing dimensions of each DataFrame
print("Original shape:", X_scaled.shape)
print("PCA shape     :",  customer_pca.shape)

***
***

<br>
<h3>Part IV: Evaluating PCA Algorithms</h3><br>
As can be observed from above, the shape of the data did not change. However, the original DataFrame contains features, whereas the new DataFrame contains principal components. Before analyzing the factor loadings of each principal component, it is important to check each component's explained variance ratio. Also note that the sum of all explained variance ratios should sum to 1.0.<br><br><br>
<strong>Challenge 8</strong><br>
Write code to loop over each principal component, printing its component number as well as its <strong>explained_variance_ratio_</strong>

In [None]:
# component number counter
component_number = 0

# looping over each principal component
_____ variance _____ pca._____:
    component_number += _____
    
    print(f"PC {_____} : {_____.round(3)}")

In [None]:
# component number counter
component_number = 0


# looping over each principal component
for variance in pca.explained_variance_ratio_:
    component_number += 1
    print(f"PC {component_number} : {variance.round(3)}")

***
***

<br>
<strong>Challenge 9</strong><br>
Write code to print the sum of all explained variance ratios.

In [None]:
# printing the sum of all explained variance ratios
_____

In [None]:
# printing the sum of all explained variance ratios
print(pca.explained_variance_ratio_.sum())

***
***

<br>
<h3>Scree Plots</h3><br>
One useful tool to visualize the explained variance of each principal component is the scree plot. Our goal in analyzing this plot is to look for a point where there is a drop in the marginal return of explained variance. In other words, we are looking for an "elbow" in the plot, where the line connecting each principal component becomes less steep.<br><br><br>
<strong>Challenge 10</strong><br>
Call the scree_plot function on the PCA object.

In [None]:
# calling the scree_plot function
_____

In [None]:
# calling the scree_plot function
scree_plot(pca_object = pca)

***
***

<br>
<h3>Part V: Interpreting Principal Components and Persona Development</h3><br>
Principal components are essentially "bundles" of various parts of the explanatory variables that were used when building an algorithm. Note that each principal component is not directly measurable, but can be measured indirectly by analyzing its <strong>factor loadings</strong>. In other words, we can interpret the meaning of each principal component by looking into which features are strongly correlated with it.<br><br>
Run the following code and analyze the resulting correlation map between the original features and the principal components.

In [None]:
# setting plot size
fig, ax = plt.subplots(figsize = (12, 12))


# developing a PC to feature heatmap
sns.heatmap(pca.components_, 
            cmap = 'coolwarm',
            square = True,
            annot = True,
            linewidths = 0.1,
            linecolor = 'black')


# setting more plot options
plt.yticks([0, 1, 2, 3, 4, 5],
           ["PC 1", "PC 2", "PC 3", "PC 4", "PC 5", "PC 6"])

plt.xticks(range(0, 6),
           customers_df.columns[2:],
           rotation=60,
           ha='left')

plt.xlabel("Feature")
plt.ylabel("Principal Component")


# displaying the plot
plt.show()

***
***

<br>
Each observation in the dataset is a customer of Apprentice Chef, Inc. Therefore, each principal component can be thought of as a <a href="https://www.lexico.com/en/definition/persona">persona</a> to aid in interpretation. Naming personas is subjective and often benefits from working with others.<br><br><br>
<strong>Challenge 11</strong><br>
Run the following code. With your team, analyze the factor loadings and develop a persona for each principal component. When finished, rename the columns of the table with your team's persona names.

In [None]:
# transposing pca components
factor_loadings_df = pd.DataFrame(pd.np.transpose(pca.components_))


# naming rows as original features
factor_loadings_df = factor_loadings_df.set_index(purchases_scaled.columns)


# checking the result
print(factor_loadings_df)


# saving to Excel
factor_loadings_df.to_excel('customer_factor_loadings.xlsx')

***

In [None]:
# naming each principal component
factor_loadings_df._____ = _____


# checking the result
factor_loadings_df

In [None]:
# naming each principal component
factor_loadings_df.columns = ['Herbivores',
                              'Fancy Diners',
                              'Winers',
                              'Traditionalists',
                              'Vegans',
                              'Veggie Lovers']


# checking the result
factor_loadings_df

***
***

<br>
<strong>Customer-Level Personas</strong><br>
Earlier in this script we instantiated, fit, and transformed the dataset's original features into principal components:<br><br>

~~~
# FITTING and TRANSFORMING the scaled data
customer_pca = pca.fit_transform(X_scaled)
~~~

<br>
Now that we have developed personas, we can analyze how much each customer fits into each group. Run the following code to view the personas and factor loadings for each customer.

In [None]:
# converting into a DataFrame 
customer_pca = pd.DataFrame(customer_pca)


# renaming columns
customer_pca.columns = factor_loadings_df.columns


# checking results
customer_pca

***
***

<br>
Digging deeper into the DataFrame above can unearth key findings and market opportunities. <strong>This is something expected of you on your final.</strong> As an example, if we were exploring the market potential for customers with a standard deviation of one or above in the Vegan persona, we could do so through subsetting, as in the following code. Try this on other personas and enjoy the exploration :)

In [None]:
customer_pca['Vegans'][customer_pca['Vegans'] > 1.0]

***
***

<br>

~~~
 ,--.-,,-,--,             .-._            _,---.                                        
/==/  /|=|  |.--.-. .-.-./==/ \  .-._ _.='.'-,  \  .-.,.---.  ,--.-.  .-,--.            
|==|_ ||=|, /==/ -|/=/  ||==|, \/ /, /==.'-     / /==/  `   \/==/- / /=/_ /             
|==| ,|/=| _|==| ,||=| -||==|-  \|  /==/ -   .-' |==|-, .=., \==\, \/=/. /              
|==|- `-' _ |==|- | =/  ||==| ,  | -|==|_   /_,-.|==|   '='  /\==\  \/ -/               
|==|  _     |==|,  \/ - ||==| -   _ |==|  , \_.' )==|- ,   .'  |==|  ,_/                
|==|   .-. ,\==|-   ,   /|==|  /\ , \==\-  ,    (|==|_  . ,'.  \==\-, /                 
/==/, //=/  /==/ , _  .' /==/, | |- |/==/ _  ,  //==/  /\ ,  ) /==/._/                  
`--`-' `-`--`--`..---'   `--`./  `--``--`------' `--`-`--`--'  `--`-`                   
     _,---.     _,.---._                                                                
  .-`.' ,  \  ,-.' , -  `.   .-.,.---.                                                  
 /==/_  _.-' /==/_,  ,  - \ /==/  `   \                                                 
/==/-  '..-.|==|   .=.     |==|-, .=., |                                                
|==|_ ,    /|==|_ : ;=:  - |==|   '='  /                                                
|==|   .--' |==| , '='     |==|- ,   .'                                                 
|==|-  |     \==\ -    ,_ /|==|_  . ,'.                                                 
/==/   \      '.='. -   .' /==/  /\ ,  )                                                
`--`---'        `--`--''   `--`-`--`--'                                                 
   ,-,--.                _,.----.    _,.----.       ,----.    ,-,--.    ,-,--.   .=-.-. 
 ,-.'-  _\ .--.-. .-.-..' .' -   \ .' .' -   \   ,-.--` , \ ,-.'-  _\ ,-.'-  _\ /==/_ / 
/==/_ ,_.'/==/ -|/=/  /==/  ,  ,-'/==/  ,  ,-'  |==|-  _.-`/==/_ ,_.'/==/_ ,_.'|==|, |  
\==\  \   |==| ,||=| -|==|-   |  .|==|-   |  .  |==|   `.-.\==\  \   \==\  \   |==|  |  
 \==\ -\  |==|- | =/  |==|_   `-' \==|_   `-' \/==/_ ,    / \==\ -\   \==\ -\  /==/. /  
 _\==\ ,\ |==|,  \/ - |==|   _  , |==|   _  , ||==|    .-'  _\==\ ,\  _\==\ ,\ `--`-`   
/==/\/ _ ||==|-   ,   |==\.       |==\.       /|==|_  ,`-._/==/\/ _ |/==/\/ _ | .=.     
\==\ - , //==/ , _  .' `-.`.___.-' `-.`.___.-' /==/ ,     /\==\ - , /\==\ - , /:=; :    
 `--`---' `--`..---'                           `--`-----``  `--`---'  `--`---'  `=`                                                                      
~~~