**Dimensionality Reduction for Customer Profile Analysis**


In this notebook, I will work with dimensionality reduction methods on a dataset for customer personality analysis. I will use clustering in conjunction with different dimensionality reduction techniques to visualize the results.

Customer Personality Analysis is an analysis of various segments of a company's customers. This analysis helps businesses better understand their clients and facilitates the process of tailoring products to meet the specific needs, behaviors, and interests of different customer types.

Customer profile analysis enables businesses to adjust their products based on the target audience, divided into different segments. For example, instead of spending money marketing a new product to all customers in the company's database, the business can analyze which customer segment is most likely to purchase the product and then focus its marketing efforts solely on that segment.

Input Data

I have been provided with a dataset containing the following attributes:

User Characteristics:
ID: Unique identifier for the customer
Year_Birth: Year of birth of the customer
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Annual household income of the customer
Kidhome: Number of children in the customer's household
Teenhome: Number of teenagers in the customer's household
Dt_Customer: Date when the customer registered with the company
Recency: Number of days since the customer's last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products:
MntWines: Amount spent on wine in the last 2 years
MntFruits: Amount spent on fruits in the last 2 years
MntMeatProducts: Amount spent on meat products in the last 2 years
MntFishProducts: Amount spent on fish products in the last 2 years
MntSweetProducts: Amount spent on sweets in the last 2 years
MntGoldProds: Amount spent on gold in the last 2 years

Promotions:
NumDealsPurchases: Number of purchases made using discounts
AcceptedCmp1: 1 if the customer accepted the offer in the first campaign, 0 otherwise
AcceptedCmp2: 1 if the customer accepted the offer in the second campaign, 0 otherwise
AcceptedCmp3: 1 if the customer accepted the offer in the third campaign, 0 otherwise
AcceptedCmp4: 1 if the customer accepted the offer in the fourth campaign, 0 otherwise
AcceptedCmp5: 1 if the customer accepted the offer in the fifth campaign, 0 otherwise
Response: 1 if the customer accepted the offer in the last campaign, 0 otherwise

Interaction with the Company:
NumWebPurchases: Number of purchases made through the company website
NumCatalogPurchases: Number of purchases made via catalog
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to the company's website in the last month

Goals
Preprocess the data: I will clean and prepare the dataset for analysis.
Apply dimensionality reduction techniques: Will use methods such as PCA, t-SNE, or UMAP to reduce the dimensions of the data.
Clustering: Will perform clustering on the reduced dataset.
Visualization: Will visualize the results of the clustering using the reduced dimensions.
Analysis: Will analyze the results and determine if the clustering approach provides meaningful insights into customer segments.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/marketing_campaign.csv', delimiter=',')

In [3]:
# Display basic information about the data
print("Number of rows and columns:", df.shape)
print("\nData types of each column:\n", df.dtypes)
print("\nNumber of missing values in each column:\n", df.isnull().sum())

Number of rows and columns: (2240, 29)

Data types of each column:
 ID                       int64
Year_Birth               int64
Education               object
Marital_Status          object
Income                 float64
Kidhome                  int64
Teenhome                 int64
Dt_Customer             object
Recency                  int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Z_CostContact            int64
Z_Revenue                int64
Response                 int64
dtype: object

Number of missing 

In [4]:
# 2. Handling missing values
df['Income_not_filled'] = df.Income.isna()
df.Income = df.Income.fillna(-1)

# 3. Processing the registration date
df.Dt_Customer = pd.to_datetime(df.Dt_Customer, format='%d/%m/%Y', dayfirst=True)
today = df.Dt_Customer.max()
df['days_lifetime'] = (today - df.Dt_Customer).dt.days
df['years_customer'] = df.Year_Birth.apply(lambda x: today.year - x)

# 4. Categorizing education level
df_education = pd.get_dummies(df.Education, prefix='education').astype(int)
df = pd.concat([df, df_education], axis=1)

# 5. Cleaning marital status
marital_status_map = {'Alone': 'Single', 'Absurd': 'Else', 'YOLO': 'Else'}
df['Marital_Status_clean'] = df.Marital_Status.map(marital_status_map)
df_ms = pd.get_dummies(df.Marital_Status_clean, prefix='marital').astype(int)
df = pd.concat([df, df_ms], axis=1)

# 6. Formatting income and removing outliers
df.Income = df.Income.astype(int)
df = df[df.Income != 666666]

# 7. Creating the final dataset
X = df.drop(['ID', 'Dt_Customer', 'Education', 'Marital_Status', 'Marital_Status_clean'], axis=1)
X.reset_index(drop=True, inplace=True)

In [5]:
df.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,days_lifetime,years_customer,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,Marital_Status_clean,marital_Else,marital_Single
0,5524,1957,Graduation,Single,58138,0,0,2012-09-04,58,635,...,663,57,0,0,1,0,0,,0,0
1,2174,1954,Graduation,Single,46344,1,1,2014-03-08,38,11,...,113,60,0,0,1,0,0,,0,0
2,4141,1965,Graduation,Together,71613,0,0,2013-08-21,26,426,...,312,49,0,0,1,0,0,,0,0
3,6182,1984,Graduation,Together,26646,1,0,2014-02-10,26,11,...,139,30,0,0,1,0,0,,0,0
4,5324,1981,PhD,Married,58293,1,0,2014-01-19,94,173,...,161,33,0,0,0,0,1,,0,0


**Performing Clustering and Dimensionality Reduction for Visualizing Results**

Now my task is to cluster customers and visualize the clustering results using Principal Component Analysis (PCA) for dimensionality reduction.

Let's limit ourselves to the following characteristics for clustering this time:

Income: Annual household income of the client
Recency: Number of days since the client's last purchase
NumStorePurchases: Number of purchases made directly in stores
NumDealsPurchases: Number of purchases made using discounts
days_lifetime: Number of days since the client registered with the company
years_customer: Age of the client
NumWebVisitsMonth: Number of visits to the company's website in the last month Select only these features from the dataset X.
Data Normalization: Use the MinMaxScaler method to normalize the values of the selected features.

Clustering: I will perform customer clustering using the KMeans method with three clusters.

Dimensionality Reduction: Will use Principal Component Analysis (PCA) to reduce the dimensionality of the data to three components.

Visualizing Results: Using Plotly Express, I will create a 3D scatter plot of the distribution of customers in the space of the three principal components, where the colors indicate the clusters.

Next, I will interpret the results of the visualization and dimensionality reduction in more detail.

In [6]:
import pandas as pd

# Selecting only the necessary columns for clustering
X = df[['Income', 'Recency', 'NumStorePurchases', 'NumDealsPurchases', 'days_lifetime', 'years_customer', 'NumWebVisitsMonth']]

In [7]:
from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Normalize the data
X_scaled = scaler.fit_transform(X)

In [8]:
from sklearn.cluster import KMeans

# Initialize KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)

# Perform clustering
clusters = kmeans.fit_predict(X_scaled)

# Add the obtained clusters to the DataFrame
X['Cluster'] = clusters

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Cluster'] = clusters


In [9]:
from sklearn.decomposition import PCA

# Initialize PCA with 3 components
pca = PCA(n_components=3)

# Apply PCA to the normalized data
X_pca = pca.fit_transform(X_scaled)

# Convert PCA components into a DataFrame for convenience
X_pca_df = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2', 'PCA3'])
X_pca_df['Cluster'] = clusters

In [10]:
import plotly.express as px

# Create a 3D scatter plot
fig = px.scatter_3d(X_pca_df, x='PCA1', y='PCA2', z='PCA3', color='Cluster', title='Customer Clustering in PCA Space')

# Display the plot
fig.show()

Actually, the data is quite clearly distributed with a few outliers.

**Analyzing Dimensionality Reduction Results**

Now I will perforn the next steps:

Calculating the Explained Variance Ratio: Determine the proportion of total variation in the data explained by each of the three principal components (PC1, PC2, PC3) using the explained_variance_ratio_ attribute of the PCA object. Display the result.

Calculating the Cumulative Explained Variance: Compute the cumulative explained variance for the three principal components to understand how much variation in the data is explained by the first few components.

In [11]:
# Explained variance ratio for each component
explained_variance_ratio = pca.explained_variance_ratio_

# Displaying the explained variance ratio for each component
for i, variance in enumerate(explained_variance_ratio):
    print(f"Explained variance ratio for component PC{i + 1}: {variance:.4f}")

# Calculating the cumulative explained variance
cumulative_explained_variance = explained_variance_ratio.cumsum()

# Displaying the cumulative explained variance
for i, cumulative_variance in enumerate(cumulative_explained_variance):
    print(f"Cumulative explained variance up to component PC{i + 1}: {cumulative_variance:.4f}")

Explained variance ratio for component PC1: 0.3020
Explained variance ratio for component PC2: 0.2867
Explained variance ratio for component PC3: 0.2512
Cumulative explained variance up to component PC1: 0.3020
Cumulative explained variance up to component PC2: 0.5887
Cumulative explained variance up to component PC3: 0.8399


The first three components (PC1, PC2, PC3) together explain 83.99% of the variation in the data. This is a very good result for PCA, as we have significantly reduced the dimensionality (from the initial number of features to three components) while retaining most of the information. PC1 explains the largest portion of the variance (30.20%), but the subsequent components (PC2 and PC3) are also important, as they significantly contribute to the overall explained variance. The remaining 16.01% of the variance not explained by these three components may contain less significant or noisy information. This analysis confirms that PCA has effectively reduced dimensionality while preserving most of the useful information from the data.

**Interpretation of "Loadings"**
I will continue interpreting the results of PCA and get acquainted with the new concept of loadings, which will help me find the relationship between the principal components and the original features in the dataset.

I have now built a visualization of cluster points in the space of three principal components. However, I want to find the relationship between the principal components and the original features. To understand which initial characteristics of the data have the most significant influence on these principal components, I can use the components attribute of the PCA method.

What are pca.components?
pca.components is an array that contains the coefficients (or "weights") showing the contribution of each original feature to each of the principal components. These coefficients are also known as "loadings".

Loadings reflect the importance of each variable (feature) to the corresponding principal component. They indicate how the variables combine to form new, reduced dimensions.

If a coefficient has a high absolute value (either positive or negative), it indicates that the corresponding variable has a strong influence on the principal component.

My task is to calculate the "loadings" for each principal component and interpret the results.

Calculate Loadings for Components: Use the components_ attribute of the PCA object to create a DataFrame that displays the contribution of each original feature to each principal component.

Interpret the Results: Output the values of the "loadings" and analyze which features have the most significant impact on each principal component.

In [12]:
# Exclude the 'Cluster' column before creating the DataFrame for loadings
X_without_cluster = X.drop(columns=['Cluster'])

# Create a DataFrame to display the "loadings"
loadings = pd.DataFrame(pca.components_, columns=X_without_cluster.columns)

# Add indices for the components
loadings.index = [f'PC{i+1}' for i in range(pca.n_components_)]

# Output the result
print(loadings)

       Income   Recency  NumStorePurchases  NumDealsPurchases  days_lifetime  \
PC1  0.063557  0.475786           0.284282           0.103049       0.821912   
PC2 -0.047082  0.878876          -0.187195          -0.059324      -0.431061   
PC3  0.373826  0.029910           0.816668          -0.050687      -0.305490   

     years_customer  NumWebVisitsMonth  
PC1        0.012658           0.049530  
PC2        0.001022          -0.031377  
PC3        0.080411          -0.300089  


PC1 (First Principal Component):
Income: 0.063557
The influence of annual income on PC1 is small but positive.

Recency: 0.475786
The influence of the number of days since the last purchase on PC1 is moderate and positive.

NumStorePurchases: 0.284282
The influence of the number of store purchases is moderate and positive.

NumDealsPurchases: 0.103049
The influence of the number of purchases made with discounts is small and positive.

days_lifetime: 0.821912
There is a large positive influence of the number of days since the customer registered.

years_customer: 0.012658
The influence of the customer's age on PC1 is very small and positive.

NumWebVisitsMonth: 0.049530
The influence of the number of website visits is also small and positive.

Interpretation: PC1 is primarily determined by the value of days_lifetime, which has the highest positive loading. This indicates that this component may reflect the overall duration the customer has been registered with the company, with less emphasis on the other features.

PC2 (Second Principal Component):
Income: 0.047082
There is a small positive influence of income on PC2.

Recency: -0.878876
There is a large negative influence of the number of days since the last purchase. This indicates that this component is predominantly defined by how recently the customer made a purchase.

NumStorePurchases: 0.187195
There is a small positive influence of the number of store purchases.

NumDealsPurchases: 0.059324
There is a small positive influence of the number of purchases made with discounts.

days_lifetime: 0.431061
There is a moderate positive influence of the number of days since registration.

years_customer: -0.001022
There is a very small negative influence of the customer’s age.

NumWebVisitsMonth: 0.031377
There is a small positive influence of the number of website visits.

Interpretation: PC2 is mainly determined by Recency, which means this component reflects how recently the customer made a purchase. This can be useful for analyzing customer activity.

PC3 (Third Principal Component):
Income: -0.373826
There is a moderate negative influence of income on PC3.

Recency: -0.029910
There is a very small negative influence of the number of days since the last purchase.

NumStorePurchases: -0.816668
There is a large negative influence of the number of store purchases.

NumDealsPurchases: 0.050687
There is a small positive influence of the number of purchases made with discounts.

days_lifetime: 0.305490
There is a moderate positive influence of the number of days since registration.

years_customer: -0.080411
There is a small negative influence of the customer’s age.

NumWebVisitsMonth: 0.300089
There is a moderate positive influence of the number of website visits.

**Analyzing Loadings After Removing the Income Feature**

Let's analyze the loadings for the three principal components after removing the Income feature. This will help me understand how the importance of other features changed for each principal component when one of the key features (Income) was removed.

Steps for Conducting the Analysis:
Remove the Income Feature: Delete the Income feature from our dataset X and rerun PCA to obtain new loadings.

Calculate New Loadings: Compute the new loadings for the three principal components using the dataset without Income.

Analyze Feature Impact: Analyze which features have the greatest influence on each principal component after removing Income.

Examine Explained Variance: Review how much variance each principal component explains in the data without the Income feature.

In [13]:
# Removing the 'Income' feature
X_no_income = X.drop(columns=['Income'])

In [14]:
from sklearn.decomposition import PCA

# Initializing PCA
pca_no_income = PCA(n_components=3)

# Fitting PCA to the data without Income
pca_no_income.fit(X_no_income)

In [15]:
# Obtaining new loadings
loadings_no_income = pd.DataFrame(pca_no_income.components_, columns=X_no_income.columns)

# Adding indices for components
loadings_no_income.index = [f'PC{i+1}' for i in range(pca_no_income.n_components_)]

# Displaying the new loadings
print(loadings_no_income)

      Recency  NumStorePurchases  NumDealsPurchases  days_lifetime  \
PC1  0.003598           0.001780           0.002089       0.999984   
PC2  0.999676          -0.000195          -0.000403      -0.003574   
PC3 -0.010170           0.038806           0.010755       0.001356   

     years_customer  NumWebVisitsMonth   Cluster  
PC1       -0.001331           0.003271 -0.000020  
PC2        0.010089          -0.002416  0.022981  
PC3        0.998824          -0.024906  0.002000  


In [16]:
# Proportion of explained variance for the new components
explained_variance_ratio_no_income = pca_no_income.explained_variance_ratio_

# Cumulative proportion of explained variance
cumulative_variance_no_income = explained_variance_ratio_no_income.cumsum()

# Displaying results
print("Proportion of explained variance for the new components:")
for i, variance in enumerate(explained_variance_ratio_no_income):
    print(f"PC{i+1}: {variance:.4f}")

print("\nCumulative proportion of explained variance for each component:")
for i, cumulative_variance in enumerate(cumulative_variance_no_income):
    print(f"PC{i+1}: {cumulative_variance:.4f}")

Proportion of explained variance for the new components:
PC1: 0.9761
PC2: 0.0200
PC3: 0.0034

Cumulative proportion of explained variance for each component:
PC1: 0.9761
PC2: 0.9961
PC3: 0.9995


Removing the Income feature significantly impacted the structure of the principal components, indicating that other features, such as days_lifetime and years_customer, became more important for explaining the variance in the data. It is essential to consider these results in further analysis and decision-making based on the data.

**Visualization of Clustering using t-SNE**

Now my task is to use the t-SNE method to visualize the results of clustering clients in a two-dimensional space. The t-SNE method helps to reduce the dimensionality of data while preserving local structures, making it effective for visualizing high-dimensional data. We will also be able to compare this method's results with PCA.

I will use the t-SNE method to reduce the dimensionality of the data to 2 dimensions, which includes all the features from the fist task and has been scaled prior to dimensionality reduction.

I will create a new DataFrame with the coordinates obtained after applying t-SNE, and add the cluster labels to it.

I will build an interactive 2D scatter plot of client distribution, where different clusters are marked by color, and analyze the plot.

In [17]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.manifold import TSNE
import plotly.express as px

# Assume that `X` is our DataFrame without the `Income` feature and already normalized.
# We include normalization before applying t-SNE.
features = ['Recency', 'NumStorePurchases', 'NumDealsPurchases', 'days_lifetime', 'years_customer', 'NumWebVisitsMonth']

# Normalize the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X[features])

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Create a DataFrame with the t-SNE results
df_tsne = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'])
df_tsne['Cluster'] = X['Cluster']

# Visualize the results
fig = px.scatter(df_tsne, x='TSNE1', y='TSNE2', color='Cluster',
                 title='t-SNE Clustering Visualization',
                 labels={'TSNE1': 't-SNE Component 1', 'TSNE2': 't-SNE Component 2'},
                 color_continuous_scale=px.colors.qualitative.Set1)

fig.show()

In [18]:
import pandas as pd

# Assume that X contains all features along with cluster labels.
# Group the data by clusters and calculate the mean for each feature
cluster_means = X.groupby('Cluster').mean()

# Output the results
print(cluster_means)

               Income    Recency  NumStorePurchases  NumDealsPurchases  \
Cluster                                                                  
0        44990.324561  21.352130           4.264411           2.327068   
1        69681.154622  51.163025          10.275630           2.273950   
2        44622.521277  73.878251           4.078014           2.356974   

         days_lifetime  years_customer  NumWebVisitsMonth  
Cluster                                                    
0           337.989975       43.987469           5.849624  
1           398.705882       46.610084           3.880672  
2           336.508274       45.346336           5.822695  


In my opinion, the clusters are less clearly separated for t-SNE; it may be necessary to adjust the t-SNE parameters. There are many overlaps and more outliers. Data from clusters 0 and 1 are likely often similar to each other.

Cluster 0 includes customers with high purchasing activity, both in stores and online, as well as a higher number of days since registration.

Cluster 1 contains less active customers, with the highest Recency score and fewer purchases made with discounts.

Cluster 2 consists of very active customers with the lowest Recency score and fewer days since registration.

Cluster 0 has a greater tendency towards discount purchases, which may indicate sensitivity to discounts or promotions.

Online Activity:

Customers in Cluster 0 exhibit higher activity on the website, which could be beneficial for planning online marketing campaigns.