# Customer Segmentation with Machine Learning.

In the realm of data science, your mission is to unlock the potential hidden within a vast trove of e-commerce sales data using Python and the scikit-learn library. The business case at hand is to better understand the customer base of an e-commerce company, and as the custodian of data, you're entrusted with the task of transforming this information into actionable insights.
Your journey begins with data preparation, where you meticulously clean, format, and structure the raw data. This behind-the-scenes work may often go unnoticed, but it forms the bedrock of your mission's success, ensuring that the data is in a state that can be effectively analyzed.

The true artistry of your work comes to light in the segmentation phase. With Python and scikit-learn, you employ the K-means clustering algorithm to partition the customers into distinct segments. These segments aren't arbitrary divisions but rather the keys to understanding customer behavior, preferences, and needs. Your clustering model paints a vivid picture of the customer landscape, allowing the e-commerce company to tailor its strategies, products, and interactions to cater to each segment's unique characteristics.

But your journey doesn't end there. You recognize the importance of precision, and so you delve into hyperparameter tuning. Like a master craftsman refining their masterpiece, you fine-tune the clustering model to perfection. This step ensures that the segments aren't just loosely defined groups but accurate reflections of customer behavior.

Your dedication to precision results in a model that effectively and accurately segments the customer base. It equips the e-commerce company with the insights needed to make data-driven decisions, enhance customer satisfaction, boost sales, and optimize marketing efforts.

In this data-driven quest, you're the unsung hero, silently transforming raw data into actionable insights. While your work may often go unnoticed by the world, its impact reverberates within the e-commerce company. Your dedication to data and your ability to shape it into meaningful customer segments contribute to the ongoing story of e-commerce success, making every customer's journey towards better shopping experiences that much more extraordinary.

# Module 1

# Task 1: Unlocking Sales Secrets.

You've just loaded an e-commerce dataset using Python, and your task is clear: to delve into the depths of this data and reveal the hidden sales secrets it holds. With pandas by your side, you're ready to explore trends, customer behaviors, and product insights that will not only drive sales but rewrite the success story of this e-commerce business. Your journey begins, armed with data and determination, to unearth the treasure trove that is "Orders_Analysis.csv."

In [29]:
#--- Import Pandas ---
import pandas as pd
#--- Read in dataset ----
df = pd.read_csv("./Orders_Analysis.csv")

# ---WRITE YOUR CODE FOR TASK 1 ---
df

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
0,DPR,DPR,100,AD-982-708-895-F-6C894FB,52039657,1312378,83290718932496,04/12/2018,2,200.0,-200.00,0.00,0.00,0.00,0.0,0,2
1,RJF,Product P,28 / A / MTM,83-490-E49-8C8-8-3B100BC,56914686,3715657,36253792848113,01/04/2019,2,190.0,-190.00,0.00,0.00,0.00,0.0,0,2
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.8,-156.56,-8.24,0.00,0.00,0.0,-2,2
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.0,-119.00,0.00,0.00,0.00,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.0,-119.00,0.00,0.00,0.00,0.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70047,NSP,Product P,33 / A / FtO,AC-BB8-B86-8DD-4-E487272,22559066,1527995,86214274161928,22/02/2019,-3,0.0,0.00,-213.75,-213.75,-42.75,-256.5,-3,0
70048,AYN,Product H,33 / B / FtO,B2-6F1-C7D-824-4-5AD849C,57555781,8030551,75341769882681,14/03/2019,-3,0.0,0.00,-222.49,-222.49,-44.51,-267.0,-3,0
70049,QID,Product H,33 / C / FtO,84-EB3-E68-8BF-1-F2EE65C,29857030,1201357,26287500138156,19/11/2018,-3,0.0,0.00,-222.50,-222.50,-44.50,-267.0,-3,0
70050,KNB,Product P,40 / B / FtO,DB-5D5-1F5-964-6-F33469E,81507405,9368488,59112081344038,08/04/2019,-3,0.0,0.00,-237.49,-237.49,-47.50,-285.0,-3,0


# Task 2: Quantifying Success.

In your relentless pursuit of data-driven excellence, you now embark on a quest with a captivating title in mind. Filtering the dataset to retain only records with a positive ordered item quantity is not just a routine task, but a crucial step in the saga of success. You're on a mission to uncover the golden equation that defines what makes products and orders successful in this e-commerce realm. With each line of code, you inch closer to the pivotal insights that will guide product offerings and sales strategies. Your journey continues, as you sift through the data, separating the ordinary from the extraordinary.

In [30]:
#--- WRITE YOUR CODE FOR TASK 2 ---
df = df[df['ordered_item_quantity'] > 0]

#--- Inspect data ---
df

Unnamed: 0,product_title,product_type,variant_title,variant_sku,variant_id,customer_id,order_id,day,net_quantity,gross_sales,discounts,returns,net_sales,taxes,total_sales,returned_item_quantity,ordered_item_quantity
0,DPR,DPR,100,AD-982-708-895-F-6C894FB,52039657,1312378,83290718932496,04/12/2018,2,200.00,-200.00,0.00,0.0,0.0,0.0,0,2
1,RJF,Product P,28 / A / MTM,83-490-E49-8C8-8-3B100BC,56914686,3715657,36253792848113,01/04/2019,2,190.00,-190.00,0.00,0.0,0.0,0.0,0,2
2,CLH,Product B,32 / B / FtO,68-ECA-BC7-3B2-A-E73DE1B,24064862,9533448,73094559597229,05/11/2018,0,164.80,-156.56,-8.24,0.0,0.0,0.0,-2,2
3,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,53616575668264,19/02/2019,1,119.00,-119.00,0.00,0.0,0.0,0.0,0,1
4,NMA,Product F,40 / B / FtO,6C-1F1-226-1B3-2-3542B41,43823868,4121004,29263220319421,19/02/2019,1,119.00,-119.00,0.00,0.0,0.0,0.0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59337,ZQV,Product W,30 / B / FtO,DB-F64-786-53D-E-2AA01BE,44945143,8064801,62354909703252,09/01/2019,0,57.50,0.00,-57.50,0.0,0.0,0.0,-1,1
59338,FTU,Product N,XL / Sl / Re,D7-8AB-C32-494-A-CF1BD1B,54388513,2953344,41328715058313,10/02/2019,0,49.17,0.00,-49.17,0.0,0.0,0.0,-1,1
59339,FTU,Product N,L / Sl / Re,22-81B-50C-90D-3-6818F85,54388513,6347235,94824655653983,14/01/2019,0,59.00,0.00,-59.00,0.0,0.0,0.0,-1,1
59340,QID,Product H,32 / A / FtO,04-C33-7C3-878-A-F7EC808,17627416,8165442,14619721496901,22/03/2019,0,74.17,0.00,-74.17,0.0,0.0,0.0,-1,1


# Task 3: Customer Code: Deciphering Buying Patterns.

With a captivating title in mind, you now delve deeper into the world of data manipulation. Your mission is crystal clear - decode the intricate buying patterns of e-commerce customers. By skillfully encoding the data into binary values, you are on a journey to unveil the secrets hidden within. The aggregation of customer behaviors aligns with your objective, as you aim to discern the customers who have made purchases from those who haven't. With each line of code, you illuminate the path towards a deeper understanding of customer preferences, and in doing so, you're empowering the e-commerce business to tailor its offerings more precisely to its audience. Your journey persists as you decipher the customer code, revealing a roadmap for strategic decisions.

In [31]:

def encode_column(column):
    if column > 0:
        return 1
    if column <= 0:
        return 0

# Assuming 'df' is your DataFrame
# Define the list of columns to group by
column_list = ['customer_id', 'product_type']

# Step 1: Create 'aggregated_dataframe'
aggregated_dataframe = df.groupby(column_list)['ordered_item_quantity'].count().reset_index()

# Step 2: Add a new column "products_ordered" using the encode_column function
aggregated_dataframe['products_ordered'] = aggregated_dataframe['ordered_item_quantity'].apply(encode_column)

# Step 3: Create 'customers_orders' DataFrame
customers_orders = aggregated_dataframe.groupby('customer_id')['products_ordered'].sum().reset_index()

# Store the result in a variable for inspection
result = customers_orders.head()
result

Unnamed: 0,customer_id,products_ordered
0,1000661,1
1,1001914,1
2,1002167,3
3,1002387,1
4,1002419,2


# Task 4: Unveiling Return Rate Insights.

Your journey now takes you into the world of order and return dynamics. As you meticulously calculate the sum of ordered and returned items by customer and order, you aim to unravel the balance between what's bought and what's sent back. The introduction of the "average return rate" offers a fresh perspective, providing insights into customer behaviors and product quality. Your quest continues, and with each line of code, you bring clarity to a complex puzzle, guiding the e-commerce business toward strategies that ensure a delicate equilibrium between sales and returns.

In [32]:

# Assuming 'df' is your DataFrame

# Step 1: Create 'ordered_sum_by_customer_order'
ordered_sum_by_customer_order = df.groupby(['customer_id', 'order_id'])['ordered_item_quantity'].sum().reset_index()

# Step 2: Create 'returned_sum_by_customer_order'
returned_sum_by_customer_order = df.groupby(['customer_id', 'order_id'])['returned_item_quantity'].sum().reset_index()

# Step 3: Merge the DataFrames
ordered_returned_sums = pd.merge(ordered_sum_by_customer_order, returned_sum_by_customer_order, on=['customer_id', 'order_id'], how='outer')

# Step 4: Calculate "average_return_rate"
ordered_returned_sums['average_return_rate'] = -ordered_returned_sums['returned_item_quantity'] / ordered_returned_sums['ordered_item_quantity']

# Store the result in a variable for inspection
result1 = ordered_returned_sums.head()
result1

Unnamed: 0,customer_id,order_id,ordered_item_quantity,returned_item_quantity,average_return_rate
0,1000661,99119989117212,3,0,0.0
1,1001914,79758569034715,1,0,0.0
2,1002167,38156088848638,1,0,0.0
3,1002167,57440147820257,1,0,0.0
4,1002167,58825523953710,1,0,0.0


# Module 2

# Task 1: Charting the Path to Customer Satisfaction.

This data-driven pursuit doesn't just stop at numbers; it's about revealing the stories behind each return. As you merge and reshape the data, you're paving the way for the e-commerce business to craft strategies that enhance customer satisfaction. Your journey persists, and with each line of code, you bring clarity to the complex relationship between returns, customer behaviors, and the pursuit of a more satisfying shopping experience.

In [33]:


# Assuming 'ordered_returned_sums' and 'customers_orders' DataFrames are already created

# Step 1: Calculate 'customer_return_rate'
customer_return_rate = ordered_returned_sums.groupby('customer_id')['average_return_rate'].mean().reset_index()

# Step 2: Create 'return_rates' DataFrame
return_rates = customer_return_rate['average_return_rate'].value_counts().reset_index()
return_rates.columns = ['average return rate', 'count of unit return rate']

# Step 3: Rename columns in 'return_rates' DataFrame
return_rates.columns = ['average return rate', 'count of unit return rate']

# Step 4: Merge 'customers_orders' with 'customer_return_rate' to create 'customers' DataFrame
customers = pd.merge(customers_orders, customer_return_rate, on='customer_id', how='left')

# Store the results in variables for inspection
result_customer_return_rate = customer_return_rate.head()
result_return_rates = return_rates.head()
result_customers = customers.head()
result_customers

Unnamed: 0,customer_id,products_ordered,average_return_rate
0,1000661,1,0.0
1,1001914,1,0.0
2,1002167,3,0.0
3,1002387,1,0.0
4,1002419,2,0.0


# Task 2: The Currency of Customer Loyalty.

Renaming columns for clarity, you're preparing a roadmap for the e-commerce company to comprehend customer spending patterns and to identify the high-value customers. With each line of code, you're crafting a narrative of how loyalty is reflected in the currency customers invest, further empowering the business to tailor strategies that nurture and retain its most valuable assets. Your journey persists, as you count the coins that echo the story of customer loyalty and financial success.

In [34]:


# Assuming 'df' is your DataFrame

# Step 1: Calculate 'customer_total_spending'
customer_total_spending = df.groupby('customer_id')['total_sales'].sum().reset_index()

# Step 2: Rename the column in 'customer_total_spending' DataFrame
customer_total_spending.rename(columns={'total_sales': 'total_spending'}, inplace=True)

# Inspect the resulting DataFrame (without using print)
result_customer_total_spending = customer_total_spending.head()


# Task 3: Customer Chronicles: Weaving a Tapestry of Insights.

With precision, you've shaped this data tapestry, and now it's a map to understanding customer behaviors, return rates, and the financial footprint of loyalty. As you remove the identifier and streamline the data, you're arming the e-commerce business with a holistic view of its customers. Your journey continues, and with each line of code, you're adding depth and color to the evolving story of customer engagement and business success.

In [35]:
# Assuming 'customers' and 'customer_total_spending' DataFrames are already created

# Step 1: Merge 'customers' with 'customer_total_spending'
customers = pd.merge(customers, customer_total_spending, on='customer_id', how='left')

# Step 2: Drop the "customer_id" column from the 'customers' DataFrame
customers.drop(columns='customer_id', inplace=True)

# Inspect the resulting DataFrame (without using print)
result_customers_after_merge_and_cleaning = customers.head()


# Task 4: Transformed Insights.

By applying logarithmic transformations and rounding to two decimal places, you're elevating the data into a realm of precision and clarity. The columns "products_ordered," "average_return_rate," and "total_spending" are now elegantly reshaped, revealing a new perspective on customer behavior and loyalty. With each line of code, you're adding a layer of sophistication to the e-commerce data, turning it into a canvas that's ready for more profound analysis and strategic decision-making. Your journey continues, as you unlock the secrets within the data, ready to paint the next chapter in the story of data-driven success.

In [36]:
import numpy as np

# Assuming 'customers' DataFrame is already created

# Step 1: Define a list of columns to transform
columns_to_transform = ["products_ordered", "average_return_rate", "total_spending"]

# Step 2: Iterate over the columns and apply transformations
for column in columns_to_transform:
    # Apply natural logarithm transformation
    transformed_column = np.log1p(customers[column])
    
    # Round the values to two decimal places
    rounded_column = transformed_column.round(2)
    
    # Create new columns in the 'customers' DataFrame
    customers[f"log_{column}"] = rounded_column

# Inspect the resulting DataFrame (without using print)
result_customers_after_transformation = customers.head()



# Module 3

# Task 1: Cluster Quest: Unveiling the Essence of Customer Segmentation.

In your enthralling journey, the stage is set for the "Cluster Quest." With the power of scikit-learn's K-means clustering, you're poised to unveil the essence of customer segmentation. The code you've crafted launches a sophisticated algorithm to partition customers into distinct groups based on their log-transformed metrics.

The K-means model's score, carefully rounded, reflects the inertia of the clusters, a critical measure of the model's performance. As you press forward, you're about to discover the clusters that define customer segments, empowering the e-commerce business to tailor its strategies with newfound precision. Your journey continues, as you prepare to reveal the secrets that lie within these clusters, setting the stage for data-driven success in the e-commerce landscape.

In [37]:
from sklearn.cluster import KMeans

# Assuming 'customers' DataFrame is already created

# Step 1: Initialize the K-Means model
kmeans_model = KMeans(init='k-means++', max_iter=500, random_state=42, n_init=10)

# Step 2: Fit the K-Means model to the customer data
kmeans_model.fit(customers.iloc[:, 3:])

# Step 3: Calculate the K-Means score
kmeans_score = kmeans_model.inertia_

# Round the score to two decimal places
kmeans_score = round(kmeans_score, 2)

# Inspect the resulting score (without using print)
result_kmeans_score = kmeans_score



# Task 2: Finding the Sweet Spot: The Clusters' Hidden Harmony.

In your data-driven journey, you venture into a quest to determine the optimal number of clusters that will reveal the hidden harmony within the data. Your code expertly explores a range of cluster values, from 1 to 15, using the K-means algorithm.

As you iterate through each cluster value, you meticulously record the inertia, capturing the essence of the clustering quality. With each round, you're inching closer to discovering the perfect balance that defines the clusters. This insight holds the key to shaping tailored strategies that align with the nature of the customer base. Your journey continues, as you prepare to unveil the clusters' hidden harmony, setting the stage for a new level of data-driven success in the e-commerce realm.

In [38]:
# Assuming 'customers' DataFrame is already created

# Step 1: Create 'dataframe' by selecting columns for clustering
dataframe = customers.iloc[:, 3:]

# Step 2: Define the number of clusters (K) as 15
K = 15

# Step 3: Create a list of cluster values from 1 to K
cluster_values = list(range(1, K + 1))

# Step 4: Initialize an empty list to store inertia values
inertia_values = []

# Step 5: Iterate over cluster values
for c in cluster_values:
    # Create a K-Means model
    kmeans_model = KMeans(n_clusters=c, init='k-means++', max_iter=500, random_state=42, n_init=10)
    
    # Fit the model to the data and calculate inertia
    kmeans_model.fit(dataframe)
    inertia_value = round(kmeans_model.inertia_, 2)
    
    # Append inertia value to the list
    inertia_values.append(inertia_value)

# Inspect the resulting inertia values (without using print)
result_inertia_values = inertia_values


# Task 3: Cluster Symphony: The Grand Unveiling.

With the insights gained from the previous steps, you've chosen the optimal number of clusters, and the updated K-means model is poised to perform its magic.

As the model fitting and prediction unfold, you're on the verge of unveiling the refined customer segments. These segments, meticulously crafted through data alchemy, logarithmic transformations, and precision clustering, represent the heart of customer insights. With each line of code, you're on the cusp of discovering the harmonic clusters that will empower the e-commerce business to tailor its strategies with unparalleled precision. Your journey continues, as you're about to reveal the grand symphony of customer segmentation, setting the stage for a new era of data-driven success in the e-commerce landscape.

In [39]:
from sklearn.cluster import KMeans

# Assuming 'customers' DataFrame is already created

# Step 1: Create 'dataframe' by selecting columns for clustering
dataframe = customers.iloc[:, 3:]

# Step 2: Create an updated K-Means model with optimized K=4
updated_kmeans_model = KMeans(n_clusters=4, init='k-means++', max_iter=500, random_state=42, n_init=10)

# Step 3: Apply K-Means clustering using .fit_predict() method
res = updated_kmeans_model.fit_predict(dataframe)

# Inspect the resulting cluster assignments (without using print)
result_cluster_assignments = res


# Task 4: Cluster Insights Unleashed: The Symphony Resonates.

In your data-driven odyssey, you've now reached the pinnacle with "Cluster Insights Unleashed." Your code takes the refined cluster centers and transforms them back into their original, interpretable values. These centers represent the essence of each customer segment, reflecting product preferences, return rates, and total spending.

As you align the clusters and round the values to two decimal places, you're preparing to reveal the symphony of insights hidden within these customer segments. The clusters are no longer abstract; they're tangible profiles that will guide the e-commerce business towards more precise strategies. With each line of code, you're about to unveil the grand symphony of customer segmentation, setting the stage for a new era of data-driven success in the e-commerce landscape. Your journey has reached its crescendo, and the insights within the clusters are ready to reshape the future of the business.

In [40]:
import numpy as np

# Assuming 'customers' DataFrame is already created

# Step 1: Calculate cluster centers
cluster_centers = updated_kmeans_model.cluster_centers_

# Step 2: Assign cluster labels to each customer and add a new column named "clusters"
customers['clusters'] = updated_kmeans_model.labels_

# Step 3: Apply inverse transformation to cluster centers
actual_data = np.expm1(cluster_centers)

# Step 4: Concatenate transformed cluster centers with original cluster centers
add_points = np.append(actual_data, cluster_centers, axis=1)

# Concatenate 'add_points' values with the original cluster centers
add_points = np.append(add_points, [[0], [1], [2], [3]], axis=1)

# Step 5: Build a new DataFrame named 'centers_df'
centers_df = pd.DataFrame(add_points, columns=["products_ordered", "average_return_rate", "total_spending",
                                               "log_products_ordered", "log_average_return_rate", "log_total_spending", "clusters"])

# Step 6: Convert the "clusters" column to integer data type
centers_df['clusters'] = centers_df['clusters'].astype("int")

# Step 7: Round all values in 'centers_df' to two decimal places
rounded_centers_df = centers_df.round(2)

# Step 8: Copy 'customers' DataFrame to 'customers_final'
customers_final = customers.copy()

# Inspect the resulting rounded centers DataFrame and the final customer DataFrame
result_rounded_centers_df = rounded_centers_df
result_customers_final = customers_final.head()


# Task 5: Convergence of Insights: Merging the Customer Tapestry.

You're at the juncture of bringing together the customer data and the refined cluster centers. As you weave this tapestry, you're not only uniting customer profiles with cluster centers, but you're also assigning the identity of "center" to the cluster points.

The merger of these datasets creates a comprehensive view, fusing customer behaviors with the essence of each cluster. This unified data set is a roadmap for the e-commerce business, offering insights into the unique characteristics of each segment. With each line of code, you're guiding the business towards a new era of personalized strategies and data-driven success. Your journey reaches a pivotal moment, as the convergence of insights promises to reshape the future of the e-commerce landscape.

In [41]:
# Assuming 'customers_final' and 'rounded_centers_df' DataFrames are already created

# Step 1: Add a new column "is_center" to 'customers_final' and initialize with zeros
customers_final['is_center'] = 0

# Step 2: Add a new column "is_center" to 'rounded_centers_df' and set the value to 1 for all cluster center rows
rounded_centers_df['is_center'] = 1

# Step 3: Concatenate the contents of 'rounded_centers_df' to 'customers_final'
customers = pd.concat([customers_final, rounded_centers_df], ignore_index=True)

# Inspect the resulting DataFrame
result_customers = customers.head()


# Task 6: The Tapestry of Segmentation: Magnitude Unveiled.

In the ever-evolving realm of data analysis, your journey reaches an intriguing chapter titled "The Tapestry of Segmentation." With your clusters and customer data now harmoniously combined, you're about to unveil the magnitude of each customer group.

As you convert cluster labels to strings and meticulously record the cardinality of each cluster, you're preparing to paint a vivid picture of the customer landscape. The "Customer Group Magnitude" holds the key to understanding the size and significance of each segment.

With each line of code, you're providing the e-commerce business with insights that go beyond the clusters themselves, offering a deeper understanding of customer behavior. Your journey continues, as you set the stage for a new era of data-driven success, where customer segmentation becomes a fundamental part of business strategy.

In [42]:
# Assuming 'customers' DataFrame is already created

# Step 1: Create a new column "cluster_name" and convert cluster labels to strings
customers['cluster_name'] = customers['clusters'].astype(str)

# Step 2: Create a new variable 'final_result' to summarize customer clusters
final_result = customers['cluster_name'].value_counts()  # Remove `.reset_index()`

# Step 3: Rename columns in 'final_result'
final_result = final_result.rename_axis('index').reset_index(name='Customer Groups')
final_result.rename(columns={'cluster_name': 'Customer Group Magnitude'}, inplace=True)

# Inspect the resulting summary DataFrame
result_final_result = final_result
