# Report on Customer Segmentation

![Banner](Images/Banner.jpg)


## The following report is divided into 6 stages:
* __Background__ - Gives context to this report
* __The Data__ - The data available for analysis
    *  Doctors contains information on doctors. Each row represents one doctor.
    *  Orders contains details on orders. Each row represents one order; a doctor can place multiple orders.
    *  Complaints collects information on doctor complaints.
    *  Instructions has information on whether the doctor includes special instructions on their orders.
* __Methodology__ - Explains methods used to achieve results
* __What the managers wants to know?__ - Answering specific questions through statistic analysis and Machine Learning
    * How many doctors are there in each region? What is the average number of purchases per region?
    * Can you find a relationship between purchases and complaints?
    * Define new doctor segments that help the company improve marketing efforts and customer service.
    * Identify which features impact the new segmentation strategy the most.
    * Your team will need to explain the new segments to the rest of the company. Describe which characteristics distinguish the newly defined segments.
* __Recommendations__ - My interpretation of the personas, with insights to the marketing team
* __Appendix__ - Overall code used in the tasks
    * Code for questions of analysis
    * Code of ML pipeline for repeatability

# Can you find a better way to segment your customers?

## 📖 Background
I work as a Data Scientist in a medical device manufacturer in Switzerland called Johansson & Johansson, a company that manufactures orthopedic devices and sells them worldwide, directly to individual doctors who use them on rehabilitation and physical therapy patients.

Historically, the sales and customer support departments have grouped doctors by geography. However, the region is not a good predictor of the number of purchases a doctor will make or their support needs.

My team wants to use a data-centric approach to segmenting doctors to improve marketing, customer service, and product planning.

## 💾 The Data

The company stores the information you need in the following four tables. Some of the fields are anonymized to comply with privacy regulations.

#### Doctors contains information on doctors. Each row represents one doctor.
- "DoctorID" - is a unique identifier for each doctor.
- "Region" - the current geographical region of the doctor.
- "Category" - the type of doctor, either 'Specialist' or 'General Practitioner.'
- "Rank" - is an internal ranking system. It is an ordered variable: The highest level is Ambassadors, followed by Titanium Plus, Titanium, Platinum Plus, Platinum, Gold Plus, Gold, Silver Plus, and the lowest level is Silver.
- "Incidence rate"  and "R rate" - relate to the amount of re-work each doctor generates.
- "Satisfaction" - measures doctors' satisfaction with the company.
- "Experience" - relates to the doctor's experience with the company.
- "Purchases" - purchases over the last year.

#### Orders contains details on orders. Each row represents one order; a doctor can place multiple orders.
- "DoctorID" - doctor id (matches the other tables).
- "OrderID" - order identifier.
- "OrderNum" - order number.
- "Conditions A through J" - map the different settings of the devices in each order. Each order goes to an individual patient.

#### Complaints collects information on doctor complaints.
- "DoctorID" - doctor id (matches the other tables).
- "Complaint Type" - the company's classification of the complaints.
- "Qty" - number of complaints per complaint type per doctor.

#### Instructions has information on whether the doctor includes special instructions on their orders.
- "DoctorID" - doctor id (matches the other tables).
- "Instructions" - 'Yes' when the doctor includes special instructions, 'No' when they do not.

## ⚙️ Methodology

In order to reach to a conclusion on how customers are segmented on our data, an approach using unsupervised machine learning was used, as it is easy to interpretate and not computationally heavy.

For the algorithm to be able to process the data correctly:
1. After cleaning the data, a set of ordinal and continuous variables were selected to be processed
2. The ordinal variable was label-encoded so it could represent a numerical rank and consequentially, enabling it to be processed
3. Since the K-means algorithm is sensible to distances (huge numerical sizes), the selected variables needed to be normalized
4. As a tentative to reduce the dimentionality (number of variables) of the data, since the selected variables were already normalized, a Principal Component Analysis could be performed
5. After observing that 5 out of 7 Principal Components explain more than 80% of the data, those 2 Principal Components were discarded
6. The re-dimensioned data was then fed to the K-means, with a choosed "k" of 4 clusters
7. The resulting labels were then added to the data, where the features of analysis were grouped by each label and displaying the mean value to assist the interpretability of the clusters, thus creating the segments

## 📖 What the managers want to know?

### 👉 How many doctors are there in each region? What is the average number of purchases per region?

Sometimes due to demographic factors, the distribution of doctors are not the same throughout a given area. Even so, we find a some regions with particularly high averages and other with less, as per below

![Average purchases per region](Images/Average_purchases_per_region.jpg)

### 👉 Can you find a relationship between purchases and complaints?

Naturally, the more a business scales and more sales are made, in general, the more complaints will be, in a proportionate manner (assuming "business as usual" conditions).
By performing a correlation test between the two variables, we can draw conclusions about their correlation and significance level. As per the scatterplot below, with complaints as a function of purchases, we see a __week correlation__ of 0.16, and a __significance level__ of less than 0.05.

This means __the variables are indeed correlated__, the higher the purchases are, the higher the complaints will be. The outliers were accounted for later on, since we have a "balance" our outliers in both axis, nontheless, we can see a week correlation. Something else may having a higher impact on the complaints variable. 

![Correlation between Purchases and Complaints](Images/Correlation_between_Purchases_and_Complaints.png)

### 👉 Define new doctor segments that help the company improve marketing efforts and customer service.

After segmenting the data through unsupervised machine learning, I've decided to highlight 4 customer segments with defined characteristics:

![Personas distribution](Images/Personas_distribution.jpg)


### 👉 Identify which features impact the new segmentation strategy the most.

![Main features](Images/Main_features.jpg)

![Personas average values](Images/Personas_average_values.jpg)

Of all the features present in the 4 data sets, only a few were selected to cluster our customets.
As per the table above, these were the features used to model the segmentation strategy and they all play an important role in shaping the personas' characteristics and understanding their behavior.

The  Incidence Rate and R rate are interpreted as a whole, which I address as "re-work", and the variable Qty refers to Complaints.

* 0/ __Millenial__
* 1/ __Fan__
* 2/ __Bootstrapper__
* 3/ __Conservative__

### 👉 Your team will need to explain the new segments to the rest of the company. Describe which characteristics distinguish the newly defined segments.

The different combination of said variables will slowly reveal behavior traits, that easily distinguish each segment: 

![Persona's characteristics](Images/Personas_characteristics.jpg)


## ✍🏽 Recommendations
We should maintain the good service with the __Fan__, and above all, improve the quality of the service of the __Bootstrapper__. Since the latter is the persona that represents our data the most, having a first good impression may play a role in their retention levels. In turn, this may increase their interest and curiosity, leading to an upgrade of the  subscription plan, resulting in the improvement of Johansson & Johansson's brand image and profits.

Regarding our __Conservative__, they seem to be an example of what happens when customers below the higher rankings behave when they reach maturity. Efforts should be deployed to maintain them engaged with the company for longer.
Now, about the __Millennial__, although smaller in numbers, they have a purchasing behaviour that does not imply cost savings, however, they do complaint a lot. To unearth the reason of such behavior, the complaints must be analyzed, for example, DoctorID #FAICB shows 5 times in the top 10 number of complaints per order, and this podium is all comprised of the __Milleannial__ persona.

## 📄 Appendix

### Code for the questions of analysis

In [None]:
#Imports libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

In [None]:
#Import and inspect data
doctors = pd.read_csv("Data/doctors.csv")
display(doctors)

In [None]:
#Import and inspect data
orders = pd.read_csv("Data/orders.csv")
display(orders)

In [None]:
#Import and inspect data
complaints = pd.read_csv("Data/complaints.csv")
display(complaints)

In [None]:
#Import and inspect data
instructions = pd.read_csv("Data/instructions.csv")
display(instructions)

In [None]:
#Obtaining number of doctors and average purchases per region, ordered by doctors
purchase_per_region = doctors.groupby("Region")["Purchases"].mean()
dr_per_region = doctors.Region.value_counts().sort_index()
dr_per_region = dr_per_region.reset_index(drop = False).set_index("Region")
dr_per_region["avg_purchases"] = purchase_per_region.values

#
dr = dr_per_region["count"]
pur = dr_per_region["avg_purchases"]

#Plotting findings
plt.figure(figsize = (10, 8))
dr.plot(kind = "barh", width= 1 , color = "green", edgecolor = "black", alpha = 0.8, label = "Doctors")
pur.plot(kind = "barh", width= 1, color = "white", edgecolor = "black", alpha = 0.8, label = "Purchases")
plt.gca().invert_yaxis()
plt.xticks(range(0, 131, 10))
plt.grid(True, axis = "x")
plt.title("Graph #1 - Number of Doctors and Average Purchases per Region")
plt.legend()
plt.show()

In [None]:
#Obtaining number of complaints per Doctor
complaints_per_doctor = complaints.groupby("DoctorID", as_index = False)["Qty"]\
                                    .sum()\
                                    .sort_values("Qty", ascending = False)

#Merging findings with the "doctors" data set
doctors_complaints_merged = doctors.merge(complaints_per_doctor, on = "DoctorID", how = "left", validate = "one_to_one")

#Filling missing values with 0, meaning no complaints were made
doctors_complaints_merged.Qty.fillna(0, inplace = True)

#Checking for relationship between purchases and complaints
y = doctors_complaints_merged.Qty
x = doctors_complaints_merged.Purchases
statistic, pvalue = stats.pearsonr(x, y)

#Plotting findings
sns.lmplot(x = "Purchases", y = "Qty", data = doctors_complaints_merged, ci = None,\
           y_jitter = True, scatter_kws = {"alpha" : 0.3},\
           line_kws = {"color" : "g",
                    "linewidth" : 0.8})
plt.title("Graph #2 - Correlation between Purchases and Complaints")
plt.annotate(f"R2 = {round(statistic, 2)}", (80.0, 17.5))
plt.annotate(f"p-value = {round(pvalue, 4)}", (80.0, 16.5))
plt.ylabel("Complaints")
plt.show()

In [None]:
#Filling two Rank missing values with median
mode = doctors_complaints_merged["Rank"].mode()[0]
doctors_complaints_merged["Rank"].fillna(mode, inplace = True)

#Changing Rank ordinal variable to numeric
doctors_complaints_merged["Rank"] = doctors_complaints_merged.Rank.map({"Silver": 0,
                               "Silver Plus": 1,
                               "Gold": 2,
                               "Gold Plus": 3,
                               "Platinum": 4,
                               "Platinum Plus": 5,
                               "Titanium": 6,
                               "Titanium Plus": 7,
                               "Ambassador": 8})
doctors_complaints_merged["Rank"] = doctors_complaints_merged.Rank.astype("float")

#Filling Satisfaction missing values
median = doctors_complaints_merged.Satisfaction[doctors_complaints_merged.Satisfaction != "--"].astype("float").median()
doctors_complaints_merged.Satisfaction.replace({"--" : median}, inplace = True)
doctors_complaints_merged["Satisfaction"] = doctors_complaints_merged.Satisfaction.astype("float")

In [None]:
#Removing outliers for more accuracy
columns = ["Rank", "Incidence rate", "R rate", "Satisfaction", "Experience", "Purchases", "Qty"]

#Looping through columns to remove outliers
for column in columns:
    doctors_complaints_merged = doctors_complaints_merged\
    [(np.abs(stats.zscore(doctors_complaints_merged[column])) < 3)]

In [None]:
#Leaving only numeric variables for ML tasks
samples = doctors_complaints_merged.drop(["DoctorID", "Region", "Category"], axis = 1)

In [None]:
#Performing Standardization
scaled_samples = StandardScaler().fit_transform(samples)

In [None]:
#Performing Principal Component Analysis
pca = PCA().fit(scaled_samples)

#Check how many relevant components there are
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_)
plt.xticks(features)
plt.xlabel("PCA feature #")
plt.ylabel("Explained variance on data %")

#Drawing a line where values above make more than 80% of the data
plt.axhline(pca.explained_variance_ratio_[np.cumsum(pca.explained_variance_ratio_) >= 0.8][0]\
            , color = "g", lw = 0.8, ls = "--")
plt.show()

print(f"5 PCA features cumulatively explain {round(sum(pca.explained_variance_ratio_[:5]), 3)*100}% of the data.")

In [None]:
#Reduce the dimensionality of the data
pca2 = PCA(n_components = 5).fit(scaled_samples)
samples_reduced = pca2.transform(scaled_samples)

In [None]:
pca2_df = pd.DataFrame(abs(pca2.components_), columns= samples.columns, index = ["PC_1", "PC_2", "PC_3", "PC_4", "PC_5"])
pca2_df[pca2_df >= 0.5]

In [None]:
print("Using a threshold of 0.5, the most important values of the Principal Components are:", "\n\n", "Component_1: Purchases + Rank", "\n", "Component_2: R rate + Experience", "\n", "Component_3: Satisfaction + Complaints", "\n", "Component_4: Satisfaction + Complaints", "\n", "Component_5: R rate + Experience")

In [None]:
#Performing "elbow" method to understand how many clusters to divide the data into
inertia_values = []
for k in range(1, 7):
    KM = KMeans(n_clusters = k, random_state=9).fit(samples_reduced)
    inertia_values.append(KM.inertia_)
plt.plot(range(1, 7), inertia_values, marker = "o")
plt.xlabel("k clusters")
plt.ylabel("inertia")
plt.show()

In [None]:
#Fitting a KMeans model with optimal clusters
KM2 = KMeans(n_clusters = 4).fit(samples_reduced)
labels = KM2.predict(samples_reduced)

In [None]:
#Adding a new feature - Complaints to Purchase Ratio
doctors_complaints_merged["complaints_to_purchaces_ratio"] = doctors_complaints_merged["Qty"]/ doctors_complaints_merged["Purchases"]

#Adding the labels to the data set
doctors_complaints_merged["persona"] = labels

print("Table #1 - Personas average values for each variable")
display(doctors_complaints_merged.iloc[:, 3:].groupby("persona").mean())
print(doctors_complaints_merged.persona.value_counts(normalize = True))

In [None]:
#Filtering doctors that fit personas
doctors_complaints_merged["persona"] = doctors_complaints_merged["persona"].astype("object")
doctors_complaints_merged["persona"] = doctors_complaints_merged.persona.map({0 : "Millennial",
                                                                              1 : "Fan",
                                                                              2 : "Bootstrapper",
                                                                              3 : "Conservative"})

dr_persona = doctors_complaints_merged[["DoctorID", "persona"]]

In [None]:
#Investigating trained labels in orders made for patterns
All = doctors.merge(complaints_per_doctor, on = "DoctorID", how = "left", validate = "one_to_one")\
                .merge(instructions, on = "DoctorID", how = "left", validate = "one_to_one")\
                .merge(orders, on = "DoctorID", how = "left", validate = "one_to_many")\
                .merge(dr_persona, on = "DoctorID", how = "right", validate = "many_to_one")

All_conditions = All[All["Condition C"].notna()]
All_conditions["Condition J"] = All_conditions["Condition J"].fillna("Before")
All_conditions["Instructions"] = All_conditions["Instructions"].fillna("No")
All_conditions.replace({True : 1,
                        False : 0,
                        "Before" : 0,
                        "After" : 1,
                        "--" : 0}, inplace = True)

columns = []
numeric_columns = All_conditions.describe().columns
for column in numeric_columns:
    columns.append(column)
columns.append("persona")

display(All_conditions[columns].groupby("persona").mean())
print("There is no evident patterns in the Conditions of the orders, that would justify a high level of complaints")

In [None]:
#Quantity of complaints and their type within known labels
complaints_personas = dr_persona.merge(complaints, on = "DoctorID", how = "left", validate = "one_to_many")
complaints_personas.groupby(["persona", "Complaint Type"])["Qty"].count()

In [None]:
#Percentage of incorrect complaints per persona
complaints_personas_filtered = complaints_personas[complaints_personas["Complaint Type"] == "Incorrect"]
complaints_personas_filtered.persona.value_counts(normalize = True).sort_values().plot(kind = "barh", color = "#0d5060")
plt.title("Percentage of incorrect complaints per persona")
plt.show()

In [None]:
#Top 10
display(All.sort_values("Qty", ascending = False).head(10))
print("The top 10 complaint placers, the Doctor ID 'FAICB' had placed an unusual amount of complaints, 6 per order")