

This cell mounts your Google Drive to the Colab environment. This allows you to access files stored in your Drive directly from your notebook.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Importing Python Library and Data Loading

This cell below imports the necessary Python libraries for data manipulation, numerical operations, and visualization.
*   `pandas` is imported as `pd` for data manipulation and analysis.
*   `numpy` is imported as `np` for numerical operations.
*   `matplotlib.pyplot` is imported as `plt` for creating static, interactive, and animated visualizations in Python.
*   `seaborn` is imported as `sns` for creating informative statistical graphics.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This cell below loads the four Excel files containing the project's data into pandas DataFrames.
*   `product_df`: Contains information about the insurance products.
*   `user_behaviour`: Contains data on how users interact with the platform and products.
*   `user_profiles`: Contains demographic and financial information about users.
*   `claims`: Contains data on claims by job and region.

The files are accessed from your Google Drive using the path

In [4]:
product_df = pd.read_excel('/content/drive/MyDrive/My Projects/AXA Recommender/product_catalog.xlsx')
user_behaviour = pd.read_excel('/content/drive/MyDrive/My Projects/AXA Recommender/user_behavioral_data.xlsx')
user_profiles = pd.read_excel('/content/drive/MyDrive/My Projects/AXA Recommender/user_profiles.xlsx')
claims = pd.read_excel('/content/drive/MyDrive/My Projects/AXA Recommender/claims_by_job_and_region.xlsx')

## Understanding Product table

In [5]:
product_df.head()

Unnamed: 0,Product_ID,Product_Name,Monthly_Premium,Benefits
0,P001,Personal Accident Cover,800,Covers accidents and emergency hospitalization
1,P002,Fire & Burglary Insurance,1200,Protects against fire and theft
2,P003,Health Micro Plan,1500,Access to basic healthcare and drugs
3,P004,Device Protection Plan,1000,Covers mobile phone and gadgets
4,P005,Life Starter Plan,2000,Basic life insurance for policyholder


In [6]:
product_df.shape

(5, 4)

In [7]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Product_ID       5 non-null      object
 1   Product_Name     5 non-null      object
 2   Monthly_Premium  5 non-null      int64 
 3   Benefits         5 non-null      object
dtypes: int64(1), object(3)
memory usage: 292.0+ bytes


In [8]:
product_df.describe()

Unnamed: 0,Monthly_Premium
count,5.0
mean,1300.0
std,469.041576
min,800.0
25%,1000.0
50%,1200.0
75%,1500.0
max,2000.0


## Understanding User Behavioral table

In [9]:
user_behaviour.head()

Unnamed: 0,User_ID,Browsed_Products,Rejected_Product,Time_Spent_on_Platform_Min
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,['P001'],,27.75
1,c0c07501-5f27-4c6a-9646-196e1dc43294,"['P003', 'P004', 'P001']",,17.12
2,6a1387d8-41f9-4311-8f8c-da9d41104124,"['P001', 'P003', 'P004']",,5.95
3,8c5441f4-69d3-4844-be93-1be3df327416,"['P004', 'P002', 'P005']",P002,4.52
4,60fa690a-8921-4506-8690-5658292bb338,"['P003', 'P001', 'P002']",,4.21


In [10]:
user_behaviour.shape

(500, 4)

In [11]:
user_behaviour.describe()

Unnamed: 0,Time_Spent_on_Platform_Min
count,500.0
mean,15.20626
std,8.488494
min,1.05
25%,7.7675
50%,15.26
75%,22.5375
max,29.97


## Understanding User profile data

In [12]:
user_profiles.head()

Unnamed: 0,User_ID,Name,Job,Monthly_Income,Number_of_Dependents,Region
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,Tyler Vaughan,Tailor,129920,5,Lagos
1,c0c07501-5f27-4c6a-9646-196e1dc43294,Shaun Murphy,Driver,98536,0,Kano
2,6a1387d8-41f9-4311-8f8c-da9d41104124,Michael Garcia,Driver,148701,3,Ibadan
3,8c5441f4-69d3-4844-be93-1be3df327416,Whitney Allison,Vendor,55491,3,Abuja
4,60fa690a-8921-4506-8690-5658292bb338,Cassandra Hall,POS Agent,136276,3,Lagos


In [13]:
user_profiles.shape

(500, 6)

In [14]:
user_profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   User_ID               500 non-null    object
 1   Name                  500 non-null    object
 2   Job                   500 non-null    object
 3   Monthly_Income        500 non-null    int64 
 4   Number_of_Dependents  500 non-null    int64 
 5   Region                500 non-null    object
dtypes: int64(2), object(4)
memory usage: 23.6+ KB


In [15]:
user_profiles.describe()

Unnamed: 0,Monthly_Income,Number_of_Dependents
count,500.0,500.0
mean,90987.978,2.464
std,34105.493937,1.715398
min,30083.0,0.0
25%,61914.25,1.0
50%,91367.5,2.0
75%,121456.5,4.0
max,149560.0,5.0


## Understanding Claims data

In [16]:
claims.head()

Unnamed: 0,Job,Region,Average_Claim_Amount,Claim_Frequency
0,Vendor,Lagos,31524.82,14
1,Vendor,Abuja,94389.23,18
2,Vendor,Port Harcourt,10656.56,20
3,Vendor,Ibadan,57659.9,18
4,Vendor,Kano,60866.61,17


In [17]:
claims.shape

(35, 4)

In [18]:
claims.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Job                   35 non-null     object 
 1   Region                35 non-null     object 
 2   Average_Claim_Amount  35 non-null     float64
 3   Claim_Frequency       35 non-null     int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 1.2+ KB


In [19]:
claims.describe()

Unnamed: 0,Average_Claim_Amount,Claim_Frequency
count,35.0,35.0
mean,61910.346857,11.885714
std,25647.017507,6.234493
min,10656.56,1.0
25%,43039.035,5.5
50%,63342.67,14.0
75%,81427.985,17.0
max,98157.45,20.0


In [20]:
# Creating a user_risk_profile based on user profile and claims

user_risk_profile = pd.merge(user_profiles, claims, on=['Job', 'Region'], how='left')

# Calculate the affordability threshold for each user
# Uses a rule of <10% of monthly income.
user_risk_profile['Max_Affordable_Premium'] = user_risk_profile['Monthly_Income'] * 0.10

# Display the first 5 rows
print("--- Resulting User_Risk_Profile DataFrame ---")
print(user_risk_profile.head())
print("\n")
print(user_risk_profile.info())

--- Resulting User_Risk_Profile DataFrame ---
                                User_ID             Name        Job  \
0  8d223d85-eb1a-40f8-a3c8-1b3be116184c    Tyler Vaughan     Tailor   
1  c0c07501-5f27-4c6a-9646-196e1dc43294     Shaun Murphy     Driver   
2  6a1387d8-41f9-4311-8f8c-da9d41104124   Michael Garcia     Driver   
3  8c5441f4-69d3-4844-be93-1be3df327416  Whitney Allison     Vendor   
4  60fa690a-8921-4506-8690-5658292bb338   Cassandra Hall  POS Agent   

   Monthly_Income  Number_of_Dependents  Region  Average_Claim_Amount  \
0          129920                     5   Lagos              80801.66   
1           98536                     0    Kano              94185.38   
2          148701                     3  Ibadan              34566.61   
3           55491                     3   Abuja              94389.23   
4          136276                     3   Lagos              97461.15   

   Claim_Frequency  Max_Affordable_Premium  
0                5                 12992.0 

In [21]:
user_risk_profile.head()

Unnamed: 0,User_ID,Name,Job,Monthly_Income,Number_of_Dependents,Region,Average_Claim_Amount,Claim_Frequency,Max_Affordable_Premium
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,Tyler Vaughan,Tailor,129920,5,Lagos,80801.66,5,12992.0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,Shaun Murphy,Driver,98536,0,Kano,94185.38,14,9853.6
2,6a1387d8-41f9-4311-8f8c-da9d41104124,Michael Garcia,Driver,148701,3,Ibadan,34566.61,20,14870.1
3,8c5441f4-69d3-4844-be93-1be3df327416,Whitney Allison,Vendor,55491,3,Abuja,94389.23,18,5549.1
4,60fa690a-8921-4506-8690-5658292bb338,Cassandra Hall,POS Agent,136276,3,Lagos,97461.15,12,13627.6


In [22]:
final = pd.merge(user_behaviour, user_risk_profile, on=['User_ID'], how='left')

In [23]:
final.head()

Unnamed: 0,User_ID,Browsed_Products,Rejected_Product,Time_Spent_on_Platform_Min,Name,Job,Monthly_Income,Number_of_Dependents,Region,Average_Claim_Amount,Claim_Frequency,Max_Affordable_Premium
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,['P001'],,27.75,Tyler Vaughan,Tailor,129920,5,Lagos,80801.66,5,12992.0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,"['P003', 'P004', 'P001']",,17.12,Shaun Murphy,Driver,98536,0,Kano,94185.38,14,9853.6
2,6a1387d8-41f9-4311-8f8c-da9d41104124,"['P001', 'P003', 'P004']",,5.95,Michael Garcia,Driver,148701,3,Ibadan,34566.61,20,14870.1
3,8c5441f4-69d3-4844-be93-1be3df327416,"['P004', 'P002', 'P005']",P002,4.52,Whitney Allison,Vendor,55491,3,Abuja,94389.23,18,5549.1
4,60fa690a-8921-4506-8690-5658292bb338,"['P003', 'P001', 'P002']",,4.21,Cassandra Hall,POS Agent,136276,3,Lagos,97461.15,12,13627.6


In [25]:
import ast

In [26]:
final['Browsed_Products'] = final['Browsed_Products'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])

In [27]:
final

Unnamed: 0,User_ID,Browsed_Products,Rejected_Product,Time_Spent_on_Platform_Min,Name,Job,Monthly_Income,Number_of_Dependents,Region,Average_Claim_Amount,Claim_Frequency,Max_Affordable_Premium
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,[P001],,27.75,Tyler Vaughan,Tailor,129920,5,Lagos,80801.66,5,12992.0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,"[P003, P004, P001]",,17.12,Shaun Murphy,Driver,98536,0,Kano,94185.38,14,9853.6
2,6a1387d8-41f9-4311-8f8c-da9d41104124,"[P001, P003, P004]",,5.95,Michael Garcia,Driver,148701,3,Ibadan,34566.61,20,14870.1
3,8c5441f4-69d3-4844-be93-1be3df327416,"[P004, P002, P005]",P002,4.52,Whitney Allison,Vendor,55491,3,Abuja,94389.23,18,5549.1
4,60fa690a-8921-4506-8690-5658292bb338,"[P003, P001, P002]",,4.21,Cassandra Hall,POS Agent,136276,3,Lagos,97461.15,12,13627.6
...,...,...,...,...,...,...,...,...,...,...,...,...
495,06e9fa6c-a8e0-4152-a6be-efbe25f074f5,[P005],,1.37,Stephanie Reyes DVM,Mechanic,72618,0,Kano,23762.49,13,7261.8
496,6885db2c-0adf-4c51-a63f-742c5b151b7f,"[P005, P003, P004]",P005,14.13,Carolyn Montoya,Mechanic,68326,5,Ibadan,22735.15,4,6832.6
497,f219a33e-adba-447b-9799-dfee7aadedf1,"[P005, P002]",P005,24.92,Taylor Summers,Hairdresser,42722,3,Ibadan,47311.25,5,4272.2
498,99e8fc86-587a-4bd2-9813-86f3f7e6f30f,"[P001, P002]",,1.47,Mary Brown,POS Agent,32189,1,Ibadan,55030.95,18,3218.9


In [28]:
# Replace NaN in Rejected_Product with None
final['Rejected_Product'] = final['Rejected_Product'].fillna(value=0)

# Fill other missing values if necessary
final.fillna({'Time_Spent_on_Platform_Min': 0}, inplace=True)

In [29]:
# Convert numeric columns to appropriate types
final['Monthly_Income'] = final['Monthly_Income'].astype(float)
final['Number_of_Dependents'] = final['Number_of_Dependents'].astype(int)
final['Max_Affordable_Premium'] = final['Max_Affordable_Premium'].astype(float)
final['Claim_Frequency'] = final['Claim_Frequency'].astype(int)
final['Average_Claim_Amount'] = final['Average_Claim_Amount'].astype(float)

In [30]:
final

Unnamed: 0,User_ID,Browsed_Products,Rejected_Product,Time_Spent_on_Platform_Min,Name,Job,Monthly_Income,Number_of_Dependents,Region,Average_Claim_Amount,Claim_Frequency,Max_Affordable_Premium
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,[P001],0,27.75,Tyler Vaughan,Tailor,129920.0,5,Lagos,80801.66,5,12992.0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,"[P003, P004, P001]",0,17.12,Shaun Murphy,Driver,98536.0,0,Kano,94185.38,14,9853.6
2,6a1387d8-41f9-4311-8f8c-da9d41104124,"[P001, P003, P004]",0,5.95,Michael Garcia,Driver,148701.0,3,Ibadan,34566.61,20,14870.1
3,8c5441f4-69d3-4844-be93-1be3df327416,"[P004, P002, P005]",P002,4.52,Whitney Allison,Vendor,55491.0,3,Abuja,94389.23,18,5549.1
4,60fa690a-8921-4506-8690-5658292bb338,"[P003, P001, P002]",0,4.21,Cassandra Hall,POS Agent,136276.0,3,Lagos,97461.15,12,13627.6
...,...,...,...,...,...,...,...,...,...,...,...,...
495,06e9fa6c-a8e0-4152-a6be-efbe25f074f5,[P005],0,1.37,Stephanie Reyes DVM,Mechanic,72618.0,0,Kano,23762.49,13,7261.8
496,6885db2c-0adf-4c51-a63f-742c5b151b7f,"[P005, P003, P004]",P005,14.13,Carolyn Montoya,Mechanic,68326.0,5,Ibadan,22735.15,4,6832.6
497,f219a33e-adba-447b-9799-dfee7aadedf1,"[P005, P002]",P005,24.92,Taylor Summers,Hairdresser,42722.0,3,Ibadan,47311.25,5,4272.2
498,99e8fc86-587a-4bd2-9813-86f3f7e6f30f,"[P001, P002]",0,1.47,Mary Brown,POS Agent,32189.0,1,Ibadan,55030.95,18,3218.9


In [31]:
p_id = product_df.Product_ID

In [32]:
interaction_matrix = pd.DataFrame(0, index=final['User_ID'], columns=p_id)

In [33]:
interaction_matrix

Product_ID,P001,P002,P003,P004,P005
User_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8d223d85-eb1a-40f8-a3c8-1b3be116184c,0,0,0,0,0
c0c07501-5f27-4c6a-9646-196e1dc43294,0,0,0,0,0
6a1387d8-41f9-4311-8f8c-da9d41104124,0,0,0,0,0
8c5441f4-69d3-4844-be93-1be3df327416,0,0,0,0,0
60fa690a-8921-4506-8690-5658292bb338,0,0,0,0,0
...,...,...,...,...,...
06e9fa6c-a8e0-4152-a6be-efbe25f074f5,0,0,0,0,0
6885db2c-0adf-4c51-a63f-742c5b151b7f,0,0,0,0,0
f219a33e-adba-447b-9799-dfee7aadedf1,0,0,0,0,0
99e8fc86-587a-4bd2-9813-86f3f7e6f30f,0,0,0,0,0


In [None]:
for idx, row in final.iterrows():
    for product in row['Browsed_Products']:
        interaction_matrix.loc[row['User_ID'], product] = 1
    if pd.notna(row['Rejected_Product']):
        interaction_matrix.loc[row['User_ID'], row['Rejected_Product']] = -1

In [None]:
interaction_matrix.drop(0, axis = 1 , inplace=True)

In [None]:
interaction_matrix.reset_index()

Product_ID,User_ID,P001,P002,P003,P004,P005
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,1,0,0,0,0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,1,0,1,1,0
2,6a1387d8-41f9-4311-8f8c-da9d41104124,1,0,1,1,0
3,8c5441f4-69d3-4844-be93-1be3df327416,0,-1,0,1,1
4,60fa690a-8921-4506-8690-5658292bb338,1,1,1,0,0
...,...,...,...,...,...,...
495,06e9fa6c-a8e0-4152-a6be-efbe25f074f5,0,0,0,0,1
496,6885db2c-0adf-4c51-a63f-742c5b151b7f,0,0,1,1,-1
497,f219a33e-adba-447b-9799-dfee7aadedf1,0,1,0,0,-1
498,99e8fc86-587a-4bd2-9813-86f3f7e6f30f,1,1,0,0,0


In [None]:
final

Unnamed: 0,User_ID,Browsed_Products,Rejected_Product,Time_Spent_on_Platform_Min,Name,Job,Monthly_Income,Number_of_Dependents,Region,Average_Claim_Amount,Claim_Frequency,Max_Affordable_Premium
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,[P001],0,27.75,Tyler Vaughan,Tailor,129920.0,5,Lagos,80801.66,5,12992.0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,"[P003, P004, P001]",0,17.12,Shaun Murphy,Driver,98536.0,0,Kano,94185.38,14,9853.6
2,6a1387d8-41f9-4311-8f8c-da9d41104124,"[P001, P003, P004]",0,5.95,Michael Garcia,Driver,148701.0,3,Ibadan,34566.61,20,14870.1
3,8c5441f4-69d3-4844-be93-1be3df327416,"[P004, P002, P005]",P002,4.52,Whitney Allison,Vendor,55491.0,3,Abuja,94389.23,18,5549.1
4,60fa690a-8921-4506-8690-5658292bb338,"[P003, P001, P002]",0,4.21,Cassandra Hall,POS Agent,136276.0,3,Lagos,97461.15,12,13627.6
...,...,...,...,...,...,...,...,...,...,...,...,...
495,06e9fa6c-a8e0-4152-a6be-efbe25f074f5,[P005],0,1.37,Stephanie Reyes DVM,Mechanic,72618.0,0,Kano,23762.49,13,7261.8
496,6885db2c-0adf-4c51-a63f-742c5b151b7f,"[P005, P003, P004]",P005,14.13,Carolyn Montoya,Mechanic,68326.0,5,Ibadan,22735.15,4,6832.6
497,f219a33e-adba-447b-9799-dfee7aadedf1,"[P005, P002]",P005,24.92,Taylor Summers,Hairdresser,42722.0,3,Ibadan,47311.25,5,4272.2
498,99e8fc86-587a-4bd2-9813-86f3f7e6f30f,"[P001, P002]",0,1.47,Mary Brown,POS Agent,32189.0,1,Ibadan,55030.95,18,3218.9


In [None]:
profile_df = final[[
    'User_ID',
    'Job',
    'Monthly_Income',
    'Number_of_Dependents',
    'Region',
    'Max_Affordable_Premium'
]].copy()

In [None]:
from sklearn.preprocessing import LabelEncoder

le_job = LabelEncoder()
le_region = LabelEncoder()

le_job.fit(profile_df['Job'])
profile_df['Job_Encoded'] = le_job.transform(profile_df['Job'])

le_region.fit(profile_df['Region'])
profile_df['Region_Encoded'] = le_region.transform(profile_df['Region'])

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numerical_cols = ['Monthly_Income', 'Number_of_Dependents', 'Max_Affordable_Premium']

profile_df[numerical_cols] = scaler.fit_transform(profile_df[numerical_cols])

In [None]:
features = profile_df[[
    'Job_Encoded',
    'Region_Encoded',
    'Monthly_Income',
    'Number_of_Dependents',
    'Max_Affordable_Premium'
]]

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
profile_df['Cluster'] = kmeans.fit_predict(features)

In [None]:
# Merge profile_df with interaction_matrix on User_ID
merged_df = profile_df[['User_ID', 'Cluster']].merge(interaction_matrix, on='User_ID')

In [None]:
cluster_product_popularity = merged_df.groupby('Cluster')[['P001', 'P002', 'P003', 'P004', 'P005']].sum()

In [None]:
# Merge profile features with interaction matrix
full_df = profile_df.merge(interaction_matrix, on='User_ID')

In [None]:
from sklearn.preprocessing import LabelEncoder

le_job = LabelEncoder()
le_region = LabelEncoder()

full_df['Job_Encoded'] = le_job.fit_transform(full_df['Job'])
full_df['Region_Encoded'] = le_region.fit_transform(full_df['Region'])

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numerical_cols = ['Monthly_Income', 'Number_of_Dependents','Max_Affordable_Premium']
full_df[numerical_cols] = scaler.fit_transform(full_df[numerical_cols])

In [None]:
full_df.head()

Unnamed: 0,User_ID,Job,Monthly_Income,Number_of_Dependents,Region,Max_Affordable_Premium,Job_Encoded,Region_Encoded,Cluster,P001,P002,P003,P004,P005
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,Tailor,1.142661,1.479855,Lagos,1.142661,5,3,3,1,0,0,0,0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,Driver,0.221536,-1.43784,Kano,0.221536,0,2,2,1,0,1,1,0
2,6a1387d8-41f9-4311-8f8c-da9d41104124,Driver,1.693886,0.312777,Ibadan,1.693886,0,1,2,1,0,1,1,0
3,8c5441f4-69d3-4844-be93-1be3df327416,Vendor,-1.041842,0.312777,Abuja,-1.041842,6,0,1,0,-1,0,1,1
4,60fa690a-8921-4506-8690-5658292bb338,POS Agent,1.32921,0.312777,Lagos,1.32921,4,3,3,1,1,1,0,0


In [None]:
full_df

Unnamed: 0,User_ID,Job,Monthly_Income,Number_of_Dependents,Region,Max_Affordable_Premium,Job_Encoded,Region_Encoded,Cluster,P001,P002,P003,P004,P005
0,8d223d85-eb1a-40f8-a3c8-1b3be116184c,Tailor,1.142661,1.479855,Lagos,1.142661,5,3,3,1,0,0,0,0
1,c0c07501-5f27-4c6a-9646-196e1dc43294,Driver,0.221536,-1.437840,Kano,0.221536,0,2,2,1,0,1,1,0
2,6a1387d8-41f9-4311-8f8c-da9d41104124,Driver,1.693886,0.312777,Ibadan,1.693886,0,1,2,1,0,1,1,0
3,8c5441f4-69d3-4844-be93-1be3df327416,Vendor,-1.041842,0.312777,Abuja,-1.041842,6,0,1,0,-1,0,1,1
4,60fa690a-8921-4506-8690-5658292bb338,POS Agent,1.329210,0.312777,Lagos,1.329210,4,3,3,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,06e9fa6c-a8e0-4152-a6be-efbe25f074f5,Mechanic,-0.539162,-1.437840,Kano,-0.539162,3,2,0,0,0,0,0,1
496,6885db2c-0adf-4c51-a63f-742c5b151b7f,Mechanic,-0.665133,1.479855,Ibadan,-0.665133,3,1,0,0,0,1,1,-1
497,f219a33e-adba-447b-9799-dfee7aadedf1,Hairdresser,-1.416614,0.312777,Ibadan,-1.416614,2,1,0,0,1,0,0,-1
498,99e8fc86-587a-4bd2-9813-86f3f7e6f30f,POS Agent,-1.725759,-0.854301,Ibadan,-1.725759,4,1,1,1,1,0,0,0


In [None]:
full_df.drop('Cluster',axis=1,inplace=True)

In [None]:
feature_cols = [
    'Job_Encoded', 'Region_Encoded',
    'Monthly_Income', 'Number_of_Dependents',
     'Max_Affordable_Premium',
    'P001', 'P002', 'P003', 'P004', 'P005'
]

X = full_df[feature_cols]

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=6, random_state=42)
full_df['Cluster'] = kmeans.fit_predict(X)

In [None]:
cluster_product_map = full_df.groupby('Cluster')[['P001', 'P002', 'P003', 'P004', 'P005']].mean()

In [None]:
new_user = {
    'Job': 'Card Maker',
    'Region': 'Lagos',
    'Monthly_Income': 30000,
    'Number_of_Dependents': 2,
    'Max_Affordable_Premium': 3000
}

In [None]:
new_user['Job_Encoded'] = le_job.transform([new_user['Job']])[0]
new_user['Region_Encoded'] = le_region.transform([new_user['Region']])[0]

# Normalize numerical features
new_user_scaled = scaler.transform([[new_user['Monthly_Income'], new_user['Number_of_Dependents'],
                                      new_user['Max_Affordable_Premium']]])

ValueError: y contains previously unseen labels: 'Card Maker'

In [None]:
new_vector = [new_user['Job_Encoded'], new_user['Region_Encoded']] + list(new_user_scaled[0]) + [0, 0, 0, 0, 0]

KeyError: 'Job_Encoded'

In [None]:
cluster_id = kmeans.predict([new_vector])[0]

NameError: name 'new_vector' is not defined

In [None]:
recommended_products = cluster_product_map.loc[cluster_id].sort_values(ascending=False).head(3).index.tolist()

NameError: name 'cluster_id' is not defined

In [None]:
recommended_products

NameError: name 'recommended_products' is not defined

In [None]:
cluster_product_map.loc[cluster_id].sort_values(ascending=False)

NameError: name 'cluster_id' is not defined

In [None]:
def score_product(user, product):
    score = 0
    reasons = []

    # Affordability check
    if product['Monthly_Premium'] <= user['Monthly_Income'] * 0.3:
        score += 1
        reasons.append("Affordable based on your income")

    # Premium within budget
    if product['Monthly_Premium'] <= user['Max_Affordable_Premium']:
        score += 1
        reasons.append("Fits within your premium budget")

    # Job relevance mapping
    job_keywords = {
        'Tailor': ['Accident', 'Health'],
        'Driver': ['Accident', 'Life', 'Health'],
        'Vendor': ['Burglary', 'Fire', 'Health'],
        'POS Agent': ['Device', 'Health'],
        'Mechanic': ['Accident', 'Device'],
        'Hairdresser': ['Health', 'Life'],
        'Electrician': ['Accident', 'Fire', 'Device'],
        'Card Maker' : ['Health', 'Life']
    }

    relevant_keywords = job_keywords.get(user['Job'], [])
    if any(keyword.lower() in product['Product_Name'].lower() for keyword in relevant_keywords):
        score += 1
        reasons.append(f"Relevant to your job as a {user['Job']}")

    return score, reasons

In [None]:
user_profile = {
    'Job': 'Driver',
    'Monthly_Income': 30000,
    'Max_Affordable_Premium': 3000,
    'Region': 'Lagos'
}

# Loop through product DataFrame
recommendations = []
for _, product in product_df.iterrows():
    score, reasons = score_product(user_profile, product)
    recommendations.append({
        'Product_ID': product['Product_ID'],
        'Product_Name': product['Product_Name'],
        'Score': score,
        'Reasons': reasons
    })

# Sort and display top recommendations
recommendations = sorted(recommendations, key=lambda x: x['Score'], reverse=True)

In [None]:
recommendations

[{'Product_ID': 'P001',
  'Product_Name': 'Personal Accident Cover',
  'Score': 3,
  'Reasons': ['Affordable based on your income',
   'Fits within your premium budget',
   'Relevant to your job as a Driver']},
 {'Product_ID': 'P003',
  'Product_Name': 'Health Micro Plan',
  'Score': 3,
  'Reasons': ['Affordable based on your income',
   'Fits within your premium budget',
   'Relevant to your job as a Driver']},
 {'Product_ID': 'P005',
  'Product_Name': 'Life Starter Plan',
  'Score': 3,
  'Reasons': ['Affordable based on your income',
   'Fits within your premium budget',
   'Relevant to your job as a Driver']},
 {'Product_ID': 'P002',
  'Product_Name': 'Fire & Burglary Insurance',
  'Score': 2,
  'Reasons': ['Affordable based on your income',
   'Fits within your premium budget']},
 {'Product_ID': 'P004',
  'Product_Name': 'Device Protection Plan',
  'Score': 2,
  'Reasons': ['Affordable based on your income',
   'Fits within your premium budget']}]

In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

def recommend_products_for_user(user_profile, product_df, kmeans_model, scaler, le_job, le_region, cluster_product_map, job_keywords):
    """
    Recommend products for a new user based on profile and product features.

    Parameters:
    - user_profile: dict with keys 'Job', 'Monthly_Income', 'Max_Affordable_Premium', 'Claim_Frequency', 'Region'
    - product_df: DataFrame with product features
    - kmeans_model: trained KMeans model from joint clustering
    - scaler: fitted StandardScaler for numerical features
    - le_job, le_region: fitted LabelEncoders for categorical features
    - cluster_product_map: DataFrame with product scores per cluster
    - job_keywords: Dictionary mapping job titles to relevant product keywords

    Returns:
    - List of recommended products with percentage scores and explanations
    """
    max_score = 4  # Total number of scoring criteria

    # Create manual mappings for job and region based on the fitted encoders' classes
    job_mapping = {cls: idx for idx, cls in enumerate(le_job.classes_)}
    region_mapping = {cls: idx for idx, cls in enumerate(le_region.classes_)}

    # Get encoded job and region, using -1 for unseen values
    job_encoded = job_mapping.get(user_profile['Job'], -1)
    region_encoded = region_mapping.get(user_profile['Region'], -1)


    # Scale numerical features
    # Create a DataFrame with feature names for scaling
    numerical_data = pd.DataFrame([[user_profile['Monthly_Income'],
                                    user_profile['Number_of_Dependents'],
                                    user_profile['Max_Affordable_Premium']]],
                                  columns=['Monthly_Income', 'Number_of_Dependents', 'Max_Affordable_Premium'])
    scaled_values = scaler.transform(numerical_data)[0]

    # Create full feature vector as a DataFrame with feature names
    feature_cols = [
        'Job_Encoded', 'Region_Encoded',
        'Monthly_Income', 'Number_of_Dependents',
         'Max_Affordable_Premium',
        'P001', 'P002', 'P003', 'P004', 'P005'
    ]
    user_data_df = pd.DataFrame([[job_encoded, region_encoded] + list(scaled_values) + [0, 0, 0, 0, 0]],
                                columns=feature_cols)


    # Predict cluster
    cluster_id = kmeans_model.predict(user_data_df)[0]

    # Get top products from cluster
    top_products = cluster_product_map.loc[cluster_id].sort_values(ascending=False).index.tolist()

    # Job relevance mapping is now passed as an argument
    relevant_keywords = job_keywords.get(user_profile['Job'], [])

    # Score and explain each product
    recommendations = []
    for _, product in product_df.iterrows():
        # Filter out products not relevant to the user's job
        if not any(keyword.lower() in product['Product_Name'].lower() for keyword in relevant_keywords):
            continue  # Skip irrelevant products

        score = 0
        reasons = []

        # Affordability
        if product['Monthly_Premium'] <= user_profile['Monthly_Income'] * 0.1:
            score += 1
            reasons.append("Affordable based on your income")

        if product['Monthly_Premium'] <= user_profile['Max_Affordable_Premium']:
            score += 1
            reasons.append("Fits within your premium budget")

        # Job relevance (already passed filter)
        score += 1
        reasons.append(f"Relevant to your job as a {user_profile['Job']}")


        # Only include products from top cluster preferences
        if product['Product_ID'] in top_products:
            recommendations.append({
                'Product_ID': product['Product_ID'],
                'Product_Name': product['Product_Name'],
                'Score': int(round((score / max_score) * 100, 0)),  # Percentage score with 0 decimals
                'Monthly_Premium': product['Monthly_Premium'],
                'Reasons': reasons
            })

    # Sort by score
    recommendations = sorted(recommendations, key=lambda x: x['Score'], reverse=True)

    # Bundle logic: if exactly two products are recommended
    if len(recommendations) == 2:
        premium_1 = recommendations[0]['Monthly_Premium']
        premium_2 = recommendations[1]['Monthly_Premium']
        bundle_premium = round(premium_1 * 1.0 + premium_2 * 0.2, 2)

        # Check affordability
        if bundle_premium <= user_profile['Max_Affordable_Premium']:
            bundle_name = f"{recommendations[0]['Product_Name']} + {recommendations[1]['Product_Name']} Bundle"
            bundle_score = int(round((recommendations[0]['Score'] + recommendations[1]['Score']) / 2, 0))
            bundle_reasons = list(set(recommendations[0]['Reasons'] + recommendations[1]['Reasons']))
            bundle_id = f"{recommendations[0]['Product_ID']}_{recommendations[1]['Product_ID']}"

            bundled_product = {
                'Product_ID': bundle_id,
                'Product_Name': bundle_name,
                'Score': bundle_score,
                'Monthly_Premium': bundle_premium,
                'Reasons': bundle_reasons
            }

            recommendations = [bundled_product]
        else:
            # Bundle not affordable—return only the first product
            recommendations = [recommendations[0]]


    return recommendations

In [None]:
user_profile = {
    'Job': 'Card Maker',
    'Monthly_Income': 35000,
    'Max_Affordable_Premium': 3000,
    'Number_of_Dependents': 2,
    'Region': 'Osun'
}

recommended = recommend_products_for_user(user_profile, product_df, kmeans, scaler, le_job, le_region, cluster_product_map)

for rec in recommended:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")


✅ Health Micro Plan (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Card Maker

✅ Life Starter Plan (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Card Maker




In [None]:
import joblib

# Save LabelEncoders
joblib.dump(le_job, 'le_job.pkl')
joblib.dump(le_region, 'le_region.pkl')

# Save StandardScaler
joblib.dump(scaler, 'scaler.pkl')

# Save KMeans model
joblib.dump(kmeans, 'kmeans_model.pkl')

# Save cluster_product_map (it's a DataFrame)
joblib.dump(cluster_product_map, 'cluster_product_map.pkl')

# Save interaction_matrix if needed
joblib.dump(interaction_matrix, 'interaction_matrix.pkl')


['interaction_matrix.pkl']

In [None]:
user_profile = {
    'Job': 'Card Maker',
    'Monthly_Income': 35000,
    'Max_Affordable_Premium': 3000,
    'Number_of_Dependents': 2,
    'Region': 'Lagos'
}

recommended = recommend_products_for_user(user_profile, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)

for rec in recommended:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")


✅ Health Micro Plan + Life Starter Plan Bundle (Score: 75%)
 - Affordable based on your income
 - Relevant to your job as a Card Maker
 - Fits within your premium budget




In [None]:
user_profile = {
    'Job': 'Card Maker',
    'Monthly_Income': 35000,
    'Max_Affordable_Premium': 3000,
    'Number_of_Dependents': 2,
    'Region': 'Lagos'
}

recommended = recommend_products_for_user(user_profile, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)

for rec in recommended:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")


✅ Health Micro Plan + Life Starter Plan Bundle (Score: 75%)
 - Affordable based on your income
 - Relevant to your job as a Card Maker
 - Fits within your premium budget


## Install fuzzy matching library

### Subtask:
Install a library like `fuzzywuzzy` or `rapidfuzz` to perform fuzzy string matching.


**Reasoning**:
The subtask requires installing the fuzzywuzzy library and its dependency to perform fuzzy string matching. I will use pip to install these packages.



In [None]:
!pip install fuzzywuzzy python-Levenshtein

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.14.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (12 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (159 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.9/159.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.14.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (3.3 MB)

## Create a function for fuzzy matching

### Subtask:
Define a Python function that takes a user-provided job string and a list of known job titles as input.


**Reasoning**:
Define a function for fuzzy matching and import necessary fuzzy matching modules within the function as requested by the instructions.



In [None]:
def find_closest_job(user_job, known_jobs):
    """
    Finds the closest job title in a list of known jobs using fuzzy matching.

    Args:
        user_job: The job title provided by the user (string).
        known_jobs: A list of known job titles from the job_keywords dictionary (list of strings).

    Returns:
        The closest matching job title from known_jobs, or None if no good match is found.
    """
    from fuzzywuzzy import fuzz
    pass

## Define a similarity threshold

### Subtask:
Determine a suitable threshold for the similarity score to consider a match valid.


**Reasoning**:
Implement the similarity threshold check within the `find_closest_job` function as instructed.



In [None]:
from fuzzywuzzy import fuzz

def find_closest_job(user_job, known_jobs):
    """
    Finds the closest job title in a list of known jobs using fuzzy matching.

        user_job: The job title provided by the user (string).
        known_jobs: A list of known job titles from the job_keywords dictionary (list of strings).

    Returns:
        The closest matching job title from known_jobs, or None if no good match is found.
    """
    best_match = None
    highest_score = -1

    # Define the similarity threshold
    similarity_threshold = 75 # This threshold can be adjusted based on desired strictness

    for known_job in known_jobs:
        score = fuzz.token_set_ratio(user_job.lower(), known_job.lower())
        if score > highest_score:
            highest_score = score
            best_match = known_job

    # Check if the highest score is below the threshold
    if highest_score < similarity_threshold:
        return None # No sufficiently close match found

    return best_match

## Integrate fuzzy matching into recommendation function

### Subtask:
Modify the `recommend_products_for_user` function to use the fuzzy matching function to find the closest known job title in the `job_keywords` dictionary based on the user's input.


**Reasoning**: Modify the recommendation function to handle cases where the fuzzy matching returns None, indicating no close job match was found above the threshold.

**Summary:** Updated the `recommend_products_for_user` function to provide generic recommendations (Health Micro Plan and Life Starter Plan with a default score of 50%) when no close job match is found above the fuzzy matching threshold. It also adjusted region handling to provide a warning but still generate recommendations if the region is unseen.


In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

def recommend_products_for_user(user_profile, product_df, kmeans_model, scaler, le_job, le_region, cluster_product_map, job_keywords):
    """
    Recommend products for a new user based on profile and product features.


    - user_profile: dict with keys 'Job', 'Monthly_Income', 'Max_Affordable_Premium', 'Number_of_Dependents', 'Region'
    - product_df: DataFrame with product features
    - kmeans_model: trained KMeans model from joint clustering
    - scaler: fitted StandardScaler for numerical features
    - le_job, le_region: fitted LabelEncoders for categorical features
    - cluster_product_map: DataFrame with product scores per cluster
    - job_keywords: Dictionary mapping job titles to relevant product keywords

    Returns:
    - List of recommended products with percentage scores and explanations
    """
    max_score = 4  # Total number of scoring criteria

    # Find the closest job using fuzzy matching
    known_jobs = list(job_keywords.keys())
    fuzzy_matched_job = find_closest_job(user_profile['Job'], known_jobs)

    # Use the fuzzy matched job if found, otherwise use the original job
    current_job = fuzzy_matched_job if fuzzy_matched_job else user_profile['Job']

    # Create manual mappings for job and region based on the fitted encoders' classes
    job_mapping = {cls: idx for idx, cls in enumerate(le_job.classes_)}
    region_mapping = {cls: idx for idx, cls in enumerate(le_region.classes_)}

    # Get encoded job and region, using -1 for unseen values
    job_encoded = job_mapping.get(current_job, -1)
    region_encoded = region_mapping.get(user_profile['Region'], -1)

    # Handle unseen job or region if necessary (e.g., return empty list or use a default)
    if job_encoded == -1 or region_encoded == -1:
        print(f"Warning: Job '{current_job}' or Region '{user_profile['Region']}' not seen in training data. Cannot generate recommendations.")
        return []


    # Scale numerical features
    # Create a DataFrame with feature names for scaling
    numerical_data = pd.DataFrame([[user_profile['Monthly_Income'],
                                    user_profile['Number_of_Dependents'],
                                    user_profile['Max_Affordable_Premium']]],
                                  columns=['Monthly_Income', 'Number_of_Dependents', 'Max_Affordable_Premium'])
    scaled_values = scaler.transform(numerical_data)[0]

    # Create full feature vector as a DataFrame with feature names
    feature_cols = [
        'Job_Encoded', 'Region_Encoded',
        'Monthly_Income', 'Number_of_Dependents',
         'Max_Affordable_Premium',
        'P001', 'P002', 'P003', 'P004', 'P005'
    ]
    user_data_df = pd.DataFrame([[job_encoded, region_encoded] + list(scaled_values) + [0, 0, 0, 0, 0]],
                                columns=feature_cols)


    # Predict cluster
    cluster_id = kmeans_model.predict(user_data_df)[0]

    # Get top products from cluster
    top_products = cluster_product_map.loc[cluster_id].sort_values(ascending=False).index.tolist()

    # Job relevance mapping is now passed as an argument
    relevant_keywords = job_keywords.get(current_job, [])

    # Score and explain each product
    recommendations = []
    for _, product in product_df.iterrows():
        # Filter out products not relevant to the user's job
        if not any(keyword.lower() in product['Product_Name'].lower() for keyword in relevant_keywords):
            continue  # Skip irrelevant products

        score = 0
        reasons = []

        # Affordability
        if product['Monthly_Premium'] <= user_profile['Monthly_Income'] * 0.1:
            score += 1
            reasons.append("Affordable based on your income")

        if product['Monthly_Premium'] <= user_profile['Max_Affordable_Premium']:
            score += 1
            reasons.append("Fits within your premium budget")

        # Job relevance (already passed filter)
        score += 1
        reasons.append(f"Relevant to your job as a {current_job}")


        # Only include products from top cluster preferences
        if product['Product_ID'] in top_products:
            recommendations.append({
                'Product_ID': product['Product_ID'],
                'Product_Name': product['Product_Name'],
                'Score': int(round((score / max_score) * 100, 0)),  # Percentage score with 0 decimals
                'Monthly_Premium': product['Monthly_Premium'],
                'Reasons': reasons
            })

    # Sort by score
    recommendations = sorted(recommendations, key=lambda x: x['Score'], reverse=True)

    # Bundle logic: if exactly two products are recommended
    if len(recommendations) == 2:
        premium_1 = recommendations[0]['Monthly_Premium']
        premium_2 = recommendations[1]['Monthly_Premium']
        bundle_premium = round(premium_1 * 1.0 + premium_2 * 0.2, 2)

        # Check affordability
        if bundle_premium <= user_profile['Max_Affordable_Premium']:
            bundle_name = f"{recommendations[0]['Product_Name']} + {recommendations[1]['Product_Name']} Bundle"
            bundle_score = int(round((recommendations[0]['Score'] + recommendations[1]['Score']) / 2, 0))
            bundle_reasons = list(set(recommendations[0]['Reasons'] + recommendations[1]['Reasons']))
            bundle_id = f"{recommendations[0]['Product_ID']}_{recommendations[1]['Product_ID']}"

            bundled_product = {
                'Product_ID': bundle_id,
                'Product_Name': bundle_name,
                'Score': bundle_score,
                'Monthly_Premium': bundle_premium,
                'Reasons': bundle_reasons
            }

            recommendations = [bundled_product]
        else:
            # Bundle not affordable—return only the first product
            recommendations = [recommendations[0]]


    return recommendations

**Reasoning**:
Test the modified `recommend_products_for_user` function with a user profile that has a job title that will not find a close match above the threshold to confirm the new logic is working correctly.



In [None]:
user_profile_no_match = {
    'Job': 'Astronaut', # This job should not have a close match
    'Monthly_Income': 100000,
    'Max_Affordable_Premium': 10000,
    'Number_of_Dependents': 1,
    'Region': 'Lagos'
}

recommended_no_match = recommend_products_for_user(user_profile_no_match, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)

if not recommended_no_match:
    print("\nAs expected, no recommendations were generated.")

No close job match found for 'Astronaut'. Cannot generate job-relevant recommendations.

As expected, no recommendations were generated.


## Summary:

### Data Analysis Key Findings

*   The `fuzzywuzzy` and `python-Levenshtein` libraries were successfully installed to enable fuzzy string matching.
*   A `find_closest_job` function was created and implemented using `fuzzywuzzy.fuzz.token_set_ratio` to calculate similarity scores between a user's job input and a list of known jobs.
*   A `similarity_threshold` of 75 was defined within the `find_closest_job` function to filter out weak matches, returning `None` if no match meets this threshold.
*   The `recommend_products_for_user` function was updated to use the `find_closest_job` function, falling back to the original user input if no fuzzy match is found (although subsequent steps ensured it returns an empty list if no match is found).
*   Logic was added to `recommend_products_for_user` to return an empty list and print a warning if `find_closest_job` returns `None` (no close match found) or if the matched job or region is not found in the training data's label encoders.
*   Testing demonstrated that the fuzzy matching correctly identified typos ("Dry Cleanr" matched to "Dry Cleaner", "Grdener" matched to "Gardener") and synonyms ("Auto Mechanic" matched to "Mechanic").
*   The system correctly handled a job not in the known list ("Software Developer") by not finding a close match and thus not generating recommendations.
*   Some test cases did not result in recommendations due to the matched job or region not being present in the training data used by the label encoders, highlighting a dependency on comprehensive training data.

### Insights or Next Steps

*   Enhance the error handling in `recommend_products_for_user` to distinguish between no fuzzy match found and a fuzzy matched job/region not being in the training data, providing more specific user feedback.
*   Consider strategies for handling jobs or regions not seen in training data, such as using default cluster assignments or keyword mappings, to provide some recommendations even when exact matches are missing.


## Handle cases with no close match

### Subtask:
Implement logic to handle scenarios where the fuzzy matching does not find a sufficiently close match above the defined threshold.


**Reasoning**:
Modify the recommendation function to handle cases where the fuzzy matching returns None, indicating no close job match was found above the threshold.



In [None]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

def recommend_products_for_user(user_profile, product_df, kmeans_model, scaler, le_job, le_region, cluster_product_map, job_keywords):
    """
    Recommend products for a new user based on profile and product features.

    Parameters:
    - user_profile: dict with keys 'Job', 'Monthly_Income', 'Max_Affordable_Premium', 'Number_of_Dependents', 'Region'
    - product_df: DataFrame with product features
    - kmeans_model: trained KMeans model from joint clustering
    - scaler: fitted StandardScaler for numerical features
    - le_job, le_region: fitted LabelEncoders for categorical features
    - cluster_product_map: DataFrame with product scores per cluster
    - job_keywords: Dictionary mapping job titles to relevant product keywords

    Returns:
    - List of recommended products with percentage scores and explanations
    """
    max_score = 4

    # Find the closest job using fuzzy matching
    known_jobs = list(job_keywords.keys())
    fuzzy_matched_job = find_closest_job(user_profile['Job'], known_jobs)

    # Handle case where no close job match is found - provide generic recommendations
    if fuzzy_matched_job is None:
        print(f"No close job match found for '{user_profile['Job']}' above the threshold. Providing generic recommendations.")
        # Return generic recommendations: Health Micro Plan and Life Starter Plan
        generic_recommendations = []
        for _, product in product_df.iterrows():
            if product['Product_ID'] in ['P003', 'P005']:
                 # Assign a default score and reasons for generic recommendations
                generic_recommendations.append({
                    'Product_ID': product['Product_ID'],
                    'Product_Name': product['Product_Name'],
                    'Score': 50, # Default score for generic recommendations
                    'Monthly_Premium': product['Monthly_Premium'],
                    'Reasons': ["Generic recommendation: No specific job match found above threshold."]
                })
        return generic_recommendations


    current_job = fuzzy_matched_job

    # Create manual mappings for job and region based on the fitted encoders' classes
    job_mapping = {cls: idx for idx, cls in enumerate(le_job.classes_)}
    region_mapping = {cls: idx for idx, cls in enumerate(le_region.classes_)}

    # Get encoded job and region, using -1 for unseen values in the original training data
    job_encoded = job_mapping.get(current_job, -1)
    region_encoded = region_mapping.get(user_profile['Region'], -1)

    # Warning for region not in training data, but proceed with recommendations
    if region_encoded == -1:
        print(f"Warning: Region '{user_profile['Region']}' was not present in the original training data. Proceeding with job-based recommendations but region features will be encoded as -1.")


    # Scale numerical features
    # Create a DataFrame with feature names for scaling
    numerical_data = pd.DataFrame([[user_profile['Monthly_Income'],
                                    user_profile['Number_of_Dependents'],
                                    user_profile['Max_Affordable_Premium']]],
                                  columns=['Monthly_Income', 'Number_of_Dependents', 'Max_Affordable_Premium'])
    scaled_values = scaler.transform(numerical_data)[0]

    # Create full feature vector as a DataFrame with feature names
    # Include region_encoded even if -1, as the model was trained with encoded regions
    feature_cols = [
        'Job_Encoded', 'Region_Encoded',
        'Monthly_Income', 'Number_of_Dependents',
         'Max_Affordable_Premium',
        'P001', 'P002', 'P003', 'P004', 'P005'
    ]
    user_data_df = pd.DataFrame([[job_encoded, region_encoded] + list(scaled_values) + [0, 0, 0, 0, 0]],
                                columns=feature_cols)


    # Predict cluster
    cluster_id = kmeans_model.predict(user_data_df)[0]

    # Get top products from cluster
    top_products = cluster_product_map.loc[cluster_id].sort_values(ascending=False).index.tolist()

    # Job relevance mapping
    relevant_keywords = job_keywords.get(current_job, [])

    # Score and explain each product
    recommendations = []
    for _, product in product_df.iterrows():
        # Filter out products not relevant to the user's job
        if not any(keyword.lower() in product['Product_Name'].lower() for keyword in relevant_keywords):
            continue

        score = 0
        reasons = []

        # Affordability
        if product['Monthly_Premium'] <= user_profile['Monthly_Income'] * 0.1:
            score += 1
            reasons.append("Affordable based on your income")

        if product['Monthly_Premium'] <= user_profile['Max_Affordable_Premium']:
            score += 1
            reasons.append("Fits within your premium budget")

        # Job relevance (already passed filter)
        score += 1
        reasons.append(f"Relevant to your job as a {current_job}")


        # Only include products from top cluster preferences
        if product['Product_ID'] in top_products:
            recommendations.append({
                'Product_ID': product['Product_ID'],
                'Product_Name': product['Product_Name'],
                'Score': int(round((score / max_score) * 100, 0)),  # Percentage score with 0 decimals
                'Monthly_Premium': product['Monthly_Premium'],
                'Reasons': reasons
            })

    # Sort by score
    recommendations = sorted(recommendations, key=lambda x: x['Score'], reverse=True)

    # # Bundle logic: if exactly two products are recommended
    # if len(recommendations) == 2:
    #     premium_1 = recommendations[0]['Monthly_Premium']
    #     premium_2 = recommendations[1]['Monthly_Premium']
    #     bundle_premium = round(premium_1 * 1.0 + premium_2 * 0.2, 2)

    #     # Check affordability
    #     if bundle_premium <= user_profile['Max_Affordable_Premium']:
    #         bundle_name = f"{recommendations[0]['Product_Name']} + {recommendations[1]['Product_Name']} Bundle"
    #         bundle_score = int(round((recommendations[0]['Score'] + recommendations[1]['Score']) / 2, 0))
    #         bundle_reasons = list(set(recommendations[0]['Reasons'] + recommendations[1]['Reasons']))
    #         bundle_id = f"{recommendations[0]['Product_ID']}_{recommendations[1]['Product_ID']}"

    #         bundled_product = {
    #             'Product_ID': bundle_id,
    #             'Product_Name': bundle_name,
    #             'Score': bundle_score,
    #             'Monthly_Premium': bundle_premium,
    #             'Reasons': bundle_reasons
    #         }

    #         recommendations = [bundled_product]
    #     else:
    #         # Bundle not affordable—return only the first product
    #         recommendations = [recommendations[0]]


    return recommendations

**Reasoning**:
Test the modified `recommend_products_for_user` function with a user profile that has a job title that will not find a close match above the threshold to confirm the new logic is working correctly.

## Test the updated system

### Subtask:
Test the recommendation system with various job title inputs, including typos, synonyms, and entirely new jobs, to verify that the fuzzy matching is working correctly and recommendations are provided appropriately.

**Reasoning**:
Create test user profiles with variations in job titles and call the recommendation function for each profile to test the fuzzy matching.

In [None]:
# Test case 1: Typo in existing job
user_profile_typo = {
    'Job': 'Dry Cleanr', # Typo of 'Dry Cleaner'
    'Monthly_Income': 40000,
    'Max_Affordable_Premium': 4000,
    'Number_of_Dependents': 1,
    'Region': 'Lagos'
}

print("- Test Case 1: Typo in Job -")
recommended_typo = recommend_products_for_user(user_profile_typo, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)
for rec in recommended_typo:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")
if not recommended_typo:
    print("No recommendations were generated for this user.")


# Test case 2: Synonym of existing job
user_profile_synonym = {
    'Job': 'Auto Mechanic', # Synonym for 'Mechanic'
    'Monthly_Income': 60000,
    'Max_Affordable_Premium': 6000,
    'Number_of_Dependents': 2,
    'Region': 'Kano'
}

print("\n--- Test Case 2: Synonym of Job ---")
recommended_synonym = recommend_products_for_user(user_profile_synonym, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)
for rec in recommended_synonym:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")
if not recommended_synonym:
    print("No recommendations were generated for this user.")

# Test case 3: Job not in the list (should not match)
user_profile_new = {
    'Job': 'Software Developer', # New job
    'Monthly_Income': 200000,
    'Max_Affordable_Premium': 20000,
    'Number_of_Dependents': 0,
    'Region': 'Abuja'
}

print("\n--- Test Case 3: New Job (No Close Match Expected) ---")
recommended_new = recommend_products_for_user(user_profile_new, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)
for rec in recommended_new:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")
if not recommended_new:
    print("As expected, no recommendations were generated for this user.")

# Test case 4: Another typo
user_profile_typo2 = {
    'Job': 'Grdener', # Typo of 'Gardener'
    'Monthly_Income': 30000,
    'Max_Affordable_Premium': 3000,
    'Number_of_Dependents': 1,
    'Region': 'Ibadan'
}

print("\n--- Test Case 4: Another Typo ---")
recommended_typo2 = recommend_products_for_user(user_profile_typo2, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)
for rec in recommended_typo2:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")
if not recommended_typo2:
    print("No recommendations were generated for this user.")

# Test case 5: Similar but different job
user_profile_similar = {
    'Job': 'Shop Owner', # Similar to 'Vendor' or 'Kiosk Owner' but not an exact match
    'Monthly_Income': 70000,
    'Max_Affordable_Premium': 7000,
    'Number_of_Dependents': 3,
    'Region': 'Port Harcourt'
}

print("\n--- Test Case 5: Similar but Different Job ---")
recommended_similar = recommend_products_for_user(user_profile_similar, product_df, kmeans, scaler, le_job, le_region, cluster_product_map, job_keywords)
for rec in recommended_similar:
    print(f"\n✅ {rec['Product_Name']} (Score: {rec['Score']}%)")
    for reason in rec['Reasons']:
        print(f" - {reason}")
if not recommended_similar:
    print("No recommendations were generated for this user.")

--- Test Case 1: Typo in Job ---

✅ Fire & Burglary Insurance (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Dry Cleaner

✅ Health Micro Plan (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Dry Cleaner

✅ Device Protection Plan (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Dry Cleaner

--- Test Case 2: Synonym of Job ---

✅ Personal Accident Cover + Device Protection Plan Bundle (Score: 75%)
 - Affordable based on your income
 - Fits within your premium budget
 - Relevant to your job as a Mechanic

--- Test Case 3: New Job (No Close Match Expected) ---
No close job match found for 'Software Developer' above the threshold. Providing generic recommendations.

✅ Health Micro Plan (Score: 50%)
 - Generic recommendation: No specific job match found above threshold.

✅ Life Starter Plan (Score: 50%)
