# Investment Type Recommender System

## Business Understanding

### Overview
Many Kenyans, especially young adults and first-time investors, struggle to identify investment avenues that align with their financial goals, risk appetite, and income level. The lack of personalized financial guidance often leads to poor or delayed investment decisions.

### Challenges
- **Low financial literacy and accessibility to advisory services**  
  Many potential investors lack foundational knowledge or support systems to understand available investment options or evaluate their suitability.

- **Overwhelming investment options**  
  The abundance of options—such as SACCOs, stocks, real estate, government bonds, and money market funds—can be confusing and lead to decision fatigue.

- **One-size-fits-all investment marketing**  
  Most financial institutions promote products generically, failing to account for individual goals, income, and risk profiles.

- **Lack of data-driven tools for personalized investment planning**  
  There is limited availability of intelligent systems to assist users in navigating investments based on their unique profiles.

- **Distrust and fear of loss**  
  Without adequate knowledge or guidance, potential investors may fear financial loss or fall victim to scams, leading to investment hesitation.

### Proposed Solution
A **machine learning-based recommender system** that suggests ideal investment types based on a user's financial profile, risk tolerance, and goals. This system can help both fintech platforms and financial institutions deliver personalized advisory services at scale.

### Brief Conclusion
By guiding users toward the most suitable investment types, this solution aims to enhance financial inclusion and support smarter, confidence-driven investment decisions.


## Problem Statement
Many individuals, especially in emerging markets, face significant challenges in making informed investment decisions due to limited financial literacy and lack of personalized advisory services. The wide range of available investment options—SACCOs, stocks, real estate, government bonds, and money market funds—can be overwhelming without guidance. Additionally, the generic approach in investment marketing overlooks the diverse financial goals, income levels, and risk appetites of potential investors, leading to poor financial outcomes and disengagement from long-term wealth-building.


## Objectives

- **Analysis-Based**  
  Understand investment behaviors among Kenyan users and segment them based on patterns.

- **Feature Engineering-Based**  
  Create user profiles using financial behavior indicators such as:
  - Income level
  - Savings rate
  - Age
  - Financial goals

- **Modeling-Based**  
  Build and evaluate recommender models, including:
  - Content-based filtering
  - Hybrid approaches (clustering + classification)


## Data Understanding

### Data Sourcing
- Publicly available financial survey data (e.g., **FinAccess Kenya survey**)
- Simulated user profiles or anonymized fintech customer data
- Investment platform usage data (e.g., user interest in asset types)
- Economic indicators (e.g., interest rates, inflation)

### Features and Relevance
- Demographics: Age, gender, location  
- Financials: Income, expenses, debt levels  
- Profile: Risk profile (low/medium/high)  
- Preferences: Investment goals (short-term/long-term, passive/active)  
- History: Past investment experience


## Data Preparation

### Format
- Data will be collected and processed in **Excel format**

### Actions
- Handle missing values
- Encode categorical variables
- Normalize numeric fields
- Create derived features (e.g., savings rate, risk-adjusted return scores)
- Segment data by user type or financial tier


## Deployment

### API
- **Framework:** FastAPI  
- **Endpoints:** Accept user profile data and return recommended investment type(s)  
- **Model Storage:** Serialized using `.pkl` or `.joblib`

### UI
- **Framework:** Streamlit  
- **Function:** Allows users to input financial info and receive personalized investment suggestions

### Prototypes / Mockups
- **Key Screens:**
  - Welcome Screen
  - Financial Profile Input
  - Investment Suggestions


In cell below we import libraries

In [1]:
# import libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# load data

file_path = "data/Refined_Finaccess.xlsx"

invest_df = pd.read_excel(file_path)
invest_df.head()

Unnamed: 0,HHid,county,A08,A13,A18,NHM,livelihoodcat,Quintiles,Education,Marital,...,sacco_redress,mobilemoney_redress,mobilebank_redress,not_registered_mmoney_24,using_someone_acc,insurance_including_NHIF_use,All_Insurance_excluding_NHIF_use,PWD,Latitude,Longitude
0,107141431,Garissa,Urban,Male,29,5,Dependent,Fourth,Tertiary,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,-0.435423,39.636586
1,10712933,Garissa,Urban,Male,60,11,Other,Second,,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,0.058794,40.305006
2,140173183,Busia,Urban,Female,35,2,Casual Worker,Fourth,Primary,Divorced/separated,...,,,,No,,Never used,Never used,Without Disability,0.636836,34.27739
3,122137153,Kiambu,Urban,Male,24,1,Casual Worker,Middle,Secondary,Single/Never Married,...,,,,No,,Never used,Never used,Without Disability,-1.251917,36.719076
4,121193116,Murang'a,Urban,Female,20,1,Dependent,Highest,Secondary,Single/Never Married,...,,,,No,Yes,Never used,Never used,Without Disability,-0.79582,37.131085


In [3]:
# check shape of data

invest_df.shape

(20871, 433)

The output above shows that dataset contains **20871 entries and 3816 columns**. 

In cell below we check for metadata summary and numeric summary

In [4]:
# check metadata summary & numeric

def data_summary(df):
    print("-----info-----")
    df.info()

    print("-----describe-----")
    df.describe()

    return df


data_summary(invest_df)



-----info-----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20871 entries, 0 to 20870
Columns: 433 entries, HHid to Longitude
dtypes: float64(59), int64(23), object(351)
memory usage: 68.9+ MB
-----describe-----


Unnamed: 0,HHid,county,A08,A13,A18,NHM,livelihoodcat,Quintiles,Education,Marital,...,sacco_redress,mobilemoney_redress,mobilebank_redress,not_registered_mmoney_24,using_someone_acc,insurance_including_NHIF_use,All_Insurance_excluding_NHIF_use,PWD,Latitude,Longitude
0,107141431,Garissa,Urban,Male,29,5,Dependent,Fourth,Tertiary,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,-0.435423,39.636586
1,10712933,Garissa,Urban,Male,60,11,Other,Second,,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,0.058794,40.305006
2,140173183,Busia,Urban,Female,35,2,Casual Worker,Fourth,Primary,Divorced/separated,...,,,,No,,Never used,Never used,Without Disability,0.636836,34.277390
3,122137153,Kiambu,Urban,Male,24,1,Casual Worker,Middle,Secondary,Single/Never Married,...,,,,No,,Never used,Never used,Without Disability,-1.251917,36.719076
4,121193116,Murang'a,Urban,Female,20,1,Dependent,Highest,Secondary,Single/Never Married,...,,,,No,Yes,Never used,Never used,Without Disability,-0.795820,37.131085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20866,208078401,Wajir,Rural,Male,39,8,Dependent,Lowest,,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,3.328605,39.788873
20867,147225325,Nairobi City,Urban,Female,24,4,Employed,Middle,Secondary,Married/Living with partner,...,,,,No,,Used to use,Never used,Without Disability,-1.263889,36.853431
20868,114061351,Embu,Rural,Male,73,6,Dependent,Middle,Primary,Widowed,...,,,,No,,Used to use,Never used,Without Disability,-0.521351,37.568370
20869,106181991,Taita-Taveta,Urban,Female,32,7,Own Business,Fourth,Primary,Single/Never Married,...,,,,No,,Never used,Never used,Without Disability,-3.378276,38.564794


The ouput above shows dataset contains total `20871 entries`and  `3816 columns` in which `2303 columns are float`, `185 integers` and `1328 object` 

### Basic Data cleaning

In cell below we inspect data and drop all columns not needed based on domain knowledge and data description, then check for missing values, duplicates, standadize all categorical data to lower case and remove white, and impute for the missing values

In [5]:
# Rename columns for readability 

def rename_columns(df):
    """
    Renames coded or unclear column headers to more readable ones.
    Modify the mapping below based on actual data.
    """
    column_mapping = {
        "HHid": "householdid",
        "A08": "area_type",
        "A13": "gender",
        "A18": "age_of_respondent",
        "NHM": "no_of_household_mebers",
        "nC1_1a": "save_bank",
        "nC1_1b": "save_microfinance",
        "nC1_2": "save_mobile_money",
        "nC1_3": "save_credit_cop",
        "nC1_4":"save_sacco",
        "nC1_5":"save_chama",
        "nC1_6":"save_friends",
        "nC1_9":"save_digitalapp",
        "C1_19" : "digital loans",
        "U23" : "total_monthly_expenditure"
        'B3Ii',
        
    }

    return df.rename(columns=column_mapping)

In [6]:
invest_df_copy = rename_columns(invest_df)
invest_df_copy.head()

Unnamed: 0,householdid,county,area_type,gender,age_of_respondent,no_of_household_mebers,livelihoodcat,Quintiles,Education,Marital,...,sacco_redress,mobilemoney_redress,mobilebank_redress,not_registered_mmoney_24,using_someone_acc,insurance_including_NHIF_use,All_Insurance_excluding_NHIF_use,PWD,Latitude,Longitude
0,107141431,Garissa,Urban,Male,29,5,Dependent,Fourth,Tertiary,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,-0.435423,39.636586
1,10712933,Garissa,Urban,Male,60,11,Other,Second,,Married/Living with partner,...,,,,No,,Never used,Never used,Without Disability,0.058794,40.305006
2,140173183,Busia,Urban,Female,35,2,Casual Worker,Fourth,Primary,Divorced/separated,...,,,,No,,Never used,Never used,Without Disability,0.636836,34.27739
3,122137153,Kiambu,Urban,Male,24,1,Casual Worker,Middle,Secondary,Single/Never Married,...,,,,No,,Never used,Never used,Without Disability,-1.251917,36.719076
4,121193116,Murang'a,Urban,Female,20,1,Dependent,Highest,Secondary,Single/Never Married,...,,,,No,Yes,Never used,Never used,Without Disability,-0.79582,37.131085


In cell below we check for the missing values and impute them for efficient feature engineering

In [7]:
# check for missing values

def missing_values(df):
    missing = df.isnull().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
   
    return missing

missing_values(invest_df)


investment_redress    20853
pension_redress       20818
mobilebank_redress    20796
sacco_redress         20778
creditonly_redress    20743
                      ...  
mobile_bank_use           7
nC1_11                    7
bank_use                  7
mobile_money_use          6
nC1_10                    6
Length: 159, dtype: int64

for `numeric_values` we impute with **median** and for `categorical` we use **mode** or **unknown**. we drop columns we much missing values.


In [8]:
# make copy of original data

invest_df_copy = invest_df.copy()

In [9]:
# Drop columns with much missing values
def drop_columns(df, columns_missing):
    df_cleaned = df.drop(columns=columns_missing, axis=1, errors='ignore')  # errors='ignore' handles any misnamed columns gracefully
    print(f"Dropped {len(columns_missing)} columns.")
    return df_cleaned

entries = [
    'investment_redress', 'pension_redress', 'mobilebank_redress', 'sacco_redress', 'creditonly_redress',
    'trad_mfi_satisfaction', 'bank_redress', 'insurance_redress', 'trad_mfi_moneylost', 'trad_mfi_unsolicited',
    'trad_mfi_downtime', 'trad_mfi_unexpectedcharges', 'trad_mfi_unethicalrecovery', 'digital_app_satisfaction',
    'digital_app_moneylost', 'digital_app_unsolicited', 'digitalapps_unexpectedcharges', 'digital_app_downtime',
    'digital_apps_unethicalrecovery', 'traditional_mfi_issues', 'investment_unexpectedcharges_fnl',
    'investment_sold_fnl', 'investment_lostmoney_fnl', 'investment_downtime_fnl', 'investment_issues_fnl',
    'investment_satisfaction', 'digital_issues', 'hirepurchase_satisfaction', 'hirepurchase_moneylost',
    'hirepurchase_downtime', 'hirepurchase_unexpectedcharges', 'hirepurchase_unsolicited',
    'hirepurchase_unethicalrecovery', 'hirepurchase_issues', 'mobilemoney_redress',
    'creditonlyagree_satisfactionl', 'creditonly_mfi_moneylost', 'creditonly_mfi_downtime',
    'creditonly_mfi_unsolicited', 'creditonly_mfi_unexpectedcharges', 'creditonly_mfi_unethicalrecovery',
    'creditonly_mfi_issues', 'pension_unethical_fnl', 'pension_underpayment_fnl', 'pension_attachment',
    'pension_delayed_fnl', 'pension_issues_fnl', 'pension_satisfaction', 'pension_lostmoney_fnl',
    'sacco_unexpectedcharges'
]


invest_df_copy = drop_columns(invest_df_copy, entries)



Dropped 50 columns.


In cell above we drop all columns with much missing value. Below we proceed to impute missing for both `numeric_values` and `categorical_values`.

In [10]:
# impute missing values for both numeric_values and categorical values

def impute_missing_values(df):
    """Impute missing values: numeric with median, categorical with mode or 'unknown'."""
    for col in df.columns:
        if df[col].isnull().sum() == 0:
            continue
        if df[col].dtype in ['float64', 'int64']:
            median_val = df[col].median()
            df[col].fillna(median_val, inplace=True)
        else:
            mode_val = df[col].mode()
            if not mode_val.empty:
                df[col].fillna(mode_val[0], inplace=True)
            else:
                df[col].fillna('unknown', inplace=True)
    print("Imputed missing values for all applicable columns.")
    return df

invest_df_copy = impute_missing_values(invest_df_copy)


Imputed missing values for all applicable columns.


In [11]:
# check for missing values after imputing

invest_df_copy.isnull().sum()

HHid                                0
county                              0
A08                                 0
A13                                 0
A18                                 0
                                   ..
insurance_including_NHIF_use        0
All_Insurance_excluding_NHIF_use    0
PWD                                 0
Latitude                            0
Longitude                           0
Length: 384, dtype: int64

In [14]:
# check for duplicates

invest_df_copy.duplicated().sum()

np.int64(0)

**No duplicates** found. Below we `standardize categorical data` by standardizing all letters to lower, and removing whitespace  

In [16]:
# standardize categorical data

def clean_cat_text(df):
    cat_cols = df.select_dtypes(include="object").columns
    for col in cat_cols:
        df[col] = df[col].astype(str).str.strip().str.lower()
    print("standardized all categorical text to lowercase and stripped whitespace")

    return df

invest_df_copy = clean_cat_text(invest_df_copy)
invest_df_copy.head()

standardized all categorical text to lowercase and stripped whitespace


Unnamed: 0,HHid,county,A08,A13,A18,NHM,livelihoodcat,Quintiles,Education,Marital,...,sacco_redress,mobilemoney_redress,mobilebank_redress,not_registered_mmoney_24,using_someone_acc,insurance_including_NHIF_use,All_Insurance_excluding_NHIF_use,PWD,Latitude,Longitude
0,107141431,garissa,urban,male,29,5,dependent,fourth,tertiary,married/living with partner,...,,,,no,,never used,never used,without disability,-0.435423,39.636586
1,10712933,garissa,urban,male,60,11,other,second,,married/living with partner,...,,,,no,,never used,never used,without disability,0.058794,40.305006
2,140173183,busia,urban,female,35,2,casual worker,fourth,primary,divorced/separated,...,,,,no,,never used,never used,without disability,0.636836,34.27739
3,122137153,kiambu,urban,male,24,1,casual worker,middle,secondary,single/never married,...,,,,no,,never used,never used,without disability,-1.251917,36.719076
4,121193116,murang'a,urban,female,20,1,dependent,highest,secondary,single/never married,...,,,,no,yes,never used,never used,without disability,-0.79582,37.131085


### **Preparing data for recommeder system & Feature engineering**

This project focuses on developing a **machine learning-based recommender system designed to suggest appropriate investment products tailored to a user's financial profile, risk tolerance, and personal goals**. To support this, we begun by preparing the dataset through a structured process that includes selecting relevant features, renaming columns for readability, and performing necessary cleaning and exploratory analysis.

We have chosen to implement a **content-based filtering** approach for the recommendation engine. **This decision is guided by the nature of the FinAccess 2024 dataset, which consists of aggregated survey responses rather than individual user-product interaction data**. Collaborative filtering methods such as ALS or SVD typically rely on user-level histories and are therefore not applicable in this context.

Instead, content-based filtering allows us to leverage the dataset's rich demographic and behavioral attributes such as age, gender, region, education level, trust in financial institutions, and satisfaction with financial products to make informed recommendations. By aligning a new user’s profile with patterns observed in similar demographic segments, we can suggest investment options that are likely to be suitable and relevant.

The recommender system works by capturing key user inputs (such as age group, gender, education, region), identifying similar profiles within the dataset, analyzing common or highly rated financial products within those segments, and recommending those products to the user. This personalized, profile-driven approach offers a practical and data-informed way to support individuals in making better investment decisions.


We start by `feature engineering`

In [12]:
# make copy of original data

invest_df_copy = invest_df.copy()

In cell below we choose important features based on the domain knowledge and goal of the project for implementation of the model. **goal is to build financial product recommender system** the features we select are based on

- **demographics & preferences**

- **current or past product usage**

- **user preferences**

- **Trust & digital access** 