marketing_df : bank client data: 
https://www.kaggle.com/datasets/adityamhaske/bank-marketing-dataset?resource=download

1 - age (numeric)

2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

* ##### related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

* ##### other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

* ##### Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None



loan_df : Features:
https://www.kaggle.com/datasets/sahideseker/loan-default-prediction-dataset?utm_source=copilot.com&select=loan_default_prediction.csv

loan_id: Unique loan identifier

income: Monthly income of the applicant

loan_amount: Total amount of the loan

employment_status: Employment status (Employed / Unemployed)

default: Whether the loan was defaulted (1 = Yes, 0 = No)


INSEE API selected dataset 

1. Population 
   
    -'PTOT' : total legal population of the département (Private + collective housing),
   
    -'PCAP' : population of the préfecture city (eg : DEP‑75 → Paris, DEP‑95 → Cergy...),
   
    -'PMUN' : municipal population, meaning the **population living in private households** only is housing and urbanism indicators
selected dataset IDs

2. Income	DS_ERFS_MENAGE_SL	Revenu disponible et pauvreté
3. Unemployment 	DS_RP_EMPLOI_LR_PRINC	Population active et chômage
4. Housing	DS_RP_LOGEMENT_PRINC	Résidences principales
5. Age structure	DS_BTS_SAL_EQTP_SEX_AGE	Salaires par sexe et âge
6. Education	DS_RP_DIPLOMES_PRINC	Diplômes de la population

In [None]:
%load_ext autoreload
%autoreload 2
    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import json
import importlib
import insee_api_functions
import functions

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier

importlib.reload(insee_api_functions)
importlib.reload(functions)

df_uci_clean = pd.read_csv("../data/clean/load_and_clean_uci_data_davy.csv")
df_loan = pd.read_csv("../data/raw/loan_default.csv")


In [None]:
df_uci_clean.head()

## 1. The UCI dataset

### 1.1. Explore and clean the UCI dataset

In [3]:
from functions import explore_dataset

explore_dataset(df_uci)


=== Dataset ===

Shape: (45211, 17)

Columns:
 ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']

Info:
<class 'pandas.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   age        45211 non-null  int64
 1   job        45211 non-null  str  
 2   marital    45211 non-null  str  
 3   education  45211 non-null  str  
 4   default    45211 non-null  str  
 5   balance    45211 non-null  int64
 6   housing    45211 non-null  str  
 7   loan       45211 non-null  str  
 8   contact    45211 non-null  str  
 9   day        45211 non-null  int64
 10  month      45211 non-null  str  
 11  duration   45211 non-null  int64
 12  campaign   45211 non-null  int64
 13  pdays      45211 non-null  int64
 14  previous   45211 non-null  int64
 15  poutcome   45211 non-null  s

Conclusion 
The UCI dataset looks clean and well-structured:

45,211 rows and 17 columns

No missing values

Columns are correctly typed (int and str)

In [4]:
df_uci.describe()


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [6]:
# Columns renaming 
from functions import rename_uci_columns
df_uci = rename_uci_columns(df_uci)
df_uci

Unnamed: 0,age,job_type,marital_status,education_level,credit_default,account_balance,housing_loan,personal_loan,contact_type,contact_day,contact_month,call_duration_sec,num_contacts_current_campaign,days_since_last_contact,num_previous_contacts,previous_outcome,subscribed
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [9]:
from functions import create_age_groups, create_campaign_groups, create_contact_missing_flag

# group by age
df_uci = create_age_groups(df_uci)
# group by number of call
df_uci = create_campaign_groups(df_uci)

# Missing Contact Flag
df_uci = create_contact_missing_flag (df_uci)
df_uci

Unnamed: 0,age,job_type,marital_status,education_level,credit_default,account_balance,housing_loan,personal_loan,contact_type,contact_day,contact_month,call_duration_sec,num_contacts_current_campaign,days_since_last_contact,num_previous_contacts,previous_outcome,subscribed,age_group_bin,campaign_group,contact_missing_flag
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,56-65,1 contact,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,36-45,1 contact,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,26-35,1 contact,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,46-55,1 contact,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,26-35,1 contact,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes,46-55,2–3 contacts,0
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes,65+,2–3 contacts,0
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes,65+,4–5 contacts,0
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no,56-65,4–5 contacts,0


In [20]:
from functions import clean_uci_dataset
df_uci_clean = clean_uci_dataset(df_uci)
df_uci_clean.head()


Unnamed: 0,age,job_type,marital_status,education_level,credit_default,account_balance,housing_loan,personal_loan,contact_type,contact_day,contact_month,call_duration_sec,num_contacts_current_campaign,days_since_last_contact,num_previous_contacts,previous_outcome,subscribed,age_group_bin,campaign_group,contact_missing_flag
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,56-65,1 contact,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,36-45,1 contact,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,26-35,1 contact,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,46-55,1 contact,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,26-35,1 contact,0


In [21]:
from functions import detect_outliers_iqr

# Deal with the outliers
df_uci_clean_outliers = detect_outliers_iqr(
    df_uci_clean,
    columns=['age', 'account_balance', 'call_duration_sec', 'num_contacts_current_campaign'],
    method="flag"
)


In [18]:
detect_outliers_iqr(df_clean, ['age', 'account_balance']).head()


Unnamed: 0,age,job_type,marital_status,education_level,credit_default,account_balance,housing_loan,personal_loan,contact_type,contact_day,...,num_contacts_current_campaign,days_since_last_contact,num_previous_contacts,previous_outcome,subscribed,age_group_bin,campaign_group,contact_missing_flag,age_outlier_flag,account_balance_outlier_flag
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,1,-1,0,unknown,no,56-65,1 contact,0,0,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,1,-1,0,unknown,no,36-45,1 contact,0,0,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,1,-1,0,unknown,no,26-35,1 contact,0,0,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,1,-1,0,unknown,no,46-55,1 contact,0,0,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,1,-1,0,unknown,no,26-35,1 contact,0,0,0


In [22]:
df_uci_clean.to_csv("load_and_clean_uci_data_davy.csv", index=False)

In [24]:
df_uci_clean

Unnamed: 0,age,job_type,marital_status,education_level,credit_default,account_balance,housing_loan,personal_loan,contact_type,contact_day,contact_month,call_duration_sec,num_contacts_current_campaign,days_since_last_contact,num_previous_contacts,previous_outcome,subscribed,age_group_bin,campaign_group,contact_missing_flag
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no,56-65,1 contact,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no,36-45,1 contact,0
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no,26-35,1 contact,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no,46-55,1 contact,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no,26-35,1 contact,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes,46-55,2–3 contacts,0
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes,65+,2–3 contacts,0
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes,65+,4–5 contacts,0
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no,56-65,4–5 contacts,0


In [26]:
df_uci_clean.isna().value_counts()

age    job_type  marital_status  education_level  credit_default  account_balance  housing_loan  personal_loan  contact_type  contact_day  contact_month  call_duration_sec  num_contacts_current_campaign  days_since_last_contact  num_previous_contacts  previous_outcome  subscribed  age_group_bin  campaign_group  contact_missing_flag
False  False     False           False            False           False            False         False          False         False        False          False              False                          False                    False                  False             False       False          False           False                   45211
Name: count, dtype: int64

In [None]:
from functions import handle_missing_values_uci

df_uci_final = handle_missing_values_uci(df_uci_clean)

df_uci_final.y.isna().value_counts()


In [None]:
# Save cleaned and explored uci dataset
df_uci_final.to_csv("load_and_clean_uci_data_davy.csv", index=False)


### 1.2. EDA

In [None]:
from functions import eda_uci_dataset
eda_uci_dataset(df_uci_final, target_col='y')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot: Duration vs Subscription
plt.figure(figsize=(8,5))
sns.boxplot(x=df_uci_final['y'], y=df_uci_final['duration'])
plt.title("Call Duration by Subscription Outcome")
plt.xlabel("Subscribed (0 = No, 1 = Yes)")
plt.ylabel("Call Duration (seconds)")
plt.show()


In [None]:
import pandas as pd
import seaborn as sns

#Bar Chart: Subscription Rate by Number of Contacts (campaign)
campaign_rates = df_uci_final.groupby('campaign')['y'].mean().reset_index()

plt.figure(figsize=(10,5))
sns.barplot(data=campaign_rates, x='campaign', y='y')
plt.title("Subscription Rate by Number of Contacts")
plt.xlabel("Number of Contacts (campaign)")
plt.ylabel("Subscription Rate")
plt.show()


In [None]:
# Scatterplot with Trendline: Duration vs Probability of Subscription
sns.lmplot(data=df_uci_final, x='duration', y='y', logistic=True, height=5, aspect=1.5)
plt.title("Probability of Subscription vs Call Duration")
plt.xlabel("Call Duration (seconds)")
plt.ylabel("Probability of Subscription")
plt.show()


In [None]:
df_uci_final['contact'].value_counts()

In [None]:
df_uci_final

In [None]:
df_uci_final[df_uci_final['contact'] == 'telephone'].sort_values(by='campaign', ascending=False).head()


In [None]:
df_uci_final[df_uci_final['contact'] == 'telephone'].sort_values(by='campaign', ascending=False).head()


In [None]:
df_uci_final[df_uci_final['contact'] == 'cellular'].value_counts()

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

def analyze_age_campaign(df, age_col='age', campaign_col='campaign', target_col='y', show_plot=True):
    """
    Groups age into bins, combines with campaign count, 
    and computes subscription rate for each combination.
    """

    # 1. Create age groups
    df['age_group'] = pd.cut(
        df[age_col],
        bins=[0, 25, 35, 45, 55, 65, 100],
        labels=['18–25', '26–35', '36–45', '46–55', '56–65', '65+']
    )

    # 2. Compute subscription rate by age group + campaign
    grouped = df.groupby(['age_group', campaign_col])[target_col].mean().reset_index()

    # 3. Sort by highest subscription rate
    grouped_sorted = grouped.sort_values(by=target_col, ascending=False)

    # 4. Optional heatmap
    if show_plot:
        pivot = grouped.pivot(index='age_group', columns=campaign_col, values=target_col)

        plt.figure(figsize=(12,6))
        sns.heatmap(pivot, annot=True, cmap='Blues')
        plt.title("Subscription Rate by Age Group and Campaign Count")
        plt.xlabel("Number of Contacts (campaign)")
        plt.ylabel("Age Group")
        plt.show()

    return grouped_sorted
results = analyze_age_campaign(df_uci_final)

### 1.3. Feature Engineering

In [None]:
df_uci_feat.columns

In [None]:
# Load cleaned dataset
df_uci_feat= pd.read_csv("load_and_clean_uci_data_davy.csv")
df_uci_feat

## * H1: Can we predict whether a client will subscribe to a term deposit based on their profile and campaign data?

Our goal is to build a model that helps the marketing team target the right clients.

Steps :
First, we define our features and target. In our case, the target is 'y', which is categorical. This determines the type of machine learning model we use — supervised classification.

### Preprocessing: Encode + Scale + Concatenate

In [None]:
# 2. Separate features and target
X = df_uci_feat.drop(columns=['y', 'day', 'default', 'contact'])
y = df_uci_feat['y']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

In [None]:
X_train

In [None]:
    # 3. Identify column types
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numerical_cols = X.select_dtypes(include='number').columns.tolist()


#### Encode categorical features to transform them in numerical columns

In [None]:
print(type(categorical_cols))
print(categorical_cols)


In [None]:
# ask to give a numpy array instead of sparse
ohe = OneHotEncoder(sparse_output=False, drop='first') # To avoid having an sparse_matrix as output
ohe.fit(X_train[['job', 'marital', 'education', 'housing', 'loan', 'month']]) # The .fit() method determines the unique values of each column

In [None]:
# Transform train and test sets
X_train_trans_np = ohe.transform(X_train[['job', 'marital', 'education', 'housing', 'loan', 'month']])
X_test_trans_np = ohe.transform(X_test[['job', 'marital', 'education', 'housing', 'loan', 'month']])
X_test_trans_np

In [None]:
# Convert to DataFrames
X_train_trans_df = pd.DataFrame(X_train_trans_np, columns=ohe.get_feature_names_out(), index=X_train.index)
X_test_trans_df = pd.DataFrame(X_test_trans_np, columns=ohe.get_feature_names_out(), index=X_test.index)
X_train_trans_df

#### Scale numerical features to Normalize numeric features

In [None]:
available_numerical_cols = [col for col in numerical_cols if col in X_train.columns]

scaler = StandardScaler()
scaler.fit(X_train[available_numerical_cols])

# Transform train and test sets
X_train_scaled_np = scaler.transform(X_train[available_numerical_cols])
X_test_scaled_np  = scaler.transform(X_test[available_numerical_cols])

# Convert to DataFrames
X_train_standarized = pd.DataFrame(X_train_scaled_np, columns=scaler.get_feature_names_out(), index=X_train.index)
X_test_standarized  = pd.DataFrame(X_test_scaled_np, columns=scaler.get_feature_names_out(), index=X_test.index)
X_test_standarized

#### Concatenate transformed features

In [None]:
# Final model-ready datasets
X_train_full = pd.concat([X_train_standarized,X_train_trans_df], axis=1)
X_test_full = pd.concat([X_test_standarized,X_test_trans_df], axis=1)

### Train and evaluate baseline model on our dataset

Logistic Regression (baseline)

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train_full, y_train)

# Predictions
y_pred = log_reg_model.predict(X_test_full)
y_proba = log_reg_model.predict_proba(X_test_full)[:, 1]


In [None]:
from functions import evaluate_model
evaluate_model(
    model=log_reg_model,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.5,
    title="Logistic Regression Evaluation"
)


In [None]:
from functions import evaluate_model
# Logistic Regression Evaluation (Threshold = 0.3)
evaluate_model(
    model=log_reg_model,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.3,
    title="Logistic Regression (Threshold = 0.3)"
)


### Logistic Regression With class weighting:

In [None]:
from sklearn.linear_model import LogisticRegression
from functions import classification_diagnostic_plot, evaluate_model

# 1. Train on preprocessed features
log_reg_balanced = LogisticRegression(class_weight='balanced', max_iter=1000)
log_reg_balanced.fit(X_train_full, y_train)

# 2. Evaluate on preprocessed test features
evaluate_model(
    model=log_reg_balanced,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.5,
    title="Logistic Regression balanced Evaluation"
)


### Interpretations:





517 false negatives, are actual subscribers that the model failed to identify
471 true positives means that the model correctly predicted subscribers
442 false positives, when the model predicted “yes” but actually “no”

### Train a Random Forest baseline

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=300, class_weight='balanced', random_state=0)
rf_model.fit(X_train_full, y_train)

# Predictions
y_pred = rf_model.predict(X_test_full)
y_proba = rf_model.predict_proba(X_test_full)[:, 1]

In [None]:
from functions import evaluate_model
evaluate_model(
    model=rf_model,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.5,
    title="Random Forest Evaluation"
)


### We  Train Random Forest on SMOTE‑balanced data  - hyperparameter
it means to create a synthetic examples rather than duplicating or deleting (as oversample or undersample do)

In [None]:
from imblearn.over_sampling import SMOTE
#  Apply SMOTE to the training data
sm = SMOTE(random_state=0)
X_train_sm, y_train_sm = sm.fit_resample(X_train_full, y_train) # X_train_sm → oversampled feature matrix and y_train_sm → oversampled target vector

rf_smote_model = RandomForestClassifier(n_estimators=300, random_state=0) 
rf_smote_model.fit(X_train_sm, y_train_sm)

In [None]:
from functions import evaluate_model
evaluate_model(
    model=rf_smote_model,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.5,
    title="Random Forest + SMOTE Evaluation"
)


# XGBoost (with scale_pos_weight)

In [None]:
from xgboost import XGBClassifier

# Estimate class imbalance ratio
imbalance_ratio = y_train.value_counts()[0] / y_train.value_counts()[1]

xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=imbalance_ratio,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=0
)

xgb_model.fit(X_train_full, y_train)


In [None]:
from functions import evaluate_model
evaluate_model(
    model=xgb_model,
    X_test=X_test_full,
    y_test=y_test,
    threshold=0.5,
    title="XGBoost Evaluation"
)


### Business Interpretation of business hypothesis: can we predict who will subscribe?

In [None]:

answer :
metrics results: precision, recall, F1, AUC
confusion matrix and ROC curve : 

### Combining data again


In [None]:
df_uci_clean.isna().sum()*100/len(df_uci_feat)

In [None]:
df_uci_clean["y"].value_counts(dropna=False)

### 2. Explore and Clean the loan dataset

In [None]:
from functions import explore_dataset

explore_dataset(df_loan)

In [None]:
from functions import clean_loan_dataset

df_loan_clean = clean_loan_dataset(df_loan)
df_loan_clean.info()
df_loan_clean.head()


### Preprocessing: Encode + Scale + Concatenate

In [None]:
from functions import preprocess_data

categorical_cols = ['job', 'marital', 'education', 'housing', 'loan', 'month']
numerical_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'contact_missing']

X_train_full, X_test_full, ohe, scaler = preprocess_data(
    X_train, X_test, categorical_cols, numerical_cols
)


### 3. Extract INSEE API DATA

In [None]:
from insee_api_functions import fetch_melodi_dataset
df_insee = fetch_melodi_dataset("DS_POPULATIONS_REFERENCE", "DEP-44")
df_insee

In [None]:
from insee_api_functions import fetch_melodi_dataset

# Test extraction for Île-de-France (example: Paris)
df_test = fetch_melodi_dataset("DS_POPULATIONS_REFERENCE", "DEP-75")
df_test.shape


In [None]:
from insee_api_functions import fetch_dep_dataset

df_france = fetch_dep_dataset("DS_POPULATIONS_REFERENCE")
df_france.shape



In [None]:
df_france.info()

In [None]:
# Mapping dictionary for readable département names

departement_name_map = {
    "DEP-01": "Ain", "DEP-02": "Aisne", "DEP-03": "Allier", "DEP-04": "Alpes-de-Haute-Provence",
    "DEP-05": "Hautes-Alpes", "DEP-06": "Alpes-Maritimes", "DEP-07": "Ardèche", "DEP-08": "Ardennes",
    "DEP-09": "Ariège", "DEP-10": "Aube", "DEP-11": "Aude", "DEP-12": "Aveyron", "DEP-13": "Bouches-du-Rhône",
    "DEP-14": "Calvados", "DEP-15": "Cantal", "DEP-16": "Charente", "DEP-17": "Charente-Maritime",
    "DEP-18": "Cher", "DEP-19": "Corrèze", "DEP-2A": "Corse-du-Sud", "DEP-2B": "Haute-Corse",
    "DEP-21": "Côte-d'Or", "DEP-22": "Côtes-d'Armor", "DEP-23": "Creuse", "DEP-24": "Dordogne",
    "DEP-25": "Doubs", "DEP-26": "Drôme", "DEP-27": "Eure", "DEP-28": "Eure-et-Loir", "DEP-29": "Finistère",
    "DEP-30": "Gard", "DEP-31": "Haute-Garonne", "DEP-32": "Gers", "DEP-33": "Gironde", "DEP-34": "Hérault",
    "DEP-35": "Ille-et-Vilaine", "DEP-36": "Indre", "DEP-37": "Indre-et-Loire", "DEP-38": "Isère",
    "DEP-39": "Jura", "DEP-40": "Landes", "DEP-41": "Loir-et-Cher", "DEP-42": "Loire", "DEP-43": "Haute-Loire",
    "DEP-44": "Loire-Atlantique", "DEP-45": "Loiret", "DEP-46": "Lot", "DEP-47": "Lot-et-Garonne",
    "DEP-48": "Lozère", "DEP-49": "Maine-et-Loire", "DEP-50": "Manche", "DEP-51": "Marne", "DEP-52": "Haute-Marne",
    "DEP-53": "Mayenne", "DEP-54": "Meurthe-et-Moselle", "DEP-55": "Meuse", "DEP-56": "Morbihan",
    "DEP-57": "Moselle", "DEP-58": "Nièvre", "DEP-59": "Nord", "DEP-60": "Oise", "DEP-61": "Orne",
    "DEP-62": "Pas-de-Calais", "DEP-63": "Puy-de-Dôme", "DEP-64": "Pyrénées-Atlantiques",
    "DEP-65": "Hautes-Pyrénées", "DEP-66": "Pyrénées-Orientales", "DEP-67": "Bas-Rhin", "DEP-68": "Haut-Rhin",
    "DEP-69": "Rhône", "DEP-70": "Haute-Saône", "DEP-71": "Saône-et-Loire", "DEP-72": "Sarthe",
    "DEP-73": "Savoie", "DEP-74": "Haute-Savoie", "DEP-75": "Paris", "DEP-76": "Seine-Maritime",
    "DEP-77": "Seine-et-Marne", "DEP-78": "Yvelines", "DEP-79": "Deux-Sèvres", "DEP-80": "Somme",
    "DEP-81": "Tarn", "DEP-82": "Tarn-et-Garonne", "DEP-83": "Var", "DEP-84": "Vaucluse", "DEP-85": "Vendée",
    "DEP-86": "Vienne", "DEP-87": "Haute-Vienne", "DEP-88": "Vosges", "DEP-89": "Yonne", "DEP-90": "Territoire de Belfort",
    "DEP-91": "Essonne", "DEP-92": "Hauts-de-Seine", "DEP-93": "Seine-Saint-Denis", "DEP-94": "Val-de-Marne",
    "DEP-95": "Val-d’Oise", "DEP-971": "Guadeloupe", "DEP-972": "Martinique", "DEP-973": "Guyane",
    "DEP-974": "La Réunion", "DEP-976": "Mayotte"
}

df_france["departement_name"] = df_france["departement_code"].map(departement_name_map)


In [None]:
# Rename all columns to match the format
df_france.rename(columns={
    "GEO": "geo_code",
    "FREQ": "frequency",
    "TIME_PERIOD": "year",
    "POPREF_MEASURE": "population_type",
    "OBS_VALUE_NIVEAU": "population_value",
    "departement_code": "departement_code",  # already correct
    "departement_name": "departement_name"   # newly added
}, inplace=True)


In [None]:
list(df_france['population_type'].unique())

In [None]:
#df_france.to_csv("insee_population_by_departement.csv", index=False)


In [None]:
import requests, json

url = "https://api.insee.fr/melodi/catalog/all"
response = requests.get(url)
raw_data = json.loads(response.content)

print(type(raw_data))
print(len(raw_data))
print(raw_data[0])


In [None]:
response.json()


In [None]:
from insee_api_functions import list_melodi_datasets

catalog = list_melodi_datasets()
catalog.head()



In [None]:
#Income datasets
revenu_df = catalog[catalog["title_fr"].str.contains("revenu", case=False, na=False)]
revenu_df

In [None]:
# Unemployment datasets
chomage_df = catalog[catalog["title_fr"].str.contains("chômage", case=False, na=False)]
chomage_df

In [None]:
# Unemployment datasets
logement_df = catalog[catalog["title_fr"].str.contains("logement", case=False, na=False)]
chomage_df

In [None]:
# Unemployment datasets
age_df = catalog[catalog["title_fr"].str.contains("âge", case=False, na=False)]
age_df

In [None]:
# Unemployment datasets
education_df = catalog[catalog["title_fr"].str.contains("diplôme|scolarité|éducation", case=False, na=False)]
education_df

In [None]:
# Save each DataFrame
revenu_df.to_csv("indicators_revenu.csv", index=False)
chomage_df.to_csv("indicators_chomage.csv", index=False)
logement_df.to_csv("indicators_logement.csv", index=False)
age_df.to_csv("indicators_age.csv", index=False)
education_df.to_csv("indicators_education.csv", index=False)


In [None]:
from insee_api_functions import fetch_indicator_for_all_departements

df_income = fetch_indicator_for_all_departements("DS_ERFS_MENAGE_SL")
df_income.head()


In [None]:
catalog.columns