## Milestone 2: Predictive Modeling
In this milestone, we will fully integrate the data and process it for machine learning. Below is a refined step-by-step plan for predictive modeling, including detailed descriptions and code for each step.

### 1. Load and Inspect the Data
#### What We'll Do
We will load the files into DataFrames and inspect their structure. This helps verify that features (df_X.csv) and the target variable (df_y.csv) are correctly aligned. We'll also reference the info.csv metadata to guide preprocessing decisions.

In [2]:
import pandas as pd

# Load features and target variable
df_X = pd.read_csv("df_X.csv")  # Replace with correct path
df_y = pd.read_csv("df_y.csv")  # Replace with correct path
info = pd.read_csv("info.csv")  # Metadata for understanding features

# Inspect the data
print(df_X.head())  # First rows of features
print(df_y.head())  # First rows of target variable
print(info.head())  # Feature metadata
print(df_X.info())  # Summary of df_X
print(df_y.info())  # Summary of df_y

   Unnamed: 0  age           job  marital  education default  balance housing  \
0           0   58    management  married   tertiary      no     2143     yes   
1           1   44    technician   single  secondary      no       29     yes   
2           2   33  entrepreneur  married  secondary      no        2     yes   
3           3   47   blue-collar  married        NaN      no     1506     yes   
4           4   33           NaN   single        NaN      no        1      no   

  loan contact  day_of_week month  duration  campaign  pdays  previous  \
0   no     NaN            5   may       261         1     -1         0   
1   no     NaN            5   may       151         1     -1         0   
2  yes     NaN            5   may        76         1     -1         0   
3   no     NaN            5   may        92         1     -1         0   
4   no     NaN            5   may       198         1     -1         0   

  poutcome  
0      NaN  
1      NaN  
2      NaN  
3      NaN  
4  

### 2. Merge Features and Target Variable
#### What We'll Do
We will combine df_X and df_y into a single DataFrame. This ensures the target variable (y) is aligned with its respective features.

In [3]:
# Inspect df_X
print(df_X.head())
print(df_X.columns)

# Inspect df_y
print(df_y.head())
print(df_y.columns)


   Unnamed: 0  age           job  marital  education default  balance housing  \
0           0   58    management  married   tertiary      no     2143     yes   
1           1   44    technician   single  secondary      no       29     yes   
2           2   33  entrepreneur  married  secondary      no        2     yes   
3           3   47   blue-collar  married        NaN      no     1506     yes   
4           4   33           NaN   single        NaN      no        1      no   

  loan contact  day_of_week month  duration  campaign  pdays  previous  \
0   no     NaN            5   may       261         1     -1         0   
1   no     NaN            5   may       151         1     -1         0   
2  yes     NaN            5   may        76         1     -1         0   
3   no     NaN            5   may        92         1     -1         0   
4   no     NaN            5   may       198         1     -1         0   

  poutcome  
0      NaN  
1      NaN  
2      NaN  
3      NaN  
4  

In [4]:
import pandas as pd

# Load df_X and df_y
df_X = pd.read_csv("df_X.csv")
df_y = pd.read_csv("df_y.csv")

# Confirm DataFrames are loaded correctly
print(df_X.shape)  # Should show non-zero rows and columns
print(df_y.shape)  # Should show non-zero rows and columns


(45211, 17)
(45211, 2)


In [5]:
# Merge DataFrames on Unnamed: 0
df_X = df_X.merge(df_y, on='Unnamed: 0', how='inner')

# Inspect the merged DataFrame
print(df_X.shape)  # Check the number of rows and columns after the merge
print(df_X.head())  # Display the first few rows


(45211, 18)
   Unnamed: 0  age           job  marital  education default  balance housing  \
0           0   58    management  married   tertiary      no     2143     yes   
1           1   44    technician   single  secondary      no       29     yes   
2           2   33  entrepreneur  married  secondary      no        2     yes   
3           3   47   blue-collar  married        NaN      no     1506     yes   
4           4   33           NaN   single        NaN      no        1      no   

  loan contact  day_of_week month  duration  campaign  pdays  previous  \
0   no     NaN            5   may       261         1     -1         0   
1   no     NaN            5   may       151         1     -1         0   
2  yes     NaN            5   may        76         1     -1         0   
3   no     NaN            5   may        92         1     -1         0   
4   no     NaN            5   may       198         1     -1         0   

  poutcome   y  
0      NaN  no  
1      NaN  no  
2    

### 3. Preprocess the Data
#### What We'll Do:

**1. Convert the target variable (y) to binary format** (no = 0, yes = 1).

**2. Handle Missing Values:**

- Replace missing values in contact and poutcome with "unknown".

- Impute missing values in pdays with -1, which indicates "not previously contacted."

**3. Exclude Irrelevant Features:**

- Drop the duration column as it is not suitable for realistic predictions.

**4. Separate Numerical and Categorical Features:**

- Divide the dataset into numerical (e.g., age, balance) and categorical (e.g., job, education) features for more efficient processing.

Why? Proper preprocessing ensures the dataset is clean, consistent, and suitable for machine learning models. Categorical features need to be encoded, and numerical features need to be scaled for accurate predictions.

In [6]:
# Convert 'y' to binary format (no=0, yes=1)
df_X['y'] = df_X['y'].map({'no': 0, 'yes': 1})
print(df_X['y'].unique())  # Verify 'y' contains only [0, 1]

# Handle missing values
df_X['contact'] = df_X['contact'].fillna("unknown")  # Fill missing 'contact' with "unknown"
df_X['poutcome'] = df_X['poutcome'].fillna("unknown")  # Fill missing 'poutcome' with "unknown"
df_X['pdays'] = df_X['pdays'].fillna(-1)  # Replace missing 'pdays' with -1

# Check if 'duration' exists before dropping it
if 'duration' in df_X.columns:
    df_X.drop(columns=['duration'], inplace=True)
    print("'duration' column dropped.")
else:
    print("'duration' column not found in the dataset.")

# Separate numerical and categorical features
numerical_features = ['age', 'balance', 'campaign', 'pdays', 'previous']
categorical_features = ['job', 'marital', 'education', 'contact', 'month', 'poutcome']

# Extract numerical and categorical features
X_num = df_X[numerical_features]
X_cat = df_X[categorical_features]
y = df_X['y']  # Extract target variable


[0 1]
'duration' column dropped.


### 4. Transform Features
#### What We'll Do
1. **Scale Numerical Features:**

- Use StandardScaler to standardize numerical data for consistent scales.

2. **Encode Categorical Features:**

- Convert categorical data into numerical format using one-hot encoding.

3. **Combine Processed Features:**

- Merge scaled numerical features and encoded categorical features into a single feature matrix.

In [7]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import numpy as np

# Scale numerical features
scaler = StandardScaler()
X_num_scaled = scaler.fit_transform(X_num)

# One-hot encode categorical features
encoder = OneHotEncoder(sparse_output=False, drop='first')  # Updated to avoid deprecated 'sparse'
X_cat_encoded = encoder.fit_transform(X_cat)

# Combine scaled numerical and encoded categorical features
X_final = np.concatenate([X_num_scaled, X_cat_encoded], axis=1)

# Verify the shape of the final dataset
print("Shape of X_final:", X_final.shape)


Shape of X_final: (45211, 37)


### 5. Split the Data
#### What We'll Do
Divide the dataset into training (80%) and testing (20%) sets to evaluate model performance on unseen data.

In [8]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.2, random_state=42)

# Verify the sizes of the splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (36168, 37)
X_test shape: (9043, 37)
y_train shape: (36168,)
y_test shape: (9043,)


#### What the Code Does
1. train_test_split():

- Takes the full dataset (X_final and y) and splits it into X_train, X_test, y_train, and y_test.

- test_size=0.2 means 20% of the data is allocated for testing.

2. random_state=42:

- Ensures reproducibility. Using the same random seed will result in consistent splits every time.

3. Shapes of the Splits:

- Print the shapes to confirm the data is split correctly.

### 6. Train Machine Learning Models
#### Objective
In this step, we’ll train various machine learning models to predict whether a client will subscribe to a term deposit (y) based on the preprocessed features. We’ll implement multiple models to compare their performance in subsequent steps.

#### Models to Train

1. **Logistic Regression**: A simple, interpretable baseline model for binary classification.

2. **Random Forest**: A robust non-linear model that handles complex relationships well.

3. **Gradient Boosting (e.g., XGBoost)**: An advanced boosting model for high accuracy.

##### 6.1 Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression

# Train Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred_lr = log_reg.predict(X_test)

##### 6.2 Random Forest

In [10]:
from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = rf.predict(X_test)

##### 6.3 Gradient Boosting (XGBoost)

In [13]:
from xgboost import XGBClassifier

# Train Gradient Boosting (XGBoost)
xgb = XGBClassifier(random_state=42)
xgb.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb.predict(X_test)

#### What This Code Does:
1. Initializes each model with default hyperparameters.

2. Fits each model to the training data (X_train, y_train).

3. Makes predictions on the test set (X_test).

### 7: Perform Basic Hyperparameter Tuning
#### Objective
To improve model performance by adjusting key parameters like:

1. **Tree Depth (max_depth)**: Controls the maximum depth of the decision trees.

2. **Number of Trees (n_estimators)**: Determines how many trees the model will build.

3. **Learning Rate (learning_rate)**: Applies to gradient boosting models (e.g., XGBoost) and controls the contribution of each tree.

#### 7.1 Hyperparameter Tuning for Random Forest
We’ll perform basic tuning for parameters like n_estimators (number of trees) and max_depth (tree depth):

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid_rf = {
    'n_estimators': [100, 200, 300], # Number of trees
    'max_depth': [10, 20, None], # Tree depth
    'min_samples_split': [2, 5, 10] # Minimum samples to split a node
}

# Perform Grid Search
grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid_rf,
    scoring='accuracy',
    cv=5,
    verbose=1
)

# Fit on training data
grid_search_rf.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters for Random Forest:", grid_search_rf.best_params_)
print("Best Score for Random Forest:", grid_search_rf.best_score_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters for Random Forest: {'max_depth': None, 'min_samples_split': 10, 'n_estimators': 200}
Best Score for Random Forest: 0.8948517303556132


#### 7.2 Hyperparameter Tuning for XGBoost
Tuning parameters like learning_rate, n_estimators, and max_depth:

In [14]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],    # Number of trees
    'max_depth': [3, 5, 7],             # Tree depth
    'learning_rate': [0.01, 0.1, 0.2]   # Learning rate
}

# Grid search for XGBoost
grid_search_xgb = GridSearchCV(
    estimator=XGBClassifier(random_state=42),
    param_grid=param_grid_xgb,
    scoring='accuracy',  # You can also use 'roc_auc'
    cv=5,                # 5-fold cross-validation
    verbose=1
)

# Fit the grid search on the training data
grid_search_xgb.fit(X_train, y_train)

# Print best parameters and the best score
print("Best XGBoost Parameters:", grid_search_xgb.best_params_)
print("Best XGBoost Score:", grid_search_xgb.best_score_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best XGBoost Parameters: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300}
Best XGBoost Score: 0.8948517265332405


#### Why Perform This Tuning?
- Random Forest: Increasing n_estimators may improve performance but can increase training time. Setting max_depth helps control overfitting for complex datasets.

- XGBoost: Tuning learning_rate balances the contribution of individual trees, and n_estimators ensures sufficient iterations for optimal performance.