# Telephone Subscription Prediction.

This project will requires you to

* analyze and clean data
* identify predictive features and test hypothesis
* choose between fundamental classification metrics
* fit and fine-tune a logreg and xgboost model for prediction

The task we are solving for is to predict if customer will subscribe to telephone service or not.

The project is organized in several Modules. Each Module has a set of tasks for you to complete. <br>
Please make sure to complete one task before moving onto the next

In [91]:
# These are package to be loaded
# Do not alter

%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

from scipy.stats import chi2_contingency
from scipy.stats import f_oneway


# TODO this is path to data folder, please remove if not required
data_folder = "telephonic_term/"

## 1. Analyze and Clean Data

You start your data project by analyzing the data <br> <br>

In [21]:
train_df = pd.read_csv(data_folder + "train.csv")
feats = list(train_df.columns[:-1])
label = train_df.columns[-1]

In [None]:
# Task 1
# For this task, simply have a look at the data
# Load the training data and take a look at feature values and answer these questions
# - Is it a continous feature or categorical feature
# - For continuous features - what are the range of values
# - For categorical features - what are unique values possible

# TODO: Solution below, please remove 
train_df.head()

In [None]:
# TODO: Solution below, please remove
train_df.describe()

In [None]:
num_cols = ['age','balance','duration']
cat_cols = ['job','marital','education','default','housing','loan','contact','month','day','campaign','pdays','previous','poutcome']
label = 'subscribed'

for col in cat_cols:
    print(train_df[col].value_counts())

# TODO: Understand features  - 'campaign','pdays','previous','poutcome' (treating as numerical for now)

In [None]:
train_df.info()

In [None]:
# Task 2
# Remove any duplicate rows from the training data
# Ask ChatGPT! : How does duplicate data impact performance of a Logistic Regression model

def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame :
    '''
    Complete this function to return a de-duplicated dataframe
    '''
    
    # TODO: remove the rest of code in this function
    df = df.drop_duplicates()
    return df


# Do not change this code
row_count = remove_duplicates(train_df).shape[0]
print(row_count)
remove_duplicates(train_df.copy())

In [None]:
# Task 3
# Fill any missing values in the data with column means (even if there are no missing value, this function will execute)
# Ask ChatGPT! : How does missing values impact performance of a Logistic Regression model

train_col_miss = {}
def fill_missing_value(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to fill missing (if there are)
    with the mean value of the column for numerical features, 
    and model for categorical features

    `train_col_mean` is a dictionary where keys are features
    and values are mean of field

    Hint: Use feats to iterate through columns
    '''

    # TODO: Solution below, please remove

    for col in num_cols:
        if train:
            train_col_miss[col] = df[col].mean()
        
        df[col] = df[col].fillna(train_col_miss[col])

    for col in cat_cols:
        if train:
            train_col_miss[col] = df[col].mode()

        df[col] = df[col].fillna(train_col_miss[col])

    return df

# Do not change this line of code
fill_missing_value(train_df.copy(), train=True)

In [None]:
# Task 4
# Identify if there are any outlier value in each of the features
# Ask ChatGPT! : How does outlier impact performance of a Logistic Regression Model


train_col_bounds = {}
def clip_outliers(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to get lower, upper bounds of each col
    Replace low and high with mean values

    `train_col_bounds` is a dictionary where key are features
    and values are tuple (x,y) x being lower bound and y being higher bound

    Hint: Use feats to iterate through columns
    '''

    # TODO: solution below, please remove
    for col in num_cols:
        if train:
            p25, p75 = df[col].quantile([.25,.75])
            iqr = p75 - p25
            train_col_bounds[col] = (p25 - 1.5 * iqr, p75 + 1.5 * iqr)

        df[col] = df[col].apply(lambda x: train_col_miss[col] if (x < train_col_bounds[col][0] or x >  train_col_bounds[col][0]) else x)
    print(train_col_bounds)


# Do not change this code
clip_outliers(train_df.copy(), train=True)

In [None]:
# Task 5
# Idetify amount of imbalance in data
# Ask ChatGPT! : How does imbalance impact performance of a Logistic Regression Model
# Knowing this - what should you do when you build the model?


def test_imbalance(df: pd.DataFrame) -> float:
    '''
    Copmlete this function to return the percentage of 0 labels in the data
    '''

    # TODO: Solution below, please remove
    return df[label].value_counts(normalize=True)[1] * 100

test_imbalance(train_df.copy())

## Feature Engineering

First lets work on encoding the categorical variables

In [None]:
# Task 6
# Encoding categorical variables 
# Ask ChatGPT: What are the different types of categorical encoding and advantages of each

cat_enc = {}
def encode_cat(train_df: pd.DataFrame, train=False):
    '''
    Use 'LabelEncoder' to encode categorical values
    The cat_enc dictionary can be use to store the encoders for each feature
    This dictionary can then be use to transform test set

    Also encode the label with no being 0, yes being 1
    '''

    # TODO: Solution below, remove

    for col in cat_cols:

        if train:
            le = LabelEncoder()
            le.fit(train_df[col])
            cat_enc[col] = le

        train_df[col] = le.transform(train_df[col])


    train_df[label] = train_df[label].apply(lambda x: 0 if x == 'no' else 1)

    return train_df


# Do not change this code
train_df = pd.read_csv(data_folder + "train.csv")
train_df = encode_cat(train_df.copy(), train=True)
train_df

In [None]:
train_df[label].value_counts()

For a feature to be useful it must have some predictive power. <br>
In classification problem the label is a `categorical` value  and the feature we have all 'continuous' valued.

In this case the statisitical test we use to test if a feature is useful or not is called - `Student's t-test.` <br>
This is test use if you have only two values in `categorical` label and `continuous` valued features. 

Some of the other tests you might need to know are - <br>
https://medium.com/towards-data-science/every-statistical-test-to-check-feature-dependence-773a21cd6722


Now, one of the assumptions of the Student's t-test is - `Normality` i.e. the feature value should follow a normal distribution for each value of the label. <br>
Now the test is robust enough that if we have more that 30 samples the results still hold, but lets still have a look at the features and see if any of them are normally distributed. <br>

In [None]:
# Task 7
# Understand feature distributions - plot histogram of feature values for each class
# This will help you understand if the feature values overlap or not
# Ask ChatGPT: How does feature value overlap influence Logigisic Regression model


# TODO: Solution below, please remove
feats = num_cols
print(num_cols)
fig, axes = plt.subplots(2, 3, figsize=(12, 6))

for i, feat in enumerate(feats):
    row = i // 3
    col = i % 3
    train_df[train_df[label]==0][feat].hist(ax=axes[row, col])
    train_df[train_df[label]==1][feat].hist(ax=axes[row+1, col])
    axes[row, col].set_title(feat)

plt.tight_layout()
plt.show()


In [None]:
# Task 7.a
# Check if any pair of numeric features are correlated
# Since all features are continuous you can use pandas default correction (Pearson Corr) 

# Ask ChatGPT! : How do correlated feature impact performance of a Logistic Regression Model. 
# Knowing this - what should you do when you build the model?

def calc_corr_num(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    From the output Identify the pair of features that are highly correlated.
    '''
    
    # TODO: Solution below, please remove
    return df[num_cols].corr()


# Do not change this code
calc_corr_num(train_df[feats])


In [None]:
# Task 7.b
# Check if any pair of categorical features are correlated
# Since all features are categorical you can use Chi-Square

# Ask ChatGPT: What is the downside of using p-values when doing multiple hypothesis testing?

def calc_corr_cat(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    From the output Identify the pair of features that are highly correlated.
    '''

    # TODO: Solution below, please remove
    for feat1 in cat_cols[:-1]:
        for feat2 in cat_cols[1:]:

            # Create a contingency table
            contingency_table = pd.crosstab(df[feat1], df[feat2])

            # Perform the Chi-Square test
            chi2, p_value, dof, expected = chi2_contingency(contingency_table)
            
            print(f"{feat1} {feat2} Chi-Square value: {chi2} P-value: {p_value}")


# Do not change this code
calc_corr_cat(train_df)


In [None]:
# Task 7.c
# Check if pair of categorical - numeric features are correlated
# Since all features 

def calc_corr_cat_num(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    From the output Identify the pair of features that are highly correlated.
    '''
    
    # TODO: Solution below, please remove
    for feat1 in num_cols:
        for feat2 in cat_cols:
            groups = df.groupby(feat2)[feat1].apply(list)
            f_statistic, p_value = f_oneway(*groups)

            print(feat1, feat2, f_statistic, p_value)


# Do not change this code
calc_corr_cat_num(train_df)


In [None]:
# Task 8.a
# Check which of the feature are predictive (i.e. will a donor donate blood)
# For this you can used a specific statistical test called 'Welch's t-test'
# Which of the feature are not predictive assuming significance alpha = 0.01

# Ask ChatGPT:  What is welch's t-test, and t-test - how does it help determing important features.

def run_student_ttest(df, col) -> tuple[str, float]:
    '''
    Write a function to return p-values from the Welch's t-test
    for feature passed into the function
    '''

    # TODO: Solution below, please remove
    t_stat, p_val = stats.ttest_ind(df[df[label]==0][col], df[df[label]==1][col], 
                                        equal_var=False)  # Welch's t-test
    
    print(col, t_stat, p_val)

for feat in num_cols:
    run_student_ttest(train_df, feat)

In [None]:
# Task 8.b
# Check which of the feature are predictive (i.e. will a donor donate blood)
# For this you can used a specific statistical test called 'Chi-sqaure test'
# Which of the feature are not predictive assuming significance alpha = 0.01

def run_chi_test(df, col) -> tuple[str, float]:
    '''
    Write a function to return p-values from the Welch's t-test
    for feature passed into the function
    '''

    # Create a contingency table
    contingency_table = pd.crosstab(df[col], df[label])

    # Perform the Chi-Square test
    chi2, p_value, dof, expected = chi2_contingency(contingency_table)
    
    print(f"{col} {label} Chi-Square value: {chi2} P-value: {p_value}")

for feat in cat_cols:
    run_chi_test(train_df, feat)

In [None]:
# TODO: PENDIND COMPLETION - FIGURE OUT FEATUERS

# Task 10
# Lets try removing irrelevant features and one of correlated feature

def drop_feature(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    '''
    Create a function to remove the columns passed to this function
    '''

    # TODO: Solution below, please remove
    # TODO: need to complete this

    return df


# Do not change this code
proc_df = drop_feature(train_df.copy(), cols = ['Total Volume Donated (c.c.)', 'Months since First Donation'])
proc_df.columns


We see that one of the feature is not relevant. Can you think of a way to conver that feature to one that is more likely to predict if a person will donate blood in the upcoming month (i.e. March 2007)? 

Hint: Think about using the feature `Number of Dontations` with the irrelevant feature to create a feature that indicates how many donation the donor makes per month.

## Understanding Classification Metrics

The most common classification metrics are - 
* Accuraccy
* Precision
* Recall
* F1-Score
Let's Ask ChatGPT what there are - <br>


Now having analyzed the data (from Task 1), choose the best metrics for your task. <br>
<br>

Assume you got the following information from business - 
* If you prediction someone is going to subscribe, but they dont - this is huge concern. You want to reduce such `false positives` as much as possible.
* If you predict someone is not going to subscribe, and they do - it is ok.

Knowing the above - decide which metric to use. <br>
Irrespective of what you use evaluate performance using F1 as well.

In [98]:
# Task 11
# Write the function to calucate the metric you have chosen

def calc_perf(y_act: list, y_pred: list) -> float:
    '''
    Compelete this function to calculate the metric
    you have chosen
    '''

    # TODO: Solution below, pleas remove
    val = precision_score(y_act, y_pred)
    return val

## Training the Model

Now, finally we can start training the model. When training an ML model its important to have three datasets
* Train dataset - which you use to train the model and learn parameter
* Validation dataset - the dataset to use to figure out which parameter are the best
* Test dataset - the hidden dataset, that you DO NOT look at. Its only use to estimate the performance in future unseen datasets.

Lets start by creating these datasets - 
1. Load the train dataset 
2. ONLY run the de-duplication function on train set (lets see what performance we get without outlier removal and feature engineering)
3. Split train dataset 80:20 to creatin a new train dataset and validation set
2. Load the test dataset

In [101]:
# Task 12
# Preproces the train data to create traininig and validation data


def create_dataset(df: pd.DataFrame) -> tuple[np.array, np.array, np. array, np.array]:
    '''
    Remove duplicate data from train file alone (using `remove_duplicates` used earlier)
    Encode the categorical values using 'encode_cat' function
    Split train file data into train and valid set (keep in mind what we about imbalance learned in Task 5)
        Hint use: train_test_split (set seed to 100), and use the `stratify` field
        Ask ChatGPT: Why is it important to stratify when creating training and validation sets for imbalanced datasets
    
    Return np. arrays for train features, train labels, valid features, valid labels

    '''

    # TODO: Solution below, pleas remove
    df = remove_duplicates(df)
    df = encode_cat(df, train=True)

    feats = num_cols + cat_cols

    X = df[feats].values
    y = df[label].values

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, 
                                                          stratify=y, random_state=100)

    return X_train, y_train, X_valid, y_valid


# Do not change this
X_train, y_train, X_valid, y_valid = create_dataset(pd.read_csv(data_folder + 'train.csv'))

Now, lets build a logicstic regression model. Extract out the features and labels from train dataset.

In [None]:
# Task 13
# Train a basline Logisitc Regression model with default parameters

def train_base(X_train: np.array, y_train: np.array) -> LogisticRegression:
    '''
    Complete this function to
    Train a baseline LogisticRegression Model with default parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove
    model = LogisticRegression(random_state=100)
    model.fit(X_train, y_train)

    return model


# Do not change this
model = train_base(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

## Improving baseline Model and Analyzing Design Choices

In [None]:
# Task 14
# Now lets try fixing the imbalance we saw in task 5 - Does it improve performance?

def rebalance_df(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Write a function to 
    (a) remove duplicate rows in data
    (b) balance the number of positive and negative samples in train_data
    Hint: Use downsampling, use random_state = 100

    Return balance dataframe
    '''


    # TODO: Solution below, please remove
    df = remove_duplicates(train_df)

    pos_df = df[df[label] == 'yes']
    neg_df = df[df[label] == 'no']

    df = pd.concat([neg_df.sample(frac=0.4, random_state=100), pos_df])
    return df


# Do not change the following code
# load teh data
train_df = pd.read_csv(data_folder + 'train.csv')
print(train_df.shape, train_df[label].value_counts(normalize=True))

print("\n\n")

# balance the data
train_df = rebalance_df(train_df)
print(train_df.shape, train_df[label].value_counts(normalize=True))

# check the performance with rebalance dataset
print("\n\n")
X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model = train_base(X_train, y_train)
print(X_train.shape)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

In [None]:
# Task 15
# Another way to combat data imbalance is comfiguring class weights in the Logistic Regression model 

def train_tune_model(X_train: np.array, y_train: np.array):
    '''
    Write the function to train a Logistic Regression model
    and use `class_weight` parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove code
    model = LogisticRegression(class_weight={0:0.25, 1:0.75 }, random_state=100)
    model.fit(X_train, y_train)

    return model


# do no change the following code
train_df = pd.read_csv(data_folder + 'train.csv')
X_train, y_train, X_valid, y_valid = create_dataset(train_df)

model = train_tune_model(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))


# Ask ChatGPT: Why does one of the methods perform better than the other
# Hint: It could be related to the how the feature distribution for both class overlap as seen in Task 6

In [None]:
# TODO: PENDIND COMPLETION - FIGURE OUT FEATUERS

# Task 16
# Lets try 
# (1) dropping the irrelaevant and one of the correlated feature
# (2) Adding the new features
# How do all this influence performance - which feature set do you finally keep?

# Do not chage this code
train_df = pd.read_csv(data_folder + 'training_data.csv')
train_df = rebalance_df(train_df)

# ADD YOUR CODE HERE
# Hint: use `create_nea_feature` and `drop_feature` functions mentioned before
# TODO: Solution below, remove it
train_df = create_new_feature(train_df)
train_df = drop_feature(train_df, cols = ['Total Volume Donated (c.c.)'])


# Do not change this code
X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model = train_base(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))


# Did the performance drop? If so, why? 
# When you add which feature back in is the performance coming back up?

# Does it make sense that adding a feature that failed student t-test helped improve performance of Logistic Regression?
# Ask ChatGPT why this could happen (Hint it could be related to how features work together)

In [None]:
# Task 17
# Nomrmalizing features to see how it impact perforamance


def train_tune_model(X_train: np.array, y_train: np.array, 
                     scaler):
    '''
    Complete this function to normalize features 
    X_train: is the train features values
    y_train: is train labels
    scaler: The scaler you have chosen

    Bonus [Optional] Task : 
    Fine-tune the Logistic Regression Model.
    Some of the parameters you may want to experiment with are - solver, penatly and C
    '''

    # Hint: Use StandardScaler to normalize features
    # Ask ChatGPT what type of feature scaling is best for Logistic Regression and why

    # TODO: Solution below, remove this code
    X_train = scaler.fit_transform(X_train)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    return (model, scaler)

scaler = StandardScaler()

# Do not change this code
train_df = pd.read_csv(data_folder + "train.csv")
train_df = rebalance_df(train_df)
# train_df = drop_feature(train_df, cols = ['Total Volume Donated (c.c.)'])

X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model, scaler = train_tune_model(X_train, y_train, scaler)

X_valid = scaler.transform(X_valid) # we use the same scaler you have used earlier
pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

In [None]:
# Task 18
# Understand the coefficient from the Logistic Regression model

def get_model_coeff(model : LogisticRegression, feats: list):
    '''
    Complete this function to print pair of values
    (feature name, coeff value)
    '''

    # TODO: Remove this code
    coeffs = list(model.coef_)[0]
    print(coeffs)

    for i in range(len(feats)):
        print(feats[i], coeffs[i])
    
    print("Intercept", model.intercept_)


get_model_coeff(model, num_cols + cat_cols)

# Ask ChatGPT: How do you interpret coeffcient of a Logistic Regression model

In [None]:
# Task 19
# Finally apply all the transformation you deem best and get prediction on test datasets

test_data = pd.read_csv(data_folder + 'test_data.csv')

# TODO: Build Pipeline for test data