# Telephone Subscription Prediction.

This project requires you to

* Analyze and clean data
* Identify predictive features and test hypothesis
* Choose between fundamental classification metrics
* Fit and fine-tune a logreg and xgboost model for prediction

The task we are solving for is to predict if customer will subscribe to telephone service or not.

The project is organized in several Modules. Each Module has a set of tasks for you to complete. <br>
Please make sure to complete one task before moving onto the next

In [2]:
# These are package to be loaded
# Do not alter

%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from scipy.stats import chi2_contingency
from scipy.stats import f_oneway

## Module 1

### Task 1-1 : Load the data

- **Description**: Load the data from file `train.csv` and assign it to variable `train_df`
- **Code Instruction**: 
    1. Import dataset using the path `train.csv` and assign to `train_df`
    2. From the column anmes, get all columns but the last one and assign it to `feats` as a list
    3. Get the last column names and assign it to `label`

In [4]:
train_df = pd.read_csv("train.csv")
feats = list(train_df.columns[:-1])
label = train_df.columns[-1]

### Task 1-2: Understand the data

- **Description**: In this task, have a look at a sample of train data and understand what the feature values looks like
- **Code Instruction**: 
    1. Take a look at the top n rows of data
    2. Get the summary statistics of the numeric features in the data
    3. Get the distribution of feature values for categorical features in the data

In [None]:
# TODO: Solution below, please remove 
train_df.head()

In [None]:
# TODO: Solution below, please remove

train_df.describe()

In [None]:
num_cols = ['age','balance','duration']
cat_cols = ['job','marital','education','default','housing','loan','contact','month','day','campaign','pdays','previous','poutcome']
label = 'subscribed'

for col in cat_cols:
    print(train_df[col].value_counts())

## Task 1-3: Remove any duplicate rows

- **Description**: In this task,remove any duplicate rows in the data
- **Code Instruction**: Complete the function to return a datafarme with de-duplicated dataframe

`Ask ChatGPT! : How does duplicate data impact performance of a Logistic Regression model`

In [None]:
def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame :
    '''
    Complete this function to return a de-duplicated dataframe
    '''
    
    # TODO: remove the rest of code in this function
    df = df.drop_duplicates()
    return df


# Do not change this code
row_count = remove_duplicates(train_df).shape[0]
print(row_count)
remove_duplicates(train_df.copy())

## Task 1-4: Fill Missing Values

- **Description** : Fill any missing values in the data with column means for numerical cols and column mode for categorical columsn
- **Code Instruction**: 
    1. Use `train_col_miss` to store mean/mode values for each numerical/categorical cols
    2. Fill missing values with mean/mode in `train_col_miss`
    3. Return a dataframe with missing values filled in

`Ask ChatGPT! : How does missing values impact performance of a Logistic Regression model`

In [None]:
train_col_miss = {}
def fill_missing_value(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to fill missing (if there are)
    with the mean value of the column for numerical features, 
    and model for categorical features

    `train_col_mean` is a dictionary where keys are features
    and values are mean of field

    Hint: Use feats to iterate through columns
    '''

    # TODO: Solution below, please remove

    for col in num_cols:
        if train:
            train_col_miss[col] = df[col].mean()
        
        df[col] = df[col].fillna(train_col_miss[col])

    for col in cat_cols:
        if train:
            train_col_miss[col] = df[col].mode()

        df[col] = df[col].fillna(train_col_miss[col])

    return df

# Do not change this line of code
fill_missing_value(train_df.copy(), train=True)

## Task 1-5: Identify outliers

- **Description**: Compelete the below function to clip outlier using `Tukey Outlier method`. Replace the outlier with mean values calculated before.
- **Code Instruction**:
    1. Identify the numerical columns
    2. Identify upper and lower bound using `Tukey Outlier method`
    3. Replace outlier values using mean/model from `train_col_miss` dict defined earlier

`Ask ChatGPT! : How does outlier impact performance of a Logistic Regression Model`


In [None]:
train_col_bounds = {}
def clip_outliers(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to get lower, upper bounds of each col
    Replace low and high with mean values

    `train_col_bounds` is a dictionary where key are features
    and values are tuple (x,y) x being lower bound and y being higher bound

    Hint: Use feats to iterate through columns
    '''

    # TODO: solution below, please remove
    for col in num_cols:
        if train:
            p25, p75 = df[col].quantile([.25,.75])
            iqr = p75 - p25
            train_col_bounds[col] = (p25 - 1.5 * iqr, p75 + 1.5 * iqr)

        df[col] = df[col].apply(lambda x: train_col_miss[col] if (x < train_col_bounds[col][0] or x >  train_col_bounds[col][0]) else x)
    print(train_col_bounds)


# Do not change this code
clip_outliers(train_df.copy(), train=True)

## Task 1-6: Identify Class Imbalance

- **Description**: Complete this function to return the percentage of 0 labels in the data
- **Code Instruction**:
    1. Copmlete this function to return the percentage of 0-valued labels in the data

`Ask ChatGPT! : How does imbalance impact performance of a Logistic Regression Model`

In [None]:
def test_imbalance(df: pd.DataFrame) -> float:
    '''
    Copmlete this function to return the percentage of 0 labels in the data
    '''

    # TODO: Solution below, please remove
    return df[label].value_counts(normalize=True)[1] * 100

test_imbalance(train_df.copy())

## 2. Feature Engineering

A logistic regression model needs all of its features to be numeric. As such categorical values need to transformed. 
A common transformation is `One Hot Encoding`

### Task 2-1: Encoding categorical variables 

- **Description**: Use OneHotEncoder to encode categorical values
- **Code Instruction**: 
    1. Create a onehot encoder for each categorical variable, and store the encoder in dict `cat_enc`
    2. Transform categorical variables using OneHot Econder
    3. For the `label` field, convert 'yes' to 1 and 'no' to 0.
    4. Return dataframe with transformed categorical columns alone (dont keep the old ones)

`Ask ChatGPT: What are the different types of categorical encoding and advantages of each`

In [11]:
cat_enc = None
def encode_cat(df: pd.DataFrame, train=False):
    '''
    Use 'OneHotEncoder' to encode categorical values
    Remember that the encoder must be saved in `cat_enc` 
    so that it can used on a test set later

    '''

    global cat_enc

    # TODO: Solution below, remove
    if train:
        le = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        le = le.fit(df[cat_cols])
        cat_enc = le

    le = cat_enc
    encoded_df = le.transform(df[cat_cols])

    encoded_df = pd.DataFrame(encoded_df, columns=le.get_feature_names_out(cat_cols))

    df = pd.concat([df.drop(columns=cat_cols), encoded_df], axis=1)

    if train:
        df[label] = df[label].apply(lambda x: 0 if x == 'no' else 1)

    return df


# Do not change this code
train_df = pd.read_csv(data_folder + "train.csv")
train_df = encode_cat(train_df.copy(), train=True)

### Task 2-2: Feature Distribution

 - **Description**: If values across classes for a feature overlaps it tends to reduce Logistic Regression models predictive power. As such its a good idea to look at the distribution of feature values<br>
 - **Code Instruction**: 
    1. Get the column names for numeric fields in the train_df
    2. For each field, plot one histogram each for each label value - showing the spread of the values. 
    3. See if any distribution tend to overlap quite a bit or not.

In [None]:
# TODO: Solution below, please remove
feats = num_cols
print(num_cols)
fig, axes = plt.subplots(2, 3, figsize=(12, 6))

for i, feat in enumerate(feats):
    row = i // 3
    col = i % 3
    train_df[train_df[label]==0][feat].hist(ax=axes[row, col])
    train_df[train_df[label]==1][feat].hist(ax=axes[row+1, col])
    axes[row, col].set_title(feat)

plt.tight_layout()
plt.show()


### Task 2-3 - Feature Correlations for Numerical Features

- **Description**: In this task, calculate pairwise correlation between features
- **Code Instruction**:
    1. Complete the function to calculate all pairwise correlation 
    (Hint: There is in inubilt-function in `pandas`for this)

`Ask ChatGPT! : How do correlated feature impact performance of a Logistic Regression Model.`

In [None]:
def calc_corr_num(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    '''
    
    # TODO: Solution below, please remove
    return df[num_cols].corr()


# Do not change this code
calc_corr_num(train_df[feats])


### Task 2-4: Feature Correlation for Categorical Features

- **Description**: In this task, check if any pair of categorical features are correlated. Since all features are categorical you can use Chi-Square
- **Code Instruction**: 
    1. Complete function to print correlation between all pairs of categorical features


`Ask ChatGPT: Is Chi-square reliable when fields have high cardinality`. <br>
`Ask ChatGPT: What is the downside of using p-values when doing multiple hypothesis testing?` <br>

Based on above answers, and to keep the complexity of the project low - we will skip checking correlation for remaining features pairs

In [None]:
def calc_corr_cat(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    From the output Identify the pair of features that are highly correlated.
    '''

    # TODO: Solution below, please remove
    for feat1 in cat_cols[:-1]:
        for feat2 in cat_cols[1:]:

            # Create a contingency table
            contingency_table = pd.crosstab(df[feat1], df[feat2])

            # Perform the Chi-Square test
            chi2, p_value, dof, expected = chi2_contingency(contingency_table)
            
            print(f"{feat1} {feat2} Chi-Square value: {chi2} P-value: {p_value}")


# Do not change this code
temp_df = pd.read_csv(data_folder + 'train.csv')
calc_corr_cat(temp_df)


## 3. Understanding Classification Metrics

The most common classification metrics are - 
* Accuraccy
* Precision
* Recall
* F1-Score
Let's Ask ChatGPT what there are - <br>

### Task 3-1: Choosing the right metric

- **Description**: Now having analyzed the data (from Task 1-6), choose the best metrics for your task. 
Assume you got the following information from business - 
* If you prediction someone is going to donate, but they dont - this is huge concern. You want to reduce such `false positives` as much as possible.
* If you predict someone is not going to donate, and they do come - it is ok. The blood donation camp can manage.
Knowing the above - decide which metric to use. <br>
Irrespective of what you use evaluate performance using F1 as well.

- **Code Instruction**: 
    1. Compelete this function to calculate the metric you have chosen

In [17]:
def calc_perf(y_act: list, y_pred: list) -> float:
    '''
    Compelete this function to calculate the metric you have chosen
    '''

    # TODO: Solution below, pleas remove
    val = precision_score(y_act, y_pred)
    return val

## 4. Build basline model

Now, finally we can start training the model. When training an ML model its important to have three datasets
* Train dataset - which you use to train the model and learn parameter
* Validation dataset - the dataset to use to figure out which parameter are the best
* Test dataset - the hidden dataset, that you DO NOT look at. Its only use to estimate the performance in future unseen datasets.

### Task 4-1: Preproces the train data to create traininig and validation data

- **Description** : In this task, we will create the datasets required for training the model
- **Code Instruction**: 
    1. Load the train dataset 
    2. ONLY run the de-duplication function on train set
    3. Encode categorical values
    4. `Stratify` split the train dataset 80:20 to creatin a new train dataset and validation set

`Ask ChatGPT: Why is it important to stratify when creating training and validation sets for imbalanced dataset`

In [None]:
def create_dataset(df: pd.DataFrame) -> tuple[np.array, np.array, np.array, np.array, list]:
    '''
    Remove duplicate data from train file alone 
    Encode the categorical values
    Split train file data into train and valid set (keep in mind what we about imbalance learned in Task 5)
        Hint use: train_test_split (set seed to 100), and use the `stratify` field
        
    Return np.arrays for train features, train labels, valid features, valid labels and list of feature names

    '''

    # TODO: Solution below, pleas remove
    df = remove_duplicates(df)
    df = encode_cat(df, train=True)

    feats = list(df.columns)
    feats.remove(label)
    feats.remove('ID')

    X = df[feats].values
    y = df[label].values

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, 
                                                          stratify=y, random_state=100)

    return X_train, y_train, X_valid, y_valid, feats


# Do not change this
X_train, y_train, X_valid, y_valid, feats = create_dataset(pd.read_csv('train.csv'))

### Task 4-2: Build Model

- **Description**: Train a basline Logisitc Regression model with default parameters
- **Code Instruction**: 
    1.  Complete this function to train a baseline LogisticRegression Model with default parameter

In [None]:
def train_base(X_train: np.array, y_train: np.array) -> LogisticRegression:
    '''
    Complete this function to
    Train a baseline LogisticRegression Model with default parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove
    model = LogisticRegression(random_state=100)
    model.fit(X_train, y_train)

    return model


# Do not change this
model = train_base(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

## 5.Improving baseline model

### Task 5-1: Fix imbalance to improve performance

- **Description**: Use sampling to balance the number positive and negative samples in the data
- **Code Instruction**: Complete the following function to
    1. Remove duplicate rows in data
    2. Balance the number of positive and negative samples in train_data


In [None]:
def rebalance_df(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Write a function to 
    (a) remove duplicate rows in data
    (b) balance the number of positive and negative samples in train_data
    Hint: Use downsampling, use random_state = 100
    Hint: Dont forget to reset index after creating a new dataframe

    Return balance dataframe
    '''

    # TODO: Solution below, please remove
    df = remove_duplicates(df)

    pos_df = df[df[label] == 'yes']
    neg_df = df[df[label] == 'no']

    df = pd.concat([neg_df.sample(frac=0.4, random_state=100), pos_df]).reset_index(drop=True)
    return df


# Do not change the following code
# load the data
train_df = pd.read_csv('train.csv')
print(train_df.shape, train_df[label].value_counts(normalize=True))

print("\n\n")

# balance the data
train_df = rebalance_df(train_df)
print("Balanced", train_df.shape, train_df[label].value_counts(normalize=True))

# check the performance with rebalance dataset
print("\n\n")
X_train, y_train, X_valid, y_valid, feats = create_dataset(train_df)
model = train_base(X_train, y_train)
print(X_train.shape)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

### Task 5-2:Improve model using Class Imbalance

- **Description**: Another way to combat data imbalance is configuring class weights in the Logistic Regression model <br>
- **Code Instruction**: Write the function to train a Logistic Regression model
    1. Use in-built parameters of Logistic Regression to weight minority mistake more

`Ask ChatGPT: Why does one of the methods perform better than the other.`
It could be related to the how the feature distribution for both class overlap as seen in Task 2-2

In [None]:
def train_tune_model(X_train: np.array, y_train: np.array):
    '''
    Write the function to train a Logistic Regression model
    and use `class_weight` parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove code
    model = LogisticRegression(class_weight={0:0.25, 1:0.75 }, random_state=100)
    model.fit(X_train, y_train)

    return model


# do no change the following code
train_df = pd.read_csv(data_folder + 'train.csv')
X_train, y_train, X_valid, y_valid, feats = create_dataset(train_df)

model = train_tune_model(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))


### Task 5-3: Normalizing Features

- **Describe**: Normalize features to see how it impact perforamance for the Log.Reg model from the task
- **Code Instruction** Complete the function to
    1. Use standard scaling to normalize features 

`Ask ChatGPT what type of feature scaling is best for Logistic Regression and why`

### Bonus [Optional] Task
Fine-tune the Logistic Regression Model.
Some of the parameters you may want to experiment with are - solver, penatly and C

In [None]:
def train_tune_model(X_train: np.array, y_train: np.array, scaler):
    '''
    Complete this function to normalize features 
    X_train: is the train features values
    y_train: is train labels
    scaler: The scaler you have chosen
    '''

    # TODO: Solution below, remove this code
    X_train = scaler.fit_transform(X_train)
    model = LogisticRegression(solver = 'liblinear', penalty='l1', C=0.5)
    model.fit(X_train, y_train)

    return (model, scaler)

scaler = StandardScaler()

# Do not change this code
train_df = pd.read_csv("train.csv")
train_df = rebalance_df(train_df)

X_train, y_train, X_valid, y_valid, feats = create_dataset(train_df)
model, scaler = train_tune_model(X_train, y_train, scaler)

X_valid = scaler.transform(X_valid) # we use the same scaler you have used earlier
pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

### Task 5-4: Model Interpretability

- **Description**: Extract the coefficient from the model by completing this function
- **Code Instruction**: Complete the following function to 
    1. Print thte coefficient for each feature
    2. Print the intercept for the Log.Reg model as well

`Ask ChatGPT: How do you interpret coeffcient of a Logistic Regression model`

In [None]:
def get_model_coeff(model : LogisticRegression, feats: list):
    '''
    Complete this function to print pair of values
    (feature name, coeff value)
    '''

    # TODO: Remove this code
    coeffs = list(model.coef_)[0]
    print(coeffs)

    for i in range(len(feats)):
        print(feats[i], coeffs[i])
    
    print("Intercept", model.intercept_)


get_model_coeff(model, feats)

## 6. Error Analysis and Improving Model

This is crucial to identify what went wrong and how it could be improved. Use a confusion matrix to figure out if the model is making more false positive or false negative. <br> Identfiy if certain subset of data is more subsceptible to error (like certain jobs, durations etc).Based on the above insights figure out how the model could be improved. Think first - feature engineering, then improve Log.Reg fitting, and then experimenting with other models. 

### Task 6-1: Understand model error
- **Description**: Get the false positive, false negative etc from the model prediction
- **Code Instruction**: 
    1.Calculate true positive, false positive, false negative and true positive and assign to tn, fp, fn, tp respectively.

In [None]:
actual = y_valid
pred = model.predict(X_valid)

# TODO: Remove code below
tn, fp, fn, tp = confusion_matrix(actual, pred).ravel()

### Task 6-2: Improve model using feature engineering [Optional]
- **Description**: See what new feature you can create from existing features to improve model performance
- **Code Instruction**: This is an open ended task

In [25]:
# update predictions based on insights from error-analysis
# TODO: FOR TESTING - SIMPLY CHECK IF PREDICTIONS ARE BETTER THAN THAT FROM TASK.18
pred = model.predict(X_valid)

### Task 6-3: Model Experimentation

- **Description**: Now we can try-out other models. Here we will be experimenting with RFClassifier. Its common in the industry to decide to spend 2-3 days on model experimentation. During this time you can experiment with as many models, fine-tuning techniques as possible
- **Code Instruction**:
    1. Load the train data
    2. Rebalance it using `rebalance_df`
    3. Create train, validation data using `create_dataset`
    4. Fit a RF Model and experiment with parameters to get best performance on validation data

In [None]:
# TODO: Remove code below
from sklearn.ensemble import RandomForestClassifier

train_df = pd.read_csv("train.csv")
train_df = rebalance_df(train_df)

X_train, y_train, X_valid, y_valid, feats = create_dataset(train_df)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
pred = rf_classifier.predict(X_valid)

# Evaluate the model
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))


### Task 6-4 - Final Prediction

- **Description**:Finally apply all the transformation you deem best on the data and the best model you found to get prediction on test datasets
- **Code Instruction**:
    1. Load test data
    2. Look at code written till now - missing value, outliers, encoding, scaling - apply all
    3. Predict the class using the best model trained from earlier tasks

In [None]:

test_df = pd.read_csv(data_folder + 'test.csv')

# TODO: Remove Solution below
test_df = encode_cat(test_df, train=False)
test_df = test_df.drop(columns=['ID'])
test_arr = scaler.transform(test_df.values)
pred = model.predict(test_arr)
print(pred)