# Blood Donor Identification

This project will teach you

* how to analyze and clean data
* how to identify predictive features and test hypothesis
* fundamental classification metrics
* training cycle of an ML model
* fine-tuning Logistic Regression model

The task we are solving for is to predict if a blood donor will make a donation in the upcoming month (here this month is March 2007) based on history of blood donations made by that donor.

The project is organized in several section. Each section has a set of tasks for you to complete. <br>
Please make sure to complete one task before moving onto the next


In [31]:
# These are package to be loaded
# Do not alter

%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from scipy import stats

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


## 1. Data Cleaning

### Task 1-1 : Load the data

- **Description**: Load the data from file `train.csv` and assign it to variable `train_df`
- **Code Instruction**: 
    1. Import dataset using the path `train.csv` and assign to `train_df`
    2. From the column anmes, get all columns but the last one and assign it to `feats` as a list
    3. Get the last column names and assign it to `label`

In [32]:
train_df = pd.read_csv(data_folder + "training_data.csv")
feats = list(train_df.columns[:-1])
label = train_df.columns[-1]

### Task 1-2: Understand the data

- **Description**: In this task, have a look at a sample of train data and understand what the feature values looks like
- **Code Instruction**: 
    1. Take a look at the top n rows of data
    2. Get the summary statistics of the numeric features in the data
    3. Get the distribution of feature values for categorical features in the data

In [None]:
# TODO: Solution below, please remove 
train_df.head()

In [None]:
# TODO: Solution below, please remove
train_df.describe()

## Task 1-3: Remove any duplicate rows

- **Description**: In this task,remove any duplicate rows in the data
- **Code Instruction**: Complete the function to return a datafarme with de-duplicated dataframe

`Ask ChatGPT! : How does duplicate data impact performance of a Logistic Regression model`

In [None]:
def remove_duplicates(df: pd.DataFrame) -> pd.DataFrame :
    '''
    Complete this function to return a de-duplicated dataframe
    '''
    
    # TODO: remove the rest of code in this function
    df = df.drop_duplicates()
    return df


# Do not change this code
row_count = remove_duplicates(train_df).shape[0]
print(row_count)
remove_duplicates(train_df.copy())

## Task 1-4: Fill Missing Values

- **Description** : Fill any missing values in the data with column means for numerical cols and column mode for categorical columsn
- **Code Instruction**: 
    1. Use `train_col_miss` to store mean/mode values for each numerical/categorical cols
    2. Fill missing values with mean/mode in `train_col_miss`
    3. Return a dataframe with missing values filled in

`Ask ChatGPT! : How does missing values impact performance of a Logistic Regression model`

In [None]:


train_col_means = {}
def fill_missing_value(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to fill missing (if there are)
    with the mean value of the column

    `train_col_mean` is a dictionary where keys are features
    and values are mean of field

    Hint: Use feats to iterate through columns
    '''


    # TODO: Solution below, please remove
    for col in feats:
        if train:
            train_col_means[col] = df[col].mean()
        
        df[col] = df[col].fillna(train_col_means[col])

    return df


# Do not change this line of code
fill_missing_value(train_df.copy(), train=True)

## Task 1-5: Identify outliers

- **Description**: Compelete the below function to clip outlier using `Tukey Outlier method`. Replace the outlier with mean values calculated before.
- **Code Instruction**:
    1. Identify the numerical columns
    2. Identify upper and lower bound using `Tukey Outlier method`
    3. Replace outlier values using mean/model from `train_col_miss` dict defined earlier

`Ask ChatGPT! : How does outlier impact performance of a Logistic Regression Model`


In [None]:
train_col_bounds = {}
def clip_outliers(df: pd.DataFrame, train=False) -> pd.DataFrame:
    '''
    Complete this function to get lower, upper bounds of each col
    Replace low and high with mean values

    `train_col_bounds` is a dictionary where key are features
    and values are tuple (x,y) x being lower bound and y being higher bound

    Hint: Use feats to iterate through columns
    '''

    # TODO: solution below, please remove
    for col in feats:
        if train:
            p25, p75 = df[col].quantile([.25,.75])
            iqr = p75 - p25
            train_col_bounds[col] = (p25 - 1.5 * iqr, p75 + 1.5 * iqr)

        df[col] = df[col].apply(lambda x: train_col_means[col] if (x < train_col_bounds[col][0] or x >  train_col_bounds[col][0]) else x)
    print(train_col_bounds)


# Do not change this code
clip_outliers(train_df.copy(), train=True)

## Task 1-6: Identify Class Imbalance

- **Description**: Complete this function to return the percentage of 0 labels in the data
- **Code Instruction**:
    1. Copmlete this function to return the percentage of 0-valued labels in the data

`Ask ChatGPT! : How does imbalance impact performance of a Logistic Regression Model`

In [None]:
def test_imbalance(df: pd.DataFrame) -> float:
    '''
    
    '''

    # TODO: Solution below, please remove
    return df[label].value_counts(normalize=True)[1] * 100

test_imbalance(train_df.copy())

## 2. Feature Engineering

### Task 2-1: Understand feature distributions

 - **Description**: If values across classes for a feature overlaps it tends to reduce Logistic Regression models predictive power. As such its a good idea to look at the distribution of feature values<br>
 - **Code Instruction**: 
    1. Get the column names for numeric fields in the train_df
    2. For each field, plot one histogram each for each label value - showing the spread of the values. 
    3. See if any distribution tend to overlap quite a bit or not.

`Ask ChatGPT: How does feature value overlap influence Logigisic Regression model`

In [None]:
# TODO: Solution below, please remove
fig, axes = plt.subplots(2, 4, figsize=(12, 6))

for i, feat in enumerate(feats):
    row = i // 4
    col = i % 4
    train_df[train_df[label]==0][feat].hist(ax=axes[row, col])
    train_df[train_df[label]==1][feat].hist(ax=axes[row+1, col])
    axes[row, col].set_title(feat)

plt.tight_layout()
plt.show()

### Task 2-2: Feature Correlations for Numerical Features

- **Description**: In this task, calculate pairwise correlation between features
- **Code Instruction**:
    1. Complete the function to calculate all pairwise correlation 
    (Hint: There is in inubilt-function in `pandas`for this)

`Ask ChatGPT! : How do correlated feature impact performance of a Logistic Regression Model.`

In [None]:

def calc_corr(df: pd.DataFrame) -> None:
    '''
    Complete the function to calculate all pairwise correlation
    From the output Identify the pair of features that are highly correlated.
    '''
    
    # TODO: Solution below, please remove
    return df.corr()


# Do not change this code
calc_corr(train_df[feats])


### Task 2-3: Identify Predictive Features

- **Description**: Check which of the feature are predictive (i.e. will a donor donate blood). For this you can used a specific statistical test called 'Welch's t-test' and assuming significance alpha = 0.01
- **Code Instruction**: Complete the following function
    1. Return the feature name and p-value from t-test


`Ask ChatGPT:  What is welch's t-test, and t-test - how does it help determing important features`

In [41]:
def run_student_ttest(df, col) -> tuple[str, float]:
    '''
    Write a function to return p-values from the Welch's t-test
    for feature passed into the function
    '''

    # TODO: Solution below, please remove
    t_stat, p_val = stats.ttest_ind(df[df[label]==0][col], df[df[label]==1][col], 
                                        equal_var=False)  # Welch's t-test
    
    return (col, p_val)

for feat in feats:
    run_student_ttest(train_df, feat)

### Task 2-4: Feature Engineering

- **Description**: Lets see if there are any new featues we can engineering. Think about the feature 'Months since First Donation', and 'Number of Donations'.
Can you figure out a sort of 'rate' feature from these two? Would it helpful in predicting who donate next month?
- **Code Instruction**: Complete the following function to
    1. Create a rate based feature and call it `Avg.Dontation Per Month`
    2. Return dataframe with rate-based feature added

In [42]:
def create_new_feature(df) -> pd.DataFrame:
    '''
    Complete this function to create the 'rate' feature
    '''

    # TODO: Solution below, remove code 
    df['Avg.Dontation Per Month'] = df['Number of Donations']/(df['Months since First Donation'] - df['Months since Last Donation'] + 1)
    return df


# Do not change this code
proc_df = create_new_feature(train_df.copy())
feats_new = list(proc_df.columns)
feats_new.remove(label)

# lets have a look at the new feature set, is the new feature relevant?
for feat in feats_new:
    run_student_ttest(proc_df, feat)

### Task 2-5:  Removing irrelevant features

- **Description**: Lets try removing irrelevant features and one of correlated feature
- **Code Instruction**: Complete the following function to
    1. drop the feature you deem irrelevant based on above code


In [None]:
def drop_feature(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    '''
    Create a function to remove the columns passed to this function
    '''

    # TODO: Solution below, please remove
    df = df.drop(columns=cols)

    return df


# Do not change this code
proc_df = drop_feature(train_df.copy(), cols = ['Total Volume Donated (c.c.)', 'Months since First Donation'])
proc_df.columns


## 3. Understanding Classification Metrics

The most common classification metrics are - 
* Accuraccy
* Precision
* Recall
* F1-Score
Let's Ask ChatGPT what there are - <br>




### Task 3-1: Choosing the right metric

- **Description**: Now having analyzed the data (from Task 1-6), choose the best metrics for your task. 
Assume you got the following information from business - 
* If you prediction someone is going to donate, but they dont - this is huge concern. You want to reduce such `false positives` as much as possible.
* If you predict someone is not going to donate, and they do come - it is ok. The blood donation camp can manage.
Knowing the above - decide which metric to use. <br>
Irrespective of what you use evaluate performance using F1 as well.

- **Code Instruction**: 
    1. Compelete this function to calculate the metric you have chosen

In [44]:
def calc_perf(y_act: list, y_pred: list) -> float:
    '''
    Compelete this function to calculate the metric
    you have chosen
    '''

    # TODO: Solution below, pleas remove
    val = precision_score(y_act, y_pred)
    return val

## 4. Build basline model

Now, finally we can start training the model. When training an ML model its important to have three datasets
* Train dataset - which you use to train the model and learn parameter
* Validation dataset - the dataset to use to figure out which parameter are the best
* Test dataset - the hidden dataset, that you DO NOT look at. Its only use to estimate the performance in future unseen datasets.

### Task 4-1: Preproces the train data to create traininig and validation data

- **Description** : In this task, we will create the datasets required for training the model
- **Code Instruction**: 
    1. Load the train dataset 
    2. ONLY run the de-duplication function on train set
    3. Encode categorical values
    4. `Stratify` split the train dataset 80:20 to creatin a new train dataset and validation set

`Ask ChatGPT: Why is it important to stratify when creating training and validation sets for imbalanced dataset`

In [45]:
def create_dataset(df: pd.DataFrame) -> tuple[np.array, np.array, np. array, np.array]:
    '''
    Remove duplicate data from train file alone (using `remove_duplicates` used earlier)
    Split train file data into train and valid set (keep in mind what we about imbalance learned in Task 5)
        Hint use: train_test_split (set seed to 100), and use the `stratify` field
        Ask ChatGPT: Why is it important to stratify when creating training and validation sets for imbalanced datasets
    
    Return np. arrays for train features, train labels, valid features, valid labels

    '''

    # TODO: Solution below, pleas remove
    df = remove_duplicates(df)

    label = 'Made Donation in March 2007'
    feats = list(df.columns)
    feats.remove(label)

    X = df[feats].values
    y = df[label].values

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, 
                                                          stratify=y, random_state=100)

    return X_train, y_train, X_valid, y_valid


# Do not change this
X_train, y_train, X_valid, y_valid = create_dataset(pd.read_csv(data_folder + 'training_data.csv'))

### Task 4-2: Build Model

- **Description**: Train a basline Logisitc Regression model with default parameters
- **Code Instruction**: 
    1.  Complete this function to train a baseline LogisticRegression Model with default parameter

In [None]:
def train_base(X_train: np.array, y_train: np.array) -> LogisticRegression:
    '''
    Complete this function to
    Train a baseline LogisticRegression Model with default parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove
    model = LogisticRegression(random_state=100)
    model.fit(X_train, y_train)

    return model


# Do not change this
model = train_base(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

## 5.Improving baseline model

### Task 5-1: Fix imbalance to improve performance

- **Description**: Use sampling to balance the number positive and negative samples in the data
- **Code Instruction**: Complete the following function to
    1. Remove duplicate rows in data
    2. Balance the number of positive and negative samples in train_data

In [None]:
def rebalance_df(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Write a function to 
    (a) remove duplicate rows in data
    (b) balance the number of positive and negative samples in train_data
    Hint: Use downsampling, use random_state = 100

    Return balance dataframe
    '''


    # TODO: Solution below, please remove
    df = remove_duplicates(train_df)

    pos_df = df[df[label] == 1]
    neg_df = df[df[label] == 0]

    df = pd.concat([neg_df.sample(frac=0.4, random_state=100), pos_df])
    return df


# Do not change the following code
# load teh data
train_df = pd.read_csv(data_folder + 'training_data.csv')
print(train_df.shape, train_df[label].value_counts(normalize=True))

print("\n\n")

# balance the data
train_df = rebalance_df(train_df)
print(train_df.shape, train_df[label].value_counts(normalize=True))

# check the performance with rebalance dataset
print("\n\n")
X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model = train_base(X_train, y_train)
print(X_train.shape)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

### Task 5-2: Improve model using Class Weights

- **Description**: Another way to combat data imbalance is configuring class weights in the Logistic Regression model <br>
- **Code Instruction**: Write the function to train a Logistic Regression model
    1. Use in-built parameters of Logistic Regression to weight minority mistake more

`Ask ChatGPT: Why does one of the methods perform better than the other.`
It could be related to the how the feature distribution for both class overlap as seen in Task 2-2

In [None]:
def train_tune_model(X_train: np.array, y_train: np.array):
    '''
    Write the function to train a Logistic Regression model
    and use `class_weight` parameter
    Use random_state = 100 to keep results consistent
    '''

    # TODO: Solution below, please remove code
    model = LogisticRegression(class_weight={0:0.25, 1:0.75 }, random_state=100)
    model.fit(X_train, y_train)

    return model


# do no change the following code
train_df = pd.read_csv(data_folder + 'training_data.csv')
X_train, y_train, X_valid, y_valid = create_dataset(train_df)

model = train_tune_model(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

## Task 5-3:  Focus on important features

- **Description**: Play around with the feature set passed to the model to ifnd the optimal one.
--**Code Instruction**: 
    1. Use `create_new_feature` to create rate feature.
    2. Use `drop_feature` to remove featuer you deem unneccessary

How do all this influence performance - which feature set do you finally keep?

In [None]:
# Do not chage this code
train_df = pd.read_csv(data_folder + 'training_data.csv')
train_df = rebalance_df(train_df)

# ADD YOUR CODE HERE
# Hint: use `create_new_feature` and `drop_feature` functions mentioned before
# TODO: Solution below, remove it
train_df = create_new_feature(train_df)
train_df = drop_feature(train_df, cols = ['Total Volume Donated (c.c.)'])


# Do not change this code
X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model = train_base(X_train, y_train)

pred = model.predict(X_train)
print("Train Performance")
print("Selected Metric: ", calc_perf(y_train, pred), "F1-Score: ", f1_score(y_train, pred))

print("\n")

pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))


Did the performance drop? If so, why? <br>
When you add which feature back in is the performance coming back up? <br>

Does it make sense that adding a feature that failed student t-test helped improve performance of Logistic Regression? <br>

`Ask ChatGPT why this could happen (Hint it could be related to how features work together)`

### Task 5-4: Normalizing Features

- **Describe**: Normalize features to see how it impact perforamance for the Log.Reg model from the task
- **Code Instruction** Complete the function to
    1. Use standard scaling to normalize features 

`Ask ChatGPT what type of feature scaling is best for Logistic Regression and why`

### Bonus [Optional] Task
Fine-tune the Logistic Regression Model.
Some of the parameters you may want to experiment with are - solver, penatly and C

In [None]:


def train_tune_model(X_train: np.array, y_train: np.array, 
                     scaler):
    '''
    Complete this function to normalize features 
    X_train: is the train features values
    y_train: is train labels
    scaler: The scaler you have chosen

    Bonus [Optional] Task : 
    Fine-tune the Logistic Regression Model.
    Some of the parameters you may want to experiment with are - solver, penatly and C
    '''

    # Hint: Use StandardScaler to normalize features
    # Ask ChatGPT what type of feature scaling is best for Logistic Regression and why

    # TODO: Solution below, remove this code
    X_train = scaler.fit_transform(X_train)

    model = LogisticRegression()
    model.fit(X_train, y_train)

    return (model, scaler)

scaler = StandardScaler()


# Do not change this code
train_df = pd.read_csv(data_folder + "training_data.csv")
train_df = rebalance_df(train_df)
train_df = drop_feature(train_df, cols = ['Total Volume Donated (c.c.)'])

X_train, y_train, X_valid, y_valid = create_dataset(train_df)
model, scaler = train_tune_model(X_train, y_train, scaler)

X_valid = scaler.transform(X_valid) # we use the same scaler you have used earlier
pred = model.predict(X_valid)
print("Validation Performance")
print("Selected Metric: ", calc_perf(y_valid, pred),"F1-Score: ", f1_score(y_valid, pred))

### Task 5-5: Model Interpretability

- **Description**: Extract the coefficient from the model by completing this function
- **Code Instruction**: Complete the following function to 
    1. Print thte coefficient for each feature
    2. Print the intercept for the Log.Reg model as well

`Ask ChatGPT: How do you interpret coeffcient of a Logistic Regression model`

In [None]:
def get_model_coeff(model : LogisticRegression, feats: list):
    '''
    Complete this function to print pair of values
    (feature name, coeff value)
    '''

    # TODO: Remove this code
    coeffs = list(model.coef_)[0]
    print(coeffs)

    for i in range(len(feats)):
        print(feats[i], coeffs[i])
    
    print("Intercept", model.intercept_)


get_model_coeff(model, list(train_df.columns)[:-1])

## Task 5-6 - Final Prediction

- **Description**:Finally apply all the transformation you deem best on the data and the best model you found to get prediction on test datasets
- **Code Instruction**:
    1. Load test data
    2. Look at code written till now - missing value, outliers, encoding, scaling - apply all
    3. Predict the class using the best model trained from earlier tasks

In [None]:
test_data = pd.read_csv(data_folder + 'test_data.csv')
test_data.head()
print(test_data.columns)
print(feats)

def get_test_prediction(test_df: pd.DataFrame) -> list:
    '''
    Apply all the data transformation you deem necessary
    Note: 
    Missing value must be replaced by mean value from train set NOT test set

    Fit model on train data and get prediction on test set
    '''

    feats = ['Months since Last Donation', 'Number of Donations', 'Months since First Donation']
    X_test = test_df[feats]
    X_test = scaler.transform(X_test)
    val = list(model.predict(X_test))

    return val

get_test_prediction(test_data)