# Problem Background: The Great Migration

In our ongoing efforts to ensure the safety of the migration to Earth Junior, we have developed a linear regression model that produces a human-zombie score ranging from 0 to 100. This score is designed to assess the likelihood of individuals being human or zombie based on various features collected during screening.

To enhance our security measures, the spaceship station has deployed a specialized automated barrier system that utilizes the human-zombie scores to classify individuals into three distinct categories:

- **Class 0: Score Range 0-33**: **Most Likely Human**  
  Individuals in this range will be directed straight to the spaceship for immediate boarding.

- **Class 1: Score Range 33-66**: **Need Further Tests**  
  Those with scores in this range will be redirected to a testing facility for additional examinations to confirm their identity. They will be quarantined for a two-week observational period to ensure they do not pose a risk.

- **Class 3: Score Range 66-100**: **Most Likely Zombies**  
  Those scoring in this highest range will be denied entry to the spaceship, as they are deemed a significant threat to the safety of the remaining human population.

This classification system aims to maximize the chances of a successful migration while ensuring that the risk of zombie infiltration is minimized.


# Programming Assignment 2: Task 2 -- Logistic Regression  [80 Marks]

### Introduction

In this task, you will be Logistic Regression models for the provided dataset from scratch. A description of the problem statement is given at the start of part. It is important that you display the output where asked. In case of no outputs, you will get a 0 for that part.

After this notebook you should be able to:

- Implement a classifier using Logistic Regression.

- Create a Logistic Regression model using simple `numpy`.

Have fun!

### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions and Plagiarism Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span> We will be using the `print` statements to grade your assignment.

- <span style="color: red;">You must attempt all parts.</span> Do not assume that because something is for 0 marks, you can leave it - it will definitely be used in later parts.

- <span style="color: red;">Do not use unauthorized libraries.</span> You are not allowed to use `sklearn` in Part A of both tasks. Failure to follow these instructions will result in a serious penalty.

<center>
  <img src = "https://miro.medium.com/v2/resize:fit:1100/format:webp/1*RElrybCZ4WPsUfRwDl7fqA.png">
</center>

One vs All (OvA) is a common technique to extend binary classifiers, like logistic regression, to handle multiclass classification tasks. For each class in the dataset, a logistic regression model is trained to distinguish that class from all other classes. For instance, for a `m` class classification, we will have `m` logistic regression classifiers in our pipeline. When making a prediction, each model outputs a probability that the instance belongs to its target class. The class with the highest probability across all models is chosen as the final prediction.

In this part, we will be going over how to implement a Multiclass Logistic Regression (OvA) model from scratch. For a review of this concept, you can go over the course slides or go over this [resource](https://www.cs.rice.edu/~as143/COMP642_Spring22/Scribes/Lect5).

## Import Libraries

In [169]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

#### Dataset

You will use the same dataset as Part A.

Load the Dataset and other necessary files.

# Preprocessing   [20 Marks]

In this step, you will need to make several changes to the dataset before we can proceed with the analysis. Follow the guidelines below:

1. **Transform Labels**:  
   Convert the labels from continuous scores to categorical labels based on the class descriptions provided earlier.  
   This transformation is crucial for training the classifier effectively. **[5 Points]**

2. **Perform Train-Test Split**:  
   Split the dataset into training and testing sets (8:2), and then check the sizes of both.  
   This step ensures that you have the right distribution of data for training and evaluation. **[5 Points]**

3. **Normalize Data**:  
   Utilize the `Scaler` class that you created in Part 1 to normalize the features of the dataset. **[10 Points]**


In [171]:
# Transform Labels

In [1]:
# Train Test Split


In [174]:
# Normalize


## Part A: Implementation from Scratch  [25 Marks]

Create a class, and implement the functionality described below to create a fully fledged **Regularized Logistic Regression model.**

* `sigmoid(x)`: This is the non-linear "activation" function that differentiates Logistic from plain-old Linear Regression. Refer to the formula from the slides. [5 Points]

* `cross_entropy_loss(y_true, y_pred)`: This is the loss function that will help you calculate the gradients for updating your model. Note that this is a Binary Classification task so you can use the Binary Cross Entropy function mentioned in the slides. [5 Points]

* `fit(x_train, y_train)`: This will be where you implement the Gradient Descent algorithm again, keeping in mind the differences between Linear and Logistic Regression. [5 points]

* `predict(x_test)`: predict whether the label is 0 or 1 for test reviews using learned logistic regression (use the decision threshold of 0.5) **Note: you need to return the probability and the predicted label from this function** [5 Points]

* `evaluate(y_true, y_pred)` function that calculates classification accuracy, F1 Score and confusion matrix. [5 Points]

# Implement One vs All Classification  [10 marks]

You need to build four classifiers, one for each class, and perform the following steps for each:

1. Create a plot with the number of iterations/epochs on the x-axis and training/validation loss on the y-axis for the evaluation dataset that we separated previously.

2. Tune the hyperparameters, i.e., learning rate and number of epochs, to minimize the validation loss.

**Please note that the correctness of the functions you created previously depends on the plot. The curve should show a constant dip, eventually reaching a plateau.**


In [3]:
# One-vs-Rest Classifiers
classifiers = {}
losses = {}  # To store losses for each classifier

for i in range(3):
    y_binary = (train_labels == i).astype(int)  # Current positive class, use this while fitting to train data
    classifiers[i] = None       # declare your logistic regression model here 
    cost = None                 # fit on your training data and store the cost. You will need to pass y_binary along with the train data
    losses[i] = cost            # Save the cost values for plotting

# Plot training loss for each classifier


# Evaluate  [15 Marks]

It's time to run your logistic regression model on the test dataset!

- Report the accuracy, F1 score and confusion matrix for each binary classifier [10 Points]
- Perform multiclass evaluation and report macro F1, accuracy and confusion matrix [5 marks]


In [178]:
# Evaluate each binary classifier
results = {
    'Class': [],
    'Probs':[],
    'Accuracy': [],
    'F1 Score': [],
    'Confusion Matrix': []
}

for i in range(3):  
    predicted_class, probability = None     # predict on your test data
    accuracy = None
    f1 = None
    cm = None
    
    results['Class'].append(i)
    results['Probs'].append(probability)
    results['Accuracy'].append(accuracy)
    results['F1 Score'].append(f1)
    results['Confusion Matrix'].append(cm)

results_df = pd.DataFrame(results)

In [None]:
results_df.drop('Probs',axis=1)

In [None]:
# Multi class evaluations.
# Combine the probabilites of the classifiers calculated above and assign label of the classifier having the highest probability
class_labels = ['Class 0: Most Likely Human', 
                'Class 1: Further Testing', 
                'Class 2: Most Likely Zombie']

In [None]:
# Calculate the macro f1, accuracy and confusion matrix for multiclass classification

# Part B: Use Scikit-learn  [10 Marks]

In this part, use scikit-learn’s [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) implementation to train and test the logistic regression on the provided dataset.

Use scikit-learn’s `accuracy_score` function to calculate the [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), F1 score and [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function to calculate confusion matrix on test set.

Finally, plot the confusion matrix

# Part C: Are You a Zombie?  [5 marks]
Use your multiclass classifier to predict whether you are a zombie.

In [183]:
# Fill in these values (honestly)
height = None               # Height in cm
weight = None               # Weight in kg
screen_time = None          # Screen time in hours per day
junk_food_days = None       # Junk food consumption in days per week
physical_activity = None    # Physical activity in hours per week
task_completion = None      # Task completion on a scale (example range: 1-10)

In [184]:
test_point = np.array([height, weight, screen_time, junk_food_days, physical_activity, task_completion])
test_point = stdscaler.transform(test_point)  # transform using your standard scaler instance

In [None]:

labels = {0: "Human", 1: "Needs Further Testing", 2: "Zombie"}
probs=[]
for i in range(3):  
    y_pred_class, prob = classifiers[i].predict(test_point.reshape(1,-1))    
    probs.append(prob)
combined_probs = np.column_stack([p for p in probs])
multi_class_pred = np.argmax(combined_probs, axis=1)
print("Prediction:", labels[multi_class_pred[0]])