<div style="text-align: center; font-size: 30px; font-weight: bold;">
    Assignment/Lab 2: Winter 2025 Group 2
    <br>***
</div>

<h1>Team members</h1>
<b>
    
- Minh Le Nguyen
- Liam Knapp
- Gautam Singh
- Gleb Ignatov

</b>
<br>

---

<div style="text-align: center; font-size: 24px; font-weight: bold;">
    Building Linear and Logistic Regression Models from Scratch
</div>

## I. Objectives

<b>
    
- Implement Linear Regression and Logistic Regression from scratch without using machine learning 
libraries.  
- Understand and apply gradient descent for optimizing model parameters. 
- Evaluate model performance using appropriate performance measures. 
- Use  your  implementation  to  perform  regression  and classification  on  the  datasets  provided  in  
separate files. 
- Compare your custom implementations with scikit-learn’s built-in models. 
- Reflect on challenges encountered and key takeaways from implementing regression models 
manually.
    
</b>

**Note: Intructions Details at the bottoms**

### *Formulas

<b>

1. [Linear Regression](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#linear-models)

2. [Logistic Regression](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning#linear-models)

</b>

---

## II. Implementation

### Set up your Python development environment

In [72]:
pip install numpy pandas scipy matplotlib

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import randint

### Step 1: Implement the Linear Regression Algorithm        

In [64]:
class LinearRegression:
    """
    A custom implementation of the LinearRegression algorithm for regression tasks.
    
    This class supports training on input features and corresponding target values, 
    and predicting target values for new input data.

    Attributes:
        TrainX (numpy.ndarray): Placeholder for training input features.
        TrainY (numpy.ndarray): Placeholder for training target values.

    Methods:
        fit(x, y):
            Trains the regressor using the provided input features (x) and target values (y).
            Calculates the Slope and Intercept of the training data.
        
        predict(x):
            Predicts target values for a given set of input features (x)
            Returns the predictions as a numpy array.
    """
    def fit(self, x, y):
        self.TrainX = np.array(x)
        self.TrainY = np.array(y)

        # Compute means
        xmean = np.mean(self.TrainX)
        ymean = np.mean(self.TrainY)

        # Compute numerator and denominator
        numerator = sum((xi - xmean) * (yi - ymean) for xi, yi in zip(self.TrainX, self.TrainY))
        denominator = sum((xi - xmean) ** 2 for xi in self.TrainX)

        # Compute slope and intercept
        self.slope = numerator / denominator
        self.intercept = ymean - (self.slope * xmean)

    
    def predict(self, x):
        # The inputed target value to be predicted
        x = np.array(x) 
        
         # Used to store the predicted values for each targeted value point.
        predictions = [] 

        # Loop through each element in the targeted value to predicted
        for targetValue in x: 
            predictions.append(self.slope * targetValue + self.intercept)
            
        return np.array(predictions)

### Stub to test KNN implemented

In [70]:
# Training data (features and target values)
TrainX = [5, 7, 8, 7, 2, 17, 2, 9, 4, 11, 12, 9, 6]  # Training X values
TrainY = [99, 86, 87, 88, 111, 86, 103, 87, 94, 78, 77, 85, 86]  # Training Y values

# Test data
test_x = [10, 15, 20]  # Test X values

# Initialize and train the model
model = LinearRegression()
model.fit(TrainX, TrainY)

# Make predictions
predictions = model.predict(test_x)
print("Predicted Y values:", predictions)

Predicted Y values: [85.59308315 76.83664459 68.08020603]


### Step 2: Load the Dataset 
(training_dataset_linear.csv and validation_dataset_linear.csv)    

In [62]:
def step_2_load_dataset(file_path="dataset/training_dataset_linear.csv", validation_file_path="dataset/validation_dataset_linear.csv"):
    """
    Load the dataset and display useful information about the dataset
    
    Args:
        file_path (str): Path to the CSV file containing the training dataset.
        validation_file_path (str): Path to the CSV file containing the validation dataset.
        
    Returns:
        trainData (DataFrame):  Pandas dataframe containing the training dataset.
        validationData (DataFrame): Pandas dataframe containing the validation dataset.
    """
    
    
    trainData = pd.read_csv(file_path)
    validationData = pd.read_csv(validation_file_path)

    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.max_colwidth')

    print("Training Data Preview:")
    print(trainData.head())
    
    print("\nValidation Data Preview:")
    print(validationData.head())

    print("\nTraining Data Info:")
    print(trainData.info())
    
    print("\nValidation Data Info:")
    print(validationData.info())
    
    print("\nTraining Data Summary Statistics:")
    print(trainData.describe())
    
    print("\nValidation Data Summary Statistics:")
    print(validationData.describe())

    print("\nMissing Values in Training Data:")
    print(trainData.isnull().sum())
    
    print("\nMissing Values in Validation Data:")
    print(validationData.isnull().sum())

    print("\nDuplicate Rows in Training Data:", trainData.duplicated().sum())
    print("Duplicate Rows in Validation Data:", validationData.duplicated().sum())

    print("\nTraining Data Types:")
    print(trainData.dtypes)
    
    print("\nValidation Data Types:")
    print(validationData.dtypes)

    return trainData, validationData

Train_Data, Validation_Data = step_2_load_dataset()

Training Data Preview:
       x       y
0  1.730  72.851
1  1.184  35.511
2  0.169   0.525
3  0.355   9.269
4  0.250  13.250

Validation Data Preview:
       x       y
0  2.443  89.705
1  0.603  20.943
2  1.137  38.205
3  0.156  14.009
4  0.163   8.761

Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       80 non-null     float64
 1   y       80 non-null     float64
dtypes: float64(2)
memory usage: 1.4 KB
None

Validation Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       20 non-null     float64
 1   y       20 non-null     float64
dtypes: float64(2)
memory usage: 452.0 bytes
None

Training Data Summary Statistics:
               x          y
count  80.000000  80.000000
mean    0.821150 

### Step 3: Train the Linear Regression Model

In [60]:
def step_3_train_linear_model(TrainData):
    """
    Load the dataset, prepare the features and target values, and train a LinearRegression model.
    
    Args:
        TrainData (pandas dataframe): Pandas dataframe containing the dataset.
        
    Returns:
        linear_model (LinearRegression): Trained LinearRegression model.
        TrainX (numpy.ndarray): Input features (Train x) used for training.
        TrainY (numpy.ndarray): Target values (Train Y) used for training.
    """
    
    # Separate the data into features (X) and target values (Y)
    TrainX = TrainData["x"].values.reshape(-1, 1)  # Convert to numpy array and reshape for the model
    TrainY = TrainData["y"].values  # Training Y values

    # Instantiate the LinearRegression class
    linear_model = LinearRegression()

    # Fit the model with the training data
    linear_model.fit(TrainX, TrainY)
    
    
    return linear_model, TrainX, TrainY



linear_model, TrainX, TrainY = step_3_train_linear_model(Train_Data)

### Step 4: Test and Evaluate the Model

### Step 5: Implement the Logistic Regression Algorithm

### Step 6: Load the Dataset 
(training_dataset_logistic.csv and validation_dataset_logistic.csv)    

### Step 7: Train the Logistic Regression Model

### Step 8: Test and Evaluate the Model

---

## III. Instructions

### Step 1: Implement the Linear Regression Algorithm
Your task is to implement the Linear Regression algorithm from scratch without using any machine 
learning libraries like scikit-learn for the core functionality. Follow these steps:

<b>

1. Create a LinearRegression class with the following methods:
    - fit(X, y): Train the model using the given input features X and target values y.
    - predict(X): Predict the target values for a given set of examples.
    - You may add other methods or modify the input arguments as needed
2. Use the Mean Squared Error (MSE) as the loss function
3. Implement gradient descent to optimize the model parameters. Allow the learning rate (lr) and 
the number of iterations to be adjustable
4. Ensure your implementation supports multiple features.

</b>

---

### Step 2: Load the Dataset 

You will receive files named training_dataset_linear.csv and validation_dataset_linear.csv containing the 
datasets. Perform the follow

<b>

1. Load the data from the provided CSV files.
2. Understand the dataset using visualizations and basic statistical summaries.
3. Preprocess the data if necessary (e.g., handle missing values, normalize features if needed). 

</b>

---

### Step 3: Train the Linear Regression Model         

<b>

1. Initialize your LinearRegression model with a learning rate of lr and iter iterations. 
2. Train the model using the fit method with the provided training dataset.

</b>

---

### Step 4: Test and Evaluate the Model 

<b>

1. Use the predict method to make predictions on the validation dataset.
2. Compute the Mean Squared Error (MSE) and R-squared score to evaluate performance.
3. Plot the regression line generated by the model along with the training data on a single graph.
4. Compare your implementation with the result of LinearRegression from scikit-learn.

</b>

---

### Step 5: Implement the Logistic Regression Algorithm  

<b>

1. Create a LogisticRegression class with the following methods:
    - fit(X, y): Train the model using the given input features X and target values y. 
    - predict(X): Predict the class labels for a given set of examples. 
    - predict_proba(X): Return the probability scores for each class. 
    - You may add other methods or modify the input arguments as needed.

2. Use the Binary Cross-Entropy (Log Loss) as the loss function.
3. Implement gradient descent to optimize the model parameters. Allow the learning rate (lr) and 
the number of iterations to be adjustable.
4. Ensure your implementation supports multiple features.
5. Use the sigmoid function to map predictions to probabilities. 
   
</b>

---

### Step 6: Load the Dataset    

You will receive files named training_dataset_logistic.csv and validation_dataset_logistic.csv containing 
the datasets. Perform the following: 

<b>

1. Load the data from the provided CSV files.
2. Understand the dataset using visualizations and basic statistical summaries.
3. Preprocess the data if necessary (e.g., handle missing values, normalize features if needed). 

</b>

---

### Step 7: Train the Logistic Regression Model        

<b>

1. Initialize your LogisticRegression model with a learning rate of lr and iter iterations.
2. Train the model using the fit method with the provided training dataset.

</b>

---


### Step 8: Test and Evaluate the Model:

<b>

1. Use the predict method to classify examples from the validation dataset.
2. Compute the accuracy, precision, recall, and F1-score to evaluate the model.
3. Plot the decision boundary along with the training data on a single graph.
4. What is the equation of the decision boundary?
5. Compare your implementation with the result of LogisticRegression from scikit-learn.
   
</b>