# **HW1: Regression**
In *assignment 1*, you need to finish:

1.  Basic Part: Implement two regression models to predict the Systolic blood pressure (SBP) of a patient. You will need to implement **both Matrix Inversion and Gradient Descent**.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implement one regression model to predict the SBP of multiple patients in a different way than the basic part. You can choose **either** of the two methods for this part.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **1. Basic Part (55%)**
In the first part, you need to implement the regression to predict SBP from the given DBP


## 1.1 Matrix Inversion Method (25%)


*   Save the prediction result in a csv file **hw1_basic_mi.csv**
*   Print your coefficient


### *Import Packages*

> Note: You **cannot** import any other package

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### *Global attributes*
Define the global attributes

In [2]:
training_dataroot = 'hw1_basic_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_basic_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_basic_mi.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 3 matrix and saved as numpy array
                      # The format of each row should be ['subject_id', 'charttime', 'sbp']

You can add your own global attributes here


In [25]:
training_set =  []
validation_set =  []
testing_set =  []

### *Load the Input File*
First, load the basic input file **hw1_basic_training.csv** and **hw1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [26]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = np.array(list(csv.reader(csvfile)))

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = np.array(list(csv.reader(csvfile)))

### *Implement the Regression Model*

> Note: It is recommended to use the functions we defined, you can also define your own functions


#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset
* Validation dataset is used to validate your own model without the testing data



In [34]:
def SplitData():
  #print(len(training_datalist[1:]))
  spliting_index = int(len(training_datalist[1:])*0.9)
  #print(spliting_index)
  global training_set, validation_set, testing_set
  global output_datalist, testing_datalist
  training_set = training_datalist[1:spliting_index].astype(float)
  validation_set = training_datalist[spliting_index:].astype(float)
  testing_set = testing_datalist[1:].astype(float)
  output_datalist = testing_datalist
  #print(len(training_set))
  #print(len(validation_set))

#### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [18]:
def PreprocessData():
  # Preprocess the training data
  global training_set
  global validation_set

  q1_dbp = np.percentile(training_set[:, 0], 25)
  q3_dbp = np.percentile(training_set[:, 0], 75)
  iqr_dbp = q3_dbp - q1_dbp
  threshold_lower_dbp = q1_dbp - 1.5 * iqr_dbp
  threshold_upper_dbp = q3_dbp + 1.5 * iqr_dbp

  q1_sbp = np.percentile(training_set[:, 1], 25)
  q3_sbp = np.percentile(training_set[:, 1], 75)
  iqr_sbp = q3_sbp - q1_sbp
  threshold_lower_sbp = q1_sbp - 1.5 * iqr_sbp
  threshold_upper_sbp = q3_sbp + 1.5 * iqr_sbp

  outliers_dbp = np.where((training_set[:, 0] < threshold_lower_dbp) | (training_set[:, 0] > threshold_upper_dbp))
  outliers_sbp = np.where((training_set[:, 1] < threshold_lower_sbp) | (training_set[:, 1] > threshold_upper_sbp))

  #print(outliers_dbp)
  #print(outliers_sbp)
  #print(len(training_set))
  training_set = np.delete(training_set, outliers_dbp, axis=0)
  training_set = np.delete(training_set, outliers_sbp, axis=0)
  #print(training_set)
  #print(len(training_set))

#### Step 3: Implement Regression
> use Matrix Inversion to finish this part




In [46]:
def MatrixInversion():
  #print(training_set)

  # To do linear regression Y=(bias+X)W (in matrix form)

  X = training_set[:, 0].reshape(-1, 1)
  Y = training_set[:, 1]

  X = np.column_stack((np.ones(X.shape[0]), X))
  W = np.linalg.inv(X.T @ X) @ X.T @ Y

  # W[0] = bias
  # W[1] = X (as a scalar here)


  # Validate W with validation set
  #dbp_validation_values = np.array([row[0] for row in validation_set])
  #X_validation = np.column_stack((np.ones(dbp_validation_values.shape[0]), dbp_validation_values))
  #predicted_sbp = X_validation @ W

  #actual_sbp_values = np.array([row[1] for row in validation_set])
  #ape = np.abs((actual_sbp_values - predicted_sbp) / actual_sbp_values)
  #validation_mape = np.mean(ape) * 100

  #print("MAPE from validation_datalist:", validation_mape, " %")

  return W

#### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

In [20]:
def MakePrediction(new_dbp, W):
  new_X = np.array([1, new_dbp])
  predicted_Y = new_X @ W

  return predicted_Y

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```





In [47]:
SplitData()
PreprocessData()
W = MatrixInversion()

print(' '.join(map(str, W.flatten())))

# Start to prediction on the testing_datalist
dbp_testing_values = np.array([row[0] for row in testing_set])
for i in range(1, len(output_datalist)):
    output_datalist[i][1] = MakePrediction(dbp_testing_values[i-1], W)

MAPE from validation_datalist: 6.385628268018219  %
32.45498431593448 1.2050287215220754


### *Write the Output File*
Write the prediction to output csv
> Format: 'sbp'




In [44]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

## 1.2 Gradient Descent Method (30%)


*   Save the prediction result in a csv file **hw1_basic_gd.csv**
*   Output your coefficient update in a csv file **hw1_basic_coefficient.csv**
*   Print your coefficient





### *Global attributes*

In [None]:
output_dataroot = 'hw1_basic_gd.csv' # Output file will be named as 'hw1_basic.csv'
coefficient_output_dataroot = 'hw1_basic_coefficient.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

coefficient_output = [] # Your coefficient update during gradient descent
                   # Should be a (number of iterations * number_of coefficient) matrix
                   # The format of each row should be ['w0', 'w1', ...., 'wn']

Your own global attributes

### *Implement the Regression Model*


#### Step 1: Split Data

In [None]:
def SplitData():

#### Step 2: Preprocess Data

In [None]:
def PreprocessData():


#### Step 3: Implement Regression
> use Gradient Descent to finish this part

In [None]:
def GradientDescent():

#### Step 4: Make Prediction

Make prediction of testing dataset and store the values in *output_datalist*
The final *output_datalist* should look something like this
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

Remember to also store your coefficient update in *coefficient_output*
The final *coefficient_output* should look something like this
> [ [1, 0, 3, 5], ... , [0.1, 0.3, 0.2, 0.5] ] where each row contains the [w0, w1, ..., wn] of your coefficient





In [None]:
def MakePrediction():

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```



### *Write the Output File*

Write the prediction to output csv
> Format: 'sbp'

**Write the coefficient update to csv**
> Format: 'w0', 'w1', ..., 'wn'
>*   The number of columns is based on your number of coefficient
>*   The number of row is based on your number of iterations

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

with open(coefficient_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in coefficient_output:
    writer.writerow(row)

# **2. Advanced Part (40%)**
In the second part, you need to implement the regression in a different way than the basic part to help your predictions of multiple patients SBP.

You can choose **either** Matrix Inversion or Gradient Descent method.

The training data will be in **hw1_advanced_training.csv** and the testing data will be in **hw1_advanced_testing.csv**.

Output your prediction in **hw1_advanced.csv**

Notice:
> You cannot import any other package other than those given



### Input the training and testing dataset

In [None]:
training_dataroot = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

### Your Implementation

### Output your Prediction

> your filename should be **hw1_advanced.csv**

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered
*   Summarize your work and your reflections
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)