# **HW1: Regression**
In *assignment 1*, you need to finish:

1.  Basic Part: Implement two regression models to predict the Systolic blood pressure (SBP) of a patient. You will need to implement **both Matrix Inversion and Gradient Descent**.


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implement one regression model to predict the SBP of multiple patients in a different way than the basic part. You can choose **either** of the two methods for this part.

# **1. Basic Part (55%)**
In the first part, you need to implement the regression to predict SBP from the given DBP


## 1.1 Matrix Inversion Method (25%)


*   Save the prediction result in a csv file **hw1_basic_mi.csv**
*   Print your coefficient


### *Import Packages*

> Note: You **cannot** import any other package

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### *Global attributes*
Define the global attributes

In [None]:
training_dataroot = 'hw1_basic_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_basic_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_basic_mi.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

You can add your own global attributes here


### *Load the Input File*
First, load the basic input file **hw1_basic_training.csv** and **hw1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [None]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = np.array(list(csv.reader(csvfile)))

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = np.array(list(csv.reader(csvfile)))

### *Implement the Regression Model*

> Note: It is recommended to use the functions we defined, you can also define your own functions


#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset
* Validation dataset is used to validate your own model without the testing data



In [None]:
def SplitData():


#### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [None]:
def PreprocessData():


#### Step 3: Implement Regression
> use Matrix Inversion to finish this part




In [None]:
def MatrixInversion():


#### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*
The final *output_datalist* should look something like this 
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

In [None]:
def MakePrediction():


#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```





### *Write the Output File*
Write the prediction to output csv
> Format: 'sbp'




In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

## 1.2 Gradient Descent Method (30%)


*   Save the prediction result in a csv file **hw1_basic_gd.csv**
*   Output your coefficient update in a csv file **hw1_basic_coefficient.csv**
*   Print your coefficient





### *Global attributes*

In [None]:
output_dataroot = 'hw1_basic_gd.csv' # Output file will be named as 'hw1_basic.csv'
coefficient_output_dataroot = 'hw1_basic_coefficient.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 20 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

coefficient_output = [] # Your coefficient update during gradient descent
                   # Should be a (number of iterations * number_of coefficient) matrix
                   # The format of each row should be ['w0', 'w1', ...., 'wn']

Your own global attributes

### *Implement the Regression Model*


#### Step 1: Split Data

In [None]:
def SplitData():

#### Step 2: Preprocess Data

In [None]:
def PreprocessData():


#### Step 3: Implement Regression
> use Gradient Descent to finish this part

In [None]:
def GradientDescent():

#### Step 4: Make Prediction

Make prediction of testing dataset and store the values in *output_datalist*
The final *output_datalist* should look something like this 
> [ [100], [80], ... , [90] ] where each row contains the predicted SBP

Remember to also store your coefficient update in *coefficient_output*
The final *coefficient_output* should look something like this
> [ [1, 0, 3, 5], ... , [0.1, 0.3, 0.2, 0.5] ] where each row contains the [w0, w1, ..., wn] of your coefficient





In [None]:
def MakePrediction():

#### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be:
```
3 2 1
```



### *Write the Output File*

Write the prediction to output csv
> Format: 'sbp'

**Write the coefficient update to csv**
> Format: 'w0', 'w1', ..., 'wn'
>*   The number of columns is based on your number of coefficient
>*   The number of row is based on your number of iterations

In [None]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

with open(coefficient_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in coefficient_output:
    writer.writerow(row)

# **2. Advanced Part (40%)**
In the second part, you need to implement the regression in a different way than the basic part to help your predictions of multiple patients SBP.

You can choose **either** Matrix Inversion or Gradient Descent method.

The training data will be in **hw1_advanced_training.csv** and the testing data will be in **hw1_advanced_testing.csv**.

Output your prediction in **hw1_advanced.csv**

Notice:
> You cannot import any other package other than those given



### Input the training and testing dataset

In [3]:
training_dataroot = 'hw1_advanced_training.csv' # Training data file file named as 'hw1_basic_training.csv'
testing_dataroot = 'hw1_advanced_testing.csv'   # Testing data file named as 'hw1_basic_training.csv'
output_dataroot = 'hw1_advanced.csv' # Output file will be named as 'hw1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be 220 * 1 matrix and saved as numpy array
                      # The format of each row should be ['sbp']

### Your Implementation

In [87]:
uni_k = 5
order = 1
z_max = 4
iterations = 1000
learning_rate = []

print("Using", uni_k, "data before")
print("Polynomial order", order)
print("Filter data using Z value", z_max, "\n")

# filter data function
def cutData(data):
    data = data.drop(["charttime"], axis=1)
    return data.to_numpy()

def makeLearningRate():
    c = 0.000001
    learning_rate.clear()
    learning_rate.append(0.01)
    for i in range(4):
        for j in range(order):
            learning_rate.append(0.1 * (c ** (j+1)))
    for i in range(uni_k):
        learning_rate.append(c)

def MakeBatch(data, split):
    batch = round(len(data) / split)
    ret = []
    
    for i in range(split):
        bot = i * batch
        top = (i+1) * batch
        if i == split-1:
            top = len(data)
        ret.append(data[bot:top])
        
    return ret

def FilterData(data):
    ret_data = dict()
    
    print("BEFORE =", len(data))
    data['temperature'].fillna(round(data['temperature'].mean(), 2), inplace = True)
    data = data.dropna()
    print("AFTER DROP NAN =", len(data))
    
    topic = ["temperature", "heartrate", "resprate", "o2sat", "sbp"]
    gd_mean = data[topic].mean()
    gd_std = data[topic].std()
    
    for i in topic:
        data = data[abs(data[i] - gd_mean[i]) / gd_std[i] <= z_max]
    print("AFTER Z VALUE =", len(data))
    print("============================================")
        
    
    data = data.groupby("subject_id")
    for group_name, group_data in data:
        group_data = group_data.drop(["subject_id", "charttime"], axis = 1)
        ret_data[group_name] = group_data.to_numpy()
#         print("SUBJECT", group_name, "=====")
#         for i in topic:
#             print(i, "===")
#             print(group_data[i].min())
#             print(group_data[i].mean())
#             print(group_data[i].max())
        
    return ret_data

# split data function
def Split_Data(data, num):
    size = len(data)
    split_size = round(size * num)
    data1 = np.array(data[1:split_size]).astype('float64') # since index 0 is string
    data2 = np.array(data[split_size:size]).astype('float64')
    return data1, data2

# training function
def modif_x(data, k, x):
    ret = [1]
    n1 = len(data)
    n2 = len(data[0])

    for i in range(4):
        for j in range(order):
            ret.append(x[i] ** (j+1))
    for i in range(k):
        ret.append(data[n1-1-i][n2-1])
        
    return ret
    
def make_matrix(data, k):
    new_data = []
    for i in range(k, len(data)):
        new_data.append([1])

        for j in range(4):
            for p in range(order):
                new_data[i-k].append(data[i][j] ** (p+1))
        for j in range(k):
            new_data[i-k].append(data[i-1-j][4])
        new_data[i-k].append(data[i][4])
    
    new_data = np.array(new_data)
    return new_data

def gradient_descent(data, weight):
    n = len(weight)
    w_new = np.zeros(n)
    norm = -2/len(data)
    
    for row in data:
        predict = 0
        y = row[n]
        for i in range(n):
            predict += row[i] * weight[i]
        for i in range(n):
            w_new[i] += (norm) * (y - predict) * row[i]
            
    for i in range(n):
        w_new[i] = weight[i] - learning_rate[i] * w_new[i]
    return w_new
        

def matrix_inversion(data):
    n = len(data[0])
    X = np.array(data[:,0:n-1])
    Y = np.array(data[:,n-1])

    ret = np.linalg.inv(X.T @ X) @ X.T @ Y
    return ret

def train_func(data):
    n = len(data[0])
    grad = np.zeros(n-1)
    #data = MakeBatch(data, 10)
    #for batch in data:
    for i in range(iterations):
        grad = gradient_descent(data, grad)
    #grad = matrix_inversion(data)
    return grad

def doPrediction(b, X):
    ret = 0
    for i in range(len(X)):
        ret += b[i] * X[i]
    return ret
    
# read files
adv_training_data = pd.read_csv(training_dataroot)
adv_testing_data = pd.read_csv(testing_dataroot)
makeLearningRate()

# data = FilterData(adv_training_data)

# train every data
def test_MAPE():
    grad = dict()
    mape = 0
    for idx in data:
        # for caluclating MAPE
        train_data, test_data = Split_Data(data[idx], 0.9)
        train_data = make_matrix(train_data, uni_k)
        grad[idx] = train_func(train_data)
        
        msum = 0
        for row in test_data:
            X = modif_x(train_data, uni_k, row[0:4])
            Y = doPrediction(grad[idx], X)
            msum += abs(row[4] - Y) / row[4]
            
            X.append(Y)
            X = np.array(X)
            train_data = np.vstack((train_data, X))
        mape = 100 / len(test_data) * msum
        print(mape)
    #print(mape/11, "%")

def final():
    grad = dict()
    train_data = FilterData(adv_training_data)
    test_data = cutData(adv_testing_data)
    
    for topic in train_data:
        train_data[topic] = make_matrix(train_data[topic], uni_k)
        grad[topic] = train_func(train_data[topic])
    
    for row in test_data:
        sub_id = row[0]
        X = row[1:5]
        X = modif_x(train_data[sub_id], uni_k, X)
        row[5] = doPrediction(grad[sub_id], X)
        
        X.append(row[5])
        X = np.array(X)
        train_data[sub_id] = np.vstack((train_data[sub_id], X))
    
    return test_data

makeLearningRate()

print("UNI_K = ", uni_k)
test_MAPE()
print()

# out = final()
# output_datalist = []

# for row in out:
#     output_datalist.append([row[5]])

# print(" === FINISH ===")

Using 5 data before
Polynomial order 1
Filter data using Z value 4 

BEFORE = 5696
AFTER DROP NAN = 5436
AFTER Z VALUE = 5328
 === FINISH ===


### Output your Prediction

> your filename should be **hw1_advanced.csv**

In [88]:
for row in output_datalist:
    print(row)

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

[138.16051185061664]
[136.41031755649922]
[139.39119575108776]
[139.7411636004514]
[140.64602106343008]
[141.82817813748477]
[142.3672038789362]
[143.69560549407203]
[143.92100712008394]
[145.20329185924228]
[145.76569550375714]
[146.17519235046785]
[147.01576184839996]
[147.64754590890885]
[147.8256394108033]
[148.014561817419]
[148.8441045616383]
[149.26419218696353]
[148.9719025163484]
[149.43048362296298]
[150.54793705180563]
[149.98147813965082]
[147.15486401657145]
[146.55194656044696]
[147.23125579263197]
[145.85263646568964]
[145.4556945992056]
[144.4172653597532]
[144.23200017109187]
[143.28423098330603]
[142.9411923943421]
[142.46727610099273]
[141.8091432212671]
[141.44472196705522]
[142.10891206213017]
[141.28731639860158]
[140.8697416089043]
[140.4182668664029]
[140.91884213456032]
[140.17744469622858]
[122.75753185455687]
[122.27643328353453]
[123.13378064948809]
[124.11276398846334]
[122.59083801287639]
[121.5738165524446]
[121.31691934292252]
[120.90699073042921]
[120.7

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered
*   Summarize your work and your reflections
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)