# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [150]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [151]:
input_dataroot = 'hw1_basic_input.csv'  # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv'       # Output file will be named as 'hw1_basic.csv'

input_datalist =  []                    # Initial datalist, saved as numpy array
output_datalist =  []                   # Your prediction, should be 10 * 4 matrix and saved as numpy array
                                        # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [152]:
# Dataset and datacount
training_data = []
validation_data = []
prediction_data = []
training_data_count = 84
test_data_count = 10

# Cleaning parameters
upper_percent = 90
lower_percent = 10

# Regression Parameters
autoregression = 0
max_degree = 2

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [153]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
    input_datalist = np.array(list(csv.reader(csvfile)))
input_datalist = input_datalist[1: ]

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [154]:
def SplitData():
    global training_data, validation_data, prediction_data
    
    # Reset datalists
    training_data = []
    validation_data = []
    prediction_data = []
    
    # Assign training data
    for data in input_datalist[0: training_data_count]:
        entry = [int(data[0])]
        for cell in data[1: ]:
            if cell == "":
                cell = 0
            entry.append(float(cell))
        training_data.append(entry)

    # Assign validation data
    for data in input_datalist[training_data_count: training_data_count + test_data_count]:
        entry = [int(data[0])]
        for cell in data[1: ]:
            if cell == "":
                cell = 0
            entry.append(float(cell))
        validation_data.append(entry)
    
    # Assign prediction data
    for data in input_datalist[training_data_count + test_data_count: training_data_count + 2 * test_data_count]:
        entry = [int(data[0])]
        for cell in data[1: ]:
            if cell == "":
                cell = 0
            entry.append(float(cell))
        prediction_data.append(entry)

    # Transpose data
    training_data = np.transpose(np.asarray(training_data, dtype = object))
    validation_data = np.transpose(np.asarray(validation_data, dtype = object))
    prediction_data = np.transpose(np.asarray(prediction_data, dtype = object))
SplitData()

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [155]:
from numpy import percentile


def PreprocessData():
    global training_data

    # Clean data with IQR substitution
    for i in range(1, len(training_data)):
        # percentile_lower = np.percentile(training_data[i], lower_percent)
        # percentile_upper = np.percentile(training_data[i], upper_percent)
        Q1 = np.percentile(training_data[i], 25)
        Q3 = np.percentile(training_data[i], 75)
        IQR = Q3 - Q1
        percentile_lower = Q1 - 1.5 * IQR
        percentile_upper = Q3 + 1.5 * IQR
        for j in range(0, len(training_data[i])):
            if training_data[i][j] < percentile_lower:
                training_data[i][j] = percentile_lower
            elif training_data[i][j] >= percentile_upper:
                training_data[i][j] = percentile_upper
PreprocessData()

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [156]:
def Regression(temp, case):
    global autoregression, max_degree

    M = []
    # Matrix Inversion
    if autoregression != 0:
        for i in range(autoregression, len(temp)):
            phi = [1, temp[i]]
            for j in range(1, autoregression + 1):
                phi.append(case[i - j])             # phi = 1 + temp[i] + case[i - 1] + ...
            M.append(phi)
        coef = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.transpose(M), M)), np.transpose(M)), case[autoregression: ])
    else:
        for cell in temp:
            phi = []
            for i in range(0, max_degree + 1):
                phi.append(cell ** i)               # phi = 1 + temp[i] + temp^2[i] + ...
            M.append(phi)
        # w = (phi^T * phi)^-1 * y
        coef = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.transpose(M), M)), np.transpose(M)), case)
    return coef
coef = Regression(training_data[1], training_data[4])

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [157]:
def MakePrediction(coef, temp, case):
    global autoregression, max_degree

    result = []
    for i in range(0, len(temp)):
        if autoregression != 0:
            predict = coef[0] + coef[1] * temp[i] 
            ### i dont start from 0 but start from the frist entry that can be autoregressed, and change case[i - j] to resul[i - j] 
            for j in range(1, autoregression + 1):
                predict += coef[j + i] * case[i - j]
        else:
            predict = 0
            for j in range(0, max_degree + 1):
                predict += coef[j] * (temp[i] ** j)
        result.append(predict)
    return result[autoregression: ]

### Step 4.1: Utility Functions
Utility functions for visualizations and monitoring errors

In [159]:
def MAPE(coef, id):
    global autoregression

    temp = []
    case = []
    if autoregression != 0:
        for i in range(1, autoregression + 1):
            temp.append(training_data[id][len(training_data[id]) - i])
        for entry in validation_data[id]:
            temp.append(entry)
        case.append(training_data[id + 3][len(training_data[id + 3]) - i])
        for entry in validation_data[id + 3]:
            case.append(entry)
    else:
        temp = validation_data[id]
        case = validation_data[id + 3]
    predict = MakePrediction(coef, temp, case)
    error = 0

    for i in range(0, len(predict)):
        error = error + abs((validation_data[i] - predict[i]) / case[i])
    return error / len(case[0]) * 100

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [158]:
with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    for row in output_datalist:
        writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)