# **HW1: Regression** 
In *assignment 1*, you need to finish:

1.  Basic Part: Implement the regression model to predict the number of dengue cases


> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict the number of dengue cases in a different way than the basic part

# 1. Basic Part (60%)
In the first part, you need to implement the regression to predict the number of dengue cases

Please save the prediction result in a csv file **hw1_basic.csv**


## Import Packages

> Note: You **cannot** import any other package in the basic part

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

## Global attributes
Define the global attributes

In [2]:
input_dataroot = 'hw1_basic_input.csv' # Input file named as 'hw1_basic_input.csv'
output_dataroot = 'hw1_basic.csv' # Output file will be named as 'hw1_basic.csv'

input_datalist = [] # Initial datalist, saved as numpy array
output_datalist = [] # Your prediction, should be 10 * 4 matrix and saved as numpy array
             # The format of each row should be ['epiweek', 'CityA', 'CityB', 'CityC']

You can add your own global attributes here


In [3]:
# Tag
CityA = 1
CityB = 2
CityC = 3

In [4]:
# Training dataset
training_index = [3, 84]
training_dataset = []

# Validation dataset
validation_index = [85, 94]
validation_dataset = []

# Testing dataset
testing_index = [95, 104]
testing_dataset = []

In [5]:
# Empty data: 02, 17, 37, 39, 83
empty_set = [2, 17, 37, 39, 83]

# Unreasonable data:
# A: 7, 17, 21, 56, (84)
invalid_A = [7, 17, 21, 56]
# B: 27, 50, 76
invalid_B = [27, 50, 76]
# C: (1), 23, 36, 55, 72, (82)
invalid_C = [23, 36, 55, 72]

In [6]:
# Coefficient for regression model
W = []
MAPE = []

## Load the Input File
First, load the basic input file **hw1_basic_input.csv**

Input data would be stored in *input_datalist*

In [7]:
# Read input csv to datalist
with open(input_dataroot, newline='') as csvfile:
  input_datalist = np.array(list(csv.reader(csvfile)))

## Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions


### Step 1: Split Data
Split data in *input_datalist* into training dataset and validation dataset 



In [8]:
def SplitData():
  global empty_set, input_datalist
  global training_dataset, validation_dataset, testing_dataset

  # Handle null data
  for i in empty_set:
    input_datalist[i][1] = '0'
    input_datalist[i][2] = '0'
    input_datalist[i][3] = '0'

  # Split training dataset and validation dataset
  training_dataset = []
  for i in range(training_index[0]-2, training_index[1]+1):
    training_dataset.append(input_datalist[i])
 
  validation_dataset = []
  for i in range(validation_index[0]-2, validation_index[1]+1):
    validation_dataset.append(input_datalist[i])

  testing_dataset = []
  for i in range(testing_index[0]-2, testing_index[1]+1):
    testing_dataset.append(input_datalist[i])

  training_dataset = np.array(training_dataset).astype(np.double)
  validation_dataset = np.array(validation_dataset).astype(np.double)
  testing_dataset = np.array(testing_dataset).astype(np.double)

SplitData()

### Step 2: Preprocess Data
Handle the unreasonable data
> Hint: Outlier and missing data can be handled by removing the data or adding the values with the help of statistics  

In [9]:
def PreprocessData():
  global training_dataset, validation_dataset

  # Handle missing data
  for i in empty_set:
    for j in range(1, 4):
      training_dataset[i-1][j] = (training_dataset[i-1-1][j] + training_dataset[i-1+1][j]) / 2

  # Handle unreasonable data
  for i in invalid_A:
    training_dataset[i-1][1] = (training_dataset[i-1-1][1] + training_dataset[i-1+1][1]) / 2
  
  for i in invalid_B:
    training_dataset[i-1][2] = (training_dataset[i-1-1][2] + training_dataset[i-1+1][2]) / 2

  for i in invalid_C:
    training_dataset[i-1][3] = (training_dataset[i-1-1][3] + training_dataset[i-1+1][3]) / 2

  # Special case
  # A: 83, 84
  training_dataset[83-1][1] = training_dataset[83-1-1][1]
  training_dataset[84-1][1] = training_dataset[83-1-1][1]

  # C: 1, 82, 83
  training_dataset[1-1][3] = training_dataset[2-1][3]
  training_dataset[82-1][3] = (training_dataset[84-1][3] + training_dataset[81-1][3]) / 2
  training_dataset[83-1][3] = (training_dataset[84-1][3] + training_dataset[81-1][3]) / 2

PreprocessData()

### Step 3: Implement Regression
> Hint: You can use Matrix Inversion, or Gradient Descent to finish this part




In [10]:
def Regression(tag): # A: tag = 1; B: tag = 2; C: tag = 3
  Phi = []
  Y = []

  # Phi (1, X, X^2, Y(n-1), Y(n-2))
  for i in range(2, len(training_dataset)):
    Phi.append(np.array([1, training_dataset[i][tag], training_dataset[i-1][tag+3], training_dataset[i-2][tag+3]]))
    Y.append(training_dataset[i][tag+3])
  
  Phi = np.array(Phi)
  Y = np.array(Y)
  Phi_t = Phi.transpose()

  W = np.matmul(Phi_t, Phi)
  W = np.linalg.inv(W)
  W = np.matmul(W, Phi_t)
  W = np.matmul(W, Y)

  return W

for i in range(1, 4):
  W.append(Regression(i))

### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [11]:
def MakePrediction(): # A: tag = 1; B: tag = 2; C: tag = 3
  global testing_dataset

  for i in range(2, len(testing_dataset)):
    for tag in range(1, 4):
      testing_dataset[i][tag+3] = W[tag-1][0] + W[tag-1][1] * testing_dataset[i][tag] + W[tag-1][2] * testing_dataset[i-1][tag+3] + W[tag-1][3] * testing_dataset[i-2][tag+3]

MakePrediction()

### Step 5: Train Model and Generate Result

> Notice: **Remember to output the coefficients of the model here**, otherwise 5 points would be deducted
* If your regression model is *3x^2 + 2x^1 + 1*, your output would be: 
```
3 2 1
```





In [12]:
def Validation(tag):
  # A: tag = 1
  # B: tag = 2
  # C: tag = 3
  global validation_dataset

  predict = []
  for i in range(2, len(validation_dataset)): 
    predict.append(W[tag-1][0] + W[tag-1][1] * validation_dataset[i][tag] + W[tag-1][2] * validation_dataset[i-1][tag+3] + W[tag-1][3] * validation_dataset[i-2][tag+3])
  predict = np.array(predict)

  MAPE = 0
  for i in range(2, len(validation_dataset)):
    MAPE += abs((validation_dataset[i][tag+3] - predict[i-2]) / validation_dataset[i][tag+3])
  MAPE /= len(validation_dataset) - 2

  return MAPE

In [13]:
for i in range(1, 4):
  MAPE.append(Validation(i))

print("MAPE for City A: %f" % MAPE[CityA-1])
print("MAPE for City B: %f" % MAPE[CityB-1])
print("MAPE for City C: %f" % MAPE[CityC-1])
print("-----")
print("Average MAPE: %f" %((MAPE[CityA-1] + MAPE[CityB-1] + MAPE[CityC-1]) / 3))

MAPE for City A: 0.146862
MAPE for City B: 0.224406
MAPE for City C: 0.161006
-----
Average MAPE: 0.177425


In [14]:
print("Regression model: Y_t = W_0 + W_1 * X_t + W_2 * X_t^2 + W_3 * Y_t-1 + W_4 * Y_t-2\n")

print("Coefficients for regression model of City A:")
for i in W[CityA-1]:
  print(i, end = "\t")

print("\n")

print("Coefficients for regression model of City B:")
for i in W[CityB-1]:
  print(i, end = "\t")

print("\n")

print("Coefficients for regression model of City C:")
for i in W[CityC-1]:
  print(i, end = "\t")

Regression model: Y_t = W_0 + W_1 * X_t + W_2 * X_t^2 + W_3 * Y_t-1 + W_4 * Y_t-2

Coefficients for regression model of City A:
23.342572396786203	-0.7234345235786304	0.6925538719235256	0.18574610317073295	

Coefficients for regression model of City B:
25.332426171345144	-0.826650663174446	0.4373583768396194	0.3262980794417698	

Coefficients for regression model of City C:
0.5821133183862983	0.039517616080825574	0.9069971855186248	0.053177339810176125	

## Write the Output File
Write the prediction to output csv
> Format: 'epiweek', 'CityA', 'CityB', 'CityC'

In [15]:
output_datalist = []
for i in range(2, len(testing_dataset)):
  output_datalist.append(np.array([int(testing_dataset[i][0]), testing_dataset[i][4], testing_dataset[i][5], testing_dataset[i][6]], dtype = '<U12'))

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in output_datalist:
    writer.writerow(row)

# 2. Advanced Part (35%)
In the second part, you need to implement the regression in a different way than the basic part to help your predictions for the number of dengue cases

We provide you with two files **hw1_advanced_input1.csv** and **hw1_advanced_input2.csv** that can help you in this part

Please save the prediction result in a csv file **hw1_advanced.csv** 


In [16]:
advanced_input1_dataroot = 'hw1_advanced_input1.csv'
advanced_input1_datalist = []

advanced_input2_dataroot = 'hw1_advanced_input2.csv'
advanced_input2_datalist = []

advanced_output_dataroot = 'hw1_advanced.csv'
advanced_output_datalist = []

In [17]:
# Coefficient for advanced regression model
W_Advanced = []
MAPE_Advanced = []

num_of_factor = 0
factor_set = []

In [18]:
# Read input csv to datalist
with open(advanced_input1_dataroot, newline='') as csvfile:
  advanced_input1_datalist = np.array(list(csv.reader(csvfile)))

with open(advanced_input2_dataroot, newline='') as csvfile:
  advanced_input2_datalist = np.array(list(csv.reader(csvfile)))

In [19]:
advanced_factor1 = []
for i in range(1, len(advanced_input1_datalist)):
  advanced_factor1.append([advanced_input1_datalist[i][0], advanced_input1_datalist[i][1], advanced_input1_datalist[i][2], advanced_input1_datalist[i][3]])
advanced_factor1 = np.array(advanced_factor1).astype(np.double)

advanced_factor2 = []
for i in range(1, 4):
  tmp = []
  for j in range(1, 26):
    tmp.append(advanced_input2_datalist[i][j])
  advanced_factor2.append(tmp)
advanced_factor2 = np.array(advanced_factor2).astype(np.double)

In [20]:
def RegressionAdvanced(): 
  Phi = []
  Y = []

  # Phi (1, X, Y(n-1), Y(n-2), ...)
  for i in range(2, len(training_dataset)):
    for j in range(1, 4):
      tmp = []
      tmp.append(1); tmp.append(training_dataset[i][j])
      tmp.append(training_dataset[i-1][j+3]); tmp.append(training_dataset[i-2][j+3]); tmp.append(advanced_factor1[i+training_index[0]-2-1][j])

      for k in factor_set:
        tmp.append(advanced_factor2[j-1][k])

      Phi.append(np.array(tmp))
      Y.append(training_dataset[i][j+3])
  
  Phi = np.array(Phi)
  Y = np.array(Y)
  Phi_t = Phi.transpose()

  W = np.matmul(Phi_t, Phi)
  W = np.linalg.inv(W)
  W = np.matmul(W, Phi_t)
  W = np.matmul(W, Y)

  return W

In [21]:
def AdvancedValidation(printMAPE = False):
  global validation_dataset

  predict = []
  for i in range(2, len(validation_dataset)):
    tmp = []
    for j in range(1, 4):
      value = W_Advanced[0] + W_Advanced[1] * validation_dataset[i][j] + \
            W_Advanced[2] * validation_dataset[i-1][j+3] + W_Advanced[3] * validation_dataset[i-2][j+3] + W_Advanced[4] * advanced_factor1[i+validation_index[0]-2-1][j]

      for k in range(num_of_factor):
        value += W_Advanced[5+k] * advanced_factor2[j-1][factor_set[k]]

      tmp.append(value)

    predict.append(tmp)

  predict = np.array(predict)

  MAPE_A = 0
  for i in range(2, len(validation_dataset)):
    MAPE_A += abs((validation_dataset[i][4] - predict[i-2][0]) / validation_dataset[i][4])

  MAPE_A /= len(validation_dataset) - 2

  MAPE_B = 0
  for i in range(2, len(validation_dataset)):
    MAPE_B += abs((validation_dataset[i][5] - predict[i-2][1]) / validation_dataset[i][5])

  MAPE_B /= len(validation_dataset) - 2

  MAPE_C = 0
  for i in range(2, len(validation_dataset)):
    MAPE_C += abs((validation_dataset[i][6] - predict[i-2][2]) / validation_dataset[i][6])

  MAPE_C /= len(validation_dataset) - 2

  if printMAPE:
    print("MAPE for City A: %f" % MAPE_A)
    print("MAPE for City B: %f" % MAPE_B)
    print("MAPE for City C: %f" % MAPE_C)
    print("-----")
    print("Average MAPE: %f" %((MAPE_A + MAPE_B + MAPE_C) / 3))

  return (MAPE_A + MAPE_B + MAPE_C) / 3

In [22]:
# Train model
# Model with advanced_input1 only
W_Advanced = RegressionAdvanced()
MAPE_prev = AdvancedValidation()

# Model with advanced_input1 and advanced_input2
num_of_factor = 1
factor_set = [0]

W_Advanced = RegressionAdvanced()
MAPE_prev = AdvancedValidation()

for i in range(1, 25):
  num_of_factor += 1
  factor_set.append(i)

  W_Advanced = RegressionAdvanced()

  MAPE_curr = AdvancedValidation()

  if MAPE_curr < MAPE_prev:
    MAPE_prev = MAPE_curr
  else:
    num_of_factor -= 1
    factor_set.pop()  


print("Regression model")  

W_Advanced = RegressionAdvanced()
MAPE_prev = AdvancedValidation(True)

Regression model
MAPE for City A: 0.143974
MAPE for City B: 0.239281
MAPE for City C: 0.134014
-----
Average MAPE: 0.172423


In [23]:
# Factor title
factor_title = advanced_input2_datalist[0]

print("Advanced regression model: Y_t = W_0 + W_1 * X_t + W_2 * Y_t-1 + W_3 * Y_t-2 + W_4 * Precipitation")
index = 5
for i in factor_set:
  print(" + W_%d * %s" % (index, factor_title[i+1]), end = '')
  index += 1
print()

for i in W_Advanced:
  print(i, end = '\t')

Advanced regression model: Y_t = W_0 + W_1 * X_t + W_2 * Y_t-1 + W_3 * Y_t-2 + W_4 * Precipitation
 + W_5 * Population + W_6 * Age0-4(%) + W_7 * Age15-29(%) + W_8 * Peopledoinghousework(%)
83.22373383723895	-0.734040786536104	0.686752431335002	0.18762552098816643	0.08068486948831305	-3.431517918796645e-06	-1.6811586299782215	-1.9463103259693504	0.3213397684964263	

In [24]:
def AdvancedMakePrediction():
  global testing_dataset

  for i in range(2, len(testing_dataset)):
    for j in range(1, 4):
      testing_dataset[i][j+3] = W_Advanced[0] + W_Advanced[1] * testing_dataset[i][j] + \
                      W_Advanced[2] * testing_dataset[i-1][j+3] + W_Advanced[3] * testing_dataset[i-2][j+3] + W_Advanced[4] * advanced_factor1[i+testing_index[0]-2-1][j]
      for k in range(num_of_factor):
          testing_dataset[i][j+3] += W_Advanced[5+k] * advanced_factor2[j-1][factor_set[k]]

AdvancedMakePrediction()

In [25]:
advanced_output_datalist = []
for i in range(2, len(testing_dataset)):
  advanced_output_datalist.append(np.array([int(testing_dataset[i][0]), testing_dataset[i][4], testing_dataset[i][5], testing_dataset[i][6]], dtype = '<U12'))

with open(advanced_output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  for row in advanced_output_datalist:
    writer.writerow(row)

# Report *(5%)*

Report should be submitted as a pdf file **hw1_report.pdf**

*   Briefly describe the difficulty you encountered 
*   Summarize your work and your reflections 
*   No more than one page






# Save the Code File
Please save your code and submit it as an ipynb file! (**hw1.ipynb**)