# **Lab1: Regression**
In *lab 1*, you need to finish:

1.  Basic Part: Implement the gradient descent regression model to predict people's grip force from their weight.



> *   Step 1: Split Data
> *   Step 2: Preprocess Data
> *   Step 3: Implement Regression
> *   Step 4: Make Prediction
> *   Step 5: Train Model and Generate Result

2.  Advanced Part: Implementing a regression model to predict grip force in a different way (for example, with more variables) than the basic part




---
# 1. Basic Part (50%)
In the first part, you need to implement the regression to predict grip force

Please save the prediction result in a CSV file and submit it to Kaggle

### Import Packages

> Note: You **cannot** import any other package


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import csv
import math
import random

### Global attributes
Define the global attributes\
You can also add your own global attributes here

In [2]:
training_dataroot = 'lab1_basic_training.csv' # Training data file file named as 'lab1_basic_training.csv'
testing_dataroot = 'lab1_basic_testing.csv'   # Testing data file named as 'lab1_basic_testing.csv'
output_dataroot = 'lab1_basic.csv' # Output file will be named as 'lab1_basic.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be a list with 100 elements

### Load the Input File
First, load the basic input file **lab1_basic_training.csv** and **lab1_basic_testing.csv**

Input data would be stored in *training_datalist* and *testing_datalist*

In [3]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = pd.read_csv(training_dataroot).to_numpy()

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = pd.read_csv(testing_dataroot).to_numpy()

### Implement the Regression Model

> Note: It is recommended to use the functions we defined, you can also define your own functions

#### Step 1: Split Data
Split data in *training_datalist* into training dataset and validation dataset


In [4]:
def SplitData(data, split_ratio):
    """
    Splits the given dataset into training and validation sets based on the specified split ratio.

    Parameters:
    - data (numpy.ndarray): The dataset to be split. It is expected to be a 2D array where each row represents a data point and each column represents a feature.
    - split_ratio (float): The ratio of the data to be used for training. For example, a value of 0.8 means 80% of the data will be used for training and the remaining 20% for validation.

    Returns:
    - training_data (numpy.ndarray): The portion of the dataset used for training.
    - validation_data (numpy.ndarray): The portion of the dataset used for validation.

    """
    training_data = []
    validation_data = []

    # TODO
    training_data = data[:int(len(data)*float(split_ratio))]
    validation_data = data[int(len(data)*float(split_ratio)):]

    return training_data, validation_data



#### Step 2: Preprocess Data
Handle unreasonable data and missing data

> Hint 1: Outliers and missing data can be addressed by either removing them or replacing them using statistical methods (e.g., the mean of all data).

> Hint 2: Missing data are represented as `np.nan`, so functions like `np.isnan()` can be used to detect them.

> Hint 3: Methods such as the Interquartile Range (IQR) can help detect outliers

In [82]:
def PreprocessData(data):
    """
    Preprocess the given dataset and return the result.

    Parameters:
    - data (numpy.ndarray): The dataset to preprocess. It is expected to be a 2D array where each row represents a data point and each column represents a feature.

    Returns:
    - preprocessedData (numpy.ndarray): Preprocessed data.
    """
    preprocessedData = np.copy(data)
    print(type(preprocessedData))
    # TODO
    # Handle missing data by replacing NaN values with the mean of the respective column
    col_means = np.nanmean(preprocessedData, axis=0)
    print(col_means)
    inds = np.where(np.isnan(preprocessedData))
    preprocessedData[inds] = np.take(col_means, inds[1])

    # Detect and remove outliers using the Interquartile Range (IQR) method
    Q1 = np.percentile(preprocessedData, 25, axis=0)
    Q3 = np.percentile(preprocessedData, 75, axis=0)
    IQR = Q3 - Q1

    # Define bounds for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # print('=====================')
    # print(lower_bound)
    # print(upper_bound)

    # Mask to filter out outliers
    mask = np.all((preprocessedData >= lower_bound) & (preprocessedData <= upper_bound), axis=1)

    # Only keep data that is not an outlier
    preprocessedData = preprocessedData[mask]
    # print(preprocessedData)

    return preprocessedData


### Step 3: Implement Regression
You have to use Gradient Descent to finish this part

In [215]:
def Regression(dataset):
    """
    Performs regression on the given dataset and return the coefficients.

    Parameters:
    - dataset (numpy.ndarray): A 2D array where each row represents a data point.

    Returns:
    - w (numpy.ndarray): The coefficients of the regression model. For example, y = w[0] + w[1] * x + w[2] * x^2 + ...
    """

    X = dataset[:, :-1]
    y = dataset[:, -1]

    # TODO: Decide on the degree of the polynomial
    global degree
    degree = 2  # For example, quadratic regression # basic: 2, advanced:
    # global X_poly
    # Add polynomial features to X
    X_poly = np.ones((X.shape[0], 1))  # Add intercept term (column of ones)
    for d in range(1, degree + 1):
        X_poly = np.hstack((X_poly, X ** d))  # Add x^d terms to feature matrix
        # 1, x1, x2, x3, x1^2, x2^2, x3^2, ...
        # 1, x1, x2, x3, x1^2, x2^2, x3^2, ...

    # Initialize coefficients (weights) to zero
    num_dimensions = X_poly.shape[1]  # Number of features (including intercept and polynomial terms)
    w = np.zeros(num_dimensions)

    # TODO: Set hyperparameters
    num_iteration = 2000000 # basic: 1000000, advanced: 2000000
    learning_rate = 0.0000000016 # need to be small enough # basic: 0.00000001, advanced: 0.000000001 (add one 0) 0.0000000016
    # pow = 3, LR = 0.00000000000001, performance decrease


    # Gradient Descent
    m = len(y)  # Number of data points
    cost = 100
    iteration = 0
    # for iteration in range(num_iteration):
    while cost >= 23: # 23.5
        # TODO: Prediction using current weights and compute error
        # print(X_poly.shape)
        y_pred = X_poly.dot(w)
        error = y - y_pred  # y_pred = X_poly.dot(w)
        # print(error.shape)
        # print(error)
        # TODO: Compute gradient
        gradient = (-1 / m) * X_poly.T.dot(error)  # 1/m
        # print(X_poly.T.dot(error))
        # print(gradient)
        # TODO: Update the weights
        w -= learning_rate * gradient

        # TODO: Optionally, print the cost every 100 iterations
        if iteration % 200 == 0:
            cost = (1 / (2 * m)) * np.sum(error ** 2)  # Mean Squared Error # 1/2m
            print(f"Iteration {iteration}, Cost: {cost}")
        iteration += 1

    return w


### Step 4: Make Prediction
Make prediction of testing dataset and store the value in *output_datalist*

In [13]:
def MakePrediction(w, test_dataset):
    """
    Predicts the output for a given test dataset using a regression model.

    Parameters:
    - w (numpy.ndarray): The coefficients of the model, where each element corresponds to
                               a coefficient for the respective power of the independent variable.
    - test_dataset (numpy.ndarray): A 1D array containing the input values (independent variable)
                                          for which predictions are to be made.

    Returns:
    - list/numpy.ndarray: A list or 1d array of predicted values corresponding to each input value in the test dataset.
    """
    prediction = []

    # TODO
    # Iterate over each value in the test dataset
    for x in test_dataset[:, 0]:
        # Initialize the predicted value with the intercept term (w[0])
        pred_value = w[0]

        # Add the contribution of each polynomial term w[i] * x^i (i=1, 2, 3, ...)
        for i in range(1, len(w)):
            pred_value += w[i] * (x ** i)

        # Append the predicted value to the list
        prediction.append(pred_value)

    # Return the predictions as a NumPy array
    # print(np.array(prediction))
    return np.array(prediction)


### Step 5: Train Model and Generate Result

Use the above functions to train your model on training dataset, and predict the answer of testing dataset.

Save your predicted values in `output_datalist`

> Notice: **Remember to inclue the coefficients of your model in the report**



In [212]:
# TODO

# (1) Split data
training_data, validation_data = SplitData(training_datalist, 0.7)
# (2) Preprocess data
training_data = training_data.astype(float)
processed_training_data = PreprocessData(training_data)
validation_data = validation_data.astype(float)
processed_validation_data = PreprocessData(validation_data)
# print(processed_training_data)
# (3) Train regression model
w = Regression(processed_training_data)

# (4) Predict validation dataset's answer, calculate MAPE comparing to the ground truth
validation_data_prediction = MakePrediction(w, processed_validation_data)
y = processed_validation_data[:, -1]
# print(y)
predicted = validation_data_prediction
# print(y.shape)
# print(predicted.shape)
mape = np.mean(np.abs((y - predicted) / y)) * 100
print(mape)

# (5) Make prediction of testing dataset and store the values in output_datalist
output_datalist = MakePrediction(w, testing_datalist)

<class 'numpy.ndarray'>
[ 36.81475622   0.62214286 168.57006333  68.66897148  23.22284721
  78.81020731 130.30244604  49.23635826]
<class 'numpy.ndarray'>
[ 36.90618695   0.59966667 168.30996979  68.51182184  23.44415427
  78.67538565 130.38012089  49.7166716 ]
Iteration 0, Cost: 1063.887218810601
Iteration 200, Cost: 45.45024333894041
Iteration 400, Cost: 45.076642518819625
Iteration 600, Cost: 44.868384944612416
Iteration 800, Cost: 44.724898598142225
Iteration 1000, Cost: 44.61592765490965
Iteration 1200, Cost: 44.52834483282187
Iteration 1400, Cost: 44.45528373834482
Iteration 1600, Cost: 44.39270964837677
Iteration 1800, Cost: 44.33805397736137
1.5149026377781758e+21


### *Write the Output File*

Write the prediction to output csv and upload the file to Kaggle
> Format: 'Id', 'gripForce'


In [9]:
# Assume that output_datalist is a list (or 1d array) with length = 100

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['Id', 'gripForce'])
  for i in range(len(output_datalist)):
    writer.writerow([i,output_datalist[i]])


# 2. Advanced Part (45%)
In the second part, you need to implement regression differently from the basic part to improve your grip force predictions. You must use more than two features.

You can choose either matrix inversion or gradient descent for this part

We have provided `lab1_advanced_training.csv` for your training

> Notice: Be cautious of the "gender" attribute, as it is represented by "F"/"M" rather than a numerical value.

Please save the prediction result in a CSV file and submit it to Kaggle

In [159]:
training_dataroot = 'lab1_advanced_training.csv' # Training data file file named as 'lab1_advanced_training.csv'
testing_dataroot = 'lab1_advanced_testing.csv'   # Testing data file named as 'lab1_advanced_testing.csv'
output_dataroot = 'lab1_advanced.csv' # Output file will be named as 'lab1_advanced.csv'

training_datalist =  [] # Training datalist, saved as numpy array
testing_datalist =  [] # Testing datalist, saved as numpy array

output_datalist =  [] # Your prediction, should be a list with 3000 elements

In [160]:
# Read input csv to datalist
with open(training_dataroot, newline='') as csvfile:
  training_datalist = pd.read_csv(training_dataroot).to_numpy()

with open(testing_dataroot, newline='') as csvfile:
  testing_datalist = pd.read_csv(testing_dataroot).to_numpy()

In [161]:
def EncodeGender(data):
    # Drop rows where gender is NaN
    # data.dropna(subset=['gender'], inplace=True)
    # data = data[~np.isnan(data).any(axis=1)]
    # data['gender'] = data['gender'].map({'F': 0, 'M': 1})
    # data = data[~np.isnan(data).any(axis=1)]
    # return data

    # Assuming gender is in the second column (index 1)
    gender_column = data[:, 1]

    # Replace 'F' with 0, 'M' with 1
    gender_column = np.where(gender_column == 'M', 1, gender_column)
    gender_column = np.where(gender_column == 'F', 0, gender_column)

    # If there are missing values (e.g., None or 'nan'), you can handle them like this
    gender_column = np.where(np.isnan(gender_column.astype(float)), -1, gender_column)
    # gender_column = np.where(gender_column == -1, 0.5, gender_column)
    print(gender_column)

    # Assign the modified gender column back to the dataset
    data[:, 1] = gender_column.astype(float)  # Make sure it's numeric

    # Create a mask to filter out rows where gender is -1
    mask = gender_column != -1

    # Use the mask to filter the data
    data = data[mask]
    # print(data)
    data = data.astype(float)

    return data

In [162]:
def MakePredictionAdvanced(w, test_dataset):
    """
    Predicts the output for a given test dataset using a regression model.

    Parameters:
    - w (numpy.ndarray): The coefficients of the model, where each element corresponds to
                               a coefficient for the respective power of the independent variable.
    - test_dataset (numpy.ndarray): A 1D array containing the input values (independent variable)
                                          for which predictions are to be made.

    Returns:
    - list/numpy.ndarray: A list or 1d array of predicted values corresponding to each input value in the test dataset.
    """
    prediction = []

    # TODO
    X_poly = np.ones((test_dataset.shape[0], 1))  # Add intercept term (column of ones)
    for d in range(1, degree + 1):
        X_poly = np.hstack((X_poly, test_dataset ** d))  # Add x^d terms to feature matrix
        # 1, x1, x2, x3, x1^2, x2^2, x3^2, ...
        # 1, x1, x2, x3, x1^2, x2^2, x3^2, ...
        # ...
    prediction = X_poly.dot(w)

    return prediction

In [216]:
# TODO

# (1) Split data
training_data, validation_data = SplitData(training_datalist, 0.9)
# (2) Preprocess data
training_data = EncodeGender(training_data)
validation_data = EncodeGender(validation_data)
# print(training_data)
processed_training_data = PreprocessData(training_data)
processed_validation_data = PreprocessData(validation_data)
print(processed_training_data)
# (3) Train regression model
w = Regression(processed_training_data)
# (4) Predict validation dataset's answer, calculate MAPE comparing to the ground truth
validation_data_prediction = MakePredictionAdvanced(w, processed_validation_data[:, :-1])
y = processed_validation_data[:, -1]
predicted = validation_data_prediction
mape = np.mean(np.abs((y - predicted) / y)) * 100
print(mape)
# (5) Make prediction of testing dataset and store the values in output_datalist
testing_datalist = EncodeGender(testing_datalist)
output_datalist = MakePredictionAdvanced(w, testing_datalist)

[1;30;43m串流輸出內容已截斷至最後 5000 行。[0m
Iteration 29538400, Cost: 23.011433779921543
Iteration 29538600, Cost: 23.01143145751203
Iteration 29538800, Cost: 23.011429135116654
Iteration 29539000, Cost: 23.011426812735436
Iteration 29539200, Cost: 23.01142449036836
Iteration 29539400, Cost: 23.011422168015436
Iteration 29539600, Cost: 23.011419845676656
Iteration 29539800, Cost: 23.01141752335202
Iteration 29540000, Cost: 23.011415201041537
Iteration 29540200, Cost: 23.011412878745197
Iteration 29540400, Cost: 23.011410556463
Iteration 29540600, Cost: 23.01140823419495
Iteration 29540800, Cost: 23.01140591194104
Iteration 29541000, Cost: 23.011403589701285
Iteration 29541200, Cost: 23.011401267475666
Iteration 29541400, Cost: 23.011398945264194
Iteration 29541600, Cost: 23.01139662306686
Iteration 29541800, Cost: 23.01139430088368
Iteration 29542000, Cost: 23.011391978714634
Iteration 29542200, Cost: 23.011389656559736
Iteration 29542400, Cost: 23.011387334418973
Iteration 29542600, Cost: 23.0

In [185]:
# (4) Predict validation dataset's answer, calculate MAPE comparing to the ground truth
validation_data_prediction = MakePredictionAdvanced(w, processed_validation_data[:, :-1])
y = processed_validation_data[:, -1]
predicted = validation_data_prediction
mape = np.mean(np.abs((y - predicted) / y)) * 100
print(mape)
# (5) Make prediction of testing dataset and store the values in output_datalist
testing_datalist = EncodeGender(testing_datalist)
output_datalist = MakePredictionAdvanced(w, testing_datalist)

13.127928711816258
[1. 1. 0. ... 1. 0. 1.]


In [218]:
# Assume that output_datalist is a list (or 1d array) with length = 100

with open(output_dataroot, 'w', newline='', encoding="utf-8") as csvfile:
  writer = csv.writer(csvfile)
  writer.writerow(['Id', 'gripForce'])
  for i in range(len(output_datalist)):
    writer.writerow([i,output_datalist[i]])

# Save the Code File
Please save your code and submit it as an ipynb file! (**Lab1.ipynb**)