## Readme

In this lab, we are going to use a simple example: a linear regression model to see how the noise is impacting the model accuracy. 

Usually, a supervised machine learning model can be described as:


_y_ = _f_($\mathbf{x}$|$\Theta$) + $\epsilon$

where $\epsilon$ ~ N(0, $\sigma$)

In this lab, we are going to see the impact of increasing $\sigma$ on the model accuracy ($\hat{\Theta}$)

On the other hand, you are also going to see even under the situation where the noise is large, increasing the number of observations can still help reduce the impact of the noise on model accuracy. 

Think: What does it mean to the computational cost? Do you see the connection with LLM why we need to improve the data quality?

## Step 1. Let's First Simulate the Data for Linear Regression

### Simulate Data y = 2 + 5x + noise

In [None]:
import numpy as np
num_obs = 30
stdev_of_noise = 2
np.random.seed(98053)
x = np.random.randn(num_obs)  * 5
x = np.vstack((np.ones(num_obs), x)).T
noise = np.random.randn(num_obs) * stdev_of_noise
coefficients = [2, 5]
y_truth = x.dot(coefficients)
y = y_truth + noise

### Let's Visualize the Data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 

import pandas as pd
df = pd.DataFrame({'x': x[:, 1], 'y': y})
bplot= sns.scatterplot(x='x', y='y', data=df)
bplot.axes.set_title("x vs y: Scatter Plot",
                    fontsize=16)
bplot.set_ylabel("y", 
                fontsize=16)
bplot.set_xlabel("x", 
                fontsize=16)
bplot.plot(x[:, 1], y_truth, color='green')

## Step 2. Let's Fit Linear Regression Model and Visualize It

In [None]:
beta_hat = np.linalg.inv(x.T.dot(x)).dot(x.T).dot(y)
print(beta_hat)

In [None]:
yhat = x.dot(beta_hat)
plt.scatter(x[:,1], y)
plt.plot(x[:, 1], yhat, color='red', label='fitted line')
plt.plot(x[:, 1], y_truth, color='green', label='actual line')
plt.legend(loc='upper left')

## Step 3. Let's Play Around to See Impact of Noise and Number of Observations on Model Fitting

### Increase the variance of noise from 2 to 50.

Run Steps 1 and 2 again. What do you see?

### Keep the variance of noise as 50, but increase the number of observations from 30 to 3000. 

Run steps 1 and 2 again. What do you see?

## Discussion

What have you seen?

What have you learnt?