# Simple Linear Regression: Least-squares adjustment
M2U1 - Exercise 1

## What are we going to do?
- Import datasets into the working environment
- Manually fit a simple least squares linear regression
- Solve said regression using NumPy mathematical functions
- Visualise the regression with Matplotlib

Remember to follow the instructions for the submission of assignments indicated in [Submission Instructions](https://github.com/Tokio-School/Machine-Learning-EN/blob/main/Submission_instructions.md).

## Task 1: Import datasets

For this exercise we must import the following datasets into the local environment, that we used in a previous exercise, and which will be available in that unit:
- [M1U1-2-dataset_tarea2.csv](https://github.com/Tokio-School/Machine-Learning/blob/main/M01-Introducci%C3%B3n_al_Machine_Learning/M1U1-Introducci%C3%B3n_al_big_data_y_ML/M1U1-2-dataset_tarea2.csv)

Depending on your work environment, you will need to follow different steps to import the datasets. You can import them either in Google Colab or in your VM or local environment, using the JupyterLab interface or using your environment's functionalities.
Because the environments are so different, we don't include step-by-step instructions for doing this, but you shouldn't have any difficulty importing them :).

Take this time to familiarise yourself with your working environment and explore the options for importing datasets locally.

## Task 2: Fitting simple linear regression using least squares

For this task, we will fit the regression step by step, calculating each value with NumPy to familiarise ourselves with its functions.

**Note:** We are only going to use NumPy's sum function. In the next task we will use the functions to directly calculate the mean of an array, standard deviations or covariance.

In [None]:
import numpy as np

### Import the datasets into NumPy

Execute the following cell to import the dataset as a NumPy array, making sure that the dataset name is correct and that the file is in the same directory as the notebook.

*NOTA:* Si utilizas Google Colab, utiliza estos métodos para subir el archivo desde local o Google Drive: [External data: Local Files, Drive, Sheets, and Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb)

In [None]:
import csv

with open('M1U1-2-dataset_tarea2.csv') as csvfile:
    read_csv = list(csv.reader(csvfile))
    
# Delete header
read_csv = read_csv[1:]

# Change the decimal comma characters to periods
for line in read_csv:
    for i in [0, 1]:
        line[i] = line[i].replace(',', '.')
    
    
# Load as a NumPy array
dataset = np.asarray(read_csv).astype(np.float)

print(dataset)

We already have the data in a 2D NumPy array.

Now, fill in the code in the following cells to fit the linear regression:

In [None]:
## TODO: Create 2 1D arrays from the imported dataset corresponding to the X and Y columns of the CSV file

X = [...]
Y = [...]

In [None]:
## TODO: Before training the model, plot the data on a Matplotlib dot plot.

import matplotlib.pyplot as plt

# You can use the scatter () function

plt.show()

Recall the linear regression equations:

$$Y=m \times X + b$$

$$m=\frac{\sum XY - \frac{(\sum X)(\sum Y)}{n}}{\sum X^2-\frac{(\sum X)^2}{n}}$$

$$b=\overline{Y} - m \times \overline{X}$$

In [None]:
## TODO: Calculate m using the function np.sum(ndarray) or ndarray.sum(), where ndarray is the array to be summed
n = [...]

# Remember, it's an element-to-element vector multiplication. Use the np.multiply() function
# In other exercises in the course we will use np.matmul() to multiply 2D matrices instead
XY = [...]

X2 = [...]    # Array X-squared

m = [...]

In [None]:
## TODO: Calculate b

# TODO: Replace "sum_y" y "sum_x" with the corresponding code or variables
y_avg = sum_y / n
x_avg = sum_x / n

b = [...]

Evaluate the model by calculating its R<sup>2</sup>.

Recall the equations for calculating the correlation coefficient:

$$R^2 = \frac{\sigma_{XY}}{\sigma_X \cdot \sigma_Y};$$

$$S_{XY} = \frac{1}{n - 1} [\sum_{i = 1}^{n}{x_i y_i - \bar{x}\bar{y}}]$$

$$\sigma_X = \sqrt{\frac{\sum X^2}{n} - \bar{X}^2};$$

$$\sigma_Y = \sqrt{\frac{\sum Y^2}{n} - \bar{Y}^2}$$

*Note:* We will use a slightly different formula for covariance than the one used in previous exercises.

In [None]:
## TODO: Calculate R**2

x_std = [...]
y_std = [...]
cov_xy = [...]

r2 = [...]

Calculate the predictions of Y as *y_pred* for the original X values, with the coefficients of the fitted model:

$y\_pred = m \times X + b$

In [None]:
## TODO: Calculate y_pred
y_pred = [...]

In [None]:
# TODO: Using Matplotlib, plot a graph with 2 series in different colours: Y vs X, y_pred vs X
# Use a dot plot for Y vs X and a line chart for y_pred vs X

[...]

## Task 3: Fitting linear regression using NumPy's mathematical functions

Now, repeat the steps above to fit the linear regression taking full advantage of NumPy's capabilities, its functions for calculating the sum, mean, standard deviation, and covariance of arrays.

In [None]:
## TODO: Solve the linear regression using NumPy's advanced functions
## Use new variable names such as np_x_avg, np_x_std, np_r2, etc.

np_m = [...]
np_b = [...]
np_r2 = [...]

## Task 4: Calculate the residuals and make predictions

Calculate the residuals of your model:

$residuos = Y - Y\_pred$

In [None]:
## TODO: Calculate the residuals and plot them with Matplotlib on a dot plot vs X

res = [...]

# Matplotlib graphs

Make predictions for 2 (or more) new values of X, 1 value for interpolation and 1 value for extrapolation.

In [None]:
# TODO: Makes predictions with the fitted model

x_interpol = [...]
y_interpol = [...]

x_extrapol = [...]
y_extrapol = [...]

Graphically represent the predictions for the training values.

In [None]:
# TODO: Represents the predictions as points of a different series on the training Y vs X point cloud


## Task 5: Resolution with Scikit-learn

*Do you dare to solve a simple linear regression using Scikit-learn? And to evaluate it and make predictions?*

Revise the code from this notebook and adapt it to use our data: [https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)

In [None]:
# TODO: Solve the simple linear regression from the said example using Scikit-learn