## Import the necessary libraries

In [None]:
import ... as pd
# Plotting
import ... as sns
import matplotlib.pyplot as plt

# sklearn environment
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

--------------------------------

# Linear regression



### 1. What is Linear Regression?

- models a linear relationship between at least one explanatory variable $x$ (feature) and an outcome variable $y$ (target):
$\newline$
$$
y = w_0 + w_1x + \epsilon
$$
$\newline$
where
>- $w_0$ is the intercept or bias 
>- $w_1$ is the coefficient or slope  (how much y changes for each unit of x)
>- $\epsilon$ is randomly distributed error term

### Why is this useful?
* Predicting metric variables is common place in data science
* The results are easily interpretable


--------------------------------------

##  Linear regression with scikit learn and the Penguins.

## Define Business Goal
Train a linear regresssion model to predict `Body mass` from `Flipper Length`


## Read the Data as a Dataframe

In [None]:
penguins = pd.....("./data/penguins_simple.csv", sep=";")


## Check the info of the data

## Do you have any null values? Any duplicates?

## Train/Test split

### Define X and y

In [None]:
X = penguins[["Flipper Length (mm)"]] # independent variable
X

In [None]:
y = penguins["Body Mass (g)"] #dependent variable, target
y

In [None]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.2, random_state=42) #splittin the data into train and test dataset

## Exploratory Data Analysis
We need to recombine the Xtrain and ytrain to make 

In [None]:
#let's check the flipper length vs body mass
plt.scatter(Xtrain["Flipper Length (mm)"], ytrain);

## Feature Engineering


+ No missing values allowed --> impute
+ For categorical variables: one-hot-encoding
+ For numerical variables: better if scaled

## Train Model(s)


In [None]:
#define your model as Linear regression
...

In [None]:
# Train the model and fitting is the same 

model.fit(Xtrain,ytrain)   #Our model learns model parameters, slope and intercept

## Evaluate the model(s)
The `.score()` method for the linear regression model corresponds to R-squared.


#### **R-squared** 
R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit).
So, a r-squared value of 1.0 means your model has a perfect fit – no errors in your predictions!
On the other hand, a r-squared value of 0.0 means your model is no better than a simple average over all points.

$$ R^2 = 1 - \frac{\sum_i(\hat{y}^{(i)} - y^{(i)}_{true})^2}{\sum_{i}(y^{(i)}_{true} - \bar{y})^2}$$





In [None]:
model.score(Xtrain, ytrain), model.score(Xtest,ytest)  # 1 is the perfect fit, the line goes through all the data points

In [None]:
ypred =model.predict(Xtest)  # Predicted Body of the penguins

actual_vs_predicted = pd.DataFrame({'Actual': ytest, 'Predicted':ypred })
actual_vs_predicted.head()

In [None]:
ytest #True values from the dataset, Body Mass of the penguins

In [None]:
#run the code below and check the visual!

plt.plot(Xtrain["Flipper Length (mm)"], ytrain, 'bs', label="train")
plt.plot(Xtest["Flipper Length (mm)"], ytest, 'ro', label="test")
plt.plot(Xtest["Flipper Length (mm)"], ypred, 'c', label="fit")
plt.legend()
