<a href="https://colab.research.google.com/github/D4ve39/pythonProg/blob/master/MachineLearningBasics_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#**The Dataset**
Each entry in the dataset consists in a measure of a person's height and weight.

# Importing Data using Pandas 
Pandas is a python library that provides tools for data manipulation, it's built around the concept of dataframe. You can imagine a dataframe as table indexed by columns and rows. For this tutorial, we do not need many of Pandas functionalities. But if you are interested, you can read this quick introduction to Pandas: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

In [None]:
url= "https://gitlab.com/analytics-club/ml-workshop/ml-workshop/-/raw/master/data/data_weight.csv?inline=false"

# Let's load the data from the csv file to a Pandas dataframe!
data = pd.read_csv(url)

# Now let's have look at the first rows of our dataset.
data.head()

In [None]:
# Now we select the labels (what we want to predict), and the features of our problem.
labels = data['Weight']
features = data['Height'].values.reshape(-1,1)

train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.3, random_state=42)

# Visualise the data

In [None]:
plt.plot(train_features, train_labels, "bo")
plt.xlabel("Height")
plt.ylabel("Weight")

# **Linear Regression**


# `TODO` Why?
Linear regression is a well studied problem in machine learning, with strong **theoretical guarantees**, intuitive interpretation and it is easy to use. It belongs to the **supervised learning** category. In fact, as we will see, we need training samples as well as **labels**. In this notebook we will use it too predict people weight based on their height. Try to check the linear regression documentation at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html and complete the following cell!

In [None]:
# Insert here your code! You should create and train your first linear regression model!
regression_model = ...

# `TODO` Predicting the the height
Now let's try to use our model for prediction! Given the test dataset, try to predict the respective height and check how you model performs on unseen data points!

In [None]:
# Insert your code here! Try to predict the height for the test data points!
predictions = ...

In [None]:
# Plot the data again!
h = np.linspace(140, 210, num=300).reshape(-1, 1)

plt.plot(train_features, train_labels, "bo", label="Train datapoints")
plt.xlabel("Height")
plt.ylabel("Weight")

plt.plot(test_features, test_labels, "ro", label="Test datapoints")
plt.plot(h, regression_model.predict(h), label="Linear regression")

plt.legend(loc="best")

## `TODO` Outliers 

Suppose we have a data point that is very far off. (For example, due to a measurement error.) 

In [None]:
# Add outlier to data  
data_outlier=data.append(pd.DataFrame( {'Height':[207],'Weight':[41]} ))

# Extract the labels (weight) and features (height)
labels = data_outlier['Weight']
features = data_outlier['Height'].values.reshape(-1,1)

# Split data into training and test set
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.3, random_state=42)

# Plot the new data
plt.plot(train_features, train_labels, "bo")
plt.xlabel("Height")
plt.ylabel("Weight")

Where do you expect the regression line to lie?

In [None]:
# Define a linear regression model
regression_model_outlier = LinearRegression()

# Fit the model
regression_model_outlier.fit(train_features,train_labels)

# Predict on the test set
predictions_outlier = regression_model_outlier.predict(test_features)

In [None]:
# Plot the data again!
h = np.linspace(140, 210, num=300).reshape(-1, 1)

plt.plot(train_features, train_labels, "bo", label="Train datapoints")
plt.xlabel("Height")
plt.ylabel("Weight")

plt.plot(test_features, test_labels, "ro", label="Test datapoints")
plt.plot(h, regression_model_outlier.predict(h), label="Linear regression")

plt.legend(loc="best")

Linear regression is sensitive to outliers. Even a single sample that is very far off can have a huge influence on the prediction. 
There are two main extension to linear regression that make linear regression more robust (against outliers): **Lasso regression** and **Ridge regression**.

– Implement Lasso regression and try out different values of the regularisation parameter "alpha".

– What do you observe? 

– What happens if alpha is very large or if alpha is very small (zero)?

In [None]:
# Define a Lasso regression model
lasso = ...

# Fit the model
...

# Predict on the test set
predictions_lasso = ...

In [None]:
# Plot the data again
h = np.linspace(140, 210, num=300).reshape(-1, 1)

plt.plot(train_features, train_labels, "bo", label="Train datapoints")
plt.xlabel("Height")
plt.ylabel("Weight")

plt.plot(test_features, test_labels, "ro", label="Test datapoints")
plt.plot(h, lasso.predict(h), label="Linear regression")

plt.legend(loc="best")

##Solution

In [None]:
# Create and train the machine learning model
regression_model = LinearRegression()
regression_model.fit(train_features,train_labels)

# Predicting the weight
predictions = regression_model.predict(test_features)

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso
# Define the model
lasso = Lasso(alpha=10)
# Fit the model
lasso.fit(train_features,train_labels)
# Predict the weight
predictions_lasso = lasso.predict(test_features)