# Exercise 2: Linear Regression
----------
In this exercise, you will implement a first machine learning model and learn about the *pandas* and *scikit-learn* libraries.

## Dataset
We will use a dataset originally published here: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/9/auto+mpg)

Download the dataset from Moodle. The dataset consists of two files:
- auto-mpg.data: contains the data
- auto-mpg.names: contains information about the dataset

The dataset contains data from 398 different car models. In addition to the car name, it includes information about:
- fuel consumption in miles per gallon
- cylinders
- engine displacement
- horsepower
- weight
- acceleration
- model year
- origin

The goal of this exercise is to predict the fuel consumption of cars using the other available attributes as input to a linear regression algorithm.

## Importing Data with *pandas*

The *pandas* library is a very important library often used in data science to handle datasets. It includes functions for analyzing, exploring and manipulating data.
You can find information about *pandas* on their website: [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html)

When working with datasets in *pandas*, the data is loaded into a pandas DataFrame, which is a two-dimensional structure similar to a table. In general the columns of the DataFrame refer to the different features of the data set, while the rows represent the instances of the data. *Pandas* provides many ways to manipulate and analyze the data in the DataFrame, e.g. for calculating statistical properties or cleaning the data.


In [1]:
import pandas as pd

data = {
    "Height": [180, 165, 172, 201, 177],
    "Weight": [80, 56, 105, 102, 68],
    "Name": ['Jack', 'John', 'Oliver', 'George', 'William']
}

# load the data into a data frame
dataframe = pd.DataFrame(data)

print(dataframe)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


   Height  Weight     Name
0     180      80     Jack
1     165      56     John
2     172     105   Oliver
3     201     102   George
4     177      68  William


In [7]:
# the info() function gives you a first overview of the data like the number of rows and columns and the data types.
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Height  5 non-null      int64 
 1   Weight  5 non-null      int64 
 2   Name    5 non-null      object
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


In [2]:
# Select a column of the dataframe
print(dataframe['Height'])
# The result of the selection is a Pandas Series, which is a one-dimensional array
print(type(dataframe['Height']))

0    180
1    165
2    172
3    201
4    177
Name: Height, dtype: int64
<class 'pandas.core.series.Series'>


In [9]:
# Select rows and columns of the dataframe using loc
# input to loc are the labels of the data
print(dataframe.loc[0:2,['Weight', 'Name']])

   Weight    Name
0      80    Jack
1      56    John
2     105  Oliver


In [10]:
# select a row of the dataframe using iloc
# iloc uses integer-based indexing
print(dataframe.iloc[0:3,1:3])
# this gives the same output as the code block above

   Weight    Name
0      80    Jack
1      56    John
2     105  Oliver


------

## Task 1: Load data
Load the car dataset for this exercise using the read_csv function from *pandas* and take a first look at the data to make sure it loaded properly.

In [63]:
file_path_1 = 'auto-mpg.data'
raw_data_1 = pd.read_csv(file_path_1, delimiter='\t', header=None)
print(raw_data_1.iloc[0:398,0])
print(raw_data_1.shape)


df = pd.DataFrame(raw_data_1, columns=('mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name'))
#df = raw_data_1.copy()
print(df.axes)
#df.columns = column_names
print(type(raw_data_1))
print( df)

0      18.0   8   307.0      130.0      3504.      12...
1      15.0   8   350.0      165.0      3693.      11...
2      18.0   8   318.0      150.0      3436.      11...
3      16.0   8   304.0      150.0      3433.      12...
4      17.0   8   302.0      140.0      3449.      10...
                             ...                        
393    27.0   4   140.0      86.00      2790.      15...
394    44.0   4   97.00      52.00      2130.      24...
395    32.0   4   135.0      84.00      2295.      11...
396    28.0   4   120.0      79.00      2625.      18...
397    31.0   4   119.0      82.00      2720.      19...
Name: 0, Length: 398, dtype: object
(398, 2)
[RangeIndex(start=0, stop=398, step=1), Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')]
<class 'pandas.core.frame.DataFrame'>
     mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    NaN        NaN          

## Task 2: Clean data

As described in the auto-mpg.names file, there are six missing horsepower values in the data set. For this exercise, we will ignore the six cars with this missing information. Use *pandas* to find and delete the six instances with missing horsepower information.

In [23]:
missing_rows = df[df.isnull().all(axis=1)]
print(missing_rows)

Empty DataFrame
Columns: [0, 1]
Index: []


## Task 3: Linear Regression

Your task is to predict the fuel consumption of the cars in miles per gallon. Use the linear regression formula $\beta = (X^{T}X)^{-1}X^{T}y$ from the lecture to complete this task. Select the available numeric features cylinders, displacement, horsepower, weight, acceleration, model year and origin as input features.

Calculate the root mean square error of your prediction.

-------------

## Scikit-Learn

As seen in Task 3, the linear regression model can be easily implemented in Python. For more complex algorithms, it makes sense to use existing libraries. *scikit-learn* is a very useful library for machine learning. With the help of *scikit-learn*, many machine learning models can be easily implemented. It also includes methods to transform and pre-process data before applying the machine learning algorithm and can also be used for evaluation.

For example, a linear regression model can be implemented as shown in the following code block.

In [11]:
# import scikit-learn
from sklearn.linear_model import LinearRegression
import numpy as np

# data set
x = np.transpose(np.array([[1,2,3,4,5,6,7,8]]))
y = np.transpose(np.array([3,4.5,5.8,7,10,13,14.6,16]))

# define a linear model and fit it to the given data
lm = LinearRegression(fit_intercept=True).fit(x,y)

# print the coefficients of the linear model
print(lm.intercept_, lm.coef_)

# calculate the predicted values
y_pred = lm.predict(x)

# calculate the root mean square error
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y, y_pred, squared=False)
print('The root mean squared error is {:.2f}'.format(rmse))

0.3392857142857153 [1.97738095]
The root mean squared error is 0.63


--------------
## Task 4: Optimization and Evaluation

In this task, we want to compare the results of the classical linear regression that we implemented above with the results of a lasso approach. In order to compare the results on unseen data, we have to define a training and a test data set.

a) Use the *scikit-learn* function "train_test_split" to split the data into the two sets. Choose a size of 70% for the training data and 30% for the test data.

b) Learn the classic linear regression model on the training dataset and evaluate its performance on the training and test datasets using the root mean square error as the evaluation metric.

c) Learn a linear regression model with lasso regularization on the training dataset and evaluate its performance on the training and test datasets using the root mean square error as the evaluation metric. Use the "Lasso" module from *scikit-learn* to perform this task and set the alpha-value to 1. Compare the model coefficients and the performance with the classic linear regression approach from above. Which of the models seems to be better suited for the given task?

d) Perform a hyperparameter optimization for the Lasso approach by trying different values for the hyperparameter alpha. Evaluate the performance and find the alpha value with the best performance.