# Linear Regression Exercise - April 3
*Attribution: NYU Tandon Machine Learning course taught by Peng Liu*


The following exercise will perform a simple multiple variable linear fitting on a civil engineering dataset.  In doing this exercise, you will learn to:

* Load data from a `csv` file using the `pandas` package
* Visualize relations between different variables with a scatter plot.
* Fit a multiple variable linear model using the `sklearn` package
* Evaluate the fit.

We begin by loading the packages we will need.

In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import linear_model
import pandas as pd

## Download Data

Concrete is one of the most basic construction materials.  In this exercise, you will download a simple dataset for predicting the strength of concrete from the attributes of concrete.  The data set comes from this very nice
[kaggle competition](https://www.kaggle.com/maajdl/yeh-concret-data).  Kaggle has many excellent dataset for your project.  
You can download the data with the following command.  After running this command, you should have the file `data.csv` in your local folder.

In [3]:
fn_src = 'https://raw.githubusercontent.com/sdrangan/introml/master/unit03_mult_lin_reg/Concrete_Data_Yeh.csv'
fn_dst = 'data.csv'

import os
import urllib.request

if os.path.isfile(fn_dst):
    print('File %s is already downloaded' % fn_dst)
else:
    urllib.request.urlretrieve(fn_src, fn_dst)
    print('File %s downloaded' % fn_dst)

File data.csv is already downloaded


The `pandas` package has excellent methods for loading `csv` files.  The following command loads the `csv` file into a dataframe `df`.

In [4]:
df = pd.read_csv('data.csv')

Ues the `df.head()` to print the first few rows of the dataframe.

In [6]:
# TODO

In this exercise, the target variable will be the concrete strength in Megapascals, `csMPa`.  We will use the other 8 attributes as predictors to predict the strength.  

Create a list called `xnames` of the 8 names of the predictors.  You can do this as follows:
* Get the list of names of the columns from `df.columns.tolist()`.  
* Remove the last items from the list using indexing.

Print the `xnames`.

In [None]:
# TODO
#   xnames = ...

Get the data matrix `X` and target vector `y` from the dataframe `df`.  

Recall that to get the items from a dataframe, you can use syntax such as

    s = np.array(df['slag'])  
        
which gets the data in the column `slag` and puts it into an array `s`.  You can also get multiple columns with syntax like

    X12 = np.array(df['cement', 'slag'])  


In [None]:
# TODO
#    X = ...
#    y = ...

You can use `matplotlib` to for data visualization. A particularly use command to plot side-by-side is `subplot` command. This command creates a subplot unit within the bigger plot. The general syntax is: 

```python
subplot(rows, columns, panel_number)
```

For example, for plotting the third subplot in a layout of 3 by 4 grid, you run:

```python
subplot(3, 4, 3)
```


Using the `subplot` and `scatter` command, create two plots, side-by-side with:
* `y` vs. the `cement` on the left (attribute 0)
* `y` vs. the `water` on the right (attribute 3)
Label the axes and use the `plt.tight_layout()` to adjust the plots nicely at the end.

In [8]:
# TODO

## Split the Data into Training and Test

Split the data into training and test.  Use 30% for test and 70% for training.  You can do the splitting manually or use the `sklearn` package `train_test_split`.   Store the training data in `Xtr,ytr` and test data in `Xts,yts`.


In [None]:
from sklearn.model_selection import train_test_split

# TODO
#  Xtr,Xts,ytr,yts = train_test_split(...)

## Fit a Linear Model

Create a linear regression model object `reg` and fit the model on the training data.


In [None]:
# TODO
#   reg = ...
#   reg.fit(...)

Compute the predicted values `yhat_tr` on the training data and print the `R^2` value on the training data.

In [None]:
# TODO
#    yhat_tr = ...
#    rsq_tr = ...

Now compute the predicted values `yhat_ts` on the test data and print the `R^2` value on the test data.

In [10]:
# TODO
#    yhat_ts = ...
#    rsq_ts = ...

Create a scatter plot of the actual vs. predicted values of `y` on the test data.

In [None]:
# TODO

## Evaluating Different Variables

One way to see the importance of different variables is to compute the *normalized* coefficients:

    coeff_norm[j]  = reg.coef_[j] * std(Xtr[:,j]) / std(ytr) 
    
which represents the change in the target for a change of one standard deviation in the attribute `j`.  The change in the target is normalized by its standard deviation.  

Compute the `coeff_norm` for the 8 attributes and plot the values using a `plt.stem()` plot.

In [None]:
# TODO
#  coeff_norm = ...

Which variable has the highest normalized coefficient, and hence most influence on the concrete strength?

In [None]:
# TODO