### Lab B: Matrix Regression   Due: 16 March ###

In this lab you will carry out regression fits to data, using the matrix approach, outlined in Lecture 7. You will use the approach two ways:
1. Carry out a simple linear regression to fit $(x,y)$ data and analyse the result.
2. Transform the independent or the dependent variable to find a suitable regression model for the data.

$\text{The matrix regression equation is:}$

$Y = X \mathbb{\beta} + \mathbb{\epsilon_i}$

$\text{Where}$


 $X = \begin{bmatrix}
         1 & x_1 \\
         1 & x_2 \\
         \ldots & \ldots \\
         1 & x_n \\
        \end{bmatrix} \ $  $Y = \begin{bmatrix} y_1 \\ y_2 \\ \ldots \\ y_n \end{bmatrix}\ $ 
$\beta = \begin{bmatrix} \beta_0\\ \beta_1\\ \end{bmatrix}$

$\text{To solve for } \mathbb{\beta}, \text{ multiply both sides by the transpose of} X$
$\text{(the transpose of a matrix is when the rows and columns are interchanged)}$

$X^T Y = X^T X \mathbb{\beta}$

$ \mathbb{\beta} = \left ( X^T X \right)^{-1} X^TY$

The steps required to create the $X$ matrix are outlined in the lecture 7 notes.

**Problem 1.** The data file {\tt IrisData_slr10.xls} contains data for *Iris Setosa* in 3 columns: Entry number, sepal width, sepal length, in an Excel file. Using matrix regression as shown above, fit a model $y = \beta_0 + \beta_1 x$,
 where $x$ is the sepal width and $y$ is the sepal length. **Plot the data with the regression line superposed upon it**. Comment on the fit of the model.
 
 To read an Excel file into Python, use the *Pandas* module. 
 * import pandas as pd
 * dataXY = pd.read_excel(excelFileName).
 * dataXY.head()                     # will display the first few lines of *dataXY*. 
 * myDataArray = np.array(dataXY)    # will create a **2-D numpy array** that you can use.
 * x = myDataArray[:,0]              # $1^{st}$ column is $x$ (and $2^{nd}$ column is $y$).
 
Plot the *residuals* $e_i = y_i - \hat{y}$ versus $y$, where $y_i$ is the data value corresponding to $x_i$ and $\hat{y_i} = b_0 + b_1\cdot x_i$ is the value given by the model at the point $x_i$. 

The plot of $e_i$ versus $x$ and the plot of $e_i$ versus $y$ should not show any pattern. You only need make one of the two plots. 

#### Regression with Transformations to a Linear Model####
**Problem 2** The data file *Boyle_P-V.dat* has the air pressure measurements (in inches of Hg) made in a variable volume cylinder by Robert Boyle in 1660. The idea in the exercise is to obtain the correct relation between Pressure and Volume, using linear regression with *transformed* variables. 
1. Make a scatter plot of Pressure v/s Volume. Does the data show a **linear** association?
2. Using the matrix regression method (outlined above), fit the model: $\hat{P} = b_0 + b_1\cdot V$,
3. Compute the residuals: $e_i = P_i - \hat{P_i}$ and plot (scatterplot) of $e_i$ v/s $\hat{P_i}$. Is there a pattern? 

When $y$ v/s $x$ scatter plot does not show a linear association, a common way is to transform the $y$ variable according to a *ladder of transformations*:
$-1/y ; -1/\sqrt{y} ; \log_{10}(y) ; \sqrt{y} ; y ; y^2 \ldots$
A better way is to examine the residual plot: This *amplifies* any curvature in the original scatter plot/
The transformations can be summarized as:  $y^p$, with $p=0 \rightarrow \log_{10}(y)$. (Note $p$ does not need to be an integer).
1. If your scatter (and residual) plots are: 
2. Convex-up (Cup open upwards), move to a lower power.
3. Convex-down (cup open down), move to a higher power
This can be summarized by the *bulge plot*, which suggests transformations for either variable: 

![tranformations.jpg](attachment:tranformations.jpg)



Transform the **dependent variable** ($y$), in this problem to obtain a suitable fit. With a suitable fit, the residual plot shows **no pattern**. 
Show residual plots for *each step* in the transformation. 
Finally, identify (and eliminate) any points that may be causing the fit to deteriorate. 
Fit your final model without the point(s). Does your model conform to expectations?

**Problem 3** In 1989, Soviet scientists released data for nuclear weapons tests. Western scientists had previously estimated the size of the explosions using **seismic data**. 

(*Aside: Seismic data is measured on the Richter scale - it is used for earthquakes. The Richter magnitude of an earthquake is determined from the *logarithm* *of the amplitude of waves recorded by seismographs. Because of the logarithmic basis of the scale, each whole number increase in magnitude represents a tenfold increase in measured amplitude; in terms of energy, each whole number increase corresponds to an increase of about 31.6 times the amount of energy released, and each increase of 0.2 corresponds to approximately a doubling of the energy released.* (wiki: Richter_magnitude_scale) - This may help explain the model you obtain).


The data is available in *sovietNuclearExplosion.dat* the 
columns are:  *date*, *Western magnitude est.*, *Soviet Reported yield (Kilotons)*.

Make a scatter plot and simple linear regression fit for *est. magnitude* ($Y$) v/s *yield* ($x$). 

Make a plot of the residuals $e_i = (y_i - \hat{y_{i}})$ v/s $y_i$.

Do you see any pattern in these curves? 

For this problem, transform the **independant** variable: $x$ (according to the ladder of transformations). 

Make a scatter plot of the data in each step. 

Pick a suitable transformation and obtain the model for the nuclear test magnitude on the (transfiormed) yield.

This is a way to obtain a model for the data using linear regression. 