# Training session

When a forester evaluates the vigor of a forest, he often considers
the height of its trees. The taller the trees the higher the forest or plantation
produces.  If we seek to quantify production in terms of wood volume, it is necessary
to know the height of the tree to calculate the volume of wood, using a formula like
a "trunk-cone" type formula. However, measuring the height
of a twenty-meter tree is not easy.  It is therefore
necessary to estimate the height using a simple measurement: the
circumference at 1.30 meters from the ground.  
 
The data consists of $n=1429$ circumference-height pairs,
obtained on a plot of 6-year-old eucalyptus (rotation age before cutting).
These data are in a file called ``eucalyptus.txt``.

The aim is to find the relationship between circumference and height
in order to predict a tree's height from its circumference.

1. Import data in a dataframe.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("eucalyptus.txt", sep=" ")
df.head()

2.  Plot the scatter plot of height versus circumference

In [None]:
fig, ax = plt.subplots( 1,1, figsize=(12,12))

df.plot.scatter("circ", "ht", ax=ax)

ax.set_xlabel("circumference")
ax.set_ylabel("Height")
ax.set_title("Scatter plot")

ax.grid()
plt.show() 

3. First, we propose to model the relationship between height and circumference using a simple linear regression model.
   Recall the model and associated assumptions.

4. Calculate the estimators of the coefficients of the line using the python library ``statsmodels``

   Note: create matrix $X$ and vector $Y$ and don't forget the `intercept`

In [None]:
import statsmodels.api as sm

5. Plot the values predicted by the linear model and the observed values on the same graph, compute the predicted values and the residuals. Analyse the residuals. Comments?

You should get the following plot

<img title="Result" alt="Alt text" src="Eucalyptus.png">

5. We now seek to model the relationship by a linear model relating the square root of the circumference to the height. Explain the model and implement it.
$$
y_i = \beta_0 + \beta_1 \sqrt{x_i} +\epsilon_i
$$

6. We now consider a linear (multidimensional) model involving two variables, the square root of the circumference to the height and the circumference itself. Explain and implement the model. What about the residuals ?

$$
y_i = \beta_0 + \beta_1 \sqrt{x_i} + \beta_2 x_i +\epsilon_i
$$

7. Let's now consider a multiple linear regression model involving integer powers of the circumference. Write the associated models and implement them.
$$
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2+\ldots+\beta_p x_i^p +\epsilon_i
$$


11. For each of the models proposed above, plot the values predicted by the model and the observed values. Please comment.

13. Compare models using the BIC criterion