<center>
    <img src="https://upload.wikimedia.org/wikipedia/commons/6/6f/Dauphine_logo_2019_-_Bleu.png" style="width: 600px;"/> 
</center>  

<div align="center"><span style="font-family:Arial Black;font-size:33px;color:darkblue"> Master Economie Finance </span></div>

<div align="center"><span style="font-family:Arial Black;font-size:27px;color:darkblue">Application Lab – Portfolio Management</span></div>

<div align="center"><span style="font-family:Arial Black;font-size:20px;color:darkblue">Time series data modelling
</span></div>

# Linear regression with python

## Data import - use for the example

We use the Boston real estate price data.

The first 401 observations are used for the estimation of a linear regression and the following observations for the calculation of the forecasts and the measurement of the precision of these forecasts by various criteria.

We will estimate two regressions:
    
     * regression of 'MEDV' (denoted y) on 'RM' denoted (x)
     * regression of 'MEDV' on all the other variables noted X

In [1]:
!pip install pandas > /dev/null 2>&1
!pip install numpy > /dev/null 2>&1
!pip install matplotlib > /dev/null 2>&1
!pip install seaborn > /dev/null 2>&1
!pip install sklearn > /dev/null 2>&1
!pip install statsmodels > /dev/null 2>&1
!pip install scipy > /dev/null 2>&1

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error

In [3]:
column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

data = pd.read_csv('housing.csv', header=None, delimiter=r"\s+", names=column_names)
# Separate the data into two sub-samples
# sample 1: up to index 400
# sample 2: from index 401
y1 = data.loc[:400,'MEDV'] #data.iloc[:401,-1]
y2 = data.loc[401:,'MEDV'] #data.iloc[401:,-1]
x1 = data.loc[:400,'RM']  
x2 = data.loc[401:,'RM'] 
X1 = data.loc[:400,data.columns!='MEDV'] 
X2 = data.loc[401:,data.columns!='MEDV'] 

## Simple linear regression with scikit-learn

Estimating a simple linear regression with **scikit-learn** involves 5 steps:
    
     1. import necessary packages and object classes
    
     2. Importing data and performing appropriate transformations
    
     3. Creation and Estimation of a Regression Model
    
     4. Checking the estimation results
    
     5. forecast with the estimated model
    
    
   
    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [5]:
# First We transform the series **y** and **x** into numpy.array
x=np.array(x1)
y=np.array(y1)

print(f"x.shape {x.shape}, y.shape {y.shape}")

# We must transform x into an array of dimension 2. When using scikit-learn’s LinearRegression, the input X must always be a 2-dimensional 
#array with shape: (n_samples, n_features)
x=x[:,np.newaxis]
print(f"x.shape {x.shape}")

x.shape (401,), y.shape (401,)
x.shape (401, 1)


### Creating a model and estimating with existing data

In [5]:
model = LinearRegression()

In [6]:
model

In [7]:
model.fit(x,y) # equivalent to model = LinearRegression().fit(x,y)

### Displaying the results of a regression

#### Estimated coefficients

In [8]:
print(model.intercept_, model.coef_)#note the underscore

-35.45813178943158 [9.43054496]


#### $R^{2}$

In [9]:
r_sq =model.score(x,y)
print(f"determination coefficient: {r_sq}")

determination coefficient: 0.5642745121062778


### Forecast: calculation of adjusted variables

In [10]:
x_pred =np.array(x2)
x_pred =x_pred[:,np.newaxis]

y_pred = model.predict(x_pred) # alternative methods : y_pred = model.intercept_ + model.coef_*x_pred
print(f"predicted variable:\n{y_pred}")

predicted variable:
[24.35981486 24.9350781  14.98585318 16.70221236 18.13565519  3.56546323
 17.42836432 17.51323922 29.15996224 18.83351552 27.32100598  8.18643026
 13.15632745  7.15850086 25.21799445 28.4998241  14.56147865 20.71962451
 28.89590698 25.00109192 21.18172121 17.80558612 22.09648407 17.02285089
 20.14436127 19.58795911 23.03010802 22.94523312 24.70874502 24.40696759
 28.98078189 25.13311955 25.23685554 23.08669129 27.05695072 25.47261917
 22.55858077 20.51215252 17.60754467 19.40877876 24.95393919 23.19042729
 25.69895224 19.74827838 25.45375808 24.34095377 23.49220472 22.86978876
 25.05767519 28.18861611 27.30214489 23.92600979 34.26188706 27.99057467
 26.07617404 20.89880486 20.52158306 23.96373197 21.88901208 27.73594995
 24.67102284 24.11462069 25.9630075  23.09612184 18.85237661 20.67247178
 21.15342958 20.42727761 18.41857154 22.70003895 23.28473274 25.24628609
 30.367072   15.72143568 22.65288622 25.6895217  14.56147865 22.86978876
 23.28473274 23.40732982 28.198

### Calculation of forecast accuracy indicators

MSE:
- Measures the average squared difference between actual (y2) and predicted (y_pred) values.
- Penalizes larger errors more because of the squaring.
- Useful when large errors are particularly undesirable.

MAE: 
- Measures the average absolute difference (without squaring) between actual and predicted values.
- More robust to outliers than MSE.
- Useful when you want a direct, interpretable measure of average error in the same units as the data.

MAPE:
- Measures the average absolute percentage difference between actual and predicted values.
- Expressed as a percentage, making it scale-invariant and easier to interpret.

Rule of Thumb:
- Outliers present:	MAE
- Penalize large errors:	MSE
- Want scale-independent insights:	MAPE


In [11]:
print("Mean squared error: %.2f" %mean_squared_error(y2,y_pred))
print("Mean absolute error: %.2f" %mean_absolute_error(y2,y_pred))
print("Mean absolute percentage error: %.2f" %mean_absolute_percentage_error(y2,y_pred))

Mean squared error: 77.66
Mean absolute error: 7.27
Mean absolute percentage error: 0.61


# Multiple linear regression with scikit-learn

In [12]:
X = np.array(X1)
print(f"X.shape {X.shape}")
model_all = LinearRegression().fit(X,y)

print(model_all.intercept_)
print(model_all.coef_)

r_sq = model_all.score(X,y)
print(r_sq)

x_pred = np.array(X2)
y_all_pred = model_all.predict(x_pred)

print(y_all_pred)

print("Mean squared error: %.2f" %mean_squared_error(y2,y_pred))
print("Mean absolute error: %.2f" %mean_absolute_error(y2,y_pred))
print("Mean absolute percentage error: %.2f" %mean_absolute_percentage_error(y2,y_pred))


29.00580915942632
[-1.99873534e-01  4.42763141e-02  5.55407781e-02  1.73791950e+00
 -1.49715808e+01  4.85503348e+00  2.90568657e-03 -1.29621067e+00
  4.83051120e-01 -1.54201512e-02 -8.08704936e-01 -1.49681762e-03
 -5.23714094e-01]
0.7352124274349188
[19.58764209 20.76354217 12.75957269  6.50814359  4.47071337  7.06776448
 21.74565968 15.95336962 24.20492324 16.95074641 22.71715509  5.00314396
 12.75920033 -3.93805933 15.11334105 19.96083573  9.38022995  6.18728639
 20.93395476 22.69635683 21.1505423  20.72383184 19.452712   20.05422324
 15.3052598  21.47638814 17.59772885 19.73056663 18.96613662 23.60388292
 24.55210924 27.07912512 23.06445028 20.9832873  19.042609   20.58685111
 14.69751333  9.97599707 14.70273344 13.44309661 19.88244938 21.39182263
 20.68780896 14.78663845 18.22094972 21.25012539 20.52500832 19.75392824
 20.85956883 23.87099178 22.96188233 21.45478615 26.34835019 22.31633319
 22.5760245  19.34334812 19.25826873 21.09442051 20.91811464 23.80887488
 23.21251458 22.4389

# Linear regressions with statsmodels

In [13]:
X1=sm.add_constant(X1) # statsmodels does not add a default constant in linear regressions. It needs to be added.
reg_all=sm.OLS(y1,X1).fit()
reg_all.summary()

# Display of estimated coefficients
reg_all.params

# Display of the t-stat of the estimated coefficients
reg_all.tvalues

# Display pvalue of the F-stat
reg_all.f_pvalue

# If we want to check that the correlation is significantly different from 0 we can use the pearsonr function
from scipy import stats
stats.pearsonr(data['AGE'].to_numpy(),data['DIS'].to_numpy())

# Display R² and adjusted R²
reg_all.rsquared, reg_all.rsquared_adj

# Display of residuals
reg_all.resid

dir(reg_all) # list of all objects that can be retrieved from the results

['HC0_se',
 'HC1_se',
 'HC2_se',
 'HC3_se',
 '_HCCM',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abat_diagonal',
 '_cache',
 '_data_attr',
 '_data_in_cache',
 '_get_robustcov_results',
 '_get_wald_nonlinear',
 '_is_nested',
 '_transform_predict_exog',
 '_use_t',
 '_wexog_singular_values',
 'aic',
 'bic',
 'bse',
 'centered_tss',
 'compare_f_test',
 'compare_lm_test',
 'compare_lr_test',
 'condition_number',
 'conf_int',
 'conf_int_el',
 'cov_HC0',
 'cov_HC1',
 'cov_HC2',
 'cov_HC3',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'diagn',
 'eigenvals',
 'el_test',
 'ess',
 'f_pvalue',
 'f_test',
 'fittedvalues',
 'fvalue',
 '

# Excercise

###  Stability of $ \beta_i $

We consider the following decomposition of the return of an asset $ i $:

$$
r_{it} = \alpha_i + \beta_i r_{mt} + e_{it}
$$

where $ r_{it} $ is the return on asset $ i $ and $ r_{mt} $ is the return on the stock market index. The term $ \beta_i r_{mt} $ measures the common variations to all assets, the residual $ e_{it} $ represents the specific risk of each asset.

The value of $ \beta_i $ determines the risk of asset $ i $ relative to the market portfolio:

- $ \beta_i = 1 $: asset $ i $ is as risky as the market portfolio.
- $ \beta_i < 1 $: asset $ i $ is less risky than the market portfolio.
- $ \beta_i > 1 $: asset $ i $ is more risky than the market portfolio.


---
1. **Import Excercise.csv**

- Import the data in a dataframe called data. 
- Set the columns Dates as index (this should become automatic when you handle a df which represents time series)
- Create a new df based on data which stored the return times series for each column
- Drop NaN value

2. **Select 10 securities at random.** For each of the $ n = 10 $ securities, estimate by the **OLS regression**: $ r_{it} = \alpha_i + \beta_i r_{mt} + e_{it} $ with $ t = 1, \dots, 40 $ first observations. The estimated model is noted

    $$
    r_{it} = \hat{\alpha}_i + \hat{\beta}_i r_{mt} + \hat{e}_{it}, \quad t = 1, \dots, 40
    $$

    We note $ \beta_{i,40} $ as the $ \beta $ of asset $ i $ obtained from the 40 first observations.

   The market return is given by the EUROSTOXX50 return

   Hint: for selecting random securities, check the numpy.random.choice() function.

4. **Estimate the same regression** for the equally weighted portfolio of $ n = 10 $ securities $ r_{pt} = \frac{1}{n} \sum_{i=1}^{n} r_{it} $

    $$
    r_{pt} = \alpha_p + \beta_p r_{mt} + e_{pt}, \quad t = 1, \dots, 40
    $$

    It can be shown that $ \alpha_p = \frac{1}{n} \sum_{i=1}^{n} \alpha_i $ and $ \beta_p = \frac{1}{n} \sum_{i=1}^{n} \beta_i $.

5. **Re-estimate the previous regressions for each security and the portfolio** by adding the following observations one by one to the first 40 observations. We will obtain for each asset a sequence of $ \beta_{i,40}, \beta_{i,41}, \dots, \beta_{i,T} $ where $ T $ is the end date of the sample. For each sequence of $ \beta_{i,40}, \beta_{i,41}, \dots, \beta_{i,T} $, calculate:

    - the mean,
    - the minimum, the maximum,
    - the standard deviation

Plot the beta time series for five different stocks, the portfolio and the EUROSTOXX50.