### Student names and ID:
### Group ID: 

# Linear Regression using the life expectancy dataset. 

Linear regression is a very common technique to link a set of features of real valued variables $\mathbf{x}=(x_1, \ldots, x_d)$ to a real value outcome $y$. The hypothesis with the linear regression model is that the outcome variable is a linear combination of the features, to with a gaussian noise is added:

$$
\begin{align}
Y  & = w_0 + \sum_{i=1}^{d} w_i \cdot x_i  + \varepsilon, & \varepsilon &\sim \mathcal{N}(0, \sigma^2)\\
Y & = \mathbf{\tilde{X}}^{T} \cdot \mathbf{W}  & \mathbf{\tilde{X}}^T &= (1, X_1, X_2, \ldots, X_d)
\end{align}
$$

The goal of this project is to explore the different aspects of linear regression on a traditional multivariate dataset from the Global Health Observatory (GPO) from the World Health Organization (WHO) which keeps track of the health status as well as many other related factors for all countries. The data-set relates life expectancy, to different health factors for 193 countries and as been collected from the same WHO data repository website and its corresponding economic data was collected from the United Nation website. The different datasets were merged and made available on the [Kaggle platform](https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who) which is what we use now. 

After a little of data cleaning and exploiration, we will estimate parameters of the model and evaluate its predictive power. We will use linear regression and regularized (ridge) regression.


The project is divided into the following tasks:
  1. Data loading, cleaning.
  2. Short dive into data exploration.
  3. Implementation of linear regression, relationship with correlation, and interpretation of the coefficients.
  4. Regularization using ridge regression and the effect of the regularization parameter on the fit.
  5. Comparison of the different models by assessing the quality of their prediction.
  6. (optional) Reproduction of the analysis within the `sklearn` framework and extension of the results.

## Task 1: Data loading and cleaning (10 pts.)

### Data loading
We start by reading in the data table, have a look at the first rows, and clean up possible mistyping in column names.

In [9]:
import pandas as pd 
import numpy as np  


In [10]:

# Use pandas.read_csv() method
data = pd.read_csv('data.csv')

## 2938 observations, 22 features
print("Size of the table: ", data.shape)
# display first 20 rows
data.head(20)

Size of the table:  (2938, 22)


Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
5,Afghanistan,2010,Developing,58.8,279.0,74,0.01,79.679367,66.0,1989,...,66.0,9.2,66.0,0.1,553.32894,2883167.0,18.4,18.4,0.448,9.2
6,Afghanistan,2009,Developing,58.6,281.0,77,0.01,56.762217,63.0,2861,...,63.0,9.42,63.0,0.1,445.893298,284331.0,18.6,18.7,0.434,8.9
7,Afghanistan,2008,Developing,58.1,287.0,80,0.03,25.873925,64.0,1599,...,64.0,8.33,64.0,0.1,373.361116,2729431.0,18.8,18.9,0.433,8.7
8,Afghanistan,2007,Developing,57.5,295.0,82,0.02,10.910156,63.0,1141,...,63.0,6.73,63.0,0.1,369.835796,26616792.0,19.0,19.1,0.415,8.4
9,Afghanistan,2006,Developing,57.3,295.0,84,0.03,17.171518,64.0,1990,...,58.0,7.43,58.0,0.1,272.56377,2589345.0,19.2,19.3,0.405,8.1


Some columns are hidden due to their number. We can print all of the columns with ```pandas.DataFrame.columns```:

In [11]:
data.columns

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

After taking a good look at the column labels, you may notice that some trailing whitespace are remaining on the column names (_e. g._ `'Measles '`). <br>
Use the  [```.strip()``` method](https://docs.python.org/3.4/library/stdtypes.html#str.strip) to clean up column names

In [12]:
print(data.columns)

##You code here
data.columns = data.columns.str.strip()

print(f'\n Columns stripped: {data.columns}')

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')

 Columns stripped: Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')


### Data Cleaning

We will perform an additional cleaning step on the dataset by handling missing values. 
Proper preprocessing of the columns ensure that we will be able to make the most of the data (even if it comes at the cost of altering it a little). We will identify columns with missing data, and apply a simple technique of filling the real missing values with the mean of the same countries (e.g., mean imputation, forward fill, or regression-based imputation). 

We are using a simple method, but multiple other techniques exist and  can be applied (check the method [`pandas.DataFrame.fillna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)):
  - mean imputation, we fill the missing values with the means of the corresponding columns. We will use this method after a grouping by country.
  - forward fill, we fill each missing value based on the last valid observation. This is a general technique that works with any data type. 
  - regression based imputation is directly related to the theme of this project, so we will avoid it at this step.

Here, we will only work with columns containing real values and we will use the mean of each country to fill the missing value.

Are there any countries for which this technique is not applicable? If yes, figure out how you could deal with the remaining missing values. #

#### Data Cleaning Answer
This mehtod does not work if all values for a country in a column are na, because then we can not compute a mean. In this case we impute the global mean. This is probably quite inacurate, because for example countries that do not record their GDP probably have a smaller GDP. With domain knowlage for the different columns, we might be able to impute values more accuratly, for example by imputing the values for GDP with the lower 90th percentile of the global GDP data. 

In [13]:
# Identify columns with missing values and report their count

data.isna().sum()[data.isna().sum() > 0]

#print(missing_data)

Life expectancy                     10
Adult Mortality                     10
Alcohol                            194
Hepatitis B                        553
BMI                                 34
Polio                               19
Total expenditure                  226
Diphtheria                          19
GDP                                448
Population                         652
thinness  1-19 years                34
thinness 5-9 years                  34
Income composition of resources    167
Schooling                          163
dtype: int64

In [14]:
# Fill missing values with mean for numeric columns
# Compute a mean for each country to fill the data

#
data_filled = data.copy()

numeric_columns = [col for col in data.columns if data.dtypes[col] in [int, float]]
numeric_col_and_country = numeric_columns.copy()
numeric_col_and_country.append("Country")
d = data[numeric_col_and_country]
means = d.groupby(["Country"]).mean()

for col in numeric_columns:
    data_filled[col] = data_filled.groupby("Country")[col].transform(lambda x: x.fillna(x.mean()))

# for columns that have no values for a specific country, we fill with global mean
for col in numeric_columns:
    mean = data[col].mean()
    data_filled[col] = data_filled[col].fillna(value=mean)

data = data_filled

In [15]:
normalized_data = data.copy()
for col in [col for col in numeric_columns if col != "Year"]:
    col_mean = data[col].mean()
    col_std = data[col].std()
    normalized_data[col] = (data[col] - col_mean) / col_std

data = normalized_data

## Task 2: Data visualization and exploration (15 pts.)

It is also important to visualize first the basic properties of the data by simply looking at it. (It is crucial for accurate modeling and analysis).
Here, we will simply produce boxplots as one dimensional summary plots of a few variables, and then assess some of the interaction between the variables with a pair plot. Exploratory data analysis of such a data set could be possible but would require a complete separate project. 

### Boxplots as a visualization tool

Use boxplots to summarize the distributions for the variables `Life Expectancy`, `BMI`, and `GDP`. 

You should observe a few number of outliers for `Life Expectancy` and `GDP`. Have a look at the properties of the outliers and characterize them rapidly.

In [16]:
# Boxplot to detect outliers
import plotly.express as px

fig = px.box(data_filled, x="BMI")
fig.show()

### Expected boxplot

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [17]:
import plotly.express as px

for i in ['GDP', 'BMI', 'Life expectancy']:
    fig = px.box(data_filled, x=i)
    fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [18]:
data_filled[data_filled['Country'] == 'Sierra Leone']

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
2297,Sierra Leone,2015,Developing,51.0,413.0,22,3.154667,0.0,86.0,607,...,86.0,9.218,86.0,0.5,587.538233,723725.0,7.4,7.3,0.431,9.5
2298,Sierra Leone,2014,Developing,48.1,463.0,23,0.01,1.443286,83.0,1006,...,83.0,11.9,83.0,0.6,78.439476,779162.0,7.5,7.4,0.426,9.5
2299,Sierra Leone,2013,Developing,54.0,47.0,23,0.01,1.321464,92.0,15,...,92.0,11.59,92.0,0.8,71.8187,692279.0,7.7,7.6,0.413,9.3
2300,Sierra Leone,2012,Developing,49.7,411.0,25,0.01,54.560337,91.0,678,...,91.0,11.24,91.0,0.9,561.898424,676613.0,7.9,7.8,0.401,9.1
2301,Sierra Leone,2011,Developing,48.9,418.0,26,3.78,54.665917,89.0,1865,...,88.0,11.98,89.0,1.3,445.525,6611692.0,8.1,8.0,0.392,8.9
2302,Sierra Leone,2010,Developing,48.1,424.0,27,3.84,5.347718,86.0,1089,...,84.0,1.32,86.0,1.6,45.128418,645872.0,8.3,8.2,0.384,8.7
2303,Sierra Leone,2009,Developing,47.1,433.0,28,3.97,49.837127,84.0,31,...,81.0,13.13,84.0,1.7,394.593244,63126.0,8.5,8.4,0.375,8.5
2304,Sierra Leone,2008,Developing,46.2,441.0,29,3.91,5.379606,77.0,44,...,75.0,1.29,77.0,1.9,46.375918,6165372.0,8.7,8.7,0.367,8.3
2305,Sierra Leone,2007,Developing,45.3,45.0,29,3.86,45.571089,63.0,0,...,63.0,1.12,64.0,2.2,358.827472,615417.0,8.9,8.9,0.357,8.2
2306,Sierra Leone,2006,Developing,44.3,464.0,30,3.8,38.000758,83.444444,33,...,65.0,1.68,64.0,2.2,322.313468,5848692.0,9.1,9.1,0.348,8.0


In [19]:
gdp_upper_fence = data_filled['GDP'].quantile(0.75) + 1.5 * (data_filled['GDP'].quantile(0.75) - data_filled['GDP'].quantile(0.25))
gdp_outliers = data_filled[data_filled['GDP'] > gdp_upper_fence]
gdp_outliers

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
112,Australia,2015,Developed,82.8,59.0,1,10.155333,0.000000,93.0,74,...,93.0,8.836667,93.0,0.1,56554.38760,2.378934e+07,0.6,0.6,0.937,20.4
113,Australia,2014,Developed,82.7,6.0,1,9.710000,10769.363050,91.0,340,...,92.0,9.420000,92.0,0.1,62214.69120,2.346694e+06,0.6,0.6,0.936,20.4
114,Australia,2013,Developed,82.5,61.0,1,9.870000,11734.853810,91.0,158,...,91.0,9.360000,91.0,0.1,67792.33860,2.311735e+07,0.6,0.6,0.933,20.3
115,Australia,2012,Developed,82.3,61.0,1,10.030000,11714.998580,91.0,199,...,92.0,9.360000,92.0,0.1,67677.63477,2.272825e+07,0.6,0.6,0.930,20.1
116,Australia,2011,Developed,82.0,63.0,1,10.300000,10986.265270,92.0,190,...,92.0,9.200000,92.0,0.1,62245.12900,2.234240e+05,0.6,0.6,0.927,19.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2754,United Arab Emirates,2007,Developing,75.6,87.0,1,1.690000,3759.457226,92.0,0,...,94.0,2.570000,92.0,0.1,42672.61323,1.275338e+07,5.1,4.9,0.826,12.9
2755,United Arab Emirates,2006,Developing,75.4,89.0,1,1.740000,3749.941617,92.0,0,...,94.0,2.330000,92.0,0.1,42372.22166,1.275338e+07,5.1,4.9,0.823,12.8
2756,United Arab Emirates,2005,Developing,75.3,92.0,1,1.790000,3427.320332,92.0,29,...,94.0,2.320000,94.0,0.1,39439.81970,1.275338e+07,5.1,4.9,0.818,12.6
2757,United Arab Emirates,2004,Developing,75.1,95.0,1,1.770000,2972.448675,92.0,22,...,94.0,2.460000,94.0,0.1,36161.17610,1.275338e+07,5.2,4.9,0.813,12.4


In [20]:
gdp_outliers['Year'].mean()

np.float64(2007.6733333333334)

In [21]:
gdp_outliers[gdp_outliers['Status'] == 'Developed'].__len__() / gdp_outliers.__len__()

0.6366666666666667

In [22]:
life_lower_fence = data_filled['Life expectancy'].quantile(0.25) - 1.5 * (data_filled['Life expectancy'].quantile(0.75) - data_filled['Life expectancy'].quantile(0.25))
life_outliers = data_filled[data_filled['Life expectancy'] < life_lower_fence]
life_outliers

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
1127,Haiti,2010,Developing,36.3,682.0,23,5.76,36.292918,40.666667,0,...,66.0,8.9,66.0,1.9,662.279518,9999617.0,4.0,4.0,0.47,8.6
1484,Lesotho,2005,Developing,44.5,675.0,5,2.67,57.903698,87.0,0,...,88.0,6.3,89.0,34.8,862.946312,1949543.0,9.3,9.2,0.437,10.7
1582,Malawi,2003,Developing,44.6,613.0,43,1.08,4.375316,84.0,167,...,85.0,6.35,84.0,24.2,26.152517,12336687.0,7.6,7.5,0.362,10.3
1583,Malawi,2002,Developing,44.0,67.0,46,1.1,3.885395,64.0,92,...,79.0,4.82,64.0,24.7,29.979898,1213711.0,7.7,7.6,0.388,10.4
1584,Malawi,2001,Developing,43.5,599.0,48,1.15,12.797606,89.571429,150,...,86.0,5.7,9.0,25.1,146.76154,11695863.0,7.9,7.7,0.387,10.1
1585,Malawi,2000,Developing,43.1,588.0,51,1.18,13.762702,89.571429,304,...,73.0,6.7,75.0,25.5,153.259487,11376172.0,8.0,7.9,0.391,10.7
2306,Sierra Leone,2006,Developing,44.3,464.0,30,3.8,38.000758,83.444444,33,...,65.0,1.68,64.0,2.2,322.313468,5848692.0,9.1,9.1,0.348,8.0
2307,Sierra Leone,2005,Developing,43.3,48.0,30,3.83,42.088929,83.444444,29,...,67.0,12.25,65.0,2.2,287.689194,5658379.0,9.3,9.3,0.341,7.8
2308,Sierra Leone,2004,Developing,42.3,496.0,30,3.99,38.524548,83.444444,7,...,69.0,11.66,65.0,2.1,263.145817,5439695.0,9.5,9.5,0.332,7.6
2309,Sierra Leone,2003,Developing,41.5,57.0,30,4.07,38.614732,83.444444,586,...,66.0,11.69,73.0,1.9,263.761831,5199549.0,9.7,9.8,0.322,7.4


In [23]:
life_outliers['Year'].mean()

np.float64(2003.1176470588234)

TOCHECK
The GDP has a lot of outliers on the upper end. Suggesting, that a few countries have a very high GDP compared to the rest of the world. These outliers are mostly developed countries, although the split is surprisingly equal with 64% being developed. The outliers are spread quite evenly over the years 2000 to 2015.
The life expectancy has a few outliers on the lower end. These outliers are entirely developing countries. However, they are concentrated in the earlier years, with an average year of 2003. Life expectancy in these countries has slightly increased over the years, lifting them just outside of the range considered an outlier. 

### line plots and Pair plots

- First check how the life expectancy evolved for a few countries with a line plot.
- Produce a pair plot of the variables `Life expectancy`, `BMI`, `GDP`, `Alcohol`, and `Schooling`. You can consider using the `Status` of the country or the `Year` as a way to color the plot. (plotly.express function `scatter_matrix`).

- What do you observe at a first visualization on the data? For instance you can comment on:
  - The relationship between `Life expectancy` and `Status`, 
  - The relationship between `Life expectancy` and `Schooling`,
  - The distribution of `BMI`.


In [24]:
import plotly.express as px
countries = data.Country.unique()
### Example of countries but you can select an other set
cchoice = ['Cuba','Germany', 'Afghanistan', 'Japan', 'United States of America' ,
           'France', 'Portugal', 'Nigeria', 'Algeria', 'Republic of Korea' ]

df = data_filled.query(f"Country in {cchoice}")
fig = px.line(df, x="Year", y="Life expectancy", color='Country')
fig.show()


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [25]:
variables = ['Life expectancy', 'BMI', 'GDP', 'Alcohol', 'Schooling', 'Status', 'Year']
import seaborn as sns 
## pairplot code here
sns.pairplot(data_filled, vars=["Life expectancy", "BMI", "GDP", "Alcohol", "Schooling", "Year"],hue="Status", diag_kind="hist")


ModuleNotFoundError: No module named 'seaborn'

In [123]:
data_filled[data_filled['Status'] == 'Developed']['Life expectancy'].min()

np.float64(69.9)

- What do you observe at a first visualization on the data? For instance you can comment on:
  - The relationship between `Life expectancy` and `Status`, 
  - The relationship between `Life expectancy` and `Schooling`,
  - The distribution of `BMI`.

TOCHECK Comment on relationship of life expectancy and status, schooling and BMI

Developed countries unsurprisingly have a higher life expectancy. Eaven though there is some overlap (a lot of developing countries with high life expectancy) there are no developed counties with a lif expectancy under 69. Where as there are a the distribution has a verry heavy tale in this direction for developing countries.

Schooling has the most clear positive effect on life expectancy. 

In the distribution of BMI, there is a notable gap, with almost no countries falling into the range between 8 and 12 (probably a scaled value). This food gap might be the distinction between countries suffering from hunger and thoughs that generally have food securety. The fact that there are developed countries in the lower part of this devide would dispute this theory, however after the year 2008 there are no more developed countries in this area. This gap also notably becomes larger over time.


## Task 3: Correlation and Its Interpretation (10 pts.)

### Description
This task focuses on working first the computation of the variable correlations and understanding how it relates to regression coefficients. The standard correlation measure can identify key relationships in the data but must be interpreted with care. 

### Question:
   - Compute the correlation between `Life Expectancy` and the following variables. Is positive or negative?
     - `Alcohol` consumption.
     - `Schooling`.
     - `BMI`.
   - Do you find some of the correlation values to be unusual? Again, split the data by some characteristic, such as the variable `Status` and explain the results obtained. #

In [26]:
variables = ['Alcohol', 'Schooling', 'BMI']

for var in variables:
    print(f'Correlation between Life expectancy and {var} is: {np.corrcoef(data_filled["Life expectancy"],data_filled[var])[0,1]}' )

Correlation between Life expectancy and Alcohol is: 0.40415061946866593
Correlation between Life expectancy and Schooling is: 0.7150663398620062
Correlation between Life expectancy and BMI is: 0.5592553046406492


It is unusual that life expectancy positively correlates with alcohol consumption. Normally, one would expect life expectancy to decrease as alcohol consumption increases. The high positive correlation with BMI is also surprising. We would have expected a value closer to 0, since life expectancy should decrease at both the lower and upper extremes of BMI.

In [27]:
## 
developed = data[data['Status'] == 'Developed']
developing = data[data['Status'] == 'Developing']

## report the correlations after stratification

print('Developed')
for var in variables:
    print(f'Correlation between Life expectancy and {var} is: {np.corrcoef(developed["Life expectancy"],developed[var])[0,1]}' )


print('Developing')
for var in variables:
    print(f'Correlation between Life expectancy and {var} is: {np.corrcoef(developing["Life expectancy"],developing[var])[0,1]}' )

Developed
Correlation between Life expectancy and Alcohol is: -0.2803813946888532
Correlation between Life expectancy and Schooling is: 0.3514715561045634
Correlation between Life expectancy and BMI is: -0.04396245854927888
Developing
Correlation between Life expectancy and Alcohol is: 0.20060621619146532
Correlation between Life expectancy and Schooling is: 0.6474122699851881
Correlation between Life expectancy and BMI is: 0.5454188506671711


Thease results can explain the unusual results form before. First, in developed Alcohole consumption is negativeliy correlated with life expectancy, as expected and in developing countrys the correlation is still posivie, but only half as much. The plot above, that plots data points with respect to alcohol and life expectancy, shows that developped countrys just have both a high alcohol consumtion and life expectancy, leading to a positive correlation over all.

Simmilarly the BMI is positivly correlated in developing countrys. This might be, because a low BMI might be due to male nutrition in these countrys, where as in developed countrys, where food is usually secure, obecety is a larger problem than malenutrition, the BMI is slightly negatively correlated.

## Task 4: Simple Regression Implementation (15 pts.)

### Description
This task involves implementing a simple linear regression model. To simplify the application of the model we will implement it as a regression class that contains one attribute `self.coefficients` that corresponds to the vector $\mathbf{W}$ (with the term for the bias $w_0$ being the first value).

You are required to complete the code for a regression class that includes methods for fitting, predicting, and evaluating the model.

We make the hypothesis that the matrix of correlations is of full rank.  



### Exercise
#### **Complete the Class**:
   - Fill in the missing methods (`fit`, `predict`, and `evaluate`) in the provided `LinRegModel` class (the `__str__` method is provided). Note that the solution of the linear regression is provided in the lecture slides. 
   - For the evaluate method, it will compute two indicators:
     - the Mean Squared Error `MSE`: 
      $$
      \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{X}^T_i\cdot \mathbf{W})^2
      $$ 
     - the R squared `rsquared` ($\bar{y}$ is the sample mean $\bar{y} = \frac{1}{n}\sum_i y_i$): 
      $$
      R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \mathbf{X}^T_i\cdot \mathbf{W})^2 }{\sum_i{(y_i - \bar{y})^2} }
      $$
   - Ensure the implementation follows the principles of linear regression using the closed-form solution.


In [28]:


class LinRegModel:
    def __init__(self):
        """Initialize the Linear Regression model."""
        self.coefficients = None
        self.variables = None

    def __str__(self):
        """ summary of the coefficients """
        return "Coefficients:\n" + str(self.coefficients) + "\nVariables: " + ", ".join(self.variables)
    
    def copy(self):
        """ Create a copy of the model instance. """
        new_model = LinRegModel()
        new_model.coefficients = np.copy(self.coefficients) if self.coefficients is not None else None
        new_model.variables = self.variables.copy() if self.variables is not None else None
        return new_model

    def fit(self, X, y, variblenames = None):
        """
        Train the linear regression model using the closed-form solution.

        Parameters:
        X: np.ndarray
            The feature matrix (shape: (n_samples, n_features)).
        y: np.ndarray
            The target vector (shape: (n_samples)).
        """
        # The following code prepares the matrix X and the varnames
        # Be careful with the automatic type change between pandas and numpy arrays
        Xtilde = np.hstack((np.ones((X.shape[0], 1)), X))
        self.variables = variblenames
        self.coefficients = np.linalg.inv(Xtilde.T @  Xtilde) @ Xtilde.T @ y
        

    def predict(self, X):
        """
        Make predictions using the trained model.

        Parameters:
        X: np.ndarray
            The feature matrix (shape: [n_samples, n_features]).

        Returns:
        np.ndarray
            The predicted values (shape: [n_samples]).
        """
        Xtilde = np.hstack((np.ones((X.shape[0], 1)), X))
        return Xtilde @ self.coefficients

    def evaluate(self, X, y):
        """
        Evaluate the model using Mean Squared Error and R-squared metrics.

        Parameters:
        X: np.ndarray
            The feature matrix (shape: [n_samples, n_features]).
        y: np.ndarray
            The true target values (shape: [n_samples]).

        Returns:
        dict
            A dictionary with MSE and R-squared metrics.
        """
        y_ = np.asarray(y).reshape(-1)
        y_pred= np.asarray(self.predict(X)).reshape( -1) 

        squared_errors = np.square(y_ - y_pred)
        ytilde =  np.mean(y_)

        MSE = np.mean(squared_errors)
        rsquared = 1 - (np.sum(squared_errors) / np.sum(np.square( y_ - ytilde)))
        
        return {'MSE': MSE, 'rsquared': rsquared}



In [29]:
### You can test the estimation of the model
variables = ['Alcohol', 'Schooling', 'BMI']
vout = ['Life expectancy']
Xtest = data[variables]

y = data[vout]

lm = LinRegModel()

lm.fit(Xtest, y, variables)

eval = lm.evaluate(Xtest, y)
print(eval)
## Tests
print(lm)


{'MSE': np.float64(0.43622459177517126), 'rsquared': np.float64(0.5636268809549021)}
Coefficients:
   Life expectancy
0     4.627375e-16
1     2.767546e-02
2     5.679788e-01
3     2.615966e-01
Variables: Alcohol, Schooling, BMI


## Task 5: Regression Analysis and Evaluation (20 pts.)

#### 1. Spliting  the Dataset
- We will divide the dataset into training and testing sets:
  - Use the first 7 years 2000 to 2006 as the **training set**.
  - Use the next three year 2007-2009 as the **testing set**.



In [30]:
## Your code here
train_data = data[data['Year'] <= 2006]
test_data = data[(data['Year'] >= 2007) & (data['Year'] <= 2009) ]

print(f"Training set size: {train_data.shape}")
print(f"Testing set size: {test_data.shape}")

Training set size: (1281, 22)
Testing set size: (549, 22)


In [31]:
data['Year']

0       2015
1       2014
2       2013
3       2012
4       2011
        ... 
2933    2004
2934    2003
2935    2002
2936    2001
2937    2000
Name: Year, Length: 2938, dtype: int64

#### 2. Experiment with Predictor Sets
- Use different combinations of predictors to train and evaluate the regression model. Below are some suggested sets of predictors:

  **Set A (Healthcare and Mortality)**:
  - `Adult Mortality`, `Infant deaths`, `Total expenditure`

  **Set B (Lifestyle and Education)**:
  - `Alcohol`, `BMI`, `Schooling`

  **Set C (Economic Factors)**:
  - `GDP`, `Income composition of resources`, `Population`

  **Set D**
  - Union of set A, set B, and set C


- Train the model on each predictor set using the **training data** and evaluate its performance on the **testing data**.
- **Deliverable**: For each predictor set, report:
  - Mean Squared Error (MSE)
  - $R^2$ score
  - A brief explanation of the observed performance and an analysis of the coefficients



In [32]:
## Predictor variables

predictors_sets = {
    "Set A": ["Adult Mortality", "infant deaths", "Total expenditure"],
    "Set B": ["Alcohol", "BMI", "Schooling"],
    "Set C": ["GDP", "Income composition of resources", "Population"]
}

predictors_sets["Set D"] = [x for l_var in predictors_sets.values() for x in l_var]

In [33]:
import copy

def train_modells_on_predictors_sets(train_data, predictors_sets, model_object=LinRegModel(), vout=['Life expectancy']):
    models = {}
    for set in predictors_sets:
        variables = predictors_sets[set]
        Xtrain = train_data[variables] 
        y_train = train_data[vout]

        #model = model_object.copy()
        model = copy.deepcopy(model_object)
        model.fit(Xtrain, y_train)
        models[set] = model
        
    return models

def evaluate_models(models, test_data, predictors_sets, vout=['Life expectancy']):
    result_set = {}
    for set in models.keys():
        variables = predictors_sets[set]
        Xtest = test_data[variables]   
        y_test = test_data[vout]
        result_set[set] = models[set].evaluate(Xtest, y_test)

    return result_set

def print_evaluation_results(evaluation_results):
    for set in evaluation_results:
        print(f"{set}")
        print(f"MSE: {evaluation_results[set]['MSE']}")
        print(f"R^2: {evaluation_results[set]['rsquared']}")

models = train_modells_on_predictors_sets(train_data, predictors_sets)
evaluation_results = evaluate_models(models, test_data, predictors_sets)

print_evaluation_results(evaluation_results)
for set in models:
    print(predictors_sets[set])
    print(models[set].coefficients)

Set A
MSE: 0.4428305750141345
R^2: 0.5327428112369803
Set B
MSE: 0.41205071348346833
R^2: 0.5652204954368011
Set C
MSE: 0.47571759949807146
R^2: 0.4980417325983735
Set D
MSE: 0.24360960196061995
R^2: 0.7429528488086798
['Adult Mortality', 'infant deaths', 'Total expenditure']
   Life expectancy
0        -0.107903
1        -0.602218
2        -0.119937
3         0.144240
['Alcohol', 'BMI', 'Schooling']
   Life expectancy
0        -0.038038
1         0.024980
2         0.356707
3         0.480104
['GDP', 'Income composition of resources', 'Population']
   Life expectancy
0        -0.028502
1         0.309800
2         0.454030
3        -0.011545
['Adult Mortality', 'infant deaths', 'Total expenditure', 'Alcohol', 'BMI', 'Schooling', 'GDP', 'Income composition of resources', 'Population']
   Life expectancy
0        -0.021306
1        -0.390198
2        -0.051654
3         0.002612
4         0.009820
5         0.218570
6         0.278795
7         0.129827
8         0.101416
9         0.03

The union of all variables yields by far the best MSE and $R^2$ on the test data. This indicates that using all features together reliably predicts unseen data. The resulting regressor can account for 
$0.74$ of the variance in the target variable.

The model that depends exclusively on Health Care and Mortality outperforms the other reduced models. The economics-based model performs worst. The economic variables contained many missing values, and the imputed values may have reduced correlations between variables. Therefore, the large amount of imputed data for economic factors may artificially weaken linear dependencies.

Since the input variables have different scales, the coefficients are difficult to compare by magnitude. However, regardless of scale, the sign still indicates whether a metric has a positive or negative effect on life expectancy. Unsurprisingly, higher total expenditure, schooling, and GDP contribute to longer life expectancy. Furthermore, reduced alcohol consumption and lower infant deaths correspond to higher life expectancy across the dataset. BMI and schooling have similar ranges, yet the coefficient for schooling is larger. This can be explained by the fact that both very high and very low BMI values are associated with unhealthy habits; because the relationship between BMI and life expectancy is not strictly linear across all data, its coefficient should be smaller.

TOCHECK analyze coefficients
Most coefficients are slightly scaled down, when comparing the model with all variables to the individual models. This is expected, since in the combined model, the different variables can explain overlapping parts of the variance in life expectancy. The only coefficients that chainge meaningfully beyond this scaling are Total expenditure, which looses almost all of its effect in the combined model, likely because its effect is already captured by other variables, like GDP and Income composition of resources. Even more notably, the coefficient for Population changes sign in the combined model. This indicates that the effect of population on life expectancy is not direct, but mediated by other variables.

Across both the combined and the individual models, Adult Mortality has the largest negative effect on life expectancy by a large margin. This is expected, since it directly measures mortality. Schooling has the largest positive effect on life expectancy, followed by BMI. Since adult mortality can not be directly effected by policymakers, this suggest, that the most effective way to increase life expectancy is to provide access to education and food security.

In [34]:
#TOCHECK ich glaub, das brauchen wir nicht mehr wenn wir normalisierte daten nehmen

for var in predictors_sets["Set D"]:
    print(f"{var}-min: {data[var].min()}")
    print(f"{var}-max: {data[var].max()}")
    print(f"{var}-scale: {1/ (data[var].max() - data[var].min())}")

Adult Mortality-min: -1.3200842196588716
Adult Mortality-max: 4.498728201016799
Adult Mortality-scale: 0.17185637338071844
infant deaths-min: -0.25697318182611134
infant deaths-max: 15.006771438361989
infant deaths-scale: 0.06551472295188837
Total expenditure-min: -2.2604657892834803
Total expenditure-max: 4.7557017452122965
Total expenditure-scale: 0.1425279534850597
Alcohol-min: -1.1399410694150123
Alcohol-max: 3.294814877985676
Alcohol-scale: 0.22549155170221324
BMI-min: -1.8728348252299294
BMI-max: 2.4578255982576396
BMI-scale: 0.23091166293631485
Schooling-min: -3.673833337991314
Schooling-max: 2.6673376811663205
Schooling-scale: 0.1576995789860972
GDP-min: -0.5616916201481591
GDP-max: 8.494715662035853
GDP-scale: 0.11041906231041804
Income composition of resources-min: -3.0639188518881335
Income composition of resources-max: 1.5645412614770768
Income composition of resources-scale: 0.21605457873826872
Population-min: -0.23670626923485638
Population-max: 23.8051682545515
Populatio

#### 3. Analyze the Best Model

From the results you obtained above you can now answer the following questions:

  1. Identify which predictor set yielded the best performance.
  2.  Answer the following questions:
  - What are the most significant predictors based on this analysis?
  - How does the inclusion or exclusion of certain variables affect model performance?
  - For countries with low life expectancy (<65), based on the coefficients estimated for each variable prediction, comment the effect of their change on lifespan?


1. The best results stem from set D, the combination of all other sets, which is to be expected, since it contains the most information, and  on a trainingset with 1281 data points, 9 dimensioons are not enough to for overfitting to be a problem. The menainig sets are all quite similar, with Lifestyle and Education beeing the strongest predictor. 

In [35]:
set = 'Set D'
predictor_subsets = {}
for variable in predictors_sets[set]:
    predictor_subsets[variable] = [x for x in predictors_sets[set] if x != variable]
predictor_subsets['No varriable excluded'] = predictors_sets[set]

models_subset = train_modells_on_predictors_sets(train_data, predictor_subsets)
evaluation_results_subset = evaluate_models(models_subset, test_data, predictor_subsets)

for variable in predictor_subsets.keys():
    print(variable, evaluation_results_subset[variable]) 

Adult Mortality {'MSE': np.float64(0.3863538735532599), 'rsquared': np.float64(0.5923347776552301)}
infant deaths {'MSE': np.float64(0.2467233388252576), 'rsquared': np.float64(0.73966735766148)}
Total expenditure {'MSE': np.float64(0.24364779732835992), 'rsquared': np.float64(0.7429125465776214)}
Alcohol {'MSE': np.float64(0.24377574714977873), 'rsquared': np.float64(0.7427775390211604)}
BMI {'MSE': np.float64(0.2435540761848086), 'rsquared': np.float64(0.7430114373962196)}
Schooling {'MSE': np.float64(0.26458469371813453), 'rsquared': np.float64(0.720820767236964)}
GDP {'MSE': np.float64(0.23919948269678062), 'rsquared': np.float64(0.7476062310401697)}
Income composition of resources {'MSE': np.float64(0.2569982380144043), 'rsquared': np.float64(0.7288256931946782)}
Population {'MSE': np.float64(0.24400818374782068), 'rsquared': np.float64(0.7425322811787829)}
No varriable excluded {'MSE': np.float64(0.24360960196061995), 'rsquared': np.float64(0.7429528488086798)}


In [36]:
#exclude $Population', 'BMI', 'Alcohol', 'total expenditure$ and $infant deaths$ to check if leaving one of them out has no effect because they are low impact variables or because their effect is equivalent with each other

low_impact_variables = ['Population', 'BMI', 'Alcohol', 'Total expenditure', 'infant deaths']
predictor_sets_one_in = {'none': [x for x in predictors_sets[set] if x not in low_impact_variables]}
print(predictor_sets_one_in['none'])
for var in low_impact_variables:
    predictor_sets_one_in[var] = predictor_sets_one_in['none'].copy()
    predictor_sets_one_in[var].append(var)

['Adult Mortality', 'Schooling', 'GDP', 'Income composition of resources']


In [37]:
models_subset = train_modells_on_predictors_sets(train_data, predictor_sets_one_in)
evaluation_results_subset = evaluate_models(models_subset, test_data, predictor_sets_one_in)

for variable in predictor_sets_one_in.keys():
    print(variable, evaluation_results_subset[variable]) 

none {'MSE': np.float64(0.25029129842024656), 'rsquared': np.float64(0.7359025887768532)}
Population {'MSE': np.float64(0.24984850025212949), 'rsquared': np.float64(0.7363698117711489)}
BMI {'MSE': np.float64(0.24703758726430314), 'rsquared': np.float64(0.7393357752223114)}
Alcohol {'MSE': np.float64(0.24929156899036228), 'rsquared': np.float64(0.7369574634609612)}
Total expenditure {'MSE': np.float64(0.24922542472947845), 'rsquared': np.float64(0.7370272562511099)}
infant deaths {'MSE': np.float64(0.24541069896053117), 'rsquared': np.float64(0.7410524029760017)}


In [38]:
#Print correlation
for var in low_impact_variables:
    var_data = train_data[var]
    life_data = train_data['Life expectancy']
    print(f'Correlation between Life expectancy and {var} is: {np.corrcoef(life_data,var_data)[0,1]}' )

Correlation between Life expectancy and Population is: -0.015054535376574358
Correlation between Life expectancy and BMI is: 0.5863042663564922
Correlation between Life expectancy and Alcohol is: 0.39414535885523044
Correlation between Life expectancy and Total expenditure is: 0.199948685466985
Correlation between Life expectancy and infant deaths is: -0.18000952861361882


TOCHECK
2. Notably almost all predictor subsets perform roughly equally well. For the variables $Population$, $BMI$, $Alcohol$, $Total expenditure$ and $infant deaths$, leaving them out has almost no effect on performance. Even leaving all of them out does not significantly reduce performance. For some of these variables this might be, because they realy do not have a significant effect on life expectancy, for example $Population$, has a very low correlation with life expectancy and might not be a significant predictor. For others this might be, because their effect is already captured by other variables. For example $BMI$ and $Alcohol$ have a significant correlation with life expectancy, but leaving them out does not reduce performance. 
Excluding the variable GDP even has a positive effect on the model performance. 
The only variable, whose exclusion significantly reduces performance is $Adult Mortality$. This indicates, that this variable captures a significant amount of information that is not captured by any other variable, which is expected, since it directly measures mortality.

In [39]:
unnormalized_train_data = data_filled[data_filled['Year'] <= 2006]
unnormalized_test_data = data_filled[(data_filled['Year'] >= 2007) & (data_filled['Year'] <= 2009) ]
train_data_low_life_exp = train_data[unnormalized_train_data['Life expectancy'] <= 65]
test_data_low_life_exp = test_data[unnormalized_test_data['Life expectancy'] <= 65]

models_low_life_exp = train_modells_on_predictors_sets(train_data_low_life_exp, predictors_sets)
evaluation_results_low_life_exp = evaluate_models(models_low_life_exp, test_data_low_life_exp, predictors_sets)

for set in predictors_sets.keys():
    print(set)
    print(predictors_sets[set]) 
    print(models_low_life_exp[set].coefficients)
    print(models[set].coefficients)

Set A
['Adult Mortality', 'infant deaths', 'Total expenditure']
   Life expectancy
0        -1.356463
1        -0.143784
2         0.010393
3        -0.165506
   Life expectancy
0        -0.107903
1        -0.602218
2        -0.119937
3         0.144240
Set B
['Alcohol', 'BMI', 'Schooling']
   Life expectancy
0        -1.137810
1        -0.276689
2         0.323600
3         0.122301
   Life expectancy
0        -0.038038
1         0.024980
2         0.356707
3         0.480104
Set C
['GDP', 'Income composition of resources', 'Population']
   Life expectancy
0        -1.344186
1         0.019470
2         0.075334
3         0.043247
   Life expectancy
0        -0.028502
1         0.309800
2         0.454030
3        -0.011545
Set D
['Adult Mortality', 'infant deaths', 'Total expenditure', 'Alcohol', 'BMI', 'Schooling', 'GDP', 'Income composition of resources', 'Population']
   Life expectancy
0        -1.051726
1        -0.149288
2         0.005056
3        -0.159079
4        -0.238369


TOCHECK
2. On this new data set a lot of the known patterns disapear. Infant deaths now has a slightly positive coefficient, Total expenditure a negative on and so dose GDP. The only coefficient that remains strongly positive is BMI, which suggests that food shortages have are a big cause of the low life expectancy in these countries. In these countries food scarcity is probably a mor pressing problem than education, which was more important on the full dataset.

#### 4. Residual analysis

Residual analysis is an essential part of validating and understanding your linear regression model. Residuals 
(the difference between observed and predicted values) help you assess how well your model fits the data, 
identify patterns in the errors, and detect outliers or influential points. 

For this part it is enough to consider only the model built with the set A of variables. 

Residuals $e_i$ are defined in the following way:
$$
e_i = Y - \mathbf{W}\cdot \tilde{\mathbf{X}} 
$$

Many diagnostic plots can be made from the study of the residuals, we will only concentrate visualising if the gaussian assumption for the noise $e_i \sim \mathcal{N}(0, \sigma^2)$ is fullfilled. 

#### Questions 

   1. Plot the predicted values $\hat{y_i}$ against the square-root of the residuals $\sqrt{e_i}$. What would you expect if the residuals are normally distributed?
   2.  An other way to assess the agreement with the normal distribution is to perform a [quantile-quantile plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot) (or Q-Q plot). The quantile-quantile plot reports as a scatter plot the quantiles of your scaled residuals against the quantiles of a normal distribution. Do a Q-Q plot of the data points.


In [40]:
## Your code here
variables = predictors_sets["Set A"]
Xtest = test_data[variables]   
y_test = test_data['Life expectancy']

y_pred = np.asarray(models["Set A"].predict(Xtest)).reshape(-1)
residuals = np.asarray(y_test).reshape(-1) - y_pred

In [41]:
px.scatter(x = y_pred, y = np.sqrt(np.abs(residuals)),labels={
        "x": "Predicted values (ŷ)",
        "y": "|Residuals|^0.5"
    }, title="predicted vs transformed residuals")
# px.scatter(y_pred, residuals)
#px.scatter(y_pred, y_test)

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [42]:

N = len(y_pred)
control_residuals = np.random.normal(size=N,scale=np.std(residuals))
px.scatter(x = y_pred, y = np.sqrt(np.abs(control_residuals)),labels={
        "x": "Predicted values (ŷ)",
        "y": "|Residuals|^0.5"
    }, title= "prediction vs normal distributed residuals")

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [43]:
print("correlation matrix residuals")
print(np.corrcoef( np.sqrt(np.abs(residuals)), y_pred))

correlation matrix residuals
[[1.         0.28590129]
 [0.28590129 1.        ]]


In [44]:
print("correlation matrix random residuals")
print(np.corrcoef( np.sqrt(np.abs(control_residuals)), y_pred))

correlation matrix random residuals
[[1.         0.06969496]
 [0.06969496 1.        ]]


If the residuals were normally distributed independent of the prediction $\hat{y_i}$ we would expect the root of the residuals $\sqrt{e_i}$ to also be distributed independent of $\hat{y_i}$. Howevere the distribution of the $\sqrt{e_i}$ clearly changes depending on $\hat{y_i}$. For $\hat{y_i}$ between -2.5 and -0.2 the $\sqrt{e_i}$ spread around a mean of 0.5, for $\hat{y_i}$ between -0.2 and 0.2 it clumps around a mean of 0.75 and for $\hat{y_i}$ between 0.2 and 1.5 the $\sqrt{e_i}$ spread around a mean of 0.75 with a much larger varriance. 

There also is a significant correlation of $0.28590129$ between $\hat{y_i}$ and $\sqrt{e_i}$.

All the distribution of residuals differs significantly form residuals that were sampled form a normal distribution $e_i \sim \mathcal{N}(0, \sigma^2)$ (that can be seen in the plot 'prediction vs normal distributed residuals')

In [None]:
from scipy import stats
theoretical_quantiles, sample_quantiles = stats.probplot(residuals, dist="norm", fit=False)

fig = px.scatter(
    x=theoretical_quantiles,
    y=sample_quantiles,
    labels={"x": "Theoretical Quantiles", "y": "Sample Quantiles"},
    title="Q–Q Plot for Residual Normality"
)

fig.add_shape(
    type="line",
    x0=min(theoretical_quantiles), y0=min(sample_quantiles),
    x1=max(theoretical_quantiles), y1=max(sample_quantiles)
)

fig.show()


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

TOCHECK Interpretation of the Q-Q-Plot
The Q-Q plot shows a significant deviation from the expected straight line. Significantly, it is not just scattered randomly (towards the tales), but follows a clear and pretty consistent curve. This indicates, that the deviation is not caused by noise or outliers, but by a systematic deviation from normality. Thus the residuals are probably not normally distributed.

## Task 6: Regularization and Ridge Regression Implementation (25 pts.)

### Description
This task focuses on implementing Ridge regression and analyzing the effects of regularization on model performance. You will compare Ridge regression results with standard linear regression and evaluate its impact on different predictor sets.

`Note:` If you didn't do it for the classical linear model, you need here to normalize your dataset column wise before applying ridge regression (why is that so?).

### Exercises

#### 1. **Implement Ridge Regression**:
   - Write a class `RidgeRegModel` that includes the same methods as the `LinRegModel` class.
     - `fit`: fit a Ridge regression model using the closed-form solution (see lecture slides).
     - `predict`: predict outcomes for a given dataset.
     - `evaluate`: evaluate the model performance using MSE and $R^2$ metrics.

  The class will have contain a regularization parameter `plambda` as an attribute. Note that we make use of class heritage to define `RidgeRegModel` so we do not need to redefine `predict` and `evaluate`.



TOCHECK why normalization is important here 
Normalization is important for Ridge regression because the regularization term penalizes large coefficients. If the features are on different scales, the penalty will affect them unevenly, leading to biased coefficient estimates. Normalizing the features ensures that each feature contributes equally to the regularization term, allowing Ridge regression to effectively shrink coefficients and improve model performance.
(We chose to normalize the Data in the beginning, because it makes the coefficients easier to compare and interpret)

In [45]:
class RidgeRegModel(LinRegModel):
    def __init__(self):
        """Initialize the Linear Regression model."""
        super().__init__()
        self.plambda = None

    def __init__(self, plambda=1.0):
        """Initialize the Ridge Regression model with regularization parameter."""
        super().__init__()
        self.plambda = plambda

    def __str__(self):
        """
        output
        """
        return f"Lambda: {self.plambda}\n" + super().__str__()
    
    def copy(self):
        """ Create a copy of the model instance. """
        new_model = RidgeRegModel()
        new_model.coefficients = np.copy(self.coefficients) if self.coefficients is not None else None
        new_model.variables = self.variables.copy() if self.variables is not None else None
        new_model.plambda = self.plambda
        return new_model

    def fit(self, X, y):
        """
        Train the linear regression model using the closed-form solution.

        Parameters:
        X: np.ndarray
            The feature matrix (shape: (n_samples, n_features)).
        y: np.ndarray
            The target vector (shape: (n_samples)).
        """
        varnames = ["bias"] + list(X.columns)
        self.variables = varnames
        Xtilde = np.hstack((np.ones((X.shape[0], 1)), X))
        regularization_matrix = self.plambda * np.diag(np.concatenate(([0], np.ones(Xtilde.shape[1]-1))))
        self.coefficients = np.linalg.inv(Xtilde.T @ Xtilde + regularization_matrix) @ Xtilde.T @ y



#### 2. **Train and Evaluate on Predictor Sets**:
   - Use the same predictor sets as in Task 4:
     - **Set A (Healthcare and Mortality)**
     - **Set B (Lifestyle and Education)**
     - **Set C (Economic Factors)**
     - **Set D (Union of A, B and C)**
   - Train and evaluate the `RidgeRegModel` on the training and testing sets for each predictor set for a value of `plambda`


In [46]:
## Your code here

model_object = RidgeRegModel(5)
models = train_modells_on_predictors_sets(train_data, predictors_sets, model_object=model_object)
evaluation_results = evaluate_models(models, test_data, predictors_sets)

print_evaluation_results(evaluation_results)


Set A
MSE: 0.443207944074247
R^2: 0.5323446264319933
Set B
MSE: 0.4122135394932295
R^2: 0.5650486879152087
Set C
MSE: 0.47559115428908544
R^2: 0.4981751525477106
Set D
MSE: 0.24346772711184803
R^2: 0.7431025494994943



#### 3. **Comparison**:
   1. Compare the performance of Ridge regression with standard linear regression for each predictor set, you can try first with parameters $\lambda=5$ and $\lambda =20$. What is the impact of increasing $\lambda$?
   2. For set D, Plot the values of the coefficients estimated, as well as the MSE, as a function of $\lambda$ from 0 to 40. What do you observe? 
   3. Evaluate the effect of tuning the $\lambda$ parameter on model coefficients and metrics.
      - How does regularization impact the coefficients of highly correlated predictors?
      - Which model (standard linear regression or Ridge regression) performs better on the testing data? Why?
      - How does the choice of $\lambda$ influence the balance between bias and variance in the model?



In [47]:
for i in [5, 20]:
    print(f"Ridge Regression with lambda={i}")
    model_object = RidgeRegModel(i)
    models = train_modells_on_predictors_sets(train_data, predictors_sets, model_object=model_object)
    evaluation_results = evaluate_models(models, test_data, predictors_sets)

    print_evaluation_results(evaluation_results)


Ridge Regression with lambda=5
Set A
MSE: 0.443207944074247
R^2: 0.5323446264319933
Set B
MSE: 0.4122135394932295
R^2: 0.5650486879152087
Set C
MSE: 0.47559115428908544
R^2: 0.4981751525477106
Set D
MSE: 0.24346772711184803
R^2: 0.7431025494994943
Ridge Regression with lambda=20
Set A
MSE: 0.4443726563010223
R^2: 0.531115668470399
Set B
MSE: 0.412728806067171
R^2: 0.5645049991448572
Set C
MSE: 0.47529145356934316
R^2: 0.49849138481277633
Set D
MSE: 0.24308983501772954
R^2: 0.7435012861891366


TOCHECK
Increasing lambda has a tiny positive effect on the performance of the model. This might be, because there is the regularization is ment to prevent overfitting, but with only 9 variables and more than 1200 data points, overfitting is not a big problem in this case. 

In [48]:
import plotly.express as px

df = {
    "Lambda": [],
    "MSE": [],
    "R^2": []
}
for var in predictors_sets['Set D']:
    df[var] = []
df['bias'] = []
    
#Since regularization parameters between 0 and 20 did not have a big effect, we increase the range to 250
for i in range(250):
    model_object = RidgeRegModel(i)#pow(2,i))
    models = train_modells_on_predictors_sets(train_data, {'Set D': predictors_sets['Set D']}, model_object=model_object)
    evaluation_results = evaluate_models(models, test_data, predictors_sets)
    
    df['Lambda'].append(i)
    df['MSE'].append(evaluation_results['Set D']['MSE'])
    df['R^2'].append(evaluation_results['Set D']['rsquared'])
    
    for var in predictors_sets['Set D']:
        df[var].append((np.array)(models['Set D'].coefficients)[predictors_sets['Set D'].index(var)+1][0])
    df['bias'].append((np.array)(models['Set D'].coefficients)[0][0])

fig = px.line(df, x='Lambda', y='MSE', title='Model Performance vs Lambda')
fig.show()
fig = px.line(df, x='Lambda', y='R^2', title='Model Performance vs Lambda')
fig.show()
fig = px.line(df, x='Lambda', y=[var for var in predictors_sets['Set D']], title='Model Coefficients vs Lambda')
fig.show()
fig = px.line(df, x='Lambda', y='bias', title='Model Coefficients vs Lambda')
fig.show()
df = pd.DataFrame(df)
df

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

TODO Evaluate



### Summary. 


- After those different analysis, provide a short summary on the data and explain which sets of variables are providing a good prediction of life expectancy and what are their relative importance (based on the coefficients). 

- Based on those correlation, could you make some recommendation on public policies that could improve life expectancy?


TODO Summary

## Task 7: Comparison with scikit-learn  (optional) (5 + 5 + 10 bonus points)

### Description
This task involves comparing the performance of the manually implemented models (`LinRegModel` and `RidgeRegModel`) with their counterparts from the `scikit-learn` library.
You will see how it is possible to perform classical steps of the analysis and also model selection very easily using this generic framework.

### Exercises

1. Run again the estimation within the corresponding sklearn modules.
2. Use the function of scikit-learn `model_selection` to perform feature selection and parameter tuning automatically. You can perform leave-one-out or fold cross validation. 
3. (optional question within the optional task): Perform polynomial regression of degree $p$ for $p$ ranging from 1 to 3 and select the best set of parameter. Does the accuracy improves?


In [49]:
### 
### The set of modules you can use from scikit-learn
###
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import LeaveOneOut, GridSearchCV, KFold, ShuffleSplit
from sklearn.preprocessing import StandardScaler

In [None]:
sk_linreg_object = LinearRegression()
sk_ridge_object = Ridge(alpha=20)
models_linear_sklearn = train_modells_on_predictors_sets(train_data, predictors_sets, model_object=sk_linreg_object)
models_ridge_sklearn = train_modells_on_predictors_sets(train_data, predictors_sets, model_object=sk_ridge_object)

AttributeError: 'LinearRegression' object has no attribute 'coef_'

In [97]:
models_linear_sklearn['Set D'].coef_

array([[-0.3901982 , -0.05165374,  0.00261153,  0.00982027,  0.21856996,
         0.27879538,  0.12982654,  0.10141638,  0.03032703]])

In [None]:
index=['bias'] + predictors_sets['Set D']

sk_linreg_coefficients = pd.Series(models_linear_sklearn['Set D'].coef_[0])
sk_linreg_coefficients = pd.concat([
    pd.Series(models_linear_sklearn['Set D'].intercept_),
    sk_linreg_coefficients
])
sk_linreg_coefficients.index = index

sk_ridge_coefficients = pd.Series(models_ridge_sklearn['Set D'].coef_)
sk_ridge_coefficients = pd.concat([
    pd.Series(models_ridge_sklearn['Set D'].intercept_),
    sk_ridge_coefficients
])
sk_ridge_coefficients.index = index

our_linreg_coefficients = pd.Series(
    models['Set D'].coefficients['Life expectancy'],
)
our_linreg_coefficients.index = index

our_ridge_coefficients = pd.Series(
    models['Set D'].coefficients['Life expectancy'],
)
our_ridge_coefficients.index = index

complete_coefficients = pd.DataFrame({
    'Our Linear Regression': our_linreg_coefficients,
    'Sklearn Linear Regression': sk_linreg_coefficients,
    'Our Ridge Regression': our_ridge_coefficients,
    'Sklearn Ridge Regression': sk_ridge_coefficients
}, index=index)

bias                              -0.021306
Adult Mortality                   -0.390198
infant deaths                     -0.051654
Total expenditure                  0.002612
Alcohol                            0.009820
BMI                                0.218570
Schooling                          0.278795
GDP                                0.129827
Income composition of resources    0.101416
Population                         0.030327
dtype: float64

In [114]:
complete_coefficients

Unnamed: 0,Our Linear Regression,Sklearn Linear Regression,Our Ridge Regression,Sklearn Ridge Regression
bias,-0.030561,-0.021306,-0.030561,-0.022139
Adult Mortality,-0.345153,-0.390198,-0.345153,-0.386112
infant deaths,-0.048858,-0.051654,-0.048858,-0.051447
Total expenditure,0.012395,0.002612,0.012395,0.003719
Alcohol,0.032878,0.00982,0.032878,0.012472
BMI,0.19862,0.21857,0.19862,0.216824
Schooling,0.23404,0.278795,0.23404,0.273303
GDP,0.114937,0.129827,0.114937,0.128252
Income composition of resources,0.130417,0.101416,0.130417,0.105339
Population,0.024031,0.030327,0.024031,0.029739


In [None]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, LeaveOneOut
from sklearn.pipeline import Pipeline

# Load sample data
X = train_data[numeric_columns]
y = train_data['Life expectancy']

# Define the model
model = RandomForestClassifier(random_state=42)

# Create a pipeline: feature selection + model
pipeline = Pipeline([
    ('feature_selection', RFECV(estimator=model, step=1, cv=LeaveOneOut(), scoring='accuracy')),
    ('classification', model)
])

# Define hyperparameters for tuning
param_grid = {
    'classification__n_estimators': [50, 100, 200],
    'classification__max_depth': [None, 3, 5, 10],
    'classification__min_samples_split': [2, 5, 10]
}

# Grid search with Leave-One-Out CV
grid_search = GridSearchCV(pipeline, param_grid, cv=LeaveOneOut(), scoring='accuracy', n_jobs=-1)

# Fit the model
grid_search.fit(X, y)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)

KeyboardInterrupt: 

In [None]:
# Step 1: RFECV only (once)
X = train_data[predictors_sets['Set D']]
y = train_data['Life expectancy']

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=2,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
X_reduced = rfecv.fit_transform(X, y)

In [None]:
# Step 2: Hyperparameter tuning on reduced features
search.fit(X_reduced, y)

In [None]:
# Boolean mask of selected features
mask = grid_search.best_estimator_.named_steps['feature_selection'].support_
print("Feature mask:", mask)

# Names of selected features (if you have feature names)
import numpy as np
selected_features = np.array(data.feature_names)[mask]
print("Selected features:", selected_features)

{'classification__max_depth': 10,
 'classification__min_samples_split': 16,
 'classification__n_estimators': 156}

In [None]:
def polynomial_of_features(X, p):
    df_polynomes_of_X = pd.DataFrame()
    for key in X.keys():
        for i in range(1,p+1):
            df_polynomes_of_X[f'{key}^{i}'] = pd.Series(X[key].copy().pow(i))
    return pd.DataFrame(df_polynomes_of_X)

In [None]:
for i in range(1,5):
    train_X_polynomial = (polynomial_of_features(train_data[predictors_sets['Set D']], i))
    train_y = train_data['Life expectancy']

    test_X_polynomial = (polynomial_of_features(test_data[predictors_sets['Set D']], i))
    test_y = test_data['Life expectancy']

    model = LinRegModel()
    model.fit(train_X_polynomial, train_y)
    evaluation = model.evaluate(test_X_polynomial, test_y)

    print(evaluation)