# Questions

This test exercise is of a theoretical nature. In our discussion of the F-test, the total set of explanatory factors was split in two parts. The factors in X1 are always included in the model, whereas those in $X_2$ are possibly removed.

In questions (a), (b), and (c) you derive relations between the two OLS estimates of the effects of $X_1$ on y, one in the large model and the other in the small model. In parts (d), (e), and (f), you check the relation of question (c) numerically for the wage data of our lectures.

We use the notation of Lecture 2.4.2 and assume that the standard regression assumptions A1-A6 are satisfied for the unrestricted model. The restricted model is obtained by deleting the set of g explanatory factors collected in the last g columns $X2$ of X. We wrote the model with $X = (X_1 X_2)$ and corresponding partitioning of the OLS estimator b in $b_1$ and $b_2$ as $y = X_1β_1 + X_2β_2 + ε = X_1b_1 + X_2b_2 + e$. We denote by $b_R$ the OLS estimator of β1
obtained by regressing y on $X_1$, so that $b_R = (X_1'X1)^{−1}X_1'y$. Further, let $P = (X_1'X1)^{−1}X_1'X_2$


(a) Prove that $E(b_R) = β1 + Pβ2.$

(b) Prove that $var(b_R)=\sigma^2 (X_1'X_1)^{-1}$

(c) Prove that $bR = b1 + Pb2$.

Now consider the wage data of Lectures 2.1 and 2.5. Let y be log-wage (500×1 vector), and let $X_1$ be the (500×2)
matrix for the constant term and the variable ‘Female’. Further let $X_2$ be the (500 × 3) matrix with observations
of the variables ‘Age’, ‘Educ’ and ‘Parttime’. The values of $b_R$ were given in Lecture 2.1, and those of $b$ in Lecture
2.5.

(d) Argue that the columns of the $(2 × 3)$ matrix P are obtained by regressing each of the variables ‘Age’, ‘Educ’,
and ‘Parttime’ on a constant term and the variable ‘Female’.

(e) Determine the values of P from the results in Lecture 2.1.

(f) Check the numerical validity of the result in part (c). Note: This equation will not hold exactly because the
coefficients have been rounded to two or three decimals; preciser results would have been obtained for higher
precision coefficients.

(a) Prove that $E(b_R) = \beta1 + P\beta_2.$\\
\\
(b) Prove that $var(b_R)=\sigma^2 (X_1'X_1)^{-1}$\\
\\
(c) Prove that $b_R = b1 + Pb2$.\\
\\

In [122]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib 
import matplotlib.pyplot as plt
import os
import statistics

path = os.getcwd()
train_exercise=os.path.join(path, 'prueba')
#os.listdir()

In [123]:
train_exercise

'C:\\Users\\hp-Omen 15\\Documents\\virtual_env\\covid\\Scripts\\Aplied_data_science\\prueba'

In [130]:
Dataset1_path=os.path.join(path, 'TrainExer21.xls' )
Dataset1 = pd.read_excel(Dataset1_path)
Dataset1.tail(3)

Unnamed: 0,Observation,Wage,LogWage,Female,Age,Educ,Parttime
497,498,173,5.153292,0,38,4,0
498,499,154,5.036953,0,27,4,0
499,500,141,4.94876,0,29,4,0


In [125]:
def regresion(df,x,y):
    b=0
    bden=0
    bnum=0
    bhist=[]
    x=df[x]
    y=df[y]
    for i in range(len(df)):
        bden+=(x[i]*(y[i]-y.mean()))
        bnum+=(x[i]*(x[i]-x.mean()))
    b=bden/bnum

    a=(y.mean()-b*x.mean())

    def esc(code):
        return f'\033[{code}m'
    #print(esc('31;1;4') + "The value of \"a\" is ", a,"\n","\n","The value of \"b\" is ", b)
    e_hist=[]
    e=0
    for i in range(len(df)):
        e_hist.append(y[i]-a-b*x[i])
        e+=e_hist[i]
    #print(esc('31;1;4') + "The value of global residual \"e\" is ", e)
    return([a,b,e,e_hist])

In [128]:
fem_reg=regresion(Dataset1,'Female','LogWage' )
Dataset1['e_Fem']=fem_reg[3]
#Dataset1
fem_reg[0:3]

[4.733644338226111, -0.2506425317585474, -2.2382096176443156e-13]

In [129]:
r_educ_on_efm=regresion(Dataset1,'Educ','e_Fem' )
r_educ_on_efm[0:3]

[-0.4526497818548166, 0.21782953890992124, -5.551115123125783e-15]

### Multiple regresion with scikit:

In [276]:
#s_fem=Dataset1.Female.values
#s_educ=Dataset1.Educ.values
#s_age=Dataset1.Age.values
#s_ptime=Dataset1.Parttime.values

data_wage=Dataset1[['Female','Age','Educ','Parttime']].values
s_logw=Dataset1.LogWage.values

In [277]:
X_train, X_test, y_train, y_test= train_test_split(data_wage, s_logw, test_size=0.01)
lr_multiple= linear_model.LinearRegression()
lr_multiple.fit(X_train, y_train)
Y_pred_multiple= lr_multiple.predict(X_test)
#print(Y_pred_multiple)
print('Ofset: ',lr_multiple.coef_)# obtendremos el valor de b1[1], b2[:]
print('Pendant: ',lr_multiple.intercept_) #  obtendremos el valor de b1[0]
print('Precition: ',lr_multiple.score(X_train, y_train))

Ofset:  [-0.04010888  0.03068269  0.23148215 -0.36556383]
Pendant:  3.0511532254843696
Precition:  0.7038627637971012


### Testing the model

In [114]:
P=[regresion(Dataset1,'Female','Age' )[0:2],
   regresion(Dataset1,'Female','Educ' )[0:2],
   regresion(Dataset1,'Female','Parttime' )[0:2]]

In [115]:
P

[[40.0506329113924, -0.11041552008804748],
 [2.2594936708860764, -0.49318932305999136],
 [0.19620253164557, 0.2494496422674728]]

In [119]:
b1= [3.053,-0.041]
b2= [0.031,0.233, -0.365]

In [120]:
b1[0]+list(np.dot(b2,P))[0]

4.749417721518987

In [121]:
b1[1]+list(np.dot(b2,P))[1]

-0.25038511282333503

## Linear Regresión with SciKit

In [136]:
from sklearn import datasets, linear_model
boston=datasets.load_boston()
print(boston.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [141]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [144]:
y_multiple=boston.target
X_multiple=boston.data[:,5:8]
X_multiple

array([[ 6.575 , 65.2   ,  4.09  ],
       [ 6.421 , 78.9   ,  4.9671],
       [ 7.185 , 61.1   ,  4.9671],
       ...,
       [ 6.976 , 91.    ,  2.1675],
       [ 6.794 , 89.3   ,  2.3889],
       [ 6.03  , 80.8   ,  2.505 ]])

In [161]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X_multiple, y_multiple, test_size=0.2)

In [188]:
X_train, X_test, y_train, y_test= train_test_split(X_multiple, y_multiple, test_size=0.2)
lr_multiple= linear_model.LinearRegression()
lr_multiple.fit(X_train, y_train)
Y_pred_multiple= lr_multiple.predict(X_test)
#print(Y_pred_multiple)
print('Ofset: ',lr_multiple.coef_)
print('Pendant: ',lr_multiple.intercept_)
print('Precition: ',lr_multiple.score(X_train, y_train))

Ofset:  [ 8.43732272 -0.10963969 -0.53161225]
Pendant:  -21.039193630491877
Precition:  0.5374835450608988
