# Chapter 4, inference / hypothesis testing

## C1

In [None]:
import wooldridge
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt


# Cargar datos
data = wooldridge.data('vote1')

vote = pd.DataFrame(data)


In [16]:
# Crear el modelo 
model = ols('voteA ~ lexpendA + lexpendB + prtystrA', data=vote).fit()

In [18]:
# Print model summary
print (model.summary())

                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     215.2
Date:                Mon, 11 Nov 2024   Prob (F-statistic):           1.76e-57
Time:                        17:27:40   Log-Likelihood:                -596.86
No. Observations:                 173   AIC:                             1202.
Df Residuals:                     169   BIC:                             1214.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     45.0789      3.926     11.481      0.0

(i) **interpretation of b1**
b1 models a level-log relationship, but vote is a percentage, a change of 1% in expendA changes the predicted percentage of vote by .06% in a similar direction of the change.

(ii)**In terms of the parameters, state the null hypothesis that a 1% increase in A’s expenditures is offset by a 1% increase in B’s expenditures.** H0 : B1 = -B2

(iii) **Estimate the given model using the data in VOTE1.RAW and report the results 
in usual form. Do A’s expenditures affect the outcome? What about B’s expenditures? Can you use these results to test the hypothesis in part (ii)?** lexpendA has a coef of 6.08 and a std error of .382, which gives us a t statistic of 15.9 which means it is very significant both practically as statistically. Similarly, lexpendB has a coef of -6.61, a SD of .379 and a t of -17.463. It is also both practically and statistically significant and important. Both have similar results in the opposite direction, which is expected due to the non-zero sum aspect of an election. Further analysis is required to quantify and reject the hypothesis that a 1% increase in A is offset by 1% increase in B, such as applying a t test between the two parameters, which would require us to estimate a model that gives the t statistic for testing the hypothesis by reparametrization or to compute the standard error of B1-B2

(iv) Estimate a model that directly gives the t statistic for testing the hypothesis in part 
(ii). What do you conclude? (Use a two-sided alternative.)




voteA = β₀ + β₁lexpendA + β₂lexpendB + β₃prtystrA + u

Hypothesis: H0:B1=-B2

we sum and substract B2expend, to obtain the coefficient B1+B2 and prove if it is 0

voteA = β₀ + β₁lexpendA + β₂lexpendB + β₂lexpendA - β₂lexpendA + β₃prtystrA + u

voteA = β₀ + β₂(lexpendA + lexpendB) + (β₁ + β₂)lexpendA + β₃prtystrA + u

θ = β₂ (coeficiente de sum_expend)
γ = β₁ + β₂ (coeficiente de lexpendA)

voteA = β₀ + θ(lexpendA + lexpendB) + γ(lexpendA) + β₃prtystrA + u



In [23]:
# Answer to 4:
from scipy import stats  

# Crear nueva variable que representa la suma de los coeficientes
vote['sum_expend'] = vote['lexpendA'] + vote['lexpendB']

# Modelo reparametrizado
# voteA = β₀ + θ(lexpendA + lexpendB) + γ(lexpendA) + β₃prtystrA + u
# donde γ = (β₁ + β₂)/2 es el coeficiente que queremos probar si es 0
modelo_rep = ols('voteA ~ sum_expend + lexpendA + prtystrA', data=vote).fit()

print("Resultados del modelo reparametrizado:")
print("=====================================")
print(modelo_rep.summary())

# Calcular el p-valor para la prueba de dos colas
t_stat = modelo_rep.tvalues['lexpendA']
p_valor = 2 * (1 - stats.t.cdf(abs(t_stat), df=modelo_rep.df_resid))

print("\nPrueba de hipótesis H₀: β₁ = -β₂")
print("===================================")
print(f"t-estadístico: {t_stat:.4f}")
print(f"p-valor: {p_valor:.4e}")

Resultados del modelo reparametrizado:
                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     215.2
Date:                Mon, 11 Nov 2024   Prob (F-statistic):           1.76e-57
Time:                        17:58:27   Log-Likelihood:                -596.86
No. Observations:                 173   AIC:                             1202.
Df Residuals:                     169   BIC:                             1214.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    

(iv) with a t statistic of 23.38 and a p value of almost 0, we can reject the null hipothesis that B1 = -B2. This means that the effects in expenditure aren't symmetric, where the expenditures by candidate B have a marginally bigger effect in voting.

## C2 
LSAT is the median LSAT score for the graduating class, GPA is the median college 
GPA for the class, libvol is the number of volumes in the law school library, cost is the annual cost of attending law school, and rank is a law school ranking (with rank 5 1 being 
the best).

In [29]:
import wooldridge as woo
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import seaborn as sns

data = woo.data('lawsch85')



In [30]:
modelo = ols('lsalary ~ LSAT + GPA + llibvol + lcost + rank', data=data)

resultados = modelo.fit()


print("\nResultados de la Regresión:")
print(resultados.summary())


Resultados de la Regresión:
                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     138.2
Date:                Mon, 11 Nov 2024   Prob (F-statistic):           2.93e-50
Time:                        18:28:35   Log-Likelihood:                 107.33
No. Observations:                 136   AIC:                            -202.7
Df Residuals:                     130   BIC:                            -185.2
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      8.3432  

(i)state and test the null hypothesis 
that the rank of law schools has no ceteris paribus effect on median starting salary.

H0: Brank = 0

the t statistic of rank is -9.541, which indicates that the rank of the school has a statistically significant effect on salary.The effect is low when compared to other variables. As it is a log-level relationship, each ranking point is equal to .33% in a change in the salary.

(ii) Are features of the incoming class of students—namely, LSAT and GPA—
individually or jointly significant for explaining salary? (Be sure to account for 
missing data on LSAT and GPA.)

Lsat doesn't seem to have individual significancy. To evaluate joint significance, one may use the F test to evaluate this.

In [33]:
import wooldridge as woo
import numpy as np
from statsmodels.formula.api import ols
import scipy.stats as stats

# Load and clean data
data = woo.data('lawsch85')
data_clean = data.dropna(subset=['LSAT', 'GPA', 'lsalary', 'llibvol', 'lcost', 'rank'])

# Estimate models
full_model = ols('lsalary ~ LSAT + GPA + llibvol + lcost + rank', data=data_clean).fit()
restricted_model = ols('lsalary ~ llibvol + lcost + rank', data=data_clean).fit()

# F-test for joint significance
n = full_model.nobs
k_full = full_model.df_model
k_restricted = restricted_model.df_model
q = k_full - k_restricted  # number of restrictions

F = ((full_model.rsquared - restricted_model.rsquared) / q) / \
    ((1 - full_model.rsquared) / (n - k_full - 1))
p_value = 1 - stats.f.cdf(F, q, n - k_full - 1)

# Print results
print(f"Sample size: {n}")
print(f"\nF-statistic: {F:.4f}")
print(f"p-value: {p_value:.4f}")

Sample size: 136.0

F-statistic: 9.9518
p-value: 0.0001


Both are jointly significant.

(iii) Test whether the size of the entering class (clsize) or the size of the faculty 
( faculty) needs to be added to this equation; carry out a single test. (Be careful to 
account for missing data on clsize and faculty.)

H0 : Bclsize, Bfaculty = 0
H1 : Bclsize, Bfaculty =/ 0



In [36]:
import wooldridge as woo
from statsmodels.formula.api import ols
import scipy.stats as stats

# Load and clean data
data = woo.data('lawsch85')
data_clean = data.dropna(subset=['LSAT', 'GPA', 'lsalary', 'llibvol', 'lcost', 'rank', 'clsize', 'faculty'])

# Extended model
extended_model = ols('lsalary ~ LSAT + GPA + llibvol + lcost + rank + clsize + faculty', 
                    data=data_clean).fit()

# Base model
base_model = ols('lsalary ~ LSAT + GPA + llibvol + lcost + rank', 
                data=data_clean).fit()

# F-test
n = base_model.nobs
k_extended = extended_model.df_model
k_base = base_model.df_model
q = k_extended - k_base

F = ((extended_model.rsquared - base_model.rsquared) / q) / \
    ((1 - extended_model.rsquared) / (n - k_extended - 1))
p_value = 1 - stats.f.cdf(F, q, n - k_extended - 1)

# Print results
print(f"Sample size: {n}")
print(f"\nF-test for joint significance of class size and faculty:")
print(f"F-statistic: {F:.4f}")
print(f"p-value: {p_value:.4f}")

print("\nCoefficients for added variables:")
print(f"{'Variable':<10} {'Coefficient':>12} {'Std Error':>12} {'t-stat':>10} {'p-value':>10}")
print("-" * 55)
print(f"{'clsize':<10} {extended_model.params['clsize']:12.4f} "
      f"{extended_model.bse['clsize']:12.4f} "
      f"{(extended_model.params['clsize']/extended_model.bse['clsize']):10.4f} "
      f"{extended_model.pvalues['clsize']:10.4f}")
print(f"{'faculty':<10} {extended_model.params['faculty']:12.4f} "
      f"{extended_model.bse['faculty']:12.4f} "
      f"{(extended_model.params['faculty']/extended_model.bse['faculty']):10.4f} "
      f"{extended_model.pvalues['faculty']:10.4f}")

Sample size: 131.0

F-test for joint significance of class size and faculty:
F-statistic: 0.9484
p-value: 0.3902

Coefficients for added variables:
Variable    Coefficient    Std Error     t-stat    p-value
-------------------------------------------------------
clsize           0.0001       0.0002     0.8741     0.3838
faculty          0.0001       0.0004     0.1687     0.8663


The F statistic is .9484 and the p-value is .3902
There is very little evidence that adding clsize and faculty will improve the model, so we fail to reject H0


(iv) What factors might influence the rank of the law school that are not included in the model?

- Bar passage rate
- Student acceptance ratio
- Number and quality of academic papers
- Ratio of employment after graduation

## C3