#### Part a

First, use all four independent variables Unemployment, SilveradoQueries, CPI_Energy, and CPI_All to build the linear regression model to predict SilveradoSales:

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.read_csv('Silverado242-Fall2023.csv')
df = df.rename(columns={'CPI.Energy': 'CPI_Energy', 'CPI.All': 'CPI_All'})

df_train = df[df['Year'] < 2016] 
df_test = df[df['Year'] >= 2016] 

model1 = smf.ols(formula = 'SilveradoSales ~ Unemployment + SilveradoQueries + CPI_Energy + CPI_All', 
                 data = df_train).fit()

print(model1.summary()) 

                            OLS Regression Results                            
Dep. Variable:         SilveradoSales   R-squared:                       0.557
Model:                            OLS   Adj. R-squared:                  0.525
Method:                 Least Squares   F-statistic:                     17.61
Date:                Wed, 20 Sep 2023   Prob (F-statistic):           2.08e-09
Time:                        01:50:05   Log-Likelihood:                -615.04
No. Observations:                  61   AIC:                             1240.
Df Residuals:                      56   BIC:                             1251.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept        -1.401e+05   2.19e+05  

From this model we can see that `Unemployment` variable has p-value 0.990, which is statistically insignificant. So, use the variable without `Unemployment` to run the model again. 

In [15]:
model2 = smf.ols(formula = 'SilveradoSales ~ SilveradoQueries + CPI_Energy + CPI_All', 
                 data = df_train).fit()
print(model2.summary())

                            OLS Regression Results                            
Dep. Variable:         SilveradoSales   R-squared:                       0.557
Model:                            OLS   Adj. R-squared:                  0.534
Method:                 Least Squares   F-statistic:                     23.89
Date:                Wed, 20 Sep 2023   Prob (F-statistic):           3.85e-10
Time:                        02:13:13   Log-Likelihood:                -615.04
No. Observations:                  61   AIC:                             1238.
Df Residuals:                      57   BIC:                             1247.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept        -1.428e+05   4.55e+04  

This model gives the current variables are all statistically significant (p-values < 0.05). The result give the formular: $\text{SilveradoSales} = -142800 + 95.93 \times \text{SilveradoQueries} - 119.56 \times \text{CPI_Energy} + 870.62 \times \text{CPI_All}$. It means that for each additional online query related to Silverado, the monthly Silverado sales increase by approximately 96 units; for each unit increase in the CPI for energy, the monthly Silverado sales decrease by approximately 120 units; for each unit increase in the overall CPI, the monthly Silverado sales increase by approximately 871 units. The sign of `SilveradoQueries` is positive which makes sense since increased online interest and searches for Silverado correlate with higher sales. The sign of `CPI_Energy` is negative which makes sense since the fuel price increase may discourage consumers from buying cars. The sign of `CPI_All` is positive which does not make sense since the overall price increase may consumers' desire to buy a car. The $R^2$ is 0.557, which means the model linear regression models does not a very good job. 

#### Part B

In [3]:
model3 = smf.ols(formula = 'SilveradoSales ~ Unemployment + SilveradoQueries + CPI_Energy + CPI_All + C(MonthFactor)', 
                 data=df_train).fit()
print(model3.summary()) 

                            OLS Regression Results                            
Dep. Variable:         SilveradoSales   R-squared:                       0.770
Model:                            OLS   Adj. R-squared:                  0.693
Method:                 Least Squares   F-statistic:                     10.03
Date:                Wed, 20 Sep 2023   Prob (F-statistic):           7.82e-10
Time:                        01:50:05   Log-Likelihood:                -595.08
No. Observations:                  61   AIC:                             1222.
Df Residuals:                      45   BIC:                             1256.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept             -1.108e+

The result give the regression formular: $\text{SilveradoSales} = -110800 - 592.43 \times \text{Unemployment} + 88.43 \times \text{SilveradoQueries} - 103.14 \times \text{CPI_Energy} + 727.04 \times \text{CPI_All} + 4910.44 \times \text{T.Aug} + 9225.00 \times \text{T.Dec} + 6078.67 \times \text{T.Feb} - 7181.30 \times \text{T.Jan} + 4298.90 \times \text{T.Jul} + 3808.09 \times \text{T.Jun} + 1884.91 \times \text{T.Mar} + 3633.82 \times \text{T.May} - 1463.36 \times \text{T.Nov} + 1148.95 \times \text{T.Oct} + 1960.32 \times \text{T.Sep}$ The coefficients for each `MonthFactor` represent the difference in Silverado sales for that month compared to the reference month, positive sign means this month may increase the sale while negative sign means this month may decrease the sale. The higher the absolute value of the `MonthFactor` coefficient, the greater the impact. The $R^2$ of this model is 0.770, which means this model does a better job than above one. Compare the p-vale we can see that `T.Dec` factor, `T.Jan` facotr, and `SilveradoQueries` variables are significant. Adding the independent variable `MonthFactor` improves the quality of the model since the $R^2$ with `MonthFactor` is higher than without `MonthFactor`. One way I can think of given data to model seasonality is to change the `MonthFactor` to numerical month number and do the regression again, and it might be the best way to improve the model so far. 

#### Part C

In [13]:
model4 = smf.ols(formula = 'SilveradoSales ~ Unemployment + SilveradoQueries + CPI_Energy  + MonthNumeric', 
                 data=df_train).fit()
print(model4.summary())

def OSR2(model, df_train, df_test, dependent_var):   
    y_test = df_test[dependent_var]
    y_pred = model.predict(df_test)
    SSE = np.sum((y_test - y_pred)**2)
    SST = np.sum((y_test - np.mean(df_train[dependent_var]))**2)    
    return 1 - SSE/SST

print(OSR2(model4, df_train, df_test, 'SilveradoSales'))

                            OLS Regression Results                            
Dep. Variable:         SilveradoSales   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.558
Method:                 Least Squares   F-statistic:                     19.97
Date:                Wed, 20 Sep 2023   Prob (F-statistic):           2.90e-10
Time:                        02:05:26   Log-Likelihood:                -612.83
No. Observations:                  61   AIC:                             1236.
Df Residuals:                      56   BIC:                             1246.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept         6.116e+04   1.12e+04  

From part a), we see that the coefficient for `CPI_All` does not make sense, so I drop it. But I keep `Unemployment` since when unemployment rate raise, there mighe be fewer people want to buy a new car. From part b), we can see that `MonthFactor` may have effect on sales, so I include the numerical month number which is `MonthNumeric`. The $R^2$ of training set for this model is 0.588, and the $OSR^2$ is 0.1840. This model may not be useful to Chevrolet since both $R^2$ and $OSR^2$ is low, which means the model does not good. 

#### Part D

The loss function can be:
$$l(\hat{y}, y) = - y \times 3000 + (\hat{y} - y) \times 500$$
Since the Chevrolet will have available exactly predicted units for every month inventory level, so we only need to care about the loss from the actual sales vs prediction from the model. The $y \times 3000$ is the part of the profits comes from actual sales, so we need to minus this part. The $(\hat{y} - y) \times 500$ is the cost that carryover to the next month, so it needs to be added to the loss.