# Emilio Flores - DSC 510 Exercise - Week 9

## Exercise 12.1
The linear model I used in this chapter has the obvious drawback that it is linear, and there is no reason to expect prices to change linearly over time. We can add flexibility to the model by adding a quadratic term, as we did in "Nonlinear Relationships" on page 133.

Use a quadratic model to fit the time series of daily prices, and use the model to generate predictions. You will have to write a version of RunLinearModel that runs that quadratic model, but after that you should be able to reuse code in timeseries.py to generate predictions.

In [129]:
import numpy as np
import pandas as pd
import random
import thinkstats2
import thinkplot
import statsmodels.formula.api as smf

In [131]:
transactions = pd.read_csv("mj-clean.csv", parse_dates=[5])

In [191]:
def GroupByDay(transactions, func=np.mean):
    """Groups transactions by day and compute the daily mean ppg.

    transactions: DataFrame of transactions

    returns: DataFrame of daily prices
    """
    grouped = transactions[["date", "ppg"]].groupby("date")
    daily = grouped.aggregate(func)

    daily["date"] = daily.index
    start = daily.date[0]
    one_day = np.timedelta64(1, "D")
    daily["years"] = (daily.date - start) / (one_day * 365.25)  # Convert days to years

    # print("Daily DataFrame after adding 'years':")
    # print(daily.head())

    return daily

In [193]:
def GroupByQualityAndDay(transactions):
    """Divides transactions by quality and computes mean daily price.

    transaction: DataFrame of transactions

    returns: map from quality to time series of ppg
    """
    groups = transactions.groupby("quality")
    dailies = {}
    for name, group in groups:
        dailies[name] = GroupByDay(group)
        # print(f"Processed quality group: {name}")

    return dailies

In [195]:
def RunQuadraticModel(daily):
    """Runs a linear model of prices versus years.

    daily: DataFrame of daily prices

    returns: model, results
    """
    # print("Daily DataFrame before adding 'years2':")
    # print(daily.head())

    daily["years2"] = daily.years**2
    model = smf.ols("ppg ~ years + years2", data=daily)
    results = model.fit()
    
    return model, results

In [None]:
dailies = GroupByQualityAndDay(transactions)

In [199]:
name = "high"
daily = dailies[name]

# Run the quadratic model and print the summary
model, results = RunQuadraticModel(daily)
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                    ppg   R-squared:                       0.455
Model:                            OLS   Adj. R-squared:                  0.454
Method:                 Least Squares   F-statistic:                     517.5
Date:                Thu, 01 Aug 2024   Prob (F-statistic):          4.57e-164
Time:                        19:15:59   Log-Likelihood:                -1497.4
No. Observations:                1241   AIC:                             3001.
Df Residuals:                    1238   BIC:                             3016.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.6980      0.067    205.757      0.0

## Exercise 12.2
Write a definition for a class named SerialCorrelationTest that extends HypothesisTest from "HypothesisTest" on page 102. It should take a series and a lag as data, compute the serial correlation of the series with the given lag, and then compute the p-value of the observed correlation.

Use this class to test whether the serial correlation in raw price data is statistically significant. Also test the residuals of the linear model and (if you did the previous exercise), the quadratic model.

In [169]:
class SerialCorrelationTest(thinkstats2.HypothesisTest):
    """Tests serial correlations by permutation."""

    def TestStatistic(self, data):
        """Computes the test statistic.

        data: tuple of xs and ys
        """
        series, lag = data
        test_stat = abs(SerialCorr(series, lag))
        return test_stat

    def RunModel(self):
        """Run the model of the null hypothesis.

        returns: simulated data
        """
        series, lag = self.data
        permutation = series.reindex(np.random.permutation(series.index))
        return permutation, lag

In [173]:
def SerialCorr(series, lag=1):
    xs = series[lag:]
    ys = series.shift(lag)[lag:]
    corr = thinkstats2.Corr(xs, ys)
    return corr

In [179]:
def RunLinearModel(daily):
    model = smf.ols("ppg ~ years", data=daily)
    results = model.fit()
    return model, results

In [185]:
# test the correlation between consecutive prices

name = "high"
daily = dailies[name]

series = daily.ppg
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

0.4852293761947381 0.0


In [187]:
# test for serial correlation in residuals of the linear model

_, results = RunLinearModel(daily)
series = results.resid
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

0.07570473767506262 0.01


In [189]:
# test for serial correlation in residuals of the quadratic model

_, results = RunQuadraticModel(daily)
series = results.resid
test = SerialCorrelationTest((series, 1))
pvalue = test.PValue()
print(test.actual, pvalue)

Daily DataFrame before adding 'years2':
                  ppg       date     years    years2
date                                                
2010-09-02  13.384186 2010-09-02  0.000000  0.000000
2010-09-03  14.459588 2010-09-03  0.002740  0.000008
2010-09-04  14.923333 2010-09-04  0.005479  0.000030
2010-09-05  16.667500 2010-09-05  0.008219  0.000068
2010-09-06  15.537500 2010-09-06  0.010959  0.000120
0.05607308161289919 0.045
