In [1]:
import seaborn as sns
sns.set()

In [2]:
import numpy as np
import pandas as pd
import datetime as dt
from static_grader import grader

# Time Series Data: Predict Temperature

Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.


## A note on scoring

It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


## Fetch the data:

In [3]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.v2.csv.gz'

Completed 256.0 KiB/3.2 MiB (846.8 KiB/s) with 1 file(s) remainingCompleted 512.0 KiB/3.2 MiB (1.6 MiB/s) with 1 file(s) remaining  Completed 768.0 KiB/3.2 MiB (2.4 MiB/s) with 1 file(s) remaining  Completed 1.0 MiB/3.2 MiB (3.2 MiB/s) with 1 file(s) remaining    Completed 1.2 MiB/3.2 MiB (4.0 MiB/s) with 1 file(s) remaining    Completed 1.5 MiB/3.2 MiB (4.7 MiB/s) with 1 file(s) remaining    Completed 1.8 MiB/3.2 MiB (5.5 MiB/s) with 1 file(s) remaining    Completed 2.0 MiB/3.2 MiB (6.3 MiB/s) with 1 file(s) remaining    Completed 2.2 MiB/3.2 MiB (7.0 MiB/s) with 1 file(s) remaining    Completed 2.5 MiB/3.2 MiB (7.7 MiB/s) with 1 file(s) remaining    Completed 2.8 MiB/3.2 MiB (8.5 MiB/s) with 1 file(s) remaining    Completed 3.0 MiB/3.2 MiB (9.2 MiB/s) with 1 file(s) remaining    Completed 3.2 MiB/3.2 MiB (9.9 MiB/s) with 1 file(s) remaining    download: s3://dataincubator-course/mldata/train.v2.csv.gz to ./train.v2.csv.gz


The data can be loaded into pandas easily:

In [4]:
df = pd.read_csv('train.v2.csv.gz')
df.head()

Unnamed: 0,station,time,temp,dew_point,pressure,wind_speed,wind_direction,precip_hour,weather_codes
0,PHX,2010-01-01 00:51,62.06,15.98,1024.9,3.0,20.0,M,M
1,PHX,2010-01-01 01:51,60.08,17.96,1025.3,4.0,50.0,M,M
2,PHX,2010-01-01 02:51,59.0,17.96,1025.6,4.0,30.0,M,M
3,PHX,2010-01-01 03:51,53.96,21.92,1026.0,0.0,0.0,M,M
4,PHX,2010-01-01 04:51,55.94,17.06,1026.2,5.0,40.0,M,M


The `station` column indicates the city.  The `time` is measured in UTC.  Both `temp` and `dew_point` are measured in degrees Fahrenheit.  The `wind_speed` is in knots, and the `precip_hour` measures the hourly precipitation in inches.

Missing values are indicated by a flag value.  Remove rows without valid temperature measurements.  You may also want to change some data types. (But keep in mind that the data provided by the grader will have the same data types as `pd.read_csv` provided.)

In [5]:
df["temp"] = pd.to_numeric(df["temp"], errors="coerce")

df.dropna(subset=["temp"], inplace=True)

We will focus on using the temporal elements to predict the temperature.


# Questions


For each question, build a model to predict the temperature in a given city at a given time.  You will be given a DataFrame, as we got from `pd.read_csv`.  (As you can imagine, the temperature values will be nonsensical in the DataFrame you are given.)  Return a collection of predicted temperatures, one for each incoming row in the DataFrame.  

## One-city model

As you may have noticed, the data contains rows for multiple cities.  We'll deal with all of them soon, but for this first question, we'll focus on only the data from New York (`"NYC"`).  Start by isolating only those rows.

In [11]:
df_nyc = df.loc[df['station']=='NYC']

Seasonal features are nice because they are relatively safe to extrapolate into the future. There are two ways to handle seasonality.  

The simplest (and perhaps most robust) is to have a set of indicator variables. That is, make the assumption that the temperature at any given time is a function of only the month of the year and the hour of the day, and use that to predict the temperature value.

**Question**: Should month be a continuous or categorical variable?  (Recall that [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is useful to deal with categorical variables.)

Build a model to predict the temperature for a given hour in a given month in New York.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

def date2monthHr(dateStr):
    date, time=dateStr.split()
    res={}
    res['month_'+date.split('-')[1]]=1
    res['hr']=int(time.split(':')[0])
    return res
    

class DateTransformer(BaseEstimator, TransformerMixin):
# Create the transformer to handle the attributes data
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # X will be a pandas series. Return a pandas series of dictionaries
        X_out=[]
        for indx, date in X.items():
            X_out.append(date2monthHr(date))
        return pd.Series(X_out)

In [22]:
from sklearn.linear_model import Ridge

nyc_model=Pipeline([('select', 
                          ColumnTransformer([('time-select', 
                                          Pipeline([('transformer', DateTransformer()), 
                                                    ('vectorizer', DictVectorizer())]), 'time')])),
                    ('grid-search_Ridge', GridSearchCV(Ridge(), {'alpha':[0.01, 0.05, 0.1, 0.5, 1, 5, 10]}, cv=10))])

nyc_model.fit(df_nyc[['time']], df_nyc['temp'])

The grader will provide a DataFrame in the same format as `pd.read_csv` provided.  All of the temperature data will be redacted.  As long as your model accepts DataFrame input, you should be able to run the grader line below as-is.  If your model is expecting a different input, you will need to write an adapter function.

In [23]:
grader.score('ts__one_city_model', nyc_model.predict)

Your score: 0.9784


## Per-city model

Now we want to extend this same model to handle all of the cities in our data set.  Rather than adding features to the existing model to handle this, we'll just make a new copy of the model for each city.

If your model is a single class, then this is easy&mdash;you can just instantiate your class once per city.  But it's more likely your model was a particular instance of a Pipeline.  If that's the case, make a **factory function** that returns a new copy of that Pipeline each time it's called.

In [38]:
def season_factory():
    return Pipeline([('select', 
                          ColumnTransformer([('time-select', 
                                          Pipeline([('transformer', DateTransformer()), 
                                                    ('vectorizer', DictVectorizer())]), 'time')])),
                    ('grid-search_Ridge', GridSearchCV(Ridge(), {'alpha':[0.01, 0.1, 1, 10]}, cv=5))])

Calling this function should give a new copy of the Pipeline.  If we train that new copy on the New York data, it should give us the same model as before.  (You might check this by submitting such a model to the previous `grader.score` call.)

While we could manually call this function for each city in our dataset, let's build a "group-by" estimator that does this for us.  This estimator should take a column name and a factory function as an argument.  The `fit` method will group the incoming data by that column, and for each group it will call the factory to create a new instance to be trained by on that group.  Then, the `predict` method should look up the corresponding model for each row and perform a predict using that model.

In [91]:
from sklearn import base
import numpy as np

class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, estimator_factory):
        # column is the value to group by; estimator_factory can be
        # called to produce estimators
        self.models={}
        self.column, self.estimator_factory=column, estimator_factory
    
    def fit(self, X, y):
        # Create an estimator and fit it with the portion in each group
        city_list = list(X[self.column].unique())
        for city in city_list:
            data=X.loc[X[self.column]==city]
            data_y=y[X[self.column]==city]
            model=self.estimator_factory()
            self.models[city]=model.fit(data, data_y)
        return self

    def predict(self, X):
        # Call the appropriate predict method for each row of X
        y=[]
        for i in range(len(X)):
            x=X.iloc[i:i+1]
            city=list(x[self.column])[0]
            y.append(self.models[city].predict(x)[0])
        return np.array(y)

Now, we should be able to build an equivalent model for each city:

In [92]:
season_model = GroupbyEstimator('station', season_factory).fit(df, df['temp'])

Again, as long as this model accepts a DataFrame as input, you should be able to pass the `predict` method to the grader.

In [93]:
grader.score('ts__month_hour_model', season_model.predict)

Your score: 0.9570


## Handling data in arbitrary order

Submit the same model again to the following scorer:

In [94]:
grader.score('ts__shuffled_model', season_model.predict)

Your score: 0.9570


If you passed, congratulations&mdash;you avoided a common pitfall!  Move on to the next question.

But if your model suddenly behaved worse: In the previous question, we provided each city's rows in contiguous groups.  In this question, the rows were all shuffled together.  If you were predicting for a group at a time and just appending those grouped predictions for the final output, it'll be in the wrong order.

There are two ways to fix this:
1. Predict for each row individually.  This is straightforward, but very, _very_ slow.
2. Predict for each group, and then reorder the predictions to match the input order.  A common way to do this is to attach the index of the feature matrix to the predictions, and then order the full prediction series by the index of the feature matrix.

Once you've fixed your `GroupbyEstimator.predict` method, resubmit to this question.

## Fourier model

Let's consider another way to deal with the seasonal terms.  Since we know that temperature is roughly sinusoidal, we know that a reasonable model might be

$$ y_t = y_0 \sin\left(2\pi\frac{t - t_0}{T}\right) + \epsilon $$

where $y_0$ and $t_0$ are parameters to be learned and $T$ is the period - one year for seasonal variation, one day for daily, etc.  While this is linear in $y_0$, it is not linear in $t_0$. However, we know from Fourier analysis, that the above is
equivalent to

$$ y_t = A \sin\left(2\pi\frac{t}{T}\right) + B \cos\left(2\pi\frac{t}{T}\right) + \epsilon $$

which is linear in $A$ and $B$.

Create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression.  Build a `fourier_factory` function that will return instances of this model.

In [106]:
def modifiedDate2monthHr(dateStr):
    date, time=dateStr.split()
    res={}
    month=int(date.split('-')[1])
    res['month_A']=np.sin(2*np.pi*month/12)
    res['month_B']=np.cos(2*np.pi*month/12)
    hr, minute=time.split(':')
    time=int(hr)*60+int(minute)
    res['time_A']=np.sin(2*np.pi*time/1440)
    res['time_B']=np.cos(2*np.pi*time/1440)
    return res
    
class ModifiedDateTransformer(BaseEstimator, TransformerMixin):
# Create the transformer to handle the attributes data
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # X will be a pandas series. Return a pandas series of dictionaries
        X_out=[]
        for indx, date in X.items():
            X_out.append(modifiedDate2monthHr(date))
        return pd.Series(X_out)
    

In [110]:
def fourier_factory():
    return Pipeline([('select', 
                          ColumnTransformer([('time-select', 
                                          Pipeline([('transformer', ModifiedDateTransformer()), 
                                                    ('vectorizer', DictVectorizer())]), 'time')])),
                    ('grid-search_Ridge', GridSearchCV(Ridge(), {'alpha':[0.001, 0.01, 0.1, 1, 10]}, cv=5))])  

A general `GroupByEstimator` should be able to take the new factory function and build a model for each city.

In [111]:
fourier_model = GroupbyEstimator('station', fourier_factory).fit(df, df['temp'])

Submit this model to the grader.

In [112]:
grader.score('ts__fourier_model', fourier_model.predict)

Your score: 0.9898


*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*