In [None]:
import seaborn as sns
sns.set()

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import gzip
from static_grader import grader

# Time Series Data: Predict Temperature

Time series prediction presents its own challenges which are different from machine-learning problems.  As with many other classes of problems, there are a number of common features in these predictions.


## A note on scoring

It **is** possible to score >1 on these questions. This indicates that you've beaten our reference model - we compare our model's score on a test set to your score on a test set. See how high you can go!


## Fetch the data:

In [None]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'train.txt.gz'

The columns of the data correspond to the
  - year
  - month
  - day
  - hour
  - temp
  - dew_temp
  - pressure
  - wind_angle
  - wind_speed
  - sky_code
  - rain_hour
  - rain_6hour
  - city

This function will read the data from a file handle into a Pandas DataFrame.  The grader will pass you DataFrames in this same format.

In [None]:
def load_stream(stream):
    return pd.read_csv(stream, sep=' +', engine='python',
                         names=['year', 'month', 'day', 'hour', 'temp',
                                'dew_temp', 'pressure', 'wind_angle', 
                                'wind_speed', 'sky_code', 'rain_hour',
                                'rain_6hour', 'city'])

In [None]:
df = load_stream(gzip.open('train.txt.gz', 'rt'))

In [None]:
df['temp'] = df['temp'].replace(-9999, None)

In [None]:
df.dropna(inplace=True)

The temperature is reported in tenths of a degree Celsius.  However, not all the values are valid.  Examine the data, and remove the invalid rows.

We will focus on using the temporal elements to predict the temperature.


## Per city model


It makes sense for each city to have it's own model.  Build a "group-by" estimator that takes an estimator factory as an argument and builds the resulting "group-by" estimator on each city.  That is, `fit` should create and fit a model per city, while the `predict` method should look up the corresponding model and perform a predict on each.  An estimator factory is something that returns an estimator each time it is called.  It could be a function or a class.

In [None]:
from sklearn import base

class GroupbyEstimator(base.BaseEstimator, base.RegressorMixin):
    
    def __init__(self, column, estimator_factory):
        # column is the value to group by; estimator_factory can be
        # called to produce estimators
        self.column = column
        self.estimator = estimator_factory()
        self.est ={}
    
    def fit(self, X, y):
        # Create an estimator and fit it with the portion in each group
        
        for item in X[self.column].unique():
            X_item = X.groupby(self.column).get_group(item)
            y_item = y.loc[X.groupby(self.column).groups[item]]
            X_item_model = self.estimator
            self.est[item] = X_item_model.fit(X_item, y_item)
        return self

    def predict(self, X):
        # Call the appropriate predict method for each row of X
        for item in X[self.column].unique():
            X_item = X[X[self.column]==item]
            model = self.est[item]
        return model.predict(X)

# Questions


For each question, build a model to predict the temperature in a given city at a given time.  You will be given a list of records, each a string in the same format as the lines in the training file.  Return a list of predicted temperatures, one for each incoming record.  (As you can imagine, the temperature values will be stripped out in the actual text records.)


## Month/hour model

Seasonal features are nice because they are relatively safe to extrapolate into the future. There are two ways to handle seasonality.  

The simplest (and perhaps most robust) is to have a set of indicator variables. That is, make the assumption that the temperature at any given time is a function of only the month of the year and the hour of the day, and use that to predict the temperature value.

**Question**: Should month be a continuous or categorical variable?  (Recall that [one-hot encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is useful to deal with categorical variables.)

In [None]:
from sklearn import base
from sklearn.preprocessing import OneHotEncoder
class ColumnTransformer(base.BaseEstimator, base.TransformerMixin):
    
    def __init__(self, col_names):
        self.col_names = ['month', 'hour']
        
    def fit(self, X, y=None):
        OneHotEncoder().fit(X[self.col_names])
        
        return self
    
    def transform(self, X):
        return OneHotEncoder().fit_transform(X[self.col_names])

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
def season_factory():
    pipe = Pipeline([
                            ('Transformer', ColumnTransformer(['month', 'hour'])),
                            ('Classifier' , LinearRegression())
    ]) 
    return pipe # A single estimator or a pipeline


In [None]:
season_model = GroupbyEstimator('city', season_factory).fit(df, df['temp'])

The grader will provide a DataFrame in the same format as `load_stream` provided.  All of the temperature data will be redacted.  As long as your model accepts DataFrame input, you should be able to run the grader line below as-is.  If your model is expecting a different input, you will need to write an adapter function.

In [None]:
grader.score('ts__month_hour_model', season_model.predict)

## Fourier model

Since we know that temperature is roughly sinusoidal, we know that a reasonable model might be

$$ y_t = y_0 \sin\left(2\pi\frac{t - t_0}{T}\right) + \epsilon $$

where $y_0$ and $t_0$ are parameters to be learned and $T$ is the period - one year for seasonal variation, one day for daily, etc.  While this is linear in $y_0$, it is not linear in $t_0$. However, we know from Fourier analysis, that the above is
equivalent to

$$ y_t = A \sin\left(2\pi\frac{t}{T}\right) + B \cos\left(2\pi\frac{t}{T}\right) + \epsilon $$

which is linear in $A$ and $B$.

Create a model containing sinusoidal terms on one or more time scales, and fit it to the data using a linear regression.

In [None]:
q2_data_list = []
with gzip.open('train.txt.gz', 'rb') as f:
    for line in f:
        item = line.decode('utf8').strip().split(' ')
        str_list = list(filter(None, item))
        str_new = ' '.join(str_list[0:4])
        city = str_list[12]
        temp = str_list[4]
        if temp == -9999:
            temp = None
        l = [city,str_new,temp]
        q2_data_list.append(l)

df_q2 = pd.DataFrame(q2_data_list)
df_q2.columns = ["city",'date','temp']

In [None]:
df_q2.head()

In [None]:
df_q2['date'] = pd.to_datetime(df_q2['date'])
series = pd.Series(df_q2['temp'].values,index = df_q2['date'].values)
temps_df = pd.DataFrame()
temps_df['city'] = df_q2['city']
temps_df['Julian'] = series.index.to_julian_date()
temps_df['temp'] = df_q2['temp']
temps_df['sin(year)'] = np.sin(temps_df['Julian'] / 365.25 * 2 * np.pi)
temps_df['cos(year)'] = np.cos(temps_df['Julian'] / 365.25 * 2 * np.pi)

temps_df = temps_df.dropna(how='any')

In [None]:
from sklearn import linear_model
class fftestimator(base.BaseEstimator, base.RegressorMixin):
    def __init__(self,city):
        self.city = city
        self.clf = LinearRegression()
        pass
        

    def fit(self,temps_df):
        city_df = pd.DataFrame()
        city_df = temps_df.loc[temps_df['city'] == self.city]
        sin_year = city_df['sin(year)'].values.tolist()
        cos_year = city_df['cos(year)'].values.tolist()
        X_train = [list(item) for item in zip(sin_year,cos_year)]
        Y_train = city_df['temp'].values.tolist()
        self.q2 = self.clf.fit(X_train,Y_train)
        return self
    
    def predict(self, record):
        item = record.to_string().split()
        date_str = ' '.join(item[0:4])
        city = item[12]
        df1 = pd.DataFrame([[city,date_str]])
        df1.columns = ['city','date']
        df1['date'] = pd.to_datetime(date_str, errors='coerce')
        series1 = pd.Series(df1['city'].values,index = df1['date'].values)
        df2 = pd.DataFrame()
        df2['Julian'] = series1.index.to_julian_date()
        df2['sin(year)'] = np.sin(df2['Julian'] / 365.25 * 2 * np.pi)
        df2['cos(year)'] = np.cos(df2['Julian'] / 365.25 * 2 * np.pi)
        sin_year = df2['sin(year)'].values.tolist()[0]
        cos_year = df2['cos(year)'].values.tolist()[0]
        value = self.q2.predict([sin_year,cos_year])[0]
        return value

In [None]:
city_list = ['bos', 'bal', 'chi', 'nyc', 'phi']
estimator_list = []
for city in city_list:
    estimator = fftestimator(city)
    estimator_list. append(estimator.fit(temps_df))
estimator_dict = dict(zip(city_list,estimator_list))

In [None]:
record = '2011 12 30 20 -11 -72 10197 220 26 4 0 0 nyc'
city = record.split()[12]
estimator_dict[city].predict(record)

In [None]:
estimator_dict[city].predict(temps_df) 

In [None]:
grader.score('ts__fourier_model', estimator_dict[city].predict)


In [None]:
# grader.score('ts__fourier_model', lambda estimator_dict: [0] * len(estimator_dict))

*Copyright &copy; 2021 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*