
*   Computing Platforms: Set up the Workspace for Machine Learning Projects.  https://ms.pubpub.org/pub/computing
*  Machine Learning for Predictions. https://ms.pubpub.org/pub/ml-prediction
* Machine Learning Packages: https://scikit-learn.org/stable/


# Part I: Import and Inspect Data

In [25]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [26]:
df = pd.read_csv('https://raw.githubusercontent.com/Rising-Stars-by-Sunshine/stats201-PS2-Yiyang/main/data/Processed_Data/CHL.csv')
df.head()

Unnamed: 0,Year,Life expectancy at birth (historical)
0,1930,32.0
1,1934,34.0
2,1936,35.0
3,1942,37.0
4,1949,41.0


# Part II: Prepare the Y varible for Regression

## 2.1. Write functions to calculte the Y variable for Regression 

*(skip the step if the Y variable already exists)*

In [27]:
df['theta'] = df['Year']/df['Life expectancy at birth (historical)']
df.head()

Unnamed: 0,Year,Life expectancy at birth (historical),theta
0,1930,32.0,60.3125
1,1934,34.0,56.882353
2,1936,35.0,55.314286
3,1942,37.0,52.486486
4,1949,41.0,47.536585


## 2.2. Make Sure that the Data Type of Y is "numeric"

In [28]:
df.dtypes

Year                                       int64
Life expectancy at birth (historical)    float64
theta                                    float64
dtype: object

In [29]:
df['theta'] = pd.to_numeric(df['theta'])
df.dtypes

Year                                       int64
Life expectancy at birth (historical)    float64
theta                                    float64
dtype: object

# Part IV: Create the X variables

## 4.1. Shift the Y to get past values

reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [30]:
# generate a new variable as the previous 1 observable of your Y variable for regression
df['theta_past'] =df['theta'].shift(1)
df.head()

Unnamed: 0,Year,Life expectancy at birth (historical),theta,theta_past
0,1930,32.0,60.3125,
1,1934,34.0,56.882353,60.3125
2,1936,35.0,55.314286,56.882353
3,1942,37.0,52.486486,55.314286
4,1949,41.0,47.536585,52.486486


## 4.2. Calculate the Moving Averages

references: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

https://towardsdatascience.com/moving-averages-in-python-16170e20f6c

In [31]:
#@title Define the Window
window = 10 #@param {type:"number"}


In [32]:
df['theta_past_ma10']=df['theta_past'].rolling(window=window,min_periods=1).mean()
df.head(20)

Unnamed: 0,Year,Life expectancy at birth (historical),theta,theta_past,theta_past_ma10
0,1930,32.0,60.3125,,
1,1934,34.0,56.882353,60.3125,60.3125
2,1936,35.0,55.314286,56.882353,58.597426
3,1942,37.0,52.486486,55.314286,57.503046
4,1949,41.0,47.536585,52.486486,56.248906
5,1950,43.7,44.622426,47.536585,54.506442
6,1951,44.6,43.744395,44.622426,52.859106
7,1952,45.4,42.995595,43.744395,51.557004
8,1953,46.0,42.456522,42.995595,50.486828
9,1954,46.8,41.752137,42.456522,49.594572


# Part V Train and Test Split

*reference*:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

In [33]:
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit()
print(tss)

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)


In [34]:
# change the train and test split parameters 
tss = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=2, test_size=None)

In [35]:
for train_idx, test_idx in tss.split(df):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26] TEST: [27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
 51]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51] TEST: [52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
 76]


In [36]:
train_idx

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51])

In [37]:
test_idx

array([52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76])

In [38]:
train_df = df.filter(items=train_idx, axis=0)
test_df =  df.filter(items=test_idx, axis=0)

In [39]:
train_df.head()

Unnamed: 0,Year,Life expectancy at birth (historical),theta,theta_past,theta_past_ma10
0,1930,32.0,60.3125,,
1,1934,34.0,56.882353,60.3125,60.3125
2,1936,35.0,55.314286,56.882353,58.597426
3,1942,37.0,52.486486,55.314286,57.503046
4,1949,41.0,47.536585,52.486486,56.248906


In [40]:
test_df.head()

Unnamed: 0,Year,Life expectancy at birth (historical),theta,theta_past,theta_past_ma10
52,1997,70.7,28.24611,28.392603,29.01068
53,1998,71.2,28.061798,28.24611,28.882839
54,1999,71.4,27.997199,28.061798,28.748191
55,2000,71.9,27.816412,27.997199,28.614282
56,2001,72.6,27.561983,27.816412,28.469453


# Part VI Prepare the Train and Test Data for Classification and Regression

## 6.2 Regression

### 6.2.1. Define the columns (Y, X) for Regression

In [41]:
cols_R = ['theta','theta_past_ma10']

### 6.2.2. Define the Data Frame of Train and Test Data for Regression

In [42]:
df_R_train = train_df[cols_R]
df_R_test = test_df[cols_R]

### 6.2.3. Export the Train and Test Data for Regression

In [43]:
df_R_train.head()

Unnamed: 0,theta,theta_past_ma10
0,60.3125,
1,56.882353,60.3125
2,55.314286,58.597426
3,52.486486,57.503046
4,47.536585,56.248906


In [44]:
df_R_train.to_csv('Regression_Train.csv')

In [45]:
df_R_test.head()

Unnamed: 0,theta,theta_past_ma10
52,28.24611,29.01068
53,28.061798,28.882839
54,27.997199,28.748191
55,27.816412,28.614282
56,27.561983,28.469453


In [46]:
df_R_test.to_csv('Regression_Test.csv')