
*   Computing Platforms: Set up the Workspace for Machine Learning Projects.  https://ms.pubpub.org/pub/computing
*  Machine Learning for Predictions. https://ms.pubpub.org/pub/ml-prediction
* Machine Learning Packages: https://scikit-learn.org/stable/


# Part I: Import and Inspect Data

In [1]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv('/content/Processed_Data.csv')
df.head()

Unnamed: 0,Year,Country,Emigrants
0,1990,Latin America and the Caribbean,15273399
1,1995,Latin America and the Caribbean,19669704
2,2000,Latin America and the Caribbean,24628700
3,2005,Latin America and the Caribbean,29338206
4,2010,Latin America and the Caribbean,34637650


# Part II: Prepare the Y varible for Regression

## 2.1. Write functions to calculte the Y variable for Regression 

*(skip the step if the Y variable already exists)*

In [6]:
df['theta'] = df['Emigrants']
df.head()

Unnamed: 0,Year,Country,Emigrants,theta
0,1990,Latin America and the Caribbean,15273399,15273399
1,1995,Latin America and the Caribbean,19669704,19669704
2,2000,Latin America and the Caribbean,24628700,24628700
3,2005,Latin America and the Caribbean,29338206,29338206
4,2010,Latin America and the Caribbean,34637650,34637650


## 2.2. Make Sure that the Data Type of Y is "numeric"

In [7]:
df.dtypes

Year          int64
Country      object
Emigrants     int64
theta         int64
dtype: object

In [8]:
df['theta'] = pd.to_numeric(df['theta'])
df.dtypes

Year          int64
Country      object
Emigrants     int64
theta         int64
dtype: object

# Part III: Prepare the Y variable for Classification

reference:

https://datatofish.com/if-condition-in-pandas-dataframe/ *italicized text*

In [9]:
#@title Define the Congestion Threshold
cut = 15000000 #@param {type:"number"}


## 3.1. Method 1: If function

In [10]:
df['congested'] = df['theta'] >= cut
df.head()

Unnamed: 0,Year,Country,Emigrants,theta,congested
0,1990,Latin America and the Caribbean,15273399,15273399,True
1,1995,Latin America and the Caribbean,19669704,19669704,True
2,2000,Latin America and the Caribbean,24628700,24628700,True
3,2005,Latin America and the Caribbean,29338206,29338206,True
4,2010,Latin America and the Caribbean,34637650,34637650,True


In [11]:
df.loc[(df['theta'] >= cut), 'congested'] = 1
df.loc[(df['theta'] <cut), 'congested'] = 0
df.head()

Unnamed: 0,Year,Country,Emigrants,theta,congested
0,1990,Latin America and the Caribbean,15273399,15273399,1
1,1995,Latin America and the Caribbean,19669704,19669704,1
2,2000,Latin America and the Caribbean,24628700,24628700,1
3,2005,Latin America and the Caribbean,29338206,29338206,1
4,2010,Latin America and the Caribbean,34637650,34637650,1


## 3.2. Method 2: Lambda function

notes: the best method that I suggest

In [None]:
df['congested'] = df['theta'].apply(lambda x: 1 if x>= cut else 0)
df.head()

Unnamed: 0,Entity,Code,Year,Low-carbon energy (TWh - equivalent),Primary energy consumption (TWh),theta,congested
0,Africa,,1965,41.118813,715.421448,0.054351,0
1,Africa,,1966,45.862911,749.141602,0.057689,0
2,Africa,,1967,47.875538,756.838013,0.059494,0
3,Africa,,1968,56.000469,799.402222,0.065467,0
4,Africa,,1969,65.352089,821.409851,0.073697,0


## 3.3. Method 3: Cut function

reference: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

notes: I do not suggest this method if you are newbies to data science 

In [None]:
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957,0


In [None]:
import numpy as np
 
congested = pd.cut(df['theta'], bins=[0,0.95,1], labels=[0,1]) #might have problems at boundaries
df.insert(3, 'congested2',congested)
df.head()

Unnamed: 0,number,timestamp,gas_used,congested2,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,0.0,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,1.0,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,1.0,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,0.0,29999943,0.519957,0


In [None]:
import numpy as np
 
congested = pd.cut(df['theta'], bins=[-1,0.95,2], labels=[0,1]) #avoid the boundary problems
df.insert(3, 'congested3',congested)
df.head()

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,0,,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,0,0.0,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,1,1.0,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,1,1.0,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,0,0.0,29999943,0.519957,0


# Part IV: Create the X variables

## 4.1. Shift the Y to get past values

reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [12]:
# generate a new variable as the previous 1 observable of your Y variable for regression
df['theta_past'] =df['theta'].shift(1)
df.head()

Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past
0,1990,Latin America and the Caribbean,15273399,15273399,1,
1,1995,Latin America and the Caribbean,19669704,19669704,1,15273399.0
2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0
3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0
4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0


## 4.2. Calculate the Moving Averages

references: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

https://towardsdatascience.com/moving-averages-in-python-16170e20f6c

In [15]:
#@title Define the Window
window = 2 #@param {type:"number"}


In [16]:
df['theta_past_ma10']=df['theta_past'].rolling(window=window,min_periods=1).mean()
df.head(20)

Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
0,1990,Latin America and the Caribbean,15273399,15273399,1,,
1,1995,Latin America and the Caribbean,19669704,19669704,1,15273399.0,15273399.0
2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0,17471551.5
3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0,22149202.0
4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0,26983453.0
5,2015,Latin America and the Caribbean,36206000,36206000,1,34637650.0,31987928.0
6,2020,Latin America and the Caribbean,42890481,42890481,1,36206000.0,35421825.0


# Part V Train and Test Split

*reference*:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

In [17]:
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit()
print(tss)

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)


In [18]:
# change the train and test split parameters 
tss = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=2, test_size=None)

In [19]:
for train_idx, test_idx in tss.split(df):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [0 1 2] TEST: [3 4]
TRAIN: [0 1 2 3 4] TEST: [5 6]


In [20]:
train_idx

array([0, 1, 2, 3, 4])

In [21]:
test_idx

array([5, 6])

In [22]:
train_df = df.filter(items=train_idx, axis=0)
test_df =  df.filter(items=test_idx, axis=0)

In [23]:
train_df.head()

Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
0,1990,Latin America and the Caribbean,15273399,15273399,1,,
1,1995,Latin America and the Caribbean,19669704,19669704,1,15273399.0,15273399.0
2,2000,Latin America and the Caribbean,24628700,24628700,1,19669704.0,17471551.5
3,2005,Latin America and the Caribbean,29338206,29338206,1,24628700.0,22149202.0
4,2010,Latin America and the Caribbean,34637650,34637650,1,29338206.0,26983453.0


In [24]:
test_df.head()

Unnamed: 0,Year,Country,Emigrants,theta,congested,theta_past,theta_past_ma10
5,2015,Latin America and the Caribbean,36206000,36206000,1,34637650.0,31987928.0
6,2020,Latin America and the Caribbean,42890481,42890481,1,36206000.0,35421825.0


# Part VI Prepare the Train and Test Data for Classification and Regression

## 6.1. Classification

### 6.1.1 Define the columns (Y, X) for Classification 

In [25]:
cols_C = ['congested','theta_past_ma10']

### 6.1.2 Define the Data Frame of Train and Test Data for Classification

In [26]:
df_C_train = train_df[cols_C]
df_C_test = test_df[cols_C]

### 6.1.3 Export the Train and Test Data for Classification

In [27]:
df_C_train.head()

Unnamed: 0,congested,theta_past_ma10
0,1,
1,1,15273399.0
2,1,17471551.5
3,1,22149202.0
4,1,26983453.0


In [28]:
df_C_train.to_csv('Classification_Train.csv')

In [29]:
df_C_test.head()

Unnamed: 0,congested,theta_past_ma10
5,1,31987928.0
6,1,35421825.0


In [30]:
df_C_test.to_csv('Classification_Test.csv')

## 6.2 Regression

### 6.2.1. Define the columns (Y, X) for Regression

In [31]:
cols_R = ['theta','theta_past_ma10']

### 6.2.2. Define the Data Frame of Train and Test Data for Regression

In [32]:
df_R_train = train_df[cols_R]
df_R_test = test_df[cols_R]

### 6.2.3. Export the Train and Test Data for Regression

In [33]:
df_R_train.head()

Unnamed: 0,theta,theta_past_ma10
0,15273399,
1,19669704,15273399.0
2,24628700,17471551.5
3,29338206,22149202.0
4,34637650,26983453.0


In [34]:
df_R_train.to_csv('Regression_Train.csv')

In [35]:
df_R_test.head()

Unnamed: 0,theta,theta_past_ma10
5,36206000,31987928.0
6,42890481,35421825.0


In [36]:
df_R_test.to_csv('Regression_Test.csv')