<a href="https://colab.research.google.com/github/Coyote-Schmoyote/currency-exchange-prediction/blob/main/currency_exchange.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3. Currency exchange rate prediction model

This notebook looks into Python-based machine learning and data science libraries in an attempt to analyze time series data and build a machine learning model that can predict the exchange rate of JPY to USD, and USD to EUR for a given day.

## 1. Problem Definition

#### Problem 1 
Fill in the missing NaN values with the data from the most recent previous day. 
If there is missing data about the year or the month, ignore the data.

#### Problem 2 
With the above data, visualize each statistic and the time series.

#### Problem 3

Display a histogram of the rate exchange, taking the difference between each day and the previous day (day - previous day).

#### Problem 4
Build a linear regression model to predict future prices (e.g., next day), using November 2016 as training data.
Use the price of the day as the target variable, and build a model that predicts the price of the day based on the prices from the previous days. Use December 2016 as test data.

## 2. Data

For this project, we are going to generate data using `pandas` module `pandas_datareader.data`. This module extracts data from various Internet sources and converts it to a pandas DataFrame. We will use the data from Federal Reserve Economic Data (FRED), starting January 2nd 2001, and ending December 30th 2016.

## 3. Evaluation
For evaluation of our Linear Regression model, we are going to use 3 metrics:
* Mean Squared Error (MSE)
* Root Mean Squared Error (RMSE)
* Mean Absolute Error (MAE)

> We alredy used these metrics in Abalone Age Prediciont Project: https://colab.research.google.com/drive/1LaPYv6-9fyaSoHuqYVK3uzLYx2CFqKgt?usp=sharing 

In [72]:
# Import the tools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

# Import linear regression model
from sklearn.linear_model import LinearRegression

# Import model evaluation tools
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_log_error

# Generate data
import pandas_datareader.data as pdr
start_date = dt.datetime(2001,1,2)
end_date = dt.datetime(2016,12,30)

df = pdr.DataReader(["DEXJPUS", "DEXUSEU"], data_source="fred", start=start_date, end=end_date)


## Problem 1: Deal with missing data
In our previous projects, we didn‘t have any missing data. However, in reale world, the majority of datasets will have missing values. Even a small amount of missing data can cause major problems with analysis and machine learning process, and therefore, one of the first things we have to do when starteting a new data science project is to make sure we have no missing values. The most common ways to handle missing data are:
* Imputation 
>Imputation method develops reasonable guesses for the missing data. It‘s usefule when there is not a lot of missing data. If the proportion of missing data is too high, imputing data might affect the results of the machine learning model.
* Removal
>We can remove missing data. If the dataset is small, it is not advisable to remove the missing values, because there might be not data to make reliable observations and produce trusted results.



In [73]:
df

Unnamed: 0_level_0,DEXJPUS,DEXUSEU
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2001-01-02,114.73,0.9465
2001-01-03,114.26,0.9473
2001-01-04,115.47,0.9448
2001-01-05,116.19,0.9535
2001-01-08,115.97,0.9486
...,...,...
2016-12-26,,
2016-12-27,117.52,1.0458
2016-12-28,117.66,1.0389
2016-12-29,116.32,1.0486


In [74]:
df.index

DatetimeIndex(['2001-01-02', '2001-01-03', '2001-01-04', '2001-01-05',
               '2001-01-08', '2001-01-09', '2001-01-10', '2001-01-11',
               '2001-01-12', '2001-01-15',
               ...
               '2016-12-19', '2016-12-20', '2016-12-21', '2016-12-22',
               '2016-12-23', '2016-12-26', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', name='DATE', length=4174, freq=None)

In [76]:
df.dtypes

DEXJPUS    float64
DEXUSEU    float64
dtype: object

In [79]:
df.reset_index(inplace=True)
df = df.rename(columns = {"index": "Date"})

In [80]:
df

Unnamed: 0,DATE,DEXJPUS,DEXUSEU
0,2001-01-02,114.73,0.9465
1,2001-01-03,114.26,0.9473
2,2001-01-04,115.47,0.9448
3,2001-01-05,116.19,0.9535
4,2001-01-08,115.97,0.9486
...,...,...,...
4169,2016-12-26,,
4170,2016-12-27,117.52,1.0458
4171,2016-12-28,117.66,1.0389
4172,2016-12-29,116.32,1.0486


In [82]:
df.columns

Index(['DATE', 'DEXJPUS', 'DEXUSEU'], dtype='object')

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4174 entries, 0 to 4173
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   DATE     4174 non-null   datetime64[ns]
 1   DEXJPUS  4020 non-null   float64       
 2   DEXUSEU  4020 non-null   float64       
dtypes: datetime64[ns](1), float64(2)
memory usage: 98.0 KB


In [86]:
df.fillna(method="ffill", inplace=True)

In [88]:
df

Unnamed: 0,DATE,DEXJPUS,DEXUSEU
0,2001-01-02,114.73,0.9465
1,2001-01-03,114.26,0.9473
2,2001-01-04,115.47,0.9448
3,2001-01-05,116.19,0.9535
4,2001-01-08,115.97,0.9486
...,...,...,...
4169,2016-12-26,117.22,1.0449
4170,2016-12-27,117.52,1.0458
4171,2016-12-28,117.66,1.0389
4172,2016-12-29,116.32,1.0486


In [87]:
df.isna().sum()

DATE       0
DEXJPUS    0
DEXUSEU    0
dtype: int64