<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

**<center><h3>Time Series Assignment</h3></center>**

---
# **Table of Contents**
---

**1.** [**Problem Statement**](#Section1)<br>
**2.** [**Objective**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Identification & Handling of Missing Data**](#Section51)<br>
  - **5.2** [**Identification & Handling of Redundant Data**](#Section52)<br>
  - **5.3** [**Identification & Handling of Inconsistent Data Types**](#Section53)<br>

**6.** [**Time Series Analysis**](#Section6)<br>
**7.** [**Time Series Forecasting**](#Section7)<br>




---
<a name = Section1></a>
# **1. Problem Statement**
---

- In late 2010, Onion prices shot through the roof and cause grave crisis.

- This was caused by lack of rainfall in major onion producing region such as Maharashtra and Karnataka.

- The crisis has led political tension and large scale hoarding by the traders in the country.

- Former Prime Minister Manmohan Singh described it as "a grave concern".

- **Further Information**:
  - BBC Article in Dec 2010 - [**Stink over onion crisis is enough to make you cry**](http://www.bbc.co.uk/blogs/thereporters/soutikbiswas/2010/12/indias_onion_crisis.html)

  - Hindu OpEd in Dec 2010 - [**The political price of onions**](http://www.thehindu.com/opinion/editorial/article977100.ece)

---
<a name = Section2></a>
# **2. Objective**
---

- The objective of this assignment is to predict the price of onion in Bangalore using ARIMA.

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade statsmodels 

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
from datetime import datetime                                       # Importing datetime for datetime manipulation
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
from matplotlib.pylab import rcParams                               # Backend used for rendering and GUI integration                                               
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from statsmodels.tsa.seasonal import seasonal_decompose             # Seasonal decomposition using moving averages
from statsmodels.tsa.stattools import adfuller                      # Augmented Dickey-Fuller unit root test
from statsmodels.tsa.stattools import acf, pacf                     # Importing Autocorrelation and Partial Autocorrelation
from statsmodels.graphics.tsaplots import plot_acf                  # To plot Autocorrelation Function
from statsmodels.graphics.tsaplots import plot_pacf                 # To plot Partial Autocorrelation Function
#-------------------------------------------------------------------------------------------------------------------------------
from statsmodels.tsa.arima.model import ARIMA
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- The data set is based on the amount of onions sold (1996-01, 2016-12) and it can be retrieved from the attached <a href = "https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/MonthWiseMarketArrivals_Clean.csv">**link**</a>.

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 10227 | 10 | 658 KB| 

|Id|Feature|Description|
|:--|:--|:--|
|01|**market**|The place where onions are sold.|
|02|**month**|Month on which onions were sold.|
|03|**year**|Year on which onions were sold.|
|04|**quantity**|Quantity of onions sold.|
|05|**priceMin**|Minimum prices of Onions.|
|06|**priceMax**|Maximum prices of Onions.|
|07|**priceMod**|Price Mode of Onions.|
|08|**state**|The name of the state where onions were sold.|
|09|**city**|The name of the city where onions were sold.|
|10|**date**|Date on which onions were sold.|


In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/MonthWiseMarketArrivals_Clean.csv')
print('Data Shape:', data.shape)
data.head()

<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
print('Described Column Length:', len(data.describe().columns))
data.describe().transpose()

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Identification & Handling of Missing Data**

In [None]:
missing_frame = pd.DataFrame(index = data.columns.values)
missing_frame['Null Frequency'] = data.isnull().sum().values
nullpercent = data.isnull().sum().values/data.shape[0]
missing_frame['Missing Null %age'] = np.round(nullpercent, decimals = 4) * 100
missing_frame['Zero Frequency'] = data[data == 0].count().values
zero_percent = data[data == 0].count().values / data.shape[0]
missing_frame['Missing %age'] = np.round(zero_percent, decimals = 4) * 100
missing_frame.transpose()

<a name = Section52></a>
### **5.2 Identification & Handling of Redundant Data**

- In this section **we will identify redundant rows and columns** in our data if present.

- For handling duplicate features we have created a custom function to identify duplicacy in features with different name but similar values:

In [None]:
def duplicate_cols(dataframe):
  ls1 = []
  ls2 = []

  columns = dataframe.columns.values
  for i in range(0, len(columns)):
    for j in range(i+1, len(columns)):
      if (np.where(dataframe[columns[i]] == dataframe[columns[j]], True, False).all() == True):
        ls1.append(columns[i])
        ls2.append(columns[j])

  if ((len(ls1) == 0) & (len(ls2) == 0)):
    return None
  else:
    duplicate_frame = pd.DataFrame()
    duplicate_frame['Feature 1'] = ls1
    duplicate_frame['Feature 2'] = ls2
    return duplicate_frame

In [None]:
print('Contains Redundant Records?:', data.duplicated().any())
print('Duplicate Count:', data.duplicated().sum())
print('-----------------------------------------------------------------------')
print('Contains Redundant Features?:', duplicate_cols(data))

<a name = Section53></a>
### **5.3 Identification & Handling of Inconsistent Data Types**

- In this section we will **identify** and **handle** the **feature** that may **contains inconsistent data type**.

**Before Identification & Handling of Inconsistent Data Types**

In [None]:
data.head(2)

In [None]:
type_frame = pd.DataFrame(data = data.dtypes, columns = ['Type'])
type_frame.transpose()

**Performing Operations**

In [None]:
data['date']  =  pd.to_datetime(data['date'], infer_datetime_format = True) 
print('Success!')

**After Identification & Handling of Inconsistent Data Types**

In [None]:
type_frame = pd.DataFrame(data = data.dtypes, columns = ['Type'])
type_frame.transpose()

**Observation:**

- We have successfully handeled inconsistent data types.

<a name = Section6></a>

---
# **6. Time Series Analysis**
---

- Time series deals with two columns, i.e. temporal (predictor) and forecast (prediction).

  - **Temporal:** The time which in our case is **year**.

  - **Forecast:** The price of onions i.e. **priceMod**.

In [None]:
def plot_trend():
  fig = plt.figure(figsize = [15, 7])
  sns.lineplot(x = 'date', y = 'priceMod', data = data, ci = None)
  plt.xlabel('Date', size = 14)
  plt.ylabel('Price of Onions', size = 14)
  plt.title('Trend in Price of Onions (1996 - 2016)', size = 16)
  plt.show()

In [None]:
plot_trend()

- Before going further, let's set date as our index which in result will simplify our work while performing analysis.

In [None]:
indexed_data =  data.set_index(['date'])
print('Success!')

In [None]:
indexed_data.head()

---
**<h4>Question 1:** Plot the trend and distribution of priceMod by using following instructions:</h4>

---

- Create a new dataframe named as indexed_data_BANGLORE which will contain data related to Banglore city only.

- Remove features - market,	month, year, priceMin, priceMax, state, city, quantity.

- Sort the dataframe index using sort index function.

In [None]:
def index_frame():
  # Insert your code here...
  return indexed_data_BANGLORE

In [None]:
indexed_data_BANGLORE = index_frame()
indexed_data_BANGLORE.head()

- Create a function to plot the trend and distribution of onion prices side by side.

In [None]:
def trend_dist():
  # Insert your code here...

In [None]:
trend_dist()

---
**<h4>Question 2:** Generate all seasonal components of time series by performing seasonal decomposition.</h4>

---

- Use seasonal_decompose() function present in statsmodel over indexed_data_BANGLORE created earlier.


In [None]:
def seasonalDecompose():
  # Insert your code here...

In [None]:
seasonalDecompose()

---
**<h4>Question 3:** Create a function named as <ins>rolling_means</ins> to estimated Rolling Statistics by using following instructions.

---

- Calculate rolling mean using rolling window of size 12 and store in a roll_mean variable.

- Calculate rolling std using rolling window of size 12 and store in a roll_std variable.

- Create line plots for original data (Banglore), rolling mean and rolling standard deviation.


In [None]:
def rolling_means(data, feature, title_add = ''):
  # Insert your code here...

---
**<h4>Question 4:** Create a function named as <ins>ADFTest</ins> to estimate Augmented Dickey–Fuller test by using following instructions.

---

- Calculate an object of adfuller function by passing data column and setting autolag as AIC.

- Create a dataframe that explain all the metrics out of the adfuller object.

- Return the dataframe object.

In [None]:
def ADFTest(data, feature, test_label = 'Original'):
  # Insert your code here...

---
**<h4>Question 5:** Check the stationarity of priceMod feature by calling rolling_means() and ADFTest() function.

---

In [None]:
rolling_means()

In [None]:
ADFTest()

---

**<h4>Question 6:** Perform log transformation over priceMod feature and plot the trend as well as distribution.

---

- Create a log_indexed_data variable by applying log transformation on indexed_data_BANGLORE dataframe.

In [None]:
def createFeature():
  # Insert your code here...
  return log_indexed_data

In [None]:
log_indexed_data = createFeature()

- Create a function that displays trend as well the distribution of the onion prices.

In [None]:
def trend_dist():
  # Insert your code here...

In [None]:
trend_dist()

---

**<h4>Question 7:** Perform Rolling Statistics as well as Augmented Dickey-Fuller Test over the log_indexed_data and priceMod features.

---

- Create a function that shows side by side comparison of Rolling Statistics of priceMod.

In [None]:
def roll_stats():
  # Insert your code here...

In [None]:
roll_stats()

- Call Augmented Dickey Fuller Test function that was created earlier over original indexed_data_BANGLORE dataframe.

In [None]:
ADFTest()

- Call Augmented Dickey Fuller Test function that was created earlier over log transformed indexed_data_BANGLORE dataframe.

In [None]:
ADFTest()

---

**<h4>Question 8:** Perform time shift transformation over log_indexed_data dataframe by using following operations.

---

- Take a difference of log_indexed_data and shifted log_indexed_data by periods = 1 and store inside shift_indexed_data.

- Drop all the null values by using dropna function.

In [None]:
def shiftTransform():
  # Insert your code here...
  return shift_indexed_data

In [None]:
shift_indexed_data = shiftTransform()
shift_indexed_data.head()

- Plot a scatterplot between shifted log_indexed_data by periods = 1 and log_indexed_data.

In [None]:
def plot_scatter():
  # Insert your code here...

In [None]:
plot_scatter()

---

**<h4>Question 9:** Perform Rolling Statistics as well as Augmented Dickey-Fuller Test over the shift_indexed_data and priceMod features.

---

- Create a function that shows side by side comparison of Rolling Statistics of priceMod.

In [None]:
def roll_stats():
  # Insert your code here...

In [None]:
roll_stats()

- Call Augmented Dickey Fuller Test function that was created earlier over original indexed_data_BANGLORE dataframe.

In [None]:
ADFTest()

- Call Augmented Dickey Fuller Test function that was created earlier over shift_indexed_data dataframe.

In [None]:
ADFTest()

Unnamed: 0,ADF_Test_Statistics,p-value,Used_Lags,Number_Of_Observations,Critical_Value (1%),Critical_Value (5%),Critical_Value (10%)
Shift Transformed,-7.29325,0.0,7,137,-3.47901,-2.88288,-2.57815


<a name = Section7></a>

---
# **7. Time Series Forecasting**
---

---

**<h4>Question 10:** Estimate p and q from autocorrelation and partial autocorrelation function defined in statsmodel.

---

- Create two variables named as <ins>ACF</ins> and <ins>PACF</ins> to estimate acf() and pacf() using lags = 20.

- Create a dataframe named as <ins>corrFrame</ins> using ACF and PACF variables.

In [None]:
def autoPartialCorrFrame():
  # Estimating Autocorrelation Function
  ACF = acf(shift_indexed_data, nlags = 20)

  # Estimating Partial Autocorrelation Function
  PACF = pacf(shift_indexed_data, nlags = 20)

  # Preparing a dataframe out of Correlation Arrays
  corrFrame = pd.DataFrame(data = {'ACF': ACF, 'PACF': PACF})

  return corrFrame

In [None]:
corrFrame = autoPartialCorrFrame()

- Create two subplots and plot autocorrelation and partial autocorrelation functions simultaneously.

In [None]:
def autoCorr_partialAutoCorr():
  # Insert your code here...

In [None]:
autoCorr_partialAutoCorr()

- On analyzing above plots we came to conclusion of using p = 2, d = 0 and q = 1.

---

**<h4>Question 11:** Create a function named as <ins>actual_vs_predicted</ins> to evaluate ARIMA model later on.

---

- Plot line plot and scatterplot taking input parameters.

- Add a model evaluation metric which will estimate RSS over the data.

- Set the RSS over the title to visualize the comparision.

In [None]:
def actual_vs_predicted(actual_data, predicted_data, title):
  # Insert your code here...

---

**<h4>Question 12:** Execute ARIMA model with the identified p, d and q value from earlier analysis and evaluate the model.

---

- Create a function that returns ARIMA and model fit.

- Create an object of ARIMA model inside function by passing shift_indexed_data with order = [2, 0, 1]

In [None]:
def arimaModel():
  # Insert your code here...

- Print the model parameters and plot the actual vs predicted values over the data.

In [None]:
model, model_fit = arimaModel()

- Evaluate the model by plotting side by side comparison of actual and predicted data.

In [None]:
def evalModel():
  # Insert your code here...

In [None]:
evalModel()

---

**<h4>Question 13:** Perform the reverse transformation over the estimated values from the model.

---

- Create a dataframe named as <ins>reverse_diff_data</ins> having first value of original column i.e. priceMod.

- Add the cummulative sum over model_fit fitted values with the dataframe named as <ins>reverse_diff_data</ins>.

- Then assign back the first value of original feature to the first value of reverse_diff_data.

- Finally return reverse_diff_data dataframe.

In [None]:
def reverseTransform():
  # Insert your code here...

- Call reverse_diff_data in a reverse_diff_data variable and plot first five values using .head()

In [None]:
reverse_diff_data = reverseTransform()
reverse_diff_data.head()

---

**<h4>Question 14:** Perform exponential transformation over reverse_diff_data and view first five data points.

---

In [None]:
def inverseShift():
  # Insert your code here...

In [None]:
inverse_log_data = inverseShift()
inverse_log_data.head()


- Plot the actual values and predicted values to original scale.

In [None]:
def plotReverseTransData():
  # Insert your code here...

In [None]:
plotReverseTransData()