# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Exploratory Data Analysis Timeseries

## Problem Statement

Perform Exploratory Data Analysis (EDA) of Retail Sales time series data using visualizations and statistical methods.

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform Exploratory data analysis (EDA) of the time series
* perform Time series behaviour analysis in qualitative and quantitative terms
* summarize the findings based on the EDA

## Dataset

The dataset is a French retail company quarterly sales data that has been made available from  Prof. Rob Hyndman's ["Forecasting Methods & Applications"](https://robjhyndman.com/forecasting/) book. There are 24 entries, from 2012-03-31 to 2017-12-31 (Quarterly sales values).

## Introduction

Exploratory data analysis of time series data starts with data visualization. 

- Are there consistent patterns? 
- Is there a significant trend? 
- Is seasonality important? 
- Is there evidence of the presence of business cycles? 
- Are there any outliers in the data that need to be explained by those with expert knowledge? 
- How strong are the relationships among the variables available for analysis?

Various tools have been developed to help with these analyses. 

## Grading = 10 Points

In [None]:
#@title Download Dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/ts_frenchretail.csv

### Importing libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-white')
%matplotlib inline
import scipy
from pandas.plotting import lag_plot
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy import signal
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

### Import the Data

In [None]:
# Read dataset with 'Date' as index
# YOUR CODE HERE

## Exploratory Data Analysis

### **Preprocessing** (1 point)

#### Divide the sales by 1000

Sales numbers are in thousands, so divide by 1000 to make it easier to work with numbers

In [None]:
# YOUR CODE HERE

#### Check for missing values     

In [None]:
# YOUR CODE HERE

### **Visualization**

#### Visualize the time series (2012 to 2017) (1 point)

In [None]:
# YOUR CODE HERE

#### Visualize the data year-wise and quarter-wise (2 points)

- Box plot to see distribution of sales in each year
- Create year-wise subplots to visualize the quarterly Sales per year
- Compute Percentage growth each year

Make a report of your observations.

In [None]:
# Box plot to see distribution of sales in each year
# YOUR CODE HERE

In [None]:
# Create year-wise subplots to visualize the quarterly Sales per year
# YOUR CODE HERE

In [None]:
# Percentage growth each year
# YOUR CODE HERE

#### Visualize the distribution of the Sales (0.5 point)

While normally distributed data is not a requirement for forecasting and doesn't necessarily improve point forecast accuracy, it can help stablize the variance and narrow the prediction interval.

Report your observations.

Hint: `sns.distplot()`

In [None]:
# YOUR CODE HERE

#### Visualize Quarterly trends (1 point)

Create quarterly subplots to visualize the data in each quarter across all years

Hint: statsmodels' `quarter_plot()` method

In [None]:
# YOUR CODE HERE

#### Visualize the distribution of Sales in each year within a single plot (1 point)

- Do the distribution peaks shift to the right from 2012 to 2017? What does this indicate?
- Is there a change in the width of the distributions from 2012 to 2017? What does it signify?

Hint: `sns.distplot(hist=False)`

In [None]:
# Distribution plot of each year
# YOUR CODE HERE

#### Visualize the quarterly sales for each year using a stacked bar plot (1 point)

In [None]:
# Plot stacked bar plot
# YOUR CODE HERE

#### Check if the time series data is stationary (1 point)

Hint:

For the series to be stationary, it must have:
 - constant mean
 - constant variance
 - constant covariance (uncorrelated)

Visualize if the mean is constant

Hint: [Rolling mean](https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.mean.html)

In [None]:
# Visualize Rolling mean
# YOUR CODE HERE

Visualize if the variance is constant

Hint: [Rolling standard deviation](https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.std.html)

In [None]:
# Rolling standard deviation
# YOUR CODE HERE

Based on the observations report whether the series is stationary or not.

#### Visualize the patterns in time series - trend, seasonality, residuals (1 point)

Hint: See Module 6 - AST3 EDA > Patterns in a time series

In [None]:
# Applying seasonal decompose
# YOUR CODE HERE

# Plotting trend, seasonality
# YOUR CODE HERE

Report if there are any observable patterns in terms of trend, seasonality , or cyclic behavior.

### **Detrending**

####  Detrend the time series (0.5 point)

Detrending a time series is to remove the trend component from a time series.

Hint: 
- Subtract the line of best fit `scipy.signal.detrend()`

In [None]:
# YOUR CODE HERE

### **Lag Plots** (Optional)

#### Visualize the Lag plots

A Lag plot is a scatter plot of a time series against a lag of itself. It is normally used to check for autocorrelation. If there is any pattern existing in the series, the series is autocorrelated. If there is no such pattern, the series is likely to be random white noise.

For reference, see Module 6 - AST3 > Lag Plots 

Hint: `pandas.plotting.lag_plot()`

In [None]:
# Visualize lag plots
# YOUR CODE HERE

### Report Analysis

- Give the summary about this time series