<a href="https://colab.research.google.com/github/JainAnki/ADSMI-Notebooks/blob/main/Copy_of_M3_MP8_NB_SALES_TSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

a = [00,11]
b = ['This is a good dog', 'Climate is extreme']
df = pd.DataFrame(list(zip(a,b)), columns = ['col_a', 'col_b'])
df

Unnamed: 0,col_a,col_b
0,0,This is a good dog
1,11,Climate is extreme


In [None]:
df['col_b']=df['col_b'].str.replace('dog','cat')
df

Unnamed: 0,col_a,col_b
0,0,This is a good cat
1,11,Climate is extreme


# Applied Data Science and Machine Intelligence
## A program by IIT Madras and TalentSprint
### Mini Project: Store Sales Analysis and Predeiction

## Description

Time-Series Analysis is an integral part of various Financial and Non-Financial applications. It helps us to understand the trends of the underlying phenomenon and make predictions for the future time.

## Learning Objectives

At the end of the mini project, you will be able to understand-
 
* Perform exhaustive Exploratory-Data-Analysis (EDA)
* Perform Data Engineering to convert the raw data into time series dataset
* Perform Time-Series-Analysis (TSA)
* Predict the sales of products

## About the Dataset.

The current Dataset is adapted from an **ongoing** Kaggle Competition from the [link](https://www.kaggle.com/competitions/store-sales-time-series-forecasting/data).
But the end goals are slightly different from the competetion.

The following Description for each file is pasted here for your convenience.

* train.csv
  - The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
  - **store_nbr** identifies the store at which the products are sold.
  - **family** identifies the type of product sold.
  - **sales** gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
  - **onpromotion** gives the total number of items in a product family that were being promoted at a store at a given date
  
* test.csv
  - The test data, having the same features as the training data. You will predict the target sales for the dates in this file.
  - The dates in the test data are for the 15 days after the last date in the training data.

* stores.csv
  - Store metadata, including city, state, type, and cluster.
  - cluster is a grouping of similar stores.

* oil.csv
  - Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.)

* holidays_events.csv
  - Holidays and Events, with metadata
  - NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day, but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was actually celebrated, look for the corresponding row where type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to payback the Bridge.
  - Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).

* transactions.csv
  - Mentions the date, store and the number of transcations that happened

* Additional Notes:
  - Wages in the public sector are paid every two weeks on the 15 th and on the last day of the month. Supermarket sales could be affected by this.
A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.


**Python Packages used:**  

* [`Google.colab`](https://colab.research.google.com/notebooks/io.ipynb) for linking the notebook to your Google-drive
* [`Pandas`](https://pandas.pydata.org/docs/reference/index.html) for data frames and easy to read csv files  
* [`Numpy`](https://numpy.org/doc/stable/reference/index.html#reference) for array and matrix mathematics functions  
* [`sklearn`](https://scikit-learn.org/stable/user_guide.html) for the pre-processing data, building ML models, and performance metrics
* [`seaborn`](https://seaborn.pydata.org/) and [`matplotlib`](https://matplotlib.org/) for plotting
 and [`statsmodels`](https://www.statsmodels.org/dev/index.html) for time-series-analysis
* [`datetime`](https://docs.python.org/3/library/datetime.html) for converting string to datetime objects and vice-versa


In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
from datetime import datetime, timedelta, date
import statsmodels as sm
from google.colab import drive
from sklearn.metrics import mean_squared_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

  import pandas.util.testing as tm


## Graded Exercise (10 Marks)

### Exercise 1 (2 points): Data Acquision

- Load all the provided files into seperate `pandas` DataFrames
- Check the shape of the data
- Check the datatypes
- Check the Summary
- Check the nulls present in each field
- Discard/Fill with appropriate value, if any
- Check the unique number of entries per field
- Drop the features that are either redundant or that do not help in modelling

**Hint:** Use `pandas` module

In [None]:
# YOUR CODE HERE

### Exercise 2 (1 point): Basic EDA

* Visualize Sales volume based on 
	- Date
	- Location
	- Store Number
	- Product under Promotion
	- Product family

* Report in a sentence or two regarding - 
	- Correlation of holidays and sales
	- Correlation of oil price and sales
	- Other correlated variables that the data suggests

**Hint**: 
	- Choose the appropriate DataFrames based on the data files
	- Choose appropriate charting/plotting tools from your learning experience with previous Mini-Projects

Use `pandas`, `seaborn`, `matplotlib` modules

In [None]:
# YOUR CODE HERE

### Exercise 3 (1 point): Data Engineering

* Perform the following
  - Create new features from date - Day of the month, Week of the year, Month of the year, and Year using `datetime` module
  - Check Feature Correlations
  - Remove Redundant Data columns
  - Scale the data points of the numerical features
  - Discretize the categorical features
  - Perform Label Encoding of the discretized features

**Hint**: Use `pandas` module

In [None]:
# YOUR CODE HERE

### Exercise 4 (1 point): Advanced EDA

* After Data Preparation, Visualize Sales volume based on 
	- Date
	- Location
	- Store Number
	- Product under Promotion
	- Product family

**Hint**: Use `pandas`, `seaborn`, `matplotlib` modules

In [None]:
# YOUR CODE HERE

### Exercise 5 (1 point): Time-Series-Analysis

* Convert the date into `index`, using `pandas` module
* Plot the Components:
  - Trend
  - Seasonality
  - Randomness

  **Hint**: Use `statsmodel`'s [`seasonal_decompose`](https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html) function

In [None]:
# YOUR CODE HERE

### Exercise 6 (1 point): Stationarity Check:

* Perform the Stationary check using this ADF method from this link [ADF Method](https://www.statsmodels.org/devel/examples/notebooks/generated/stationarity_detrending_adf_kpss.html)
	- If the data is a stationary data, please proceed.
	- If not, convert to a stationary model

In [None]:
# YOUR CODE HERE

### Exercise 7 (1 point): Time-Series Modelling (Part-1):

* Perform the following actions for **EACH** of the following models

Actions:
 - Instantiate the model
 - Train the model
 - Fit the model
 - Predict the sales
 - Compute the Metrics - RMSE, MAPE
 - Use the provided Test Data for Testing (Prediction)   
 - **DO NOT USE traintestsplit**-It's a Timeseries Data.!
 - Visualise the train data, test data, and predicted data with different colors on the same plot

Models:

  - Moving-Average(MA) Model
  - Autoregressive(AR) Model
  - Autoregressive Moving-Average(ARMA) Model


**Hint**: 
  - Use `statsmodels` for the models
  - Use `sklearn` for the metrics
  - You may refer to this [link](https://towardsdatascience.com/how-to-use-an-autoregressive-ar-model-for-time-series-analysis-bb12b7831024)

In [None]:
# YOUR CODE HERE

### Exercise 8 (2 point): Time-Series Modelling (Part-2):

* Perform the following actions for **EACH** of the following models

Actions:
 - Instantiate the model
 - Train the model
 - Fit the model
 - Predict the sales
 - Compute the Metrics - RMSE, MAPE

Plot the following for ARIMA, SARIMA, and SARIMAX models:
  - autocorrelation function (ACF) and 
  - partial autocorrelation (PACF) plots

Models:

  - Autoregressive Integrated Moving-Average(ARIMA) Model
  - Seasonal Autoregressive Integrated Moving Average (SARIMA) Model
  - Seasonal Autoregressive Integrated Moving Average Extended (SARIMAX) Model

**Hint**: 
  - Use `statsmodels` for the models
  - Use `sklearn` for the metrics
  - Use `plot_acf`, `plot_pacf` for plotting the functions
  - You may refer to this [link](https://towardsdatascience.com/how-to-use-an-autoregressive-ar-model-for-time-series-analysis-bb12b7831024)


In [None]:
# YOUR CODE HERE

### Discussion and Food for Thought:

- What is the learning outcome?
- How is the data preprocessing different from the previous ML projects?
- How do the integrated models behave compared to MA, AR, and ARMA models?
- What could be other alternate/supporting metrics to determine the model's performance?
- **EXPLORE** the `var` model and repeat the above preocess on your own