# P5 Starter - Time Series Analysis 

### Statistical Modeling to Deep Learning

##  Imports & Sanity Check (Do NOT Change)

In [None]:
import numpy as np 
import pandas as pd 
import os
from tqdm.notebook import tqdm
import statsmodels.api as sm # PACF, ACF
from typing import Tuple, List

# Viz:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import importlib.util
import sys

file_path = '/kaggle/input/helper/helper.py'  # full path to the file

spec = importlib.util.spec_from_file_location("helper", file_path)
helper = importlib.util.module_from_spec(spec)
sys.modules["helper"] = helper
spec.loader.exec_module(helper)

## Helper Utilities. Read the Function names at least so that you are not re-writing code

* **make_submission**: Helps you convert your predictions to competition submission ready files.
* **rmsle**: Implementation of the metric used to evaluate your score on the leaderboard.
* **lgbm_rmsle**: Definition that can be used to do train-val type training while printing metric scores.
* **data import**: Imports the necessary files into the notebook
* **preprocess_holidays**: Performs some necessary cleaning on the holiday dataset
* **preprocess_test_train**: Performs some necessary cleaning on the test and train dataset

## Load the data (Do NOT Change)

In [None]:
#########################
# DO NOT CHANGE
#########################
train, test, stores, transactions, oil, holidays = helper.data_import()
holidays, regional, national, local, events, work_day, _, _, _ = helper.preprocess_holidays(holidays)

## Section 1: EDA & Feature Engineering

### Q1 Left join transaction to train and then print the Spearman Correlation between Total Sales and Transactions.

In [None]:
# TODO - q1
merged_df = pd.DataFrame()
print(f"Spearman Correlation between Total Sales and Transactions:")

### Q2 Plot an 'ordinary least squares' trendline between transactions and sales to verify the spearman correlation value in Q1. [0.1 Points]

In [4]:
# TODO - q2

### Q3 Plot these line charts in the notebook:

A) Transactions vs Date (all stores color coded in the same plot) 

B) Average monthly transactions

 C) Average Transactions on the days of the wee)


In [5]:
# TODO - q3 - Plot A

In [6]:
# TODO - q3 - Plot B


In [7]:
# TODO - q3 - Plot C

### Q4 Use pandas' in-build (linear) interpolation to impute the missing oil values then overlay the imputed feature over the original.

Your new feature column should be called: `dcoilwtico_interpolated`

In [8]:
# Interpolate. 

# Plot

### Q5 Again, left join oil on the dataframe above and report the spearman correlation between oil and sales and oil and transactions

In [None]:
# Find correlation with sales & transactions

print("Correlation Between Oil and Sales:")
print("Correlation Between Oil and Transactions:")

### Q6 Report the top-3 highest negative correlations between oil and sales of a particular product family. Now think whether oil should be discarded as a feature?

In [10]:
# Calculate all correlations

# Report the top 3

### Q7. Implement the One hot encode function 

You just have to finish the one-hot encoder function definition for this one.

In [11]:
def one_hot_encoder(df, nan_as_category=True) -> Tuple[pd.DataFrame, List[str]]:
    # One hot encoding (pandas can do it on 1 line!) 
    
    # Store the new columns in a list
    
    # Replace " " with "_" in column names.
    
    # Return the new dataframe and all the columns (as a list)
    pass

In [None]:
#########################
# DO NOT CHANGE. 
# NOTE: Run this after you have implemented the one_hot_encoder function above.
#########################

# train, test = helper.preprocess_test_train(merged_df, one_hot_encoder, stores)

## Section 2

### Q8. EMA

Forecast window should be >=15 days since the test set is 15 days. **For this question use 16 as the forecast window**

In [14]:
# Train EMAs for each family per store (pandas has an inbuilt ema function!)

In [None]:
# Make the predictions

# Use the make_submission utility function provided to save a submission CSV. 

# Submit to competition and note your RMSLE score somewhere for this model type.

# NOTE - 1: You still need to go on the right panel and click submit 
# (make_submission will NOT submit to competition -> It just makes a submission ready file)
# NOTE - 2: Ensure that you are not overwriting your submission.csv file in subsequent cells.

# Use the make_submission utility function provided to save a submission CSV.

# helper.make_submission(test_preds=[], file_name="EMA_results.csv")

### Q9. PACF and ACF

Use lib sm 

(statsmodel.api is already imported as sm)

In [17]:
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf

In [18]:
# 1. Group by date

### Q10. ADF Test -> ARIMA

#### Differencing technique
This process is meant to transform the time series data to stationary, as ARIMA model only works with stationary time series data.

In [19]:
# 1. Compute and store the diff series

# 2. Drop NA or any other erroneous values.


In [None]:

# Plot the ACF
fig, ax = plt.subplots(figsize=(10, 6))

##########
# TODO: Your plot code goes here:
##########

##########
plt.xlabel('Lag')
plt.ylabel('Partial Autocorrelation')
plt.title('Partial Autocorrelation Function (PACF)')

plt.show()

### Augmented Dickey-Fuller (ADF) test

The Augmented Dickey-Fuller (ADF) test is a statistical test used to determine whether a time series is stationary or non-stationary. Stationarity is an important assumption in many time series analysis models.

The ADF test evaluates the null hypothesis that the time series has a unit root, indicating non-stationarity. The alternative hypothesis is that the time series is stationary.

When performing the ADF test, we obtain the ADF statistic and the p-value. The ADF statistic is a negative number and the more negative it is, the stronger the evidence against the null hypothesis. The p-value represents the probability of observing the ADF statistic or a more extreme value if the null hypothesis were true. A low p-value (below a chosen significance level, typically 0.05) indicates strong evidence against the null hypothesis and suggests that the time series is stationary.

In [21]:
from statsmodels.tsa.stattools import adfuller

In [None]:
# 1. Perform the ADF test

# 2. Extract the test statistics and p-value

# 3. Print these values
print("ADF Statistic:")
print("p-value:")

The ADF statistic is (around) -11.4. This statistic is a negative value and is more negative than the critical values at common significance levels. This suggests strong evidence against the null hypothesis of a unit root, indicating that the time series is stationary.

The p-values (around)  i6.76e-2121, which is a very small value close to zero. Typically, if the p-value is below a chosen significance level (e.g., 0.05), it indicates strong evidence to reject the null hypothesis. In your case, the extremely small p-value suggests strong evidence against the presence of a unit root and supports the stationarity of the time series.

**TODO** Choose the right p, q and d values for your ARIMA model

In [23]:
# TODO: Replace with appropriate p,d,q values for ARIMA
p_arima = None

d_arima = None

q_arima = None

In [24]:
# 1. Get sales series as training data (np array with appropriate dtype)

# 2. Using statsmodel.tsa lib. Initialize an ARIMA model with the p,d,q params you defined. 

# 3. Fit the model

In [25]:
# Print the post model fitting summary

In [26]:
# Make predictions & submit to competition using your best model

## Section 3

### Q11 Define a validation set. What will be the most appropriate time period for this validation set?

In [None]:
# Get the val set:

### Q12. LightGBM

In [None]:
import lightgbm as lgb

In [None]:
# Process your data to the appropriate dtypes, vars, etc.

In [None]:
# Use the lgb.Dataset method to intialize your dataset iterables.

# 1. Make one for the train set:

# 2. Make another for the val set you defined in Q13:


In [None]:
# Fill the dict with appropriate params:
lgb_params = {'num_leaves': ,
              'learning_rate': ,
              'feature_fraction': ,
              'max_depth': ,
              'verbose': 20,
              'num_boost_round': ,
              'early_stopping_rounds': ,
              'nthread': -1}

In [None]:
# Complete the model initialization/train params)
model = lgb.train(lgb_params, ... 

In [None]:
# 1. Predict the sales value on your val set using the best_iteration recorded by the LGBM
# 2. Compute and print the RMSLE on this val set.

In [None]:
# 1. Pre-process your test set to appropriate format.
# 2. Predict -> Save using make_submission -> Submit to competition
# 3. Note your RMSLE for LGBM

### Q13. CatBoost

In [None]:
from catboost import Pool, CatBoostRegressor

In [None]:
# Fill out missing params for catboost appropriately here:
catboost_params = {
    'iterations': ,           # Number of boosting rounds
    'learning_rate': ,        # Learning rate for gradient boosting
    'depth': ,                   # Depth of each tree
    'loss_function': 'RMSLE',      # Loss function (Root Mean Squared Error for regression)
    'eval_metric': 'RMSLE',        # Evaluation metric
    'random_seed': 42,            # Ensures reproducibility
    'early_stopping_rounds': ,  # Stops training if no improvement after 50 rounds
    'verbose': 100                # Prints training progress every 100 rounds
}

In [None]:
# 1. Define the model

# 2. Fit


In [None]:
# 3. Preprocess your test data appropriately

# 4. Make Predictions

In [None]:
# 5. Use make_submission -> Submit to competition

# 6. Note your RMSLE for this model

### Q14. XGBoost

In [None]:
from xgboost import XGBRegressor

In [None]:
# 1. Initialize model with random state = 42 to be consistent with CatBoost

# 2. Fit


In [None]:
# 3. Make Predictions.

In [None]:
# 4. make_submission -> Submit to competition 

# 5. Note your RMSLE 

### Q15. Optuna for automatic hyperparameter optimization

In [None]:
import optuna
import time

In [None]:
def objective_lgb(trial):
    # 1. Define the parameter search space
    
    # 2. Create datasets (train, val) for LightGBM

    # 3. Train the model

    # 4. Evaluate on the validation set

    # 5. Return the metric score
    
    pass

# Create Optuna study to minimize the objective function

start = time.time()
# 1. Create the optuna study and specify appropriate direction

# 2. Optimize (pay attention to recommended trials; 50 takes too long)

# 3. Get the best parameters

# 4. Print them.
# print("Best parameters:", best_params)

print("Took:", time.time() - start, "seconds")

In [None]:
# Make a competition submission using these parameters
# Note these values.

In [None]:
# Do the same for Catboost

In [None]:
# Do the same for XGBoost

### Q16. Which out of the three Catboost vs LightGBM vs XGBoost provides the best score? Why do you think this model is more suited to this dataset/problem?

In [None]:
print("<Your answer goes here>")

## Optional Extra Credit Section - Achieve the lowest score

### Cross Validation Strategies & Ensembling

In [None]:
# 1. Try different Validation sets 
# 2. Try ensembling different methods used in this assignment together