# BLU04 - Time Series Concepts: Exercise notebook

Yay! Exercises! 

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
import hashlib # for grading purposes
import json
import utils
from sklearn.impute import KNNImputer

You've been hired (once again :) your career is going pretty well!) as a data scientist for a supermarket chain that wants to start extracting insights from their data.

You start by analyzing customer flow in one of the stores. Let's get our data:

In [None]:
store = utils.get_store_data()
store.head()

## Exercise 1 - Date to index

Convert the `date` column to `datetime` and set it as index of the `store` dataframe.

Don't forget the best practice for datetime indexes!

In [None]:
# store = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(''.join([str(i) for i in store.columns.to_numpy()])).encode()).hexdigest() == \
'055ab105b7156113b812cde9f7e935768c6d5899150523cd269ba767fb4f8d7f', 'Did you set the date column as index?'
assert isinstance(store.index,pd.DatetimeIndex), 'Did you convert the index to datetime?'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in store.index])).encode()).hexdigest() == \
'6b85e1eefc04198fc82eb98f0256ff132bd99170a60f4393f4288a8d128313dc', 'Did you follow the best practices for datetime indexes?'

## Exercise 2 - Time series preprocessing

### Exercise 2.1 - Look out for duplicate timestamps (ungraded)

When working with time series, we should make sure that we don't have more than one value for the same timestamp. How would you check if there are any duplicates?

In [None]:
## UNGRADED CELL
# use this cell to write your code
# ...

### Exercise 2.2 - Missing days

Sometimes datasets don't have rows corresponding to all timestamps, as a data scientist you should check if this is the case. Copy `store` to a new variable called `store_complete`, with no gap days. Fill the missing data with null values.

In [None]:
store_complete = store.copy()
# store_complete = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert store_complete.shape == (1672,1), 'Did you add the missing timestamps?'
assert store_complete.isnull().sum().iloc[0] == 11, "Did you fill in the missing values with nulls?"
assert hashlib.sha256(json.dumps(''.join([str(i) for i in store_complete.loc[store_complete.isnull().customers].index])
).encode()).hexdigest() == '02e28bdebe0533aee335a1cabf9d5126924e254d1739eb7796040e7add8c0a0f', \
"Did you fill in the correct missing days?"
assert hashlib.sha256(json.dumps(''.join([str(i) for i in store_complete.index])).encode()).hexdigest() == \
'f3feff5cbde9ed0ac3a05893cbaac73001c1f5554ae9627bb68cf595b5399056', 'Is the index ordered?'

## Exercise 3 - Working with timestamps

### Exercise 3.1 - Worst day in 2016

What was the worst day in terms of customers in 2016? Find the timestamp of the day with the least number of customers in 2016 and assign it to the variable `worst_day_2016`.

In [None]:
#worst_day_2016 =

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(worst_day_2016, pd.Timestamp), 'The answer should be a timestamp.'
assert hashlib.sha256(json.dumps(str(worst_day_2016)).encode()).hexdigest() == \
'e8c2fa791b4ca2388706faceea087d1b380f8697281caa58b468a6f0092e328b', 'Not correct, try again.'
print(f"The worst day in 2016 was {worst_day_2016.day} of {worst_day_2016.month_name()}. Talk about new year's blues!")

### Exercise 3.2 - Best Friday

Last Friday there were 3000 customers, and your boss said they've never seen such a high count of customers on a Friday. To check if your boss is correct, can you find the maximum number of customers that we've ever had on a Friday? Assign the answer to the variable `max_customers_friday`. The answer should be a number.

In [None]:
# max_customers_friday = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(int(max_customers_friday))).encode()).hexdigest() == \
'3d08b44ebd3b36f488ee9ed3f608fff67f88dbcfefa3e4d3c7ca46244e26d670', 'Not correct, try again.'
print(f"Yep! The highest count we ever had on a Friday was {int(max_customers_friday)} customers. Don't tell your boss.")

## Exercise 4 - Time series methods

### Exercise 4.1 - Shopping rush

A new pandemic has started, and everyone came to buy soap and isopropyl alcohol. Your boss swears to have never seen such an absolute increase in customers from one day to the next - "Yesterday there were 100 customers, today there were 5000."

To confirm if that's true, can you find the maximum increase in customers from one day to the next? Assign the answer to the variable `max_increase`. The answer should be a number.

In [None]:
# max_increase = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(int(max_increase))).encode()).hexdigest() == \
'e4c02700f464096a5eafc2286588982c0f5905a92fd6c1c3c3d73908669b1fcd', 'Not correct, try again.'

### Exercise 4.2 - Bad month

Despite the shopping rush of the last few days, we had a bad month, with a monthly sum of customers < 45000. What was the last month we had less than 45000 customers? Assign the answer to the variable `last_bad_month`. The answer should be a timestamp, not the name of the month.

In [None]:
# last_bad_month = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(last_bad_month,pd.Timestamp), 'The answer should be a timestamp.'
assert hashlib.sha256(json.dumps(str(last_bad_month.year)).encode()).hexdigest() == \
'0fe9d210e9a2b48d40af858d5edbc92e1fcc022110a68a19b4fc55377e6a9de2', 'Not correct, try again.'
assert hashlib.sha256(json.dumps(str(last_bad_month.month)).encode()).hexdigest() == \
'2bf175f9655e7bb7357b9f0a7c6051465a5ae701104ffe741b98e852c0e4d460', 'Not correct, try again.'
assert hashlib.sha256(json.dumps(str(last_bad_month.day)).encode()).hexdigest() == \
'391552c099c101b131feaf24c5795a6a15bc8ec82015424e0d2b4274a369a0bf' or \
hashlib.sha256(json.dumps(str(last_bad_month.day)).encode()).hexdigest() == \
'f20e4586c63ba3b2c06a97c4e585acea4a2977c3d8d81dc2d1f2275439ad90a7', 'Not correct, try again.'

**Congrats!**

Your work is proving useful, so your boss has asked to expand your analysis to the whole chain. 

Let's get the new data:

In [None]:
chain = utils.get_stores_data()
chain.head()

In [None]:
print('We now have %0.0f data points. Wooooow!' % len(chain))

The thing is, we can't just set the index to be the day, as we now have multiple stores on the same day. 

Looks like we have to go into multi-indexing...

## Exercise 5 - Multi-indexing

### Exercise 5.1 - Set the multi-index
Convert the date to a datetime, then set the index of the `chain` dataframe to `(date, store_nbr)`.

In [None]:
# chain = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(chain.index.dtypes.index) ==  2, 'The index should have two levels.'
assert chain.index.dtypes.index[0] == 'date', 'The first index should be date.'
assert chain.index.dtypes.index[1] == 'store_nbr', 'The second index should be store_nbr.'
assert isinstance(chain.index.get_level_values(0), pd.DatetimeIndex), 'The date should be converted to datetime.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in chain.index.get_level_values(0)])).encode()).hexdigest() == \
'924abf6d787da3822a81e4318962ecc23f52e208095113b1a083111e0a12936a', 'Did you follow the good practices for multi-index?'

### Exercise 5.2 - Shop 10 in April 2016

What was the maximum daily number of customers in store 10 in April 2016? Assign the answer to the variable `max_store10`. The answer should be an integer.

In [None]:
# max_store10 = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(max_store10)).encode()).hexdigest() == \
'60bf6d3dab326daeabf97b4471f0c8f40a8d61091daf45554f074da18d715526', 'Not correct, try again.'
print(f"Correct! The maximum daily number of customers on April 2016 was {max_store10}.")

### Exercise 5.3 - Stores in 2015 vs. 2014
How has the number of stores changed from 2014 to 2015? Find the difference in the number of stores in each year and assign the answer to the variable `nr_stores_2014_2015`.

In [None]:
# nr_stores_2014_2015 = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(str(abs(nr_stores_2014_2015))).encode()).hexdigest() == \
'd10a4bc9e0c1fa4e8f3d7ce2512b8756e47ca5fa451f373c39a1431bb88db49f', 'Not correct, try again.'

### Exercise 5.4 - Customers in 2015

Find the total number of customers in each store in 2015. The result should be a pandas dataframe called `sum_per_store_2015` where the index is the `store_nbr` and the values are the customer sums.

In [None]:
# sum_per_store_2015 = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(sum_per_store_2015, pd.DataFrame), 'The result should be a dataframe.'
sum_per_store_2015 = sum_per_store_2015.sort_index()
assert sum_per_store_2015.shape[0] == 53, "There should be 53 stores."
assert hashlib.sha256(json.dumps(''.join([str(i) for i in sum_per_store_2015.to_numpy().flatten()])).encode()).hexdigest() == \
'45c1fea0de4ddcc33b2809df0e20e0c11b98053222590d9cec285ae7d3d98d92', 'Not correct, try again.'

print(f"""Good job!!
Also, the store with the highest count of customers in 2015 was store {sum_per_store_2015.idxmax().iloc[0]}, \
with a total of {sum_per_store_2015.max().iloc[0]} customers.
Now that's a lot of customers!...""")

## Exercise 6 - Time series modelling concepts

### Exercise 6.1 - Store 10

You've been asked to make an analysis on store 10. You remember what you learned about time series  at that awesome Academy in 2024. Let's impress your boss!

### 6.1 - Get the data

Using cross sections, select all data for store 10. Make sure that the time series has all the consecutive days between the start and end dates. Fill the missing values with nulls. The result should be a dataframe called `store_10` with an index corresponding to the dates and the `customers` column.

In [None]:
# store_10 = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert store_10.shape == (1672, 1), 'Did you fill in the missing dates?'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in store_10.to_numpy().flatten()])).encode()).hexdigest() == \
'9853ad11d73360f8e8404f2e21e1160c9b9e8b3ca0d05fb36a85aca3124dcc89', 'Did you fill in the null values?'
assert store_10.isnull().sum().iloc[0] == 12, "The values for missing days should be filled as nulls."

### 6.2 - Seasonality in store 10

Find the weekly and biweekly seasonality in customer numbers in store 10. Use the correlation as a proxy for seasonality. Do this in two steps.

First, calculate the weekly and biweekly shifts for the customers column and store the results in the pandas series `customers_week_before` and `customers_2_weeks_before`. Use a negative shift.

Then, use the pandas correlation function to calculate the correlations between the original and shifted data. Store the result in the float variables `weekly_corr` and `biweekly_corr`.

In [None]:
# customers_week_before = 
# customers_2_weeks_before = 

# weekly_corr = 
# biweekly_corr = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(customers_week_before, pd.Series) and isinstance(customers_2_weeks_before, pd.Series), \
'The results should be a series.'
assert int(customers_week_before.loc['2016-01-09']) == int(customers_2_weeks_before.loc['2016-01-02']), \
"Did you shift the data correctly?"
np.testing.assert_almost_equal(weekly_corr, 0.66, decimal=2, err_msg="Not correct, try again.")
np.testing.assert_almost_equal(biweekly_corr, 0.56, decimal=2, err_msg="Not correct, try again.")

### 6.3 Negative trends

You were very fast and impressed your boss. Now you should dive deeper and analyze data from all stores. Find out which stores have a negative trend in customer numbers. The answer should be a list of the store numbers (integers) called `stores_neg_trend`. 

This one is a bit harder and there are several ways to solve it. A few clues:
- Assume linear trends where the trend is characterized by the slope of the linear regression.
- This is just an EDA, we're not building a predictive model, so you don't need to split your dataset.
- For each store time series, don't forget to fill the missing dates using a `KNNImputer`.

In [None]:
# stores_neg_trend = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(stores_neg_trend,list), 'The result should be a list.'
assert len(stores_neg_trend) == 32, 'The length of the list is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in sorted(stores_neg_trend)])).encode()).hexdigest() == \
'da7dbee9e8e950abcbe18f37a82b783a75df8a7e979d33112fbd85f7ee45e947', 'Not correct, try again.'
print("Correct! Your boss is ecstatic about your work and decides to give you a 200% raise!...")