# Preprocessing time-series data using Pandas

This notebook is intended to introduce you to the basic Pandas DateTime    
The following five points will be covered:

## A. Content:

1. [Parsing DateTime](#task1)
2. [Aggregating columns](#task2)
3. [Extracting DateTime properties](#task3)
4. [Fitering and Selecting specific durations](#task4)
5. [Changing the granularity of the Timeseries](#task5)

## B. Prerequisites: 
1. Basic knowledge of Python and Pandas
2. Kaggle account

## C. Code along:
**Option 1. On Kaggle (recommended):**   
1. Sign in to www.kaggle.com 
2. Copy this notebook: https://www.kaggle.com/deenagergis/20201019-pandas-timeseries-tutorial-blank
3. Have fun coding! 

**Option 2. On your local machine:**
1. Clone the following Repository: https://github.com/Deena-Gergis/pandas_timeseries_tutorial
2. *(Optional) Create a new virtual environment*  
3. Install the requirements.txt using `pip install -r <your_local_path>/requirements.txt ` or `conda install --yes --file <your_local_path>/requirements.txt`
4. Connect to Kaggle from your local machine: https://github.com/Kaggle/kaggle-api#api-credentials
5. Open the `blank_pandas_timeseries.ipynb` notebook and code along
6. Have fun coding! 


## D. References: 

* *A 3 minute guide:* https://www.linkedin.com/pulse/your-3-minute-guide-pandas-timeseries-deena-gergis-msc-/
* *Official documentation:* https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

__________

# I. Prepare

In [None]:
## Configs and Constants
DATASET_NAME = 'shivamb/netflix-shows'
FILE_NAME = 'netflix_titles.csv'
DATA_DIR  = './data'

In [None]:
import os

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

from kaggle import KaggleApi

In [None]:
# Set default properties for plotting 
plt.rcParams['figure.figsize'] = [11, 4]
plt.rcParams['figure.dpi'] = 100 

### Download and read data from Kaggle 

In [None]:
# Download data
kaggle = KaggleApi()
kaggle.authenticate()
kaggle.dataset_download_files(DATASET_NAME, path=DATA_DIR, unzip=True)

In [None]:
# Read and show data
raw_path = os.path.join(DATA_DIR, FILE_NAME)
raw_df = pd.read_csv(raw_path)

In [None]:
raw_df.sample(5)

In [None]:
raw_df.dtypes

____________

# II. Preprocess


## Task 1: Count the number of shows added per day     

In the following section, we will parse the raw date format into pandas datetime and summarize the daily shows added to the total number 

### Parse timestamp into datetime column <a id='task1'></a>

Change the raw format to a pandas datetime format.    
Once we have changed the format as such, we will be able to apply more functionalities illustrated below 


### Count shows added  per date <a id='task2'></a>
All the shows have been listed in the original dataframe.     
Now let's count the total number of shows added per day

______

## Task 2: Extract the day name and sum-up the shows added 
<a id='task3'></a>

In the last step, we have used the `date_added` column to count the number of shows.    
Since we've used the `groupby` functionality to count the number of shows,     
the column is set as our index. 

We could now use our new index directly to extract the Attributes of the timestamp.    
One example of those Attributes is the `day_name`.    
Check out the [full list of the attributes here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html).

______

## Task 3: Select data from 2016 onwards 
<a id='task4'></a>
You can also use the regular masking way to select and filter entries.      
The syntax is even simpler than one could expect. You don't even need to parse    
your filtering criteria to `datetime`. A simple string with `%YYYY-%MM-%DD` format     
will do the job  


______

## Task 4: Sum up weekly data 
<a id='task5'></a>

It is possible to change the granularity of your timeseries directly using Pandas datetie module.        
       
       
To do that, you need to specify two things: 
- Your new granularity passed as an argument to the `resample` function. [Read more details](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)
- The function that will be used to generate the new granularity

_________