# T-DAB Data Science Notebook Template

### Objectives
1. Make it easy for you and your colleagues to reuse your code
2. Make your code look really polished and professional
3. Save time!

### Principles
- 📦 Modular: Code is broken into small, independent parts (like functions) that each do one thing. Code you’re reusing lives in a single central place.
- ✔️ Correct: Your code does what you say/think it does.
- 📖 Readable: It’s easy to read the code and understand what it does. Variable names are informative and code has up-to-date comments and docstrings.
- 💅 Stylish: Code follows a single, consistent style (e.g. the Tidyverse style guide for R, PEP 8 for Python code)
- 🛠️ Versatile: Solves a problem that will happen more than once and anticipates variation in the data.
- 💡 Creative: Solves a problem that hasn’t already been solved or is a clear improvement over an existing solution.

### General Guidelines

- Structure your Notebook: give your notebook a title (H1 header) and a meaningful preamble to describe its purpose and contents.
- Use headings (H2, H3...) and documentation in Markdown cells to structure your notebook and explain your workflow steps.
- Not all sections described here will be required. I.e.: an EDA notebook may only include Importing Data, Pre-processing and Data Analysis.

#### To further discuss
- Use of **notebook template** with general section guidelines (somewhat flexible) and common imports.
- Define **naming convention** for all T-DAB notebooks:

  - Current Jupyter Notebooks naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description. For example: `1.0-jqp-initial-data-exploration`
  - Discuss ordering: **should this be independent to each author to indicate reading order?** 
  - Suggestion: **if ordering is per author, we might prefer to start convention with initials and then include sorting numbers**



#### Potential future additions

- The **toc2** extension can automatically create heading numbers and a Table of Contents, both in a sidebar (optionally a floating window) and in a markdown cell. The highlighting indicates your current position in the document — this will help you keep oriented in long notebooks.
- The **Collapsible Headings** extension allows you to hide entire sections of code, thereby letting you focus on your current workflow stage.
- The **Jupyter Snippets** extension allows you to conveniently insert often needed code blocks, e.g. your typical import statements. Or, as an even simpler approach, we could just provide a generic import cell as in the template below.



# Beginning of Notebook Template...

# Notebook Title

## Introduction

Short description of the contents of the following notebook, should match the one included in repository's `README.md` file

In [None]:
# Load the "autoreload" extension so that code can change
%load_ext autoreload
# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

# Import all relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

## Import data

- Import all relevant files into DataFrames
- Raw files should be stored in repository `data` folder unless size is an impediment (order of few MBs)
- Use descriptive yet concise names to define DataFrame objects


In [None]:
# Read data
# df_raw_1 = pd.read_csv('./data/sample_data.csv')
# df_raw_2 = pd.read_csv('./data/sample_data.csv')

# Preview data
# df.head()

## Data cleaning & pre-processing

- Include all cleaning and pre-processing steps required before modeling
- If preprocessing is to be performed on several datasets try to wrap reusable code into a function
- Include **docstrings** describing every function purpose as well as its `input` and `output` parameters
- If possible merge all raw files into a single clean DataFrame at the end of the cleaning process

In [None]:
# Data cleaning and pre-processing
def clean_data(df, param1, param2, *args, **kwargs):
    
    """
    This function wraps all the required cleaning and pre-processing steps
    
    Args:
        df: pandas DataFrame containing raw data
        param1: The first parameter.
        param2: The second parameter.

    Returns:
        clean_df: a DataFrame containing cleaned data

    """

    # Include all relevant data wrangling functions/ste[s]
    df['DateTime'] = pd.to_datetime(df['DateTime'])
    df = df.drop_duplicates()
    ...
    df['col_1'].fillna(method = 'ffill', inplace=True)
    df['col_2'].interpolate(method='linear', axis=0, limit=None, inplace = True)
    df.dropna(axis=0, thresh=len(df.columns)/2, inplace=True)
    
    return clean_df

In [None]:
# If more than one DataFrame is provided we can try merging them after cleaning
# df = pd.merge(df_1, df_2, how="left", on='index', sort=True).set_index('index')

## Exploratory Data Analysis

- In this sub-section we obtain all required summary statistics and visualizations 
- Ideally, we should build reusable functions both for plotting and obtaining summary statistics
- Once functions have been defined the cells calling the required code should follow below

In [None]:
# Example function to obtain summary statistics
def get_summary(df):
    
    """
    This function takes a DataFrame with data and returns summary statistics on the columns it contains
    
    Args:
        df: pandas DataFrame containing data to analyze

    Returns:
        summary_df: a DataFrame containing summary statistics on df columns
    
    """
    
    # Get a whole bunch of stats
    summary_df = df.describe().transpose()
    
    # Count NANs
    summary_df['number_nan'] = df.shape[0] - summary_df['count']
    
    # Count unique values
    summary_df['number_distinct'] = df.apply(lambda x: len(pd.unique(x)), axis=0) 
    
    # Count unique values
    summary_df['median'] = df.median()
    
    # Print DateTime information
    try:
        print(df['DateTime'].describe(datetime_is_numeric=True))
    except:
        pass
    
    return summary_df


In [None]:
# Example function to plot a Histogram + Boxplot 
def plot_metrics(df, metric = 'metric_name', bins = 30, title = 'Distribution', xlabel = 'x', ylabel= 'y'):

    """
    This function takes a clean DataFrame and outputs a Histogram and a Boxplot of a selected metric
    
        Args:
        df (DataFrame): pandas DataFrame containing clean data
        metric (str): string specifying which metric to plot
        bins (int): number specifying number of bins
        title (str): plot title
        xlabel (str): label for x axis
        ylabel (str): label for y axis

    Returns:
        None
    
    """
    
    # Extract raw ads
    metrics = df[metric]

    # Create a figure for 2 subplots
    fig, ax = plt.subplots(2,1,figsize = (12,12))
    # Plot histogram
    ax[0].hist(metrics, bins = bins)
    ax[0].axvline(metrics.mean(), color = 'magenta', linestyle = 'dashed', linewidth = 2)
    ax[0].axvline(metrics.median(), color = 'cyan', linestyle = 'dashed', linewidth = 2)
    ax[0].set_xlabel(metric, fontsize = 16)
    ax[0].set_ylabel('counts', fontsize = 16)
    # Plot boxplot
    ax[1].boxplot(metrics, vert = False)
    ax[1].set_xlabel(xlabel, fontsize = 16)
    ax[1].set_ylabel(ylabel, fontsize = 16)

    # Add title
    fig.suptitle(title)

    # Show figure
    plt.show()

## Modeling
- This section will contain all the models we would like to train using our pre-processed data
- If several models are produced, we should include a subsection per model identified by a markdown (H3) sub-header
- Typical subsections that we want to include here:
    - Import all ML/model relevant modules
    - Perform Feature Engineering and Feature Selection
    - Proper Train, Validation & (if required) Test split 
    - All `seed` parameters to allow experiment reproducibility should be set here as well
- As usual, code should be modularized into functions when possible


## Evaluation
- This section will contain all the code required to evaluate our models 
- Results could be presented in the form of plots and/or figures 
- Ideally, one should provide functions that take all parameters required by the model to allow comparison between different model configurations while minimizing the amount of code
- All plotting, evaluation and auxiliary functions should be included at the top of the section. Evaluation cells should follow and included as little code as possible


## References for best Software Engineering Practices for Jupyter Notebooks

1. [Manage your Data Science project structure in early stage](https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600)
2. [Six steps to more professional data science code ](https://www.kaggle.com/rtatman/six-steps-to-more-professional-data-science-code)
3. [Jupyter Notebook Best Practices](https://towardsdatascience.com/jupyter-notebook-best-practices-f430a6ba8c69)
4. [Example Google Style Python Docstring](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)