# Data Challenge Notebook

This notebook serves as a repertoire of python functions for solving data challenges, includes 10 tips for how to be successful, and how to develop metrics for analytical based projects. 

Most data challenges evaluate coding level, and ability to inspect data (i.e. identifying paradox, errors), perform data wrangling, choose appropriate visualizations, apply statistics/ML and concisely and effectively answer questions. 

This notebook includes functions for high level processes that may be necessary for a given data challenge and is meant to be customized for your particular challenge. The sections include:

- Initial Data Analysis
- Data Wrangling
- Exploratory Data Analysis
- Statistical Analysis [Optional]
- Machine Learning [Optional]

### Problem Definition

1. Download the data, load it into your favorite statistical programing software or database. Report the number of rows and columns that you've loaded.

2. Visualize trip distance by time of day in any way you see fit, any observations?

3. Build a model to forecast the number of trips by hour for the next 12 hours after Feb 12th 10:00 am. How well did you do?

4. I want to know where I can most easily get a cab. Recommend a pick up spot where I can find a cab given my lat long and a time of day.

#### Tips & Tricks:
1. Create a notebook template of custom functions for analysis, visualizations and modeling BEFORE starting challenge!
2. Research the company/role. (Many companies will utilize their own data for challenges)
3. Develop a plan of action prior to coding and ask for clarification on the prompt and data, if necessary.
4. Keep track of time (Some challenges allot only a few hours!)
5. Use functional programming whenever possible for conciseness. (Show all your work!)
6. Use intuitive dataframe names-not "df".
7. State all assumptions! (Prompts are purposely vague, you will have to make decisions based on the information provided).
8. Explain all values/visualizations/errors/results as it pertains to the prompt provided. (Think storytelling)
9. Outline next steps and actionable insights dervied from analyses.
10. Add comments and utilize markdown to explain functions (input/output) and results.

#### Developing metrics for analysis
Consider what 'health' and success means to the particular business and what information would provide useful insights.
- What is their business model?
- Who is their target audience?
- What would they want to optimize?

A good metric should be:
- Comparative
- Understandable
- Ratio/Rate
- Changes the way you behave
    
Types of metrics, include:
- Qualitative/Quantitative
- Vanity/Actionable
- Exploratory/Reporting
- Leading/Lagging
- Correlated/Causal

## Amazon Mock Data Challenge

### Data Challenge
### Data & Features

## Table of Contents
- Initial Data Analysis
- Data Wrangling
- Exploratory Data Analysis
- Statistical Analysis
- Machine Learning

In [29]:
# Import libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import tensorflow as tf
import seaborn as sns
import pandas_profiling as pp
import etsy_py
from scipy.stats import shapiro
from scipy.stats import skew
from scipy.stats import kurtosis

sns.set(style='dark',context='poster')

### 1. Exploratory Data Analysis

Firstly, we need to understand the data which we are working with. Therefore, I am starting off by checking the dimesnion of the data, data types, missingness, etc. 

In [10]:
# we start off by loading the data
tripdata_df = pd.read_csv("../data/tripdata.csv") 

##### Question 1

Number of rows and columns

In [12]:
print(f'>>> Number of rows: {tripdata_df.shape[0]}\n>>> Number of columns: {tripdata_df.shape[1]}')

>>> Number of rows: 1510722
>>> Number of columns: 21


In [14]:
# let take a look at a sample data
tripdata_df.head(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,Lpep_dropoff_datetime,Store_and_fwd_flag,RateCodeID,Pickup_longitude,Pickup_latitude,Dropoff_longitude,Dropoff_latitude,Passenger_count,...,Fare_amount,Extra,MTA_tax,Tip_amount,Tolls_amount,Ehail_fee,improvement_surcharge,Total_amount,Payment_type,Trip_type
0,2,2016-02-01 00:00:01,2016-02-01 00:10:06,N,1,-73.939018,40.805214,-73.972534,40.785885,1,...,10.5,0.5,0.5,0.0,0.0,,0.3,11.8,2,1.0
1,2,2016-02-01 00:01:33,2016-02-01 00:20:13,N,1,-73.891495,40.746651,-73.890877,40.743896,1,...,13.0,0.5,0.5,0.0,0.0,,0.3,14.3,2,1.0
2,2,2016-02-01 00:03:46,2016-02-01 00:21:04,N,1,-73.98378,40.676132,-73.956978,40.718327,1,...,17.5,0.5,0.5,3.76,0.0,,0.3,22.56,1,1.0
3,2,2016-02-01 00:00:05,2016-02-01 00:06:48,N,1,-73.807518,40.700375,-73.831657,40.705978,1,...,8.0,0.5,0.5,0.0,0.0,,0.3,9.3,2,1.0
4,2,2016-02-01 00:06:20,2016-02-01 00:08:47,N,1,-73.903961,40.744934,-73.900009,40.733601,5,...,5.0,0.5,0.5,0.0,0.0,,0.3,6.3,2,1.0
5,2,2016-02-01 00:00:42,2016-02-01 00:27:25,N,1,-73.978615,40.670422,-73.986893,40.748573,1,...,22.5,0.5,0.5,4.76,0.0,,0.3,28.56,1,1.0
6,2,2016-02-01 00:00:59,2016-02-01 00:07:25,N,1,-73.829826,40.759659,-73.809982,40.75116,5,...,7.0,0.5,0.5,0.0,0.0,,0.3,8.3,2,1.0
7,2,2016-02-01 00:00:56,2016-02-01 00:15:31,N,1,-73.955017,40.735298,-73.939865,40.793724,3,...,17.5,0.5,0.5,4.7,0.0,,0.3,23.5,1,1.0
8,2,2016-02-01 00:02:45,2016-02-01 00:09:19,N,1,-73.959496,40.718136,-73.948105,40.709122,1,...,6.5,0.5,0.5,1.95,0.0,,0.3,9.75,1,1.0
9,2,2016-02-01 00:04:12,2016-02-01 00:19:36,N,1,-73.944565,40.714516,-73.918343,40.689548,1,...,11.5,0.5,0.5,3.84,0.0,,0.3,16.64,1,1.0


It is a good idea to know the data types of the features we have in the data. From the table above, the features seem to be a mixture of categorical and numerical data. The Ehail_fee seems to have lots of missing values too but we will get to that in a jiffy.

In [15]:
tripdata_df.dtypes

VendorID                   int64
lpep_pickup_datetime      object
Lpep_dropoff_datetime     object
Store_and_fwd_flag        object
RateCodeID                 int64
Pickup_longitude         float64
Pickup_latitude          float64
Dropoff_longitude        float64
Dropoff_latitude         float64
Passenger_count            int64
Trip_distance            float64
Fare_amount              float64
Extra                    float64
MTA_tax                  float64
Tip_amount               float64
Tolls_amount             float64
Ehail_fee                float64
improvement_surcharge    float64
Total_amount             float64
Payment_type               int64
Trip_type                float64
dtype: object

##### Missing values

Before I proceed further, it is important to find out which features have missing values so that we will devise means of handing them.

In [16]:
# get the total of missing values per colum and publish
missing_val_col = tripdata_df.isna().sum()
print(f'>>> Missing data by column details:\n{missing_val_col[missing_val_col > 0]}')

>>> Missing data by column details:
Ehail_fee     1510722
Trip_type           2
dtype: int64


Luckily, only two features **Ehail_fee** and **Trip_type** have missing values. When I look at the data dictionary for the green taxi, there was no feature named **Ehail_fee**, suggesting that it may be an upcoming feature. And since it practically has no valid input, we will be dropping that column. Also, the **Trip_type** feature has only two missing values, therefore, we can conveniently drop those two rows.

In [17]:
# drop the colum first before dropping the 2 rows of trip type
tripdata_df.dropna(axis=1,how='all',inplace=True)
tripdata_df.dropna(axis=0,inplace=True)
print(f'>>> New shape: {tripdata_df.shape}')

>>> New shape: (1510720, 20)


### 2. Data Visualization and Cleaning

To avoid my decisions being influenced by theory alone, let us visualize some of the features and confirm whether our theorems correlate with insights from the data. First, we check some of the coordinates to see if they are valid.

In [34]:
# let us view a snapshot of the the pickup to check for data anomaly
import folium

def plot_map(location, dataframe, action):
    """
    Given a default location and dataframe, it plots the data on the map
    Params:
        - location: defualt location of the city of interest
        - dataframe: dataframe that has latitude and longitude
        - action: defines whether we are picking up or dropping off
    Returns:
        - a map object
    """
    lat = f'{action}_latitude'
    lon = f'{action}_longitude'
    pickup_map = folium.Map(location=location)
    for idx, data in dataframe.iterrows():
        if data["Pickup_latitude"] != 0: # we dont want to land in Uk or Ghana
            folium.Marker(location=[data[lat],data[lon]]).add_to(pickup_map)
    return pickup_map

In [35]:
location = [40.730610, -73.935242]
sample_data = tripdata_df.head(5000)
action = 'Pickup'
# let us view a snapshot of the the pickup to check for data anomaly
pickup_map = plot_map(location,sample_data,action)
pickup_map

From the pickup plot, we see an outlier that is practically impossible. To be exact, the intersection of zero degrees latitude and zero degrees longitude falls in a water between Gabon and Ghana. What this suggests is that even when we cant find missing values, our data is far from being clean. Therefore, we will take care of this by concentrating only on data from New York city using the bounding coordinates (**(40.5774, -74.15)** and **(40.9176,-73.7004)**) for the latitude and longitude respectively as they are the bounding coordinates of New york City. 

##### Pickup coordinates cleaning

In [None]:
pickups_outisde_NYC = tripdata_df.index[(tripdata_df.Pickup_latitude < 40.5774 or tripdata_df.Pickup_latitude => 40.9176) 
                                  or (tripdata_df.Pickup_longitude =>  -74.15 and tripdata_df.Pickup_longitude <= -73.7004)].

In [None]:
def initial_analysis(df):
    """
    Given a dataframe produces a simple report on initial data analytics
    Params:
        - df 
    Returns:
        - Shape of dataframe records and columns
        - Columns and data types
    """
    print('>>> Report of Initial Data Analysis:\n')
    print(f'>>> Number of rows: {df.shape[0]}\n>>> Number of columns: {df.shape[1]}')
    print(f'>>> Features and Data Types: \n {df.dtypes}')

In [None]:
def percent_missing(df):
    """
    Given a dataframe it calculates the percentage of missing records per column
    Params:
        - df
    Returns:
        - Dictionary of column name and percentage of missing records
    """
    col=list(df.columns)
    perc=[round(df[c].isna().mean()*100,2) for c in col]
    miss_dict=dict(zip(col,perc))
    return miss_dict

In [None]:
def normality_test(df,col_list):
    """
    Given a dataframe determines whether each numerical column is Gaussian 
    Ho = Assumes distribution is not Gaussian
    Ha = Assumes distribution is Gaussian
    Params:
        - df
    Returns:
        - W Statistic
        - p-value
        - List of columns that do not have gaussian distribution
    """
    non_gauss=[]
    w_stat=[]
    # Determine if each sample of numerical feature is gaussian
    alpha = 0.05
    for n in numeric_list:
        stat,p=shapiro(df[n])
        print(sns.distplot(df[n]))
        print(tuple(skew(df[n]),kurtosis(df[n])))

        if p <= alpha: # Reject Ho -- Distribution is not normal
            non_gauss.append(n)
            w_stat.append(stat)
    # Dictionary of numerical features not gaussian and W-Statistic        
    norm_dict=dict(zip(non_gauss,w_stat))
    return norm_dict

In [None]:
# Outliers

### Data Wrangling

In [None]:
# Impute missing values

In [None]:
# Feature Engineering

In [None]:
# Data Formatting

### Exploratory Data Analysis

In [None]:
pp.ProfileReport(df)

In [None]:
# Seasonality

### Statistical Analysis

In [None]:
# Hypothesis Testing

In [None]:
# Anomaly Detection

### Machine Learning

In [None]:
# Data preprocessing

In [None]:
# PCA

#### Classification

In [None]:
# Instantiate classifier

In [None]:
# Hyperparameter Tuning

In [None]:
# Evaluation

#### Regression

In [None]:
# Instantiate regressor

In [None]:
# Hyperparameter Tuning

In [None]:
# Evaluation