# Template for Data Exploration

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Code-to-run-the-notebook" data-toc-modified-id="Code-to-run-the-notebook-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Code to run the notebook</a></span><ul class="toc-item"><li><span><a href="#Code-running-locally" data-toc-modified-id="Code-running-locally-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Code running locally</a></span></li><li><span><a href="#Code-running-on-the-cluster" data-toc-modified-id="Code-running-on-the-cluster-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Code running on the cluster</a></span></li><li><span><a href="#Loading-variables" data-toc-modified-id="Loading-variables-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Loading variables</a></span></li><li><span><a href="#Downloading-values" data-toc-modified-id="Downloading-values-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Downloading values</a></span></li></ul></li><li><span><a href="#Presentation-of-the-dataset" data-toc-modified-id="Presentation-of-the-dataset-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Presentation of the dataset</a></span><ul class="toc-item"><li><span><a href="#Data-sources" data-toc-modified-id="Data-sources-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data sources</a></span></li><li><span><a href="#Dataset-contents" data-toc-modified-id="Dataset-contents-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Dataset contents</a></span><ul class="toc-item"><li><span><a href="#Description-of-the-data" data-toc-modified-id="Description-of-the-data-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Description of the data</a></span><ul class="toc-item"><li><span><a href="#Some-stats-on-the-dataset" data-toc-modified-id="Some-stats-on-the-dataset-3.2.1.1"><span class="toc-item-num">3.2.1.1&nbsp;&nbsp;</span>Some stats on the dataset</a></span></li></ul></li><li><span><a href="#Profiling-report" data-toc-modified-id="Profiling-report-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Profiling report</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Data-preprocessing-and-cleaning" data-toc-modified-id="Data-preprocessing-and-cleaning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data preprocessing and cleaning</a></span><ul class="toc-item"><li><span><a href="#Missing-values" data-toc-modified-id="Missing-values-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Missing values</a></span></li><li><span><a href="#Duplicates" data-toc-modified-id="Duplicates-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Duplicates</a></span></li><li><span><a href="#Wrong-values" data-toc-modified-id="Wrong-values-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Wrong values</a></span></li><li><span><a href="#Data-standardization" data-toc-modified-id="Data-standardization-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Data standardization</a></span></li><li><span><a href="#Remove-or-fix-outliers" data-toc-modified-id="Remove-or-fix-outliers-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Remove or fix outliers</a></span></li><li><span><a href="#Some-stats-of-the-dataset-after-data-preprossing-and-data-cleaning" data-toc-modified-id="Some-stats-of-the-dataset-after-data-preprossing-and-data-cleaning-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Some stats of the dataset after data preprossing and data cleaning</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Dataset-exploration" data-toc-modified-id="Dataset-exploration-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Dataset exploration</a></span><ul class="toc-item"><li><span><a href="#Exploration-of-target-variables" data-toc-modified-id="Exploration-of-target-variables-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Exploration of target variables</a></span></li><li><span><a href="#Exploration-of-relationship-between-input-variables" data-toc-modified-id="Exploration-of-relationship-between-input-variables-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Exploration of relationship between input variables</a></span></li><li><span><a href="#Outlier-detection" data-toc-modified-id="Outlier-detection-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Outlier detection</a></span></li><li><span><a href="#More-data-exploration" data-toc-modified-id="More-data-exploration-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>More data exploration</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

## Introduction
On this section we will provide the goals of this notebook. What kind of information we are trying to extract from the datasets that will be explored

## Code to run the notebook
On this section we will provide the code to run the notebook. The goal being to have all the functions used in the notebook here, so that after this section the notebook is as clean as possible as with as few code as possible

### Code running locally
On this section we will add the code that will run locally. Here we put some functions that we believe they can be useful for any notebook.

In [1]:
%%local
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.plotly as py
import plotly
from plotly.grid_objs import Column, Grid
init_notebook_mode(connected=True)
from IPython.display import display
import pandas as pd
from IPython.core.display import display, HTML
import plotly.figure_factory as ff


def build_table(df):
    trace = go.Table(
    header=dict(values=list(df.columns),
                fill = dict(color='#C2D4FF'),
                align = ['left'] * 5,
                font=dict(size=10)),
    cells=dict(values=[df[k].tolist() for k in df.columns[0:]],
               fill = dict(color='#F5F8FF'), font=dict(size=10)))
    data = [trace]
    return data

def add_percentage_of_runs_out_of_total_runs_to_data_frame(df):
    df['percentage_runs'] = df['count']/df['count'].sum()
    df = df.sort_values(by='percentage_runs', ascending=False)
    df['run_id'] = df['run_id'].apply(str)
    return df

def overlay_histograms(first_distribution_values, second_distribution_values, first_distribution_name, second_distribution_name):
    data = [go.Histogram(x=first_distribution_values, opacity=1, histnorm='probability', name = first_distribution_name, marker=dict(
        color='rgb(0, 0, 100)'
    ))]
    data += [go.Histogram(x=second_distribution_values, opacity=0.75, histnorm='probability', name = second_distribution_name, marker=dict(
        color='#EB89B5'
    ))]
    return data

def overlay_line_graphs(first_distribution_values, second_distribution_values, 
                        first_distribution_name, second_distribution_name):
    # Group data together
    hist_data = [first_distribution_values, second_distribution_values]

    group_labels = [first_distribution_name, second_distribution_name]
    colors = ['rgb(0, 0, 100)', '#EB89B5']
    # Create distplot with custom bin_size
    fig = ff.create_distplot(hist_data, group_labels, bin_size=[1, .5], colors=colors, show_hist=False)
    return fig

def count_na(df):
    sum_series = df.isnull().sum(axis=0)
    return pd.DataFrame({'column':sum_series.index, 'sum':sum_series.values})

### Code running on the cluster
Here we will add all the functions necessary to run code on the cluster

In [3]:
from pyspark.mllib.stat import Statistics 
from pyspark.sql.types import StringType, StructType, StructField, IntegerType
from pyspark.mllib.linalg import Matrices, Vectors

def get_dataset_stats(df):
    return df.describe()

def get_dataset_without_duplicates(df):
    return df.drop_duplicates()

def count_duplicates(df):
    duplicates_schema = StructType([
        StructField("Count", StringType(), False)
     ])

    dup_data = sc.parallelize([str(df.count() - df.drop_duplicates().count())])
    dup_data = dup_data.map(lambda x: (x,))
    return spark.createDataFrame(dup_data, schema=duplicates_schema)

def count_nan(df):
    return df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])

def compute_statistics(df, col_name):
    stats_schema = StructType([
        StructField("Metric", StringType(), True),
        StructField("Value", StringType(), True)
     ])
    
    x = df.select(col_name)
    rdd = x.rdd.map(lambda data: Vectors.dense([c for c in data]))
    summary = Statistics.colStats(rdd)
    median = x.approxQuantile(col_name, [0.5], 0.01)[0]
    counts = x.groupBy(col_name).count().sort("count", ascending = False)
    mode = counts.first()[col_name]
    stats_data = [('Mode', str(mode)), ('Median', str(median)), ('Mean', str(summary.mean()[0])), ('Variance', str(summary.variance()[0])), ('Coefficient of variation', str(sqrt(summary.variance()[0])/summary.mean()[0])), ('Max value', str(summary.max()[0])), ('Min value', str(summary.min()[0])), ('Total value count', str(summary.count())), ('Number of non-zero values', str(summary.numNonzeros()[0]))]                
    stats_df = spark.createDataFrame(stats_data, schema=stats_schema)
    return stats_df

### Loading variables
On this section we will load all the variables necessary to be downloaded locally for visualization

### Downloading values
On this section we will download the variables loaded above into our local environment. 

Example:
<code>%%spark -o df_hotel_events -m sample --maxrows 10</code>

## Presentation of the dataset
On this section we will give a quick overview of the dataset.

We will describe:
<ul>
    <li>The sources: where the data comes from</li>
    <li>The content: display of the dataset contents in a table</li>
    <li>The columns: description of each column</li>
    <li>Some stats: count, median of values, nb of unique values, NA values, etc</li>
</ul>


### Data sources
Here if possible we will add a diagram explaining where the data is coming from

### Dataset contents
Contents of the dataset will be described here

In [2]:
%local
# Table with the contents

#### Description of the data
Here we will describe each column of the dataset


In [3]:
%local
# Use here function count_na to have a dataframe describing how many missing values on each column of the dataframe

##### Some stats on the dataset
We can use this section to give some basic statistics on each column of the dataset

#### Profiling report
Can use profiling frameworks for this

### Conclusion
A brief conclusion describing the findings

## Data preprocessing and cleaning
On this section we will:
<ul>
    <li>Remove or fix missing values</li>
    <li>Remove or fix duplicates</li>
    <li>Do any kind of data standardization: Time zone conversion, currency conversion</li>
    <li>Remove or fix wrong values: e.g. Negative prices</li>
    <li>Remove or fix outliers</li>
</ul>

### Missing values
On this section we will remove or fix missing values in the dataset

### Duplicates
On this section we will remove or fix duplicates in the dataset

### Wrong values
Here we will remove wrong values: example: negative prices

### Data standardization
Here we will do standardization work: convert all values to the same currency, same timezone, etc

### Remove or fix outliers
We will do a first search for outliers and remove them

### Some stats of the dataset after data preprossing and data cleaning
We will display some basic statistics of the dataset after data preprocessing and data cleaning

### Conclusion
Quick conclusion of the findings

## Dataset exploration
On this section we will go deeper into the dataset

Here we will:

<ul>
    <li>Explore target variables if applicable</li>
    <li>Explore relationships between variables</li>
    <li>Find outliers</li>
    <li><b>Do any kind of exploration allowing us to reach the goals of the notebook</b></li>
</ul>

### Exploration of target variables
On this section we will see how target variables are distributed

### Exploration of relationship between input variables
On this section we will see how input variables correlate among themselves

### Outlier detection
Here we will handle outliers

### More data exploration
<ul>
    <li>How much money can we make from this project?</li>
    <li>How much can Egencia save?</li>
    <li>Any other missing data exploration in accordance to the goals of the notebook</li>
</ul>


## Conclusion
Conclusions for the notebook. Must be aligned with the goals described at the beginning