# AI Skunkworks Project - Hyperparameters Database

# Contents

- <a href='#1'>1. Project Description</a>
- <a href='#2'>2. Exploratory Data Analysis</a>
    - <a href='#2.1'>2.1 Auditing and cleansing the loaded data</a>
    - <a href='#2.2'>2.2 Exploring the data</a>
    - <a href='#2.3'>2.3 Data visualization</a>
- <a href='#3'>3. </a>
- <a href='#4'>4. </a>
- <a href='#4'>4. Conclusion</a>
- <a href='#5'>5. Contributions statement</a>
- <a href='#6'>6. Citations</a>
- <a href='#7'>7. License</a>

# <a id='1'>1. Project Description</a>

#### Goals and Objectives :
<ul>
<li>In statistics, hyperparameter is a parameter from a prior distribution; it captures the prior belief before data is observed.</li>
<li>In any machine learning algorithms, these parameters need to be initialized before training amodel.</li>
<li>Hyperparameters are important because they directly control the behaviour of the training algorithm and have a significant impact on the performance of the model is being trained.</li>
<li>Our aim is to find proper hyperparameter with proper tuning for our dataset which would help the database team in modelling the database schema in an efficient way. We would create H2O models for this dataset for getting proper hyperparameters.</li>
</ul>

#### Background Research:
Hyperparameters are variables that we need to set before applying a learning algorithm to a dataset. In machine learning scenarios, a significant part of model performance depends on the hyperparameter values selected. The goal of hyperparameter exploration is to search across various hyperparameter configurations to find the one that results in the optimal performance.  The challenge with hyperparameters is that there are no magic number that works everywhere. The best numbers depend on each task and each dataset. The hyperparameter database to be developed as a part of this project is an open resource with algorithms, tools, and data that allows users to visualize and understand how to choose hyperparameters that maximize the predictive power of their models. Phase I of the project involves selecting a unique dataset containing predicted target variables, hyperparameters, meta-data etc. by running different models (with varying hyperparameters) on it using H2O.<br>

**Hyperparameters can be divided into 2 categories:** <br>
1. Optimizer hyperparameters
<ul>
<li>They are related more to the optimization and training process</li>
<li>If our learning rate is too small than optimal value then it would take a much longer time (hundreds or thousands) of epochs to reach the ideal state</li>
<li>If our learning rate is too large than optimal value then it would overshoot the ideal state and our algorithm might not converge</li>
</ul><br>

2. Model Specific hyperparameters
<ul>
<li>They are more involved in the structure of the model</li>
</ul>

Currently, the hyperparameter database analyzes the effect of hyperparameters on the following algorithms: Distributed Random Forest (DRF), Generalized Linear Model (GLM), Gradient Boosting Machine (GBM). Naïve Bayes Classifier, Stacked Ensembles, XGBoost and Deep Learning Models (Neural Networks).


**Algorithms and code sources:**
<ul>
<li>Deciding the algorithms based on the type of dataset(Classification or Regression algorithm)</li>
<li>We would be using H20 python module.</li>
<li>This Python module provides access to the H2O JVM, as well as its extensions, objects, machine-learning algorithms, and modeling support capabilities, such as basic munging and feature generation. </li>
</ul>
    
**How and what the team will work on?:**
<ul>
<li>Find a dataset</li>
<li>Figure out the algorithms that will work for that dataset. The type of algorithm like classification or Regression algorithm.</li>
<li>Perform H2O in python for generating various models at various runtime.</li>
<li>Suppose hypothetically if five algorithms are used then the best models for those algorithms will be selected to get the hyperparameter.</li>
<li>All the models will be stored for reference irrespective of it being best or the worst model.</li>
<li>Further communicate with the database team to give them the insight of the data and the conclusions drawn from the H2O analysis. This would help them in designing the ER Diagrams and designing the data models and schema.</li>
<li>We will try to follow this process on as many datasets as possible within the given time constraint.</li>
</ul>
    
**Challenges:**
<ul>
<li>Implementing H2O analysis.</li>
<li>Getting optimized hyperparameters.</li>
<li>The model might overfit for the training dataset which will not work well on the test dataset.</li>
<li>Other challenge would be to tune model and feature parameters well.</li>
</ul>

# <a id='2'>2. Exploratory Data Analysis</a>

### <a id="2.1">2.1 Auditing and cleansing the loaded data</a>
In this task, we are inspecting and auditing the data to identify the data problems, and then fix the problems. Different generic and major data problems could be found in the data might include:
* Lexical errors, e.g., typos and spelling mistakes
* Irregularities, e.g., abnormal data values and data formats
* Violations of the Integrity constraint.
* Outliers
* Duplications
* Missing values
* Inconsistency, e.g., inhomogeneity in values and types in representing the same data

### The following libraries are used throughout this notebook for purpose of data visualization and much more


<table>
<thead><tr>
<th style="text-align:center">Serial No.</th>
<th style="text-align:center">Package</th>
<th style="text-align:center">Plots used in this kernel</th>
<th style="text-align:center">Remark</th>
<th style="text-align:center">Nature of plots</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">1</td>
<td style="text-align:center"><strong>Matplotlib</strong></td>
<td style="text-align:center">1. vendor_id histogram, 2. store and fwd flag histogram</td>
<td style="text-align:center">Matplotlib is oldest and most widely used python visualization package, its a decade old but still its the first name come to our mind when it comes to plotting. Many libraries are built on top of it, and uses its functions in the backend. Its style is very simple and that's the reason plotting is fast in this. It is used to create axis and design the layout for plotting using other libraries like seaborn.</td>
</tr>
<tr>
<td style="text-align:center">2</td>
<td style="text-align:center"><strong>Seaborn</strong></td>
<td style="text-align:center">1.Violin plot (passenger count vs trip duration), 2. Boxplots( Weekday vs trip duration, 3. tsplot (hours, weekday vs avg trip duration), 4. distplots of lat-long, and trip_duration</td>
<td style="text-align:center">Seaborn is my favorite plotting library (Not at all a fan of house greyjoy though :P) Plots from this package are soothing to eyes. Its build as a wrapper on matplotlib and many matplotlib's functions are also work with it.colors are amazing in this package's plots</td>
</tr>
<tr>
<td style="text-align:center">3</td>
<td style="text-align:center"><strong>Pandas</strong></td>
<td style="text-align:center">1. Parallel coordinates (for cluster characteristics)</td>
<td style="text-align:center">Pandas also offer many plotting functions and it's also a package built on matplotlib, so you need to know matplotlib to tweak the defaults of pandas. It offers Alluvial plots (which are nowhere near what R offers as alluvial plots) which are used in this notebook to show cluster characteristics.</td>
</tr>
<tr>
<td style="text-align:center">4</td>
<td style="text-align:center"><strong>Bokeh</strong></td>
<td style="text-align:center">1. Time series plot (day of the year vs avg trip duration)</td>
<td style="text-align:center">Bokeh is one great package which offers interactive plots, you can use bokeh with other libraries like seaborn, data-shader or holoviews, but bokeh its offers various different kind of plots. zoom, axis, and interactive legends makes bokeh different than others</td>
<td style="text-align:center"><strong>Interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">5</td>
<td style="text-align:center"><strong>Folium</strong></td>
<td style="text-align:center">1.pickup locations in Manhattan, 2. cluster's location in the USA, 3. clusters location in Manhattan</td>
<td style="text-align:center">This package offers geographical-maps and that too are interactive in nature. This package offers a different kind of terrains for maps- stemmer terrain, open street maps to name a few. you can place bubble at the locations, shift the zoom, and scroll the plot left-right-up-down and add interactions, for example - cluster plots shown in this notebook offers information about clusters like number of vehicles going out, most frequently visited clusters etc. <em>kaggle started supporting this package during this competition only</em></td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">6</td>
<td style="text-align:center"><strong>Pygmaps</strong></td>
<td style="text-align:center">1. location visualizations 2. cluster visualizations</td>
<td style="text-align:center">Pygmaps is available as archive package and can't even be installed using pip install command, but this package was the predecessor of gamps package but offers few great interactions which even gmaps doesn't offer. for example, a scattering of a cluster can be plotting with this one better than with gmaps. This package was way underdeveloped and developed version of it is know as gmaps yet, it was able to generate beautiful vizs. plots made my this package are best viewed in browsers.</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">7</td>
<td style="text-align:center"><strong>Plotly</strong></td>
<td style="text-align:center">1.bubble plot</td>
<td style="text-align:center">This is another great package which offers colorful visualizations, but some of these beautiful plots require to interact with plotly server so you need API key and it will call the API.</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">8</td>
<td style="text-align:center"><strong>Gmaps</strong></td>
<td style="text-align:center"><em>To be updated</em></td>
<td style="text-align:center">gmaps provide great features like- route features, and we all are too used to gmaps, so feel like home.</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">9</td>
<td style="text-align:center"><strong>Ggplot2</strong></td>
<td style="text-align:center">1. Weather plots of NYC for given period</td>
<td style="text-align:center">gglots are now available in python as well, and its kind of in developing state and documentation is less of this package which makes it a little difficult but at the same time it provides very beautiful plots, so there is a tradeoff ;)</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">10</td>
<td style="text-align:center"><strong>Basemaps</strong></td>
<td style="text-align:center"><em>Will not be added in this kernel</em></td>
<td style="text-align:center">As long as you are not developing any maps related library, there are no benefits of using basemaps. They offer many options but using them is difficult, due to lots of arguments, different style, and less amount of documentation, and many lines of codes sometimes will be required to plot a map.</td>
</tr>
<tr>
<td style="text-align:center">11</td>
<td style="text-align:center"><strong>No package</strong></td>
<td style="text-align:center">1. heatmaps of NYC taxi traffic</td>
<td style="text-align:center">Instead of depending on data-shadder, I tried plotting the heatmap of traffic data with a row image, you will get to know the basics of image processing( reading image, color schemes that's all :P ) and how such basic exercise can result in traffic heatmap</td>
</tr>
<tr>
<td style="text-align:center">12</td>
<td style="text-align:center"><strong>Datashader</strong></td>
<td style="text-align:center">1.locations' heatmap</td>
<td style="text-align:center">If you really have a very large size of data that you want to plot on a map, data shader is one of the best easiest option available in marker. But I found that row image processing and generating plots using a scipy.mics or cv2 is considerably faster than using this package.</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
<tr>
<td style="text-align:center">13</td>
<td style="text-align:center"><strong>Holoviews</strong></td>
<td style="text-align:center">1. Pairplot for feature interaction.</td>
<td style="text-align:center">Holoviews is another alternative for making interactive visualizations, this package offers artistic plots. But you need really good RAM etc to use this package, else the notebook will get hang. plots exported to HTML works perfectly fine</td>
<td style="text-align:center"><strong>interactive</strong></td>
</tr>
</tbody>
</table>

### Loading required libraries

In [2]:
import pandas as pd  #pandas for using dataframe and reading csv 
import numpy as np   #numpy for vector operations and basic maths 
import urllib        #for url stuff
import re            #for processing regular expressions
import datetime      #for datetime operations
import calendar      #for calendar for datetime operations
import time          #to get the system time
import scipy         #for other dependancies
from scipy.misc import imread, imresize, imsave  # for plots 
import seaborn as sns #for making plots
import matplotlib.pyplot as plt # for plotting
import plotly.plotly as py
import plotly.graph_objs as go
from matplotlib.pyplot import *
from matplotlib import cm

import warnings                 # Ignore  Warnings
warnings.filterwarnings("ignore")

### <a id="2.2">2.2 Exploring the data</a>

### Importing Data Set and Exploring them using Pandas head function

In [3]:
data = pd.read_csv('bank-additional-full.csv')

In [4]:
data.head()

Unnamed: 0,"age;""job"";""marital"";""education"";""default"";""housing"";""loan"";""contact"";""month"";""day_of_week"";""duration"";""campaign"";""pdays"";""previous"";""poutcome"";""emp.var.rate"";""cons.price.idx"";""cons.conf.idx"";""euribor3m"";""nr.employed"";""y"""
0,"56;""housemaid"";""married"";""basic.4y"";""no"";""no"";..."
1,"57;""services"";""married"";""high.school"";""unknown..."
2,"37;""services"";""married"";""high.school"";""no"";""ye..."
3,"40;""admin."";""married"";""basic.6y"";""no"";""no"";""no..."
4,"56;""services"";""married"";""high.school"";""no"";""no..."
