                                 Data Engineer Final Project 
----

<center><img src="https://dzpp79ucibp5a.cloudfront.net/events_banners/12768_normal_1403512002_api-meetup-logo-600x200.png" height="500"/></center>

# 3NF DataFrames
-----

<center><img src="http://www.ibm.com/developerworks/library/ba-augment-data-warehouse1/fig2.png" height="500"/></center>

Groups 
----
<center><img src="groups.png" height="200"/></center>

Event
----
<center><img src="event.png" height="200"/></center>

Topic
----
<center><img src="topics.png" height="200"/></center>

RSVP
----
<center><img src="rsvp.png" height="200"/></center>


Insights:
----

- Explore Tech communities in different cities/countries
- Identify the rise of new technology
- Understand how smaller tech topics inner-related together

Get a _feel_ for the data

What is exploratory data analysis (EDA)?
-----

<center><img src="DataWarehouseWithMDMDQS2.png" height="300"/></center>

- An approach/philosophy about how data analysis should be carried out
- Summarizes the main characteristics of the data (often with visual methods)
- Allow the data to reveal its underlying structure and inspire models
 
"Exploratory Data Analysis" by John Tukey 1977

Interactive analysis is the best way to really figure out what is going on in a data set.

You need to make tables, plots, identify outliers, find missing data, and identify problems with the data. 

To do this you need to interact with the data quickly and easily,

that is why we use Python (a dynamic language).

What are the general steps for EDA?
-----

1. Load data  
2. Look at univariate variables  
    - Visualizations
    - Summary statistics
3. Look at bivariate variables  
4. Look at n-variate variables  

5\. Advanced attribute understanding

- Unsupervised  
    - Dimension reduction  
    - Protip: Use Principal component analysis (PCA)
- Supervised  
    - Regression  
    - Decision trees

Data as a table
-----
Data should be formated as a 2 dimensional table (aka, matrix) with the rows as observations and the columns as attributes. 

If the data is not in matrix or a well organized matrix, your first task is to __jam it into a matrix.__

In [1]:
! head ../../data/brain_size.csv

"id","gender","fsiq","viq","piq","weight","height","mri_count"
"1","Female",133,132,124,"118","64.5",816932
"2","Male",139,123,150,"143","73.3",1038437
"3","Male",140,150,124,".","72.5",1001121
"4","Male",133,129,128,"172","68.8",965353
"5","Female",137,132,134,"147","65.0",951545
"6","Female",99,90,110,"146","69.0",928799
"7","Female",138,136,131,"138","64.5",991305
"8","Female",92,90,98,"175","66.0",854258
"9","Male",89,93,84,"134","66.3",904858


The Pandas DataFrame
------

We will store and manipulate this data in a pandas.DataFrame

pandas is Excel on steroids.

pandas.DataFrame is Python equivalent of the spreadsheet table. 


In [2]:
import pandas as pd

brain_df = pd.read_csv('../../data/brain_size.csv')

In [3]:
brain_df.head(n=2)

Unnamed: 0,id,gender,fsiq,viq,piq,weight,height,mri_count
0,1,Female,133,132,124,118,64.5,816932
1,2,Male,139,123,150,143,73.3,1038437


Instance
------

A dataset is composed of a set of instances. The rows of the table are the instances.  

The thing about which you want to understand and/or make a prediction. 

Statistics __assumes__: each instance is an individual and independent sample.

Features / Attributes
----

A property of an instance. 

The columns of the data table should be the features.

Label / Target
-----

The target value is a special feature, which is to be predicted or classified.

Example
-----

An instance (with its features) and a label.

Check for understanding
-----

What data is not inherently in table format?


- Images, Sound, and Videos
- Text
- Human thoughts and emotions
- Tensors

Data types (aka, levels of measurement)
-----

- Numerical
- Nominal


What are each?

Numeric attributes are either real or integer typed numbers.

For example - temperature

----

Nominal attributes take on values in a finite set of possibilities.

For example - Sunny ☀️, Overcast 🌥, and Rainy 🌧

Types of Nominal
----

- Binary. What is an example of Binary typed data?

- Categorical / Nominal. What is an example of nominal typed data?

- Ordinal. What is an example of ordinal typed data?

Levels of Measurement is a spectrum
-----

Unmeasurable --> Qualitive --> Binary/Nominal --> Ordinal --> Integer --> Real valued --> "Ratio"

EDA: Techniques
----

- Descriptive Statistics
- Tables
- Visualizations

Descriptive Statistics
-----

- Measures of overall-ness
- Measures of location / center
- Measures of variation
- Measures of the shape of the distribution
- Measures of statistical dependence between variables

Measures of overall-ness
-----

- Count
- Ratio

Data Science is mostly counting
-----

<center><img src="http://static.deathandtaxesmag.com/uploads/2014/12/the-count-pi.jpg" height="500"/></center>

Count is the number of observations (Protip: You want a lot of counts)

What is a ratio?
------

One count divided by another count, like 0.515

It is very important to define your numerator and denominator. This elementary school but critical for businesses.

Summary Statistics
-----

Summary statistics are used for summarizing a sample. The most commonly used summary statistics describe the following characteristics of the data.

- Measure of center
- Measure of spread

Measures of center (aka central tendencies)?
-----

Can you guess a couple?

Measures of center (aka central tendencies)
-----

- Mean
- Median 
- Mode

<center><img src="images/average.jpg" height="800"/></center>

Mean
----

The arithmetic average of the data values

$$ \bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1 + x_2 + \ldots + x_n}{n} $$
    where n is the size of the data
    
<center><img src="http://1.bp.blogspot.com/-so2EWw4rnaY/U2_aqMV4ZeI/AAAAAAAAAII/TRTXyEgTJO8/s1600/MeanVisual.gif" height="300"/></center>

Mean
----

- The most common measure of center
- Often does __not__ summarize data very well

<center><img src="http://web.stanford.edu/~savage/faculty/savage/FOA%20Index_files/image001.jpg" height="500"/></center>

- Can be affected by extreme data values (outliers)

<center><img src="images/mean.png" height="500"/></center>

> Let’s say there are 7 people sitting in a bar. And they start taking about net worth....

<center><img src="http://i0.wp.com/michelbaudin.com/wp-content/uploads/2014/03/Bill-Gates-in-terroir_bar-SF.png?resize=492%2C426" height="500"/></center>

[Source](https://introductorystats.wordpress.com/2011/09/04/when-bill-gates-walks-into-a-bar/)

In [4]:
data = list(range(100))

def calculate_mean(data):
    """Calculate the arithmetic average of a list"""
    pass

In [5]:
from statistics import mean

assert calculate_mean(data) == mean(data)

AssertionError: 

In [None]:
def calculate_mean(data):
    """Calculate the arithmetic average of a list"""
    return sum(data) / len(data)

Median
----

The middle number when the data values are put in order

<center><img src="http://3.bp.blogspot.com/-G43KSyi6Ch0/U29KLv6pDeI/AAAAAAAAAHo/c-kAVEDXZQk/s1600/MedianVisual.gif" height="500"/></center>

<center><img src="images/median.png" height="500"/></center>

Not affected by extreme values (outliers)

Check for understanding
-----

What is the median if n is odd? What is the median if n is even?

<center><img src="http://askinmask.com/wp-content/uploads/2011/3/kak-vychislit-srednee-znachenie-medianu-modu_5_1.jpg" height="250"/></center>
If n is odd, the median is exactly the middle number. 

If n is even, the median is the average of the two middle numbers.

In [None]:
data_even = list(range(10))
data_odd = list(range(11))

def calculate_median(data):
    """Calculate the middle value of a list"""
    pass

# HINT:
3//2

In [None]:
from statistics import median

assert calculate_median(data_even) == median(data_even)
assert calculate_median(data_odd) == median(data_odd)

In [None]:
def calculate_median(data):
    """Calculate the middle value of a list"""
    data = sorted(data)          # Sort data
    index = (len(data) - 1) // 2 # Location of middle value
    
    if (len(data) % 2):          # If odd, use middle value
        return data[index]
    else:                        # If even, use average of middle values
        return (data[index] + data[index + 1]) / 2

Mode
----

The most frequently occurred value

<center><img src="http://3.bp.blogspot.com/-b9EBpSbyRJk/U25QMGl8D0I/AAAAAAAAAG8/GU_EOKvBExc/s1600/Centr-ModeVisual.gif" height="500"/></center>
 
[Source](http://statisticsbypeter.blogspot.com/2014/05/mode.html)

Mode is __NOT__ affected by extreme values (outliers)

<center><img src="images/mode_nope.png" height="500"/></center>

There may be __no__ mode or several modes

In [None]:
data = [2, 1, 0, 1, 1, 0]

def calculate_mode(data):
    """Calculate the most common value of a list"""
    pass

from collections import Counter

In [None]:
from statistics import mode

assert calculate_mode(data) == mode(data)

In [None]:
from collections import Counter

def calculate_mode(data):
    """Calculate the most common value of a list"""
    return Counter(data).most_common(n=1)[0][0]

<center><img src="https://upload.wikimedia.org/wikipedia/en/2/2a/Wikipedia_Edit_Frequency.png" height="350"/></center>

What do these measures of central tendency tell us about Wikipedia edits? 

<center><img src="http://assets.amuniversal.com/a7479dd06cb801301d46001dd8b71c47" height="600"/></center>

Measures of Spread
-----

Can you guess a couple?

Measures of Spread
------

- Range
- Median Absolute Deviation (MAD)
- Variance
- Standard Deviation

Range
-----

$ Range = x_{maximum} - x_{minimum}$

Median Absolute Deviation (MAD)
-------

A robust measure of the variability of a univariate sample of quantitative data. 

<center><img src="images/mad.png" height="500"/></center>

[Why we should retire Standard Deviation](https://www.edge.org/response-detail/25401)

(can also use mean instead of median)

In [None]:
data = [2, 2, 3, 4, 14]

def calculate_mad(data):
    """Calculate the Median Absolute Deviation of a list"""
    pass

# Hint
abs?

In [None]:
assert calculate_mad(data) == 1

In [None]:
from statistics import median

def calculate_mad(data):
    """Calculate the Median Absolute Deviation of a list"""
    median_value = median(data)

    diff = []
    for item in data:
        diff.append(abs(item-median_value))

    return median(diff)

In [None]:
def calculate_mad(data):
    """Calculate the Median Absolute Deviation of a list"""
    return median(map(lambda x: abs(x-median(data)), data))

Variance & Standard Deviation
-----

Variance & Standard Deviation are cousins

Standard Deviation rougly describes how far way the typical observation is from the mean.

Variance
-----

$$ s^2_x = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $$


----
Standard Deviation
----

$$ s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} $$

The version of the formulas for population are a little different. You don't have worry about them because you typically just observe samples.

In [None]:
data = list(range(100))

def calculate_stdev(data):
    """Calculate the Standard Deviation of a list"""
    pass

In [None]:
from statistics import stdev

assert calculate_stdev(data) == stdev(data)

In [None]:
from statistics import mean

def calculate_stdev(data):
    """Calculate the Standard Deviation of a list"""
    return (sum(map(lambda x: (x - mean(data)) ** 2, data)) / (len(data) - 1))**.5

Summary
-----

- EDA is the process of understanding your data.
- Data should be represented as a matrix or table.
- The most important descriptive statistics measure center and variation.
- Choose your descriptive statistics wisely

<br>
<br> 
<br>

----

Bonus Material
-----

Geometric mean
----

Indicates the central tendency or typical value of a set of numbers by using the product of their values 
(as opposed to the arithmetic mean which uses their sum)

<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/78e45b0076a1c2c6935c0d1ddb849000afe2b5f5" height="500"/></center>

Useful when comparing different items on different numerical scales

Winsorized mean
-------

the calculation of the mean after replacing given parts of a probability distribution or sample at the high and low end with the most extreme remaining values

<center><img src="images/win.png" height="500"/></center>

typically doing so for an equal amount of both extremes

often 10 to 25 percent of the ends are replaced.
    

Percentile
-----

The $p^{th}$ percentile - $p\%$ of the values in the data are less than or equal to this value ($0 \leq p \leq 100$)

<center><img src="https://s-media-cache-ak0.pinimg.com/564x/94/39/fb/9439fb82998c209a35146093422007e6.jpg" height="500"/></center>

Quartile
-----
- $1^{st}$ quartile = $25^{th}$ percentile
- $2^{nd}$ quartile = $50^{th}$ percentile = __median__
- $3^{rd}$ quartile = $75^{th}$ percentile

<img src="images/quartile.png" width="500">

<br>
<br> 
<br>

----