# Exploratory Data Analysis (EDA)

### What is EDA?
EDA is an approach to analyzing data typically involving visual methods. It was championed by John W. Tukey in his 1977 book *Exploratory Data Analysis*. He stated that at the time too much emphasis was put on confirmatory data analysis (statistical hypothesis testing). In very simple terms he wanted to look for questions to ask rather than answers to questions.

### Objectives of EDA
- Suggest hypotheses about the causes of observed phenomenon
- Assess assumptions on which statistical inference will be based
- Support the selection of appropriate statistical tools and techniques
- Provide a basis for further data collection

### Graphical Techniques
Some examples of graphical techniques used in EDA include:
- Box (+whisker) Plot
- Histogram
- Run Chart / Time Series plot
- Scatter Plot
- Pricinple Component Analysis (PCA)

Some of these graphing techniques are outlined under the graphing overview presentation found under the PyData Fort Wayne GitHub repository: https://github.com/PyDataFortWayne/GraphingMatplotlibSeaborn

But before we get into how to visualize data let's look at Tidy Data and how to make the data easy to work with.

In [2]:
# Import the necessary packages
import pandas as pd
import seaborn as sns
import numpy as np

# Tidy Data
Hadley Wickham published an article in the Journal of Statistical Software called *Tidy Data*. In it he outlines characteristics of how to clean data well. A majority of time is spent cleaning data nad he wanted to determine how to make data cleaning easy and as effective as possible. It also allows for easier development of tools if the data is in a consistent format.

Tidy Data is defined as:
- Each variable is a column
- Each observation is a row
- Each type of observation unit is a table

Let's look at an example. Let's say a researcher is attempting to determine how effective a treatment is. The dataframe may look like this:

In [3]:
df = pd.DataFrame([['John Smith', None, 2.0], ['Jane Doe', 16.0, 11.0], ['Mary Johnson', 3.0, 1.0]], columns=['Patient', 'Treatment A', 'Treatment B'])
df

Unnamed: 0,Patient,Treatment A,Treatment B
0,John Smith,,2.0
1,Jane Doe,16.0,11.0
2,Mary Johnson,3.0,1.0


However, this same data could be represented in a different format but still project the same information. For example:

In [4]:
pd.DataFrame([['Treatment A', None, 16.0, 3.0], ['Treatment B', 2.0, 11.0, 1.0]], columns=['Treatment', 'John Smith', 'Jane Doe', 'Mary Johnson'])

Unnamed: 0,Treatment,John Smith,Jane Doe,Mary Johnson
0,Treatment A,,16.0,3.0
1,Treatment B,2.0,11.0,1.0


This is what Hadley Wickham is attempting to solve with Tidy Data. That same data represented in Tidy format would look like this:

In [5]:
df_tidy = pd.DataFrame([['John Smith', 'a', None],
                        ['Jane Doe', 'a', 16.0],
                        ['Mary Johnson', 'a', 3.0],
                        ['John Smith', 'b', 2.0],
                        ['Jane Doe', 'b', 11.0],
                        ['Mary Johnson', 'b', 1.0]
                       ], 
                       columns=['Patient Name', 'Treatment', 'Result']
                      )
df_tidy

Unnamed: 0,Patient Name,Treatment,Result
0,John Smith,a,
1,Jane Doe,a,16.0
2,Mary Johnson,a,3.0
3,John Smith,b,2.0
4,Jane Doe,b,11.0
5,Mary Johnson,b,1.0


# Cleaning Data
## Melt
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.melt.html


“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set.

As a reminder, the treatment table from earlier looked like this:

In [6]:
df

Unnamed: 0,Patient,Treatment A,Treatment B
0,John Smith,,2.0
1,Jane Doe,16.0,11.0
2,Mary Johnson,3.0,1.0


If we apply the melt method to this DataFrame we can make it Tidy:

In [7]:
melted_df = df.melt(id_vars='Patient', value_name='Result', var_name='Treatment')
melted_df

Unnamed: 0,Patient,Treatment,Result
0,John Smith,Treatment A,
1,Jane Doe,Treatment A,16.0
2,Mary Johnson,Treatment A,3.0
3,John Smith,Treatment B,2.0
4,Jane Doe,Treatment B,11.0
5,Mary Johnson,Treatment B,1.0


## Pivot
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html

Reshape data (produce a “pivot” table) based on column values. Uses unique values from index / columns to form axes of the resulting DataFrame.

In [8]:
melted_df

Unnamed: 0,Patient,Treatment,Result
0,John Smith,Treatment A,
1,Jane Doe,Treatment A,16.0
2,Mary Johnson,Treatment A,3.0
3,John Smith,Treatment B,2.0
4,Jane Doe,Treatment B,11.0
5,Mary Johnson,Treatment B,1.0


In [9]:
melted_df.pivot(index='Patient', columns='Treatment', values='Result').reset_index()

Treatment,Patient,Treatment A,Treatment B
0,Jane Doe,16.0,11.0
1,John Smith,,2.0
2,Mary Johnson,3.0,1.0


In [10]:
df

Unnamed: 0,Patient,Treatment A,Treatment B
0,John Smith,,2.0
1,Jane Doe,16.0,11.0
2,Mary Johnson,3.0,1.0


## Splitting Fields
Sometimes it's necessary to split fields into different columns. Luckily this is fairly easy with pandas. Let's cleanup our melted dataset and cleanup the treatment value field. We can split the field and use just the treatment letter.

In [11]:
# pat - What to split on
# expand=True - Expand out the list to a DataFrame
# [1] - Select the second column of tha dataframe
melted_df['Treatment'] = melted_df['Treatment'].str.split(pat=' ', expand=True)[1]
melted_df

Unnamed: 0,Patient,Treatment,Result
0,John Smith,A,
1,Jane Doe,A,16.0
2,Mary Johnson,A,3.0
3,John Smith,B,2.0
4,Jane Doe,B,11.0
5,Mary Johnson,B,1.0


A more complex example looking at Apache access logs which wasn't recorded in a nice simple csv format but rather a specific format. We can use a complex regular expression to split out the fields into a dataframe to ease parsing and analysis.

In [53]:
import os.path
import re

# Read in sample log file
df_apache = pd.read_csv(os.path.join('data', 'apache_access.log'), 
                        header=None, 
                        names=['RAW'])

# Display entire contents of cells
pd.set_option('display.max_colwidth', 2000)
df_apache.head()

Unnamed: 0,RAW
0,"127.0.0.1 - - [07/Mar/2004:16:05:49 -0800] ""POST /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1"" 401 12846"
1,"127.0.0.1 - - [07/Mar/2004:16:06:51 -0800] ""POST /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1"" 200 4523"
2,"127.0.0.1 - - [07/Mar/2004:16:10:02 -0800] ""POST /mailman/listinfo/hsdivision HTTP/1.1"" 200 6291"
3,"127.0.0.1 - - [07/Mar/2004:16:11:58 -0800] ""GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1"" 200 7352"
4,"127.0.0.1 - - [07/Mar/2004:16:20:55 -0800] ""GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1"" 200 5253"


In [54]:
# Define regular expression to parse each line
apache_regex = r'^(?P<ip_address>(?:\d{1,3}\.){3}\d{1,3}) ' + \
               r'[^ ]* [^ ]* \[(?P<request_time>[^\]]*)\] ' + \
               r'"(?P<method>[^ ]*) ?(?P<url>[^ ]*) ' + \
               r'(?P<http_version>HTTP\/\d\.\d)" ' + \
               r'(?P<status_code>\d+) ' + \
               r'(?P<apache_pid>\d+)$'
are = re.compile(apache_regex)

# Split the fields into new columns in a new dataframe
df_split = df_apache['RAW'].str.split(apache_regex, expand=True)
# Remove empty columns
del df_split[0]
del df_split[8]

# Label the columns from the regular expression
df_split.columns = are.groupindex.keys()

df_split.head()

Unnamed: 0,ip_address,request_time,method,url,http_version,status_code,apache_pid
0,127.0.0.1,07/Mar/2004:16:05:49 -0800,POST,/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables,HTTP/1.1,401,12846
1,127.0.0.1,07/Mar/2004:16:06:51 -0800,POST,/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2,HTTP/1.1,200,4523
2,127.0.0.1,07/Mar/2004:16:10:02 -0800,POST,/mailman/listinfo/hsdivision,HTTP/1.1,200,6291
3,127.0.0.1,07/Mar/2004:16:11:58 -0800,GET,/twiki/bin/view/TWiki/WikiSyntax,HTTP/1.1,200,7352
4,127.0.0.1,07/Mar/2004:16:20:55 -0800,GET,/twiki/bin/view/Main/DCCAndPostFix,HTTP/1.1,200,5253


## Categorical Data
https://pandas.pydata.org/pandas-docs/stable/categorical.html

Pandas can implement a Categorical data type which is simply a predefined list of accepted values. This allows us to use less memory when processing our data.

Let's take a look at the method column from our apache example above.

In [55]:
df_split['method'].value_counts()

GET     24
POST     3
Name: method, dtype: int64

In [57]:
df_split['method'].memory_usage()

296

In [59]:
df_split['method'] = df_split['method'].astype('category')
df_split['method'].memory_usage()

203