## Course Announcements

- Due today (11:59 PM):
    - D3
    - Q3
    - Project Proposal
    - [Weekly Project Survey](https://docs.google.com/forms/d/e/1FAIpQLSf_ZLgO19fSP8CnUTzmOgDo6Es0Rj5frpP_OMSscfqrF0Nseg/viewform?usp=sf_link) (*optional*; link also on Canvas Assignments)

# EDA: Exploratory Data Analysis

- tools: `pandas`, `seaborn` & `matplotlib`
- concepts:
    - structure (including joins)
    - granularity
    - scope 
    - temporality
    - faithfulness
- EDA Case Studies: Student pre-course survey data

<div class="alert alert-success">
The examples and data in this notebook are largely adapted from two places: (1) Jake VanderPlas' <a href="https://github.com/jakevdp/PythonDataScienceHandbook" class="alert-link">Python Data Science Handbook</a> and (2) Berkeley's <a href="https://www.textbook.ds100.org/ch/06/viz_intro.html" class="alert-link">Data 100 Textbook</a>. Feel free to check out these resources for more!
</div>

## Exploratory data analysis (EDA)

> an approach to completely and fully understand your dataset.

It requires a state of flexibility and a willingness to look for both:
- artifacts in the data we anticipate exist 
- artifacts that we don't expect / believe are there



![eda](img/exploratory.png)


## Setup

In [None]:
# Import seaborn and apply its plotting styles
import seaborn as sns
sns.set(style="white", font_scale=1.5)

# import matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
# set plotting size parameter
plt.rcParams['figure.figsize'] = (17, 7)

# make sure pandas & numpy are imported
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

#improve resolution
#comment this line if erroring on your machine/screen
%config InlineBackend.figure_format ='retina'

# EDA

In the [Data 100 Textbook](https://www.textbook.ds100.org/ch/05/eda_intro.html), the authors state that with EDA, "We seek to understand the following properties about our data":

- Structure: the format of our data file.
- Granularity: how fine or coarse each row and column is.
- Scope: how (in)complete our data are.
- Temporality: how the data are situated in time.
- Faithfulness: how well the data captures "reality".

# Structure

What is the format of our data file?

<div class="alert alert-info">
The <b>structure</b> of your data describes the "shape" of the dataset. This refers to the format that the data are stored.
</div>

#### Questions to be able to answer about the structure of your dataset:
1. What format are the data in?
    - tabular (CSV, TSV, Excel, SQL)
    - nested (XML, JSON)   

2. Is each observation in a separate row? 
    - If not, can we get it into that format?
    - If the data are nested, how would we unnest the data

3. What variables (columns) do we have information about?
    - what `type` of information is in each column
    - Do we have all of the variables we need?    

4. Are the data spread across multiple tables? 
    - How do we want to join the data?
    

# Granularity
How fine or coarse is the data stored in each row and column?

<div class="alert alert-info">
The <b>granularity</b> of your data is what each observation in your data represents.
</div>

Often, the observation will be a **single individual** (i.e. information about each person who has called into a call center). 

Other times, it will be data collected about a **single individual at a particular time** (i.e. information about each call received on a given day) - Note that the same individual could be in this dataset multiple times.

Other times, each row will contain a **summary about a number of indidivuals** (i.e. total number of calls received at a call center each day) - Here, each row would contain summarized information about a whole bunch of people.

## Individual Level Granularity

In this example, we see that each observation represents a single individual.

In [None]:
calls = pd.DataFrame(
    [["Joey",      "suspected poisoning",  42,  "M"],
     ["Weiwei",    "ingested substance",   50,  "F"],
     ["Joey",      "chemical in eye",       8,  "M"],
     ["Karina",    "ingested substance",    7,  "F"],
     ["Nhi",       "ingested substance",    3,  "F"],
     ["Sam",       "chemical on skin",     42,  "M"]], 
    columns = ["Name", "Complaint", "Age", "Gender"])

calls

## Group-Level Granularity

However often the data we're handed have observations (rows) that summarize information across many individuals.

In [None]:
calls_total = pd.DataFrame(
    [["2019-08-29", 100],
    ["2019-08-30",  212],
    ["2019-08-31",  88],
    ["2019-09-01",  160],
    ["2019-09-02",  122],
    ], 
    columns = ["Date", "Calls"])

calls_total

#### Questions to be able to answer about the granularity of your dataset:
1. What does each record (row) represent?

2. Do all records uniformly represent the same level of granularity
   - are some rows individual-level, while others are summaries of the group? How will this be handled?

3. Were the data summarized or aggregated?
    - How were they grouped & summarized
    - What metrics were used for summarization (means and medians are common)

4. What aggregations/summarizations do we plan to do with these data?
    

# Scope

How (in)complete are our data?

<div class="alert alert-info">
The <b>scope</b> of your data describes how helpful these data are with respect to our data science question.
</div>

At this point, **descriptive statistics** & **exploratory visualizations** become pretty essential.

- Descriptive Statistics - help to summarize typical values and ranges for variables of interest
- Exploratory Visualizations - help us understand the distributions of individual variables and relationships between variables in our dataset

We'll also want to start to determine how frequently, for what variables, and why data are **missing**.

# Temporality
How are these data situated in time?

<div class="alert alert-info">
The <b>temporality</b> refers to how the data are situated in time. Specifically, we're interested in what data and time information are included in the dataset.
</div>


#### Questions to be able to answer about the temporality of your dataset:
1. What do reported times represent?
    - What does each date and time in the dataset mean? Time event occurred? Time reported?
    - Note that **timezones** & **Daylight Saving Time** are always important to consider    

2. How are the dates/times represented? What format are they in?
   - YYYY-MM-DD? Year? Time? Date & Time?

3. How are null timestamps represented? 
    - was a "random" date picked?
        - `Jan 1st 1990` <- Excel's default date
        - `Jan 1st, 1904` <- Excel's default date *on Mac*
        - `12:00am Jan 1st, 1970`<- [Unix Epoch for Timestamps](https://www.wikiwand.com/en/Unix_time#/Encoding_time_as_a_number)
    - `-999`, `NaN`, `NA`

4. What aggregations/summarizations do we plan to do with these data?

# Faithfulness
How well do these data reflect reality?


<div class="alert alert-info">
The <b>faithfulness</b> of your data is a determination of how trustworthy the data are.
</div>

#### Questions to be able to answer about the faithfulness of your dataset:
1. Are the values reasonable / what we expect? 
    - Unreasonable values examples: dates in the future, locations that don't exist, negative counts, wild outliers

2. Are there inconsistencies across tables?
    - Identifiers that don't match?
    - Date of Births that differ between two tables?
    - Any inconsistencies between values stored in more than one table?


3. Are there data entered by hand? These often contain inconsistencies.


4. Any obvious signs of falsification?
    - examples: repeated names, fake looking email addresses, repeated uncommon names or fields

# EDA Case Study: Pre-Course Surveys

Typically you won't be handed a dataset without information about where the data came from or what question you want to ask. But, for the sake of this exercise, let's look at a dataset where you don't have too many specifics  beyond what's in the dataset itself and see what we can figure out through EDA.

## The Data

We'll start with a dataset available by default from the `seaborn` package.

In [None]:
## load dataset in
df = pd.read_csv('data/pre_survey_data_sp21.csv')
df.head()

## Structure

<div class="alert alert-info">
The <b>structure</b> of your data describes the "shape" of the dataset. This refers to the format that the data are stored.
</div>

Let's first clean up those column names for working with these data...

In [None]:
df.columns = ['ID', 'enrolled', 'tech_access', 'timezone', 'objectives_pre', 
              'language', 'topics', 'statistics', 'programming']
df.head()

At a glance and without any additional information, we can tell this is **tabular data** with *observations in rows* and *variables in columns*.

### Clicker Question #1

Should we change the way these data are stored and why?

- A) Yes, nested XML is easier to work with
- B) Yes, observations should be in columns 
- C) Yes, nested JSON is easier to work with
- D) No, observations in rows is best
- E) Some other answer...

Often the first thing you'll want to know is how big the datset is.

In [None]:
## determine shape of dataset
df.shape

### What type of variable is in each column?

In [None]:
df.dtypes 

Ok, mostly continuous variables, except for `abbrev` which contains some type of string data.

## Granularity

<div class="alert alert-info">
The <b>granularity</b> of your data is what each observation in your data represents.
</div>

In [None]:
df.head()

### Clicker Question #2

What's the granularity of this dataset? 

- A) Individual-Level
- B) Group-level
- C) Individiual-level over time
- D) Group-level over time

## Scope

<div class="alert alert-info">
The <b>scope</b> of your data describes how helpful these data are with respect to our data science question.
</div>

In [None]:
df.shape

### Clicker Question #3

If we wanted to understand something about all COGS 108 students this quarter, could we do that with these data? 

- A) Yes
- B) No
- C) ¯\\\_(ツ)\_/¯

Note: There are currently 315 students enrolled in COGS 108.

## Temporality?

<div class="alert alert-info">
The <b>temporality</b> refers to how the data are situated in time. Specifically, we're interested in what data and time information are included in the dataset.
</div>


Note: Prof Ellis deleted timestamps from the data (just in case students were chatting with one another filling these out, but not sharing the information in them with one another, I didn't want anyone to be able to identify data from any one student), but those data were in there originally.

## Faithfulness?

<div class="alert alert-info">
The <b>faithfulness</b> of your data is a determination of how trustworthy the data are.
</div>

First, the data are self reported. There are always concerns on surveys with faithfulness.

In [None]:
df.head()

### Clicker Question #4

But wait...let's think about the information in this dataset...what does the `enrolled` mean? And why are there a bunch of `NaN` values? Have any idea?

- A) I've got some thoughts.
- B) No ideas here...
- C) What am I supposed to be thinking about?

## Observations to include?

In [None]:
## breakdown of enrolled
df['enrolled'].value_counts()

In [None]:
# only students currently enrolled
df = df[df['enrolled'] == 1]
df.shape

### Clicker Question #5

What? Prof Ellis told me there were 315 students in this class....why do we have *more* observations than students enrolled?

- A) I've got some thoughts.
- B) No ideas here...
- C) What am I supposed to be thinking about?



In [None]:
# remove duplicate values
# you would have to figure out which to drop in the real world
# and chck for consistency of responses
df = df.drop_duplicates(subset='ID', keep='last')
df.shape

## Missing values?

In [None]:
null_rows = df.isnull().any(axis=1)
df[null_rows]

### Clicker Question #6

Why are there so many `NaN`s?

- A) I've got some thoughts.
- B) No ideas here...
- C) What am I supposed to be thinking about?

In [None]:
# we'd expect to drop 36 rows
sum(null_rows)

In [None]:
# drop those who didn't respond to the survey
df = df.dropna().reset_index(drop=True)
df.shape

## Expected values? Wild outliers?

We'll look at each of the variables in the dataset to see if any thing looks *off*.

In [None]:
df.describe()

In [None]:
plt.subplot(1, 3, 1)
sns.histplot(df['statistics'], kde=True, color='#DE2D26')
plt.subplot(1, 3, 2)
sns.histplot(df['programming'], kde=True, color='#0CBD18')
plt.subplot(1, 3, 3)
sns.countplot(df['timezone'], color='#940CE8');

## Categorical Columns

Note that we're only focusing on the three columns above, and ignoring `tech_access`, `objectives_pre`, `language`, and `topics`. More cleaning/wrangling would need to be done...I'll show an example of one approach to this for the `language` and `topics` columns now. The idea here is called **one hot encoding** (AKA creating **dummy variables**)

In [None]:
lang = (df.language.str.split('\s*,\s*', expand=True) # split on comma
       .stack() # reshape multilevel index
       .str.get_dummies() # one-hot encode
       .sum(level=0))

In [None]:
lang

In [None]:
# one hot encode topics
topics = (df.topics.str.split('\s*,\s*', expand=True) 
          .stack() 
          .str.get_dummies() 
          .sum(level=0))

In [None]:
topics

In [None]:
# join on index (merge on column/series)
# note only joining lang here, not topics
df = df.join(lang, how='outer').reset_index(drop=True)
df

In [None]:
# plot one-hot eoncoded data
plt.subplot(2, 1, 1)
lang.columns = lang.columns.str.wrap(12) # wrap column names
a = lang.sum()/len(df)
a = a.sort_values(axis=0, ascending=False)
a.plot.bar(color='#686868', rot=0)

plt.subplot(2, 1, 2)
topics.columns = topics.columns.str.wrap(12) # wrap column names
a = topics.sum()/len(df)
a = a.sort_values(axis=0, ascending=False)
a.plot.bar(color='#686868', rot=0)
plt.xticks(size = 8);

### Relationship between these variables?

We can use scatterplots to start to get an understanding for how these values are related to one another.

In [None]:
# scatterplot
sns.lmplot(x='statistics', y='programming', 
           data=df, fit_reg=False, 
           height=6, aspect=2, 
           x_jitter=.5, y_jitter=.5);

So we see a relationship...but there's a lot of variability here. And, maybe we're most interested in understanding those who fall further away from the group

In [None]:
# check for multiple conditions
# be sure each condition in parentheses
df[(df['statistics'] > 9) & (df['programming'] < 6)]

## Relationships Across Multiple Variables

We've looked at the histograms individually and scatterplots for the relationship, so we wanted to note that it is possible to combine the two in a single matrix plot.

In [None]:
# generate scatter matrix
pd.plotting.scatter_matrix(df[['timezone','statistics','programming']], figsize=(15, 10));

In [None]:
# cooccurrence matrix
lang.T.dot(lang)

In [None]:
# cooccurrence matrix
topics.T.dot(topics)