### tools

folium, pivottablejs, missingno, pandas_profiling, mpld3

1. Why EDA
    A. For the Analyst
        ○ Let's us identify patterns and develop hypotheses
        ○ Test technical assumptions
        ○ Build an intuition for the data 
        § So that if you have an idea for the data
        § Can make mistake
    B. For the consumer 
    ○ Ensures right question is being asked 
        § Tests business assumptions
        ○ Leads to insights that otherwise would not be found 
        ○ Provides the context for max value of the results 
II. Themes
        ○ You are never done with EDA.  Every time you are iterating through analysis we do this.
        ○ Stay open minded: test assumptions you might have 
        ○ Redo EDA even if on the same datasets
III. Process

A. Prepare

    1. Form Hypotheses
    2. Wrangle Data
    3. Assess Data Quality and Profile
    4. Explore each variable in the dataset

B. Analysis  
    5. Assess relationship between each variable and the target (the thing of interest, eg. why people churn in your website)
    6. Assess interactions between variables
    7. Explore data across many dimensions
    
This feels linear, but really each of these steps have branches, so need to come back to these.

### approaches

* Before plotting or joining, have a question or hypothesis in mind.  Write it down.  
* Draw a plot of what you want to see on paper to sketch the idea.
* How do you know you are not fooling yourself?
    * What else can you check that it's actually true?
    * What evidence could there be that it's wrong?
    
* What is the unit economics of the question? JK

How do we answer the question of whether referees give more red cards to dark skinned players?
  > More likely to give a yellow or no card for the same offense under the same conditions?
  
Potential issues: 

* Conflicting data:  How to combine rater 1 and 2 of skin tone?  
* Is data imbalanced:  Red cards very rare?
* Is data biased: players with different amounts of playing time?
    * Perhaps unit economics helps with this
* How do I know if I accounted for all forms of confounding?

Pandas 
1. Look at the shape of the data. 
A. Is there a data dictionary?  Begin to understand this.
B. **Pandas commands**

    df.shape() 
        * To see the number of rows and columns
   df.columns.to_list()
   df.head()
       * To see the first few elements
   df.describe().T
       * .T to transpose the data
       * Count, mean, std, min, etc.
   df.dtypes 
       * To see all of the datatypes of the data
       
   data.info()
       * can tell you what types of data there is 

### Tidy data

* Each variable forms own column
* Each observation forms row
* Each type of observational unit forms a table

1. Make a players table 
    * He does this with `groupBy` of the index, like players name
        * Sees if data is consistent among all by doing agg('unique')
    * And get what is unique to each player
        * Height, weight, name, position, rater1, rater2

* Now once we have the players table, can save by doing `to_csv`, and test it saved via `read_csv`

`countries.rename(columns={'old_name': 'new_name'})` 

### Now working with one table

1. Handle Missing Data
2. Combining/removing columns
3. Distributions 
4. Correlations

    * Is data unbalanced? (What does this mean?)

### Duplicate data 

`nunique`
`drop_duplicates()`

### Explore missing data 

Is there systematic reason for the data missing, or is it randomly missed?
    * Eg. fields always missing at the same time?
    * Eg. data missing for a whole group of people

* can use missing no library 
    * To see amount of data missing per each column
    * To see correlation of missing data btwn columns 
    * Do a heat map to see how correlated is the missing data 

A. Start with a count by columns
    Why are some entire categories empty?  
    Why are some much more than others, etc.
    ```python
    recent = time_slice(data, '2013-2017')
    msno.bar(recent, labels=True)
    ```
B. Then move onto matrix 

    *`msno.matrix(recent, labels=True)`
    
* So now can see the correlation between missing data variables 
* Is there a pattern of missingness, as opposed to random missing data 

Some categories are just missing entirely - so can look into them.  Did that data get mixed into other categories, or datasets?

* Select say, all countries in north america, then call `nullity_sort`, to sort by missingness of data.  
  * Take a look at what downey says about removing missing data 

### Backlog for missing data 

* Assess the prevalence of missing data across all data fields, assess whether its missing is random or systematic, and identify patterns when such data is missing
* Identify any default values that imply missing data for a given field
* Determine sampling strategy for quality assessment and initial EDA
* For datetime data types, ensure consistent formatting and granularity of data, and perform sanity checks on all dates present in the data.
* In cases where multiple fields capture the same or similar information, understand the relationships between them and assess the most effective field to use
* Assess data type of each field
* For discrete value types, ensure data formats are consistent
* For discrete value types, assess number of distinct values and percent unique and do sanity check on types of answers
* For continuous data types, assess descriptive statistics and perform sanity check on values 
* Understand relationships between timestamps and assess which to use in analysis
* Slice data by device type, operating system, software version and ensure consistency in data across slices
* For device or app data, identify version release dates and assess data for any changes in format or value around those dates

#### Taking action on missing data 

* count amount of missing data
`data.isnull().values.sum()`

1. Ignore that there's missing data
2. Remove the missing data
    - Can drop any existence of not a number
    `.dropna(), any row with nan gets removed`
    - Now say we want to remove only rows that have all NAN
    `data.dropna(how="all")`
    - also can use threshold, where if it has certain amounnt of NAN then remove 
    Forward fill data - take data and sweep it forward `fillna(method="ffill")`
    And back fill via `fillna(method="bfill")`
3. Fill in the missing data
    `fillna(value=9999)`
4. Replace missing data with a static number



### Slicing data

1. Cross section - one time period across all of the dataset 
2. Time series - a single country over time 
3. Geospatial - grouping similar data (by feature), eg. grouping by region

### Histogram of data

* `pd.get_values` ? to get a histogram of the data
* creating bins for continuous variables
    * `pd.qcut(players['weight'], len(categories), weight_categories)`
To get a distribution:
* `sns.distplot(players.age_years)`

### See correlation btwn data 

For example, how much do the two ratings differ

```python
import seaborn as sns
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()
```

Can see that no rating is more than 2 away.

And now that we see they are very similar, we can combine the two columns just by taking the mean.

### Now that have cleaned data, time to explore

* So can see a distribution of the redcards across skintone

Can see if there are any lurking variables 
    1. Is skintone correlated with anything else?
    2. Are redcards correlated with anything else?
Scatter matrix: 

Will create all of the cross terms

```python
from pandas.tools.plotting import scatter_matrix
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);

```

* Note that the diagonal always creates a 1 to 1, so he does that.

Also can just get to just comparing two of the attributes.


```python
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()

```

For correlation, can do `df.corr()`, and will give correlation for all of the columns
The `corr()` returns a dataframe, so we can call `describe` on the correlations as well to see where most correlation occurs.

* Then library to get a high level overview 

* https://github.com/JosPolfliet/pandas-profiling
`pandas_profiling.ProfileReport(players)`

### Can further clean 
* So have a list of the subset of columns you want, and then once have cleaned table, can profile again

### Next Steps 
* Task perform on referee, clubs, and country dataframe
    1. Handle Missing Data
    2. Combining/removing columns
    3. Distributions 
    4. Correlations
* Do redcard final joins, to see analysis

### Resources
> Can see dataset with fivethirtyeight article, science isn't broken 
> * [EDA](https://www.youtube.com/watch?v=W5WE9Db2RLU)
>*  [Handling Missing Data](https://www.youtube.com/watch?v=O5v4NrSCw_A&list=PLQVvvaa0QuDc-3szzjeP6N6b0aDrrKyL-&index=10)
> * [Missing data](https://github.com/ResidentMario/missingno)
> [Quandl dataset](https://www.quandl.com/docs-and-help)