## Exploratory Analysis using Pandas, Matplotlib, and Seaborn. 

In this notebook, we'll start munging some real data and exploring using various Pandas methods and data visualizations.  


Imports:
```python
import pandas as pd
import numpy as np
import os
```

Paths:
```python
dir = r'C:/Users/phwh9568/Workshops/Python_Data_Camp/'
data_dir = os.path.join(dir,'data')
```

Let's read in some data from from the EPA's [Environmental Justice Screening and Mapping Tool](https://www.epa.gov/ejscreen).  ```EJscreen_Colorado.csv``` is in the data directory. This is an extract of a nationwide dataset available here: https://www.epa.gov/ejscreen/download-ejscreen-data.  

We're working with data on the census tract level. You will also find a data dictionary in the data folder that explains the variables: ```EJScreen_2024_Tract_Percentiles_Columns.xlsx```

We'll use this data to make use of Pandas' data munging/manipulating/analyzing capabilities.  

Read it in as a variable:  
```python
data = pd.read_csv(os.path.join(data_dir,'EJscreen_Colorado.csv'))
```

Quick note on field data types... Pandas will guess/assume column data types, and usually this is fine. But not always!  

You can explicitly declare a column data type on read using the ```dtype``` parameter of ```.read_csv()```.  

This is me telling you there is a messed up column here. Reimport setting ID to string.

It's too big to view the whole thing, but we can use various methods to get a feel for the data set...  

Let's explore:  

```python
data.head()
data.tail()
data.columns
data.describe
```

```python
data.columns
```

There's a fair amount of data we don't need right now, so let's split off the demographic information we're interested in along with the environmental variables.  

Let's make a list of the columns we want:
```python
environmental = list(data.columns[-14:])
```

And, we'll manually create a list of the demographic variables we want to include:  
```python
demographics = ['ID','PEOPCOLORPCT', 'LOWINCPCT', 'LIFEEXPPCT', 'LINGISOPCT', 'DISABILITYPCT']
```

Now, combine these two lists then use this list of columns to split off a new dataframe based on the selected columns... 

How do we do this?  

Let's join another dataset to the ejscreen data that will tell us if an area is urban or rural.  

```python
urban = pd.read_csv(os.path.join(data_dir,'Colorado_Tracts_Urban.csv'), dtype={'GEOID':str})
```

Check values of the ```UATYPE20``` column using ```.unique()```

Now, replace NaN values in ```UATYPE20``` to 'R' for rural. Do you remember how?

Now merge:  
```python
ejdata = ejdata.merge(urban, left_on='ID', right_on='GEOID')
```

Great! Now, export this data to a csv:  

```python
ejdata.to_csv(os.path.join(data_dir,'ejdata_urban.csv'))
```

Let's explore this data some. **MAYBE** we can find some relationships?  

Start by using sort values to see census tracts with high ozone levels. 
```python
ejdata.sort_values(by=['OZONE'], ascending=False, inplace=True)
```

You might find it useful to add some color to your table... 
```python
ejdata[['LIFEEXPPCT','OZONE', 'PTRAF', 'PM25']].style.background_gradient()
```

Here's some more info on styling Pandas tables: https://pandas.pydata.org/docs/user_guide/style.html

### Let's get into visualization.  

There's a TON of Python data visualization packages. We'll touch on three big ones:  
1. matplotlib
2. seaborn
3. plotly  

We'll start with matplotlib:

```python
import matplotlib.pyplot as plt
```

Let's start by generating some scatterplots.  

We'll start with the standard matplotlib approach:  

Here are the docs: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html. 

```python
fig, ax = plt.subplots()

x = ejdata['LIFEEXPPCT']
y = ejdata['PTRAF']

ax.scatter(x,y)
```

Check the docs and modify... Change the figure size, marker color, marker size, set axis labels. 

Pandas X matplotlib...  

Pandas does have some built in matplotlib functionality.  For example, https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html. 

```python
ejdata.plot.scatter(x='LIFEEXPPCT',y='PTRAF')
```

Definitely easier! But sometimes not as fully customizable.  

Although matplotlib is (maybe?) the standard bearer, there are several other very good visualization libraries that can be used as alternatives or alongside matplotlib.  

### Let's checkout [Seaborn](https://seaborn.pydata.org/)  

```python
import seaborn as sns
```

Let's start again with our scatterplot.  

```python
sns.scatterplot(x=ejdata['LIFEEXPPCT'], Y=EJDATA['PTRAF'])
```

Check the docs: 
Now, recreate the above plot but use the ```hue``` parameter to style it according to urban and rural areas:

Is there something here? Worth exploring? Maybe...  

Let's split off just the urban census tracts. Do you remember how to use ```.loc```?

I guess we could keep futzing about plugging in various variables into our scatterplots...  

But why not produce them all at once? 

Let's check out Pairgrid:
```python
g = sns.PairGrid(ejUrban) #or sns.pairplot(ejUrban)
g.map(sns.scatterplot)
```

Okay, well, we have some unclear results... let's run a correlation:  
```python
ejdata.corr(numeric_only=True)
```

Let's visualize that using [```sns.heatmap()```](https://seaborn.pydata.org/generated/seaborn.heatmap.html).  

Check the docs.

Oh well, I tried! Let's get some different data!  

There is a built in dataset in Seaborn:
```python
iris = sns.load_dataset('iris')

Take a peek... 

```python
iris.shape
```

```python
iris.columns
```

Start with a basic scatterplot using petal length as the x axis and petal width on the y axis.  

Once you've got that, color the marker points by species.  

Try to make a [boxplot](https://seaborn.pydata.org/generated/seaborn.boxplot.html)!

Make a [barplot](https://seaborn.pydata.org/generated/seaborn.barplot) that shows petal length by species! Make them different colors!

Let's do some more advanced plotting. Say you want to make a figure for your paper... 

We'll combine charts using matplotlib.  Let's stack these three atop of one another...  

Start by reviewing the ```pyplot.subplots()``` documentation: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html 

We'll start with this basic code then modify:
```python
fig, ax = plt.subplots()
```

Okay, if we want these stacked vertically, we need to set the ```nrows``` parameter.  
How many rows do we need?

Okay, now we've got 3 empty charts. Let's start by populating the charts.  

Review the sns.barplot docs: https://seaborn.pydata.org/generated/seaborn.barplot 

What does the ax parameter do?  

What happens if we just run the our ```ax``` variable? 

Okay, there's 3 things there... how do we select one of them?  

Okay, how can we use this info and apply it to the ```sns.barplot()``` ```ax``` parameter?

Populate the remaining axes with the boxplot and scatterplot.

Make it pretty.