---
Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk  
Spring 2023

---
# 03: Data Exploration - Part 2 (in Python)
*Environmental Data Analytics / John Fay*<br>
*Spring 2021*

## Objectives
1. Import and explore datasets in R
2. Graphically explore datasets in R
3. Apply data exploration skills to a real-world example dataset

## IMPORT AND VIEW SUMMARIES

As in R, we often begin our scripts by importing whatever packages we need. The "Pandas" package is a Python data analytics library. 

In [None]:
#Load packages
import pandas as pd 

In [None]:
#Read the USGS dataset into a dataframe object
USGSFlowData = pd.read_csv("./data/Processed/USGS_Site02085000_Flow_Processed.csv")

#View the first 5 records
USGSFlowData.head()

Examine characteristics of our dataframe...

In [None]:
#Display the data type of our USGSFlowData object
type(USGSFlowData) #"class" in R

In [None]:
#Reveal the column names 
USGSFlowData.columns #"colnames" in R

In [None]:
#Reveal the structure of our dataframe
USGSFlowData.info() #"str" in R

In [None]:
#Reveal the dimensions of our dataframe
USGSFlowData.shape #"dim" in R

In [None]:
USGSFlowData['discharge_mean_approval'].value_counts()

Check our date column and set it to a datetime object

In [None]:
#Reveal datatype of the datetime column
USGSFlowData['datetime'].dtype

In [None]:
#Change it to a proper datetime object
USGSFlowData['datetime'] = pd.to_datetime(USGSFlowData['datetime'])

In [None]:
#Reveal datatype of the datetime column
USGSFlowData['datetime'].dtype

## Data Visualization

### Data visualization packages
Python has many data visualization packages. `Pandas` itself is able to do some quick visualizaitons. A package called "`Matplotlib`" has perhaps been around the longest and mimics the syntax used in MatLab. Some of the more powerful ones include `Plotly`, `Dash`, and `Bokeh`, which allow for some interactive plots. However, for those with experience in R's *ggplot*, a package called `plotnine` allows us to mimic that format.

### Plotting with Pandas

In [None]:
#Bar plot - need to run the "value_counts" to compute # of records in each group
USGSFlowData['discharge_mean_approval'].value_counts().plot(kind='bar');

In [None]:
#Histogram of mean discharge
USGSFlowData['discharge_mean'].plot(kind='hist',bins=10);

In [None]:
#Line plot
date_mask = USGSFlowData['datetime']>pd.to_datetime("2010/01/01")
USGSFlowData[date_mask].plot(
    kind='line',
    x='datetime',
    y='gage_height_mean',
    legend=False,
    figsize=(20,3)
);

### Plotting with Plotnine/ggplot

In [None]:
#Install plotnine (install if needed)
try: 
    from plotnine import *
except:
    !pip install plotnine
    from plotnine import *

In [None]:
#Construct the plots
(   ggplot(USGSFlowData, aes(x='discharge_mean_approval')) + 
    geom_bar()
)

In [None]:
(ggplot(USGSFlowData) +
  geom_histogram(aes(x = 'discharge_mean'), binwidth = 10))

In [None]:
(ggplot(USGSFlowData) +
  geom_freqpoly(aes(x = 'gage_height_mean'), bins = 50) +
  geom_freqpoly(aes(x = 'gage_height_min'), bins = 50, color = "darkgray") +
  geom_freqpoly(aes(x = 'gage_height_max'), bins = 50, linetype = 'dashed') +
  scale_x_continuous(limits = (0, 10)))

In [None]:
myPlot = (
    ggplot(USGSFlowData, aes(x='datetime',y='discharge_mean')) + 
          geom_line(color='blue',size=0.25) +
          theme_xkcd()
         )
myPlot