---
title: "3: Data Exploration (1)"  
author: "Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk"   
date: "Spring 2021"   
---

# 3. Data Exploration

## LESSON OBJECTIVES
1. Set up a data analysis session in Jupyter
2. Import and explore datasets in Python
3. Apply data exploration skills to a real-world example dataset

A handy link: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html

## BEST PRACTICES FOR PYTHON/JUPYTER

In many situations in data analytics, you may be expected to work from multiple computers or share projects among multiple users. A few general best practices will avoid common pitfalls related to collaborative work. 

### Relative paths in Jupyter notebooks

Jupyter notebooks can use absolute or relative paths, but relative paths are more robust and should be used where possible. Relative paths will be relative to where the Jupyter notebook lives and OS commands can navigate up or down the directory structure.

#### Listing contents of folders using OS commands followed by `!`

OS-specific commands can be called within Jupyter by preceding them with a "`!`". For example, in Windows you can list the contents of the folder containing the script you are running using "`! dir`". On unix machines, this would be "`! ls`"

In [None]:
#OS specific command for showing the current working directory
!pwd #for mac/linux based machines (!cd #for PCs)

In [None]:
#List the contents of the current directory ("!ls" also works)
!dir 

In [None]:
#List the contents of the "data" sub directory 
!dir data

In [None]:
#List the contents of the directory containing the current notebook
!dir ..

#### Navigating folders using Python's built-in `os` module

In [None]:
#Import the os module
import os

In [None]:
#Create a variable holding the current working directory
projectDir = os.getcwd()
#Display the current working directory
projectDir

In [None]:
#Change the directory to the data folder
os.chdir('data')
os.getcwd()

In [None]:
#Go back to the current working directory (stored in the "projectDir" variable above)
os.chdir(projectDir)
os.listdir()

### Load your packages
As in R, packages should be loaded early in the script. 

In [1]:
import pandas as pd #Import pandas, refering to it as "pd"

### Import your data
The easiest way to import CSV data for data analysis is using Panda's [`read_csv()` function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) which reads data directly into a Pandas dataframe object.

As in R, we supply the path to the CSV file, using relative path conventions. 

In [34]:
df_USGS = pd.read_csv('./data/Raw/USGS_Site02085000_Flow_Raw.csv')

## EXPLORE YOUR DATASET
Take a moment to read through the README file associated with the USGS dataset on discharge at the Eno River. Where can you find this file? How does the placement and information found in this file relate to the best practices for reproducible data analysis?

In [3]:
#View all records
df_USGS

Unnamed: 0,agency_cd,site_no,datetime,165986_00060_00001,165986_00060_00001_cd,165987_00060_00002,165987_00060_00002_cd,84936_00060_00003,84936_00060_00003_cd,84937_00065_00001,84937_00065_00001_cd,84938_00065_00002,84938_00065_00002_cd,84939_00065_00003,84939_00065_00003_cd
0,USGS,2085000,1/1/28,74.0,A,,,,,,,,,,
1,USGS,2085000,1/2/28,61.0,A,,,,,,,,,,
2,USGS,2085000,1/3/28,56.0,A,,,,,,,,,,
3,USGS,2085000,1/4/28,54.0,A,,,,,,,,,,
4,USGS,2085000,1/5/28,48.0,A,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33211,USGS,2085000,12/5/18,76.7,P,68.9,P,73.7,P,2.55,P,2.49,P,2.53,P
33212,USGS,2085000,12/6/18,68.9,P,62.8,P,66.2,P,2.49,P,2.44,P,2.47,P
33213,USGS,2085000,12/7/18,65.2,P,60.4,P,63.2,P,2.46,P,2.42,P,2.44,P
33214,USGS,2085000,12/8/18,64.0,P,60.4,P,61.5,P,2.45,P,2.42,P,2.43,P


#### Viewing properties of your dataset

In [4]:
#Confirm the data type -- R: class(df_USGS)
type(df_USGS)

pandas.core.frame.DataFrame

In [35]:
#Display the column names -- R: colnames(df_USGS)
df_USGS.columns

Index(['agency_cd', 'site_no', 'datetime', '165986_00060_00001',
       '165986_00060_00001_cd', '165987_00060_00002', '165987_00060_00002_cd',
       '84936_00060_00003', '84936_00060_00003_cd', '84937_00065_00001',
       '84937_00065_00001_cd', '84938_00065_00002', '84938_00065_00002_cd',
       '84939_00065_00003', '84939_00065_00003_cd'],
      dtype='object')

In [36]:
#Rename columns -- R: colnames(df_USGS) <- c(...)
df_USGS.columns = ("agency_cd", "site_no", "datetime", 
                   "discharge_max", "discharge_max_approval", 
                   "discharge_min", "discharge_min_approval", 
                   "discharge_mean", "discharge_mean_approval", 
                   "gage_height_max", "gage_height_max_approval", 
                   "gage_height_min", "gage_height_min-approval", 
                   "gage_height_mean", "gage_height_mean_approval")

In [37]:
#Display the structure of the dataframe -- R: str(df_USGS))
df_USGS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33690 entries, 0 to 33689
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   agency_cd                  33690 non-null  object 
 1   site_no                    33690 non-null  int64  
 2   datetime                   33690 non-null  object 
 3   discharge_max              5348 non-null   float64
 4   discharge_max_approval     5348 non-null   object 
 5   discharge_min              5348 non-null   float64
 6   discharge_min_approval     5348 non-null   object 
 7   discharge_mean             28582 non-null  float64
 8   discharge_mean_approval    28582 non-null  object 
 9   gage_height_max            5461 non-null   float64
 10  gage_height_max_approval   5461 non-null   object 
 11  gage_height_min            5461 non-null   float64
 12  gage_height_min-approval   5461 non-null   object 
 13  gage_height_mean           8820 non-null   flo

In [38]:
#Display the dimensions
df_USGS.shape

(33690, 15)

In [9]:
df_USGS.size

498240

##### Viewing records in a dataframe

In [39]:
#View the head (first 5 records) of the dataset
df_USGS.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,10/1/27,,,,,39.0,A,,,,,,
1,USGS,2085000,10/2/27,,,,,39.0,A,,,,,,
2,USGS,2085000,10/3/27,,,,,39.0,A,,,,,,
3,USGS,2085000,10/4/27,,,,,39.0,A,,,,,,
4,USGS,2085000,10/5/27,,,,,39.0,A,,,,,,


In [40]:
#Altenatively, view the first 9 records
df_USGS.head(9)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,10/1/27,,,,,39.0,A,,,,,,
1,USGS,2085000,10/2/27,,,,,39.0,A,,,,,,
2,USGS,2085000,10/3/27,,,,,39.0,A,,,,,,
3,USGS,2085000,10/4/27,,,,,39.0,A,,,,,,
4,USGS,2085000,10/5/27,,,,,39.0,A,,,,,,
5,USGS,2085000,10/6/27,,,,,39.0,A,,,,,,
6,USGS,2085000,10/7/27,,,,,39.0,A,,,,,,
7,USGS,2085000,10/8/27,,,,,39.0,A,,,,,,
8,USGS,2085000,10/9/27,,,,,39.0,A,,,,,,


In [12]:
#Or 6 records, selected at random
df_USGS.sample(6)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
21757,USGS,2085000,7/27/87,0.39,A,,,,,,,,,,
521,USGS,2085000,6/5/29,95.0,A,,,,,,,,,,
26808,USGS,2085000,5/25/01,9.9,A,1.6,A,,,,,,,,
12175,USGS,2085000,5/2/61,105.0,A,,,,,,,,,,
18161,USGS,2085000,9/21/77,,,,,,,,,,,,
31899,USGS,2085000,5/3/15,92.6,A,58.9,A,74.1,A,2.7,A,2.39,A,2.54,A


In [13]:
#Or, the last 3 records
df_USGS.tail(3)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
33213,USGS,2085000,12/7/18,65.2,P,60.4,P,63.2,P,2.46,P,2.42,P,2.44,P
33214,USGS,2085000,12/8/18,64.0,P,60.4,P,61.5,P,2.45,P,2.42,P,2.43,P
33215,USGS,2085000,12/9/18,149.0,P,60.4,P,91.6,P,2.97,P,2.42,P,2.64,P


In [14]:
#View records 30000 to 30005, columns 3, 8, and 14
df_USGS.iloc[29999:30004,[2,7,13]]

Unnamed: 0,datetime,discharge_mean,gage_height_mean
29999,2/18/10,63.4,2.15
30000,2/19/10,56.9,2.08
30001,2/20/10,53.1,2.03
30002,2/21/10,50.4,1.99
30003,2/22/10,60.5,2.11


In [41]:
#Show the count of values in the discharge_min_approval category
df_USGS['discharge_max_approval'].value_counts()

A    5347
P       1
Name: discharge_max_approval, dtype: int64

In [42]:
#Show the data type of the 'datetime' column
df_USGS['datetime'].dtype

dtype('O')

In [43]:
#Show the data type of all columns
df_USGS.dtypes

agency_cd                     object
site_no                        int64
datetime                      object
discharge_max                float64
discharge_max_approval        object
discharge_min                float64
discharge_min_approval        object
discharge_mean               float64
discharge_mean_approval       object
gage_height_max              float64
gage_height_max_approval      object
gage_height_min              float64
gage_height_min-approval      object
gage_height_mean             float64
gage_height_mean_approval     object
dtype: object

In [44]:
#Summary of all data
df_USGS.describe()

Unnamed: 0,site_no,discharge_max,discharge_min,discharge_mean,gage_height_max,gage_height_min,gage_height_mean
count,33690.0,5348.0,5348.0,28582.0,5461.0,5461.0,8820.0
mean,2085000.0,88.147554,30.462292,59.477284,2.123613,1.735719,1.951908
std,0.0,282.8018,61.329289,152.627024,1.416646,0.58774,0.949163
min,2085000.0,0.26,0.09,0.02,0.89,0.84,0.87
25%,2085000.0,7.23,4.38,9.3,1.49,1.38,1.45
50%,2085000.0,21.15,12.6,24.0,1.83,1.65,1.77
75%,2085000.0,59.8,34.8,54.0,2.31,2.03,2.2
max,2085000.0,4730.0,1460.0,4600.0,17.02,9.19,15.04


In [45]:
#Summary of a specific column
df_USGS['discharge_mean'].describe()

count    28582.000000
mean        59.477284
std        152.627024
min          0.020000
25%          9.300000
50%         24.000000
75%         54.000000
max       4600.000000
Name: discharge_mean, dtype: float64

### Formatting dates
Yep, as in R, dates can be a pain. By default they are imported as generic, non-numeric "objects" (hence the dtype of "O" above). 

The Pandas `to_datetime` function ([link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)) works like R's `as.Date` function to help convert dates, in various formats, into actual date objects (called "timestamps" in Pandas lingo).

In [46]:
#Create an example of a date, in string format
exampleDate = "2021-04-14"
#Convert to a Pandas "timestamp" object
dateObj = pd.to_datetime(exampleDate)
dateObj

Timestamp('2021-04-14 00:00:00')

If the date is in a non-standard format, we tell the command what format...

In [47]:
#Create a date string in a non-standard format
exampleDate2 = "Wednesday, 14 Apr. 2021"
dateObj2 = pd.to_datetime(exampleDate2,format = '%A, %d %b. %Y')
dateObj2

Timestamp('2021-04-14 00:00:00')

Timestamp objects can be displayed in various other date formats using the `strftime` function. See http://strftime.org/ for all the formatting options and try a few yourself. 

In [48]:
#Display the timestamp objects in various formats using "strftime"
print(dateObj.strftime('%m/%d/%Y'))

04/14/2021


<details>
    <summary><b>See if you can get the date to read:</b> <code>Wednesday, Apr. 14, 2021</code></summary>
    <code>print(dateObj.strftime('%A, %b. %d, %Y'))</code>
</details>

In [None]:
print(dateObj.)

<details>
    <summary><b>What number day of the year is this date?</b></summary>
    <code>print(dateObj.strftime('%j'))</code>
</details>

In [None]:
print(dateObj.)

#### Convert our dataframes `datetime` values to timestamps
We can apply the `.to_datetime()` function to our datetime column. 

In [49]:
#Update the datetime column to be dates, not strings
df_USGS['datetime'] = pd.to_datetime(df_USGS['datetime'])

In [50]:
#Display a few samples
df_USGS.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,2027-10-01,,,,,39.0,A,,,,,,
1,USGS,2085000,2027-10-02,,,,,39.0,A,,,,,,
2,USGS,2085000,2027-10-03,,,,,39.0,A,,,,,,
3,USGS,2085000,2027-10-04,,,,,39.0,A,,,,,,
4,USGS,2085000,2027-10-05,,,,,39.0,A,,,,,,


As in our R example, the 2-digit dates in our raw data file are mistakenly assumed to be in the 21st century. We need to convert back to the 20th century. As we did in R, we'll apply a function to find and fix these dates...

In [25]:
df_USGS.iloc[-1,2] > pd.to_datetime('2019-01-10')

False

In [51]:
#Create a function called "fixDate" that corrects date values
def fixDate(d):
    if d > pd.to_datetime('2019-01-10'):
        return d - pd.DateOffset(years=100)
    else:
        return d

In [52]:
#Apply the function to the datetime values
df_USGS['datetime'] = df_USGS['datetime'].apply(fixDate)

In [53]:
#View the result
df_USGS.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,1927-10-01,,,,,39.0,A,,,,,,
1,USGS,2085000,1927-10-02,,,,,39.0,A,,,,,,
2,USGS,2085000,1927-10-03,,,,,39.0,A,,,,,,
3,USGS,2085000,1927-10-04,,,,,39.0,A,,,,,,
4,USGS,2085000,1927-10-05,,,,,39.0,A,,,,,,


## Adjusting Datasets

### Removing NAs

Notice in our dataset that our discharge and gage height observations have many NAs, meaning no measurement was recorded for a specific day. In some cases, it might be in our best interest to remove NAs from a dataset. Removing NAs or not will depend on your research question.

In [None]:
#List the number of missing values in each column (sum across rows)
df_USGS.isna().sum(axis='rows')

In [30]:
#Show NAs in just one variable
df_USGS['discharge_mean'].isna().sum()

28049

In [60]:
#Drop rows that have missing data in any column; -- R: "omit.na"
df_USGS_cleaned = df_USGS.dropna()
df_USGS_cleaned.shape

(5342, 15)

## Saving datasets
We just edited our raw dataset into a processed form. We may want to return to this processed dataset later, which will be easier to do if we save it as a spreadsheet. 

In [62]:
#Save the file
df_USGS_cleaned.to_csv("./data/Processed/USGS_Site02085000_Flow_Processed.csv", index=False)

## TIPS AND TRICKS: SPREADSHEETS

* Files should be saved as .csv or .txt for easy import into Pandas. Note that complex formatting, including formulas in Excel, are not saved when spreadsheets are converted to comma separated or text formats (i.e., values alone are saved).


* The first row is reserved for column headers.


* A second, secondary row for column headers (e.g., units) should not be used if data are being imported into R. Incorporate units into the first row column headers if necessary.


* Short names are preferred for column headers, to the extent they are informative. Additional information can be stored in comments within Python scripts and/or in README files.


* Spaces in column names are allowed in Pandas, but should be replaced with underscores ("`_`") to avoid issues. 


* Avoid symbols in column headers. This can cause issues when importing into Pandas.