---
title: "3: Data Explroation "  
author: "Environmental Data Analytics | John Fay and Luana Lima | Developed by Kateri Salk"   
date: "Spring 2021"   
---

# 3. Data Exploration

## LESSON OBJECTIVES
1. Set up a data analysis session in Jupyter
2. Import and explore datasets in Python
3. Apply data exploration skills to a real-world example dataset

## BEST PRACTICES FOR PYTHON/JUPYTER

In many situations in data analytics, you may be expected to work from multiple computers or share projects among multiple users. A few general best practices will avoid common pitfalls related to collaborative work. 

### Relative paths in Jupyter notebooks

Jupyter notebooks can use absolute or relative paths, but relative paths are more robust and should be used where possible. Relative paths will be relative to where the Jupyter notebook lives and OS commands can navigate up or down the directory structure.

#### Listing contents of folders using OS commands followed by `!`

OS-specific commands can be called within Jupyter by preceding them with a "`!`". For example, in Windows you can list the contents of the folder containing the script you are running using "`! dir`". On unix machines, this would be "`! ls`"

In [7]:
#OS specific command for showing the current working directory
!pwd #for mac/linux based machines (!cd #for PCs)

/home/jovyan/work/PythonForRUsers


In [13]:
#List the contents of the current directory ("!ls" also works)
!dir 

01-Getting-Started.ipynb	       A-Basic-Python.ipynb
02-ReproducibilityCoding-Basics.ipynb  B-Web-Services-APIs-Python.ipynb
03-Data-Exploration.ipynb	       data
03-DataExploration_Part2.ipynb	       LICENSE
06-Data-Exploration.ipynb	       README.md
07-Data-Wrangling.ipynb		       requirements.txt
08-Data-Wrangling.ipynb		       Untitled.ipynb


In [16]:
#List the contents of the "data" sub directory 
!dir data

Raw


In [17]:
#List the contents of the directory containing the current notebook
!dir ..

bokeh-notebooks       PythonForRUsers	       restore-my-notebook.ipynb
ggplot-python.ipynb   python-matplotlib.ipynb
mysql-savefile.ipynb  r-demo.ipynb


#### Navigating folders using Python's built-in `os` module

In [18]:
#Import the os module
import os

In [20]:
#Create a variable holding the current working directory
projectDir = os.getcwd()
#Display the current working directory
projectDir

'/home/jovyan/work/PythonForRUsers'

In [21]:
#Change the directory to the data folder
os.chdir('data')
os.getcwd()

'/home/jovyan/work/PythonForRUsers/data'

In [23]:
#Go back to the current working directory (stored in the "projectDir" variable above)
os.chdir(projectDir)
os.listdir()

['data',
 '06-Data-Exploration.ipynb',
 '.ipynb_checkpoints',
 '03-Data-Exploration.ipynb',
 '.gitignore',
 'LICENSE',
 '08-Data-Wrangling.ipynb',
 '02-ReproducibilityCoding-Basics.ipynb',
 '03-DataExploration_Part2.ipynb',
 'A-Basic-Python.ipynb',
 '07-Data-Wrangling.ipynb',
 'README.md',
 'requirements.txt',
 '.git',
 '01-Getting-Started.ipynb',
 'B-Web-Services-APIs-Python.ipynb',
 'Untitled.ipynb']

### Load your packages
As in R, packages should be loaded early in the script. 

In [24]:
import pandas as pd #Import pandas, refering to it as "pd"

### Import your data
The easiest way to import CSV data for data analysis is using Panda's [`read_csv()` function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) which reads data directly into a Pandas dataframe object.

As in R, we supply the path to the CSV file, using relative path conventions. 

In [25]:
df_USGS = pd.read_csv('./data/Raw/USGS_Site02085000_Flow_Raw.csv')

## EXPLORE YOUR DATASET
Take a moment to read through the README file associated with the USGS dataset on discharge at the Eno River. Where can you find this file? How does the placement and information found in this file relate to the best practices for reproducible data analysis?

In [26]:
#View all records
df_USGS

Unnamed: 0,agency_cd,site_no,datetime,165986_00060_00001,165986_00060_00001_cd,165987_00060_00002,165987_00060_00002_cd,84936_00060_00003,84936_00060_00003_cd,84937_00065_00001,84937_00065_00001_cd,84938_00065_00002,84938_00065_00002_cd,84939_00065_00003,84939_00065_00003_cd
0,USGS,2085000,1/1/28,74.0,A,,,,,,,,,,
1,USGS,2085000,1/2/28,61.0,A,,,,,,,,,,
2,USGS,2085000,1/3/28,56.0,A,,,,,,,,,,
3,USGS,2085000,1/4/28,54.0,A,,,,,,,,,,
4,USGS,2085000,1/5/28,48.0,A,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33211,USGS,2085000,12/5/18,76.7,P,68.9,P,73.7,P,2.55,P,2.49,P,2.53,P
33212,USGS,2085000,12/6/18,68.9,P,62.8,P,66.2,P,2.49,P,2.44,P,2.47,P
33213,USGS,2085000,12/7/18,65.2,P,60.4,P,63.2,P,2.46,P,2.42,P,2.44,P
33214,USGS,2085000,12/8/18,64.0,P,60.4,P,61.5,P,2.45,P,2.42,P,2.43,P


#### Viewing properties of your dataset

In [27]:
#Confirm the data type -- R: class(df_USGS)
type(df_USGS)

pandas.core.frame.DataFrame

In [28]:
#Display the column names -- R: colnames(df_USGS)
df_USGS.columns

Index(['agency_cd', 'site_no', 'datetime', '165986_00060_00001',
       '165986_00060_00001_cd', '165987_00060_00002', '165987_00060_00002_cd',
       '84936_00060_00003', '84936_00060_00003_cd', '84937_00065_00001',
       '84937_00065_00001_cd', '84938_00065_00002', '84938_00065_00002_cd',
       '84939_00065_00003', '84939_00065_00003_cd'],
      dtype='object')

In [29]:
#Rename columns -- R: colnames(df_USGS) <- c(...)
df_USGS.columns = ("agency_cd", "site_no", "datetime", 
                   "discharge_max", "discharge_max_approval", 
                   "discharge_min", "discharge_min_approval", 
                   "discharge_mean", "discharge_mean_approval", 
                   "gage_height_max", "gage_height_max_approval", 
                   "gage_height_min", "gage_height_min-approval", 
                   "gage_height_mean", "gage_height_mean_approval")

In [30]:
#Display the structure of the dataframe -- R: str(df_USGS))
df_USGS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33216 entries, 0 to 33215
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   agency_cd                  33216 non-null  object 
 1   site_no                    33216 non-null  int64  
 2   datetime                   33216 non-null  object 
 3   discharge_max              28103 non-null  float64
 4   discharge_max_approval     28103 non-null  object 
 5   discharge_min              8439 non-null   float64
 6   discharge_min_approval     8439 non-null   object 
 7   discharge_mean             5167 non-null   float64
 8   discharge_mean_approval    5167 non-null   object 
 9   gage_height_max            5164 non-null   float64
 10  gage_height_max_approval   5164 non-null   object 
 11  gage_height_min            5045 non-null   float64
 12  gage_height_min-approval   5045 non-null   object 
 13  gage_height_mean           5045 non-null   flo

In [31]:
#Display the dimensions
df_USGS.shape

(33216, 15)

In [32]:
df_USGS.size

498240

##### Viewing records in a dataframe

In [33]:
#View the head (first 5 records) of the dataset
df_USGS.head()

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,1/1/28,74.0,A,,,,,,,,,,
1,USGS,2085000,1/2/28,61.0,A,,,,,,,,,,
2,USGS,2085000,1/3/28,56.0,A,,,,,,,,,,
3,USGS,2085000,1/4/28,54.0,A,,,,,,,,,,
4,USGS,2085000,1/5/28,48.0,A,,,,,,,,,,


In [34]:
#Altenatively, view the first 9 records
df_USGS.head(9)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
0,USGS,2085000,1/1/28,74.0,A,,,,,,,,,,
1,USGS,2085000,1/2/28,61.0,A,,,,,,,,,,
2,USGS,2085000,1/3/28,56.0,A,,,,,,,,,,
3,USGS,2085000,1/4/28,54.0,A,,,,,,,,,,
4,USGS,2085000,1/5/28,48.0,A,,,,,,,,,,
5,USGS,2085000,1/6/28,47.0,A,,,,,,,,,,
6,USGS,2085000,1/7/28,44.0,A,,,,,,,,,,
7,USGS,2085000,1/8/28,41.0,A,,,,,,,,,,
8,USGS,2085000,1/9/28,44.0,A,,,,,,,,,,


In [35]:
#Or 6 records, selected at random
df_USGS.sample(6)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
13638,USGS,2085000,5/4/65,37.0,A,,,,,,,,,,
2100,USGS,2085000,10/1/33,2.4,A,,,,,,,,,,
1681,USGS,2085000,8/8/32,13.5,A,,,,,,,,,,
27917,USGS,2085000,6/7/04,8.73,A,1.33,A,,,,,,,,
28432,USGS,2085000,11/4/05,0.7,A:e,1.1,A,1.01,A,1.08,A,,,,
17291,USGS,2085000,5/5/75,,,,,,,,,,,,


In [36]:
#Or, the last 3 records
df_USGS.tail(3)

Unnamed: 0,agency_cd,site_no,datetime,discharge_max,discharge_max_approval,discharge_min,discharge_min_approval,discharge_mean,discharge_mean_approval,gage_height_max,gage_height_max_approval,gage_height_min,gage_height_min-approval,gage_height_mean,gage_height_mean_approval
33213,USGS,2085000,12/7/18,65.2,P,60.4,P,63.2,P,2.46,P,2.42,P,2.44,P
33214,USGS,2085000,12/8/18,64.0,P,60.4,P,61.5,P,2.45,P,2.42,P,2.43,P
33215,USGS,2085000,12/9/18,149.0,P,60.4,P,91.6,P,2.97,P,2.42,P,2.64,P


In [37]:
#View records 30000 to 30005, columns 3, 8, and 14
df_USGS.iloc[29999:30004,[2,7,13]]

Unnamed: 0,datetime,discharge_mean,gage_height_mean
29999,2/18/10,63.4,2.15
30000,2/19/10,56.9,2.08
30001,2/20/10,53.1,2.03
30002,2/21/10,50.4,1.99
30003,2/22/10,60.5,2.11


In [38]:
#Show the data type of the 'datetime' column
df_USGS['datetime'].dtype

dtype('O')

In [39]:
#Show the data type of all columns
df_USGS.dtypes

agency_cd                     object
site_no                        int64
datetime                      object
discharge_max                float64
discharge_max_approval        object
discharge_min                float64
discharge_min_approval        object
discharge_mean               float64
discharge_mean_approval       object
gage_height_max              float64
gage_height_max_approval      object
gage_height_min              float64
gage_height_min-approval      object
gage_height_mean             float64
gage_height_mean_approval     object
dtype: object

In [48]:
#Summary of all data
df_USGS.describe()

Unnamed: 0,site_no,discharge_max,discharge_min,discharge_mean,gage_height_max,gage_height_min,gage_height_mean
count,33216.0,28103.0,8439.0,5167.0,5164.0,5045.0,5045.0
mean,2085000.0,64.659659,16.816984,44.597676,2.062227,1.70733,1.855641
std,0.0,179.307854,40.29165,123.156452,1.326113,0.538284,0.809897
min,2085000.0,0.02,0.09,0.22,0.89,0.84,0.87
25%,2085000.0,9.8,1.9,5.005,1.47,1.38,1.43
50%,2085000.0,25.0,3.62,15.2,1.8,1.64,1.72
75%,2085000.0,55.0,16.5,40.6,2.25,2.0,2.12
max,2085000.0,4730.0,1200.0,3270.0,17.0,7.93,14.47


In [49]:
#Summary of a specific column
df_USGS['discharge_mean'].describe()

count    5167.000000
mean       44.597676
std       123.156452
min         0.220000
25%         5.005000
50%        15.200000
75%        40.600000
max      3270.000000
Name: discharge_mean, dtype: float64

## TIPS AND TRICKS: SPREADSHEETS

* Files should be saved as .csv or .txt for easy import into Pandas. Note that complex formatting, including formulas in Excel, are not saved when spreadsheets are converted to comma separated or text formats (i.e., values alone are saved).


* The first row is reserved for column headers.


* A second, secondary row for column headers (e.g., units) should not be used if data are being imported into R. Incorporate units into the first row column headers if necessary.


* Short names are preferred for column headers, to the extent they are informative. Additional information can be stored in comments within Python scripts and/or in README files.


* Spaces in column names are allowed in Pandas, but should be replaced with underscores ("`_`") to avoid issues. 


* Avoid symbols in column headers. This can cause issues when importing into Pandas.