## 7: Data Wrangling (in Python & Pandas)
*Environmental Data Analytics | John Fay*<br>
*Spring 2019*

## LESSON OBJECTIVES
1. Wrangle datasets with Python and Pandas functions
2. Compare R's `dplyr` to Python's `pandas` package with respect to wrangling data.
3. Apply data wrangling skills to a real-world example dataset

## SET UP YOUR DATA ANALYSIS SESSION
Import and explore the `NTL-LTER_Lake_ChemistryPhysics_Raw.csv` dataset. ([info on Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html))

In [None]:
#Import pandas
import pandas as pd

In [None]:
#Read in our data into a dataframe
NTL_phys_data = pd.read_csv("../Data/Raw/NTL-LTER_Lake_ChemistryPhysics_Raw.csv")

In [None]:
#First few rows...
NTL_phys_data.head()

In [None]:
#Structure...
NTL_phys_data.info()

In [None]:
#Summary...
NTL_phys_data.describe(include='all')

In [None]:
#Dimensions...
NTL_phys_data.shape

---
## DATA WRANGLING

###  ♦ Filtering
Filtering allows us to choose certain rows (observations) in our dataset.

In [None]:
#Data type of the lakename variable
NTL_phys_data['lakename'].dtype

In [None]:
#Data type of the depth variable
NTL_phys_data['depth'].dtype

<font color='blue'>► _How might the data type of a column affect how we go about subsetting records?_</font>

#### Binary mask filtering
Similar to R's matrix filtering, we can create and apply a binary mask to subset records from our dataframe.

In [None]:
#Construct a binary mask where records with zero depths are set as True
zeroDepthMask = NTL_phys_data['depth'] == 0

In [None]:
#Display the mask
zeroDepthMask

In [None]:
#Apply the mask to the original dataframe, which returns only rows where the mask is True
NTL_Surface1 = NTL_phys_data[zeroDepthMask]
NTL_Surface1.shape

We can create and apply a mask all in one line...

In [None]:
#Binary mask filtering, in one line
NTL_Surface1 = NTL_phys_data[NTL_phys_data['depth'] == 0]
NTL_Surface1.shape

#### Query filtering
Similar to TidyR's `filter` verb, we can supply a query string to select rows.

_<mark>→ → Note: the query statement will NOT work if you have a space in your field name ← ←</mark>
If that's the case you should use binary mask filtering..._

In [None]:
#Query filtering
NTL_Surface2 = NTL_phys_data.query("depth == 0")
NTL_Surface2.shape

In [None]:
NTL_Surface3 = NTL_phys_data.query("depth < 0.25")
NTL_Surface3.shape

<font color='blue'>► _How might `<` or `>` work with nominal data (e.g. `lakename`)?_</font>

<font color="blue">► *Have a look at the output of your queries. Are the indices sequential? What might this say about our dataframe subsets?*</font>

In [None]:
#Check the index values: are they sequential? 
NTL_Surface2.head()

##### Multiple filters...
How do we filter records using multiple criteria? Here are examples using the binary masking and query approaches.

In [None]:
#First, generate a list of lake names
NTL_phys_data['lakename'].unique()

In [None]:
#Filter for just Peter Lake and Paul Lake using binary masks
PeterMask = NTL_phys_data['lakename'] == 'Peter Lake' #Mask 1
PaulMask = NTL_phys_data['lakename'] == 'Paul Lake'   #Mask 2
PeterPaul1 = NTL_phys_data[PeterMask | PaulMask]      #Combine masks, using the or (|) operator
PeterPaul1.shape                                      #Show the dimensions of the result

In [None]:
#Same as above, but in one line, note the need for parens around each mask criteria
PeterPaul2 = NTL_phys_data[(NTL_phys_data['lakename'] == 'Peter Lake') | (NTL_phys_data['lakename'] == 'Paul Lake')]
PeterPaul2.shape

In [None]:
#Filter for just Peter Lake and Paul Lake, using a query statement
PeterPaul1 = NTL_phys_data.query("lakename == 'Peter Lake' | lakename == 'Paul Lake'")
PeterPaul1.shape

<details>
    <summary><b><font color="blue">► EXERCISE: Ensure our list worked by listing the unique lake names in the <code>PeterPaul1</code> dataframe</font></b><br>Hint: see 4 code cells up...</summary>
    <code>PeterPaul1['lakename'].unique()</code>
</details>

In [None]:
#List the unique values in the 'lakename' field of PeterPaul1


---
<font color='darkgreen'>**_\*A note on writing "tidy" Python code_**<br>
_To span Python statement across multiple lines, we can either use the `\` character, which tells Python to ignore the line break..._</font>

In [None]:
#Filter for just Peter Lake and Paul Lake by eliminating others
PeterPaul2 = NTL_phys_data.query("lakename != 'Tuesday Lake' & " + \
                                 "lakename != 'East Long Lake' &  " + \
                                 "lakename != 'West Long Lake' &  " + \
                                 "lakename != 'Central Long Lake' &  " + \
                                 "lakename != 'Hummingbird Lake' & " + \
                                 "lakename != 'Crampton Lake' &  " + \
                                 "lakename != 'Ward Lake'")
                                  
PeterPaul2.shape

<font color='darkgreen'>_...Or, we can wrap code within parentheses which allows us to continue writing single commands on multiple lines._</font>

In [None]:
#Filter for just Peter Lake and Paul Lake
PeterPaul3 = (NTL_phys_data.query("lakename != 'Tuesday Lake' & " + 
                                  "lakename != 'East Long Lake' &  " + 
                                  "lakename != 'West Long Lake' &  " + 
                                  "lakename != 'Central Long Lake' &  " + 
                                  "lakename != 'Hummingbird Lake' & " + 
                                  "lakename != 'Crampton Lake' &  " + 
                                  "lakename != 'Ward Lake'")
             )
PeterPaul3.shape

---
More examples...

In [None]:
#Better format with masks, notice the "~" that negates the compound mask
PeterPaul4 = NTL_phys_data[~NTL_phys_data['lakename'].isin(['Tuesday Lake', 
                                                             'East Long Lake',
                                                             'West Long Lake', 
                                                             'Central Long Lake',                                                           
                                                             'Hummingbird Lake',
                                                             'Crampton Lake', 
                                                             'Ward Lake'])]
PeterPaul4.shape

In [None]:
#Using a query and some of the more flexible query string operators
PeterPaul3 = NTL_phys_data.query('lakename in ["Peter Lake","Paul Lake"]')
PeterPaul3.shape

##### Querying a range from continuous values

In [None]:
JuneOct_exclusive = NTL_phys_data.query('daynum > 151 and daynum < 305')
JuneOct_inclusive = NTL_phys_data.query('daynum >= 151 and daynum <= 305')
JuneOct_range = NTL_phys_data[NTL_phys_data['daynum'].isin(range(151,305))]
print(JuneOct_exclusive.shape, JuneOct_inclusive.shape,JuneOct_range.shape)

<details>
    <summary><b><font color="blue">► EXERCISE: Filter the <code>NTL_pys_data</code> for the year 1999</font></b></summary>
    <code>NTL_1999 = NTL_phys_data.query('year4 == 1999')</code> Using "query", OR<br>
    <code>NTL_1999 = NTL_phys_data[NTL_phys_data['year4'] == 1999]</code> Using a binary mask...
</details>

In [None]:
NTL_1999 = 
NTL_1999.shape

<details>
    <summary><b><font color="blue">► EXERCISE: Filter the <code>NTL_pys_data</code> for Tuesday Lake for the year 1990 thru 1999</font></b></summary>
    <code>NTL_90to99 = NTL_phys_data.query('year4 >= 1990 and  year4 <= 1999')</code>
</details>

In [None]:
NTL_90to99 = 
NTL_90to99.shape

### Another alternative: using Pandas `iloc[]` and `loc[]` functions.
While definitely not as user friendly as TidyR's filter (and select) verbs, Pandas indexing capabilities are quite powerful, particularly when you get into multiple and hierarchical indices (which we won't discuss here). Anyway, Pandas has two types of row and column indices and a function for each...

#### Using `iloc[]` to filter records
`iloc[]` is used to extract records by their `i`nteger `loc`ations, i.e. the row number. *[Recall that in Pandas, index values begin at zero, not one!]*

In [None]:
#Show the 20th row of datam, a single row is returned as a series...
NTL_phys_data.iloc[19]

In [None]:
#Show rows 3, 1, and 5 - in that order; multiple rows return a dataframe
NTL_phys_data.loc[[2,0,4]]

In [None]:
#Show rows 10 thru (and including) 14
NTL_phys_data.iloc[9:14]

<details>
    <summary><b><font color="blue">► EXERCISE: Display the 101st thru (and inlcuding) the 105th rows in the <code>NTL_pys_data</code> dataframe.</font></b></summary>
    <code>NTL_phys_data.iloc[100:105]</code>
</details>

In [None]:
NTL_phys_data.iloc[100:105]

#### Using `loc()` to filter records
`loc` is used to extract records by an index *label* that we assign. We've not explicity set an row index in our `NTL_phys_data` dataframe, so Pandas has assigned sequential values, which we can see via the `index` property:

In [None]:
#Display the index of our dataframe
NTL_phys_data.index

As such, our named index is the same as the row number so `iloc` and `loc` behave *almost* exactly the same. The key difference is that, when we specify a slice of data, `loc` returns the end value, `iloc` does not. 

In [None]:
#Select data from the row with the index = 19
NTL_phys_data.loc[19]

In [None]:
#Return rows 10 through (and including) 14
NTL_phys_data.loc[9:13] #<- Note that loc returns the last value

In [None]:
NTL_phys_data.iloc[9:14] #<- iloc returns upto, but not including our upper index in a slice

#### Changing our index and using `loc`
Just a quick glimpse of how we might change our default index and use it with `loc`...

In [None]:
#Set the index to values in the sampledate column
NTL_phys_data_idx = NTL_phys_data.set_index('sampledate')

In [None]:
#Select rows for a given index
NTL_phys_data_idx.loc['8/17/16']

---
### Sorting (_Arrange_)
In Pandas, we use the `sort_values()` function to sort ("arrange") our records

In [None]:
NTL_phys_data_depth_ascending = NTL_phys_data.sort_values("depth")
NTL_phys_data_depth_ascending.head()

In [None]:
NTL_phys_data_depth_ascending = NTL_phys_data.sort_values("depth",ascending=False)
NTL_phys_data_depth_ascending.head()

<details>
    <summary><b><font color="blue">► EXERCISE: Sort the <code>NTL_pys_data</code> dataframe by `temperature`, in descending order. 
Which dates, lakes, and depths have the highest temperatures?</font></b></summary>
    <code>>NTL_ColSelect = NTL_phys_data.sort_values('temperature_C',ascending=False</code>
</details>

In [None]:
#Sort the NTL_phys_data dataframe by temperature, in descending order
NTL_phys_data_temps_desc = 
NTL_phys_data_temps_desc.head()

### Selecting Columns (_Select_)
<font color='red'>Pandas doesn't have a good analog to dplyrs's "SELECT" verb, but we have a few workarounds...</font>

#### Select by listing column names
First, we can select specific columns easily enough by <u>supplying a list of the column names</u> we want in the output:

In [None]:
#Extract a subset of columns by naming them
NTL_phys_data_temps = NTL_phys_data[['lakename','sampledate','depth','temperature_C']]
NTL_phys_data_temps.head()

---
<details>
    <summary><b><font color="blue">► EXERCISE: Subset the <code>NTL_phys_data</code> dataframe for the `depth` and `lakename` columns, in that order. In what order do the columns appear in the result?</font></b></summary>
    <code>NTL_ColSelect = NTL_phys_data[['depth','lakename']]</code>
</details>

In [None]:
NTL_ColSelect = NTL_phys_data[['depth','lakename']]
NTL_ColSelect.head()

#### Selecting columns using `.iloc[]`
As we saw above, `.iloc` can subset rows by row number. We can also use it to select columns by their column number. We just need to specify rows before columns separated by a comma. To select all rows, we can just insert a semicolon indicating a slice including everything (`[:,10:20]`). 

In [None]:
#Retrieve the all records in the 4th column
NTL_phys_data.iloc[:,3]

<details>
    <summary><b><font color="blue">► EXERCISE: How would you retrieve the first 3 records in the 4th column? The last 3 records?</font></b></summary>
    <code>NTL_phys_data.iloc[:3,3]<br>NTL_phys_data.iloc[:,[1,4,5,6]]<br>NTL_phys_data.iloc[:8,4:7]</code>
</details>

In [None]:
#Retrieve the first 10 records in the 4th column


In [None]:
#Retrieve the columns 2, 4, 5, and 6 of data (all rows)


In [None]:
#Retrieve the columns 4 thru 6 of data (first 8 rows)


---
<details>
    <summary><b><font color="blue">► EXERCISE: Use <code>.iloc()</code> to subset the <code>NTL_phys_data</code> dataframe for the 5th column (<code>depth</code>) and 2nd column (<code>lakename</code>), in that order. In what order do the columns appear in the result?</font></b></summary>
    <code>NTL_phys_data_temps1 = NTL_phys_data.iloc[:,[4,1]]</code>
</details>

In [None]:
#Extract a subset of via their integer indices
NTL_phys_data_temps1 = 
NTL_phys_data_temps1.head()

#### Selecting columns using `.loc[]`
As with rows, `.loc`** allows us to use the column names to select specific columns. It offers the ability to return slices of columns. _(However, we still can't pull columns using a mix of single column names and slices, as R's `select` verb can.)_ 

In [None]:
#Extract columns using a list of column names (rows with an index of 0 thru 5)
NTL_phys_data.loc[:5,['lakename','sampledate','depth','temperature_C']]

In [None]:
#Extract columns by slice of column names (rows with an index of 100 thru 105)
NTL_phys_data.loc[100:105,'sampledate':'temperature_C']

---
<details>
    <summary><b><font color="blue">► EXERCISE: Use <code>.loc()</code> to subset the <code>NTL_phys_data</code> dataframe for the <code>depth</code> and <code>lakename</code>, in that order. In what order do the columns appear in the result?</font></b></summary>
    <code>NTL_phys_data_temps1 = NTL_phys_data.loc[:,['depth','lakename']]</code>
</details>

In [None]:
#Extract a subset of columns names
NTL_phys_data_temps1 = 
NTL_phys_data_temps1.head()

 → _See this [link](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) for a more complete tutorial on selecting rows and columns in Pandas._  

---
### Adding/updating column values (_Mutate_)
In Pandas, we can create a new column by applying simple functions to existing columns.

In [None]:
#Compute values into a new column
NTL_phys_data["temperature_F"] = NTL_phys_data["temperature_C"] * 9/5 + 32
NTL_phys_data.head()

Alternatively, Pandas' `apply` function allows us more flexibility. We can define a function, then apply it to our data.

In [None]:
#Create a function that computes season from julian day
def getSeasonFromDay(day):
    if day <= 81: return "winter"
    elif day <= 173: return "spring"
    elif day <= 265: return "summer"
    elif day <= 356: return "fall"
    else: return "winter"

In [None]:
#Apply that function to create a new column in our dataset
NTL_phys_data["season"] = NTL_phys_data["daynum"].apply(getSeasonFromDay)
#Display 10 random records
NTL_phys_data.sample(10)

### Pipes
TidyR's pipes (`%>%`) don't convey well to Python or Pandas. So instead of a nested set of commands using pipes, we simply span our commands in sequential lines, or we can use parentheses to span commands across multiple lines. 

The objective remains to make the code readable. 

In [None]:
NTL_processed = (NTL_phys_data
                 .query('lakename == "Paul Lake" | lakename == "Paul Lake"')
                 .loc[:,['lakename','sampledate','depth','temperature_C']]
                )
NTL_processed['temperature_F'] = NTL_phys_data['temperature_C'] * 9/5 + 32

### Saving processed datasets
Pandas has a `to_csv()` command that is similar to R's.

In [None]:
NTL_processed.to_csv('../Data/Processed/NTL-LTER_Lake_ChemistryPhysics_PeterPaul_Processed2.csv',
                     index=False, #no row names
                    )