# CME538 - Introduction to Data Science
## Lecture 2.3 - Pandas III
### New Concepts
* lambda functions.
* Different ways to iterate through a DataFrame and performance consideration.
* different methods for combining multiple DataFrames (merge, append, concatenate, join).
* Working with time series data in Pandas

### Lecture Structure
1. [lamba functions](#section1)
2. [Iterating over DataFrames](#section2)
3. [Combining DataFrames](#section3)

## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [os](https://docs.python.org/3/library/os.html) - Built in miscellaneous operating system interfaces.
* [Time](https://docs.python.org/3/library/time.html) - This module provides various time-related functions. 
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 3 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 4, 5, and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as pd`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout CIV1498 for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [None]:
# Import 3rd party libraries
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

### Install descartes package

In [None]:
!pip install descartes

<a id='section1'></a>
## 1. Lambda Functions
As we learned in Week 1 - Lecture 3, you can write your very own Python functions using the `def` keyword. 

For example:

In [None]:
def raise_to_power(number, power):
    return number ** power

# Raise the number 2 to the power of 5
raise_to_power(number=2, power=5)

However, for simpler function definitions (`raise_to_power`) they can be converted to a `lambda` function. A `lambda` function is a small anonymous function that can take any number of arguments, but can only have one expression. The benefits of using `lambda` functions are (1) you will write fewer lines of code and (2) you can create functions on the fly without assigning them a name.

See the cell below where we've convert the `raise_to_power()` function to a `lambda` function.

In [None]:
raise_to_power = lambda number, power: number ** power

raise_to_power(number=2, power=5)

The structure of a `lambda` function is:

```python
lambda arguments : expression
```

For example,

```python
lambda argument1, argument2, (argument1 + argument2) / 2
```

We'll see later how lambda functions can be used with the **Pandas** `.apply()` method.

<a id='section2'></a>
## 2. Iterating over DataFrames
There are multiple ways to iterate through DataFrames and when those DataFrame become large and the desired computation become complex, these different methods can have major impacts on compute times. In some cases, it could mean the different bewteen seconds, tens of minutes and even hours.

Let's start by creating a dataset of random latitude and longitude values.
* Latitude : max/min +90 to -90
* Longitude : max/min +180 to -180

In [None]:
num_locations = 100000
locations = pd.DataFrame({'lat': np.random.uniform(-90, 90, num_locations),
                          'lon': np.random.uniform(-180, 180, num_locations),
                          'distance': np.zeros(num_locations)})
locations.head()

We want to calculate the straight-line distance between two points on the earth's surface (latitude and Longitude).  

We will calculate the distance between Toronto **(lat: 43.651070 lon: -79.347015)** and every point in `locations`.

In [None]:
toronto_lat = 43.651070
toronto_lon = -79.347015

Create geopgrpahic plot of Toronto using **GeoPanfas** (We'll get into GeoPandas later on in the course).

In [None]:
# Create geopgrpahic plot of Toronto using GeoPanfas
df = pd.DataFrame({'City': ['Toronto'], 'Country': ['Canada'], 
                    'Latitude': [toronto_lat], 'Longitude': [toronto_lon]})

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# We restrict to South America.
ax = world[world.continent == 'North America'].plot(color='white', edgecolor='black')

# We can now plot our ``GeoDataFrame``.
gdf.plot(ax=ax, color='red')

plt.show()

We’ll use the Haversine (or Great Circle) distance formula, which takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. 

In [None]:
def haversine(lat1, lon1, lat2, lon2):
    """Defines a basic Haversine distance formula."""
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

### Simple `for` loop over `range()`
Just about every Pandas beginner I’ve ever worked with (Including myself) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects, however, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient by comparison.

If you're only looping over a few rows, then perhaps this approach will suffice, however, its a good idea to understand the limitations.

In [None]:
def simple_for_loop_method(locations):

    distance_list = []

    # Loop through rows in locations DataFrame
    for row_index in range(locations.shape[0]):

        # Get lat and lon and row_index
        lat = locations.loc[row_index, 'lat']
        lon = locations.loc[row_index, 'lon']

        # Compute Haversine distance
        distance = haversine(lat1=toronto_lat, 
                             lon1=toronto_lon, 
                             lat2=lat, 
                             lon2=lon)

        # Collect distance
        distance_list.append(distance)

    # Add distance values
    locations['distance'] = distance_list
    
    return locations

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit simple_for_loop_method(locations)

Using a simple for loop and the `range()` function, it took **3.6 seconds** to iterate through **100,000 rows**.

In [None]:
locations = simple_for_loop_method(locations)
locations.head()

### Simple for loop using `.iterrows()`
`.iterrows()` is a generator that iterates over the rows of the dataframe and returns the index of each row, in addition to an object containing the row itself. `.iterrows()` is optimized to work with Pandas dataframes, however, it’s often the least efficient way to run most standard functions. 

In [None]:
def iterrows_method(locations):
    
    distance_list = []
    # Loop through rows in locations DataFrame
    for index, row in locations.iterrows():

        # Get lat and lon and row_index
        lat = row['lat']
        lon = row['lon']

        # Compute Haversine distance
        distance = haversine(lat1=toronto_lat, 
                             lon1=toronto_lon, 
                             lat2=lat, 
                             lon2=lon)

        # Collect distance
        distance_list.append(distance)

    # Add distance values
    locations['distance'] = distance_list

    return locations

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit iterrows_method(locations)

Using the Pandas `.iterrows()` function, it took **9.45 seconds** to iterate through **100,000 rows**, which is over twice as long as the simpler method.

### Simple for loop using `.to_dict()`
The Pandas `.to_dict()` method converts a DataFrame to a dictionary. Below, we specify `orient='row'`, which returns a list of dictionaries where each dictionary corresponds to a row.

In [None]:
def to_dict_for_loop_method(locations):

    distance_list = []
    # Loop through rows in locations DataFrame
    for row in locations.to_dict(orient='row'):

        # Get lat and lon and row_index
        lat = row['lat']
        lon = row['lon']

        # Compute Haversine distance
        distance = haversine(lat1=toronto_lat, 
                             lon1=toronto_lon, 
                             lat2=lat, 
                             lon2=lon)

        # Collect distance
        distance_list.append(distance)

    # Add distance values
    locations['distance'] = distance_list

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit to_dict_for_loop_method(locations)

Using the Pandas `.to_dict()` function, it took **2.32 seconds** to iterate through **100,000 rows**, which is almost five times faster than `.iterrows()`.

### Using Pandas `.apply()`
The `.apply()` method, which applies a function along a specific axis (meaning, either rows or columns) of a DataFrame.

In [None]:
def apply_method(locations):

    locations['distance'] = locations.apply(
        lambda row: haversine(lat1=toronto_lat, 
                              lon1=toronto_lon, 
                              lat2=row['lat'], 
                              lon2=row['lon']), 
        axis=1
    )
    
    return locations

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit apply_method(locations)

Using the Pandas .apply() function takes roughly the same amount of time as the simple loop but the code is more compact.

### Vectorization over Pandas `Series`
Vectorization is the process of executing operations on entire arrays rather than by iterating over individual units. Recall that the fundamental units of Pandas, DataFrames and Series, are both based on NumPy arrays. 

In [None]:
def vectorized_series_method(locations):

    locations['distance'] = haversine(lat1=toronto_lat, 
                                      lon1=toronto_lon, 
                                      lat2=locations['lat'], 
                                      lon2=locations['lon'])
    
    return locations

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit vectorized_series_method(locations)

By vectorizing over Pandas Series, we see a **x165** improvement over the `.to_dict()` method.

### Vectorization over NumPy `Array`

In [None]:
def vectorized_array_method(locations):

    locations['distance'] = haversine(lat1=toronto_lat, 
                                      lon1=toronto_lon, 
                                      lat2=locations['lat'].to_numpy(), 
                                      lon2=locations['lon'].to_numpy())
    
    return locations

Now let's use some Jupyter `%magic` to time the function.

In [None]:
%timeit vectorized_array_method(locations)

By vectorizing over NumPy Arrays, we see a **x252** improvement over the `.to_dict()` method.

**Note:** These times are for my compute and only a single exection. These time will vary from machine to machine and slightly from iteration to iteration.

From this quick exercise, you should know that there are many different ways to iterate through a DataFrame and that the different methods have very different performance considerations.

<a id='section3'></a>
## 3. Combining DataFrames

When conducting exploratory data analysis (EDA), its common that the data we want to use comes in multiple files and will need to be combined. Let's look at an example of this.

In the **Lecture 6** folder, there are six `.csv` files from **Uber** showing monthly ridership numbers from April 2014 to September 2014.

Let's take a look at these files.

In [None]:
os.listdir()

Given what we've learned aleady in Lectures 4 and 5, we know how to import these `.csv` files to **Pandas** DataFrames. Lets try that.

In [None]:
april_data = pd.read_csv('uber-raw-data-apr14.csv')
may_data = pd.read_csv('uber-raw-data-may14.csv')
june_data = pd.read_csv('uber-raw-data-jun14.csv')
july_data = pd.read_csv('uber-raw-data-jul14.csv')
aug_data = pd.read_csv('uber-raw-data-aug14.csv')
sept_data = pd.read_csv('uber-raw-data-sep14.csv')

Let's see what the April data looks like.

In [None]:
april_data.head()

In this DataFrame, each row is an Uber trip

Suppose we're asked to plot the number of trips per hour from April 2014 to September 2014. To tackle this problem, it would be much easier if all the data was in one DataFrame.

In the following section, you'll be introduces to two Pandas methods for combining DataFrames: `.concatenate()` and `.merge()`. The figure below is helpful for figuring out which method to use.

<br>
<img src="images/merging_dataframes.png" alt="drawing" width="750"/>
<br>

### Concatenate

We use the `.concat()` function to append either columns or rows from one DataFrame to another. This happens to be the functionality we need to handle the Uber data we import above.

[`pd.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) has many features, which you're encouraged to explore, but the basic function is demonstrated below.

If we look at the flow diagram above, if we want to stack multiple DataFrames side-by-side, then we set `axis=1`.

In [None]:
# Stack the DataFrames on top of each other
uber_data = pd.concat([april_data, 
                       may_data, 
                       june_data, 
                       july_data, 
                       aug_data, 
                       sept_data], axis=1)

# View combined DataFrame
uber_data.head()

However, this is not what we want to do with the Uber data. We'd like to stack the data from each month, one on top of each other from April to September. To accomplish this, we need to set `axis=0`. Note that the order of the months in the DataFrame (top to bottom) follows the order of months in the `.concat()` method, left to right.

In [None]:
# Stack the DataFrames on top of each other
uber_data = pd.concat([april_data, 
                       may_data, 
                       june_data, 
                       july_data, 
                       aug_data, 
                       sept_data], axis=0)

# View combined DataFrame
uber_data.head()

We can see April data dat the top of the DataFrame

In [None]:
uber_data.tail()

and September data at the bottom.

Next, let's plot the index of our new DataFrame `uber_data` and inspect.

In [None]:
plt.plot(uber_data.index)
plt.show()

We can clearly see from the plot that when concatenating the DataFrames, the original indexes have been preserved, meaning that we have duplicates, which will be an issue moving forward. 

To adjust the row index automatically, we have to set the argument `ignore_index` as `True` while calling the `.concat()` function.

In [None]:
uber_data = pd.concat([april_data, 
                       may_data, 
                       june_data, 
                       july_data, 
                       aug_data, 
                       sept_data], 
                      axis=0,
                      ignore_index=True)

plt.plot(uber_data.index)
plt.show()

Now each row has a unique index!

The `.concat()` function has many more features that you should definitely [check out](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).


### Merge
`.merge()` is another **Pandas** function for combining DataFrames. Generally speaking, the difference between `.merge()` and `.concat()` is that `.merge()` is used to combine two (or more) DataFrames on the basis of values of common columns while `.concat()` is used to append one (or more) DataFrames one below the other or one next to the other.

<br>
<img src="images/joins.jpg" alt="drawing" width="550"/>
<br>

`.merge()` allows you to execute the following join operations: inner join, full outer join, left outer join, right outer join (see Figure above).

Before learning more about `.merge()`, let's create some test data. First, let's create a DataFrame with the names of participants in a test and their participant id number.

In [None]:
df1 = pd.DataFrame({'participant_id': ['1', '6', '33', '42', '65', '8', '20', '13', '14'],
                    'first_name': ['Shoshanna', 'Marianne', 'Karl', 'Brent', 'John', 'Marcus', 
                                   'Bruce', 'Judi', 'Denzel'], 
                    'last_name': ['Saxe', 'Touchie', 'Peterson', 'Sleep', 'Harrison', 'Aurelius', 
                                  'Wayne', 'Dench', 'Washington']})
df1

Next, let's create another DataFrame showing the participant's test scores.

In [None]:
df2 = pd.DataFrame({'participant_id': ['22', '98', '71', '33', '42', '65', '8', '20', '13', 
                                       '14', '34', '54'],
                    'score': [80, 76, 72, 66, 77, 64, 59, 60, 62, 89, 67, 58]})
df2

First, use the `.merge()` method to work through the four paths outlined in the figure below. In this case, we're using `.merge()` because the contents of the DataFrame are required for combining the DataFrames. The common column **participant_id** will be used to merge **df1** and **df2**.
<br>
<img src="images/merging_dataframes.png" alt="drawing" width="750"/>
<br>
#### Path 1: 
* I want to keep the full content of both DataFrames. 
* Solution: Outer Join

The merge operation takes the form of `left_df.merge(right=right_df)`

In [None]:
df_outer_join = df1.merge(right=df2, 
                          how='outer', 
                          on='participant_id')
df_outer_join

#### Path 2: 
* I want to keep the full content of the left DataFrame and merge any matching data from the right DataFrame. 
* Solution: Left Outer Join

Practically, this means that we want a DataFrame with all of the participants from **df1** and we want to merge their scores from **df2**. We do not want any participants from **df2** that are not in **df1**.

In [None]:
df_left_outer_join = df1.merge(right=df2, 
                               how='left', 
                               on='participant_id')
df_left_outer_join

#### Path 3: 
* I want to keep the full content of the right DataFrame and merge any matching data from the left DataFrame. 
* Solution: Right Outer Join

Practically, this means that we want a DataFrame with all of the scores from **df2** and we want to merge their names from **df1**. We do not want any participants from **df1** that are not in **df2**.

In [None]:
df_right_outer_join = df1.merge(right=df2, 
                                how='right', 
                                on='participant_id')
df_right_outer_join

#### Path 4: 
* I want to keep the only contents of the right and left DataFrame only where overlap. 
* Solution: Inner Join

Practically, this means that we want a DataFrame with all of the scores from **df2** and names from **df1** where they have **participant_id** in common.

In [None]:
df_inner_join = df1.merge(right=df2, 
                          how='inner', 
                          on='participant_id')
df_inner_join

Check out Quercus for more LinkedinLearning Boosters and Pandas documentation.