# <span style="color:blue">Exploring Pandas</span>
Now that we're familiar with Pandas Series and DataFrames, let's delve further into Pandas' more complex capabilities: **grouping**, **sorting**, **merging**, and **binning**. These utilize and build upon the previous concepts we've learned.

---
## <span style="color:blue">Grouping</span>
Pandas has the functionality to group dataframes by unique values in one or multiple columns. This allows for creating informative summaries out of large datasets. For example, if you were analyzing raw census data, grouping would allow you to summarize the data based on *gender* or *age* groupings, or by *city* and *state*.

In [1]:
import pandas as pd

In [2]:
# Create a dataframe to demonstrate grouping

raw_dict = {"Vehicle Type": ["Car", "Car", "SUV", "Car", "Truck", "Truck", "SUV", "Truck", "Car", "SUV", "SUV", "Truck"],
            "Manufacturer": ["Ford", "GM", "Ford", "Chevy", "Ford", "Chevy", "GM", "GM", "Ford", "Chevy", "GM", "Chevy"],
            "Owner": ["Bob", "Andrew", "Sally", "Amanda", "Bill", "Mike", "Lindsey", "Kristen", "Matt", "Anna", "Jon", "Erin"],
            "Horsepower": [265, 190, 240, 350, 365, 400, 275, 300, 185, 280, 310, 240],
            "Torque (lb-ft)": [270, 215, 203, 350, 415, 275, 290, 190, 280, 305, 250, 290],
            "Fuel Economy (mpg)": [25, 27, 22, 31, 19, 17, 23, 18, 27, 21, 19, 18]}
vehicles_df = pd.DataFrame(raw_dict)
vehicles_df

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
0,25,265,Ford,Bob,270,Car
1,27,190,GM,Andrew,215,Car
2,22,240,Ford,Sally,203,SUV
3,31,350,Chevy,Amanda,350,Car
4,19,365,Ford,Bill,415,Truck
5,17,400,Chevy,Mike,275,Truck
6,23,275,GM,Lindsey,290,SUV
7,18,300,GM,Kristen,190,Truck
8,27,185,Ford,Matt,280,Car
9,21,280,Chevy,Anna,305,SUV


To start grouping a dataframe, run the following:
~~~~{.python}
grouped_df = original_df.groupby('column1')
~~~~

<code class="python">.groupby()</code> returns a "GroupBy" object. In the above line of code, <code class="python">grouped_df</code> is what holds the "GroupBy" object. The variable (column name) that you want to group by is passed into <code class="python">.groupby()</code>. In the above line of code, we are grouping <code class="python">original_df</code> by the unique values in column named <code class="python">'column1'</code>.

This "GroupBy" object is not exactly usable or useful in and of itself. However, running data functions on it (i.e. <code class="python">.sum()</code>, <code class="python">.mean()</code>, etc.) will let you extract useful summary statistics for each group.

In [3]:
# Group vehicles_df by the "Vehicle Type" column

grouped_vehicles_df = vehicles_df.groupby("Vehicle Type")

# Note that the "GroupBy" object by itself is not particularly useful

print(grouped_vehicles_df)

<pandas.core.groupby.DataFrameGroupBy object at 0x109cdbc50>


Our `vehicles_df` dataframe contains 12 rows of data. Each row of data represents a specific vehicle model and there are 5 columns, or variables, that describe each model (fuel economy, horsepower, manufacturer, torque, and vehicle type). While this data is useful as is, grouping it lets us extract more insights from it. 

So, we've grouped our `vehicles_df` dataframe by `"Vehicle Type"`, and assigned the resultant "GroupBy" object to `grouped_vehicles_df`. While printing this object offers no useful information, We can run data functions on this "GroupBy" object to obtain summary statistics per each group (i.e. Car, Truck, and SUV).

First, let's obtain a count of all values per vehicle type, across all columns:

In [4]:
grouped_vehicles_df.count()

Unnamed: 0_level_0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft)
Vehicle Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Car,4,4,4,4,4
SUV,4,4,4,4,4
Truck,4,4,4,4,4


In the above line, we simply ran <code class="python">.count()</code> on our entire "GroupBy" object.

We can also run these data functions on one column, or a select few columns. Now, let's find out the average fuel economy per group:

In [5]:
grouped_vehicles_df["Fuel Economy (mpg)"].mean()

Vehicle Type
Car      27.50
SUV      21.25
Truck    18.00
Name: Fuel Economy (mpg), dtype: float64

How about the maximum horsepower and torque per group?

In [6]:
grouped_vehicles_df[["Horsepower", "Torque (lb-ft)"]].max()

Unnamed: 0_level_0,Horsepower,Torque (lb-ft)
Vehicle Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Car,350,350
SUV,310,305
Truck,400,415


Note: running a data function on a "GroupBy" object returns a DataFrame (if 2-D), or a Series (if 1-D). Thus, you can create new dataframes based on the results of these data functions

In [7]:
min_power_tq = grouped_vehicles_df[["Horsepower", "Torque (lb-ft)"]].min()
min_power_tq

Unnamed: 0_level_0,Horsepower,Torque (lb-ft)
Vehicle Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Car,185,215
SUV,240,203
Truck,240,190


In [8]:
# Proving that the result of running data functions on "GroupBy" objects is a dataframe

type(min_power_tq)

pandas.core.frame.DataFrame

One last thing about grouping: **you can group by multiple columns**.

This is useful if you're dealing with demographic data and want to group by city AND state. You can do this by passing in a list that contains the columns you want to group by.

In [9]:
# Group vehicles_df by both Manufacturer AND Vehicle Type

grouped_vehicles_df_2 = vehicles_df.groupby(["Manufacturer", "Vehicle Type"])

# Get averages

grouped_vehicles_df_2.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Fuel Economy (mpg),Horsepower,Torque (lb-ft)
Manufacturer,Vehicle Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chevy,Car,31.0,350.0,350.0
Chevy,SUV,21.0,280.0,305.0
Chevy,Truck,17.5,320.0,282.5
Ford,Car,26.0,225.0,275.0
Ford,SUV,22.0,240.0,203.0
Ford,Truck,19.0,365.0,415.0
GM,Car,27.0,190.0,215.0
GM,SUV,21.0,292.5,270.0
GM,Truck,18.0,300.0,190.0


---
## <span style="color:blue">Sorting</span>
Pandas makes it possible to sort by the values in different columns in any dataframe. The main function for this is **<code class="python">.sort_values()</code>**.

Let's sort <code class="python">vehicles_df</code>, from our previous example, by fuel economy:

In [10]:
# First, let's see what the unaltered vehicles_df looks like

vehicles_df

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
0,25,265,Ford,Bob,270,Car
1,27,190,GM,Andrew,215,Car
2,22,240,Ford,Sally,203,SUV
3,31,350,Chevy,Amanda,350,Car
4,19,365,Ford,Bill,415,Truck
5,17,400,Chevy,Mike,275,Truck
6,23,275,GM,Lindsey,290,SUV
7,18,300,GM,Kristen,190,Truck
8,27,185,Ford,Matt,280,Car
9,21,280,Chevy,Anna,305,SUV


In [11]:
# Now, let's sort this dataframe by the "Fuel Economy (mpg)" column

vehicles_df.sort_values("Fuel Economy (mpg)")

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
5,17,400,Chevy,Mike,275,Truck
7,18,300,GM,Kristen,190,Truck
11,18,240,Chevy,Erin,290,Truck
4,19,365,Ford,Bill,415,Truck
10,19,310,GM,Jon,250,SUV
9,21,280,Chevy,Anna,305,SUV
2,22,240,Ford,Sally,203,SUV
6,23,275,GM,Lindsey,290,SUV
0,25,265,Ford,Bob,270,Car
1,27,190,GM,Andrew,215,Car


Note that <code class="python">.sort_values()</code> sorts ascending (or, from smallest to largest) by default. We can also sort descending by simply passing in <code class="python">ascending=False</code> into <code class="python">.sort_values()</code>

In [12]:
# Sort DESCENDING by the "Fuel Economy (mpg)" column

vehicles_df.sort_values("Fuel Economy (mpg)", ascending=False)

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
3,31,350,Chevy,Amanda,350,Car
1,27,190,GM,Andrew,215,Car
8,27,185,Ford,Matt,280,Car
0,25,265,Ford,Bob,270,Car
6,23,275,GM,Lindsey,290,SUV
2,22,240,Ford,Sally,203,SUV
9,21,280,Chevy,Anna,305,SUV
4,19,365,Ford,Bill,415,Truck
10,19,310,GM,Jon,250,SUV
7,18,300,GM,Kristen,190,Truck


In [13]:
# Creating a new dataframe from the result of running .sort_values()

sorted_df = vehicles_df.sort_values("Fuel Economy (mpg)")
sorted_df.head()

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
5,17,400,Chevy,Mike,275,Truck
7,18,300,GM,Kristen,190,Truck
11,18,240,Chevy,Erin,290,Truck
4,19,365,Ford,Bill,415,Truck
10,19,310,GM,Jon,250,SUV


Note that sorting a dataframe does not reset the row indexes. Once you sort a dataframe, it's best practice to reset the row indexes. Use the **<code class="python">.reset_index()</code>** function to do this.

In [14]:
# Run the .reset_index() on the sorted dataframe
# Pass in drop=True to prevent appending a new column with the old indexes to the dataframe

sorted_df = sorted_df.reset_index(drop=True)
sorted_df.head()

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
0,17,400,Chevy,Mike,275,Truck
1,18,300,GM,Kristen,190,Truck
2,18,240,Chevy,Erin,290,Truck
3,19,365,Ford,Bill,415,Truck
4,19,310,GM,Jon,250,SUV


---
## <span style="color:blue">Merging</span>
Many times, the data you'll want to analyze comes in the form of several tables. Pandas contains a function called **<code class="python">merge()</code>** that will merge two dataframes together.

In order to merge two dataframes successfully, you'll need to specify a column to merge on. This column should be the link that connects the two dataframe (i.e. some sort of ID column).

Let's go through the different types of merges using two dataframes - one that contains an online retailer's purchase data for the past week, and another one that contains data on customers who've signed up for the email list.

In [15]:
# Create the dataframe containing purchase data

purchase_dict = {"Customer ID": [123, 757, 985, 907, 642, 754, 396, 278],
                 "Order Price": [9.99, 15.00, 7.99, 25.00, 18.00, 31.00, 29.99, 17.99],
                 "Category": ["Clothes", "Electronics", "Toys", "Toys", "Clothes", "Toys", "Electronics", "Clothes"]}
purchase_df = pd.DataFrame(purchase_dict)
purchase_df

Unnamed: 0,Category,Customer ID,Order Price
0,Clothes,123,9.99
1,Electronics,757,15.0
2,Toys,985,7.99
3,Toys,907,25.0
4,Clothes,642,18.0
5,Toys,754,31.0
6,Electronics,396,29.99
7,Clothes,278,17.99


In [16]:
# Create the dataframe containing email list data

email_dict = {"Customer ID": [123, 147, 278, 396, 421, 642, 754],
              "Name": ["Jill", "Tony", "Sarah", "Bill", "Erin", "Tyler", "Amanda"],
              "Email Address": ["jill29@gmail.com", "tony@yahoo.com", "sarah.b@gmail.com", "bill93@gmail.com", "erinm@yahoo.com", "tyler@aol.com", "amanda72@gmail.com"]}
email_df = pd.DataFrame(email_dict)
email_df

Unnamed: 0,Customer ID,Email Address,Name
0,123,jill29@gmail.com,Jill
1,147,tony@yahoo.com,Tony
2,278,sarah.b@gmail.com,Sarah
3,396,bill93@gmail.com,Bill
4,421,erinm@yahoo.com,Erin
5,642,tyler@aol.com,Tyler
6,754,amanda72@gmail.com,Amanda


Note that the two dataframes share the "Customer ID" column in common. This column is what makes the merge possible.

An **inner merge** joins the two dataframes only where the values in the joining column match. By default, <code class="python">merge()</code> performs an inner merge.

In [17]:
# Performing an inner merge on the two dataframes using the "Customer ID" field
# Pass into merge(): the two dataframes to merge together, as well as the column to merge on
# Note that it only returns the rows where the value in "Customer ID" matched between the two dataframes
# If a row did not contain a match in "Customer ID", it is not returned in the merged dataframe

inner_merge_df = pd.merge(purchase_df, email_df, on="Customer ID")
inner_merge_df

Unnamed: 0,Category,Customer ID,Order Price,Email Address,Name
0,Clothes,123,9.99,jill29@gmail.com,Jill
1,Clothes,642,18.0,tyler@aol.com,Tyler
2,Toys,754,31.0,amanda72@gmail.com,Amanda
3,Electronics,396,29.99,bill93@gmail.com,Bill
4,Clothes,278,17.99,sarah.b@gmail.com,Sarah


You can think of inner merge we just performed as returning only the orders from customers who've also signed up for the email list. We merge the two dataframes to return only rows where the "Customer ID" value matches (and therefore, appears on both dataframes). Also, the merged dataframe contains all columns from both dataframes.

On the other hand, an **outer merge** joins the two dataframes, even where there isn't a match in the joining column. To perform an outer merge, you must specify it by passing in <code class="python">how="outer"</code> to <code class="python">merge()</code>.

In [18]:
# Performing an outer merge on the two dataframes, again using the "Customer ID" field
# Note that it returns all rows from both dataframes

outer_merge_df = pd.merge(purchase_df, email_df, on="Customer ID", how="outer")
outer_merge_df

Unnamed: 0,Category,Customer ID,Order Price,Email Address,Name
0,Clothes,123,9.99,jill29@gmail.com,Jill
1,Electronics,757,15.0,,
2,Toys,985,7.99,,
3,Toys,907,25.0,,
4,Clothes,642,18.0,tyler@aol.com,Tyler
5,Toys,754,31.0,amanda72@gmail.com,Amanda
6,Electronics,396,29.99,bill93@gmail.com,Bill
7,Clothes,278,17.99,sarah.b@gmail.com,Sarah
8,,147,,tony@yahoo.com,Tony
9,,421,,erinm@yahoo.com,Erin


The outer merge we just performed returns all rows from both dataframes. If a customer who placed an order in the past week *has not* signed up for the email list (i.e. they are in the <code class="python">purchase_df</code> dataframe, but not the <code class="python">email_df</code> dataframe), Pandas fills in the "Name" and "Email Address" fields with null values (<code class="python">NaN</code>). If someone who has signed up for the email list *has not* placed an order in the past week (i.e. they are in the <code class="python">email_df</code> dataframe, but not the <code class="python">purchase_df</code> dataframe), Pandas fills in the "Category" and "Order Price" with null values (<code class="python">NaN</code>).

Lastly, a **right** or **left merge** joins the two dataframes, but only protects the data in one of the dataframes you're joining together. In other words, it's like an outer merge that only applies to one dataframe. With a right or left merge, rows in the *other* dataframe where there isn't a match in the joining column are dropped.

With Pandas, the **first** dataframe you pass into <code class="python">merge()</code> is considered the **left** dataframe, and the **second** dataframe you pass into <code class="python">merge()</code> is considered the **right** dataframe.

In [19]:
# Performing a left merge on the two dataframes
# Here, purchase_df is the left dataframe, and email_df is the right dataframe

left_merge_df = pd.merge(purchase_df, email_df, on="Customer ID", how="left")
left_merge_df

Unnamed: 0,Category,Customer ID,Order Price,Email Address,Name
0,Clothes,123,9.99,jill29@gmail.com,Jill
1,Electronics,757,15.0,,
2,Toys,985,7.99,,
3,Toys,907,25.0,,
4,Clothes,642,18.0,tyler@aol.com,Tyler
5,Toys,754,31.0,amanda72@gmail.com,Amanda
6,Electronics,396,29.99,bill93@gmail.com,Bill
7,Clothes,278,17.99,sarah.b@gmail.com,Sarah


In [20]:
# Performing a right merge on the two dataframes
# Here, email_df is the left dataframe and purchase_df is the right dataframe

right_merge_df = pd.merge(email_df, purchase_df, on="Customer ID", how="right")
right_merge_df

Unnamed: 0,Customer ID,Email Address,Name,Category,Order Price
0,123,jill29@gmail.com,Jill,Clothes,9.99
1,278,sarah.b@gmail.com,Sarah,Clothes,17.99
2,396,bill93@gmail.com,Bill,Electronics,29.99
3,642,tyler@aol.com,Tyler,Clothes,18.0
4,754,amanda72@gmail.com,Amanda,Toys,31.0
5,757,,,Electronics,15.0
6,985,,,Toys,7.99
7,907,,,Toys,25.0


---
## <span style="color:blue">Binning</span>
Often times, it's useful to put numeric data into bins. This allows for more vigorous and customizable analysis of datasets. `cut()` is a function in Pandas that gives you the capability of binning data.

Let's work on binning the "Horsepower" values in `vehicles_df`.

In [21]:
# First, call vehicles_df to examine it

vehicles_df

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type
0,25,265,Ford,Bob,270,Car
1,27,190,GM,Andrew,215,Car
2,22,240,Ford,Sally,203,SUV
3,31,350,Chevy,Amanda,350,Car
4,19,365,Ford,Bill,415,Truck
5,17,400,Chevy,Mike,275,Truck
6,23,275,GM,Lindsey,290,SUV
7,18,300,GM,Kristen,190,Truck
8,27,185,Ford,Matt,280,Car
9,21,280,Chevy,Anna,305,SUV


In [22]:
# Create bins and bin labels for the Horsepower column

hp_bins = [180, 200, 350, 400]
hp_labels = ["Slow", "Decent", "Fast"]

# Bin the Horsepower column
# cut() returns a Pandas Series containing each of the binned column's values translated into their corresponding bins

pd.cut(vehicles_df["Horsepower"], hp_bins, labels=hp_labels)

0     Decent
1       Slow
2     Decent
3     Decent
4       Fast
5       Fast
6     Decent
7     Decent
8       Slow
9     Decent
10    Decent
11    Decent
Name: Horsepower, dtype: category
Categories (3, object): [Slow < Decent < Fast]

The syntax for creating the bins may be a bit confusing at first. 

The first argument passed into `cut()` is the dataframe column you'd like to bin.

The second argument passed into `cut()` is a list that is supposed to represent the bins you'd like to create. In the above lines of code, the list `hp_bins` contains four numbers - these numbers represent our **bin edges**. Therefore, when it's passed into `cut()`, Pandas creates **three bins**: 
- the first one ranging from **180 to 200**
- the second one ranging from **200 to 350** 
- the third one ranging from **350 to 400**..

The third argument passed into `cut()` is a list that contains your bin labels. The list `hp_labels` contains 3 strings. When this list is passed into `cut()`: 
- the label **`"Slow"`** is mapped to the bin ranging from **180 to 200**
- the label **`"Decent"`** is mapped to the bin ranging from **200 to 350**
- the label **`"Fast"`** is mapped to the bin ranging from **350 to 400**.

In [23]:
# We can append our bins to vehicles_df

vehicles_df["Speed"] = pd.cut(vehicles_df["Horsepower"], hp_bins, labels=hp_labels)
vehicles_df

Unnamed: 0,Fuel Economy (mpg),Horsepower,Manufacturer,Owner,Torque (lb-ft),Vehicle Type,Speed
0,25,265,Ford,Bob,270,Car,Decent
1,27,190,GM,Andrew,215,Car,Slow
2,22,240,Ford,Sally,203,SUV,Decent
3,31,350,Chevy,Amanda,350,Car,Decent
4,19,365,Ford,Bill,415,Truck,Fast
5,17,400,Chevy,Mike,275,Truck,Fast
6,23,275,GM,Lindsey,290,SUV,Decent
7,18,300,GM,Kristen,190,Truck,Decent
8,27,185,Ford,Matt,280,Car,Slow
9,21,280,Chevy,Anna,305,SUV,Decent


In [24]:
# Binning adds a new wrinkle to our data, allowing for more vigorous analysis
# For example, we can now group by our bins

grouped_speed_vehicles_df = vehicles_df.groupby("Speed")
grouped_speed_vehicles_df[["Horsepower", "Torque (lb-ft)"]].mean()

Unnamed: 0_level_0,Horsepower,Torque (lb-ft)
Speed,Unnamed: 1_level_1,Unnamed: 2_level_1
Slow,187.5,247.5
Decent,282.5,268.5
Fast,382.5,345.0
