# APS106 - Fundamentals of Computer Programming
## Week 12 | Lecture 1 (12.2) - More Pandas, Data Visualization

### This Week
| Lecture | Topics |
| --- | --- |
| 12.1 | Pandas |
| **12.2** | **More Pandas, Data Visualization** | 
| 12.3 | Design Problem: Stock Market, Part 1 |

### Lecture Structure
1. [Conditional Selection](#section1)
2. [Breakout Session 1](#section2)
3. [Adding, Removing, and Modifying Columns](#section3)
4. [Utility Methods](#section4)
5. [String Methods](#section5)
6. [Breakout Session 2](#section6)
7. [Data Visualization](#section7)

<a id='section1'></a>
## 1. Conditional Selection
Let's start by importing `pandas`.

In [None]:
import pandas as pd

Now, let's import a new dataset of baby names in New York.

In [None]:
babynames = pd.read_csv('new_york_baby_names.csv')
babynames.head()

That way to read the table able is as follows. In the state of Ney York in 1910, 1923 babies of Female sex were born and given the name Mary.

Let's start off by grabbing a smaller sample of our dataset.

In [None]:
babynames_first_10_rows = babynames.loc[0:9, :]
babynames_first_10_rows

By passing in a sequence (`list` or `Series`) of `boolean` values, we can extract a subset of the rows in a `DataFrame`. We will keep only the rows that correspond to a `boolean` value of `True`.

Let's first create a list of `booleans`. One requirement is that if we're created a list to filer the rows of a `DataFrame` then the `list` must have as many items in it as there are rows in the `DataFrame` we wantto filter. In this case, the `DataFrame` has **10** rows so the list must have **10** booleans.

In [None]:
boolean_list = [True, False, True, False, True, False, True, False, True, False]

Now, let's pass that list in as the first argument (row selection) in for the `.loc` method.

In [None]:
babynames_first_10_rows.loc[boolean_list, :]

As you can see above, we had placed `True` at the position of all even indices in the `list` and therefore, only rows with an even index are returned.

Oftentimes, we'll use boolean selection to check for entries in a `DataFrame` that meet a particular condition. In the code below, we first selectiong a column, which returns a `Series`.

In [None]:
babynames.loc[:, 'Sex']

In the code below, we are using a logical condition to generate a boolean `Series`. 

In [None]:
babynames.loc[:, 'Sex'] == "F"

Let's save this as a variable.

In [None]:
boolean_series = babynames.loc[:, 'Sex'] == "F"

Now we can pass this `boolean` `Series` into `.loc` as the first argument (row selection). This will return a `DataFrame` where only the Female baby names will be returned.

In [None]:
babynames.loc[boolean_series, :]

As you can see from the print out above, only Female baby names are present in the resultant `DataFrame`. Let's check out the size of the original `DataFrame` and the one after select for only `sex == Female` baby names.

In [None]:
total_rows = babynames.shape[0]
female_rows = babynames.loc[boolean_series, :].shape[0]

print("The original DataFrame has: ", total_rows, " rows.", sep='')
print("The 'Female' filtered DataFrame has: ", female_rows, " rows.", sep='')

Rather than creating a separate variable `boolean_series` for the `boolean` `Series`, we can pass the logical condition into `.loc` as the first argument (row selection).

In [None]:
babynames.loc[babynames.loc[:, 'Sex'] == "F", :]

Lastly, let's show that the number of rows for the `sex == Female` `DataFrame` and the `sex == Male` `DataFrame` add up to the original size of the `babynames` `DataFrame`.

In [None]:
total_rows = babynames.shape[0]
female_rows = babynames.loc[babynames.loc[:, 'Sex'] == "F", :].shape[0] 
male_rows = babynames.loc[babynames.loc[:, 'Sex'] == "M", :].shape[0] 

print("The original DataFrame has: ", total_rows, " rows.", sep='')
print("The 'Female' filtered DataFrame has: ", female_rows, " rows.", sep='')
print("The 'Male' filtered DataFrame has: ", male_rows, " rows.", sep='')
print(female_rows, " + ", male_rows, " = ", total_rows, sep='')

### Multiple Conditions
To filter on multiple conditions, we combine boolean operators using bitwise comparisons and use brackets `()` to separate the conditions.

| Symbol | Usage | Meaning |
| --- | --- | --- |
| ~ | ~p | not p |
| &#124; | p &#124; q | p or q |
| & | p & q | p and q |

The code below is filtering the `babynames` `DataFrame` to only include **Female** baby names for all years before the year **2000**.

In [None]:
babynames.loc[(babynames.loc[:, "Sex"] == "F") & (babynames.loc[:, "Year"] < 2000), :]

The code below is filtering the `babynames` `DataFrame` to include all baby names that are either **Female** or from before the year **2000**.

In [None]:
babynames.loc[(babynames.loc[:, "Sex"] == "F") | (babynames.loc[:, "Year"] < 2000), :]

The code below is filtering the `babynames` `DataFrame` to include **Female** baby names from the year **2000** that were given to less than **6** babies that year.

In [None]:
babynames.loc[(babynames.loc[:, "Sex"] == "M") & 
              (babynames.loc[:, "Year"] == 2020) & 
              (babynames.loc[:, "Count"] < 6), :]

The code below is filtering the `babynames` `DataFrame` to include **Female** baby names from the year **2000** that were given to more than **700** babies that year.

In [None]:
babynames.loc[(babynames.loc[:, "Sex"] == "M") & 
              (babynames.loc[:, "Year"] == 2020) & 
              (babynames.loc[:, "Count"] > 700), :]

### Membership Condition
We can use `.isin` for Selection based on a list or Series. For example, let's way we wantto create a dictionary that only contains the names `"Sebastian"`, `"Ben"`, `"Joseph"`, `"Katia"`, and `"Tamara"`. Based on what we've learned so far, I could do the following.

In [None]:
babynames.loc[(babynames.loc[:, "Name"] == "Sebastian") | 
              (babynames.loc[:, "Name"] == "Ben") | 
              (babynames.loc[:, "Name"] == "Joseph") | 
              (babynames.loc[:, "Name"] == "Katia") | 
              (babynames.loc[:, "Name"] == "Tamara"), :]

A more concise method to achieve the above is by using the `.isin` method. The .`isin` method in Pandas is used to check whether each element in a `DataFrame` or `Series` is contained in a sequence of values. Here's how it works:

In [None]:
names = ["Sebastian", "Ben", "Joseph", "Katia", "Tamara"]
babynames.loc[:, "Name"].isin(names)

We get back a `Series` where the value is `True` if the `"Name"` is in the list `names` and `False` if its not. We can take this `boolean` `Series` and pass it to the first argument the `.loc` method (row selection).

In [None]:
names = ["Sebastian", "Ben", "Joseph", "Katia", "Tamara"]
babynames.loc[babynames.loc[:, "Name"].isin(names), :]

<a id='section2'></a>
## 2. Breakout Session 1
Let's import the `elections` and `babynames` datasets.

In [None]:
elections = pd.read_csv('elections.csv')
elections.head()

In [None]:
babynames = pd.read_csv('new_york_baby_names.csv')
babynames.head()

#### Question 1
Display a `DataFrame` showing all **Republican** candidates who won the presidential election with less than **50%** of the popular vote.

In [None]:
elections.loc[
    (elections.loc[:, 'Party'] == 'Republican') & \
    (elections.loc[:, 'Result'] == 'win') & \
    (elections.loc[:, '%'] < 50)
, :]

#### Question 2
Display a `DataFrame` showing all year where the following parties ran: `'Green'`, `'Union Labor'`, and `'Free Soil'`.

In [None]:
elections.loc[elections.loc[:, 'Party'].isin(["Green", "Union Labor", "Free Soil"]), :]

#### Question 3
The `elections` `DataFrame` is in chronological order when imported. Building on the `DataFrame` from **Question 1**, print the name of the first **Republican** candidate to win the presidential election with less than **50%** of the popular vote.

In [None]:
df_temp = elections.loc[(elections.loc[:, 'Party'] == 'Republican') & \
                        (elections.loc[:, 'Result'] == 'win') & \
                        (elections.loc[:, '%'] < 50), :]

candidates_name = df_temp.loc[:, 'Candidate'].iloc[0]

print(candidates_name, 'is was first Republican candidate to win the presidential election with less than **50%** of the popular vote.')

#### Question 4
Display a `DataFrame` showing all **Republican** and **Democratic** candidates who won the presidential election with less than **50%** of the popular vote.

In [None]:
elections.loc[
    ((elections.loc[:, 'Party'] == 'Republican') | (elections.loc[:, 'Party'] == 'Democratic')) & \
    (elections.loc[:, 'Result'] == 'win') & \
    (elections.loc[:, '%'] < 50)
, :]

#### Question 5
Display a `DataFrame` showing all years where the name **Sebastian** was given to at leat one baby **Male** baby.

In [None]:
babynames.loc[(babynames.loc[:, 'Name'] == 'Sebastian') & (babynames.loc[:, 'Sex'] == 'M'), :]

#### Bonus
We'll discuss plotting a bit later in the lecture but for now, check this out.

In [None]:
babynames.loc[(babynames.loc[:, 'Name'] == 'Sebastian') & (babynames.loc[:, 'Sex'] == 'M'), :].plot('Year', 'Count')

<a id='section3'></a>
## 3. Adding, Removing, and Modifying Columns
### Add a Column
To add a column, use `.loc[, column-name]` to reference the desired new column, then assign it to a `Series` or `list` of appropriate length. Let's create a new column to the `babynames` `DataFrame` called `"Dummy"` and assigned zeros to it.

In [None]:
babynames.loc[:,  "Dummy"] = 0
babynames

Now, let's try creating a new column called `"Count_Squared"` and assign to it the square of the `"Count"` column.

In [None]:
babynames.loc[:,  "Count_Squared"] = babynames.loc[:,  "Count"] * babynames.loc[:,  "Count"]
babynames

### Modify a Column
To modify a column, use `.loc[:, column-name]` to access the desired column, then re-assign it to a new `list` or `Series`.

In [None]:
babynames.loc[:, "Count"] = babynames.loc[:, "Count"] - 10000
babynames

### Rename a Column Name
Rename a column using the `.rename(old-column-name:new-column-name)` method.

In [None]:
babynames = babynames.rename(columns={"Name":"First-Name"})
babynames

### Delete a Column
Remove a column using `.drop()`.

In [None]:
babynames = babynames.drop(["Dummy", "Count_Squared"], axis="columns")
babynames

<a id='section4'></a>
## 4. Utility Methods
There are many, many utility methods built into `Pandas`, far more than we can possibly cover in `APS106`. You are encouraged to explore all the functionality outlined in the pandas documentation. For `APS106`, you will only be required to know the utility methods covered in this lecture.

In [None]:
babynames = pd.read_csv('new_york_baby_names.csv')
babynames.head()

### `.max()`
This function returns the maximum value along the specified axis. As shown below, `axis=0` is column-wise and `axis=1` is row-wise.

<br>
<img src="images/DataFrame.png" alt="drawing" width="450"/>
<br>

For a `Series`.
```python
df['column_name'].min()
```

For a `DataFrame`, for column-wise calculation:
```python
df.min(axis=0)
```
or, for a row-wise calculation:
```python
df.min(axis=1)
```

First, let's try calculating the maximum baby name count. As you can see, because `Series` is 1-D data, we do not need to specify the axis.

In [None]:
babynames.loc[:, 'Count'].max()

For a `DataFrame`, if we use a column-wise operation, Pandas will return the maximum value for each column.

In [None]:
babynames.max(axis=0)

If we use a row-wise operation, Pandas will try to return the maximum value for each row.

In [None]:
babynames.max(axis=1)

As you can see, we get an error because Pandas cannot compare numberic and string data to determine which is larger. If we filter to only the numeric columns (`"Year"` and `"Count"`), then this should work. You can see that the `"Year"` is returned because its typically larger than the name count.

In [None]:
babynames.loc[:, ['Year', 'Count']].max(axis=1)

### `.min()`
This function returns the minimum value along the specified axis.

In [None]:
babynames.loc[:, ['Year', 'Count']].min(axis=1)

### `.mean()`
This function computes the arithmetic mean along the specified axis.

In [None]:
babynames.loc[:, 'Count'].mean()

### `.value_counts()`
Count the number of times each unique value occurs in a `Series`. The output below is a `Series` where the `index` are all the unique names in the `babynames` `DataFrame` and the `values` are the total number of babies with those names.

In [None]:
babynames.loc[:, "Name"].value_counts()

### `.unique()`
Return an array of all unique values in a `Series`. Below, we apply the `.unique()` method to the `Name` `Series` and get an array of all unique baby names in the `babynames` dataset. For the purpose of `APS106`, you can treat this array as a list-like object. It is mutable, indexable, and iterable. 

In [None]:
babynames.loc[:, "Name"].unique()

### `.sort_values()`
The `.sort_values()` function in `Pandas` is used to sort a `DataFrame` or `Series` by one or more columns. It allows you to specify the column(s) by which you want to sort the data, as well as the order of sorting (ascending or descending).

The code below sorts the `Name` `Series` in ascending order.

In [None]:
babynames.loc[:, "Name"].sort_values()

The code below sorts a `DataFrame` by the `Count` column in descending order.

In [None]:
babynames.sort_values(by=["Count"], ascending=[False]).head()

The code below sorts a `DataFrame` first by the `"Year"` in ascending order and then by the `"Count"` in descending order.

In [None]:
babynames.sort_values(by=["Year", "Count"], ascending=[True, False]).head()

### `.astype()`
The `.astype()` function is used to cast a pandas object (like a `DataFrame` or a `Series`) to a specified dtype (data type). It allows you to explicitly convert the data to a desired data type.

Let's say you have a `DataFrame` where one of the columns represents numerical values as strings, and you want to convert this column to numeric data type for numerical operations. Here's a simple example of how you can use `.astype()` to achieve this.

First, let's create the `DataFrame`.

In [None]:
data = {'NumericString': ['1', '2', '3', '4', '5']}
df = pd.DataFrame(data)
df

Now let's check the data type of the `'NumericString'` column.

In [None]:
print("Data types before conversion:")
print(df.dtypes)

Lastly, let's convert the data type of the `'NumericString'` column from a `string` to an `int`.

In [None]:
df.loc[:, 'NumericString'] = df.loc[:, 'NumericString'].astype(int)
print("\nData types after conversion:")
print(df.dtypes)

<a id='section5'></a>
## 5. String Methods
The Pandas library is very useful for the manipulation of strings as it provides us with various handy string methods. It saves time and makes our program efficient. The following are `string` methods we will cover in APS106.
1. `.upper()`
2. `.lower()`
3. `.len()`
4. `.startswith()` / `.endswith()`
5. `.replace()`

### `.upper()`
This `string` method converts a `string` into uppercase as you can see from the example below.

In [None]:
babynames.loc[:, 'Name'].str.upper()

`string` method opperator on `Series` NOT `DataFrames`. Notice the error when I try to call a `string` method from a `DataFrame`.

In [None]:
babynames.str.upper()

Additionally, we need to access `string` methods via the `.str` module. 

1. `Series.str.upper()`
2. `Series.str.lower()`
3. `Series.str.len()`
4. `Series.str.startswith()`
5. `Series.str.replace()`

### `.lower()`
This `string` method converts a `string` into lowercase as you can see from the example below.

In [None]:
babynames.loc[:, 'Name'].str.lower()

### `.len()`
This `string` method returns the length (number of characters) of each `string`. This would be similar to:

```python
>>> var = 'seb'
>>> print(len(var))
3
```

In [None]:
babynames.loc[:, 'Name'].str.len()

### `.startswith()` / `.endswith()`
`.startswith()` returns a `boolean` which is `True` if the `string` in the `Series` starts with the `string` that is passed as an argument to `.startswith()`. `.endswith()` returns a `boolean` which is `True` if the `string` in the `Series` ends with the `string` that is passed as an argument to `.endswith()`. 

Let's check out the first 5 rows of our `babynames` `DataFrame`.

In [None]:
babynames.head()

We can see that the first name is `Mary` and the fifth row is `Margaret`. Therefore, if we use `.startswith('M')`, we should get `True` for rows 0 and 4 and `False` for rows 1 - 3.

In [None]:
babynames.loc[:, 'Name'].str.startswith('M')

### `.replace()`
Replaces a part of the string with another one. This works in a similar way to the Python string method `.replace()`. The `.replace()` takes two arguments: (1) The `string` to replace and (2) What `string` to replace it with.

Let's consider a scenario where you have a `DataFrame` containing a column with strings representing monetary values, but some of these values are formatted with commas as thousands separators. You want to clean up the data by removing the commas from these values so that you can convert them to numerical values for analysis.

Here's how you can use `str.replace()` from Pandas to achieve this. First, let's create the `DataFrame`.

In [None]:
data = {'MonetaryValue': ['$1,000', '$2,500', '$3,750', '$4,500', '$5,250']}
df = pd.DataFrame(data)
df

Now, let's use `.replace()` to replace the comma `","` with an empty space `""`.

In [None]:
df.loc[:, 'MonetaryValue'] = df.loc[:, 'MonetaryValue'].str.replace(',', '')
df

Next, let's use `.replace()` to replace the dollar sign `"$"` with an empty space `""`.

In [None]:
df.loc[:, 'MonetaryValue'] = df.loc[:, 'MonetaryValue'].str.replace('$', '')
df

Now we can use `.astype()` to convert from a `string` to a `float` so we can apply mathematical operations to the column `"MonetaryValue"`.

In [None]:
df.loc[:, 'MonetaryValue'] = df.loc[:, 'MonetaryValue'].astype(float)
df

<a id='section6'></a>
## 6. Breakout Session 2
For this Breakout Session, we will be working with the `babynames` dataset so let's load it again.

In [None]:
babynames = pd.read_csv('new_york_baby_names.csv')
babynames.head()

### Question 1
Create a new `DataFrame` called `babynames_1985` that only includes baby names from the year 1985.

In [None]:
babynames_1985 = babynames.loc[babynames.loc[:, 'Year'] == 1985]
babynames_1985.head()

### Question 2
Remove all names from `babynames_1985` that do not start with the letter `"J"`.

In [None]:
babynames_1985 = babynames_1985.loc[babynames_1985.loc[:, "Name"].str.startswith('J'), :]
babynames_1985.head()

### Question 3
Add a column to `babynames_1985` called `"length"` that contains the length of `"Name"` (the number of letters in the name). For example, the length of `"Seb"` is **3**.

In [None]:
babynames_1985['length'] = babynames_1985.loc[:, 'Name'].str.len()
babynames_1985.head()

### Question 4
Sort `babynames_1985` by the `"length"` column so that the first row contains the longest name starting with `"J"` from 1985.

In [None]:
babynames_1985 = babynames_1985.sort_values(['length'], ascending=False)
babynames_1985.head()

### Question 5
Filter `babynames_1985` to only include names that end with `"ette"` and have **8** or more letters.

In [None]:
babynames_1985 = babynames_1985.loc[(babynames_1985.loc[:, "Name"].str.endswith('ette')) & 
                                    (babynames_1985.loc[:, "Name"].str.len() > 8), :]
babynames_1985.head()

### Question 6 (Bonus)
Starting with `babynames_1985` below.

In [None]:
babynames_1985 = babynames.loc[babynames.loc[:, 'Year'] == 1985, :]
babynames_1985.head()

Remove all names from `babynames_1985` that do not start with the letter `"J"` WITHOUT using Pandas `.str` methods.

In [None]:
starts_with_j = []

for index in babynames_1985.index:
    
    name = babynames_1985.loc[index, 'Name']
    
    if name[0] == 'J':
        starts_with_j.append(True)
    else:
        starts_with_j.append(False)
        
babynames_1985 = babynames_1985.loc[starts_with_j, :]
babynames_1985.head()

<a id='section7'></a>
## 7. Data Visualization
There are MANY plotting libraries available in the Python ecosystem and doing a deep dive into them is way beyong the scope of APS106. However, Pandas has some basic plotting functionality we will cover. In particular, we will cover `line plots`, `scatter plots`, and `bar plot`.

First let's create a dummy dataset to demonstrate these plotting functions.

In [None]:
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)
df

### Line Plot

In [None]:
df.plot(kind='line', x='A', y='B')

### Scatter Plot

In [None]:
df.plot(kind='scatter', x='A', y='B')

### Bar Plot

In [None]:
df.plot(kind='bar', x='A', y='B')

### Example 1: Uber Rides

In [None]:
combined_uber_data = pd.read_csv('combined_uber_data.csv')
combined_uber_data.head()

In [None]:
combined_uber_data.plot(kind='bar', x='Date/Time', y='count', figsize=(15, 5), title='Uber Rides Per Week, 2014')

### Example 2: Toronto Weather

In [None]:
toronto_weather_data = pd.read_csv('toronto_weather_data.csv')
toronto_weather_data.head()

In [None]:
toronto_weather_data.plot(kind='line', y='Temp (°C)', figsize=(15, 5), title='Temperature in Toronto')