# SandBox Notebook


### Introduction to Pandas
This is notebook provides an opportunity to practice what has been learnt in class and a high-level overview of the [Pandas](https://pandas.pydata.org) library. Please see the sixth chapter of the [Learning Data Science](https://learningds.org/intro.html) text for a narrative explanation of what is going on in this notebook.

In [None]:
import plotly.express as px
import numpy as np
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

## Series, DataFrames, and Indices 

Series, DataFrames, and Indices are fundamental Pandas data structures for storing tabular data and processing the data using vectorized operations.

### Series

Series is a 1-D labeled array data. We can think of it as columnar data. 

#### Creating a new `Series` object
Below we create a `Series` object and will look into its two components: 1) array and 2) index.

In [None]:
s = pd.Series([-1, 10, 2])
print("Series Object:", s, sep='\n')

# Data contained within the Series
print("Array Object:", s.array, sep='\n')

# The Index of the Series
print("Index Object:", s.index, sep='\n')

We can create a `Series` object by providing a custom index.

In [None]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
print("Series Object:", s, sep='\n')
print("Array Object:", s.array, sep='\n')
print("Index Object:", s.index, sep='\n')

We can reassign the index of a `Series` to a new index.

In [None]:
s.index = ["first", "second", "third"]
s.index

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition

In [None]:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])

# Selection using a single label
# Notice how the return value is a single array element
print(s["a"])

# Selection using a list of labels
# Notice how the return value is another Series
print(s[["a", "c"]])

In [None]:
# Filter condition: select all elements greater than 0
print(s>0)

# Selection using a filtering condition
print(s[s>0])

### DataFrame

DataFrame is a 2-D tabular data with both row and column labels. In this lecture, we will see how a DataFrame can be created from scratch or loaded from a file. 

#### Loading data from a file into a `DataFrame`
For loading data into a `DataFrame`, `Pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a csv file into a `DataFrame` object. 

In [None]:
elections = pd.read_csv("data/elections.csv")
elections

#### Creating a new `DataFrame` object
We can also create a `DataFrame` in variety of ways. Here we cover the following:
1. Using a list and column names
2. From a dictionary
3. From a Series

In [None]:
# Creating a DataFrame using a list of column name(s)
df_list_1 = pd.DataFrame([1, 2, 3], columns = ["Number"])
print(df_list_1)

print()

df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
print(df_list_2)

In [None]:
# Creating a DataFrame from a dictionary
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], "Price":[5.49, 3.99]})
print(df_dict_1)

print()

df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                   {"Fruit":"Orange", "Price":3.99}])
print(df_dict_2)

In [None]:
# Creating a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
print(df_ser)

print()

# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
print(df_ser)

print()

# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
print(ser_to_df)


In [None]:
# Creating a DataFrame from a csv file and specifying the index column
mottos = pd.read_csv("data/mottos.csv", index_col = "State")
mottos

In [None]:
elections.set_index("Candidate", inplace=True) # This sets the index to the "Candidate" column

In [None]:
elections

In [None]:
elections.reset_index(inplace=True)
elections

## Slicing in `DataFrame`

As a simple slicing example, consider the code below, which returns the first 5 rows of the `DataFrame`.

In [None]:
elections.loc[0:4]

We can also use the head command to return only a few rows of a dataframe.

In [None]:
elections.head()

In [None]:
elections.head(3)

Or the tail command to get the last so many rows.

In [None]:
elections.tail(5)

If we want a subset of the columns, we can also use loc just to ask for those.

In [None]:
elections.loc[0:4, "Year":"Party"]

### `loc`

`loc` selects items by row and column label.

In [None]:
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

In [None]:
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
elections.loc[0, "Candidate"]

In [None]:
elections.loc[:, ["Year", "Candidate", "Result"]]

### `iloc`

`iloc` selects items by row and column number.

In [None]:
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
elections.iloc[[1, 2, 3], 0:3]

In [None]:
elections.iloc[[1, 2, 3], 1]

In [None]:
elections.iloc[:, 0:3]

### `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it does essentially the same thing.

If we provide a slice of row numbers, we get the numbered rows.

In [None]:
elections[3:7]

If we provide a list of column names, we get the listed columns.

In [None]:
elections[["Year", "Candidate", "Result"]].tail(5)

And if we provide a single column name we get back just that column.

In [None]:
elections["Candidate"].tail(5)

## A little annoying puzzle

In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

In [None]:
# weird[1] #try to predict the output

In [None]:
# weird["1"] #try to predict the output

In [None]:
# weird[1:] #try to predict the output

In [None]:
mottos.index

In [None]:
mottos.columns

## Dataset - California baby names

Let's load the California baby names again.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

## Conditional Selection

In [None]:
# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows

In [None]:
# Notice how we have exactly 10 elements in our boolean array argument
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

In [None]:
# First, use a logical condition to generate a boolean array
logical_operator = (babynames["Sex"] == "F")
logical_operator

In [None]:
# Then, use this boolean array to filter the DataFrame
babynames[logical_operator]

Boolean array selection also works with `loc`!

In [None]:
babynames.loc[babynames["Sex"] == "F"]

In [None]:
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]

In [None]:
(
    babynames[(babynames["Name"] == "Bella") | 
              (babynames["Name"] == "Alex") |
              (babynames["Name"] == "Ani") |
              (babynames["Name"] == "Lisa")]
)
# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability

In [None]:
names = ["Bella", "Alex", "Ani", "Lisa"]
babynames[babynames["Name"].isin(names)]

In [None]:
babynames[babynames["Name"].str.startswith("N")]

In [None]:
bella_counts = babynames[babynames["Name"] == "Bella"]["Count"]
bella_counts

In [None]:
# Average number of babies named Bella each year

np.mean(bella_counts)

In [None]:
# Max number of babies named Bella born on a given year

max(bella_counts)

In [None]:
babynames

In [None]:
babynames.shape

In [None]:
babynames.size

In [None]:
babynames.describe()

In [None]:
babynames["Sex"].describe()

In [None]:
babynames.sample()

In [None]:
babynames.sample(5).iloc[:, 2:]

In [None]:
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:, 2:]

In [None]:
babynames["Name"].value_counts()

In [None]:
babynames["Name"].unique()

In [None]:
babynames["Name"].sort_values()

In [None]:
babynames.sort_values(by = "Count", ascending = False)

_Note:_ the outer parentheses in the code below aren't strictly necessary, but they make it valid syntax to break the chained method calls in separate lines, which helps readability. The example below finds the top 5 most popular names in California in 2021.

In [None]:
# Sort names by count in year 2021
(
    babynames[babynames["Year"] == 2021]
    .sort_values("Count", ascending = False)
    .head()
)

In [None]:
babynames.sort_values("Name", ascending = False)

In [None]:
# Here, a lambda function is applied to find the length of each value, `x`, in the "Name" column

babynames.sort_values("Name", key=lambda x: x.str.len(), ascending = False).head(5)

---

### An alternate approach is to create a temporary column corresponding to the length

In [None]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)

In [None]:
# Sort by the temporary column
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
babynames.head(5)

In [None]:
# Drop the `name_length` column
babynames = babynames.drop("name_lengths", axis = 'columns')
babynames.head(5)

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [None]:
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
babynames.head()

In [None]:
# Drop the `dr_ea_count` column
babynames = babynames.drop("dr_ea_count", axis = 'columns')
babynames.head(5)

---

## Female Name whose popularity has dropped the most.

In this exercise, let's find the female name whose popularity has dropped the most since its peak. As an example of a name that has fallen into disfavor, consider "Jennifer", visualized below.

Note: We won't cover plotly in lecture until after Lisa covers EDA and Regex.

Since we're only working with female names, let's create a DataFrame with only female names to simplify our later code.

In [None]:
female_babynames = babynames[babynames["Sex"] == "F"]
female_babynames

In [None]:
fig = px.line(female_babynames[female_babynames["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18)

In [None]:
female_babynames = female_babynames.sort_values(["Year", "Count"])
female_babynames.head()

In [None]:
fig = px.line(female_babynames[female_babynames["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18)

To answer this question, we'll need a mathematical definition for the change in popularity of a name.

For the purposes of lecture, let’s use the RTP or ratio_to_peak. This is the current count of the name divded by its maximum ever count.

Getting the max Jennifer is easy enough.

In [None]:
max_jenn = max(female_babynames[female_babynames["Name"] == "Jennifer"]["Count"])
max_jenn

And we can get the most recent Jennifer count with `iloc[-1]`

In [None]:
curr_jenn = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
curr_jenn

In [None]:
curr_jenn / max_jenn

We can also write a function that produces the ratio_to_peak for a given series.

Here for clarity, let's regenerate the `jenn_counts` Series, but let's do so on a DataFrame where the index is the year.

In [None]:
def ratio_to_peak(series):
    return series.iloc[-1] / max(series)

In [None]:
jenn_counts_ser = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)

We can try out various names below: 

In [None]:
ratio_to_peak(female_babynames[female_babynames["Name"] == "Jessica"]["Count"])

### Approach 1: Naive For Loop

As a first approach, we can try to use a for loop.

In [None]:
%%time
# Build dictionary where entry i is the ammd for the given name
# e.g. rtps["jennifer"] should be 0.01500
rtps = {}
for name in female_babynames["Name"].unique()[0:100]:
    counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]
    if counts_of_current_name.size > 0:
        rtps[name] = ratio_to_peak(counts_of_current_name)
    
# Convert to series
rtps = pd.Series(rtps) 
rtps

In [None]:
rtps.sort_values()