## Discussion 3:  Pandas

In this discussion, we will cover some common operations that you will use in Pandas, including `groupby`, aggregates, `filter`, and `apply`. 

For a more detailed version of the questions in this section, please see the [discussion worksheet](http://www.ds100.org/sp20/resources/assets/discussions/disc03.pdf).

The two cells bellow will download the relevant data and load it into two dataframes locally called `elections` and `babynames`.

In [None]:
import pandas as pd
import requests
import zipfile
from pathlib import Path

import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head(5)

In [None]:
elections = pd.read_csv("elections.csv")
elections.head(5)

### Elections Data

In this first sequence of questions, we'll analyze the elections data. These first three exercises correspond to questions 4a-4c on the discussion worksheet.

#### Question 4a

Using `groupby.agg` or one of the shorthand methods (`groupby.min`, `groupby.first`, etc.), create a `Series` best result that gives the highest percentage vote ever attained by each party. For example, best `result[‘Libertarian’]` should return 3.3. The order of your `Series` does not matter.

In [None]:
best_result_by_party = elections
... # FILL ME IN
best_result_by_party

#### Question 4b

Again using `groupby.agg` or one of the its shorthand methods, create a `DataFrame` that gives the result for a party in its most recent year of participation, with Party as its index. For example `last_result.query("Party == ‘Whig’")` should give you a row showing that the Whigs last participated in an election in 1852 with Winfield Scott as their candidate, earning 44% of the vote. This might take more than one line of code. Write your answer below.

In [None]:
last_year_by_party = elections
... # FILL ME IN
last_year_by_party

#### Question 4c

Using filter, create a `DataFrame` of major party results since 1988 that includes all election results starting in 1988, but only includes each row if the Party it belongs to has earned at least 1% of the popular vote in ***any*** election since 1988. For example, in 1988, you should not include the "New Alliance" candidate since this party has not earned 1% of the vote since 1988. However, you should include the "Libertarian" candidate from 1988 who only earned 0.47% of the vote in 1988 because in 2016 the Libertarian candidate Gary Johnson had 3.3% of the vote.

In [None]:
major_party_results_since_1988 = elections
... # FILL ME IN
major_party_results_since_1988

### Baby Names Data

Now we'll turn our attention to the baby names dataset (the `babynames` `DataFrame` we loaded above). This section corresponds to exercises 4d-4e on the discusson worksheet.

#### Question 4d

Create a `Series` named `female_names_since_2000_count` which gives the total number of occurrences of each name for female babies born in California from the year 2000 or later. The index should be the name, and the value should be the total number of births. Your series should be ordered in decreasing order of count. For example, your first row should have the index “Emily” and the value 49605, because 49,605 Emilys have been born since the year 2000 in California.

In [None]:
female_names_since_2000_count = babynames
... # FILL ME IN
female_names_since_2000_count

#### Question 4e

Using `groupby`, create a `Series` called `count_for_names_2018` listing all baby names from 2018 in decreasing order of popularity. The result should not be broken down by gender! If a name is used by both male and female babies, the number you provide should be the total across both genders. For example, `count_for_names_2018["Noah"]`
should be the number 2567 because in 2018 there were 2567 Noahs born (12 female and 2555 male).

In [None]:
count_for_names_2018 = babynames
... # FILL ME IN
count_for_names_2018