# Intermediate Project - SAAS Career Exploration - Part 1

For this project, you will be using the tools that you have learned so far on a real data science problem. You will be cleaning and analyzing a dataset to answer a research question, something that you will be doing for your entire career if you continue down this path.

This project will be done **in groups of 2-3**. Modern research is collaborative, so get used to it! If you are having trouble finding a partner(s), please message your family channels. The first part of the project is due **this Sunday (March 1st) at 3pm**.

Parts that you need to complete <span style="color:blue">will be written in blue</span> and have `#TODO` next to them.

The final product for this project will be a statistical model that answers a question posed about the data, in addition to a short description about how your model works and its limitations.

## 1. The Dataset

### 2016 US Election from Kaggle

https://www.kaggle.com/benhamner/2016-us-election/data

This dataset gives results from the 2016 Democratic primaries for the US presidential election. The results themselves are stored in the ``primaries`` dataframe, while information about each county and state is stored in the ``general`` dataframe.


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import random
import numpy as np
import pandas as pd

general = pd.read_csv("2016-us-election/county_facts.csv")
column_dict_df = pd.read_csv("2016-us-election/county_facts_dictionary.csv")
primaries = pd.read_csv("2016-us-election/primary_results.csv")

### 1.1. Cleaning general

First, we have to clean the data to make it a little bit easier for us to use. What's wrong with it now? <span style="color:blue">Well, let's start off by examining the `general` dataframe using the `.head()` command.</span>

In [None]:
# TODO: Your code here

Oh no! What are those column names? Why do we have states and counties in the same table? Why don't the states have abbreviations? Dealing with problems like these is called *data cleaning*, and is frequently one of the most important and most time-intensive parts of data science. Lucky for you, I've done the data cleaning already! <span style="color:blue">Skim through the code below just to get a general idea of what's going on, but please don't worry about every last detail of what it does.</span>

In [None]:
# Turn the county_facts_dictionary.csv file into a dictionary
column_dict = column_dict_df.set_index("column_name").to_dict()['description']

# Use that dictionary to rename the columns of general
general.columns = general.columns.to_series().map(lambda x: column_dict.get(x,x))

# Extract the rows corresponding to states from general (note that these are the rows with NaN in the 
# state_abbreviation column, minus the first row which is the whole US)
states = general[general['state_abbreviation'].isnull()][1:].reset_index(drop=True)

# Attach the state abbreviations to the states dataframe
states["state_abbreviation"] = general["state_abbreviation"].unique()[1:]

# Extract the rows corresponding to counties from general
counties = general[~general['state_abbreviation'].isnull()].reset_index(drop=True)

In [None]:
states.head()

In [None]:
counties.head()

### 1.2. Cleaning primaries

Next, let's look at the `primaries` dataframe.

In [None]:
primaries.head()

For future parts of this project, it would be much easier for us if we only had one row per county which had the votes for all the candidates.<span style="color:blue"> Do this using a pivot table where the index is the state *and the county* (you can pass a list of multiple column names into the `index` argument), the columns are the candidates, and the values are the fraction of votes that each candidate received. </span> Hint: You can use the same function from lecture, just replace the arguments appropriately.

In [None]:
# TODO: Create the pivot table as described above.
pivot_table = #YOUR CODE HERE

In [None]:
# TODO: Run this cell after filling out the above to make the indices into columns.
pivot_table.reset_index(inplace = True)
pivot_table.head()

### 1.3. Unique counties

Why do we need to index by both state and county? Well, it turns out that there are tons of counties with duplicate names! We wouldn't want to accidentally combine all of the counties named "Calhoun County" over all the states with that county. (There's a "Calhoun County" in eleven states!!) <span style="color:blue"> Let's try to figure out how many unique county names there are; use the `.unique()` function on the `county` column to find the number of unique column names, and compare that with the number of rows in the pivot table. </span>

In [None]:
# TODO: Find the number of unique county names.
uniques =  # YOUR CODE HERE
len(uniques)

In [None]:
# TODO: Find the number of rows in pivot_table to see how many different counties there actually are.


### 1.4. Lake County

More than a third of the counties don't have a unique name! <span style="color:blue"> The county I'm from is called "Lake County" - how many different states could I be from?</span>

In [None]:
# TODO: Find the number of states with a county named "Lake" (or the county you're from if you'd prefer).


### 1.5. Challenge problem! (Optional)

CHALLENGE PROBLEM: Find the county name that is duplicated the most often. If you can, try doing this without any loops!

In [None]:
# TODO: Your code here


## 2. Submission

**To submit, first save this file as a pdf by going to the top left and clicking File -> Download as -> PDF via LaTex (.pdf), then fill out this form!**

https://docs.google.com/forms/d/e/1FAIpQLSfC959ud0v9C9vZBEM2U41ryAQA5DsgU4d56_BnSVCfYEsTZw/viewform?fbclid=IwAR0s64G8p3U2NNFgzNHTV6vuAxCLd1redLc6SEgW77gCNfx39p8VORhPkt4