# Data Wrangling

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

## Merging Data

Suppose we had data from two different sources and wanted to combine them to look at relationships between variables and overall create a more comprehensive dataset. To do this, we'd have to **merge** the two datasets together. That is, we would need to make sure that the observations match on certain characteristics to make sure that it's the same entity (such as individual, county, state, etc.). Then, we'd need to get the variables from both datasets and make sure they are included in the full combined table for each observation.

With all the different methods of data collection available nowadays, this is becoming more and more common. In this section, we'll go over how to handle data from multiple sources, and why we might want to use them together. 

In [None]:
from requests import get
import numpy as np 
import pandas as pd 
import yaml

In [None]:
# reading in our keys
with open('../../keys.yml', 'r') as file:
    keys = yaml.safe_load(file)

census_key = keys['census_api_key']

# Data

We want to combine two different data sources to explore the relationship between Trump's vote share in 2024 and other state-level demographic variables. To do this, we need to combine the following:
- State-level demographic data from the U.S. Census API.
- Data on the 2024 election from FiveThirtyEight.com. This data source has already been provided for you in the `pres_2024.csv` file

Let's first use the Census API to get some data to analyze:

Let's start by bringing in a set of variables from the 2022 American Community Survey (ACS) Data Profiles tables. We'll look at some employment and education characteristics of people by state. The variables we'll want to pull are:
- `NAME`: State name
- `DP02_0001E`: Total number of households in the state
- `DP03_0087E`: Mean income of people in the state
- `DP03_0002PE`: Percent of people 16 years and older in labor force
- `DP02_0068PE`: Percent of people 25 years and older with a Bachelor's Degree or higher
- `DP02_0066PE`: Percent of people 25 years and older with a Graduate or Professional degree

Note that we are going to be using the Data Profile table (note the DP at the beginning of each variable name) so the base url needs to be for that particular type of table. 

In [None]:
year = 2023
census_base_url = f'https://api.census.gov/data/{year}/acs/acs1/profile'

census_params = {'get':'NAME,DP02_0001E,DP03_0087E,DP03_0002PE,DP02_0068PE,DP02_0066PE',
                 'for':'state:*',
                 'key':census_key}

r = get(census_base_url, params = census_params)
# Removing Puerto Rico due to lack of data.
people_by_state = r.json()[:-1]

In [None]:
#(make a data frame and then convert the appropriate columns to numeric): 
colnames = ['state', 'num_households','mean_income','percent_employed','percent_bachelors','percent_graduate', "stateid"]
census_df = pd.DataFrame(people_by_state[1:],  columns = colnames)
census_df[colnames[1:]] =  census_df[colnames[1:]].apply(pd.to_numeric)


In [None]:
#  Alternative method for the same result : 
# keycols = ['state', 'num_households','mean_income','percent_employed','percent_bachelors','percent_graduate', 'stateid']
# census_dict = {keycols:[float(state[keys.index(keycols)]) for state in people_by_state[1:]] for key in keycols[1:]}
# census_dict['state'] = [state[0] for state in people_by_state[1:]]
# census_df = pd.DataFrame(census_dict)

Now we'll import the state data and the election data.

In [None]:
states = pd.read_csv('states.csv')
pres_2024 = pd.read_csv("pres_2024.csv")


And take a peak at our three data sets:

In [None]:
states.head()

In [None]:
pres_2024.head()


In [None]:
census_df.head()

Note here that our end goal is to combine the `pres_2024` data and the `census_df` data, but the `pres_2024` dataset has a state abbreviation instead of the full name of each state. In order to combine these data sets, we'll need to merge one of the tables with the `states` data first and then merge using the abbreviation. So our steps are:

1. Create a merged data frame by combining `census_df` with `states` using the full name of each state.
2. Merge the data from step 1 with `pres_2024` using the state abbreviation.

## Merging Data

To use the information in these data sets together, we need to **merge** them using the shared values from each data frame.

The basic syntax here will be:


<font color ='red'>**Q1: Use an inner join to combine `states` and `census_df` to create a new data frame called `merged_data`**</font>

<font color ='red'>**Q2: merge the `pres_2024` data with `merged_data` to create a new data frame called `pres_merged`**</font>

<font color ='red'>**Q3: Use `senate_merged` to create a scatter plot showing the relationship between `percent_bachelors` and `percent_trump2024`**</font>

[The syntax for a scatter plot is](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html):

<font color ='red'>**Q4: Do the same merging process you did in Q1 and Q2, but this time use an `outer` join. Call your results `pres_full`. How does this change the resulting data set?**</font>
> remember you can use `dataframe.shape` to print the dimensions of a pandas dataframe

### Group Operations

We'll use `groupby` to create a grouped data frame. Note that grouping by a variable will cause the aggregate operations like `describe` and `mean` to be performed across groups instead of over the entire data frame

In [None]:
pres_group_regions = pres_full.groupby('Region')

So this will give me descriptive statistics on the 2024 % Trump vote across each major geographic region:

In [124]:
pres_group_regions['percent_trump2024'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
North Central,12.0,55.356722,6.878321,43.474955,49.6943,56.441564,58.765581,66.956852
Northeast,9.0,42.787636,5.687365,32.319407,41.763118,43.314991,46.064631,50.372537
South,16.0,56.791832,9.833829,34.082823,50.822217,59.221939,64.265691,69.96685
West,13.0,50.644301,11.141164,37.480284,40.966044,50.591646,58.389443,71.598005


<font color ='red'>**Q5. Use `groupby` and `describe` to compare the `percent_bachelors` variable in states that Trump won in 2024 compared to states he lost.**</font>
> Note that you might need to create a new column to indicate Trump winning in 2024 before you can use `groupby` here

<font color ='red'>**Q6. Which party had a harder Senate map in 2024? Use the `senate_races_2024.csv` file to compare the states that had seats up for re-election to states that didn't have seats up for re-election.**</font>

> Note that the `senate_races_2024.csv` file just has a list of states where there was a senate race, so think carefully about how you should join this with your existing data.

In [None]:
senate = pd.read_csv('senate_races_2024.csv')
senate.head()