# More Data Wrangling

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
from datetime import datetime
from dotenv import dotenv_values, find_dotenv
import numpy as np 
import pandas as pd 

## Working with Data

Many times, when we are working with Data Frames and tabular data, each row represents a person or entity. That is, the **unit of observation** is a person and the variables all describe that person. We've also looked at data where the unit of observation was a US state. However, this is not always the case. You need to be careful and think through exactly what each row represents, and transform your data if it needs to be in another form. Let's take the merged anxiety and Census data as an example. This is available in the `anxiety_census_data.csv` file, and contains the proportion of people who reported feeling anxious all or most of the last 7 days within each state for a series of days from May 1, 2022 to June 25, 2022. 

In [None]:
anxiety_census = pd.read_csv('anxiety_census_data.csv')
anxiety_census.head()

In this dataset, the unit of observation is a **state-date** combination. A unique row is represented by a state on a particular date. 

<h4 style="color:red;font-weight:bold">Question 1: Create a subset of the <code>anxiety_census</code> so that it only contains Maryland and Virginia. Include only the variables <code>state</code>, <code>time_value</code>, and <code>value</code>. Call this new DataFrame <code>md_va</code>.</h4>

## Reshaping Data

Suppose we want to graph the data for these two states together, so that we can see how the trends differ between them. In order to use the `plot.line` method and use both, we actually need them in two different columns. That is, we need one variable to be the value for Maryland on a given day and one variable to be the value for Virginia on that day. 

In other words, right now, each row represents a **state-date pair**. For example, the first row is Maryland on May 1, 2022, the second row is Maryland on May 2, 2022, and so on eventually going to Virginia on May 1, 2022, then Virginia on May 2, 2022, and so on. Instead, we want data that has each row represent a **date**, with separate columns for the values of Maryland and Virigina on that day. 

In [None]:
wide_data = md_va.pivot(index = 'time_value', columns = 'state', values = 'value')
wide_data.head()


<details>
<summary><b>Side note on indices in Pandas</b></summary>

<br>
<div>
In Pandas DataFrames, the <code>index</code> attribute assigns a special label to each row in the Data Frame. Most of the time, this index will just be a unique row number (1, 2, 3, 4...), but it might also be a date or a list of categories.

You can access a DataFrame index like this:

``` python
DataFrame.index
```

...And you can turn an index in regular column by running <code>reset_index()</code>

```python
DataFrame.reset_index()
```
</div>

</details>



Note that the dates have been turned into the indices. If we wanted to include them as a variable instead, we could have used `reset_index` to reset the index back to incrementing up from 0, but we'll leave it as it is for now because it is useful for graphing purposes.

In [None]:
# If we had wanted to reset the index.
wide_data.reset_index().head()

Now that we've gotten our data into this format, we can create our line plot by indicating color. We use a dictionary to specify which column corresponds to which color, and we adjust the `figsize` argument to make our plot a little bit wider than it is long. Finally, we add the `xlabel` and `ylabel` arguments to give descriptive labels to the x and y axes of the plot:

In [None]:
wide_data.plot.line(color = {'Maryland':'red', 'Virginia':'blue'}, 
                    figsize=(10, 5),
                    xlabel='Date', 
                    ylabel='% Anxiety'
                    )


## Time and Date conversion

Depending on how wide your display it, you might notice that the `time_values` appear to run together a bit on the x-axis. This is because Python is intepreting these index values as a string instead as a date:

In [None]:
anxiety_census.dtypes

Our plotting functions are going to handle To get more sensible date handling, we'll need to explicitly declare that `time_values` contains dates. We can do that in Pandas with the `pd.to_datetime` function:

In [None]:
anxiety_census['time_value'] = pd.to_datetime(anxiety_census['time_value'])
anxiety_census.dtypes

From here, I'll run the same code we executed above to subset the data for Maryland and Virginia and make a line plot, but notice the difference in the axes now that we've clarified the data type for the date column:

In [None]:
md_va = anxiety_census.loc[anxiety_census['state'].isin(["Maryland", "Virginia"]) ,['state','time_value','value']]

wide_data = md_va.pivot(index = 'time_value', columns = 'state', values = 'value')

wide_data.plot.line(color = {'Maryland':'red', 'Virginia':'slateblue'}, 
                    figsize=(10, 5),
                    xlabel='Date', 
                    ylabel='% Anxiety',
                    )



In general, you'll find you get better behaviors from Pandas if you to this kind of date conversion before you start your analyses.

<h4 style="color:red;font-weight:bold">Question 2: Create a line graph comparing the trends for Maryland, Virginia, and New York.</h4>

Note: you can use any of the colors in [this list](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors) for your plot. You can also use a [hexadecimal color value](https://en.wikipedia.org/wiki/Web_colors#Extended_colors) to create your own palette, and there are some [online sources](https://imagecolorpicker.com/) that will automatically give you the hex code to recreate the color from an image. You can try using the palette below in a plot to see some custom colors:


In [None]:
custom_pal = {"Maryland": "#8EF84C", 
         "New York" : "#604CF8",
         "Virginia" : "#F84CE4"
         }


## Going from Wide to Long

We can also go the opposite way for data that might require long format.

In [None]:
long_data = wide_data.melt(ignore_index = False)
long_data

If we pair the plot command with `groupby`, we can generate a line plot from the long-formatted data just like we can with wide data:

In [None]:
long_data.groupby('state')['value'].plot.line(legend=True,
                                              color = custom_pal, 
                                              xlabel = "Date",
                                              ylabel = "Anxiety",
                                              figsize = (10, 3)
                                              )

## More Merging Practice

Let's try doing another merge. Let's say we want to take a look at anxiety trends for the month of May 2022, and try to see if there's any current events that may correlate with any changes as the month goes on. In order to look at this, we might want to combine data from the NYT Archives with the data on anxiety trends.

Recall that we can pull all articles from the NYT Archives API for a given month. So, we'll start by getting all articles from May 2022.

In [None]:
# reading in our keys
keys = dotenv_values(find_dotenv('keys.env', usecwd=True))

nyt_key = keys['nyt_api_key']

In [None]:


base_url = "https://api.nytimes.com/svc/archive/v1/2022/5.json"
r = get(base_url, params= {'api-key':nyt_key}) 
archive_2022_05 = r.json()['response']['docs']


In [None]:
archive_2022_05[0].keys()

Next, we'll get the data that we want as a DataFrame. We want to retain some basic information like abstract and type of material and word count, as well as the publication date so that we can match on the anxiety data. 

In [None]:
variables = ['abstract', 'web_url','pub_date', 'type_of_material','word_count']
nyt_dict = {var:[article[var] for article in archive_2022_05] for var in variables}
nyt_df = pd.DataFrame(nyt_dict)
nyt_df.head()

Note that the `pub_date` variable is not the same format as the one in the anxiety data. Notably, it has a lot more specific time associated with that article. Let's add a new variable called `date` that contains just the date information in the same format as is in the anxiety dataset (namely, YYYY-MM-DD).

<h4 style="color:red;font-weight:bold">Question 3: Add a column to <code>nyt_df</code> called <code>date</code> that contains the date in the same format as it is in the anxiety dataset. That is, it should be in YYYY-MM-DD format.</h4>


> Hint: You can just extract the first 10 characters of the `pub_date` column to get the date without the time

Once you've created the date object, you'll want to convert it to a date using `pd.to_datetime`:

In [None]:
nyt_df['date'] = pd.to_datetime(nyt_df['date'])


<h4 style="color:red;font-weight:bold">Question 4: What does each row in the <code>nyt_df</code> DataFrame represent?</h4>

Answer:

Next, let's get the relevant data for just New York and in the month of May. 

<h4 style="color:red;font-weight:bold">Question 5: Create a DataFrame called <code>ny_anxiety_census</code> that contains just the rows of <code>anxiety_census</code> that are for New York state.</h4>

Now that we've gotten a dataset with only values from New York, we'll further subset the data to only include the dates in May. 

In [None]:
ny_may = ny_anxiety_census[ny_anxiety_census.time_value <= '2022-05-31']
ny_may.head()

In [None]:
ny_may.shape

<h4 style="color:red;font-weight:bold">Question 6: What does each row in the <code>ny_may</code> DataFrame represent?</h4>

Answer: 

The `nyt_df` DataFrame and the `ny_may` DataFrame have different **units of observation**. So, if we want to combine them together, we have to be careful about how to do it. For example, if we want to make sure we only have one row per day (since we want to look at the change over the days), then we might want to first make sure that both DataFrames have the day as the unit of observation. To do this, we'll aggregate using `groupby` to get some summary measures that we'll track over the course of the month.

<h4 style="color:red;font-weight:bold">Question 7: Using <code>apply</code>, find the total number of News articles there were in each day of May.</h4>

We can create a custom function to do more than computation with a group at once as well.

In [None]:
def nyt_summary(x):
    return pd.Series([sum(x['type_of_material'] == 'News'), x.count().iloc[0]])

nyt_by_day = nyt_df.groupby('date')[['date','type_of_material']].apply(nyt_summary).reset_index()

nyt_by_day.columns = ['date', 'anxiety_news', 'total']

nyt_by_day.head()


In [None]:
nyt_by_day.shape

Finally, we'll merge `ny_may` with `nyt_by_day`

In [None]:
ny_merged = pd.merge(ny_may, nyt_by_day, how='outer', left_on = 'time_value', right_on ='date' )
ny_merged.head()

<h4 style="color:red;font-weight:bold">Question 8: In the month of May, did the number of News articles each day in the New York Times have any relationship with the reported anxiety on that day? Use a scatterplot to show the relationship.</h4>

*Note:* Remember, the method to create a scatterplot from a DataFrame is `.scatter.plot()` with two arguments: the x variable and y variable for the scatterplot.

> This isn't exactly a particularly interesting thing to look at because we wouldn't necessarily expect there to be a relationship on that day. Remember, the question asked whether the respondent had felt anxious in the past 7 days. We also aren't looking at the topics of the News articles. To do more sophisticated analyses, we would want to do some more cleaning and try thinking about extracting more information from things like the abstract.

## Overview

Steps to preparing data for analysis (particularly from APIs).

1. Obtain data using API or reading in from a CSV file. Helpful functions: `pd.read_csv`, `get`. 
2. Identify the type of data that you have. Is it a dictionary? A list? What does each item within the list or dictionary represent?
3. Develop a plan to extract the data that you want. Try getting just one, then think about how you might generalize it to be able to use list comprehension or dictionary comprehension. 
4. Create a DataFrame and do some cleanup of the data. Make sure the column names are meaningful and the types of variables are what you need them to be (e.g., numeric if they are numeric variables). Make sure you know what the unit of observation is.
5. Identify any additional data wrangling steps you might need to take. Do you need to join datasets together? Do you need to group and aggregate data?

# Extra code

Here's an example of reshaping FiveThirtyEight's state-level presidential election data to get a two-party vote share. 

The original data is in long format, with one row per candidate per state per cycle. We want to use this to just calculate Trump and Biden's share of the two party vote in 2020. 

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/election-results/refs/heads/main/election_results_presidential.csv')

After reading in the data, our first step is to filter it to remove the following:

- rows other than the 2020 election
- any data on primary races
- data for candidates other than Trump or Biden
- any cases where `state_abbrev` is null (these are results for the entire country)
- Maine and Nebraska's weird system of allocating electoral votes based on congressional district.  


In [None]:
data_2020 = data[(data['cycle'] == 2020) &  # 2020 cycle only
                 (data['stage'] == 'general') &  # general election only
                 (data['candidate_name'].isin(["Donald Trump", "Joe Biden"])) &
                 (data['state_abbrev'].notnull()) & # only state level results
                 (data['state_abbrev'].str.contains('[0-9]')==False) # removing special Maine and Nebraska district-level results
                        ]

Next, we'll remove some extraneous columns:

In [None]:
pres_vote = data_2020.loc[:, ['state','state_abbrev', 'candidate_name', 'votes']]

In [None]:
pres_vote.head()

Next, we'll want to get the total votes for each state. We can do this using groupby and `sum`:

In [None]:
pres_vote_total = pres_vote.groupby(['candidate_name', 'state', 'state_abbrev']).sum()


pres_vote_total.head()

Next, we'll pivot the results from long to wide:

In [None]:
presvote_wide =pres_vote_total.reset_index().pivot(index = 'state', columns = 'candidate_name', values = 'votes')
presvote_wide.head()

Finally, we'll calculate the total number of votes cast in each state by `sum`-ing along the row axis, then we'll create a variable called `trumpshare` that contains Donald Trumps proportion of the two party vote in each state:

In [None]:
presvote_wide.loc[:, 'total'] = presvote_wide[["Donald Trump", "Joe Biden"]].sum(axis=1)
presvote_wide.loc[:, 'trumpshare'] = presvote_wide['Donald Trump'].div(presvote_wide['total'])

In [None]:
presvote_wide.head()