# Dating cleaning


Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
import re

import numpy as np 
import pandas as pd 

The following sections will look at some strategies for cleaning real world data. We'll use some selected variables from the 2020 round of the <a href="https://electionstudies.org/data-center/2020-time-series-study/">American National Election Study</a> which surveys eligible voters before and after each presidential election. 

This subset of the data contains the following variables: 

| Variable       | Description                                                                  |
|----------------|------------------------------------------------------------------------------|
| partyid        | R's preferred party (7 category)                                             |
| id             | R's unique identifier                                                        |
| libcon         | R's liberal/conservative self placement (7 category)                         |
| biden_likes    | Open ended: what does R like about Joe Biden?                                |
| biden_dislikes | Open ended: what does R dislike about Joe Biden?                             |
| trump_likes    | Open ended: what  does R like about Donald Trump?                            |
| trump_dislikes | Open ended: what does R dislike about Donald Trump?                          |
| mipp_1         | Open ended: what is the most important political problem facing our country? |
| age_group      | Age (5 categories)                                                           |

In [None]:
data_file = "https://raw.githubusercontent.com/Neilblund/APAN/refs/heads/main/anes_selected.csv"
df = pd.read_csv(data_file)
df.head()

The first thing you might notice here is that many of the responses have `NaN`, which indicates missing or invalid responses. This is pretty common with survey data: some questions aren't asked because they're not relevant to all respondents, or respondents might hang up the phone, or just refuse to give a valid answer.

You can get a sense of the number of `NaN` values using the `.info` method: 

In [None]:
df.info()

Alternatively, using `isnull()` gives us a data frame of the same size with `True` and `False` values depending on whether it was a missing value or not. Then, `mean()` calculates the arithmetic mean for each column. Since Python treats `True` as `1` and `False` as `0`, the mean of all of these values is the same as the proportion of missing data for each column:

In [None]:
df.isnull().mean()


<font color ='red'>**Question 1: the "like/dislikes" questions were left blank if a respondent said they couldn't identify anything they liked about a candidate. How would you show the relationship between party ID and being able to say anything positive about Trump?**</font>

You might notice that there were no `NaN` values reported for the `partyid` and `libcon` variables, but, in reality, there are still non-responses for these questions, they're just given text labels instead of being left blank. We can see this by using the `value_counts` method. 

(note: we can use the `sort_index` method to sort these alphabetically instead of in order of the most frequent category. For ordinal variables like this one that have a clear ordering, that may often be a better option)

In [None]:
df.partyid.value_counts().sort_index()

In [None]:
df.libcon.value_counts().sort_index()

In many cases, we will want to convert this kind of labelled missing data to an explicit `NaN` instead. This will prevent these responses from showing up in our tables and graphs. We can replace a value with an `NaN` like the `replace` function:

In [None]:
missing_values = ["-9. Refused", "99. Haven't thought much about this", "-8. Don't know"]

df['libcon_replaced'] = df['libcon'].replace(missing_values, np.nan)

In [None]:
df.libcon.value_counts().sort_index(ascending=True)

<font color ='red'>**Question 2: Replace the refused/don't know values in the party ID variable with `NaN`**</font>

In [None]:
# when you're done, re-run the function to count nulls to make sure your code worked:
df.isnull().mean()


### Recoding data

In some cases it can make sense to collapse or rearrange one or more categories to simplify our data or capture a particular subset of respondents.  For instance, there's generally not a significant difference in the voting behavior of people who report that they are "weak" or "independent" partisans compared to people who self-described "strong" partisans, so for the sake of simplify we often collapse these categories into "Democrats", "Independents" and "Republicans". 

When we want to take an existing variable and combine responses to create something new, we can use the `map` method to apply a dictionary object that will re-map our original values onto new ones.The dictionary should take the form of `{[oldvalue]:[newvalue]}`


In [None]:
partyid_map = {"1. Strong Democrat": "Democrat",
               "2. Not very strong Democrat": "Democrat",
               "3. Independent-Democrat": "Democrat",
               "4. Independent": "Independent", 
               "5. Independent-Republican": "Republican",
               "6. Not very strong Republican": "Republican",
               "7. Strong Republican": "Republican"
}

# using the assign function to create a new variable:
df = df.assign(partyid_3cat= df.partyid.map(partyid_map))

df.partyid_3cat.value_counts()

Since recoding is easy to mess up, I think its always a good idea to always:
1. Preserve the old variable and give your new variable a different name
2. Check your results by comparing the old variable to the new one.

You can check your results by selecting the old values and the new values, dropping duplicates, and then checking to see how the unique labels match up:

In [None]:
df[['partyid', 'partyid_3cat']].drop_duplicates().sort_values('partyid')

### Dropping missing data

In some cases, we will want to just drop rows that have missing data, we can use this with the Pandas `.dropna()` method. `axis=1` will drop columns with missing information:

In [None]:
df_no_missing_cols = df.dropna(axis = 1)
df_no_missing_cols.shape

`axis=0` will drop all rows with missing information:

In [None]:
df_no_missing_rows = df.dropna(axis = 0)
df_no_missing_rows.shape

We want to be careful here: surveys like the ANES will rarely ask every respondent to answer every single question, and some respondents will invariably skip or refuse to answer certain items. If we drop any rows with any missing data, we end up tossing out a huge amount of information: from 8280 responses to just 312. This is probably not a great idea. So more often we'll want only drop rows when some crucial bit of information is missing. For instance, we could do something like this to only drop rows where the `age_group` and `partyid_3cat` variables are missing:

In [None]:
df_no_missing_pid_or_age = df.dropna(axis=0, subset=["partyid_3cat", "age_group"])

df_no_missing_pid_or_age.shape

Or we could only drop cases where there is non-missing data for at least two columns:

In [None]:
df_min_2 = df.dropna(thresh=2, axis=0)

df_min_2.shape

As with the previous examples, you should always be careful to avoid throwing out useful data if possible. 

## String Data

String data and needing to clean up string data are very common in social science research. Whether it's coding open response questions in surveys or parsing social media posts, lots of data in the social sciences are text data, and it's important to understand how to deal with them. This is particularly true when using web scraping techniques. The data that you get from web scraping will generally at least start out as string data, which you can then transform into numerical variables as needed.



### Regular Expressions

A **regular expression, or regex**, is a sequence of characters that are used to match patterns within text. These can be extremely useful for searching and matching complicated string patterns. Regexes use specific rules and formatting guidelines to specify various patterns and are implemented in Python via the `re` package. Let's take a look at a quick example.

In [None]:
text = 'this is some text'
re.split(r"\s+", text)

The regular expression `\s+` refers to one of more whitespace characters. This uses `\s` as the regex formatting for a whitespace character, as well as the `+` to indicate one or more. Note that `\s` is used because `s` by itself would refer to the letter `s`. The backslash `\` is an escape character that allows Python to interpret it as a pattern that it is trying to match.

We could have done this without the `+` but that would try to separate on individual spaces instead. In the above example, there would be no difference, but if we were to add additional spaces, we'd see a big change.

In [None]:
text = 'this is some    text'
re.split(r"\s", text)

You can also first compile a regular expression, then use it multiple times. 

In [None]:
regex = re.compile(r"\s+")
regex.split(text)

In [None]:
regex.split('some other  text')

Other regex functions, like `findall` can be used to extract bits of text that matches some pattern. This is an example of a pattern that could be used to identify and capture a price from a string of text. Notice I uses

In [None]:
text = "33lbs of bananas cost $90.00" 

re.findall(r"(\$[0-9]+[\.0-9]{0,2})", text)

### Regex methods

|Method|Description|
|-|-|
|`findall`|Return all nonoverlapping matching patterns in a string as a list|
|`match`|Match pattern at start of string and optionally segment pattern components into groups; if the pattern matches, return a match object, and otherwise None|
|`search`|Scan string for match to pattern, returning a match object if so; unlike match, the match can be anywhere in the string as opposed to only at the beginning|
|`split`|Break string into pieces at each occurrence of pattern|
|`sub`|Replace all occurrences of pattern in string with replacement expression|

### Working with regex

Unless you work with regular expressions frequently, it will be very hard to get good enough at regex to write it out on the fly. Most of the time, you only need regex occasionally for one or two tasks, and it can be hard to remember what all of the patterns and syntax are. **You do not need to try to memorize all of the regex syntax!** There are plenty of tools available to help you out whenever you need to come back to it. Understanding some of the basics should help you get started, and you can use existing cheatsheets as well as online regex builder tools to do the rest. 

Regex 101 (https://regex101.com) is a website that helps you build your regexes. This website allows you to paste in the text you want to search, as well as type in a regex that you want to build. The text selected by the regex will be highlighted as you go, helping you build the exact regex that you need to get what you want. 

### Regex Cheatsheet

Feel free to download this cheatsheet to more easily read it. This is also available on the course Canvas website.

![Regex Cheatsheet](Regular_Expressions_Cheat_Sheet.pdf)

## Strings in DataFrames

Many times, we'll have to work with strings that are in DataFrames. These are a bit trickier than individual strings, because we want to be efficient about how we do this. Luckily for us, the creators of the pandas DataFrames recognized that string manipulation would be fairly common and included tools to help make it easier.


We can use the `extract` method to capture text that fits a regex pattern. The parentheses in the expression below are the "capture group" for our regex, this is the part of the pattern that we want to extract, so the code below will extract the portion of the text that matches the string "liberal", "conservative", or "moderate", so this is a way to recode the `libcon` variable to a 3-category measure without using the `map` function:


In [None]:
# first lower case, then match: 
libcon_lowercase = df.libcon.str.lower()

df['libcon_3cat'] = libcon_lowercase.str.extract('(liberal|conservative|moderate)', re.IGNORECASE)


We can compare our new variables against the old ones like this: 

In [None]:
df[['libcon', 'libcon_3cat']].drop_duplicates().sort_values('libcon')

`str.contains` returns a `True` if the string contains the pattern given as its argument and `False` otherwise. One use for the `contains` function is to identify cases where a piece of text shows up so that you can filter or count the number of occurrences of that string. So if we wanted to know how many people mentioned climate change or global warming in their response to the "most important political problem" question, we could use a regular expression like `"climate change|global warming"` to count these cases:

In [None]:
df =df.assign(climate_change = df['mipp_1'].str.contains('climate change|global warming', case=False, regex=True))

# get the proportion who mentioned this: 
df.climate_change.mean()


Then we can use this indicator to get the average proportion of people giving this response across different levels of some variable like party ID:

In [None]:
df.pivot_table(index='partyid_3cat', values='climate_change', aggfunc='mean')
# or : 
# df.groupby('partyid_3cat')['climate_change'].mean()

<font color ='red'>**Question 3: How would I show the relationship between `age_group` and the likelihood of mentioning Covid-19 as a response to the `mipp_1` question? Can you write your expression in a way that allows you to capture differences in spelling or punctuation?**</font>

Another common use for the `str.contains` function is to subset a data set. Since it returns a boolean value, we can use it with `.loc` to get rows where a pattern is matched. For instance, here's how I would get a data frame containing only people who mentioned age, dementia, or cognitive ability in their reasons for disliking Joe Biden. (the `\b` in this expression represents a word boundary, so this avoids capturing terms that just happen to begin or end with "age" like "baggage", or "mortgage". 

In [None]:
df_age = df.loc[df.biden_dislikes.str.contains(r'\bold\b|\bage\b|dementia|cogniti*', case=False, na=False)]

# how many rows?
df_age.shape[0]

<font color ='red'>**Question 4: How many Democrats mentioned Donald Trump when asked to name a thing they liked about Joe Biden? How many Republicans mentioned Biden when asked to list something they liked about Trump?**</font>