# Week 2: Data Collection 1 - Working with APIs

**Author:** Minjae Yun

In today's digital world, APIs (Application Programming Interfaces) have become an essential part of modern software systems. In this session, we'll explore what an API is and how to effectively use it.

## US Census Data
The US Census Bureau has transitioned from manual bulk downloading to using APIs for data retrieval since 2020 [Census Data API](https://www.census.gov/newsroom/press-releases/2020/american-factfinder-retiring.html). This change aims to offer more efficient ways to access data. To get started:

- Visit the [US Census API Key Signup](https://api.census.gov/data/key_signup.html) page and sign up for an API key.

### Side Note: Understanding the `format()` Function
The `format()` function in Python is used for string formatting and allows you to insert values into placeholders within a string. It's a powerful tool for creating dynamic and readable output.

In [None]:
# assign string values to objects
text1 = "Enter"
text2 = "Any"
text3 = "Strings"
print(text1+text2+text3) # we can concatenate strings using "+"

- Manually add Space between strings

In [None]:
print(text1 + " " + text2 + " " +text3)

- Instead, use a Python format() functionality

In [None]:
print(f"{text1} {text2} {text3}")

- We can add operations to the variable

In [None]:
print(f"{text1} {text2} {text3*3}")

### Start Pulling Census Data

Now that we have a basic understanding of APIs and their significance, let's delve into the process of effectively working with APIs for data collection. This step is essential for harnessing the power of available data sources.

For more in-depth information and examples, you can refer to the [Census Data API User Guide](https://www.census.gov/data/developers/guidance/api-user-guide.Example_API_Queries.html#list-tab-2080675447).

To begin our journey, we will use the following packages:

- **Requests:** This package will enable us to make HTTP requests to the Census API and retrieve data.
- **Pandas:** Pandas is an indispensable tool for data manipulation and analysis, and we will use it to handle the data obtained from the Census API.
- **Census:** The Census package provides an interface to interact with the Census API and simplifies the data retrieval process.

By combining these tools, we can access and process Census data with ease. Let's proceed with hands-on examples of working with the Census API.

In [None]:
# create year, url, and variable list object
year = 2018
url = f'https://api.census.gov/data/{year}/acs/acs5/variables.json' # now we can iteratively choose year
response = requests.get(url) # intermediate step - use requestes.get() function to load url
data = response.json() # intermediate step - use .json function to convert into a structured format
var_list =  data['variables'].keys() # final list
var_list[:10]

### Side Note: Dictionaries

Dictionaries are a fundamental data structure in Python, offering comparable capabilities to lists and tuples. However, dictionaries provide distinct advantages, especially in terms of key-value pair storage. 

Key points about dictionaries:

1. **Key-Value Pairs:** Unlike lists or tuples, dictionaries use a key-value pairing system, allowing you to store and access data based on meaningful labels (keys).
2. **Data Retrieval:** Accessing values from dictionaries is rapid and doesn't require iterating through the entire collection.
3. **Data Transformation:** Dictionaries can be converted into dataframes, making it easier to perform data analysis using tools like Pandas.

Let's start exploring dictionaries with a hands-on example:

#### Create a Simple Dictionary

To create a simple dictionary, you can use curly braces `{}` and specify key-value pairs separated by colons. Here's an example:

In [None]:
student = {
    'name': 'Alice',
    'age': 20,
    'major': 'Computer Science',
    'gpa': 3.8
}

In [None]:
# Access dictionary values using keys
student['name']

In [None]:
# Combine with .format() function
print(f"Student Name:{student['name']}")
print(f"Student Age:{student['age']}")
print(f"Student Major:{student['major']}")
print(f"Student GPA:{student['gpa']}")

In [None]:
# Append new key-values
student['age'] = 21
student['gpa'] = 3.9
student['university'] = 'XYZ University'

# Display the updated dictionary
print("\nUpdated Student Information:")
print(student)

- We can also insert dictionaries into a dictionary!

In [None]:
population_data = {
    'USA': {'population': 331002651, 'capital': 'Washington, D.C.'},
    'China': {'population': 1444216107, 'capital': 'Beijing'},
    'India': {'population': 1380004385, 'capital': 'New Delhi'},
    'Brazil': {'population': 212559417, 'capital': 'Brasília'},
    'Russia': {'population': 145934462, 'capital': 'Moscow'},
    'Nigeria': {'population': 206139587, 'capital': 'Abuja'},
    'France': {'population': 65273511, 'capital': 'Paris'}  # New country
}
print(population_data['USA'])
print(f"The data class of an element inside a dictionary can also be {type(population_data['USA'])}")
print(population_data['USA']['capital'])

In [None]:
# Add a new country to the dictionary
new_country_name = 'Spain'
new_country_info = {'population': 46754783, 'capital': 'Madrid'}
population_data[new_country_name] = new_country_info
pd.DataFrame(population_data)

If we want to preserve the country name then we can insert the entire information again inside another dictionary

In [None]:
# Restructure the data
restructured_data = [{'Country': country, **info} for country, info in population_data.items()]

# Convert to pandas DataFrame
df = pd.DataFrame(restructured_data)

# Display the DataFrame
print(df)

### Side Notes: `for` and `while` Loops

Let's take a quick look at for and while loops, two of the most frequently used tools in your Python toolbox.

1. **`for` Loop:** If you're dealing with a specific range of elements, the `for` loop is your go-to. It's like your trusty map for exploring a set of items step by step.

2. **`while` Loop:** Now, when things are more about logic and less about a predefined range, the `while` loop steps in. It's like having a sentinel guarding the gate until your condition is met.

These loops are your dynamic duo for tackling repetitive tasks in style. Let's dive in and see them in action!

In [None]:
# Loop through dictionary keys and values
print("Looping through Dictionary:")
for key, value in student.items():
    print(f"{key}: {value}")

In [None]:
# Numeric operation
number = 5
factorial = 1

for i in range(1, number + 1):
    factorial *= i

    print(f"The factorial of {i} is {factorial}")

In [None]:
# Logical expression
import time

seconds = 10

while seconds > 0:
    print(f"Time left: {seconds} seconds")
    time.sleep(1)
    seconds -= 1

print("Time's up!")

- Create a loop that generates API URLs for each year from 2010 to 2020.

In [None]:
# Answer


### Census Data: Browsing and Pulling Variables

When working with Census data, it's essential to select the specific variables that are relevant to your analysis. Here's a step-by-step guide on how to achieve this:

1. **Read Variable Descriptions:** Start by visiting the [Census Data API Documentation](https://api.census.gov/data/2018/acs/acs5.html) page. This page provides detailed descriptions of available variables, allowing you to understand their meanings and relevance.

2. **Choose Relevant Variables:** Carefully examine the variable descriptions to identify the ones that match your research goals. Think about the information you need and the variables that provide that information. Make note of the variable names that you'll be using in your data collection process.

3. **Pull Designated Variables:** Once you've decided on the variables you need, you can use Python to interact with the Census API and retrieve data. You'll use your chosen variable names in your API requests to specifically request the data you're interested in.

Remember, your goal is to efficiently gather the data that aligns with your research objectives. By selecting the right variables, you can ensure that your analysis is accurate and insightful. Let's dive into the process and start pulling the data!

- Read Variable Descriptions

In [None]:
year = 2018
url = f'http://api.census.gov/data/{year}/acs/acs5/variables.html'
var = pd.read_html(url)[0]
len(var) # so many

- Choose Relevant Variables
- Let's say, we want the total number of population, B02001_001E

In [None]:
mykey="INPUT_YOUR_API_KEY" 
c = Census(mykey)
year = 2017

geography = "county"
geo_params = {'for': f'{geography}:*', 'in': 'state:06'}

# Make the API call
variables = c.acs5.get(('B02001_001E',), geo_params, year=year)
variables

- Utilize commas to extract multiple variables.
- You can leverage the `.join()` function to retrieve multiple variables simultaneously.

In [None]:
variables = c.acs5.get(('B02001_001E,B02001_005E',), geo_params, year=year)
variables

In [None]:
# total population, non-hispanic white, black, asian and hispanic populations, and median income
list_of_vars = ['B02001_001E','B02001_002E','B02001_003E','B02001_005E','B03001_003E', 'B06011_001E']
print(",".join(list_of_vars))

- We can also find a zipcode-level observation!

In [None]:
df = c.acs5.zipcode('B01003_001E,B02001_005E', state_fips="06", zcta="91711", year=year)
df

### Constructing Data Across Time

Analyzing data across specific time periods can provide valuable insights into trends and changes. To accomplish this, follow these steps:

1. **Get the Common List of Variables:** The first step in constructing data across time is to identify a common set of variables that you want to analyze consistently across different periods. Then, we can choose variables that are relevant to your research objectives.

By establishing a common ground for variables, you ensure that your analysis remains focused and consistent as you delve into different time frames. This foundational step sets the stage for robust and insightful data analysis. Let's move forward and explore how to gather this common list of variables!

In [None]:
import requests

# Years to consider
years = range(2015, 2021) # range from 2015 to 2020

# Dictionary to store variables for each year
variables_by_year = {}

# Loop through each year
for year in years:
    url = f'https://api.census.gov/data/{year}/acs/acs5/variables.json'
    response = requests.get(url)
    
    if response.status_code == 200:
        data = response.json()
        variables_by_year[year] = list(data['variables'].keys())
    else:
        print(f'Error retrieving data for year {year}')
        years.remove(year)

# Identify consistently present variables
all_variables = set(variables_by_year[years[0]])
for year in years[1:]:
    all_variables.intersection_update(variables_by_year[year])

print(f"Number of Common Variables: {len(all_variables)}") # Still So Many!


- Leverage `random` package to randomly choose elements

In [None]:
import random

# Create a list of 50 elements
elements = list(range(1, 51))

# Randomly choose 5 elements from the list
random_elements = random.sample(elements, 5)

print(f"Randomly chosen elements: {random_elements}")


- Find the list of 30 variables from US Census using `random` package

In [None]:
# Answer:

### Systematically Loop Through Other States

In the previous sections, we learned how to pull variables across different years. Now, let's explore data from multiple states across the United States by systematically looping through data from various states.

#### Utilizing External Sources

To achieve this, we'll practice how to leverage external [source]("https://www.mercercountypa.gov/dps/state_fips_code_listing.htm") and integrate them seamlessly within the Python environment.

By combining our knowledge of data extraction, variable selection, and external data sources, we're equipped to embark on a comprehensive exploration of data from across the United States. Let's dive into the next phase of our analysis and start systematically looping through different states!

- Load the external table and clean

In [None]:
url="https://www.mercercountypa.gov/dps/state_fips_code_listing.htm"
states = pd.read_html(url)[0]
states

In [None]:
states.columns = states.iloc[0] # Take the first row as a column
states= states[1:] # keep observations from the second column
cols =  [x.lower().replace(" ","_") for x in list(states.columns)]
col_names = list(range(len(cols)))
states.columns = [x+str(y) for x,y in zip(cols,col_names)]
states

### Side Notes: Enhancing Various Data Cleaning Operations
- `replace()` replaces a designated string with a new one.
- `split()` splits a string into a list of elements split by a designated string
- `Regular Expression` allows abstract operations
- `list` can be used to simple data merging

In [None]:
# replace function
text = "Hello, World! Hello!"
new_text = text.replace("Hello", "Hi")
print(new_text)  

In [None]:
# split function
text = "Hello, World! Hello!"
new_text = text.split(", ")[0]
print(new_text)  
new_text = text.split(", ")[1]
print(new_text)  

In [None]:
# Regular Expression
import re
text = "Hello, World! Hello!"
new_text = re.sub(r"Hello", "Hi", text)
print(new_text) 

In [None]:
# Regular Expression for changing the order
text = "John Smith, Jane Doe"
new_text = re.sub(r"(\w+) (\w+)", r"\2, \1", text)
print(new_text)  

In [None]:
# List operation to append data
list1 = [1, 2, 3]
list2 = [4, 5, 6]
result = list1 + list2
print(result) 

In [None]:
# List operation for combing values
list1 = [1, 2, 3]
list2 = [4, 5, 6]
result = [x + y for x,y in zip(list1,list2)]
print(result) 

In [None]:
# More on zip operation
names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 22]

zipped = zip(names, ages)

# Convert the zip object to a list of tuples
zipped_list = list(zipped)

print(zipped_list)

- Find the list of State FIPS Code

In [None]:
state_fips = list(states.fips_code1) + list(states.fips_code4)
state_fips = [x for x in state_fips if type(x)==str] # Why did I add the condition?
len(state_fips) # 50 States, D.C., Puerto Rico, U.S. Virgin Islands, American Samoa, Northern Mariana Islands, and Guam

- Set up a loop to get the total population by state

In [None]:
data = []
for fips in state_fips:
    geo_params = {'for': f'state:{fips}'}

    d = c.acs5.get(('NAME,B02001_001E'), geo_params, year=year)
    data = data + d
pd.DataFrame(data)

In [None]:
data['year'] = year

## UCR Crime Data
- General information [link](https://www.justice.gov/developer#:~:text=The%20FBI%20Crime%20Data%20API,uses%20and%20their%20related%20entities.)
- Sign up for Data.gov API Key [link](https://api.data.gov/signup/)

In [None]:
key="Input_your_key"
state="CA"
url = f"https://api.usa.gov/crime/fbi/cde/agency/byStateAbbr/{state}?API_KEY={key}"
response = requests.get(url)
data = response.json()
data

## Twitter Practice

As of now, the Twitter API is offered as a paid service, and accessing Reddit data involves lengthy waiting periods. However, we have an alternative avenue to refine our skills: working with [Publicly Available Data](https://www.thetrumparchive.com/)!

Let's engage in practical exercises using this accessible resource!

In [13]:
import requests
import pandas as pd
import io

# URL of the website you want to scrape
url = "https://drive.google.com/file/d/1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6/view?usp=sharing"

file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)
print(df.head())

                    id                                               text  \
0    98454970654916608  Republicans and Democrats have both created ou...   
1  1234653427789070336  I was thrilled to be back in the Great city of...   
2  1218010753434820614  RT @CBS_Herridge: READ: Letter to surveillance...   
3  1304875170860015617  The Unsolicited Mail In Ballot Scam is a major...   
4  1218159531554897920  RT @MZHemingway: Very friendly telling of even...   

  isRetweet isDeleted              device  favorites  retweets  \
0         f         f           TweetDeck         49       255   
1         f         f  Twitter for iPhone      73748     17404   
2         t         f  Twitter for iPhone          0      7396   
3         f         f  Twitter for iPhone      80527     23502   
4         t         f  Twitter for iPhone          0      9081   

                  date isFlagged  
0  2011-08-02 18:07:48         f  
1  2020-03-03 01:34:50         f  
2  2020-01-17 03:22:47         f  


In [None]:
df.text[:5]

In [None]:
from textblob import TextBlob

# Perform sentiment analysis using TextBlob
for tweet in df.text[:5]: # now you can modify this line to obtain everything
    analysis = TextBlob(tweet)
    sentiment = 'Positive' if analysis.sentiment.polarity > 0 else 'Negative' if analysis.sentiment.polarity < 0 else 'Neutral'
    
    print(f'Tweet: {tweet}')
    print(f'Sentiment: {sentiment}')
    print('---')


In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

# Initialize the SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Perform sentiment analysis using VADER
for tweet in df.text[:5]:
    sentiment_scores = sia.polarity_scores(tweet)
    compound_score = sentiment_scores['compound']
    sentiment = 'Positive' if compound_score > 0 else 'Negative' if compound_score < 0 else 'Neutral'
    
    print(f'Tweet: {tweet}')
    print(f'Sentiment: {sentiment}')
    print('---')

[Free API List](https://github.com/public-apis/public-apis)