# Final Project Demo
## Yiqi Ling
## GR 5072, Modern Data Structures
---

## 1. Setup and Import

In [1]:
import acs_explorer as ac
import pandas as pd

Below displays the functions currently we can use.

In [2]:
dir(ac)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'acs_explorer',
 'acsexplorer_analyze_trends',
 'acsexplorer_generate_report',
 'acsexplorer_get_data',
 'acsexplorer_get_geo_info',
 'acsexplorer_pipeline_by_keyword',
 'acsexplorer_pipeline_by_location',
 'acsexplorer_topic_search',
 'acsexplorer_topic_search_shortlist',
 'version']

## 2. Function Demonstration
This guide demonstrates the main functionalities of the `acs_explorer` package, which includes:

- **Pre-Data Fetching Process**: Prepares for data collection through geocoding and keyword-based variable searching.
- **Data-Fetching Process**: Retrieves data from the Census API based on variables, geographic resolutions, and filters.
- **Data Report Proces**s: Generates comprehensive reports summarizing the retrieved data, trends, and visualizations.
- **Overall Pipeline**: Combines the above processes into streamlined workflows for geocoding-based and keyword-based analyses.

### 2.1 Pre-Data Fetching Process
This step involves geocoding to retrieve geographic information and searching Census variables based on specific keywords.

#### 2.1.1 Geocoding Geographic Information
`acsexplorer_get_geo_info()` Converts a physical address into Census geographic identifiers (state, county, tract). It allows users to fastly locate multiple places they are interested in by utilizing `nominatim` api provided by Open Street Map, and `geocoder` api provided by US Census Bureau.

In [3]:
# single address
address = "1600 Amphitheatre Parkway, Mountain View, CA"
ac.acsexplorer_get_geo_info(address)

[{'state': '06',
  'county': '085',
  'tract': '504601',
  'address': 'Google Building 41, 1600, Amphitheatre Parkway, Mountain View, Santa Clara County, California, 94043, United States',
  'lng': -122.08558456613565,
  'lat': 37.42248575}]

In [4]:
# multiple addresses
addresses = ["1600 Amphitheatre Parkway, Mountain View, CA", "535 West 116th Street, New York"]
geo_info = ac.acsexplorer_get_geo_info(addresses)
pd.DataFrame(geo_info)

Unnamed: 0,state,county,tract,address,lng,lat
0,6,85,504601,"Google Building 41, 1600, Amphitheatre Parkway...",-122.085585,37.422486
1,36,61,20300,"Low Memorial Library, 535, College Walk, Manha...",-73.961835,40.808223


#### 2.1.2 Searching Variables by Keyword
`acsexplorer_topic_search` and `acsexplorer_topic_search_shortlist` identifies Census variables related to a specific keyword and organizes them into a shortlist and detailed search results. It first iterate over all availabe census data shells (i.e., codebooks) using API calls, then download into cache. Also, it employs `nltk` package to achieve vague search, and search for synonymous.

In [26]:
keyword = "internet"
df, df_shortlist = ac.acsexplorer_topic_search(keyword, include_shortlist=True)

Request failed for acs/acs1, 2020 with status code 404


In [24]:
df.head()

Unnamed: 0,Concept,Variable Name,Group,Description,Dataset,Year
0,Sex by Industry for the Civilian Employed Popu...,B24030_081E,B24030,"Male:: Information (6471:6781):: Broadcasting,...",[acs1],[2023]
1,Sex by Industry for the Civilian Employed Popu...,B24030_083E,B24030,"Male:: Information:: Broadcasting, internet pu...",[acs1],"[2008, 2009, 2010, 2011, 2012, 2013, 2014, 201..."
2,Sex by Industry for the Civilian Employed Popu...,B24030_084E,B24030,Male: Information: Internet publishing and int...,[acs1],"[2005, 2006, 2007]"
3,Sex by Industry for the Civilian Employed Popu...,B24030_183E,B24030,Female:: Information (6471:6781):: Broadcastin...,[acs1],[2023]
4,Sex by Industry for the Civilian Employed Popu...,B24030_187E,B24030,"Female:: Information:: Broadcasting, internet ...",[acs1],"[2008, 2009, 2010, 2011, 2012, 2013, 2014, 201..."


In [28]:
df_shortlist.head()

Unnamed: 0,Concept,Group,Variable Name,Dataset,Year
0,Sex by Industry for the Civilian Employed Popu...,B24030,"[B24030_081E, B24030_083E, B24030_084E, B24030...",[acs1],"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
1,"Sex by Industry for the Full-Time, Year-Round ...",B24040,"[B24040_081E, B24040_083E, B24040_084E, B24040...",[acs1],"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
2,Detailed Industry for the Civilian Employed Po...,B24134,"[B24134_172E, B24134_173E, B24134_178E]","[acs5, acs1]","[2020, 2021, 2022]"
3,Detailed Industry for the Civilian Employed Ma...,B24135,"[B24135_172E, B24135_173E, B24135_178E]","[acs5, acs1]","[2020, 2021, 2022]"
4,Detailed Industry for the Civilian Employed Fe...,B24136,"[B24136_172E, B24136_173E, B24136_178E]","[acs5, acs1]","[2020, 2021, 2022]"


Additionaly, to only get brief information, users can use `acsexplorer_topic_search_shortlist`.

In [6]:
ac.acsexplorer_topic_search_shortlist(keyword).head()

Request failed for acs/acs1, 2020 with status code 404


Unnamed: 0,Concept,Group,Variable Name,Dataset,Year
0,Sex by Industry for the Civilian Employed Popu...,B24030,"[B24030_081E, B24030_083E, B24030_084E, B24030...",[acs1],"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
1,"Sex by Industry for the Full-Time, Year-Round ...",B24040,"[B24040_081E, B24040_083E, B24040_084E, B24040...",[acs1],"[2005, 2006, 2007, 2008, 2009, 2010, 2011, 201..."
2,Detailed Industry for the Civilian Employed Po...,B24134,"[B24134_172E, B24134_173E, B24134_178E]","[acs5, acs1]","[2020, 2021, 2022]"
3,Detailed Industry for the Civilian Employed Ma...,B24135,"[B24135_172E, B24135_173E, B24135_178E]","[acs5, acs1]","[2020, 2021, 2022]"
4,Detailed Industry for the Civilian Employed Fe...,B24136,"[B24136_172E, B24136_173E, B24136_178E]","[acs5, acs1]","[2020, 2021, 2022]"


### 2.2 Data-Fetching Process
This is the main function of this package. step retrieves Census data based on specified variables, geographic levels, and time ranges.

The main function, `acsexplorer_get_data`, interacts with the Census API to:

- Retrieve data for user-defined variables.
- Support multiple geographic resolutions, such as state, county, and Census Tract levels.
- Fetch data for specific years or time ranges from various Census datasets (e.g., ACS 1-year estimates, ACS 5-year estimates).

**Parameters**
- variables (list of str): A list of variable codes to retrieve (e.g., population, median income).
- geography (str): The desired geographic resolution. Supported values include:
    - "state"
    - "county"
    - "tract"
- year (int): The year for which data is requested (e.g., 2020, 2023).
- dataset (str): The type of dataset, either:
    - "ACS1": 1-year estimates, available for larger geographic areas.
    - "ACS5": 5-year estimates, offering data for all geographic levels.

**Returns**  
A Pandas DataFrame containing the requested Census data.

---
In the first example, we retrieve data for two variables, B19054_002E and B19054_003E, at the state level for the year 2021 using the ACS 5-year dataset (acs5).

In [7]:
variables = ["B19054_002E", "B19054_003E"] 
geography = "state"
year = 2021
dataset = "acs5"

# Fetch state-level data
ac.acsexplorer_get_data(variables, geography, year, dataset).head()

Unnamed: 0,NAME,B19054_002E,B19054_003E,state
0,Alabama,294130,1608853,1
1,Alaska,114928,145633,2
2,Arizona,510471,2173086,4
3,Arkansas,181095,977365,5
4,California,2878020,10339566,6


This function call retrieves and displays state-level data in a clean tabular format.

---
To fetch data at a more granular level, such as a Census Tract, users can use the `geo_filter` parameter. This allows you to specify precise geographic boundaries. In this example, we query data for a specific Census Tract in Santa Clara County, California, for the year 2020.

In [8]:
geography = "tract"
geo_filter = {"state": "06", "county": "085", "tract": "504601"}
ac.acsexplorer_get_data(variables, geography, 2020, dataset, geo_filter).head(5)

Unnamed: 0,NAME,B19054_002E,B19054_003E,state,county,tract
0,"Census Tract 5046.01, Santa Clara County, Cali...",95,329,6,85,504601


### 2.3 Data Report Process

The `acs_explorer` package provides functionalities for analyzing data trends and generating comprehensive reports, which include data summaries, trend tables, and simple visualizations. Two key functions are highlighted below:

- **`acsexplorer_analyze_trends`**: Analyzes Census data trends over a specified time range.
- **`acsexplorer_generate_report`**: Creates a comprehensive report that includes visualizations and summary statistics.

---

#### 2.3.1 **Analyzing Data Trends**

The `acsexplorer_analyze_trends` function retrieves and summarizes data trends for specific variables across multiple years.

In [9]:
geography = "state"
year_range = (2017,2021)
geo_filter = {"state": "06"}
ac.acsexplorer_analyze_trends(variables, geography, year_range, dataset, geo_filter)

Unnamed: 0,NAME,B19054_002E,B19054_003E,state,Year
0,California,2828125,10060003,6,2017
1,California,2848830,10116605,6,2018
2,California,2893599,10150667,6,2019
3,California,2899112,10204002,6,2020
4,California,2878020,10339566,6,2021


This function allows users to identify changes in Census variables over time within a defined geographic scope.


---

#### 2.3.2 **Generating Comprehensive Reports**
The acsexplorer_generate_report function compiles a detailed report, including:

- Data trends table.
- Interactive visualizations.
- An HTML report summarizing the results.


In [10]:
ac.acsexplorer_generate_report(variables, geography, year_range, dataset, geo_filter)

Interactive plot saved to reports/report_B19054_002E_plot.png
Interactive plot saved to reports/report_B19054_003E_plot.png
Report generated at reports/report.html


'reports/report.html'


### 2.4 Overall Pipelines
All in a nutshell, `acsexplorer_pipeline_by_location` and `acsexplorer_pipeline_by_keyword` are ready-to-use workflows combining geocoding, data fetching, and reporting.

---

#### 2.4.1 **Pipeline 1: Data Retrieval by Location**

The `acsexplorer_pipeline_by_location` function takes a list of addresses and retrieves Census data for specified variables across geographic levels and time ranges.

In [11]:
addresses = ["1600 Amphitheatre Parkway, Mountain View, CA", "535 West 116th Street, New York"]
variables = ["B19054_002E", "B19054_003E"] 
geography = "state"
year_range = (2017,2019)
dataset = "acs5"
output_path = "reports"
ac.acsexplorer_pipeline_by_location(addresses, geography, variables, year_range, dataset, output_path)

Step 1: Getting geographic information for addresses: ['1600 Amphitheatre Parkway, Mountain View, CA', '535 West 116th Street, New York']...
Geographic information: {'state': '06', 'county': '085', 'tract': '504601', 'address': 'Google Building 41, 1600, Amphitheatre Parkway, Mountain View, Santa Clara County, California, 94043, United States', 'lng': -122.08558456613565, 'lat': 37.42248575}
Step 2: Analyzing trends for variable ['B19054_002E', 'B19054_003E']...
Geographic information: {'state': '36', 'county': '061', 'tract': '020300', 'address': 'Low Memorial Library, 535, College Walk, Manhattan Community Board 9, Manhattan, New York County, City of New York, New York, 10027, United States', 'lng': -73.9618350613562, 'lat': 40.808222549999996}
Step 2: Analyzing trends for variable ['B19054_002E', 'B19054_003E']...
Step 3: Combined trend data for all addresses:
Data saved to reports/data.csv


Unnamed: 0,NAME,B19054_002E,B19054_003E,state,Year,Address
0,California,2828125,10060003,6,2017,"Google Building 41, 1600, Amphitheatre Parkway..."
1,California,2848830,10116605,6,2018,"Google Building 41, 1600, Amphitheatre Parkway..."
2,California,2893599,10150667,6,2019,"Google Building 41, 1600, Amphitheatre Parkway..."
3,New York,1594038,5708672,36,2017,"Low Memorial Library, 535, College Walk, Manha..."
4,New York,1577898,5738639,36,2018,"Low Memorial Library, 535, College Walk, Manha..."
5,New York,1576000,5767234,36,2019,"Low Memorial Library, 535, College Walk, Manha..."


#### 2.4.2 **Pipeline 2: Data Retrieval by Keyword**
The acsexplorer_pipeline_by_keyword function searches for Census variables related to a specified keyword, retrieves data for those variables, and generates a report.

In [54]:
keyword = "population"
geography = "state"
year_range = (2021,2022)
dataset = "acs1"
top_search = 1
output_path = "reports"
ac.acsexplorer_pipeline_by_keyword(keyword, geography, year_range, dataset, top_search, output_path)

Step 1: Searching for variables related to 'population'...
Request failed for acs/acs1, 2020 with status code 404
Request failed for group B26109PR with status code 404
Step 2: Analyzing trends for the selected variables...
Analyzing trends for variable: B02003_002E
Analyzing trends for variable: B02003_003E
Analyzing trends for variable: B02003_004E
Analyzing trends for variable: B02003_005E
Analyzing trends for variable: B02003_006E
Analyzing trends for variable: B02003_007E
Analyzing trends for variable: B02003_008E
Analyzing trends for variable: B02003_009E
Analyzing trends for variable: B02003_010E
Analyzing trends for variable: B02003_011E
Analyzing trends for variable: B02003_012E
Analyzing trends for variable: B02003_013E
Analyzing trends for variable: B02003_014E
Analyzing trends for variable: B02003_015E
Analyzing trends for variable: B02003_016E
Analyzing trends for variable: B02003_017E
Analyzing trends for variable: B02003_018E
Analyzing trends for variable: B02003_019E
An

MergeError: Passing 'suffixes' which cause duplicate columns {'Value_x', 'Variable Name_x'} is not allowed.