<a href="https://colab.research.google.com/github/Sai23d/CS345_FinalProject/blob/main/DSCI521_Final_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### College of Computing and Informatics, Drexel University
### DSCI 521: Data Analysis and Interpretation
---

## Final Project Report

## Project Title:

## Student(s):

## Date:
---

## Your project should include the following components:
- Problem Definition: Define a clear problem or task to solve using data analysis techniques.

- Dataset Selection: You must choose a dataset relevant to your interests or a specific domain. The dataset should be of sufficient size and complexity to demonstrate various data analysis techniques.

- Exploratory Data Analysis (EDA): You need to perform thorough EDA on the dataset to understand its characteristics, identify patterns, missing values, outliers, and potential features for modeling.

- Feature Engineering: Implement feature engineering for creating new features, transforming existing ones, or selecting relevant features.

- Model Evaluation and Selection: Experiment with different data analysis algorithms and techniques. Evaluate the models using appropriate evaluation metrics

- Conclusion: Discuss your findings and future work.

## You should write the report with following characristics to ensure effective communication:
- Visualization and Interpretation: You should use visulization throughout EDA, modeling, and evaluation to illustrate any insights.

- Code and Implementation: Throughout the notebook, you must write well-structured and commented code.

- Documentation and Presentation: Throughout the notebook, you must add comprehensive commentary text to explain plots and code snippets. Always provide interpretations and explanations to document your analyses and results.

### 1. Problem Definition
---
*(Define the problem that will be solved in this data analytics project.)*

This project focuses on the key question:

How do socioeconomic factors affect access to the internet and digital devices across U.S. communities? While we know that digital access is often worse in disadvantaged areas, this project uses recent data from the American Community Survey (ACS) 5-Year Estimates to measure these patterns more precisely.(2018-2022) We will: Identify areas (census tracts) with low digital access, such as households without internet or computers. Explore how factors like income, education, and housing costs relate to digital access. Build models to predict which communities are most likely to face digital exclusion and which factors matter most. Our goal is to better understand how inequality impacts digital access and provide useful insights for decision-makers working to close the digital gap.

### 2. Data Sets
---
*(Describe the origin of the data sources. What is the format of the original data? How to access the data?)*

The primary dataset used in this project comes from the American Community Survey (ACS) 5-Year Estimates, conducted by the U.S. Census Bureau. The ACS is an ongoing nationwide survey that collects detailed population and housing information from approximately 3.5 million addresses each year. The 5-year estimates aggregate data over five years to provide reliable statistics at small geographic levels such as census tracts and other geographic levels.

In this project, we focus on variables related to:

Digital access (e.g., internet subscriptions, device availability) Socioeconomic status (e.g., household income, education level, rent burden) Demographic context (e.g., total population, age, race) These data are widely used in policy planning, academic research, and social service allocation.

Format of the Data: The ACS data are provided in multiple formats:

CSV (Comma-Separated Values) files — for individual tables (e.g., B28002, B19013) API Access — via the U.S. Census Bureau API for automated querying

Each dataset includes:

A GEOID column (unique geographic identifier) Estimate and Margin of Error (MOE) columns for each variable Metadata, including variable labels and geographic definitions

There are several ways to access ACS 5-Year Estimates:

Census Bureau Website (https://www2.census.gov/programs-surveys/acs/summary_file/) Use the “Advanced Search” to filter by year, geographic level, and table ID (e.g., B28002 for internet access) Download selected tables as CSV files

U.S. Census Bureau API Register for a free API key at https://api.census.gov/data/key_signup.html

In [1]:
# ======================================
# American Community Survey (ACS) 5-Year Estimates
# Project: Digital Divide in U.S. Communities
# Year: 2018-2022
# ======================================

# -----------------------------
# 0. Install required packages
# -----------------------------
!pip install census us pandas matplotlib seaborn scikit-learn

# -----------------------------
# 1. Import libraries
# -----------------------------
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from census import Census
from us import states
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import numpy as np

# -----------------------------
# 2. Connect to the Census API
# -----------------------------
API_KEY = "0b77d873815d425f3098f0c0aa6be4a000493aae"
c = Census(API_KEY)

year = 2022  # ACS 5-Year Estimates (2018-2022)

# -----------------------------
# 3. Define variables to pull
# -----------------------------
variables = {
    "B28002_013E": "no_internet",        # Households without internet
    "B28001_002E": "with_computer",      # Households with computers
    "B19013_001E": "median_income",      # Median household income
    "B15003_017E": "high_school",        # High school grads
    "B15003_022E": "bachelors",          # Bachelor's degree
    "B25070_001E": "gross_rent",         # Rent burden (total households paying rent)
    "B01003_001E": "total_population",   # Total population
}

# -----------------------------
# 4. Pull data for all states and tracts
# -----------------------------
all_data = []

for state in states.STATES:
    print(f"Fetching data for {state.name}...")
    try:
        data = c.acs5.state_county_tract(
            list(variables.keys()),
            state.fips,
            "*",  # all counties
            "*",  # all tracts
            year=year
        )
        all_data.extend(data)
    except Exception as e:
        print(f"Error fetching {state.name}: {e}")

# Convert to DataFrame
df = pd.DataFrame(all_data)

# -----------------------------
# 5. Create GEOID20 for unique tract ID
# -----------------------------
df['GEOID20'] = df['state'] + df['county'] + df['tract'].str.zfill(6)

Collecting census
  Downloading census-0.8.24-py3-none-any.whl.metadata (8.2 kB)
Collecting us
  Downloading us-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting jellyfish (from us)
  Downloading jellyfish-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.6 kB)
Downloading census-0.8.24-py3-none-any.whl (11 kB)
Downloading us-3.2.0-py3-none-any.whl (13 kB)
Downloading jellyfish-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (355 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m355.9/355.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jellyfish, us, census
Successfully installed census-0.8.24 jellyfish-1.2.0 us-3.2.0
Fetching data for Alabama...
Fetching data for Alaska...
Fetching data for Arizona...
Fetching data for Arkansas...
Fetching data for California...
Fetching data for Colorado...
Fetching data for Connecticut...
Fetching data for Delaware...
Fetching data for Florida...
Fetchi

### 3. Exploration and Feature Engineering
---
*(Describe and present any code and methods used for exploring and visualizing the data, including statistical analysis and examination of correlations between features. Perform feature engineering to support model development.)*

In [None]:
# -----------------------------
# State-level digital access summary
# -----------------------------

# Aggregate by state
state_summary = df.groupby('state').agg(
    total_no_internet=('no_internet', 'sum'),
    total_with_computer=('with_computer', 'sum'),
    total_population=('total_population', 'sum')
).reset_index()

# Merge FIPS to get state names
state_summary['state_name'] = state_summary['state'].map(lambda x: states.lookup(x).name)

# Calculate percentage of households without internet
state_summary['pct_no_internet'] = state_summary['total_no_internet'] / (
    state_summary['total_no_internet'] + state_summary['total_with_computer']
)

# Optional: households without internet per 100 households
state_summary['No_Internet_per_100_Households'] = (state_summary['pct_no_internet'] * 100).round(1)

# Rename columns for readability
state_summary = state_summary.rename(columns={
    "total_no_internet": "Households_No_Internet",
    "total_with_computer": "Households_With_Computer",
    "pct_no_internet": "Pct_No_Internet"
})

# Format percentage column
state_summary['Pct_No_Internet'] = (state_summary['Pct_No_Internet'] * 100).round(1)

# Sort states by % households without internet
state_summary = state_summary.sort_values('Pct_No_Internet', ascending=False)

# Display top 5 states with highest % of households without internet
top5 = state_summary.head(5)
print("Top 5 States with Highest % of Households Without Internet:")
print(top5[['state_name', 'Pct_No_Internet', 'Households_No_Internet', 'Households_With_Computer']])

# Display bottom 5 states with lowest % of households without internet
bottom5 = state_summary.tail(5)
print("\nTop 5 States with Lowest % of Households Without Internet:")
print(bottom5[['state_name', 'Pct_No_Internet', 'Households_No_Internet', 'Households_With_Computer']])






### 4. Modeling and Evaluation
---
*(Describe and present the analytic models built on the data and evaluate the performance of the models for solving the problem)*

### 5. Conclusion
---
*(Briefly describe what you have done and what you discovered. Discuss any shortcomings of the process and results. Propose future work. **Finally, discuss the lessons learned from doing the project**.)*

### 6. References

---
# Use the following requirements for preparing your project:

## DO NOT DELETE THE CELLS BELLOW

# Project Requirements

This final project examines the level of knowledge the students have learned from the course. The following course outcomes will be checked against the content of the report:

Upon successful completion of this course, a student will be able to:
* observe and explore a variety of quantitative methods for data analysis;
* understand methods’ evaluation techniques to interpret their output;
* implement and evaluate methods to gain technical experience with data; and
* reproducibly execute an analytic project and represent/communicate its results faithfully.

** Marking will be foucsed on both presentation and content.**

## Written Presentation Requirements
The report will be judged on the basis of visual appearance, grammatical correctness, and quality of writing, as well as its contents. Please make sure that the text of your report is well-structured, using paragraphs, full sentences, and other features of well-written presentation.

## Technical Content:
* Is the problem well defined and described thoroughly?
* Is the size and complexity of the data set used in this project comparable to that of the example data sets used in the lectures and assignments?
* Did the report describe the charactriatics of the data?
* Did the report describe the goals of the data analysis?
* Did the analysis conduct exploratory analyses on the data?
* Did the analysis build analysis models of the data and evaluated the performance of the models?
* Overall, what is the rating of this project?