# Step 4: Exploratory Data Analysis

The goal of this step is to get an idea of the data set, what may be relevant, and where connections may be. So, as a beginning step, I will focus mostly on article counts and how they relate to race, topics, and census tracts. This will closely mimic what's currently being offered on the website.

In [176]:
import pandas as pd
import textwrap # for wrapping output text
from datetime import datetime # for converting and using dates
from dateutil import parser
import pytz

In [186]:
# Load geocoded articles
articles = pd.read_csv("../data/processed/gbh_geocoded_output.csv")

articles.head(3)

Unnamed: 0,Index,ID,Coordinates,Block,Census Tract,Neighborhood,County,Closest Topic,Publication Date
0,0,65adafef8d9d92f2327ea8ff,"[-71.0201972, 42.3665992]",1044,981300,East Boston,25,Weather,Mon Dec 18 12:34:19 EST 2023
1,1,65adafef8d9d92f2327ea923,"[-71.06939, 42.3561948]",1000,981700,Downtown,25,State Politics,Thu Nov 30 14:12:21 EST 2023
2,2,65adafef8d9d92f2327ea8fd,"[-71.148182, 42.357122]",2032,102,Allston,25,GBH,Mon Dec 18 16:00:52 EST 2023


In [187]:
# Currently, not all census tracts are related to a neighborhood. So, there are some NaN values in the neighborhood column.
# Only Suffolk County (25) has tracked neighborhoods.
articles.notna().sum()

Index               1056
ID                  1056
Coordinates         1056
Block               1056
Census Tract        1056
Neighborhood         925
County              1056
Closest Topic       1056
Publication Date    1056
dtype: int64

In [256]:
# Drop rows without neighborhoods. Later on this will not be needed.
articles.dropna(subset=["Neighborhood"], inplace=True)

First I will parse by dates so that all following functions can have that applied as well.

In [189]:
# Dictionary to map timezone names to pytz time zones
tzinfos = {
    'EST': pytz.timezone('US/Eastern'),
    'EDT': pytz.timezone('US/Eastern')
}

# Function to convert date strings to datetime objects
def convert_to_datetime(date_str):
    return parser.parse(date_str, tzinfos=tzinfos)

In [190]:
# Apply the conversion function to the 'Publication Date' column
articles['Publication Date'] = articles['Publication Date'].apply(convert_to_datetime)

# Ensure the Publication Date column is timezone-aware
articles['Publication Date'] = articles['Publication Date'].dt.tz_convert(pytz.timezone('US/Eastern'))


In [240]:
# Define the date range (timezone-aware)
start_date = pytz.timezone('US/Eastern').localize(datetime(2023, 12, 18))
end_date = pytz.timezone('US/Eastern').localize(datetime(2023, 12, 18))


# Filter the DataFrame for the date range
filtered_articles = articles[(articles['Publication Date'] >= start_date) & (articles['Publication Date'] <= end_date)]

print(f"Using articles from {start_date.date()} to {end_date.date()}")

Using articles from 2023-12-18 to 2023-12-18


In [193]:
# If we don't want the to use the filtered by date data, skip this step
all_articles = articles.copy()
articles = filtered_articles

In [195]:
# If want to use all articles again, use this step
articles = all_articles.copy()

We can now begin doing some basic analysis

In [197]:
# Get the total count of articles

total_articles = articles["ID"].count()
print(f"Total articles: {total_articles}")

Total articles: 925


In [198]:
# Function to get the top 5 category based on amount of articles
def get_article_count(data, column):
    count = data[column].value_counts()
    top_count = count.head(5)
    top_count_df = top_count.reset_index()
    top_count_df.columns = [column, "Article Count"]

    return top_count_df

In [199]:
# Get the count of articles for a given category
def get_specific_count(data, column, category):
    return len(data[data[column] == category])

We can now get article counts and popularity for a given demographic

In [200]:
# Get all Neighborhoods
neighborhoods = articles["Neighborhood"].unique()
neighborhood_count = len(neighborhoods)

neighborhoods_str = ', '.join(neighborhoods)
wrapped_neighborhoods = textwrap.fill(neighborhoods_str, width=100)
print(f"The {neighborhood_count} available neighborhoods are: \n{wrapped_neighborhoods}")

The 22 available neighborhoods are: 
East Boston, Downtown, Allston, Roxbury, Fenway, Beacon Hill, Mission Hill, Back Bay, South Boston
Waterfront, Longwood, South Boston, West Roxbury, South End, Hyde Park, West End, Dorchester, North
End, Brighton, Jamaica Plain, Charlestown, Mattapan, Roslindale


In [201]:
# Count articles by Neighborhood
neighborhood_count = get_article_count(articles, "Neighborhood")

print("Top 5 Most Frequented Neighborhoods:\n", neighborhood_count.to_string(index=False))

Top 5 Most Frequented Neighborhoods:
 Neighborhood  Article Count
    Downtown            409
      Fenway            145
 Beacon Hill             84
     Roxbury             50
    Back Bay             36


In [202]:
# Example: Get the count of articles for a specific neighborhood
neighborhood = "Downtown"

neighborhood_specific_count = get_specific_count(articles, "Neighborhood", neighborhood)
print(f"Number of articles for {neighborhood}: {neighborhood_specific_count}")

Number of articles for Downtown: 409


In [237]:
# Get the census tracts for a specific neighborhood
def get_tracts_for_neighborhood(data, neighborhood):
    return data[data["Neighborhood"] == neighborhood]["Census Tract"].unique()

In [239]:
# Example: Get the census tracts for a specific neighborhood
neighborhood = "Downtown"
tracts = get_tracts_for_neighborhood(articles, neighborhood)

tracts_str = ', '.join(map(str, tracts))
wrapped_tracts = textwrap.fill(tracts_str, width=100)
print(f"The {neighborhood} neighborhood has the following census tracts: \n{wrapped_tracts}")

The Downtown neighborhood has the following census tracts: 
981700, 70201, 70102, 30302, 30301, 70202, 70103, 70302


In [225]:
# Get all Census Tracts
tracts = articles["Census Tract"].unique()
tracts_count = len(tracts)

tracts_str = ', '.join(map(str, tracts))
print(f"There are {tracts_count} census tracts available")

There are 104 census tracts available


In [204]:
# Count articles by Census Tract

tract_count = get_article_count(articles, "Census Tract")

# Drop duplicates to ensure unique Census Tract-Neighborhood pairs
unique_tract_neighborhood = articles[['Census Tract', 'Neighborhood']].drop_duplicates()

# Merge with the unique Census Tract-Neighborhood pairs to get the neighborhood information
tract_count_df = tract_count.merge(unique_tract_neighborhood, on='Census Tract', how='left')

tract_count_df.insert(1, "Neighborhood", tract_count_df.pop("Neighborhood"))

print("Top 5 Most Frequented Census Tracts:\n", tract_count_df.to_string(index=False))

Top 5 Most Frequented Census Tracts:
  Census Tract Neighborhood  Article Count
        30302     Downtown            319
        10103       Fenway            104
        20302  Beacon Hill             70
       981700     Downtown             37
        70201     Downtown             30


In [205]:
# Example: Get the count of articles for a specific Census Tract
tract = 30302

tract_specific_count = get_specific_count(articles, "Census Tract", tract)

print(f"Number of articles for Census Tract {tract}: {tract_specific_count}")

Number of articles for Census Tract 30302: 319


In [209]:
# Get all Counties. For now it's only Suffolk
counties = articles["County"].unique()
counties_count = len(counties)

counties_str = ', '.join(map(str,counties))
print(f"There are {counties_count} counties available")

There are 1 counties available


In [210]:
# Count articles by County
county_count = get_article_count(articles, "County")

# As other counties do not have neighborhoods assigned to them on our database, only one will show
print("Top 5 Most Frequented Counties:\n", county_count.to_string(index=False))

Top 5 Most Frequented Counties:
  County  Article Count
     25            925


In [224]:
# Example: Get the count of articles for a specific County
county = 25

county_specific_count = get_specific_count(articles, "County", county)
print(f"Number of articles for County #{county}: {county_specific_count}")

Number of articles for County #25: 925


Similarly, this can be done to find out the topics 

In [227]:
# Get the total topics and their count

total_topics = articles["Closest Topic"].unique()
topics_count = len(total_topics)

topics_str = ', '.join(map(str, total_topics))
wrapped_topics = textwrap.fill(topics_str, width=100)
print(f"There are {topics_count} topics available: \n{wrapped_topics}")

There are 52 topics available: 
Weather, State Politics, GBH, Aging/Seniors, Guns, Infrastructure, Education, Other, Public Health,
Arts & Culture, Obituaries, Local Politics, Law Enforcement, Mental Health, Housing/Homelessness,
Labor/Workforce, Gentrification, Shopping, Addiction/Substance Use, Immigration, Equity & Justice,
Taxes, Philanthropy/Nonprofits, Small Business, Sports, Higher Education, Research & Development,
Youth, Accessiblity/Disablity, Civil Rights, Poverty & hunger, Population & Demographics, Childcare,
Homeland Security, Gender issues, LGBTQ+, Technology, Healthcare, Politics/Elections, Crime, Native
Americans, Veterans Affairs, Government, Race, Natural Disasters, Extremism, Families, Terrorism,
Construction, Environment, Religion, International Affairs


In [228]:
# Count articles of a Topic
topic_count = get_article_count(articles, "Closest Topic")

print("Top 5 Most Frequented Topics:\n", topic_count.to_string(index=False))

Top 5 Most Frequented Topics:
        Closest Topic  Article Count
               Other            136
      Local Politics            126
Housing/Homelessness             81
  Politics/Elections             69
      State Politics             35


In [229]:
# Example: Get the count of articles for a specific Topic
topic = "Local Politics"

topic_specific_count = get_specific_count(articles, "Closest Topic", topic)
print(f"Number of articles for {topic}: {topic_specific_count}")

Number of articles for Local Politics: 126


Then, we can get topic distribution on any given tract or neighborhood

In [261]:
# Get the topics for all census tracts

def get_all_topics(data, column):
    # Group by census tract and aggregate topics
    topic_counts = data.groupby([column,'Closest Topic']).size().reset_index(name='counts')

    # Determine all topics for each tract
    all_topics = topic_counts.groupby(column)['Closest Topic'].apply(list).reset_index(name='All Topics')

    # Determine the main 3 topics for each tract
    top_3_topics = topic_counts.sort_values([column, 'counts'], ascending=[True, False])
    top_3_topics = top_3_topics.groupby(column).head(3).reset_index(drop=True)
    main_3_topics = top_3_topics.groupby(column)['Closest Topic'].apply(list).reset_index(name='Main 3 Topics')

    # Merge the results
    result = pd.merge(all_topics, main_3_topics, on=column)
    return result

In [262]:
# Get the topics for a specific tract
def get_main_topics_from_demo(data, column, demo):
    topics = data[data[column] == demo]
    formattedTopics = topics["Main 3 Topics"].values[0]
    formattedTopicss = str(formattedTopics[0]) + ", " + str(formattedTopics[1]) + ", " + str(formattedTopics[2])
    return formattedTopicss


In [265]:
# Get all topics for the neighborhoods
neighborhood_topics = get_all_topics(articles, "Neighborhood")
neighborhood_topics.head(3)

Unnamed: 0,Neighborhood,All Topics,Main 3 Topics
0,Allston,"[Accessiblity/Disablity, Construction, GBH, Hi...","[Other, GBH, Local Politics]"
1,Back Bay,"[Arts & Culture, Civil Rights, Equity & Justic...","[Other, Politics/Elections, Local Politics]"
2,Beacon Hill,"[Accessiblity/Disablity, Addiction/Substance U...","[State Politics, Housing/Homelessness, Politic..."


In [266]:
# Example: Get the topics for a specific neighborhood
neighborhood = "Downtown"
topics =  get_main_topics_from_demo(neighborhood_topics, "Neighborhood", neighborhood)
print(f"The main 3 topics for the {neighborhood} neighborhood are: {topics}")

The main 3 topics for the Downtown neighborhood are: Local Politics, Other, Politics/Elections


In [267]:
# Get all topics for the census tracts
tract_topics = get_all_topics(articles, "Census Tract")
tract_topics.head(3)

Unnamed: 0,Census Tract,All Topics,Main 3 Topics
0,102,"[GBH, Higher Education, Immigration, Labor/Wor...","[GBH, Other, Higher Education]"
1,302,[Housing/Homelessness],[Housing/Homelessness]
2,402,[Other],[Other]


In [268]:
# Example: Get the topics for a specific tract
tract = 30302
topics =  get_main_topics_from_demo(tract_topics, "Census Tract", tract)
print(f"The main topics in the tract {str(tract)} are: " + str(topics))

The main topics in the tract 30302 are: Local Politics, Other, Politics/Elections


: 

We can also collect the amount of articles for a given topic and demographic

In [231]:
# Count articles of a topic on a given demographic

def get_topic_count(data, column, demographic, topic):
    filtered_articles = data[data[column] == demographic]
    topic_count = get_specific_count(filtered_articles, "Closest Topic", topic)
    return topic_count


In [232]:
# Example: Get the count of articles for a specific Topic in a specific Neighborhood
topic = "Local Politics"
neighborhood = "Downtown"

topic_specific_count = get_topic_count(articles, "Neighborhood", neighborhood, topic)
print(f"Number of articles for {topic} in {neighborhood}: {topic_specific_count}")

Number of articles for Local Politics in Downtown: 68


In [233]:
# Example: Get the count of articles for a specific Topic in a specific Census Tract
topic = "Local Politics"
tract = 30302

topic_specific_count = get_topic_count(articles, "Census Tract", tract, topic)
print(f"Number of articles for {topic} in Census Tract {tract}: {topic_specific_count}")

Number of articles for Local Politics in Census Tract 30302: 52


Combining these two for a more useful dataset, we can obtain the topics with the amount of articles each of them have

In [230]:
# Count articles by Date
date_counts = articles['Publication Date'].dt.date.value_counts().reset_index()

date_counts.columns = ['Publication Date', 'Article Count']

top_5 = date_counts.head(5)

print("\nTop 5 Dates:\n", top_5.to_string(index=False))


Top 5 Dates:
 Publication Date  Article Count
      2023-07-19             20
      2023-07-28             11
      2021-01-26             10
      2023-03-10              9
      2020-11-03              8
