# Step 1: Preprocessing data

This notebook aims to load, organize, and distill the given data into another dataframe with only the pertinent fields for the analysis. 

In [11]:
# Import necessary libraries

import pandas as pd # For data manipulation
import ast # For converting string to list


In [12]:
# Load cleaned data 

data = pd.read_csv("../data/processed/gbh_output_cleaned.csv")

data.head(3)

Unnamed: 0.1,Unnamed: 0,_id,neighborhoods[0],neighborhoods[1],neighborhoods[2],neighborhoods[3],neighborhoods[4],neighborhoods[5],position_section,tracts[0],...,NER_Pass_1,NER_Pass_1_Sorted,NER_Pass_1_Coordinates,NER_prediction,NER_Sorted,NER_Sorted_Coordinates,Tracts,topic_model_body,tokens,closest_topic_client
0,0,65adafef8d9d92f2327ea8ff,East Boston,,,,,,Local News,981300,...,"[(the East Coast, 'LOC'), (Massachusetts, 'GPE...","[(Logan Airport, 'FAC'), (Logan, 'FAC'), (Loga...","[-71.0201972, 42.3665992]",,[],,['981300'],storm making its way up the East Coast brought...,"['storm', 'making', 'its', 'way', 'up', 'the',...",Weather
1,1,65adafef8d9d92f2327ea923,Downtown,,,,,,Local News,981700,...,,,,,[],,['981700'],House and Senate Democrats said Thursday they ...,"['House', 'and', 'Senate', 'Democrats', 'said'...",State Politics
2,2,65adafef8d9d92f2327ea8fd,Downtown,Back Bay,,,,,Local News,30301,...,,,,,[],,['000102'],GBH News the fastest growing local newsroom in...,"['GBH', 'News', 'the', 'fastest', 'growing', '...",GBH


Initially, the data contains a lot of rows and it is hard to focus on what relaly matters. So I will delete most columns that aren't going to be used and combine the coordinate columns into one.

In [13]:
# Get coordinates for each article

# Extract coordinate from the "Explicit_Pass_1" column
data["Explicit_Pass_1"] = data["Explicit_Pass_1"].apply(lambda x: ast.literal_eval(x)[1] if  pd.notna(x) and isinstance(ast.literal_eval(x), list) and len(ast.literal_eval(x)) == 2 else None)

# Combine the coordinates columns into a single column
coordinates = data["Explicit_Pass_1"].combine_first(data["NER_Pass_1_Coordinates"]).combine_first(data["NER_Sorted_Coordinates"])
data.insert(2, "Coordinates", coordinates)


In [14]:
# Clean Data

# Remove unnecessary columns for easier readability
for i in range(0,6):
    data.drop(columns=[f'neighborhoods[{i}]', f'tracts[{i}]' ], inplace=True)

# It also contains values that may not be correct, so we want to remove them and recollect them
for column in ["Explicit_Pass_1", "NER_Pass_1_Coordinates", "NER_Sorted_Coordinates", "Tracts", "tracts[6]", "position_section", "author", "body_x", "content_id","hl1_x", "hl2", "link", "userID", "uploadID", "dateSum", "hl1_y", "body_y", "llama_prediction", "NER_Pass_1", "NER_Pass_1_Sorted", "NER_prediction", "NER_Sorted", "topic_model_body", "tokens"]:
    data.drop(columns=[column], inplace=True)



# Move to the end the publication date
data.insert(4, "Publication Date", data.pop("pub_date"))

# Rename columns
data.rename(columns={"Unnamed: 0": "Index", "_id": "ID", "closest_topic_client": "Closest Topic"}, inplace=True)

# Save processed data
data.to_csv("../data/processed/gbh_processed_output.csv", index=False)

# Display the first few rows
data.head(5)

Unnamed: 0,Index,ID,Coordinates,author,Publication Date,body_x,hl1_x,link,Closest Topic
0,0,65adafef8d9d92f2327ea8ff,"[-71.0201972, 42.3665992]",Lisa Wardle,Mon Dec 18 12:34:19 EST 2023,A storm making its way up the East Coast broug...,"Winter storm snarls travel, knocks out power f...",https://www.wgbh.org/news/local/2023-12-18/win...,Weather
1,1,65adafef8d9d92f2327ea923,"[-71.06939, 42.3561948]",Chris Lisinski | State House News Service,Thu Nov 30 14:12:21 EST 2023,House and Senate Democrats said Thursday they ...,"Spending bill deal reached on Beacon Hill, Dem...",https://www.wgbh.org/news/politics/2023-11-30/...,State Politics
2,2,65adafef8d9d92f2327ea8fd,"[-71.148182, 42.357122]",WGBH Staff,Mon Dec 18 16:00:52 EST 2023,"GBH News, the fastest-growing local newsroom i...",GBH News to Establish \nEquity and Justice Rep...,https://www.wgbh.org/foundation/press/press-re...,GBH
3,3,65adafef8d9d92f2327ea90c,"[-71.0872662, 42.3276283]",Elena Eberwein,Wed Dec 13 05:00:00 EST 2023,"On a crisp fall day in Roxbury, Lana Andrews s...",From Google Docs to FaceTime: How a digital li...,https://www.wgbh.org/news/local/2023-12-13/fro...,Aging/Seniors
4,4,65adafef8d9d92f2327ea90b,"[-71.1084303, 42.3503433]",Jeremy Siegel,Wed Dec 13 12:16:10 EST 2023,<i>Researchers in Boston are examining a brain...,BU's CTE lab to look at Lewiston shooter's bra...,https://www.wgbh.org/news/local/2023-12-13/bus...,Guns


We can now view inforamtion about the current data to be used and make sure everything is in order. As it can be seen, there are no empty columns and we have only the ones we want.

In [10]:
# Display basic information about the data
data.info()

# Check for missing values
data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Index             1064 non-null   int64 
 1   ID                1064 non-null   object
 2   Coordinates       1064 non-null   object
 3   Closest Topic     1064 non-null   object
 4   Publication Date  1064 non-null   object
dtypes: int64(1), object(4)
memory usage: 41.7+ KB


Index               0
ID                  0
Coordinates         0
Closest Topic       0
Publication Date    0
dtype: int64