# %% [markdown]

 # Data Exploration and Initial Analysis

 This notebook performs comprehensive exploration of the raw dataset to understand its structure, content, and characteristics. We'll examine the data format, entity relationships, and identify any data quality issues that need to be addressed before model training.

 ## Setup: initialize environment and import required libraries

In [None]:
# %%

from notebook_config import DATASETS_DIR, FILES_DIR
import pandas as pd
import json

# %% [markdown]

 ## Load and Examine Raw Dataset

 Load the full dataset into a DataFrame to begin our exploration. This step establishes our baseline understanding of the data volume and structure.

In [None]:
# %%

df = pd.read_csv(DATASETS_DIR / 'full_data.csv')

# %% [markdown]

 ## Initial Data Preview

 Display the first five rows of the DataFrame to get a quick overview of the data structure, column names, and typical content patterns. This helps identify the format of text fields and entity annotations.

In [None]:
# %%

df.head()

Unnamed: 0.1,Unnamed: 0,urls,text,persons,organizations,themes,locations
0,0,https://edition.cnn.com/2022/08/01/politics/go...,"""articleBody"":""A federal judge has ruled again...","Louie Gohmert,161;Timothy Kelly,493;Andrew Cly...","Dc District Court,439","TAX_POLITICAL_PARTY_REPUBLICANS,56;USPEC_POLIT...","Georgia,Pennsylvania,Texas"
1,1,https://www.cnn.com/2022/08/01/media/texas-dpr...,"""articleBody"":""More than a dozen major news or...","Laura Lee Prather,1177;Haynes Boone,1220;Nicol...","Texas Department Of Public Safety,128;Texas De...","TRIAL,82;TRIAL,551;DELAY,477;DELAY,2393;EDUCAT...","Robb Elementary School,Texas Department Of Pub..."
2,2,https://www.cnn.com/2022/08/01/politics/pact-a...,"""articleBody"":""Comedian Jon Stewart and vetera...","Pat Toomey,1802;Kate Bolduan,1114;Matt Zeller,...","Senate Majority Leader Chuck Schumer,1635;Whil...","TAX_POLITICAL_PARTY_REPUBLICANS,742;TAX_POLITI...","Iraq,America,Pennsylvania"
3,3,https://www.cnn.com/2022/08/01/politics/gop-la...,"""articleBody"":""A federal judge has ruled again...","Louie Gohmert,161;Timothy Kelly,493;Andrew Cly...","Dc District Court,439","TAX_POLITICAL_PARTY_REPUBLICANS,56;USPEC_POLIT...","Georgia,Pennsylvania,Texas"
4,4,https://www.cnn.com/2022/08/01/politics/vetera...,"""articleBody"":""A version of this story appears...","Pat Toomey,1342;Joe Manchin,4990;Paul Leblanc,...","Union On,2356;Senate Republicans,370;Veterans ...","EPU_POLICY_GOVERNMENT,344;EPU_POLICY_GOVERNMEN...","Pennsylvania,Capitol Hill,West Virginia,Americ..."


# %% [markdown]

 ## End of Dataset Preview

 Display the last five rows to check for any patterns or differences at the end of the dataset, ensuring our data is consistent throughout.

In [None]:
# %%

df.tail()

Unnamed: 0.1,Unnamed: 0,urls,text,persons,organizations,themes,locations
687,687,https://www.cnn.com/2022/08/08/tech/baidu-robo...,"""articleBody"":""Tech giant Baidu announced Mond...",,"Google,2158;Traffic Safety Administration,1999...","TRAFFIC,1977;SOC_EMERGINGTECH,767;WB_168_ROADS...","San Francisco,Wuhan,Chongqing,United States,Be..."
688,688,https://www.cnn.com/2022/08/08/politics/alex-j...,"""articleBody"":""Approximately two years’ worth ...","Mark Bankston,303;Zoe Lofgren,2005;Federico An...","Justice Department,2482;Justice Department,255...","USPEC_POLITICS_GENERAL1,152;USPEC_POLITICS_GEN...","California,Sandy Hook,Texas"
689,689,https://www.cnn.com/cnn-underscored/health-fit...,"""articleBody"":""Some people run hot all the tim...","Napper Chelsea,6905;Angela Ballard,2050;Honeyw...","Walmart,8293","TAX_ECON_PRICE,2484;TAX_ECON_PRICE,10630;TAX_F...","Arizona,California,Leisure Town,Portugal"
690,690,https://www.cnn.com/2022/08/08/asia/taiwan-jos...,"""articleBody"":""China’s threat to Taiwan is “mo...","Joseph Wu,203;Joseph Wu,3082;Nancy Pelosi,403;...","Taiwan Defense Ministry,5967;Cnn,212;Cnn,3844;...","EPU_POLICY_CONGRESSIONAL,2806;UNGP_FORESTS_RIV...","Taiwan,Taiwan Strait,United States,America,Chi..."
691,691,https://www.cnn.com/2022/08/08/politics/aborti...,"""articleBody"":""Just how far people in the Sout...","Jenny Ma,6502;Julie Kaye,7147;Jeff Landry,7787...","Jackson Women Health Organization,4566;Georgia...","TAX_WORLDLANGUAGES_ALABAMA,2253;TAX_WORLDLANGU...","Mississippi,Texas,Arkansas,North Carolina,Sout..."


# %% [markdown]

 ## Dataset Structure Analysis

 Show summary of DataFrame structure and data types. This reveals:
 - Total number of rows and columns
 - Memory usage
 - Data types for each column (helpful for identifying string vs numeric fields)
 - Presence of missing values (NaN counts)

In [None]:
# %%

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 692 entries, 0 to 691
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     692 non-null    int64 
 1   urls           692 non-null    object
 2   text           692 non-null    object
 3   persons        628 non-null    object
 4   organizations  673 non-null    object
 5   themes         667 non-null    object
 6   locations      692 non-null    object
dtypes: int64(1), object(6)
memory usage: 38.0+ KB


# %% [markdown]

 ## Statistical Summary

 Compute basic statistical summary for numeric columns. This provides insights into:
 - Central tendency measures (mean, median)
 - Dispersion measures (std, min, max, quartiles)
 - Distribution characteristics for any numerical features

In [None]:
# %%

df.describe()

Unnamed: 0.1,Unnamed: 0
count,692.0
mean,345.5
std,199.907479
min,0.0
25%,172.75
50%,345.5
75%,518.25
max,691.0


# %% [markdown]

 ## Column Inventory

 List all column names in the DataFrame to understand the complete feature set available for analysis and model training.

In [None]:
# %%

df.columns

Index(['Unnamed: 0', 'urls', 'text', 'persons', 'organizations', 'themes',
       'locations'],
      dtype='object')

# %% [markdown]

 ## Detailed Row Inspection - First Sample

 Inspect all values in the first row to understand the complete structure of a single data point, including how entities are formatted and what metadata is available.

In [None]:
# %%

df.iloc[0]

Unnamed: 0                                                       0
urls             https://edition.cnn.com/2022/08/01/politics/go...
text             "articleBody":"A federal judge has ruled again...
persons          Louie Gohmert,161;Timothy Kelly,493;Andrew Cly...
organizations                                Dc District Court,439
themes           TAX_POLITICAL_PARTY_REPUBLICANS,56;USPEC_POLIT...
locations                               Georgia,Pennsylvania,Texas
Name: 0, dtype: object

# %% [markdown]

 ## Text Content Analysis - First Sample

 View the text field of the first row to understand the content format, length, and style. This helps determine if text preprocessing will be needed (HTML removal, JSON parsing, etc.).

In [None]:
# %%

df.iloc[0]['text']

'"articleBody":"A federal judge has ruled against House Republicans who tried to challenge security screening on Capitol Hill for members of Congress. Reps. Louie Gohmert of Texas, Andrew Clyde of Georgia and Lloyd Smucker of Pennsylvania were fined thousands of dollars each by the sergeant at arms after they skipped security screenings outside the House chamber that were put in place following the January 6, 2021, attack on the US Capitol. The trio then sued in DC District Court to challenge the House rules. But Judge Timothy Kelly, a Trump appointee, on Monday dismissed the case, saying he did not have jurisdiction. The House sergeant at arms and the House’s top administrator were protected from the court wading into the House rules because of the Constitution’s Speech or Debate Clause, the judge determined.  “Here, each challenged act of the House Officers qualifies as a legislative act,” Kelly wrote. “Thus, the Speech or Debate Clause bars the Members’ claims.”",'

# %% [markdown]

 ## Entity Analysis - Person Entities

 Examine person entities and their IDs in the first row to understand:
 - How person names are formatted
 - The relationship between names and their unique identifiers
 - Whether entities are comma-separated or use other delimiters

In [None]:
# %%

df.iloc[0]['persons']

'Louie Gohmert,161;Timothy Kelly,493;Andrew Clyde,185;Lloyd Smucker,212'

# %% [markdown]

 ## Entity Analysis - Organization Entities

 Examine organization entities and their IDs in the first row to understand the same formatting patterns for organizational entities, which may differ from person entities.

In [None]:
# %%

df.iloc[0]['organizations']

'Dc District Court,439'

# %% [markdown]

 ## Entity Analysis - Theme Labels

 View theme labels in the first row to understand how thematic categorization is implemented and what types of themes are present in the dataset.

In [None]:
# %%

df.iloc[0]['themes']

'TAX_POLITICAL_PARTY_REPUBLICANS,56;USPEC_POLITICS_GENERAL1,56;TAX_MILITARY_TITLE_SERGEANT_AT_ARMS,270;TAX_MILITARY_TITLE_SERGEANT_AT_ARMS,610;TAX_FNCACT_SERGEANT_AT_ARMS,270;TAX_FNCACT_SERGEANT_AT_ARMS,610;TAX_FNCACT_ADMINISTRATOR,640;CRISISLEX_CRISISLEXREC,380;TAX_MILITARY_TITLE_OFFICERS,824;TAX_FNCACT_OFFICERS,824;TAX_MILITARY_TITLE_SERGEANT,262;TAX_MILITARY_TITLE_SERGEANT,602;TAX_FNCACT_SERGEANT,262;TAX_FNCACT_SERGEANT,602;TAX_FNCACT_JUDGE,20;TAX_FNCACT_JUDGE,479;TAX_FNCACT_JUDGE,761;CRISISLEX_C07_SAFETY,88;CRISISLEX_C07_SAFETY,298;TRIAL,418;CONSTITUTIONAL,726;GENERAL_GOVERNMENT,138;EPU_POLICY_CONGRESS,138;TAX_FNCACT_APPOINTEE,513;'

# %% [markdown]

 ## Entity Analysis - Location Entities

 Inspect location labels in the first row to understand how geographical entities are annotated and whether they follow the same format as other entity types.

In [None]:
# %%

df.iloc[0]['locations']

'Georgia,Pennsylvania,Texas'

# %% [markdown]

 ## Detailed Row Inspection - Second Sample

 Inspect all values in the 11th row to compare with the first sample and identify any variations in data format or content patterns across different articles.

In [None]:
# %%

df.iloc[10]

Unnamed: 0                                                      10
urls             https://arabic.cnn.com/middle-east/article/202...
text             <html lang="ar" dir="rtl" data-reactroot="" cl...
persons          Karen Jan Pierre,1533;Muhammad Ben Salman,706;...
organizations    White House,305;White House,1501;United Nation...
themes           TAX_FNCACT_ENVOY,1629;MOVEMENT_GENERAL,1181;SE...
locations        White House,Yemeni,United States,Yemen,Saudi A...
Name: 10, dtype: object

# %% [markdown]

 ## Text Content Analysis - Second Sample

 View the text field of the 11th row to compare content format and identify any differences in text structure, length, or formatting that might require different preprocessing approaches.

In [None]:
# %%

df.iloc[10]['text']

'<html lang="ar" dir="rtl" data-reactroot="" class="userconsent-cntry-ca userconsent-reg-global" data-rh="lang,dir"><head><script type="text/javascript" src="https://consumer.krxd.net/consent/set/f3b6d00d-676f-48d8-80ef-2b48af61105e?idt=device&amp;dt=kxcookie&amp;dc=1&amp;al=1&amp;tg=1&amp;cd=1&amp;sh=1&amp;re=1&amp;callback=Krux.ns._default.kxjsonp_consent_set_1"></script><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=f3b6d00d-676f-48d8-80ef-2b48af61105e&amp;technographics=1&amp;callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" async="" src="https://static.criteo.net/js/ld/publishertag.prebid.130.js"></script><script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/jF-AgDWy8ih0GfLx4Semh9UK/recaptcha__ar.js" crossorigin="anonymous" integrity="sha384-vWopNeBx+2UW4RMHgAX

# %% [markdown]

 ## Data Quality Assessment - JSON Wrapper Detection

 Count rows where text starts with "articleBody" to identify articles that are wrapped in JSON format. This helps determine the scope of JSON parsing needed during data preprocessing.

In [None]:
# %%

df[df['text'].str.startswith('"articleBody"')].shape[0]

574

# %% [markdown]

 ## Data Quality Assessment - HTML Content Detection

 Count rows where text starts with an HTML tag to identify articles containing HTML markup. This helps determine the scope of HTML cleaning needed during preprocessing.

In [None]:
# %%

df[df['text'].str.startswith('<html')].shape[0]

118

# %% [markdown]

 ## Entity Mapping Construction

 Build dictionary mapping entity IDs to lists of names for persons, organizations, and locations. This creates a comprehensive lookup table that:
 - Maps each unique entity ID to all its name variations
 - Helps identify entity disambiguation patterns
 - Provides insights into entity frequency and distribution
 - Enables reverse lookup from names to IDs for validation

In [None]:
# %%

entities_dict = {
    'persons': {},
    'organizations': {},
    'locations': {}
}
for i in range(len(df)):
    persons = df.iloc[i]['persons']
    organizations = df.iloc[i]['organizations']
    locations = df.iloc[i]['locations']

    if not pd.isna(persons):
        for person in persons.split(';'):
            p, _id = person.split(',')
            entities_dict['persons'].setdefault(_id, []).append(p)
    if not pd.isna(organizations):
        for organization in organizations.split(';'):
            o, _id = organization.split(',')
            entities_dict['organizations'].setdefault(_id, []).append(o)
    if not pd.isna(locations):
        for location in locations.split(','):
            l, _id = location, str(hash(location))
            entities_dict['locations'].setdefault(_id, []).append(l)

print(json.dumps(entities_dict, indent=4))
json.dump(entities_dict, open(FILES_DIR / 'misc' / 'entities_dict.json', 'w'), indent=4)

{
    "persons": {
        "161": [
            "Louie Gohmert",
            "Louie Gohmert",
            "Brittney Griner",
            "Jacquelyne Germain"
        ],
        "493": [
            "Timothy Kelly",
            "Timothy Kelly",
            "Taymoor Atighetchi",
            "Antonio Gutteres"
        ],
        "185": [
            "Andrew Clyde",
            "Andrew Clyde",
            "Nyota Uhura",
            "Nyota Uhura",
            "Madonna Louise Veronica Ciccone",
            "Joe Biden"
        ],
        "212": [
            "Lloyd Smucker",
            "Lloyd Smucker",
            "Blake Masters",
            "Suzanne Malveaux",
            "Suzanne Malveaux"
        ],
        "1177": [
            "Laura Lee Prather"
        ],
        "1220": [
            "Haynes Boone"
        ],
        "2136": [
            "Nicole Carroll"
        ],
        "1802": [
            "Pat Toomey",
            "Claudia Rebaza",
            "Sara Murray"
        ],
       

# %% [markdown]

 ## Reverse Entity Mapping Construction

 Build reverse mapping from entity names to their corresponding IDs. This creates a complementary lookup that:
 - Enables finding all IDs associated with a given entity name
 - Helps identify potential entity linking issues
 - Supports entity normalization and deduplication
 - Provides validation capabilities for entity extraction models

In [None]:
# %%

entities_reversed_dict = {
    'persons': {},
    'organizations': {},
    'locations': {}
}
for _type, elements in entities_dict.items():
    for k, list_of_entities in elements.items():
        for name in list_of_entities:
            entities_reversed_dict[_type].setdefault(name, []).append(k)

print(json.dumps(entities_reversed_dict, indent=4))
json.dump(entities_reversed_dict, open(FILES_DIR / 'misc' / 'entities_reversed_dict.json', 'w'), indent=4)

{
    "persons": {
        "Louie Gohmert": [
            "161",
            "161"
        ],
        "Brittney Griner": [
            "161",
            "866",
            "758",
            "73",
            "1352",
            "1459",
            "616",
            "62",
            "2925",
            "51",
            "3405",
            "3459",
            "3524",
            "7495",
            "7602",
            "3068",
            "129",
            "93",
            "128",
            "541",
            "508",
            "1901",
            "3175",
            "4339",
            "4393",
            "989",
            "4048"
        ],
        "Jacquelyne Germain": [
            "161",
            "158"
        ],
        "Timothy Kelly": [
            "493",
            "493"
        ],
        "Taymoor Atighetchi": [
            "493"
        ],
        "Antonio Gutteres": [
            "493"
        ],
        "Andrew Clyde": [
            "185",
            "185"
      

In [None]:
# %%

# End of interactive script