Goal: understand raw `categories` vs `broad_category`, and refine mapping later.

In [1]:
%load_ext autoreload
%autoreload 2

Cell 1 – imports & config

In [27]:
import sys
from pathlib import Path

import pandas as pd
import yaml

PROJECT_ROOT = Path().resolve().parents[0]
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.data.load import load_all_sources, add_broad_category, load_category_mapping

CONFIG_PATH = PROJECT_ROOT / 'configs' / 'base.yaml'


Cell 2 – load config + data + broad categories

In [41]:
with CONFIG_PATH.open('r', encoding='utf-8') as f:
    cfg = yaml.safe_load(f)
print('Config loaded.')

Config loaded.


In [28]:
df_raw = load_all_sources(cfg, root=PROJECT_ROOT)
print('Data loaded:', df_raw.shape)

In [42]:
df = add_broad_category(df_raw, cfg, root=PROJECT_ROOT)

df[['source', 'categories', 'broad_category']].head()

Unnamed: 0,source,categories,broad_category
235575,Daily Times,World,World
311984,Daily Times,Pakistan,Pakistan
257006,Daily Times,Pakistan,Pakistan
299841,Daily Times,Pakistan & Top Stories,Pakistan
230900,Daily Times,World,World


Cell 3 – top raw categories overall

In [43]:
df['categories'].value_counts().head(50)


categories
Pakistan                                9927
World                                   6265
Business                                5594
National                                3530
Sports                                  3501
Lifestyle                               1211
Sport                                    923
Technology                               581
National & Top Headlines                 529
Comment & Opinion                        485
E-Papers & Pakistan Today                432
Letters & Opinion                        396
Op-Ed                                    376
Pakistan & Top Stories                   363
Punjab                                   333
Sindh                                    308
Reviews                                  292
Cricket                                  272
Editorials & Opinion                     263
Headlines & National                     254
Editorial                                240
Film                                     206

Cell 4 – top raw categories per source

In [44]:
for src in df['source'].unique():
    print(f'=== {src} ===')
    print(df[df['source'] == src]['categories'].value_counts().head(20))


=== Daily Times ===
categories
Pakistan                   2978
Sports                     1367
Lifestyle                  1211
World                      1160
Business                   1015
Op-Ed                       376
Pakistan & Top Stories      363
Reviews                     292
Editorial                   157
Infotainment                123
Arts, Culture & Books       104
Cartoons                     85
Perspectives                 81
Commentary / Insight         64
Uncategorized                63
Top Stories & World          57
Infotainment & Trending      43
Pakistan & World             32
Top Stories                  30
Blogs                        23
Name: count, dtype: int64
=== Dawn ===
categories
Pakistan    5277
World       2429
Business    1361
Sport        923
Prism         10
Name: count, dtype: int64
=== Pakistan Today ===
categories
National                                3530
Business                                2007
World                                    844

Cell 5 – confusion between raw and broad (crosstab)

In [45]:
ct = pd.crosstab(df['categories'], df['broad_category'])

# Print number of categories in actual
print(f'Number of unique raw categories: {ct.shape[0]}')
print(f'Number of unique broad categories: {ct.shape[1]}')
ct.head(20)

Number of unique raw categories: 370
Number of unique broad categories: 8


broad_category,Business,Lifestyle,Opinion,Other,Pakistan,Sports,Technology,World
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Analysis & Headlines,0,0,1,0,0,0,0,0
Analysis & National,0,0,1,0,0,0,0,0
Art & Books,0,20,0,0,0,0,0,0
"Arts, Culture & Books",0,104,0,0,0,0,0,0
"Arts, Culture & Books & Trending",0,1,0,0,0,0,0,0
Azad Jammu & Kashmir,0,0,0,0,3,0,0,0
Balochistan,0,0,0,0,88,0,0,0
Balochistan & Pakistan,0,0,0,0,6,0,0,0
Balochistan & Pakistan & Top Stories,0,0,0,0,1,0,0,0
"Balochistan, Business",1,0,0,0,0,0,0,0


You can use this to spot:
- Raw categories that are mapping to `Other`
- Raw categories that map to multiple broad categories (maybe noisy)

Cell 6 – focus on `Other` to fix mapping

In [46]:
other_df = df[df['broad_category'] == 'Other']
len(other_df), other_df['categories'].nunique()


(499, 15)

In [47]:
other_df['categories'].value_counts().head(50)


categories
E-Papers & Pakistan Today     432
Headlines                      20
Uncategorized                  10
Fashion                         8
Trending                        6
Letters                         4
Top Stories                     4
Entertainment                   3
Magazine                        3
Rawalpindi                      2
E-Papers & Profit Magazine      2
Food                            2
Off-Beat                        1
Health                          1
Featured                        1
Name: count, dtype: int64

Based on this, we can later update configs/category_mapping_v1.yaml to reduce Other.