Goal: understand raw `categories` vs `broad_category`, and refine mapping later.

Cell 1 – imports & config

In [3]:
import sys
from pathlib import Path

import pandas as pd
import yaml

PROJECT_ROOT = Path().resolve().parents[0]
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from src.data.load import load_all_sources, add_broad_category, load_category_mapping

CONFIG_PATH = PROJECT_ROOT / 'configs' / 'base.yaml'


Cell 2 – load config + data + broad categories

In [4]:
with CONFIG_PATH.open('r', encoding='utf-8') as f:
    cfg = yaml.safe_load(f)

df = load_all_sources(cfg, root=PROJECT_ROOT)
df = add_broad_category(df, cfg, root=PROJECT_ROOT)

df[['source', 'categories', 'broad_category']].head()


INFO:src.data.load:Loading source pakistan_today from /home/spark/NUST/Semester 5/Data Mining/Project/data/raw/pakistan_today(full-data).csv (encoding=utf-8)
INFO:src.data.load:Loading source tribune from /home/spark/NUST/Semester 5/Data Mining/Project/data/raw/tribune(full-data).csv (encoding=latin1)
INFO:src.data.load:Loading source dawn from /home/spark/NUST/Semester 5/Data Mining/Project/data/raw/dawn (full-data).csv (encoding=latin1)
INFO:src.data.load:Loading source daily_times from /home/spark/NUST/Semester 5/Data Mining/Project/data/raw/daily_times(full-data).csv (encoding=utf-8)
INFO:src.data.load:Loading preprocessed business_reorder from /home/spark/NUST/Semester 5/Data Mining/Project/data/interim/business_reorder_clean.parquet
INFO:src.data.load:Filtered invalid sources: (625905, 7) -> (624642, 7)
INFO:src.data.load:Combined dataset shape: (624642, 7)
INFO:src.data.load:Sampling up to 10000 rows per source (__file__ column).
  .apply(lambda g: g.sample(min(len(g), per_sourc

Unnamed: 0,source,categories,broad_category
235575,Daily Times,World,World
311984,Daily Times,Pakistan,Pakistan
257006,Daily Times,Pakistan,Pakistan
299841,Daily Times,Pakistan & Top Stories,Pakistan
230900,Daily Times,World,World


Cell 3 – top raw categories overall

In [5]:
df['categories'].value_counts().head(50)


categories
Pakistan                                5277
Pakistan                                4650
Business                                4233
World                                   3836
NATIONAL                                3530
Sports                                  3501
World                                   2429
Business                                1361
Lifestyle                               1211
Sport                                    923
Technology                               581
NATIONAL & Top Headlines                 529
Comment & Opinion                        485
E-papers & Pakistan Today                432
Letters & Opinion                        396
Op-Ed                                    376
Pakistan & Top Stories                   363
Punjab                                   333
Sindh                                    308
Reviews                                  292
Cricket                                  272
Editorials & Opinion                     263

Cell 4 – top raw categories per source

In [18]:
for src in df['source'].unique():
    print(f'=== {src} ===')
    print(df[df['source'] == src]['categories'].value_counts().head(20))


=== Daily Times ===
categories
Pakistan                     2978
Sports                       1367
Lifestyle                    1211
World                        1160
Business                     1015
Op-Ed                         376
Pakistan & Top Stories        363
Reviews                       292
Editorial                     157
Infotainment                  123
Arts, Culture &amp; Books     104
Cartoons                       85
Perspectives                   81
Commentary / Insight           64
Uncategorized                  63
Top Stories & World            57
Infotainment & Trending        43
Pakistan & World               32
Top Stories                    30
Blogs                          23
Name: count, dtype: int64
=== Dawn ===
categories
Pakistan     5277
World        2429
Business     1361
Sport         923
Prism          10
Name: count, dtype: int64
=== Pakistan Today ===
categories
NATIONAL                                3530
Business                                2007

Cell 5 – confusion between raw and broad (crosstab)

In [7]:
ct = pd.crosstab(df['categories'], df['broad_category'])
ct.head(20)


broad_category,Business,Lifestyle,Opinion,Other,Pakistan,Sports,Technology,World
categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Analysis & HEADLINES,0,0,0,1,0,0,0,0
Analysis & NATIONAL,0,0,0,0,1,0,0,0
Art and Books,4,0,1,4,0,2,2,7
"Arts, Culture &amp; Books",23,0,1,12,3,4,6,55
"Arts, Culture &amp; Books & Trending",0,0,0,0,0,0,0,1
Azad Jammu & Kashmir,1,0,0,0,2,0,0,0
Balochistan,0,0,0,0,88,0,0,0
Balochistan & Pakistan,0,0,0,0,6,0,0,0
Balochistan & Pakistan & Top Stories,0,0,0,0,1,0,0,0
"Balochistan, Business",0,0,0,0,1,0,0,0


You can use this to spot:
- Raw categories that are mapping to `Other`
- Raw categories that map to multiple broad categories (maybe noisy)

Cell 6 – focus on `Other` to fix mapping

In [8]:
other_df = df[df['broad_category'] == 'Other']
len(other_df), other_df['categories'].nunique()


(815, 45)

In [9]:
other_df['categories'].value_counts().head(50)


categories
E-papers & Pakistan Today       432
Cartoons                         85
Life & Style                     32
TV                               24
Islamabad                        23
HEADLINES                        20
K-P                              20
Spotlight                        19
Music                            15
KARACHI                          12
Arts, Culture &amp; Books        12
CITY                             12
LAHORE                           11
Uncategorized                    10
Life & Style, TV                  8
Fashion                           8
life and style                    7
Trending                          6
ISLAMABAD                         6
Top Stories                       4
Letters                           4
Art and Books                     4
GOVERNANCE & HEADLINES            3
Life & Style, Fashion             3
Life & Style, Art and Books       3
Magazine                          3
Entertainment                     3
PESHAWAR         

Based on this, we can later update configs/category_mapping_v1.yaml to reduce Other.