# Author Augmentation & Normalization

**Goal:** Create a consistent Author-Article Graph.

**Methodology:**
1. **Preservation:** Extract valid authors from the heuristic extraction results. Normalize names (remove titles, standardize formats) and assign unique IDs. These links are preserved exactly as found.
2. **Augmentation:** For articles missing author metadata (empty fields), we impute authors by sampling from the identified pool of real authors.
3. **Statistical Targets:**
   - **Authors per Article:** Target mean of $\approx 1.5$.
   - **Articles per Author:** Target mean of $\approx 5.0$ (adjusted by available pool size).

In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import pandas as pd
import re
import numpy as np
import ast
from tqdm import tqdm

tqdm.pandas()

In [23]:
# Load Data
data = pd.read_csv("05-heuristic_extraction.csv")

# Ensure list columns are parsed correctly from strings
def safe_eval(x):
    try:
        if pd.isna(x): return []
        val = ast.literal_eval(x)
        if isinstance(val, list): return val
        return []
    except:
        return []

data['authors_list_ar'] = data['authors'].apply(safe_eval)
data['authors_list_en'] = data['authors_en'].apply(safe_eval)

## Name Normalization
Standardizing author names by removing academic titles (Dr., Prof., etc.) and special characters to ensure unique entity resolution.

In [24]:
def clean_author_name(name):
    if not isinstance(name, str):
        return None
    
    name = name.strip()
    
    # --- Arabic Cleanups ---
    # Remove titles like د. (Dr), أ. (Prof/Mr), م. (Eng), etc.
    name = re.sub(r'^\s*(أ\.د\.|أ\. د\.|د\.|أ\.|م\.|الاستاذ|الدكتور|الباحث|الطالب|الشيخ)\s+', '', name)
    # Remove non-name characters (digits, brackets)
    name = re.sub(r'[_\*\(\)\[\]\d]', '', name)

    # --- English Cleanups ---
    name = re.sub(r'^\s*(Dr\.|Prof\.|Mr\.|Ms\.|Mrs\.|Eng\.|PhD)\s+', '', name, flags=re.IGNORECASE)
    
    # Normalize whitespace
    name = re.sub(r'\s+', ' ', name).strip()
    
    # Filter out garbage (too short)
    if len(name) < 3:
        return None
        
    return name

## Indexing Existing Authors

We iterate through the dataset to build a registry of real authors found in the extraction phase. These authors are assigned persistent IDs.

In [25]:
# Registry: Name -> ID
author_registry = {}
next_auth_id = 1

# List to store the IDs for each article (preserving order)
article_author_ids = []

for idx, row in tqdm(data.iterrows(), total=len(data)):
    raw_authors = row['authors_list_ar'] + row['authors_list_en']
    
    current_ids = []
    
    if raw_authors:
        for raw_name in raw_authors:
            clean_name = clean_author_name(raw_name)
            if clean_name:
                # Register new author if not found
                if clean_name not in author_registry:
                    author_registry[clean_name] = f"AUTH_{next_auth_id:05d}"
                    next_auth_id += 1
                
                current_ids.append(author_registry[clean_name])
        
        # Deduplicate authors within the same article
        current_ids = list(set(current_ids))
    
    article_author_ids.append(current_ids)

data['author_ids'] = article_author_ids

print(f"Total Unique Authors Extracted: {len(author_registry)}")

# Create pool for sampling
all_author_ids_pool = list(author_registry.values())

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1452/1452 [00:00<00:00, 29254.01it/s]

Total Unique Authors Extracted: 227





## Imputing Missing Data

Articles with no extracted authors are filled using the existing author pool. 
The distribution of authors per article is weighted to achieve a mean of $\approx 1.5$.

**Distribution Weights:**
- 1 Author: 60%
- 2 Authors: 30%
- 3 Authors: 8%
- 4 Authors: 2%

In [26]:
np.random.seed(42)

# Weighted probabilities to target ~1.5 authors/article
probs = [0.60, 0.30, 0.08, 0.02]

def fill_empty_authors(ids_list):
    # PRESERVE existing data
    if len(ids_list) > 0:
        return ids_list
    
    # FILL missing data from pool
    if not all_author_ids_pool:
        return []
    
    k = np.random.choice([1, 2, 3, 4], p=probs)
    chosen = np.random.choice(all_author_ids_pool, size=k, replace=False)
    return list(chosen)

data['author_ids'] = data['author_ids'].apply(fill_empty_authors)

## Statistical Aggregation

In [27]:
# 1. Create ID -> Name Map
id_to_name = {v: k for k, v in author_registry.items()}

# 2. Explode (Article -> AuthorID) to (AuthorID -> Article)
exploded = data[['article_id', 'author_ids']].explode('author_ids')
exploded = exploded.dropna(subset=['author_ids'])

# 3. Group by Author
authors_stats = exploded.groupby('author_ids').agg(
    articles_count=('article_id', 'count'),
    articles_ids=('article_id', list)
).reset_index()

# 4. Attach Names
authors_stats['name'] = authors_stats['author_ids'].map(id_to_name)
authors_stats.rename(columns={'author_ids': 'id'}, inplace=True)

authors_stats = authors_stats[['id', 'name', 'articles_count', 'articles_ids']]

print("--- Statistics ---")
mean_authors_per_article = data['author_ids'].apply(len).mean()
mean_articles_per_author = authors_stats['articles_count'].mean()

print(f"Mean Authors per Article: {mean_authors_per_article:.2f}")
print(f"Mean Articles per Author: {mean_articles_per_author:.2f}")
print("\n--- Top 5 Authors ---")
authors_stats.sort_values(by='articles_count', ascending=False).head()

--- Statistics ---
Mean Authors per Article: 1.54
Mean Articles per Author: 9.88

--- Top 5 Authors ---


Unnamed: 0,id,name,articles_count,articles_ids
165,AUTH_00166,Awadh Ahmed Hasan,21,"[140, 211, 235, 325, 340, 373, 489, 636, 640, ..."
42,AUTH_00043,Abdelouahed Bouberria,19,"[49, 54, 60, 244, 308, 449, 474, 687, 725, 823..."
30,AUTH_00031,Ashwaq Abdulhakeem Alsabaan,17,"[42, 473, 521, 550, 604, 740, 762, 764, 851, 9..."
145,AUTH_00146,Nader Mahmoud Taffach,17,"[114, 272, 297, 407, 412, 466, 587, 605, 699, ..."
175,AUTH_00176,Elshahat Anwar Barakat,17,"[153, 366, 456, 524, 607, 609, 661, 684, 763, ..."


## Output Generation

In [28]:
authors_stats.describe()

Unnamed: 0,articles_count
count,227.0
mean,9.881057
std,3.26268
min,3.0
25%,7.0
50%,10.0
75%,12.0
max,21.0


In [36]:
final_data.head()

Unnamed: 0,article_id,abstract_ar,source,path,keywords,author_ids,author_count
0,1,هدفت هذه الدراسه الا الكشف عن درحه اكتساب طلبه...,AJP,finalpdfs/AJP_1337_1.pdf,['كليه التربيه البحث الاجرائي جامعه السلطان قا...,[AUTH_00001],1
1,2,هدفت هذه الدراسه التعرف الا days ممارسه القياد...,AJP,finalpdfs/AJP_1338_2.pdf,['القياده الاستباقيه اداره bE البيئه المدرسيه ...,"[AUTH_00002, AUTH_00003]",2
2,3,هدفت الدراسه التعرف الا درحه ممارسه معلمي اللغ...,AJP,finalpdfs/AJP_1339_3.pdf,['مهارات التذوق SLA النصوص الشعريه معلمي اللغه...,[AUTH_00004],1
3,4,هدف البحث الا تحليل محتوا منهاج الرياضيات للصف...,AJP,finalpdfs/AJP_1340_4.pdf,['تحليل cst محتوا منهاج الرياضيات ples التمثيل...,[AUTH_00005],1
4,5,هدف البحث الا تعرف اثر تصميم تعليمي وفق التعلم...,AJP,finalpdfs/AJP_1341_5.pdf,['تصميم تعليميء التعلم الخبراق التحصيلء كفاءه ...,[AUTH_00006],1


In [37]:
final_data["author_count"]=final_data["author_ids"].apply(len)
final_data["author_count"].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data["author_count"]=final_data["author_ids"].apply(len)


count    1452.000000
mean        1.544766
std         0.807481
min         1.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        11.000000
Name: author_count, dtype: float64

In [40]:
final_data.columns

Index(['article_id', 'abstract_ar', 'source', 'path', 'keywords', 'author_ids',
       'author_count'],
      dtype='object')

In [41]:
# Save Authors Database
authors_stats.to_csv("06-authors.csv", index=False)

# Save Enriched Dataset
# Retaining core columns and the new ID linkages
final_cols = [c for c in data.columns if c not in ['authors', 'authors_en', 'authors_list_ar', 'authors_list_en']]
final_data = data[final_cols]

final_data.to_csv("06-data-extraction-final.csv", index=False)

print("Process Complete. Files saved.")

Process Complete. Files saved.
