### 🧾 Project Overview

This notebook outlines a complete web crawling and data structuring pipeline focused on gathering and organizing information about Persian scientists and philosophers. Using **Selenium**, the crawler interacts with dynamic web pages to extract detailed biographical content. The extracted data is then processed and formatted according to a **predefined JSON schema**, aided by two **language models** (Deepseek-reasoner api call and llama-3.2-3B )that help interpret and structure the raw text into meaningful fields.

Additionally, for geographic data enrichment, the pipeline calls an external API from [**Nominatim (OpenStreetMap)**](https://nominatim.openstreetmap.org/search) to fetch accurate latitude and longitude coordinates for location names mentioned in the profiles.

This notebook demonstrates a blend of web automation, natural language processing, and geocoding to produce structured, enriched datasets ready for downstream use.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Utils

In [None]:
import json

# This script is used to print json file contents in a pretty format.
def prettyPrint(json_record):
  print(json.dumps(json_record, indent=2, ensure_ascii=False))

### 📑 Paragraph Selection Strategy

Due to the context window limitations of the language model and the high cost associated with each API call to the **DeepSeek Reasoner**, it is not practical to send all crawled paragraphs for schema conversion. To address this, a rule-based filtering approach is used to pre-select the most relevant paragraphs before passing them to the language model.

The strategy involves defining a list of **keywords and phrases** that are likely to appear in paragraphs containing information relevant to the target JSON schema fields—such as names, birth and death details, professions, works, honors, and geographical data.

The selection function iterates over the list of crawled paragraphs and checks if any of the keywords appear in each paragraph. If a paragraph contains at least one keyword, it is added to the list of selected paragraphs. To increase variety and avoid repetition, matched keywords are removed from the list once used.

This lightweight, interpretable method ensures that only the most semantically rich paragraphs are passed to the language model—optimizing both performance and cost.


In [None]:
def select_top_k(k, texts):
  keywords = [
    "نام", "نام کامل", "نام مستعار", "القاب", "معروف به", "شخصیت", "دانشمند", "شاعر", "نویسنده","فیزیکدان","ریاضیدان","شیمیدان","محقق","نویسنده","شاعر"
    "تولد", "متولد", "زاده", "زادگاه", "محل تولد", "محل زاده شدن", "دیده به جهان گشود", "در سال ... به دنیا آمد", "سال تولد",
    "درگذشت", "وفات", "فوت", "مرد", "جان سپرد", "در تاریخ ... درگذشت", "زمان مرگ", "سال مرگ",
    "دفن", "محل دفن", "مدفون", "قبر", "آرامگاه", "مزار", "خاک‌سپاری", "به خاک سپرده شد",
    "شهر", "استان", "منطقه", "روستا", "محل اقامت", "محل زندگی", "مکان","نهاد","بازنشسته", "عضو"
    "شغل", "حرفه", "فعالیت", "سمت", "عنوان", "استاد", "وزیر", "پژوهشگر", "هنرمند", "دانشمند",
    "آثار", "کتاب‌ها", "نوشته‌ها", "تالیفات", "نوشته", "مقاله", "کتاب", "آثار علمی", "مهم‌ترین آثار","جایزه", "مدال","افتخارات","تحصیلات","مطالعات","دانشگاه"
    "دوره", "دوران", "عصر", "زمانه", "سده", "قرن","سلسله"
    "رویداد", "شرکت کرد در", "نقش داشت در", "مشارکت", "جنگ", "قیام", "نهضت", "انقلاب", "حضور در", "فعالیت در",
    "تصویر", "عکس", "پرتره", "جوانی", "دوران پیری", "عکس آرامگاه", "تصویر قبر", "چهره"
  ]
  selected = []
  for p in texts:
    found = [keyword for keyword in keywords if keyword in p]
    if found:
      keywords = [k for k in keywords if k not in found]
      selected.append(p)
  return selected

### 🧠 Language Model Integration

This section defines two methods for interacting with language models used in the pipeline to transform and structure crawled data:

---

🔹 `send_to_llm()` — Using DeepSeek Reasoner (via OpenAI-compatible API)

This function sends a user-defined prompt to a hosted version of the `deepseek-reasoner` model using OpenAI’s API interface. Key features include:

- **Custom API Endpoint**: The function supports custom `api_key` and `base_url`, allowing access to privately hosted models or third-party LLM providers.
- **Retry Logic**: In case of rate limits or transient errors, it includes a retry mechanism with exponential wait time.

---

🔹 `send_to_llama()` — Using a Local LLaMA Model (Unsloth LLaMA-3)

This function interacts with a locally hosted or optimized model called `unsloth/Llama-3.2-3B-Instruct`, using Hugging Face’s `transformers` pipeline. Highlights include:

- **Efficient Inference**: It uses `torch_dtype=torch.bfloat16` and `device_map="auto"` for performance optimization on supported hardware.
- **Chat-Style Input**: The prompt is wrapped in a conversation-like structure (using roles such as `"user"`), making it compatible with instruction-tuned models.
- **Streaming Generation**: Configured to generate up to `1024` new tokens per call, suitable for summarization or partial schema filling.

This model offers a more cost-effective alternative for tasks that don't require the reasoning depth of a hosted API model.

---

Together, these two functions enable flexible use of both cloud-hosted and local language models, balancing quality, performance, and cost depending on the task at hand.


In [None]:
import openai
import requests
import time

OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"  
OPENAI_BASE_URL = "YOUR_OPENAI_BASE_URL"

client = openai.OpenAI(api_key=OPENAI_API_KEY,base_url=OPENAI_BASE_URL)

def send_to_llm(prompt, model_name="deepseek-reasoner", retries=5, wait_seconds=1):
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.2,
            )
            print("✅ Response received")
            return response.choices[0].message.content.strip()

        except openai.RateLimitError as e:
            print(f"⚠️ Rate limit hit. Waiting {wait_seconds} seconds... (attempt {attempt+1}/{retries})")
            time.sleep(wait_seconds)

        except Exception as e:
            print(f"❌ Unexpected error: {e}")
            time.sleep(2)

    print("❌ Failed after retries.")
    return None

In [7]:
import torch
from transformers import pipeline
import random

model_id = "unsloth/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/945 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.43G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Device set to use cuda:0


In [8]:
def send_to_llama(prompt):
  messages = [
    # {"role": "system", "content": "you are a good summerizer, summerize the following text"},
   {"role": "user", "content": prompt},
  ]
  outputs = pipe(
    messages,
    max_new_tokens=1024,
  )
  return outputs[0]["generated_text"][-1]['content']

### JSON Extraction from Text with LLM

This function, `extract_json_from_text()`, is designed to extract structured data from raw Wikipedia-style Persian biographies using prompt-based interaction with a reasoning-capable language model.

- **Prompt Engineering**: The prompt includes a detailed instruction along with a well-structured example (`Example 1`) to guide the LLM in producing JSON output that matches a predefined schema. This example-driven approach improves accuracy by showing the model what a correct output looks like.
  
- **DeepSeek Reasoner Usage**: The function sends the prompt to a high-level reasoning model (e.g., `deepseek-reasoner`) capable of understanding and interpreting complex, unstructured text. This is critical for extracting fields like names, birthplaces, dates, and works from loosely formatted biography paragraphs.

- **Safe JSON Parsing**: To ensure the model output is valid JSON, it trims the string from the first `{` character and then evaluates it using Python's `eval()` function. While not the safest method for untrusted sources, this approach works effectively within a controlled environment to parse model-generated JSON-like outputs.


In [None]:
def extract_json_from_text(text, schema):
    prompt = f"""
              You are an intelligent extractor. Given a short biography, extract structured information based on the following schema:\n {schema}
              Only return valid JSON, using double quotes.


              Example 1:
              Text:
              زکریای رازی، از برجسته‌ترین دانشمندان و پزشکان دوران اسلامی، در ۲۷ اوت سال ۸۶۵ میلادی در شهر ری، واقع در استان ری (در جنوب‌شرق تهران امروزی)، چشم به جهان گشود. این دانشمند بزرگ که با لقب‌هایی چون «پدر شیمی نوین» و «رازس» نیز شناخته می‌شود، یکی از چهره‌های درخشان عصر عباسیان بود که آثار و دستاوردهای او تا قرن‌ها پس از مرگش الهام‌بخش دانشمندان و پزشکان باقی ماند.
تحصیلات و مشاغل علمی
رازی در طول زندگی خود، به فعالیت در حوزه‌های گوناگون علمی چون پزشکی، فلسفه، و شیمی پرداخت. او نبوغ بی‌مانند خود را نه تنها در نگارش کتاب‌ها و رسالات متعدد، بلکه در خدمت به جامعه نیز به نمایش گذاشت. یکی از برجسته‌ترین دوره‌های زندگی علمی او، مربوط به ریاست بیمارستان معتضدی بغداد در حوالی سال ۹۰۰ میلادی است. رازی در این سمت، که به فرمان خلیفه معتضد عباسی به او سپرده شده بود، با تدبیر و ابتکار عمل، گام‌های بزرگی در جهت ساماندهی و ارتقاء سطح خدمات پزشکی برداشت.
دستاوردهای علمی
از دیگر افتخارات بزرگ زکریای رازی، می‌توان به کشف الکل و اسید سولفوریک اشاره کرد؛ کشف‌هایی که در بازه‌ای میان سال‌های ۹۱۰ تا ۹۲۰ میلادی، از دل آزمایش‌های پرتکرار و علمی او در زادگاهش ری به دست آمدند. این اکتشافات نه تنها مسیر دانش شیمی را در دوران اسلامی دگرگون کردند، بلکه پایه‌گذار بخشی از اصول شیمی مدرن نیز شدند.
آثار مکتوب
رازی در کنار فعالیت‌های بالینی و آزمایشگاهی، از نویسندگان پرکار زمان خود نیز بود. از میان آثار مهم او می‌توان به کتاب‌های زیر اشاره کرد:
الحاوی: یکی از جامع‌ترین دایرةالمعارف‌های پزشکی دوران اسلامی.
المنصوری: کتابی در پزشکی که برای شاه منصور سامانی نگاشته شد.
کتاب الاسرار: اثری در زمینهٔ شیمی تجربی که بسیاری از تجربیات او را در خود جای داده است.
وفات
زکریای رازی در تاریخ ۳۱ اکتبر ۹۲۵ میلادی، در همان شهر زادگاهش، ری، درگذشت. نویسندگانی مانند بیژن شهرامی معتقد بودند که احتمالاً محوطهٔ تاریخی برج طغرل آرامگاه وی است.
احمد ابوحمزه، ری‌شناس، محل دفن رازی را در امامزاده شعیب واقع در فیروزآباد شهرری می‌داند.


              Output:
              {{
    "name": "زکریای رازی",
    "sex": "مرد",
    "nick-names": [
      "پدر شیمی نوین",
      "رازس"
    ],
    "birth": {{
      "date": "27 اوت 865 میلادی",
      "location": {{
        "province": "ایران",
        "city": "ری",
        "coordinates": {{
          "latitude": "",
          "longitude": ""
        }}
      }}
    }},
    "death": {{
      "date": "31 اکتبر 925 میلادی",
      "location": {{
        "province": "ایران",
        "city": "ری",
        "coordinates": {{
          "latitude": "",
          "longitude": ""
        }}
      }},
      "tomb_location": {{
        "province": "ایران",
        "city": "ری",
        "coordinates": {{
          "latitude": "",
          "longitude": ""
        }}
      }}
    }},
    "era": "عباسیان",
    "occupation": [
      "پزشک",
      "فیلسوف",
      "شیمی‌دان"
    ],
    "works": [
      "الحاوی",
      "المنصوری",
      "کتاب الاسرار"
    ],
    "events": [
      {{
        "title": "ریاست بیمارستان معتضدی بغداد",
        "start_date": "حدود 900 میلادی",
        "end_date": "حدود 910 میلادی",
        "location": {{
          "province": "بغداد",
          "city": "بغداد",
          "coordinates": {{
            "latitude": "",
            "longitude": ""
          }}
        }},
        "related_people": [
          "خلیفه معتضد عباسی"
        ],
        "description": "رازی به عنوان رئیس بیمارستان معتضدی بغداد، به پیشرفت‌های مهمی در پزشکی دست یافت."
      }},
      {{
        "title": "کشف الکل و اسید سولفوریک",
        "start_date": "حدود 910 میلادی",
        "end_date": "حدود 920 میلادی",
        "location": {{
          "province": "ایران",
          "city": "ری",
          "coordinates": {{
            "latitude": "",
            "longitude": ""
          }}
        }},
        "related_people": [],
        "description": "رازی با انجام آزمایش‌های متعدد، موفق به کشف الکل و اسید سولفوریک شد."
      }}
    ],
    "image": {{
      "young": "",
      "adult": "",
      "tomb": ""
    }}
  }}


              Example 2:
              Text: \"{text}\"
              Output:
            """

    res = send_to_llm(prompt)
    # res = send_to_llama(prompt)
    res = res[res.find("{"):]
    print(res)
    return eval(res)


In [10]:
schema = {
  "name": "",
  "sex": "",
  "nick-names": [
    ""
  ],
  "birth": {
    "date": "",
    "location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    }
  },
  "death": {
    "date": "",
    "location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    },
    "tomb_location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    }
  },
  "era": "",
  "occupation": [
    ""
  ],
  "works": [
    ""
  ],
  "events": [
    {
      "title": "",
      "start_date": "",
      "end_date": "",
      "location": {
        "province": "",
        "city": "",
        "coordinates": {
          "latitude": 0.0,
          "longitude": 0.0
        }
      },
      "related_people": [
        ""
      ],
      "description": ""
    }
  ],
  "image": {
    "young": "",
    "adult": "",
    "tomb": ""
  }
}

### 🗺️ Find location lat, lon


`get_coordinates(province, city)`
This function retrieves the latitude and longitude of a given city and province using the Nominatim API from OpenStreetMap. It constructs a query string with the city and province, makes a GET request to the API, and extracts the coordinates from the response if available.

- **Input**: `province` (str), `city` (str)
- **Output**: Tuple `(latitude, longitude)` or `(None, None)` if coordinates are not found.

 `normalize_cities(city)`
This function normalizes a Persian place name into a standardized Iranian city or province name by sending a prompt to llama language model for processing. It is used when the city/province name is unclear or non-standard.

- **Input**: `city` (str)
- **Output**: Normalized city/province name (str).

 `update_locations(data)`
This recursive function traverses a given data structure (list or dictionary) and updates all "location" keys with the correct latitude and longitude. If the location doesn't have coordinates, it tries to fetch them using `get_coordinates`. If a direct lookup fails, it attempts to normalize the city name using `normalize_cities` before retrying.

- **Input**: `data` (list or dict) containing location information.
- **Output**: The data structure with updated coordinates under each "location" key.


- `lat_lon_archive`: A cache used to store previously fetched coordinates to reduce redundant API calls.

In [None]:
import requests

GEO_API_URL = 'https://nominatim.openstreetmap.org/search'
lat_lon_archive = {}

In [None]:
def get_coordinates(province, city):
    """Get lat/lon from Nominatim (can swap with Google API)."""
    query = f"{city}, {province}"
    res = requests.get("https://nominatim.openstreetmap.org/search", params={
        "q": query,
        "format": "json"
    }, headers={
        "User-Agent": "MyApp (you@example.com)"
    })

    if res.status_code == 200:
        data = res.json()
        if data:
            return float(data[0]['lat']), float(data[0]['lon'])
    return None, None

In [15]:
get_coordinates("ایران","بلخ")

(35.7038211, 51.3997796)

In [None]:
import requests
import time  # for respecting rate limits

def normalize_cities(city):

    prompt = f"""
Normalize the following Persian place name into a standard Iranian province or city.
Only return the normalized name. If it's unclear, return the best guess. No explanation.

Examples:
"نزدیکی تهران" → "تهران"
"شمال بلوچستان و سند" → "بلوچستان"
"سواحل خلیج فارس" → "هرمزگان"
"غرب ایران" → "کرمانشاه"
"استان خراسان بزرگ" → "خراسان رضوی"

Now normalize:
"{city}" →
"""
    return send_to_llama(prompt)

def update_locations(data):
    """Recursively find and update all 'location' keys with lat/lon."""
    if isinstance(data, dict):
        for key, value in data.items():
            if key == "location" and isinstance(value, dict):
                province = value.get("province", "")
                city = value.get("city", "")
                if province or city:
                    prev = lat_lon_archive.get(province+"/"+city, "")
                    if prev:
                      lat, lon = prev
                    else:
                      lat, lon = get_coordinates(province, city)
                      prev_city = city
                      if not lat:
                          city = normalize_cities(f'{province}-{city}')
                          lat, lon = get_coordinates("Iran", city)
                          if lat and lon:
                            lat_lon_archive[province+"/"+prev_city] = [lat, lon]
                    if lat and lon:
                        value.setdefault("coordinates", {})
                        value["coordinates"]["latitude"] = lat
                        value["coordinates"]["longitude"] = lon
                        print(province, city, lon, lat)

            else:
                update_locations(value)
    elif isinstance(data, list):
        for item in data:
            update_locations(item)

### Preprocess

`preprocess(text)`
This function processes a given text string by performing two main operations:

1. **Remove references (e.g., [1], [2], etc.)**: 
   It uses a regular expression (`r'\[\d+\]'`) to match and remove any text that is enclosed within square brackets and contains one or more digits (commonly used for citations or references).

2. **Remove non-Persian characters**:
   It then uses another regular expression (`r"[^\u0600-\u06FF\u06F0-\u06F9\s،.؟]"`) to remove any characters that are not Persian letters, Persian numerals, spaces, or common punctuation marks (like commas, periods, and question marks). The Unicode ranges `\u0600-\u06FF` and `\u06F0-\u06F9` correspond to Persian characters and numbers.

In [18]:
import re

def preprocess(text):
  text = re.sub(r'\[\d+\]', '', text)
  text = re.sub(r"[^\u0600-\u06FF\u06F0-\u06F9\s،.؟]", "", text)
  return text

### Crawling Wikipedia

In [None]:
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Collecting selenium
  Downloading selenium-4.31.0-py3-none-any.whl.metadata (7.5 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.31.0-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m93.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.29.0-py3-none-any.whl (492 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.9/492.9 kB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Downloading outcome-1.3.0.post0-py2.py3-

### 📄 Extract Scientists

*Scraping Iranian Scientists from Persian Wikipedia*

This script uses **Selenium**, a browser automation tool, to scrape a list of Iranian philosophers from a Persian Wikipedia page. It launches a headless (invisible) instance of Google Chrome, navigates to the target page, and searches specific sections of the HTML structure where the names are listed.

It collects all list items (`<li>` elements) that contain philosopher names, filters them to ensure the links lead to valid Wikipedia pages (ignoring non-existent ones), and extracts both the visible name and the corresponding link. The resulting data is stored in a list, where each entry contains the philosopher's name and their Wikipedia URL.

The browser is closed after the scraping is done. This approach is useful for extracting structured or semi-structured data from web pages that require JavaScript rendering, which simpler tools like `requests` and `BeautifulSoup` can't handle effectively.


In [20]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# def get_wiki_info(url):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")


In [None]:
# فهرست فیلسوفان ایرانی

driver = webdriver.Chrome(options=chrome_options)
li_elements = []
scientists = []

url = 'https://fa.wikipedia.org/wiki/%D9%81%D9%87%D8%B1%D8%B3%D8%AA_%D9%81%DB%8C%D9%84%D8%B3%D9%88%D9%81%D8%A7%D9%86_%D8%A7%DB%8C%D8%B1%D8%A7%D9%86%DB%8C'
driver.get(url)

for i in [4,6,8]:
  li_elements += (driver.find_elements(
      By.XPATH,
      f'/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/div[{i}]//li'
  ))

for li in li_elements:
    try:
      a_tag = li.find_element(By.TAG_NAME, "a")
      title = a_tag.get_attribute("title")
      if "صفحه وجود ندارد" not in title:
        href = a_tag.get_attribute("href")
        scientists.append([title, href])
    except:
      continue

driver.quit()


In [None]:
# فهرست دانشمندان و مهندسان معاصر ایران

driver = webdriver.Chrome(options=chrome_options)
url = "https://fa.wikipedia.org/wiki/%D9%81%D9%87%D8%B1%D8%B3%D8%AA_%D8%AF%D8%A7%D9%86%D8%B4%D9%85%D9%86%D8%AF%D8%A7%D9%86_%D9%88_%D9%85%D9%87%D9%86%D8%AF%D8%B3%D8%A7%D9%86_%D9%85%D8%B9%D8%A7%D8%B5%D8%B1_%D8%A7%DB%8C%D8%B1%D8%A7%D9%86#%D8%AB"
driver.get(url)
scientists = []

# Wait for page to fully load
time.sleep(2)

# Get all <li> elements
li_elements = driver.find_elements(By.TAG_NAME, "li")

# Loop through each <li> and extract first <a> inside
for i, li in enumerate(li_elements[76:123]+li_elements[125:130]+li_elements[132:148]):
    try:
        a_tag = li.find_element(By.TAG_NAME, "a")
        href = a_tag.get_attribute("href")
        title = a_tag.get_attribute("title")
        scientists.append([title, href])
    except Exception as e:
        # print(e)
        continue

driver.quit()
len(scientists)

68

In [None]:
# فهرست دانشمندان ایرانی پیش از دوران معاصر

driver = webdriver.Chrome(options=chrome_options)
url = "https://fa.wikipedia.org/wiki/%D9%81%D9%87%D8%B1%D8%B3%D8%AA_%D8%AF%D8%A7%D9%86%D8%B4%D9%85%D9%86%D8%AF%D8%A7%D9%86_%D8%A7%DB%8C%D8%B1%D8%A7%D9%86%DB%8C_%D9%BE%DB%8C%D8%B4_%D8%A7%D8%B2_%D8%AF%D9%88%D8%B1%D8%A7%D9%86_%D9%85%D8%B9%D8%A7%D8%B5%D8%B1"
driver.get(url)
scientists = []

# Get all <li> elements
li_elements = driver.find_elements(By.TAG_NAME, "tr")

# Loop through each <li> and extract first <a> inside
for i, tr in enumerate(li_elements):
    try:
        td = tr.find_elements(By.TAG_NAME, "td")[1]
        a_tag = td.find_element(By.TAG_NAME, "a")
        href = a_tag.get_attribute("href")
        title = a_tag.get_attribute("title")
        if "صفحه وجود ندارد" in title:
          continue
        print(i, title, href)
        scientists.append([title, href])
    except Exception as e:
        print(e)
        continue

driver.quit()
len(scientists)

0 زرتشت https://fa.wikipedia.org/wiki/%D8%B2%D8%B1%D8%AA%D8%B4%D8%AA
1 جاماسپ https://fa.wikipedia.org/wiki/%D8%AC%D8%A7%D9%85%D8%A7%D8%B3%D9%BE
2 جاماسپ (ساسانی) https://fa.wikipedia.org/wiki/%D8%AC%D8%A7%D9%85%D8%A7%D8%B3%D9%BE_(%D8%B3%D8%A7%D8%B3%D8%A7%D9%86%DB%8C)
3 هوشتانه https://fa.wikipedia.org/wiki/%D9%87%D9%88%D8%B4%D8%AA%D8%A7%D9%86%D9%87
4 آرتاخه https://fa.wikipedia.org/wiki/%D8%A2%D8%B1%D8%AA%D8%A7%D8%AE%D9%87
5 بوبراندا https://fa.wikipedia.org/wiki/%D8%A8%D9%88%D8%A8%D8%B1%D8%A7%D9%86%D8%AF%D8%A7
6 آزونکس https://fa.wikipedia.org/wiki/%D8%A2%D8%B2%D9%88%D9%86%DA%A9%D8%B3
7 مانی https://fa.wikipedia.org/wiki/%D9%85%D8%A7%D9%86%DB%8C
8 مولوی https://fa.wikipedia.org/wiki/%D9%85%D9%88%D9%84%D9%88%DB%8C
9 پولس ایرانی https://fa.wikipedia.org/wiki/%D9%BE%D9%88%D9%84%D8%B3_%D8%A7%DB%8C%D8%B1%D8%A7%D9%86%DB%8C
10 مزدک https://fa.wikipedia.org/wiki/%D9%85%D8%B2%D8%AF%DA%A9
11 برزویه پزشک https://fa.wikipedia.org/wiki/%D8%A8%D8%B1%D8%B2%D9%88%DB%8C%D9%87_%D9%BE%D8%B2%D8%B4%DA%A9

178

In [None]:
# فهرست استادان ممتاز در دانشگاه‌های ایران

driver = webdriver.Chrome(options=chrome_options)
url = "https://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%D8%A7%D8%B3%D8%AA%D8%A7%D8%AF%D8%A7%D9%86_%D9%85%D9%85%D8%AA%D8%A7%D8%B2_%D8%AF%D8%B1_%D8%AF%D8%A7%D9%86%D8%B4%DA%AF%D8%A7%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%DB%8C%D8%B1%D8%A7%D9%86"
# url = "https://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%D8%A7%D8%B3%D8%AA%D8%A7%D8%AF%D8%A7%D9%86_%D8%AF%D8%A7%D9%86%D8%B4%DA%AF%D8%A7%D9%87_%D8%A7%D9%84%D8%B2%D9%87%D8%B1%D8%A7"
driver.get(url)
scientists = []

div_elements = driver.find_elements(By.CLASS_NAME, "mw-category-group")
i = 0

for div in div_elements:
  li_elements = div.find_elements(By.TAG_NAME, "li")
  for li in li_elements:
      try:
          a_tag = li.find_element(By.TAG_NAME, "a")
          href = a_tag.get_attribute("href")
          title = a_tag.get_attribute("title")
          if "صفحه وجود ندارد" in title:
            continue
          i += 1
          print(i, title, href)
          scientists.append([title, href])
      except Exception as e:
          print(e)
          continue

driver.quit()

1 مجیدرضا آیت‌اللهی https://fa.wikipedia.org/wiki/%D9%85%D8%AC%DB%8C%D8%AF%D8%B1%D8%B6%D8%A7_%D8%A2%DB%8C%D8%AA%E2%80%8C%D8%A7%D9%84%D9%84%D9%87%DB%8C
2 مهدی اخوان بهابادی https://fa.wikipedia.org/wiki/%D9%85%D9%87%D8%AF%DB%8C_%D8%A7%D8%AE%D9%88%D8%A7%D9%86_%D8%A8%D9%87%D8%A7%D8%A8%D8%A7%D8%AF%DB%8C
3 تقی اخوان نیاکی https://fa.wikipedia.org/wiki/%D8%AA%D9%82%DB%8C_%D8%A7%D8%AE%D9%88%D8%A7%D9%86_%D9%86%DB%8C%D8%A7%DA%A9%DB%8C
4 مجتبی ازهری https://fa.wikipedia.org/wiki/%D9%85%D8%AC%D8%AA%D8%A8%DB%8C_%D8%A7%D8%B2%D9%87%D8%B1%DB%8C
5 علیرضا اشرفی https://fa.wikipedia.org/wiki/%D8%B9%D9%84%DB%8C%D8%B1%D8%B6%D8%A7_%D8%A7%D8%B4%D8%B1%D9%81%DB%8C
6 غلامعلی افروز https://fa.wikipedia.org/wiki/%D8%BA%D9%84%D8%A7%D9%85%D8%B9%D9%84%DB%8C_%D8%A7%D9%81%D8%B1%D9%88%D8%B2
7 کرامت‌الله ایزدپناه https://fa.wikipedia.org/wiki/%DA%A9%D8%B1%D8%A7%D9%85%D8%AA%E2%80%8C%D8%A7%D9%84%D9%84%D9%87_%D8%A7%DB%8C%D8%B2%D8%AF%D9%BE%D9%86%D8%A7%D9%87
8 محمد برزگر جلالی https://fa.wikipedia.org/wiki/%D9%85%D8%AD%D9%8

In [52]:
len(scientists)

44

### Extracting and Enriching Iranian Scientists' Data from Wikipedia

This script automates the extraction of structured biographical data about Iranian scientists or philosophers from Persian Wikipedia and stores the results in a `.jsonl` file.

---

📂 Checking for Existing Entries

- The script starts by reading the existing JSONL file to build a list of names that have already been processed.
- It opens the file and reads each line as a JSON object, extracting the `name` field if present.
- This step helps avoid redundant processing.

---

🌐 Web Page Automation with Selenium

- A headless Chrome browser is initialized using Selenium.
- For each name and URL in the `scientists` list, the browser navigates to the Wikipedia page.
- If the name already exists in the previously read list, it skips to the next item.

---

📄 Extracting Paragraph Text

- The script extracts all `<p>` tags inside the content area of the page.
- It preprocesses these paragraphs using **`preprocess`**, which cleans the Persian text.
- It then filters and selects the most relevant content using **`select_top_k`** to summarize and reduce redundancy.

---

📋 Parsing Info Table

- The script attempts to locate the first HTML `<table>` on the page.
- It iterates over table rows to extract key-value pairs like birth dates and occupation.
- Keywords like "تولد", "پیشه", etc. help classify these values into the structured JSON.

**Functions used**:  
- **`preprocess`** (for cleaning text)  
- **`extract_json_from_text`** (for generating structured output)

---

🖼️ Extracting Image Data

- The script looks for a main image using a specific XPath.
- It also parses all `<figure>` elements on the page.
- Captions are checked for keywords like "مزار", "کودکی", etc. to classify the image as a tomb, childhood photo, or other types.
- Preprocessed captions are used as keys to store image URLs in the JSON.

---

🗺️ Adding Geolocation Data

- After extracting all data, the script calls **`update_locations`**.
- This function adds geographic coordinates (latitude and longitude) based on city/province information using external geolocation logic.

---

💾 Writing Output

- The final structured data is printed for review using **`prettyPrint`**.
- It is then appended to the existing `.jsonl` file.


In [None]:
file_path = "/content/drive/MyDrive/nlp_hw1_crawling/scientists_data.jsonl"

existing_names = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            obj = json.loads(line.strip())
            if 'name' in obj:
                existing_names.append(obj['name'])
        except Exception as e:
            print(e)
            print("can not load jsonl")
            break

In [54]:
len(existing_names)

290

In [55]:
names = ' '.join(existing_names)
print(names)

زرتشت بزرگمهر  مانی جاماسپ پاول پارسی آرتاخه برزویه میدیوماه فریدون بامشاد مزدک بامدادان ساتاسپ کرتیر آذرباد مهراسپندان نَرْسه مهرداد ششم اوستن فِرِیدون نرسه زکریای رازی سیما برزین ابوالعباس ایرانشهری ابونصر محمد فارابی فخرالدین رازی ابن سینا عمر خیام نیشابوری جلالالدین محمد بلخی ابوالحسن محمد بن یوسف عامری یحیی بن عدی محمد بن طاهر بن بهرام سجستانی سیستانی ابوحیان توحیدی ابوموسی جابر بن حیّان ابوریحان محمد بن احمد بیرونی ابن هیثم ابن مسکویه ابوجعفر محمد بن محمد بن حسن طوسی شمس الدین محمد بن محمود آملی محمد غزالی میر برهانالدّین محمّدباقر استرآبادی صدرالدین محمد بن ابراهیم قوام شیرازی ملا محسن فیض کاشانی شیخ حیدر آملی قطبالدین شیرازی شهابالدین سهروردی افضلالدین کاشانی هادی سبزواری میرزا جهانگیر خان قشقایی ملا عبدالله بهابادی محمد هیدجی محمد تقی شیخ شوشتری میرفندرسکی ابن ربن الطبری آرامش دوستدار غلامحسین ابراهیمی دینانی ابوالقاسم فنائی اسماعیل سعادت احسان طبری احسانالله نراقی تهرانی سیّد احمد فَردید انشاءالله رحمتی بابک احمدی زاده بیژن عبدالکریمی حسن رمضانی داریوش شایگان رامین جهانبگلو ر

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import os
import copy

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")


img_xpath = '/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[1]/tbody/tr[2]/td/span/a/img'
file_path = "/content/drive/MyDrive/nlp_hw1_crawling/scientists_data.jsonl"

existing_names = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            obj = json.loads(line.strip())
            if 'name' in obj:
                existing_names.append(obj['name'])
        except:
            print("can not load jsonl")

index = 0
for i, (name, url) in enumerate(scientists[index:]):
  print("\n",index+i+1 ,"/", len(scientists))

  if name in names:
    print(name, " already exists!")
    continue

  print("working on ", name)

  driver = webdriver.Chrome(options=chrome_options)
  driver.get(url)

  content_div = driver.find_element(By.ID, "mw-content-text")
  paragraphs = content_div.find_elements(By.TAG_NAME, "p")
  figures = driver.find_elements(By.TAG_NAME, "figure")
  print(len(paragraphs), len(figures))

  if len(paragraphs) < 2:
    continue

  table_data = []

  texts = [preprocess(p.text) for p in paragraphs if p.text.strip()]

  print("selecting...")
  print(len(texts))
  refiend_text = select_top_k(3, texts)
  if len(refiend_text) > 3:
    refiend_text = refiend_text[:1] + refiend_text[-2:]
  texts = table_data + refiend_text
  print(table_data)

  js = extract_json_from_text(' '.join(texts), schema)

  try:
    table = driver.find_element(By.TAG_NAME, 'table')
    rows = table.find_elements(By.TAG_NAME, 'tr')
    for row in rows:
        try:
          key_cell = row.find_element(By.TAG_NAME, 'th').text.strip().replace('\n', ',')
          value_cell = row.find_element(By.TAG_NAME, 'td').text.strip().replace('\n', ',')
          table_data.append(f" {key_cell}: {value_cell} ")
          if any(word in key_cell for word in ["زاده","متولد" ,"تولد" ]):
              js["birth"]["date"] = value_cell
          if any(word in key_cell for word in ["گرایش","شاخه" ,"تخصص" ,"پیشه"]):
              js["occupation"] = value_cell.split('،')
        except:
          continue
  except:
    print("table not found")

  try:
      img_element = driver.find_element(By.XPATH, img_xpath)
      img_src = img_element.get_attribute("src")
      js["image"]["adult"] = img_src
  except :
      print("Image not found.")

  for fig in figures:
      try:
          img = fig.find_element(By.TAG_NAME, "a").get_attribute("href")
          caption = fig.find_element(By.TAG_NAME, "figcaption").text
          if img and caption:
            if any(word in caption for word in ["مزار","آرامگاه" ,"مقبره" ,"قبر"]):
              js["image"]["tomb"] = img
            elif any(word in caption for word in ["کودکی" ,"جوانی", "خردسالی","جوان"]):
              js["image"]["young"] = img
            else:
              js["image"][preprocess(caption)] = img
      except:
          print("fig not found")

  driver.quit()

  update_locations(js)

  prettyPrint(js)

  with open(file_path, "a", encoding="utf-8") as f:
    f.write(json.dumps(js, ensure_ascii=False) + "\n")
    print(f"{name} added successfully!")



 1 / 44
working on  مجیدرضا آیت‌اللهی
1 0

 2 / 44
مهدی اخوان بهابادی  already exists!

 3 / 44
تقی اخوان نیاکی  already exists!

 4 / 44
مجتبی ازهری  already exists!

 5 / 44
علیرضا اشرفی  already exists!

 6 / 44
غلامعلی افروز  already exists!

 7 / 44
working on  کرامت‌الله ایزدپناه
3 0
selecting...
3
[]
Image not found.
{
  "name": "کرامت‌الله ایزدپناه",
  "sex": "",
  "nick-names": [
    ""
  ],
  "birth": {
    "date": "۱۳۲۶,فردوس",
    "location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    }
  },
  "death": {
    "date": "",
    "location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    },
    "tomb_location": {
      "province": "",
      "city": "",
      "coordinates": {
        "latitude": 0.0,
        "longitude": 0.0
      }
    }
  },
  "era": "معاصر",
  "occupation": [
    "پژوهشگر، نویسنده و استاد زبان و ادب

#### ✅ Summary

This script combines web scraping, natural language processing, and image metadata parsing to build a rich dataset of Persian-language biographical records. It uses structured functions for cleaning, classification, and geolocation, making it modular and extensible for large-scale crawls.