# My School Examination Scrapper:

## Two-Layer Scraper Documentation

This documentation outlines the two-layer scraper process for extracting WAEC question and solution information in JSON format.

## **Layer 1: URL Scraper**
The first layer scrapes all question and solution URLs and saves them into a CSV file.

### **Required Inputs:**
Specify the following parameters:

- `subject`: The subject you want to scrape (e.g., `mathematics`).
- `start_year`: The starting year for the scraping range (e.g., `1988`).
- `end_year`: The ending year for the scraping range (e.g., `2023`).
- `exam_type`: The type of exam (e.g., `waec`).
- `paper_type`: The type of paper (e.g., `obj`).
- `total_pages`: The total number of pages to scrape (e.g., `20`).

### **Output:**
- A CSV file containing the following columns:
  - `id`: Unique identifier for each question.
  - `question_url`: The URL linking to the question and solution.

### **Code Usage:**
Run the first-layer script by setting the required parameters and executing it. The output CSV will be generated with the extracted URLs.

---

## **Layer 2: Question and Solution Extractor**
The second layer takes the CSV file generated by the first layer and extracts detailed question and solution information.

### **Required Input:**
- The file path to the CSV containing the URLs generated by Layer 1.

### **Output:**
- A JSON file with the extracted data for each question. Each entry contains:
  - `id`: Unique identifier.
  - `subject_year`: The subject, year, and exam type.
  - `topic_year`: Reserved for additional topic information (if available).
  - `question`: The question text.
  - `question_diagram`: URL to any associated diagram or image.
  - `options`: A dictionary of answer options (e.g., `{ "A": "1.75", "B": "2" }`).
  - `correct_answer`: The correct option.
  - `explanation`: Explanation or solution for the question.
  - `url`: The URL of the question page.

### **Code Usage:**
Run the second-layer script by specifying the path to the CSV file. The script processes the URLs and saves the extracted information in JSON format. Additionally, it ensures the JSON is formatted to support LaTeX and Unicode encoding.

---

## **General Workflow:**
1. **Layer 1:**
   - Configure parameters (subject, start and end years, etc.).
   - Run the URL scraper script.
   - Verify that the CSV file is correctly generated.

2. **Layer 2:**
   - Provide the CSV file path as input.
   - Run the question and solution extractor script.
   - Verify that the JSON file is correctly generated and formatted.

---

## **Best Practices:**
- Ensure a stable internet connection to avoid request timeouts.
- Verify the output CSV after Layer 1 before proceeding to Layer 2.
- For large datasets, consider breaking the scraping process into smaller batches to reduce the risk of errors or interruptions.

---

### **Example Input for Layer 1:**
```python
subject = 'mathematics'
start_year = 1988
end_year = 2023
exam_type = 'waec'
paper_type = 'obj'
total_pages = 20
```

### **Example Input for Layer 2:**
```python
csv_file = 'mathematics_waec_obj_1988-2023.csv'


## Layer 1

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import time
import requests
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import json

# Function to fetch and parse a single page
def fetch_subject_page(subject, exam_year, exam_type, paper_type, page):
    """
    Fetches the content of a single page and parses it with BeautifulSoup.
    """
    url = f'https://myschool.ng/classroom/{subject}?exam_type={exam_type}&exam_year={exam_year}&type={paper_type}&topic=&page={page}'
    response = requests.get(url)

    # Check if the page was fetched successfully
    if response.status_code != 200:
        print(f"Failed to fetch page {page} for year {exam_year}. Status code: {response.status_code}")
        return None, False

    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup, True

# Function to extract data from a page
def extract_data(soup):
    """
    Extracts questions, options, solutions, and exam years from the parsed page content.
    """
    # Extract exam years from the page
    exam_year_elements = soup.find_all(class_='ml-2 badge bg-success text-light')
    exam_year_texts = [element.text.strip() for element in exam_year_elements]

    # Extract solution URLs for the questions
    solution_elements = soup.find_all(class_='btn btn-sm btn-outline-danger')
    links = [link['href'] for link in solution_elements]

    # Extract the Serial Numbers of the Questions
    sn_elements = soup.find_all(class_='question_sn bg-danger mr-3')
    sn_texts = [element.text.strip() for element in sn_elements]

    # Return all extracted data in a dictionary
    return {
        "id": sn_texts,
        'question_url': links,
        'exam_year': exam_year_texts,
    }

# Function to scrape a single page and extract data
def scrape_page(subject, exam_year, exam_type, paper_type, page):
    """
    Scrapes a single page and extracts the data.
    """
    soup, success = fetch_subject_page(subject, exam_year, exam_type, paper_type, page)
    if success and soup:
        return extract_data(soup)
    return None

# Main scraping function using ThreadPoolExecutor for parallel processing
def scrape_all_pages_parallel(subject, exam_year, exam_type, paper_type, total_pages=20):
    """
    Scrapes all pages of data for a given subject, exam year, and exam type in parallel.
    """
    # Initialize an empty dictionary to store the scraped data
    all_data = {
        "id": [],
        'question_url': [],
        'exam_year': []
    }

    # Using ThreadPoolExecutor to scrape pages in parallel
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = []
        # Submit each page to be scraped in parallel
        for page in range(1, total_pages + 1):
            futures.append(executor.submit(scrape_page, subject, exam_year, exam_type, paper_type, page))

        # Collect the results as they complete
        for future in as_completed(futures):
            data = future.result()
            if data:
                all_data['id'].extend(data['id'])
                all_data['question_url'].extend(data['question_url'])
                all_data['exam_year'].extend(data['exam_year'])

    # Convert the collected data to a DataFrame
    return pd.DataFrame(all_data)

# Loop through the years from start_year to end_year and scrape data for each year
def scrape_for_years(subject, start_year, end_year, exam_type, paper_type, total_pages=20):
    """
    Scrapes data for multiple years in the specified range.
    """
    all_exam_data = pd.DataFrame()

    # Loop through each year in the specified range
    for exam_year in range(start_year, end_year + 1):
        print(f"Scraping data for year: {exam_year}...")
        # Scrape the data for the current year
        examination_df = scrape_all_pages_parallel(subject, exam_year, exam_type, paper_type, total_pages)
        # Append the data to the combined DataFrame
        all_exam_data = pd.concat([all_exam_data, examination_df], ignore_index=True)

    return all_exam_data

# Run the scraper and measure execution time
if __name__ == "__main__":
    subject = 'mathematics'
    start_year = 2022
    end_year = 2023
    exam_type = 'waec'
    paper_type = 'obj'
    total_pages = 20  # You can adjust this number to scrape more or fewer pages (20 is a constant)

    # Start the timer to measure the execution time
    start_time = datetime.now()  # Start timer

    # Scrape data for all years in the specified range
    all_examination_df = scrape_for_years(subject, start_year, end_year, exam_type, paper_type, total_pages)

    # End the timer and calculate the elapsed time
    end_time = datetime.now()  # End timer
    elapsed_time = end_time - start_time

    # Display the DataFrame and runtime
    print(f"Scraping completed in {elapsed_time}.")

    # Save the scraped data to a CSV file
    all_examination_df.to_csv(f'{subject}_{exam_type}_{paper_type}_{start_year}-{end_year}_url.csv', index=False)
    print("URL scraped and saved successfully.")


Scraping data for year: 2022...
Scraping data for year: 2023...
Scraping completed in 0:00:05.247603.
URL scraped and saved successfully.


## Layer 2

In [None]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed
import json
import re
from urllib.parse import urljoin

def fetch_question_page(url):
    """
    Fetches and parses the page content for a given URL.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code != 200:
            print(f"Failed to fetch URL: {url}. Status code: {response.status_code}")
            return None, False
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup, True
    except requests.RequestException as e:
        print(f"Error fetching URL {url}: {e}")
        return None, False

def extract_question_data(id, soup, base_url):
    """
    Extracts question data from the given BeautifulSoup object.
    """
    # Initialize default response dictionary
    default_response = {
        'id': id,
        'subject_year': None,
        'topic_year': None,
        'question': None,
        'question_diagram': None,
        # 'options': {},
        'correct_answer': None,
        'incorrect_answers': [],
        'explanation': None,
        'url': base_url
    }

    if not soup:
        return default_response

    try:
        # Extract question text
        parent_div = soup.find('div', class_='question-desc mb-3')
        question_text = None
        if parent_div:
            p_tags = parent_div.find_all('p')
            question_text = ' '.join([p.get_text(strip=True) for p in p_tags]) if p_tags else None

        # Extract question image URL
        img_tag = parent_div.find('img', class_='img-fluid') if parent_div else None
        img_url = urljoin(base_url, img_tag['src']) if img_tag and 'src' in img_tag.attrs else None

        # Extract correct answer
        correct_answer_tag = soup.find_all(class_="text-success mb-3")
        correct_answer_key = correct_answer_tag[0].text.strip('Option')[0] if correct_answer_tag else None

        # Extract options
        options_dict = {}
        options_container = soup.find('ul', class_='list-unstyled')
        if options_container:
            options = options_container.find_all('li')
            for option in options:
                key = option.find('strong')
                value = option.text.split(".")[-1].strip() if option.text else None
                if key and value:
                    key_text = key.text.strip('.').strip()
                    options_dict[key_text] = value

        # Calculate incorrect answers
        incorrect_answers = []
        if correct_answer_key and options_dict:
            correct_answer = options_dict.get(correct_answer_key)
            incorrect_answers = [value for key, value in options_dict.items()
                               if key != correct_answer_key]

        # Extract explanation
        explanation_texts = []
        explanation_tags = soup.find_all(class_="mb-4")
        for tag in explanation_tags:
            text = tag.get_text(strip=True)
            if "Contributions" in text:
                break
            explanation_texts.append(text)
        explanation_content = ' '.join(explanation_texts).split("Explanation", 1)[-1].strip() if "Explanation" in ' '.join(explanation_texts) else None

        # Extract subject and year from URL
        match = re.search(r'/classroom/([^/]+).*exam_year=(\d+).*type=([^&]+)', base_url)
        subject_year = f"{match.group(1).capitalize()} {match.group(3).capitalize()} {match.group(2)}" if match else None

        return {
            "id": id,
            "subject_year": subject_year,
            "topic_year": "",
            "question": question_text,
            "question_diagram": img_url,
            # "options": options_dict,
            "correct_answer": [options_dict.get(correct_answer_key)] if correct_answer_key else None,
            "incorrect_answers": incorrect_answers,
            "explanation": f"Hint: {explanation_content}" if explanation_content else None,
            "url": base_url
        }
    except Exception as e:
        print(f"Error extracting data for ID {id}: {e}")
        return default_response

def scrape_urls_in_batches(id_list, urls, batch_size=20):
    """
    Scrapes the given URLs in batches and returns a DataFrame of results.
    """
    results = []

    with ThreadPoolExecutor(max_workers=batch_size) as executor:
        future_to_url = {executor.submit(fetch_question_page, url): (id_list[i], url)
                        for i, url in enumerate(urls)}

        for future in as_completed(future_to_url):
            id_, url = future_to_url[future]
            try:
                soup, success = future.result()
                data = extract_question_data(id_, soup, url) if success else None
                if data:
                    results.append(data)
            except Exception as e:
                print(f"Error processing URL {url}: {e}")
                results.append({
                    'id': id_,
                    'subject_year': None,
                    'topic_year': None,
                    'question': None,
                    'question_diagram': None,
                    # 'options': {},
                    'correct_answer': None,
                    'incorrect_answers': [],
                    'explanation': None,
                    'url': url
                })

    return pd.DataFrame(results)

def save_to_pretty_json_row_by_row(df, filename):
    """
    Saves each row of the DataFrame as a prettified JSON object to a file
    encoded for LaTeX and Unicode compatibility.
    """
    with open(filename, 'w', encoding='utf-8') as f:
        for _, row in df.iterrows():
            json.dump(row.to_dict(), f, indent=4, ensure_ascii=False)
            f.write('\n')

def main():
    """
    Main execution function with proper error handling and configuration.
    """
    try:
        # Configuration
        csv_file = '/content/mathematics_waec_obj_2022-2023_url.csv'
        batch_size = 20

        # Extract metadata from CSV filename
        filename_parts = csv_file.split('/')[-1].split('_')
        subject = filename_parts[0]
        exam_type = filename_parts[1]
        paper_type = filename_parts[2]
        years = filename_parts[3].split('.')[0]
        start_year, end_year = years.split('-')

        # Load URLs from the CSV file
        question_url_df = pd.read_csv(csv_file)
        urls = question_url_df['question_url'].tolist()
        ids = question_url_df['id'].tolist()

        start_time = datetime.now()
        print(f"Starting scraping at {start_time}")

        # Scrape all data
        all_examination_df = scrape_urls_in_batches(ids, urls, batch_size=batch_size)

        end_time = datetime.now()
        elapsed_time = end_time - start_time
        print(f"Scraping completed in {elapsed_time}.")
        print(f"Total questions scraped: {len(all_examination_df)}")

        # Save data to JSON
        output_filename = f'{subject}_{exam_type}_{paper_type}_{start_year}-{end_year}.json'
        save_to_pretty_json_row_by_row(all_examination_df, output_filename)
        print(f"Data successfully saved to {output_filename}")

    except Exception as e:
        print(f"An error occurred during execution: {e}")

if __name__ == "__main__":
    main()

Starting scraping at 2025-01-01 23:13:02.903776
Scraping completed in 0:00:07.141608.
Total questions scraped: 97
Data successfully saved to mathematics_waec_obj_2022-2023.json
