## Comprehensive Texas Academic Performance Report (TAPR) Data Scraper
This scraper allows users to select the level (Campus, District, Region, State) and type of data they would like to download from the TAPR data download on the TEA website. If the level is "D" for District, district type data will also be downloaded in addition to the TAPR data unless the user has indicated they do not want the data (set dist_type = False). 

If the files already exist, the scraper will not download new files. 

The scraper creates separate folders for each year of data and names the files with the appropriate year. 

Import Libraries

In [28]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import os
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

District Type Scraper (Helper Function)

In [29]:
def district_type_scraper(year):
   """
   Scrapes the Texas Education Agency (TEA) website for district type data of a given school year.

   Args:
      year (int): The ending year of the school year (e.g., 2024 for the 2023-24 school year).

   Returns:
      pd.DataFrame: A DataFrame containing data from the specified school year's district type Excel file.

    Raises:
        requests.exceptions.RequestException: If there's an issue fetching the webpage.
        ValueError: If no Excel file is found on the page.   
   """
   school_year = str(year-1)+"-"+str(year-2000) # Finds the academic school year 
   #Web scraping academic year for district data from website 
   url = f'https://tea.texas.gov/reports-and-data/school-data/district-type-data-search/district-type-{school_year}'
   grab = requests.get(url)
   soup = BeautifulSoup(grab.text, 'html.parser')
   xlsx = []
   for link in soup.find_all("a"):
      data = str(link.get('href'))
      if re.search(".xlsx$", data):
         return pd.read_excel(f"https://tea.texas.gov{data}", sheet_name= 2)

Detects if files download within a given time limit, otherwise timesout (Helper Function)

In [30]:
def wait_for_downloads(variables, year, directory, timeout=200):
    """
    Waits for the expected data files to be downloaded within a specified timeout period.

    Args:
        variables (list of str): A list of variable names that determine expected file names.
        year (int): The year of the dataset, affecting file format (.dat for <2021, .csv for >=2021).
        directory (str): The directory where the files are expected to be downloaded.
        timeout (int, optional): Maximum time in seconds to wait for all files to be downloaded. Defaults to 200.

    Returns:
        bool: True if all expected files are downloaded within the timeout period, False otherwise.

    Raises:
        FileNotFoundError: If the specified directory does not exist.

    Notes:
        - The function checks for `.crdownload` files to ensure downloads are complete.
        - It waits in 5-second intervals before checking again.
        - If downloads complete within the timeout, a success message is printed.
    """
    start_time = time.time()  # Record the start time
    expected_files = []  # List to store expected file names based on year and variables

    # Determine expected file names based on the year
    for var in variables:
        if year < 2021:
            expected_files.append(f"DIST{var}.dat" if var != "REF" else "DREF.dat")
        else:
            expected_files.append(f"DIST{var}.csv" if var != "REF" else "DREF.csv")

    check = 1  # Variable to print waiting message only once
    while time.time() - start_time < timeout:  # Continue checking until timeout is reached
        downloaded_files = os.listdir(directory)  # Get the list of files in the directory

        # Check if all expected files are present and not still downloading (.crdownload files)
        if all(file in downloaded_files and not file.endswith(".crdownload") for file in expected_files):
            print(f"All downloads for {year} completed successfully.\n")
            return True  # Return True if all files are found and fully downloaded
        
        # Print waiting message only once at the start
        if check == 1:
            print("Waiting for all files to download...")
        check += 1

        time.sleep(5)  # Wait for 5 seconds before checking again

    return False  # Return False if the timeout is reached before all files are downloaded


Renames files to include year in file name (Helper Function)

In [31]:
def file_renamer(directory, year, prefix, var, level):
    """
    Renames downloaded files in the specified directory based on naming conventions.

    Args:
        directory (str): The path to the directory containing the files.
        year (int): The year to be appended to the renamed files.
        prefix (str): The prefix used in some file names (e.g., 'DIST').
        var (str): The variable name (e.g., 'POP', 'ECON', 'REF').
        level (str): The level identifier (some files may use this instead of the prefix).

    Returns:
        None: The function renames files in place and does not return a value.

    Notes:
        - The function checks for `.csv` and `.dat` file extensions.
        - It looks for two possible naming patterns:
            1. `{prefix}{var}{ext}` (e.g., `DISTPOP.csv`)
            2. `{level}{var}{ext}` (e.g., `STATEPOP.dat`)
        - If a match is found, the file is renamed to:
            - `{level}{var}_{year}{ext}` for "REF" files.
            - `{prefix}{var}_{year}{ext}` for all other files.
        - The function **only renames the first matching file** and stops checking further.
    """
    for ext in ['.csv', '.dat']:  # Check both CSV and DAT file formats
        old_patterns = [
            f"{prefix}{var}{ext}",  # Pattern with prefix
            f"{level}{var}{ext}"     # Pattern with level (some files may not have prefix)
        ]

        for old_pattern in old_patterns:
            old_name = os.path.join(directory, old_pattern)  # Full path of the old file
            if os.path.exists(old_name):  # Check if file exists
                # Determine new name format
                if var == "REF":
                    new_name = os.path.join(directory, f"{level}{var}_{year}{ext}")
                else:
                    new_name = os.path.join(directory, f"{prefix}{var}_{year}{ext}")

                os.rename(old_name, new_name)  # Rename the file
                break  # Stop checking after renaming the first matching file


Converts .dat files to .csv files automatically (Helper Function)

In [32]:
#Helper function: Converts .dat files to .csv files 
def convert_dat_to_csv(directory):
    """
    Converts all .dat files in the specified directory to .csv files.
    
    Parameters:
        directory (str): Path to the directory containing .dat files.
    """
    if not os.path.exists(directory):
        print(f"Directory '{directory}' does not exist.")
        return

    # Iterate through files in the directory
    for file_name in os.listdir(directory):
        if file_name.endswith(".dat"):
            dat_file_path = os.path.join(directory, file_name)
            csv_file_path = os.path.join(directory, file_name.replace(".dat", ".csv"))

            try:
                # Read the .dat file with automatic delimiter detection
                df = pd.read_csv(dat_file_path, delimiter=None, engine='python')

                # Save as .csv
                df.to_csv(csv_file_path, index=False)
                print(f"Converted: {file_name} -> {csv_file_path}")

            except Exception as e:
                print(f"Error converting {file_name}: {e}")



### Scrape Data from TAPR Advanced Download (Master Function)

In [None]:
def tea_scraper(years, variables, level, dist_type = True):
    """
    Scrape all HERC data for specified years, variables, and level of data.
    
    Parameters:
        years (list): List of years to scrape data for (formatted YYYY)
        variables (list): List of variable codes to download (such as "GRAD")
        level (str): Administrative level to scrape. Options:
            'C' for Campus
            'D' for District
            'R' for Region
            'S' for State

    Returns: 
        Specified files stored in folders located in users current directory. 
    """
    directory_path_name = os.getcwd()
    # Validation for level parameter
    valid_levels = {
        'C': 'Campus',
        'D': 'District',
        'R': 'Region',
        'S': 'State'
    }
    
    if level not in valid_levels:
        raise ValueError(f"Invalid level. Must be one of: {', '.join(valid_levels.keys())}")
    
    # Create prefix for filenames based on level
    file_prefix = {
        'C': 'CAMP',
        'D': 'DIST',
        'R': 'REGN',
        'S': 'STATE'
    }[level]
    
    for year in years:
        ### TAPR DATA DOWNLOAD ###
        # Create full path for year directory
        dir_name = f"raw_data{year}"
        full_dir_path = os.path.join(directory_path_name, dir_name)
        os.makedirs(full_dir_path, exist_ok=True)
        
        # Configure Chrome options
        chrome_options = webdriver.ChromeOptions()
        absolute_download_path = os.path.abspath(full_dir_path)
        
        # Add additional Chrome preferences to prevent download prompts
        prefs = {
            "download.default_directory": absolute_download_path,
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True,
            "profile.default_content_settings.popups": 0
        }
        chrome_options.add_experimental_option("prefs", prefs)
        
        # Add additional Chrome arguments
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        
        driver = webdriver.Chrome(options=chrome_options)
        driver.get(f"https://rptsvr1.tea.texas.gov/perfreport/tapr/{year}/download/DownloadData.html")
        
        # Select appropriate level
        level_select = driver.find_element(By.XPATH, f"//input[@type='radio' and @name='sumlev' and @value='{level}']")
        level_select.click()
        
        unavailable = []
        print(f"Downloading {valid_levels[level]} Level TAPR Data for {year}...")
        
        for var in variables:
            print(f"Checking for {file_prefix}{var} data...")
            # Updated file patterns to include level prefix and year
            file_patterns = [
                f"{file_prefix}{var}_{year}.csv",
                f"{file_prefix}{var}_{year}.dat",
                f"{level}{var}_{year}.dat",  # Some files might not include the prefix
                f"{level}{var}_{year}.csv"
            ]
            
            if any(os.path.isfile(os.path.join(full_dir_path, file)) for file in file_patterns):
                print(f"{var}_{year} already exists")
                unavailable.append(var)
                continue
                
            try:
                select_data = driver.find_element(By.XPATH, f"//input[@type='radio' and @name='setpick' and @value='{var}']")
                select_data.click()
                
                # Add a small delay after clicking the radio button
                time.sleep(1)
                
                download = driver.find_element(By.XPATH, "//input[@type='submit' and @value='Continue']")
                download.click()
                print(f"Downloaded {level if var == 'REF' else file_prefix}{var} for {year}")
                
            except NoSuchElementException:
                print(f"{var} not found for {year}")
                unavailable.append(var)
                continue
        
        available_vars = set(variables) - set(unavailable)
        # do not shut down driver until time-out occurs or all available files have finished downloading
        if wait_for_downloads(variables = available_vars, year = year, directory = full_dir_path):
            for a_var in available_vars:
                file_renamer(directory = full_dir_path, year = year, prefix = file_prefix, var = a_var, level = level)  
        driver.quit()

        ### DISTRICT TYPE DATA DOWNLOAD ###
        if level == "D" and dist_type:
            print(f"Downloading District Type Data for {year}...")
            
            if os.path.isfile(os.path.join(full_dir_path, f"district_type{year}.csv")):
                print(f"District Type Data for {year} already exists") # don't download if it already exists
                print("")
                continue

            df = district_type_scraper(year) # get the dataframe with the sheet data
            df.to_csv(f"{dir_name}/district_type{year}.csv") # save it to the raw_data{year} folder

            print(f"Downloaded District Type Data for {year}")
            print("")

        #Last step: Checking if files are .dat and converting to .csv
        convert_dat_to_csv(full_dir_path)

    print("All Data Downloaded!")

        

In [34]:
# Run the function with all currently available years and all the TAPR datasets
tapr_2018_2023 = list(range(2018, 2024)) # all years with data that is currently available

data_acronyms = ['PROF', 'PERF1', 'GRAD', 'STAAR1', 'REF', 'PERF'] # all the measures located on the TAPR website

tea_scraper(years = tapr_2018_2023, variables = data_acronyms, level = "D")

Downloading District Level TAPR Data for 2018...
Checking for DISTPROF data...
PROF_2018 already exists
Checking for DISTPERF1 data...
PERF1 not found for 2018
Checking for DISTGRAD data...
GRAD_2018 already exists
Checking for DISTSTAAR1 data...
STAAR1_2018 already exists
Checking for DISTREF data...
REF_2018 already exists
Checking for DISTPERF data...
PERF_2018 already exists
All downloads for 2018 completed successfully.

Downloading District Type Data for 2018...
District Type Data for 2018 already exists

Downloading District Level TAPR Data for 2019...
Checking for DISTPROF data...
PROF_2019 already exists
Checking for DISTPERF1 data...
PERF1 not found for 2019
Checking for DISTGRAD data...
GRAD_2019 already exists
Checking for DISTSTAAR1 data...
STAAR1_2019 already exists
Checking for DISTREF data...
REF_2019 already exists
Checking for DISTPERF data...
PERF_2019 already exists
All downloads for 2019 completed successfully.

Downloading District Type Data for 2019...
District T