## Setup

Load Libraries

In [3]:
import os
import requests
import pandas as pd
from typing import Optional

## Data

The data we will use can be found [here](https://cogcomp.seas.upenn.edu/Data/QA/QC/). The dataset comprises three variables: `question`, `category`, and `sub-category`. However, our analysis will only use two variables: `question` and `category`. Essentially, we will build a model that uses the content of the question to predict its category. For example, (e.g. Who was Abraham Lincon?) and the output or label would be Human. There are six disnnct categories in the dataset:

1. ENTY (Entity): Questions that seek specific entities as answers, like objects, organisms, or concepts.
2. HUM (Human): Questions about humans, individually or as a group.
3. DESC (Description): Questions asking for description, explanations, or reasons.
4. Num (Numeric): Questions expecting numerical answer.
4. LOC (Locanon): Questions that are geographically oriented.
6. ABBR (Abbreviation): Questions seeking the extended form or explanation of abbreviations.

For this work, we will use the Training Set 5 dataset for training the model and the single test set available for testing the model.

We will now create the data download utility function.

In [None]:
dir_name = 'data'
os.makedirs(dir_name, exist_ok=True)

In [None]:
def download_data(dir_name: str, filename: str, url: str, expected_bytes: Optional[int]=None) -> str:
    """
    Download a file if not present, and make sure it's the right size if the expected size is provided.
    
    Args:
        dir_name (str): The directory where the data will be stored.
        filename (str): The filename under which the data will be stored.
        url (str): The URL from which to download the data.
        expected_bytes (Optional[int]): The expected size of the data in bytes. 
                                        If provided, the function will check if the downloaded file size matches this.
                                        If not, an exception will be raised.
                                        If not provided, no size check will be performed.
                                        
    Returns:
        str: The file path where the data is stored.
    """
    
    # Create the directory if it doesn't already exist
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    filepath = os.path.join(dir_name, filename)

    # Download the file if it doesn't already exist
    if not os.path.exists(filepath):
        response = requests.get(url)
        with open(filepath, 'wb') as f:
            # Write the content of the response to a file in the directory
            f.write(response.content)

    # If an expected size is provided, verify the size of the downloaded file
    if expected_bytes is not None:   
        statinfo = os.stat(filepath)
        if statinfo.st_size == expected_bytes:
            print(f'Found and verified {filepath}')
        else:
            print(f'File size {statinfo.st_size} does not match expected size {expected_bytes}')
            raise Exception(
              f'Failed to verify {filepath}. Can you get to it with a browser?')
    
    # Return the filepath for use elsewhere
    return filepath


In [None]:
# Download the data.
url = 'http://cogcomp.org/Data/QA/QC/'
dir_name = 'data'
train_filename = download_data(dir_name, 'train_5500.label', url+'train_5500.label', 335858)
test_filename = download_data(dir_name, 'TREC_10.label', url+'TREC_10.label', 23354)