<a href="https://colab.research.google.com/" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Module for data acquisition

This module is responsible for capturing and loading text data from various sources such as text files, CSV files or APIs.

Import python libraries:

In [2]:
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
from typing import List, Dict, Any
import logging

This module implements five functions for capturing text data from various sources.

- load_data_from_file(file_path: str) -> List[str]:
- load_data_from_csv(file_path: str, column_name: str) -> List[str]:
- load_data_from_web(url: str) -> List[str]:
- load_data_from_api(api_endpoint: str, params: Dict[str, Any]) -> List[str]:
- load_data_from_pubmed(api_endpoint: str, params: Dict[str, Any]) -> List[str]:

An additional helper function:

- convert_file_to_ascii_encoding(input_filename: str, output_filename: str) -> None:

can be used to convert text data from various different encodings to the Ascii encoding, which is normally supported by text processing libraries such as `nltk`.

The `load_data_from_file` function is designed to load text data from a file and return it as a list of strings, where each string represents a line from the file. This function is essential for working with textual data stored in files, which is a common scenario in data processing, machine learning, and natural language processing (NLP) tasks.

**Functionality**

1. *File existence check*: The function first checks if the specified file exists using the `os.path.exists()` method. If the file does not exist, it raises a `FileNotFoundError` to prevent further errors down the line.

2. *Reading the file*: If the file exists, it is opened in read mode. The content of the file is read line by line using `file.readlines()`, which returns a list where each element corresponds to a line in the file.

3. *Logging*: The function logs the number of lines loaded from the file using the `logging.info()` method. This is useful for tracking and debugging, especially when dealing with large files.

4. *Return data*: Finally, the list of strings (each string is a line from the file) is returned for further processing.

**Use**

To use this function, simply pass the path of the text file you want to load:

```python
file_path = 'path/to/file.txt'
lines = load_data_from_file(file_path)
```

This will return a list of strings, each representing a line of text from the file. You can then proceed with your data processing, whether it involves parsing, analysis, or feeding it into a machine learning model.

**Considerations**

- *File encoding*: The function currently opens the file with the default system encoding. If you're working with files in different encodings (like UTF-8), you may need to adjust the `open` function to handle these encodings explicitly.
- *Error handling*: The function raises an error if the file does not exist.

In [3]:
def load_data_from_file(file_path: str) -> List[str]:
    """
    Load text data from a file.
    :param file_path: str, the path to the text file
    :return: List[str], a list of strings containing the text data, each string is one line 
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")
        
    with open(file_path, 'r') as file:
        data = file.readlines()

    logging.info(f'Loaded {len(data)} lines from "{file_path}".')

    return data

The `convert_file_to_ascii_encoding` function is designed to read the contents of a text file and convert it to ASCII encoding. The converted content is then saved to a new file. This function is particularly useful when dealing with text that may contain non-ASCII characters, which could cause compatibility issues in certain applications or systems.

In many data processing tasks, especially when dealing with legacy systems or specific text-based formats, ensuring that text data is in ASCII encoding is crucial. ASCII encoding is a character encoding standard that uses 7 bits to represent characters, which makes it highly compatible with older systems and simpler text processing pipelines.

**Functionality**

1. *Reading the input file*: The function first opens the specified input file in read mode and reads its entire content into a string variable.

2. *Converting to ASCII*: The content is then encoded into ASCII using the `.encode('ascii', errors='replace')` method. This step replaces any non-ASCII characters with a placeholder (usually `?`), ensuring that the resulting string is pure ASCII.

3. *Saving the output*: Finally, the ASCII-encoded content is written to a new file, specified by the `output_filename` parameter, ensuring that the output is in ASCII format.

**Applications**

- *Data standardization*: Convert various text data sources to a uniform ASCII encoding, making it easier to process and analyze them together.
- *Legacy system integration*: Prepare text files for integration in the systems that only support ASCII encoding.
- *Text processing*: Simplify the handling of text data by converting non-ASCII characters, which might otherwise cause errors or require complex handling.

**Use**

To use this function, provide the path to the input file (the file you want to convert) and the path to the output file (where you want to save the converted text):

```python
input_file = 'path/to/input_file.txt'
output_file = 'path/to/output_file.txt'
convert_file_to_ascii_encoding(input_file, output_file)
```

This will create a new file at `output_file` that contains the ASCII-encoded content of the `input_file`.

**Considerations**

- *Error handling*: The strategy `errors='replace'` replaces non-ASCII characters with a `?`. This is a safe option, but you might lose some data (e.g. special characters or diacritics). If preserving these characters is important, you should consider alternative error handling strategies such as `ignore` or `xmlcharrefreplace`.
  
- *Use cases for ASCII encoding*: While ASCII encoding is widely supported, it is limited in terms of characters that can be displayed. For text data containing international characters, other encodings such as UTF-8 are more suitable unless you have certain restrictions that require ASCII.

In [4]:
def convert_file_to_ascii_encoding(input_filename: str, output_filename: str) -> None:
    """
    Read the contents of a file and save it with ASCII encoding.
    
    Parameters:
    - input_filename (str): The name of the file to be read.
    - output_filename (str): The name of the file where the ASCII-encoded content should be saved.
    """
    with open(input_filename, 'r') as file:
        contents = file.read()

    # Convert to ASCII and handle non-ASCII characters using 'replace' error strategy
    ascii_contents = contents.encode('ascii', errors='replace').decode('ascii')

    with open(output_filename, 'w', encoding='ascii') as file:
        file.write(ascii_contents)

The `load_data_from_csv` function is designed to load text data from a specific column in a CSV (Comma-Separated Values) file. This function reads the CSV file, extracts the data from the specified column, and returns it as a list of strings. This is particularly useful for working with structured data where text is organized under specific columns.

CSV files are one of the most common formats for storing structured data, and they are widely used across various domains such as data analysis, machine learning, and natural language processing. 

**How It Works**

1. *File existence check*: The function first checks whether the specified CSV file exists using `os.path.exists()`. If the file is not found, a `FileNotFoundError` is raised to alert the user.

2. *Reading the CSV file*: The function then reads the CSV file using the `pandas.read_csv()` function, which loads the file into a DataFrame. The file is read with UTF-8 encoding and uses `;` as the separator. This separator can be adjusted depending on the CSV file's format.

3. *Column existence check*: After loading the data, the function checks whether the specified column exists in the DataFrame. If the column is not found, a `ValueError` is raised, informing the user that the column name is incorrect or doesn't exist in the file.

4. *Extracting data*: If the column is found, the function extracts the data from that column and converts it to a list of strings using the `.tolist()` method. This list is then returned for further processing or analysis.

**How to Use This Function**

To use this function, specify the path to the CSV file and the name of the column from which you want to extract text data:

```python
file_path = 'path/to/your/data.csv'
column_name = 'TextColumn'
data = load_data_from_csv(file_path, column_name)
```

This will return a list of strings, where each string represents an entry from the specified column in the CSV file.

**Considerations**

- *CSV format*: Ensure that the separator (`sep`) used in `pd.read_csv()` matches the one used in your CSV file. The default here is `;`, which is common in some regions and formats, but many CSV files use `,` as the separator.

- *Error handling*: The function includes checks for both file existence and column existence, making it robust against common user errors. However, ensure that the CSV file is well-formed and that the column names are correctly specified.

- *Data types*: This function is specifically designed for loading text data. If the column contains other data types (e.g., numeric or mixed types), further processing might be required.

In [11]:
def load_data_from_csv(file_path: str, column_name: str) -> List[str]:
    """
    Load text data from a specific column in a CSV file.
    :param file_path: str, the path to the CSV file
    :param column_name: str, the name of the column containing the text data
    :return: List[str], a list of strings containing the text data
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"The file '{file_path}' does not exist.")
    
    df = pd.read_csv(file_path, encoding='utf-8', sep=';')
    
    if column_name not in df.columns:
        raise ValueError(f"The column '{column_name}' does not exist in the CSV file.")
        
    data = df[column_name].tolist()
    
    return data



This function load_data_from_web takes a URL as input and attempts to fetch the web page using the requests.get() method. 
If the request fails, a ValueError is raised with the corresponding error message. 
Otherwise, the function proceeds to parse the HTML content of the web page using BeautifulSoup with the 'html.parser' parser.

The function then finds all the paragraph elements (\<p\>) in the parsed HTML using the find_all() method. 
It extracts the text content of each paragraph element using the get_text() method and stores it in a list. 
Finally, the function returns the list of strings containing the text data.



In [12]:
# TODO
def load_data_from_web(url: str) -> List[str]:
    """
    Scrape text data from a web page.
    :param url: str, the URL of the web page
    :return: List[str], a list of strings containing the text data
    """
    try:
        response = requests.get(url)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        raise ValueError(f"Failed to load data from URL '{url}': {e}")
    
    soup = BeautifulSoup(response.content, 'html.parser')
    paragraphs = soup.find_all('p')
    
    data = [paragraph.get_text() for paragraph in paragraphs]
    
    return data



This function load_data_from_api takes an API endpoint URL and a dictionary of parameters as input. 
It attempts to fetch data from the API using the requests.get() method with the provided parameters. 
If the request fails, a ValueError is raised with the corresponding error message. 
Otherwise, the function proceeds to parse the JSON content of the API response using the response.json() method.

The function then processes the JSON data to extract the text data. 
The specific processing required depends on the structure of the JSON data returned by the API. 
In this example, it is assumed that the JSON data contains a key called "results" that holds a list of dictionaries, 
each containing a key "text" with the text data. 
The function iterates through the list of dictionaries and extracts the text data, storing it in a list. 
Finally, the function returns the list of strings containing the text data. Note that this is just an example, 
and you may need to modify the processing logic based on the specific API you are using.



In [13]:
# TODO
def load_data_from_api(api_endpoint: str, params: Dict[str, Any]) -> List[str]:
    """
    Retrieve text data from an API endpoint.
    :param api_endpoint: str, the URL of the API endpoint
    :param params: Dict[str, Any], a dictionary of parameters to be sent in the API request
    :return: List[str], a list of strings containing the text data
    """
    try:
        response = requests.get(api_endpoint, params=params)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        raise ValueError(f"Failed to load data from API endpoint '{api_endpoint}': {e}")
    
    json_data = response.json()

    # Process the JSON data to extract the text data. The specific processing depends
    # on the structure of the JSON data returned by the API. This is just an example.
    data = [item["text"] for item in json_data["results"]]
    
    return data

This function load_data_from_pubmed takes an API endpoint URL and a dictionary of parameters as input. 
It attempts to fetch data from the API using the requests.get() method with the provided parameters. 
If the request fails, a ValueError is raised with the corresponding error message. 
Otherwise, the function proceeds to parse the JSON content of the API response using the response.json() method.

The function then processes the JSON data to extract the text data. 
The specific processing required depends on the structure of the JSON data returned by the API. 
In this example, it is assumed that the JSON data contains a key called "results" that holds a list of dictionaries, 
each containing a key "text" with the text data. 
The function iterates through the list of dictionaries and extracts the text data, storing it in a list. 
Finally, the function returns the list of strings containing the text data. Note that this is just an example, 
and you may need to modify the processing logic based on the specific API you are using.


In [14]:
# TODO
def load_data_from_pubmed(api_endpoint: str, params: Dict[str, Any]) -> List[str]:
    """
    Retrieve text data from PubMed.
    :param api_endpoint: str, the URL of the API endpoint
    :param params: Dict[str, Any], a dictionary of parameters to be sent in the API request
    :return: List[str], a list of strings containing the text data
    """
    try:
        response = requests.get(api_endpoint, params=params)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        raise ValueError(f"Failed to load data from pubmed '{api_endpoint}': {e}")
    
    json_data = response.json()

    # Process the JSON data to extract the text data. The specific processing depends
    # on the structure of the JSON data returned by the API. This is just an example.
    data = [item["text"] for item in json_data["results"]]
    
    return data

In [30]:
testing = False
if testing:
    domainName = 'Autism'
    fileName = 'input/214Texts.txt'
    lines = load_data_from_file(fileName)
    lines[:7]

In [31]:
if testing:
    domainName = 'Autism'
    fileName = 'input/214Texts_store.csv'
    lines = load_data_from_csv(fileName, 'text3')
    lines[:7]

In [32]:
if testing:
    domainName = 'Autism'
    lines = load_data_from_web('https://www.temida.si/~bojan/')
    lines[:7]

In [33]:
if testing:
    domainName = 'Autism'
    lines = load_data_from_api('https://www.temida.si/~bojan/', {})
    lines[:7]

In [35]:
if testing:
    domainName = 'Autism'
    lines = load_data_from_pubmed('https://www.temida.si/~bojan/', {})
    lines[:7]