In [1]:
import pandas as pd
from resiliparse.parse.html import HTMLTree
from urllib.request import Request, urlopen
from resiliparse.parse.encoding import detect_encoding
from resiliparse.extract.html2text import extract_plain_text

# from chatnoir_api import Index
# from chatnoir_api.v1 import search
from ast import literal_eval

import requests
from tqdm import tqdm

import json
import os

import os.path
import func_timeout
import shutil

## Function Workflow

The function follows these steps:

1. **Accessing the URL**:
   - It attempts to access the web page at the provided `url` using the `urlopen` function from the `urllib` library.
   - If there are any issues accessing the URL (e.g., network errors), it returns "not_accessible."

2. **Parsing HTML**:
   - Once the web page is successfully accessed, the HTML content is parsed using the `HTMLTree` library, which allows for efficient HTML parsing.
   - The encoding of the HTML content is detected using the `detect_encoding` function to ensure proper parsing.

3. **Extracting Text Content**:
   - The function extracts the main plain text content from the parsed HTML. Several parameters can be adjusted:
     - `main_content`: Determines whether to extract the main content of the page.
     - `alt_texts`: Specifies whether to include alternate texts (e.g., image descriptions).
     - `preserve_formatting`: Indicates whether to preserve text formatting (e.g., line breaks).
     - `noscript`: Controls the extraction of content within `<noscript>` tags, which may


In [2]:
def get_text_from_uri(url):
    try:
        html = urlopen(url).read()
        
    except:
        return "not_accessible"
    
    tree = HTMLTree.parse_from_bytes(html, detect_encoding(html))
    text = extract_plain_text(tree,
                             main_content=True,
                             alt_texts=False,
                             preserve_formatting=False,
                             noscript=True)
    return text 
        

).

- **Input**:
  - `url`: The URL of the web page from which text content is to be extracted.
  - `max_wait`: The maximum time (in seconds) to wait for the text retrieval process (default is 5 seconds).
  - `default_value`: The value to return if the retrieval process times out (default is "timeout").

- **Output**:
  - The function returns the retrieved text content from the URI if the retrieval is successful within the specified timeout. If the retrieval times out, it returns the `default_value`.

## Function Workflow

The function follows these steps:

1. **Timeout Control**:
   - It uses the `func_timeout` library to control the execution time of the `get_text_from_uri` function, which retrieves text content from the provided `url`.
   - The `max_wait` parameter specifies the maximum waiting time for the retrieval process.

2. **Handling Timeout**:
   - If the retrieval process takes longer than the specified `max_wait` duration, the `func_timeout.FunctionTimedOut` exception is raised.
   - In this case, the function catches the exception and returns the `default_value`.

3. **Returning Text Content**:
   - If the retrieval process completes successfully within the specified timeout, the function returns nism to prevent long delays.


In [3]:
def ctrl_timeout(url, max_wait = 5, default_value = "timeout"):
    try:
        return func_timeout.func_timeout(max_wait, get_text_from_uri, args=[url])
    except func_timeout.FunctionTimedOut:
        pass
    return default_value

y.

- **Input**:
  - `path`: The path to an Excel file containing a list of URIs.
  - `filepath`: The directory path where the retrieved content should be saved.

- **Output**:
  - The function retrieves and saves content from the URIs to JSON files in the specified directory.
  - The Excel file with updated status information for each URI (e.g., "not_accessible," "timeout," "no_text") is saved.

## Function Workflow

The function follows these steps:

1. **Loading URI List**:
   - It reads the Excel file at the specified `path` to obtain a list of URIs.

2. **Iterating Over URIs**:
   - For each URI in the list, it performs the following steps:
     - Checks if a JSON file with the same index exists in the `filepath` directory. If not, it proceeds with content retrieval.
     - If the URI is not marked as "not_accessible," "timeout," or "no_text" in the Excel file, it attempts to retrieve the content using the `ctrl_timeout` function.
     - Handles different outcomes:
       - If the retrieval process returns "not_accessible," it updates the Excel file to mark the URI as not accessible.
       - If the retrieval process times out, it updates the Excel file to mark the URI as a timeout.
       - If the retrieval process returns no text content, it updates the Excel file to mark the URI as having no text.
       - If the retrieval is successful, it saves the retrieved text content to a JSON file in the `filepath` directory.

3. **Periodic Saving**:
   - The function periodically saves the updated Excel file (every 100 iterations) to ensure that progress is recorded even in the event of an interruption.

4. **Final Saving**:
   - After processing all URIs, the function saves the final Excel fiibility and content retrieval times.


In [4]:
def get_uri_content(path, filepath):
    
    uri_list = pd.read_excel(path)
    
    for idx, row in tqdm(uri_list.iterrows()):
        
        if not os.path.isfile(filepath+"/"+str(row["index"])+".json"):
            if not (row["not_accessible"] == 1 or row["timeout"] == 1 or row["no_text"] == 1):
               
                text = ctrl_timeout(row["uri"])
                
                if text == "not_accessible":
                    uri_list.at[idx,"not_accessible"] = 1
                elif text == "timeout":
                    uri_list.at[idx,"timeout"] = 1
                elif not text:
                    uri_list.at[idx,"no_text"] = 1 
                    
                else:
                    with open(filepath+"/"+str(row["index"])+".json", "w") as f:
                        json.dump(text, f, indent=2)
                if idx%100 == 0:
                    uri_list.to_excel("./data_created/CommonCrawl17/refined_uri_list.xlsx", index = None)
        
    uri_list.to_excel("./data_created/CommonCrawl17/refined_uri_list.xlsx", index = None)

In [18]:
get_uri_content("./data_created/ClueWeb22/combined_uri_list.xlsx", "./data_created/ClueWeb22/text_files")

13930it [35:33,  6.53it/s]
