### Notebook for downloading attachments using URLs from extracted AI-related contracts obtained from the website [sam.gov](https://sam.gov/:).
#### This notebook performs URL processing, link extraction, and attachment downloads. It uses a headless Firefox WebDriver to navigate URLs, handle pop-ups, extract and save download links, and download files to specified folders while managing server load and saving results to JSON files.
#### Note: To extract the URLs, the VPN needs to be turned off.
#### Files needed to run the notebook:
    -- 'AI_contracts.csv'
#### Files generated from the notebook:
    -- 'batchurl_0_15.json' (total 10 json files)
    -- 'Attachments includes pdfs, .docx, .doc files'

#### Importing necessary libraries for web scraping (selenium) and data manipulation (pandas), and webdriver_manager for managing browser drivers. The notebook uses Firefox as the browser.

In [None]:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from webdriver_manager.firefox import GeckoDriverManager
import time
import json

#### Reading the `AI_contracts.csv` file 

In [None]:
AI_contracts=pd.read_csv('AI_contracts.csv')
full_urls_list=AI_contracts['Link'].tolist()

##### The below function sets up a headless Firefox WebDriver to navigate a given URL, handles pop-ups, and extracts download links from an attachments table on the page. It returns a list of dictionaries containing document names and their corresponding download links.

In [23]:
def get_url_links(url):
    """
    Function to extract download links from a given URL.

    This function performs the following steps:
    1. Sets up and configures a Firefox WebDriver instance with headless mode and specific dimensions.
    2. Navigates to the specified URL and scrolls to the end of the page to ensure all dynamic content is loaded.
    3. Attempts to handle any pop-ups by clicking an 'OK' button if present.
    4. Checks for the presence of a "Download All Attachments/Links" button and waits for it to become visible.
    5. If the button is found, processes the attachments table to extract document names and download links.
    6. Handles potential exceptions during page interactions and ensures the WebDriver instance is closed properly.

    Args:
        url (str): The URL of the page from which to extract download links.

    Returns:
        data_list (list): A list of dictionaries containing document names and download links.
    """
    # Set up Firefox driver
    options = Options()
    options.add_argument('--headless')  # Uncomment to run in headless mode
    options.add_argument('--width=1920')
    options.add_argument('--height=1080')
    service = Service(GeckoDriverManager().install())
    driver = webdriver.Firefox(service=service, options=options)

    data_list = []  # Initialize data list to store row data

    try:
        driver.get(url)
        body = driver.find_element(By.TAG_NAME, "body")
        body.send_keys(Keys.CONTROL+Keys.END)

        # Handle the 'OK' button
        try:
            ok_button = WebDriverWait(driver, 60).until(
                EC.element_to_be_clickable((By.XPATH, '//button[text()="OK"]'))
            )
            driver.execute_script("arguments[0].scrollIntoView(true);", ok_button)
            ActionChains(driver).click(ok_button).perform()
            print("Clicked 'OK' button on popup.")
        except TimeoutException:
            print("No 'OK' button to click (not found within the timeout period).")
        except NoSuchElementException:
            print("No 'OK' button present on the page.")

        time.sleep(60)

        # Scroll down to the bottom of the page to ensure all dynamic content is loaded
        body = driver.find_element(By.TAG_NAME, "body")
        body.send_keys(Keys.CONTROL+Keys.END)

        # Check for the presence of the "Download All Attachments/Links" button
        try:
            download_button = WebDriverWait(driver, 30).until(
                EC.presence_of_element_located((By.XPATH, '//span[@class="download-button ng-star-inserted"]/a'))
            )
            print("Download button found, processing attachments...")

            WebDriverWait(driver, 120).until(
                EC.visibility_of_element_located((By.ID, "opp-view-attachments-tableId"))
            )

            table_body = driver.find_element(By.ID, 'opp-view-attachments-tableBodyId')
            rows = table_body.find_elements(By.TAG_NAME, 'tr')
            for row in rows:
                cells = row.find_elements(By.TAG_NAME, 'td')
                if len(cells) >= 4:
                    row_data = {
                        'Document': cells[0].text,
                        # 'File Size': cells[1].text,
                        # 'Access': cells[2].text,
                        # 'Updated Date': cells[3].text
                    }
                    link_element = cells[0].find_element(By.CLASS_NAME, 'file-link')
                    href_value = link_element.get_attribute('href')
                    row_data['Download Link'] = href_value
                    data_list.append(row_data)
        except TimeoutException:
            print("Download button not found, no attachments available.")
        except NoSuchElementException:
            print("No attachments found.")

    except Exception as e:
        print(f"An error occurred while processing the page: {e}")

    finally:
        driver.quit()
    return  data_list

#### The `batch_url_proc` function processes a batch of URLs, extracting download links using the `get_url_links` function, and adds a delay between requests to manage server load. The `save_list_to_json` function saves a list of data to a specified JSON file for later use.

In [24]:
def batch_url_proc(start, end):
    """
    Processes a batch of URLs and extracts download links.

    This function iterates over a range of indices, processes each URL in the
    `full_urls_list` by calling the `get_url_links` function, and collects
    the results into a list. A delay is added between requests to manage
    server load and avoid potential rate limiting.

    Args:
        start (int): The starting index of the URL list to process.
        end (int): The ending index (exclusive) of the URL list to process.

    Returns:
        final_links_batch (list): A list of lists, where each inner list contains
                                   dictionaries with document names and download links.
    """
    final_links_batch=[]
    for i in range(start,end):
      print(f"currently processing file no: {i+1}")
      time.sleep(60)
      final_links_batch.append(get_url_links(full_urls_list[i]))
    return final_links_batch


def save_list_to_json(my_list, filename):
    """
    Saves a list of data to a JSON file.

    This function serializes the provided list to a JSON formatted string
    and writes it to a specified file. This is useful for saving the results
    of data processing for later use.

    Args:
        my_list (list): The list of data to be saved to the JSON file.
        filename (str): The name of the file to save the JSON data.

    Returns:
        None
    """
    with open(filename, "w") as file:
        json.dump(my_list, file)


##### Extracting URLs for attachments embedded in each contract on SAM.gov in batches due to VPN reconnection every two hours. Attachment names and URLs were saved to JSON files.

In [None]:
batchurl_0_15= batch_url_proc(0,15)
save_list_to_json(batchurl_0_15, "batchurl_0_15.json")

In [None]:
batchurl_15_30= batch_url_proc(15,30)
save_list_to_json(batchurl_15_30, "batchurl_15_30.json")

In [None]:
batchurl_30_45= batch_url_proc(30,45)
save_list_to_json(batchurl_30_45, "batchurl_30_45.json")


In [None]:
batchurl_45_60= batch_url_proc(45,60)
save_list_to_json(batchurl_45_60, "batchurl_45_60.json")


In [None]:
batchurl_60_75= batch_url_proc(60,75)
save_list_to_json(batchurl_60_75, "batchurl_60_75.json")

In [None]:
batchurl_75_90= batch_url_proc(75,90)
save_list_to_json(batchurl_75_90, "batchurl_75_90.json")

In [None]:
batchurl_90_105= batch_url_proc(90,105)
save_list_to_json(batchurl_90_105, "batchurl_90_105.json")

In [None]:
batchurl_105_120= batch_url_proc(105,120)
save_list_to_json(batchurl_105_120, "batchurl_105_120.json")

In [None]:
batchurl_120_135= batch_url_proc(120,135)
save_list_to_json(batchurl_120_135, "batchurl_120_135.json")

In [None]:
batchurl_135_156= batch_url_proc(135,156)
save_list_to_json(batchurl_135_156, "batchurl_135_156.json")

#### Reading and converting the file names and urls into one large list

In [None]:
import os
import json

def read_json_files(folder_name):
    current_dir = os.getcwd()
    parent_folder_path = os.path.join(current_dir, folder_name)
    parent_folder_path = os.path.abspath(parent_folder_path)
    
    json_files = [file for file in os.listdir(parent_folder_path) if file.endswith('.json')]
    json_contents = []

    for json_file in json_files:
        file_path = os.path.join(parent_folder_path, json_file)
        with open(file_path, 'r') as file:
            json_contents.append(json.load(file))

    return json_contents


folder_name = 'json_links_attachments'  
contents = read_json_files(folder_name)
for content in contents:
    print(content)


In [26]:
import itertools
contents_=list(itertools.chain.from_iterable(contents))
print(len(contents_))

156


In [27]:
contents_

[[],
 [{'Document': 'BLM_Awards_for_Closeout_04_04_24_Final_30Apr2024_Public_Notice_1.xlsx',
   'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/73486b1246654d5db15326e2047f801a/download?&status=archived&token='}],
 [{'Document': 'ATTACHMENT F - PAST PERFORMANCE QUESTIONNAIRE.pdf',
   'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/93742853c9b94e9f897f6e86066ef6e2/download?&status=archived&token='},
  {'Document': 'ATTACHMENT E - REPORTING TOOL - BNOE - NX EQ ORTHOPEDIC SURGICAL ROBOTICS SYSTEMS.xlsx',
   'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/e66496074912480e9fb7149f0e1323ad/download?&status=archived&token='},
  {'Document': 'ATTACHMENT D - SOLICITATION PROVISIONS - BNOE - NX EQ ORTHOPEDIC SURGICAL ROBOTICS SYSTEMS.pdf',
   'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/76b15d2c7ccf4f12885c8d8671bb43f9/download?&status=archived&token='},
  {'Docu

In [28]:
contents__=list(itertools.chain.from_iterable(contents_))
print(len(contents__))

399


#### Filtering the atachement urls to avoid downloading '.xls', '.xlsx' and '.zip' files

In [29]:
def filter_documents(doc_list, extensions):
    filtered_docs = [
        doc for doc in doc_list
        if any(doc['Document'].endswith(ext) for ext in extensions) and doc['Download Link'] is not None
    ]
    return filtered_docs

extensions = ['.doc', '.docx', '.pdf']#, '.xls', '.xlsx', '.zip']
filtered_documents = filter_documents(contents__, extensions)
filtered_documents

[{'Document': 'ATTACHMENT F - PAST PERFORMANCE QUESTIONNAIRE.pdf',
  'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/93742853c9b94e9f897f6e86066ef6e2/download?&status=archived&token='},
 {'Document': 'ATTACHMENT D - SOLICITATION PROVISIONS - BNOE - NX EQ ORTHOPEDIC SURGICAL ROBOTICS SYSTEMS.pdf',
  'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/76b15d2c7ccf4f12885c8d8671bb43f9/download?&status=archived&token='},
 {'Document': 'ATTACHMENT C - CONTRACT CLAUSES - ORTHOPEDIC SURGICAL ROBOTICS SYSTEMS.pdf',
  'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/a87aa4c0d0d24c57b19b308f28711dd2/download?&status=archived&token='},
 {'Document': 'ATTACHMENT B - CONTRACT ADMINISTRATION - BNOE - NX EQ ORTHOPEDIC SURGICAL ROBOTICS SYSTEMS.pdf',
  'Download Link': 'https://sam.gov/api/prod/opps/v3/opportunities/resources/files/1142541dcea8415d8f68a1b2778bac9f/download?&status=archived&token='},
 {'Docume

#### Below function downloads a file from a given URL, saves it to a specified folder with a cleaned filename, and prints a confirmation message upon success or an error message if the download fails. 

In [36]:
import os
import requests

def download_file(url, save_folder, filename):
    """
    Downloads a file from a given URL and saves it to a specified folder with a given filename.

    This function performs the following steps:
    1. Removes spaces from the filename to ensure it is valid.
    2. Creates the target folder if it does not already exist.
    3. Sends a GET request to download the file from the provided URL.
    4. If the request is successful (status code 200), saves the file to the specified location.
    5. Prints a confirmation message with the file path upon successful download, or an error message if the download fails.

    Args:
        url (str): The URL from which to download the file.
        save_folder (str): The directory where the downloaded file will be saved.
        filename (str): The name to save the downloaded file as. If None, defaults to 'file.doc'.

    Returns:
        None
    """
    filename=filename.replace(" ", '')
    os.makedirs(save_folder, exist_ok=True)
    response = requests.get(url)
    if response.status_code == 200:
        if filename is None:
            filename = 'file.doc'
        file_path = os.path.join(save_folder, filename)
        with open(file_path, 'wb') as file:
            file.write(response.content)
        print(f"File saved: {file_path}")
    else:
        print(f"Failed to download file: {response.status_code}")

#### Downloading the attachments to a specific folder

In [38]:
for i in filtered_documents:
    download_file(url=i['Download Link'], save_folder='downloads', filename=i['Document'])

File saved: downloads\ATTACHMENTF-PASTPERFORMANCEQUESTIONNAIRE.pdf
File saved: downloads\ATTACHMENTD-SOLICITATIONPROVISIONS-BNOE-NXEQORTHOPEDICSURGICALROBOTICSSYSTEMS.pdf
File saved: downloads\ATTACHMENTC-CONTRACTCLAUSES-ORTHOPEDICSURGICALROBOTICSSYSTEMS.pdf
File saved: downloads\ATTACHMENTB-CONTRACTADMINISTRATION-BNOE-NXEQORTHOPEDICSURGICALROBOTICSSYSTEMS.pdf
File saved: downloads\36C10G24R0012.docx
File saved: downloads\P03FAR13PharmacyRobotSingleSourceJustificationMPTSATCOBCSignedREDACT_Redacted.pdf
File saved: downloads\S0252.225-2BUYAMERICANCERTIFICATE.docx
File saved: downloads\36C26024Q0556.docx
File saved: downloads\36C26224Q1094_1.docx
File saved: downloads\IIRAdvisor-PSCPositionSynopsis7200AA24R00071.pdf
File saved: downloads\B6SOW.docx
File saved: downloads\36C24524Q0527_1.docx
File saved: downloads\36C24524Q0527_1.docx
File saved: downloads\II_01_RFP_2032H8-24-R-00005P000025.17.24.pdf
File saved: downloads\II_01_RFP_2032H8-R-24-00005GovernmentResponsestoQuestionsPart1(1).pd