Objective:
* Go through the list of the the target corporate's partners
* Filter the sponsor list and get list of unique sponsor names
* For each sponsor, get its homepage url and optionally its shortname (ex. https://target.net/ -> target)
* Get the homepage corpus of each sponsor, and see if the extension of other sponsors' homepage url exist in the homepage corpus.

Mainly utilized library:
* google search through python: https://pypi.org/project/googlesearch-python/

# Load Libraries

In [1]:
# Python google search package
# !pip install googlesearch-python

# Install requests
!pip install requests

# Install requests-ip-rotator
!pip3 install requests-ip-rotator

Collecting requests-ip-rotator
  Downloading requests_ip_rotator-1.0.14-py3-none-any.whl.metadata (8.8 kB)
Collecting boto3 (from requests-ip-rotator)
  Downloading boto3-1.37.21-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<1.38.0,>=1.37.21 (from boto3->requests-ip-rotator)
  Downloading botocore-1.37.21-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3->requests-ip-rotator)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3->requests-ip-rotator)
  Downloading s3transfer-0.11.4-py3-none-any.whl.metadata (1.7 kB)
Downloading requests_ip_rotator-1.0.14-py3-none-any.whl (20 kB)
Downloading boto3-1.37.21-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.6/139.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.37.21-py3-none-any.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m 

In [2]:
# Regular Python Data analyze library
import pandas as pd
import random

# Google search library
# reference: https://pypi.org/project/googlesearch-python/
# from googlesearch import search

# Web related library
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup

# Progress Bar
from tqdm import tqdm

# Requests-ip-rotator to rotate IP
# Aim to avoid 429 HTTP error
# Reference: https://github.com/Ge0rg3/requests-ip-rotator
from requests_ip_rotator import ApiGateway

# Add request error handler
from requests.exceptions import HTTPError, ReadTimeout

# Check nan
import math

# Load Data

In [None]:
# Load CSV into a Data Pool DataFrame
# Change encoding from 'utf-8' to 'latin1' to avoid invalid continuation byte
df_pool = pd.read_csv('')

# Apply .strip() to all string values in the DataFrame using apply with a lambda function
df_pool = df_pool.apply(lambda col: col.map(lambda x: x.strip() if isinstance(x, str) else x))

# Now all string values in df_pool will have leading/trailing spaces removed
print("Stripped all leading/trailing spaces from string attributes.")

Stripped all leading/trailing spaces from string attributes.


In [None]:
# Check dataframe
df_pool

Now we need to do a filtering, if we will only be focusing on a specific aspect of the dataset.
For example, the sponsors in state "MN" only.

In [5]:
# Filter the df based on specific needs
# df_target = df_pool[df_pool['State'] == 'MN']
# df_target = df_pool[df_pool['Type'] == 'npo']

# Reset the index of the filtered DataFrame and drop the old index
# df_target = df_target.reset_index(drop=True)

# If wanting to examine whole data pool, use this line and comment out all previous lines in this cell
df_target = df_pool

# We dont need any duplications
df_target = df_target.drop_duplicates(subset='Sponsors', keep='first')
df_target.reset_index(drop=True, inplace=True)

In [None]:
# Check target dataframe
df_target

In [7]:
# Create necessary a lists of selected sponsors
# Make sure they are unique
Sponsor_list = df_target['Sponsors'].tolist()
Sponsor_URL_list	 = df_target['Sponsor_URL'].tolist()
Sponsor_domain_list = df_target['Sponsor_Domain'].tolist()
print("Unique number of Sponsors:", len(Sponsor_list))
print("Unique number of Sponsor URLs:", len(Sponsor_URL_list))
print("Unique number of Sponsor Domains:", len(Sponsor_domain_list))

Unique number of Sponsors: 14
Unique number of Sponsor URLs: 14
Unique number of Sponsor Domains: 14


# Get external links for each Sponsor's webpage, check if it include any Sponsor names on the list

Issue:
* (Unsolved) Linkedin uses a very werid internal redirect link for some suppose-to-be external links, these cannot be manually detected, or at least not in a simple way.
* (Unsolved) Cannot get request from some links (mostly bank or government related, or large corporates), not even with headers. Could potentially be fixed through selerium (for future).

In [17]:
# User Agent to prevent being identified as a bot
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

In [18]:
MAX_PAGES = 1000
# Google Colab Memory Exceed
MAX_QUEUE = 20000
MAX_DEPTH = 2  # Limit depth to 3

# Function to extract all external links
# From the starting homepage
def extract_links(start_url):
  to_visit = [(start_url, 0)]  # Queue of (url, depth)

  # While queue is not empty
  while to_visit:
    # extract homepage link and current depth
    url, depth = to_visit.pop(0)

    # Check if we've hit the page limit
    if len(visited_urls) >= MAX_PAGES:
      break

    # Skip if the URL has already been visited
    if url in visited_urls or depth > MAX_DEPTH:
      continue

    # First attempt with User-Agent header
    try:
      response = requests.get(url, headers=headers, timeout=10)
      response.encoding = 'utf-8'  # explicitly set the encoding
      soup = BeautifulSoup(response.content, 'html.parser')
    except requests.RequestException as e:
      print(f"Failed to access {url} with User-Agent: {e}, attempting without User-Agent")
      # Second attempt without User-Agent
      try:
        response = requests.get(url, timeout=10)
        response.encoding = 'utf-8'  # explicitly set the encoding
        soup = BeautifulSoup(response.content, 'html.parser')
      except requests.RequestException as e:
        print(f"Failed to access {url} without User-Agent: {e}, skipping this URL")
        continue

    # Mark the URL as visited
    visited_urls.add(url)
    # print(f"Visiting Internal Url (Depth {depth}): {url}")

    # Get the base domain and path of the URL
    base_domain = urlparse(url).netloc
    base_path = urlparse(url).path

    # Find all links on the page
    links = soup.find_all('a')

    # Go through all links in the list
    for link in links:
      href = link.get('href')
      if href:
        full_url = urljoin(url, href)
        parsed_url = urlparse(full_url)

        # If it's an external link, add to external_links set
        if parsed_url.netloc != base_domain:
          external_links.add(full_url)
        else:
          # If it's an internal link, and not yet visited, add to the queue
          # Make sure the base domain and base path both matches
          if parsed_url.netloc == base_domain and parsed_url.path.startswith(base_path):
            if full_url not in visited_urls and not full_url.endswith((".pdf", ".jpg", ".png", ".gif", ".zip")):
              # Ensure the full_url is not in the to_visit queue with any depth
              if not any(url == full_url for url, _ in to_visit):
                # Check if the URL contains "sponsor" or "partner"
                if len(to_visit) < MAX_QUEUE and ("Sponsor" in full_url or "sponsor" in full_url or "Partner" in full_url or "partner" in full_url or "Funder" in full_url or "funder" in full_url or "Donor" in full_url or "donor" in full_url):
                  to_visit.insert(0, (full_url, depth + 1))  # Add to the front of the queue
                elif len(to_visit) < MAX_QUEUE:
                  to_visit.append((full_url, depth + 1))  # Add to the end of the queue
                # else:
                  # continue

In [19]:
# Helper function to check NAN
def is_nan(value):
  """Check if the given value is NaN."""
  return isinstance(value, float) and math.isnan(value)

In [20]:
# Initialize an empty list to store search results
search_results = []
# Track visited internal URLs and external links
visited_urls = set()
external_links = set()

# Start with the initial URL
for index in tqdm(range(len(Sponsor_list))):
  sponsor = Sponsor_list[index]
  # Locate the sponsor's homepage as starting url
  start_url = Sponsor_URL_list[index]
  # Clear the list
  visited_urls.clear()
  external_links.clear()

  if not is_nan(start_url):
    # Get all the external links on the web of this sponsor
    extract_links(start_url)
    # boolean to check if we have found matching external link
    added_new = False

    if len(external_links) > 0:
      for target_index in range(len(Sponsor_list)):
        # Original and Target sponsor shall be separate, and not empty
        if target_index != index and not is_nan(Sponsor_URL_list[target_index]) and not is_nan(Sponsor_domain_list[target_index]):
          # If the sponsor's full homepage link is extracted, add the sponsor and the link
          if Sponsor_URL_list[target_index] in external_links:
            matching_link = next(link for link in external_links if Sponsor_URL_list[target_index] == link)
            search_results.append({
              'Homepage Sponsor': sponsor,
              'Homepage Url': start_url,
              'Target Sponsor': Sponsor_list[target_index],
              'Target Url': Sponsor_URL_list[target_index],
              'Matched External Link': matching_link
            })
            added_new = True
          # If the sponsor's shortname is mentioned in any hyperlink, add the sponsor and the link
          elif any(str(Sponsor_domain_list[target_index]) in link for link in external_links):
            # Find the matching link and store it
            matching_link = next(link for link in external_links if str(Sponsor_domain_list[target_index]) in link)
            search_results.append({
              'Homepage Sponsor': sponsor,
              'Homepage Url': start_url,
              'Target Sponsor': Sponsor_list[target_index],
              'Target Url': Sponsor_URL_list[target_index],
              'Matched External Link': matching_link
            })
            added_new = True
      if not added_new:
        # No matching external link found among the extracted links
        search_results.append({
          'Homepage Sponsor': sponsor,
          'Homepage Url': start_url,
          'Target Sponsor': "None",
          'Target Url': "None",
          'Matched External Link': "None"
        })
    else:
      # No external links found for the target sponsor
      search_results.append({
        'Homepage Sponsor': sponsor,
        'Homepage Url': start_url,
        'Target Sponsor': "None",
        'Target Url': "None",
        'Matched External Link': "None"
      })

  else:
    # No homepage found for the target sponsor
    search_results.append({
      'Homepage Sponsor': sponsor, # Sponsor that owns the homepage
      'Homepage Url': "None", # Homepage Sponsor's Homepage Url
      'Target Sponsor': "None", # Target Sponsor Name
      'Target Url': "None", # Target Sponsor's Homepage Url
      'Matched External Link': "None" # External Link found on Homepage Url that proof the validness of Target Sponsor
    })

  7%|▋         | 1/14 [00:24<05:19, 24.60s/it]

Failed to access https://asd-inc.com/ with User-Agent: HTTPSConnectionPool(host='asd-inc.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7a9ffb5cfbd0>: Failed to establish a new connection: [Errno 111] Connection refused')), attempting without User-Agent
Failed to access https://asd-inc.com/ without User-Agent: HTTPSConnectionPool(host='asd-inc.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7a9ffb5cc5d0>: Failed to establish a new connection: [Errno 111] Connection refused')), skipping this URL


 71%|███████▏  | 10/14 [06:21<04:13, 63.34s/it]

Failed to access https://www.prosourcewholesale.com/inspiration#Explore Rooms with User-Agent: HTTPSConnectionPool(host='www.prosourcewholesale.com', port=443): Read timed out. (read timeout=10), attempting without User-Agent
Failed to access https://www.prosourcewholesale.com/inspiration#Explore Styles with User-Agent: HTTPSConnectionPool(host='www.prosourcewholesale.com', port=443): Read timed out. (read timeout=10), attempting without User-Agent



Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  soup = BeautifulSoup(response.content, 'html.parser')
100%|██████████| 14/14 [14:52<00:00, 63.72s/it]


In [21]:
external_links

{'http://bipcapital.com/',
 'http://crescerance.com/',
 'http://healthemed.net/',
 'http://korioclinical.com',
 'http://linkedin.com/in/jharris365',
 'http://resilia.com',
 'http://www.dropstat.com/',
 'http://www.mediafly.com',
 'http://www.opengenie.ai',
 'http://www.opyacare.com/',
 'http://www.peregrine-health.com',
 'http://www.therounds.com',
 'https://abstrakt.ai/',
 'https://acclivityhealth.com/',
 'https://acivilate.com',
 'https://adviserinfo.sec.gov/firm/summary/292983',
 'https://appsurify.com/',
 'https://basehq.com/',
 'https://bipventures.vc/about/#approach',
 'https://bipventures.vc/team/#network',
 'https://casestatus.com/',
 'https://cdn.prod.website-files.com/64601aeac004d574778d4339/6520d4d85467d373dc814ee9_Panoramic-Ventures-The-State-of-Startups-in-the-Southeast-2022-Report-min.pdf',
 'https://cdn.prod.website-files.com/64601aeac004d574778d4339/6520d4f5546f23989d49d729_The-State-of-Startups-2021-min.pdf',
 'https://cdn.prod.website-files.com/64601aeac004d574778d43

In [22]:
# Quick checkers
# Sponsor_urls["Viemo"]

In [23]:
# Convert the search results into a pandas DataFrame
df_results = pd.DataFrame(search_results)

In [None]:
# View results
df_results

In [25]:
# Export to csv file
df_results.to_csv('search_results_Hyperlink.csv', index=False)