Current Approach: TF-IDF classifier using a dataset comprised of recursively grabbing all the text from the Democrat and Republican main sites.

TODO:
- Before publishing, check robot.txt files to ensure webscraped data is legal

# Get the dataset through webscraping

In [1]:
import requests
from bs4 import BeautifulSoup

Let's try scraping just this webpage: https://www.rnc.org/ for text

In [2]:
# initialize variables
link = 'https://www.rnc.org/'
text = ''

# open website as BeautifulSoup object
website = requests.get(link)
soup = BeautifulSoup(website.content, 'html.parser')

# let's assume all the text is wrapped around <p> tags.
paragraphElements = soup.find_all('p')
for element in paragraphElements:
  text += (element.text) + ' '


In [3]:
# this now gives us all the text (I think)
text

"RNC.ORG Initially united in 1854 by the promise to abolish slavery, the Republican Party has always stood for freedom, prosperity, and opportunity. Today, as those principles come under attack from the far-left, we are engaged in a national effort to fight for our proven agenda, take our message to every American, grow the party, promote election integrity, and elect Republicans up and down the ballot. The principles of the Republican Party recognize the God-given liberties while promoting opportunity for every American. .\xa0 train vote support connect Republicans believe in liberty, economic prosperity, preserving American values and traditions, and restoring the American dream for every citizen of this great nation. As a party, we support policies that seek to achieve those goals.\xa0 Our platform is centered on stimulating economic growth for all Americans, protecting constitutionally-guaranteed freedoms, ensuring the integrity of our elections, and maintaining our national securi

Now let's scrape the same web page for links

In [4]:
# initialize variables
link = 'https://www.rnc.org/'
links = []

# open website as BeautifulSoup object
website = requests.get(link)
soup = BeautifulSoup(website.content, 'html.parser')

# let's assume all the links are in blocks with href.
for a in soup.find_all('a', href=True):
    links.append(a['href'])

In [5]:
# this now gives us all the links (I think)
links

['',
 'https://gop.com/action-center/',
 'https://secure.winred.com/trump-national-committee-jfc/lp-hf-rnc-org-redirect?utm_medium=web&utm_source=web&utm_campaign=20240715_w000174_rnc-org-homepage_tnc_tnc&exitintent=true',
 'https://gop.com/about-our-party/rnc-leaders/',
 'https://gop.com/about-our-party/rnc-members/',
 'https://prod-static.gop.com/media/RNC2024-Platform.pdf?_gl=1*2k6fhh*_gcl_au*ODAxMDc0MzMyLjE3MjAxODY4MTc.&_ga=2.121209992.1645374160.1721087775-1997897851.1720186817',
 'https://gop.com/rules-and-resolutions/',
 'https://www.youtube.com/channel/UC3o7kbpTUQ5-0WTMIp8sVwA',
 'https://x.com/GOP',
 'https://www.instagram.com/gop/',
 'https://www.facebook.com/GOP',
 ' https://rumble.com/c/GOP',
 'https://x.com/chairmanwhatley',
 'https://www.facebook.com/chairmanwhatley',
 'https://www.instagram.com/ChairmanWhatley/',
 'https://x.com/kc4gop/',
 'https://www.facebook.com/KCCrosbie/',
 '../../contact-us/',
 'https://www.google.com/maps/place/310+First+St+SE,+Washington,+DC+2000

Now let's turn everything we have so far into a function

In [6]:
def web_scrape(link: str):

  # initialize variables
  text = ''
  links = []

  # open website as BeautifulSoup object
  website = requests.get(link)
  soup = BeautifulSoup(website.content, 'html.parser')

  # let's assume all the text is wrapped around <p> tags.
  for element in soup.find_all('p'):
    text += (element.text) + ' '

  # let's assume all the links are in blocks with href.
  for element in soup.find_all('a', href=True):
      links.append(element['href'])

  # Output ordered pair
  return text, links

In [7]:
text, links = web_scrape('https://democrats.org/who-we-are/')

In [8]:
print(text)

Copyright © 2025 DNC Services Corporation All rights reserved. Paid for by the DEMOCRATIC NATIONAL COMMITTEE (202) 863-8000 This communication is not authorized by any candidate or candidate's committee. 
              DEMOCRATIC NATIONAL COMMITTEE
              430 South Capitol Street Southeast
              Washington, DC 20003
           Proudly Powered by WordPress 
      The Democratic National Committee is committed to electing Democrats everywhere – from the school board to the Oval Office. We’re mobilizing voters across the country.
     The DNC’s officers, under the leadership of Chair Ken Martin, come from every walk of life and represent all that is best about the Democratic Party. Learn more about the team doing this work — because when we organize everywhere, we can win anywhere. Copyright © 2025 DNC Services Corporation All rights reserved. Paid for by the DEMOCRATIC NATIONAL COMMITTEE (202) 863-8000 This communication is not authorized by any candidate or candidate's co

In [9]:
links

['https://democrats.org',
 'https://democrats.org/who-we-are/',
 'https://democrats.org/where-we-stand/',
 'https://democrats.org/take-action/',
 'https://democrats.org/vote-2024/',
 'https://store.democrats.org/',
 'https://secure.actblue.com/donate/web-donate',
 'https://democrats.org/es',
 'https://democrats.org',
 'https://democrats.org/privacy-policy/',
 'https://democrats.org/terms-of-service/',
 'https://democrats.org/who-we-are/',
 'https://democrats.org/who-we-are/what-we-do/',
 'https://democrats.org/who-we-are/about-the-democratic-party/',
 'https://democrats.org/who-we-are/leadership-2-2/',
 'https://democrats.org/who-we-are/state-parties/',
 'https://democrats.org/where-we-stand/',
 'https://democrats.org/where-we-stand/party-platform/',
 'https://democrats.org/where-we-stand/issues-2024/',
 'https://democrats.org/volunteer-interest-form/',
 'https://democrats.org/volunteer-interest-form/trainings/',
 'https://democrats.org/volunteer-interest-form/work-with-us/',
 'https:/

Problem: Some links can't be opened by BeautifulSoup method

In [11]:
!pip install selenium
!pip install webdriver-manager





[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting webdriver-manager
  Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting python-dotenv (from webdriver-manager)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, webdriver-manager
Successfully installed python-dotenv-1.1.1 webdriver-manager-4.0.2



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [46]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# # Options
# options = Options()
# options.add_experimental_option("detach", True)

# # Set up Chrome WebDriver
# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
# 						  options=options)

# Get page without opening chrome.exe
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
						  options=options)

# Navigate to URL
driver.get("https://gop.com/about-our-party/rnc-leaders/")

# Retrieve the page source
html = driver.page_source

# Parse the HTML with BeautifulSoup
soup1 = BeautifulSoup(html, 'html.parser')

# Navigate to URL
driver.get("https://news.ycombinator.com/")

# Retrieve the page source
html = driver.page_source

# Parse the HTML with BeautifulSoup
soup2 = BeautifulSoup(html, 'html.parser')

# webscrape
text1, links1 = web_scrape(soup=soup1)
text2, links2 = web_scrape(soup=soup2)

# debug
print(text1)
print(text2)
print(links1)
print(links2)
driver.quit()

The Republican National Committee RNC Leaders Chairman Originally from Watauga County, North Carolina, Michael Whatley serves as the Chairman of the Republican National Committee. Chairman Whatley previously served as the Chairman of the North Carolina Republican Party and the General Counsel for the Republican National Committee. 

Prior to his service with the Republican Party, Whatley served as a federal law clerk, senior official in the George W. Bush Administration and as the Chief of Staff for US Senator Elizabeth Dole, R-NC.  He also served as a Senior Advisor to the Bush-Cheney Campaign, Florida Recount and Transition Teams, as well as the Trump-Pence Campaign and Transition Teams. 

Whatley earned a History degree from the University of North Carolina at Charlotte and Masters in Religion from Wake Forest University before graduating from the University of Notre Dame with both a law degree and Masters in Theology. Co-Chair KC Crosbie was elected as the National Committeewoman f

In [102]:
def web_scrape(soup: BeautifulSoup):

  # initialize variables
  text = ''
  links = []

  # let's assume all the text is wrapped around <p> tags.
  for element in soup.find_all('p'):
    text += (element.text) + ' '
  
  # let's assume some of the text is wrapped around <span> tags.
  for element in soup.find_all('span'):
    text += (element.text) + ' '

  # let's assume some of the text is wrapped around <a> tags.
  for element in soup.find_all('a'):
    text += (element.text) + ' '

  # let's assume all the links are in blocks with href.
  for element in soup.find_all('a', href=True):
      if (".com" in element['href']):
        links.append(element['href'])
      elif (element['href'].startswith('/')):
        links.append("domain" + element['href'])
        
  # Output ordered pair
  return text, links

In [103]:
text, links = web_scrape(soup=soup)

In [104]:
links

['domain/about-our-party/',
 'domain/about-our-party/rnc-leaders/',
 'domain/about-our-party/rnc-members/',
 'https://protectthevote.com',
 'domain/news/',
 'domain/action-center/',
 'https://shop.gop.com/?utm_medium=web&utm_source=organic&utm_campaign=20210803_w000016_shopgop_gop_rnc&utm_content=homepage_menu',
 'https://secure.winred.com/rnc/will-you-support-gop-hp?utm_campaign=20250416_w000192_contribute-button?utm_source=web',
 'https://gop.com',
 'https://gop.com',
 'domain/action-center/',
 'https://secure.winred.com/rnc/will-you-support-gop-hp?utm_campaign=20250416_w000192_contribute-button?utm_source=web',
 'https://gop.com',
 'https://twitter.com/chairmanwhatley',
 'https://twitter.com/kc4gop',
 'https://twitter.com/JoeGruters',
 'https://www.youtube.com/channel/UC3o7kbpTUQ5-0WTMIp8sVwA',
 'https://x.com/GOP',
 'https://www.instagram.com/gop/',
 'https://www.facebook.com/GOP',
 ' https://rumble.com/c/GOP',
 'https://x.com/chairmanwhatley',
 'https://www.facebook.com/chairmanwh

In [None]:
link = 'https://gop.com'
token = link.split('http://')[1].split('/')[0]
top_level = token.split('.')[-2]+'.' + token.split('.')[-1]

In [59]:
top_level.split('.')[0]

'gop'

['/about-our-party/',
 '/about-our-party/rnc-leaders/',
 '/about-our-party/rnc-members/',
 'https://protectthevote.com',
 '/news/',
 '/action-center/',
 'https://shop.gop.com/?utm_medium=web&utm_source=organic&utm_campaign=20210803_w000016_shopgop_gop_rnc&utm_content=homepage_menu',
 'https://secure.winred.com/rnc/will-you-support-gop-hp?utm_campaign=20250416_w000192_contribute-button?utm_source=web',
 'https://gop.com',
 'https://gop.com',
 '/action-center/',
 'https://secure.winred.com/rnc/will-you-support-gop-hp?utm_campaign=20250416_w000192_contribute-button?utm_source=web',
 'https://gop.com',
 'javascript:;',
 'https://twitter.com/chairmanwhatley',
 'https://twitter.com/kc4gop',
 'https://twitter.com/JoeGruters',
 './rnc-members',
 'https://www.youtube.com/channel/UC3o7kbpTUQ5-0WTMIp8sVwA',
 'https://x.com/GOP',
 'https://www.instagram.com/gop/',
 'https://www.facebook.com/GOP',
 ' https://rumble.com/c/GOP',
 'https://x.com/chairmanwhatley',
 'https://www.facebook.com/chairmanwha

Let's put everything in a function once again

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Note: This class needs some serious rewriting before general usage. 
# For the purposes of creating this corpus, it'll do...
class web_crawl():
	def __init__(self, allowed_domains: list[str]):
		
		# Set up chrome driver with options
		options = Options()
		options.add_argument("--headless")
		options.add_argument("--disable-gpu")
		driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),
								options=options)
		
		# Initialize variables
		self.driver = driver
		self.links = []
		self.directories = []
		self.result = []
		self.allowed_domains = allowed_domains
	
	def get_soup(self, link: str):

		# Navigate to URL
		self.driver.get(link)

		# Retrieve the page source
		html = self.driver.page_source

		# Parse the HTML with BeautifulSoup
		return BeautifulSoup(html, 'html.parser')

	def web_scrape(self, link: str): # User: Make sure link is in list of allowed domains
		
		# initialize variables
		text = ''
		links = []
		soup = self.get_soup(link)
		
		# let's assume all the text is wrapped around <p> tags.
		for element in soup.find_all('p'):
			text += (element.text) + ' '
		
		# let's assume some of the text is wrapped around <span> tags.
		for element in soup.find_all('span'):
			text += (element.text) + ' '

		# let's assume some of the text is wrapped around <a> tags.
		for element in soup.find_all('a'):
			text += (element.text) + ' '

		# let's assume all the links are in blocks with href.
		for element in soup.find_all('a', href=True):
			href = element['href']
			if (self.check_domain(href)) and (href not in self.links):
				links.append(href)
			if (href.startswith('/')) and (href not in self.directories):
				links.append(link + href)
				self.directories.append(href)

		# Hard-code GOP news navigation
		if (link == 'https://gop.com/press-releases/'):
			subdir = ''
			total_pages = 123 # As of 7/6/25
			for i in range(1, total_pages+1):
				links.append(link+"?page="+str(i))
				self.directories.append("?page="+str(i))

		# Output ordered pair
		return text, links
		
	def dfs(self, link: str): # User: Make sure link is in list of allowed domains
		
		# Web scrape
		text, links = self.web_scrape(link=link)
		if (link, text) not in self.result: # Check unique strings
			self.result.append({link : text})
			print(f"Scraped {link}")
		
		# Mark link as explored
		if link not in self.links:
			self.links.append(link)
		
		# Visit neighbors
		for neighbor in links:
			if neighbor not in self.links:
				self.dfs(neighbor)

	def check_link(self, link: str):
		
		# Check for unallowed links
		unallowed_links = ['https://.com', 'https://.org']
		if (link in unallowed_links):
			return False 
		
		# Check for mailing links
		if ("mailto:" in link):
			return False
		
		# Check if it's a link
		return ("https://" in link and (".com" in link or ".org" in link))

	def check_domain(self, link: str):
		
		# Check if it's even a link
		if (not self.check_link(link)):
			return False
		
		# Get domain
		token = link.split('https://')[1].split('/')[0]
		domain = (token.split('.')[-2]+'.' + token.split('.')[-1]).split('.')[0]
		
		# Check for allowed domains
		return (domain in self.allowed_domains)

	
	def __del__(self):
		
		# Close the driver
		self.driver.quit()
		

In [5]:
# Running on 'https://gop.com' should be 2m 6.1s when an ouput length of 168
REPcrawl = web_crawl(allowed_domains=['gop']) # Initializes web crawler with list of allowed domains.
REPcrawl.dfs(link='https://gop.com') # Calls the crawler on a link. Note: Link MUST contain an allowed domain.

Scraped https://gop.com
Scraped https://gop.com/about-our-party/
Scraped https://shop.gop.com/?utm_medium=web&utm_source=organic&utm_campaign=20210803_w000016_shopgop_gop_rnc&utm_content=homepage_menu
Scraped https://shop.gop.com/?utm_medium=web&utm_source=organic&utm_campaign=20210803_w000016_shopgop_gop_rnc&utm_content=homepage_menu/search
Scraped https://shop.gop.com/collections/all-products
Scraped https://shop.gop.com/collections/all-products/collections/all-products?sort_by=best-selling
Scraped https://gop.com/
Scraped https://gop.com/privacy/
Scraped https://gop.com/terms-of-service/
Scraped https://www.gop.com/unsubscribe
Scraped https://gop.com/terms-of-service
Scraped https://shop.gop.com/pages/contribution-rules
Scraped https://shop.gop.com/collections/all-products/collections/all-products?sort_by=title-ascending
Scraped https://shop.gop.com/collections/all-products/collections/all-products?sort_by=title-descending
Scraped https://shop.gop.com/collections/all-products/collec

In [None]:
# Running on 'https://democrats.org' is gonna take forever because of the 896 news pages. So I just stopped it arbitrarily at 5mins.
DEMcrawl = web_crawl(allowed_domains=['democrats']) # Initializes web crawler with list of allowed domains.
DEMcrawl.dfs(link='https://democrats.org') # Calls the crawler on a link. Note: Link MUST contain an allowed domain.


Now let's save our variables into a dataset

In [None]:
import pandas as pd

In [None]:
DEMlinks = []
DEMtext = []
REPlinks = []
REPtext = []
for dictionary in DEMcrawl.result:
	DEMlinks.append(list(dictionary.keys())[0])
	DEMtext.append(list(dictionary.values())[0])
for dictionary in REPcrawl.result:
	REPlinks.append(list(dictionary.keys())[0])
	REPtext.append(list(dictionary.values())[0])
DEMdf = pd.DataFrame({
	'links' : DEMlinks,
	'text' : DEMtext
})
REPdf = pd.DataFrame({
	'links' : REPlinks,
	'text' : REPtext
})

In [None]:
DEMdf.to_csv("democrat1.csv")
REPdf.to_csv("republican1.csv")

7/6/25. Ran saved the 'democrat1.csv' and 'republican1.csv' as files iterating through the whole web crawl. 'republican1.csv' is reflective of the full web crawl, however, 'democrat1.csv' stopped after around the 800-something new's page. Either way, 'democrat1.csv' is a bigger file than 'republican1.csv' so it kind of works out.

# Preprocessing the data

In [1]:
import pandas as pd
DEMdf = pd.read_csv('democrat1.csv')
REPdf = pd.read_csv('republican1.csv')

The goal is to highlight mostly news articles & party statements (i.e. what is our party about). Things like merchandise or email list sign ups aren't really important imo.

In [10]:
DEMdf.head(-5)

Unnamed: 0.1,Unnamed: 0,links,text
0,0,https://democrats.org,Copyright © 2025 DNC Services Corporation All ...
1,1,https://democrats.org/who-we-are/,Copyright © 2025 DNC Services Corporation All ...
2,2,https://democrats.org/where-we-stand/,Copyright © 2025 DNC Services Corporation All ...
3,3,https://democrats.org/take-action/,Copyright © 2025 DNC Services Corporation All ...
4,4,https://democrats.org/vote-2024/,"By texting VOTE to 70888, you are consenting t..."
...,...,...,...
1882,1882,https://store.democrats.org/stickers/?utm_sour...,\nShop All\n \n Apparel \n \n ...
1883,1883,https://store.democrats.org/friendship-bracele...,\nShop All\n \n Apparel \n \n ...
1884,1884,https://store.democrats.org/home-goods/?utm_so...,\nShop All\n \n Apparel \n \n ...
1885,1885,https://store.democrats.org/mugs-and-water-bot...,\nShop All\n \n Apparel \n \n ...


In [11]:
REPdf.head(-5)

Unnamed: 0.1,Unnamed: 0,links,text
0,0,https://gop.com,The Republican National Committee Join the mil...
1,1,https://gop.com/about-our-party/,The Republican National Committee Beginning in...
2,2,https://shop.gop.com/?utm_medium=web&utm_sourc...,100% MADE IN THE USA. NOT CHINA. Text STORE to...
3,3,https://shop.gop.com/?utm_medium=web&utm_sourc...,100% MADE IN THE USA. NOT CHINA. Text STORE to...
4,4,https://shop.gop.com/collections/all-products,100% MADE IN THE USA. NOT CHINA. By providing ...
...,...,...,...
1859,1859,https://gop.com/press-release/rnc-members-cond...,The Republican National Committee WASHINGTON –...
1860,1860,https://gop.com/press-release/rnc-statement-ce...,"The Republican National Committee WASHINGTON, ..."
1861,1861,https://gop.com/press-release/declaracion-del-...,The Republican National Committee Michael What...
1862,1862,https://gop.com/press-release/rnc-statement-on...,"The Republican National Committee WASHINGTON, ..."


For now, we'll have 4 main categories: news, statements, merchanidse, and misc.

In [None]:
# Step 1: Automatically annotate by hard-coding key words.
DEMdf['labels'] = ['' for _ in DEMdf.iterrows()]
REPdf['labels'] = ['' for _ in REPdf.iterrows()]
labels = ['news', 'statements', 'merchandise', 'misc']

for index, row in DEMdf.iterrows():
	if '/news/' in row['links']:
		DEMdf.loc[index, 'labels'] = labels[0]
	elif 'store.' in row['links']:
		DEMdf.loc[index, 'labels'] = labels[2]
	else:
		DEMdf.loc[index, 'labels'] = labels[3]
for index, row in REPdf.iterrows():
	if '/press-release/' in row['links']:
		REPdf.loc[index, 'labels'] = labels[0]
	elif '//shop.' in row['links']:
		REPdf.loc[index, 'labels'] = labels[2]
	else:
		REPdf.loc[index, 'labels'] = labels[3]

In [3]:
DEMdf['labels'].value_counts()

labels
news           1811
misc             57
merchandise      24
Name: count, dtype: int64

In [4]:
REPdf['labels'].value_counts()

labels
news           1253
misc            505
merchandise     111
Name: count, dtype: int64

Now we've hopefully labeled all the merch and news links, and labeled everything else as 'misc' temporarily. Now, let's manually annotate the "statements" by iterating through the misc category.

In [None]:
# Step 2: Manually annotate.
for index, row in REPdf.iterrows():
	if row['labels'] == labels[3]:
		user_input = input(f"Link: {row['links']}\t\nStatement? [y]")
		if user_input.lower() == 'y':
			REPdf.loc[index, 'labels'] = labels[1]
			print("Changed!")
		else:
			print("Not changed!")
print("REPdf done.")
for index, row in DEMdf.iterrows():
	if row['labels'] == labels[3]:
		user_input = input(f"Link: {row['links']}\t\nStatement? [y]")
		if user_input.lower() == 'y':
			DEMdf.loc[index, 'labels'] = labels[1]
			print("Changed!")
		else:
			print("Not changed!")
print("DEMdf done.")

# REP Notes: misc = press-releases, research, rapid-response, blog
# need to remove "blog" link: https://www.gop.gov/blog/
# Clean failed scrapes

Changed!
Changed!
Changed!
Changed!
Changed!
Changed!
Changed!
Changed!
Changed!
Changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!
Not changed!


In [9]:
# Another hard-coded annotation block
for index, row in DEMdf.iterrows():
	if '/es/' in row['links']:
		DEMdf.loc[index, 'labels'] = 'spanish'
	if row['text'] == 'The requested URL was rejected. Access Forbidden ':
		DEMdf.loc[index, 'labels'] = 'none'
for index, row in REPdf.iterrows():
	if row['links'] == 'https://www.gop.gov/blog/': # Fixing my error
		REPdf.loc[index, 'labels'] = labels[3]
	if 'press-releases' in row['links']:
		REPdf.loc[index, 'labels'] = labels[0]
	if row['text'] == 'The requested URL was rejected. Access Forbidden ':
		REPdf.loc[index, 'labels'] = 'none'

Here are the final annotations: (TODO: Add definitions)
- DEM: news, statements, merchandise, misc, spanish, none
- REP: news, statements, merchandise, misc, none

In [10]:
# Save these dataframes. The naming conventions aren't the best tbh but we'll roll with it.
DEMdf.to_csv("democrat1_labeled.csv")
REPdf.to_csv("republican1_labeled.csv")

TEMP

In [49]:
democrat_path = 'democrat1_labeled.csv'
democrat_df = pd.read_csv(democrat_path)
democrat_df[democrat_df['labels'] == 'statements']

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,links,text,labels
0,0,0,https://democrats.org,Copyright © 2025 DNC Services Corporation All ...,statements
1,1,1,https://democrats.org/who-we-are/,Copyright © 2025 DNC Services Corporation All ...,statements
2,2,2,https://democrats.org/where-we-stand/,Copyright © 2025 DNC Services Corporation All ...,statements
3,3,3,https://democrats.org/take-action/,Copyright © 2025 DNC Services Corporation All ...,statements
4,4,4,https://democrats.org/vote-2024/,"By texting VOTE to 70888, you are consenting t...",statements
5,5,5,https://democrats.org/vote-2024//,"By texting VOTE to 70888, you are consenting t...",statements
6,6,6,https://www.democrats.org/terms-of-service,Copyright © 2025 DNC Services Corporation All ...,statements
10,10,10,https://democrats.org/es,Pagado por el Comité Nacional Demócrata (202) ...,statements
14,14,14,https://democrats.org/privacy-policy/,Copyright © 2025 DNC Services Corporation All ...,statements
15,15,15,https://democrats.org/terms-of-service/,Copyright © 2025 DNC Services Corporation All ...,statements


# Creating Bayesian Classifiers

It looks like the "news" labels and "statements" labels may be useful for us. Let's make classifiers on both.

In [1]:
!pip install nltk




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [50]:
# Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from collections import Counter
from math import log
import math
import nltk

# Load dataset 
class BipartisanDataset():
  def __init__(self, democrat_path, republican_path, mode='statements'):
    """
    Parameters:
    - Mode: 'statements', 'news' (Which subset of the website will the model train on.)
    """
    
    # Shape the data into our desired format
    democrat_df = pd.read_csv(democrat_path)
    republican_df = pd.read_csv(republican_path)
    democrat_df = democrat_df[democrat_df['labels'] == mode]
    republican_df = republican_df[republican_df['labels'] == mode]
    democrat_df['class'] = ['democrat' for i in range(len(democrat_df))]
    republican_df['class'] = ['republican' for i in range(len(republican_df))]

    # Split the data into 80/20 train-test
    df_total = pd.concat([democrat_df, republican_df])
    self.train_data, self.test_data = train_test_split(df_total, test_size=0.2)
    self.data = pd.concat([self.train_data, self.test_data])
    print(f"Initialized dataset: {len(self.data)}")

  def __len__(self):
    return len(self.data)

# Naive-Bayes Model
class BipartisanNaiveBayes():
  def __init__(self, dataset: BipartisanDataset, unit='article'):
    """
    Parameters:
    - Dataset: BipartisanDataset class 
    - Unit: 'article', 'sentence' (Which unit will the model predict.)
    """

    # Initialize dataset. NOTE: Must be BipartisanDataset structure
    self.dataset = dataset

    # Initialize prior variables
    self.democrat_fd = Counter()
    self.republican_fd = Counter()
    self.democrat_prior = 0
    self.republican_prior = 0
    democrat_count = 0
    republican_count = 0
    total_count = 0
    for index, row in self.dataset.train_data.iterrows():
      if row['class'] == 'democrat':
        democrat_count += 1
        total_count += 1
        for word in str(row['text']).split():
          self.democrat_fd[word] += 1
      if row['class'] == 'republican':
        republican_count += 1
        total_count += 1
        for word in str(row['text']).split():
          self.republican_fd[word] += 1
    self.democrat_prior = democrat_count / total_count
    self.republican_prior = republican_count / total_count

    # Test the model
    self.republican_total = self.republican_fd.total()
    self.democrat_total = self.democrat_fd.total()
    predictions = []

    # Loops through each review in the development data
    for index, row in self.dataset.test_data.iterrows():

      # Handles articles
      if unit=='article':
        predicted_label, republican_score, democrat_score = self.predict(str(row['text']))
        predictions.append(predicted_label)

      # Handles sentences
      elif unit=='sentence':
        sentences = nltk.sent_tokenize(str(row['text']))
        sentence_predictions = []
        for sentence in sentences:
          predicted_label, republican_score, democrat_score = self.predict(sentence)
          sentence_predictions.append(predicted_label)
        predictions.append(sentence_predictions)

    # Compute accuracy of article prediction
    if unit=='article':
      true_prediction = 0
      for prediction, row in zip(predictions, self.dataset.test_data.iterrows()):
        if prediction == row[1]['class']:
          true_prediction += 1
      self.accuracy = (true_prediction / len(self.dataset.test_data)) * 100

    # Compute accuracy of sentence prediction
    elif unit=='sentence':
      true_prediction = 0
      total_sentences = 0
      for prediction, row in zip(predictions, self.dataset.test_data.iterrows()):
        answer = row[1]['class']
        for pred in prediction:
          total_sentences += 1
          if pred == answer:
            true_prediction += 1
      self.accuracy = (true_prediction / total_sentences) * 100

    # Done
    print(f"Initialized Naive-Bayes model with {self.accuracy:.2f} accuracy")

  def predict(self, text):

    # Uses log of prior score as base probability
    democrat_score = log(self.democrat_prior)
    republican_score = log(self.republican_prior)

    # Loops through each word in the development data, computing it's real and fake score. (With log probabilities)
    for word in text.split():

      # Add-one smoothing implementation for republican
      if self.republican_fd[word] == 0:
        republican_score += log((self.republican_fd[word]+1)/(self.republican_total+len(self.republican_fd)))
      else:
        republican_score += log(self.republican_fd[word]/self.republican_total)

      # Add-one smoothing implementation for democrat
      if self.democrat_fd[word] == 0:
        democrat_score += log((self.democrat_fd[word]+1)/(self.democrat_total+len(self.democrat_fd)))
      else:
        democrat_score += log(self.democrat_fd[word]/self.democrat_total)

    # Choose which label to predict
    predicted_label = 'republican'
    if democrat_score > republican_score:
      predicted_label = 'democrat'
    return predicted_label, republican_score, democrat_score


Let's test how well this model performs with real data.

In [51]:
# Tests
republican_tests = ["Trump is a good president.", "The government should increase military spending.", "We should oppose illegal immigration.", "In order to balance budget, we should limit tax."]
democrat_tests = ["Trump is a bad president.", "We should increase the minimum wage.", "The United States needs universal health care", "The government needs to create more jobs."]

In [52]:
statements_dataset = BipartisanDataset("bad_dataset/democrat1_labeled.csv", "bad_dataset/republican1_labeled.csv", mode='statements')
news_dataset = BipartisanDataset("bad_dataset/democrat1_labeled.csv", "bad_dataset/republican1_labeled.csv", mode='news')
statement_model = BipartisanNaiveBayes(statements_dataset, unit='sentence')
news_model = BipartisanNaiveBayes(news_dataset, unit='sentence')

Initialized dataset: 35
Initialized dataset: 3189
Initialized Naive-Bayes model with 89.52 accuracy
Initialized Naive-Bayes model with 92.89 accuracy


In [53]:
statement_model.predict("Trump does not want to defund healthcare")

('republican', -50.7310250441707, -53.38620401382196)

In [55]:
statement_model.predict("The government should spend more money on healthcare")

('democrat', -62.69124750688233, -62.572406708362934)

In [56]:
for test in republican_tests:
	print(f"{statement_model.predict(test)[0]}: {test}")

republican: Trump is a good president.
republican: The government should increase military spending.
democrat: We should oppose illegal immigration.
democrat: In order to balance budget, we should limit tax.


In [57]:
for test in democrat_tests:
	print(f"{statement_model.predict(test)[0]}: {test}")

republican: Trump is a bad president.
democrat: We should increase the minimum wage.
democrat: The United States needs universal health care
republican: The government needs to create more jobs.


The statement model is small. There's only 35 webpages that it trains on. It's doing pretty poorly on both of these tests.

In [58]:
news_model.predict("Trump does not want to defund healthcare")

('republican', -50.64687798769201, -50.86374798427672)

In [59]:
news_model.predict("The government should spend more money on healthcare")

('democrat', -62.84837655331012, -60.512277607443345)

In [61]:
for test in republican_tests:
	print(f"{news_model.predict(test)[0]}: {test}")

democrat: Trump is a good president.
republican: The government should increase military spending.
republican: We should oppose illegal immigration.
democrat: In order to balance budget, we should limit tax.


In [62]:
for test in democrat_tests:
	print(f"{news_model.predict(test)[0]}: {test}")

democrat: Trump is a bad president.
democrat: We should increase the minimum wage.
democrat: The United States needs universal health care
democrat: The government needs to create more jobs.


So it looks like the news_model is performing slightly better on classifying democrats.

# Fine-tuning a model

In [65]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [70]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores("Wow, this is great!")


{'neg': 0.0, 'neu': 0.196, 'pos': 0.804, 'compound': 0.8478}