## Web Scraping Topic of Choice

2025-04-11

This notebook navigates to the user's website of choice, locates the search bar, searches a topic, and gathers results.

Author: Faline Rezvani

In [None]:
# Importing libraries
from selenium import webdriver # Web testing library
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep # Allow time for webpage to load
import pandas as pd
import re, string # Regular expressions for text cleaning
import csv

In [19]:
# Instantiating the browser driver object
driver=webdriver.Chrome()

# Assigning URL
driver.get("https://www.geeksforgeeks.org/")

# Finding the search bar
# Navigate to the desired element on the webpage, right click > Inspect
# Right click selection > Copy XPath > paste between triple quotes
search=driver.find_element(By.XPATH,"""//*[@id="comp"]/div[2]/div[1]/div[3]/input""")

# Clearing the search bar
#search.clear()
# Assigning keyword to use in search
search.send_keys("Reinforcement Learning")

# Pressing enter
search.send_keys(Keys.ENTER)
sleep(5)

# Inspect target webpage for section with results
# Locating search results
articles=driver.find_elements(By.XPATH,"""//*[@id="modal"]/a""")
sleep(20)

First Option: Place Results in CSV File

In [None]:
# Creating csv file in which to save results
filecsv = open('geeksRLDescriptions.csv', 'w', encoding='utf8')
csv_columns = ['description']
writer = csv.DictWriter(filecsv, fieldnames = csv_columns)
writer.writeheader()

In [None]:
# Placing results in csv file
# Helpful resource [here](https://www.zenrows.com/blog/xpath-web-scraping#filter-html-extract-data)
for article in articles:
    description = article.find_element(By.XPATH, """.//div[2]""").text
    
    writer.writerow({'description':description})

# Closing csv file and driver
filecsv.close()
driver.close()

Second Option: Place Results in List of Strings
Instantiate Browser Again

In [17]:
# Placing results in list of strings
# XPath to descriptions //*[@id="modal"]/a[1]/div[2]/div[2]
description = []

for i in articles:
    description.append(i.text)

driver.close()

In [18]:
print(description)

['Reinforcement Learning\nReinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn to make decisions through trial and error to maximize cumulative rewards....', 'Dynamic Programming or DP\nDynamic Programming is an algorithmic technique with the following properties. It is mainly an optimization over plain recursion. Wherever we see a recursive solution that...', 'Types of Reinforcement Learning\nReinforcement Learning (RL) is a branch of machine learning that focuses on how agents should act in an environment to maximize cumulative rewards. It is inspired by behav...', 'Reinforcement learning from Human Feedback\nReinforcement Learning from Human Feedback (RLHF) is a method in machine learning where human input is utilized to enhance the training of an artificial intelligence (AI)...', 'The Role of Reinforcement Learning in Autonomous Systems\nModern te\xadch advances allow robots to operate inde\xadpendently. Reinforce\xadment learning makes t

In [None]:
# Read csv file from local location
df = pd.read_csv('YourFilePathHere')

df.head()

In [20]:
# Creating single string out of list of strings
description_singlestring = """""".join(description)

In [22]:
# Pre-processing with regular expressions
description_singlestring = re.sub(r'\d+', '', description_singlestring)  # Remove all digits from the string

In [23]:
description_singlestring = re.sub(r'[^\w\s]', '', description_singlestring) # Remove all punctuation from the string

In [25]:
# The most basic tokenization
description_tokens = description_singlestring.split()

In [26]:
print(description_tokens)

['Reinforcement', 'Learning', 'Reinforcement', 'Learning', 'RL', 'is', 'a', 'branch', 'of', 'machine', 'learning', 'that', 'focuses', 'on', 'how', 'agents', 'can', 'learn', 'to', 'make', 'decisions', 'through', 'trial', 'and', 'error', 'to', 'maximize', 'cumulative', 'rewardsDynamic', 'Programming', 'or', 'DP', 'Dynamic', 'Programming', 'is', 'an', 'algorithmic', 'technique', 'with', 'the', 'following', 'properties', 'It', 'is', 'mainly', 'an', 'optimization', 'over', 'plain', 'recursion', 'Wherever', 'we', 'see', 'a', 'recursive', 'solution', 'thatTypes', 'of', 'Reinforcement', 'Learning', 'Reinforcement', 'Learning', 'RL', 'is', 'a', 'branch', 'of', 'machine', 'learning', 'that', 'focuses', 'on', 'how', 'agents', 'should', 'act', 'in', 'an', 'environment', 'to', 'maximize', 'cumulative', 'rewards', 'It', 'is', 'inspired', 'by', 'behavReinforcement', 'learning', 'from', 'Human', 'Feedback', 'Reinforcement', 'Learning', 'from', 'Human', 'Feedback', 'RLHF', 'is', 'a', 'method', 'in', 'm