# Part 1 Web Scraping

Web Scraping is an art where one has to study the website and work according to the dynamics of that particular website.

Most common tools used for web scraping in python are demonstrated below.

1. requests https://requests.readthedocs.io/en/latest/
2. beautiful soup https://beautiful-soup-4.readthedocs.io/en/latest/
3. Selenium https://selenium-python.readthedocs.io/
4. Scrapy https://docs.scrapy.org/en/latest/

We will be working on the first three and the fourth one can be explored in the homeworks.

We will be scraping 4 websites today:

1. GeeksforGeeks
2. MarketWatch
3. CNBC
4. Hoopshype

There are different techniques to be used when scraping a dynamic website vs a static website which will be discussed in the coming sections

Some websites have their APIs open and those can be used to directly fetch the data without the need of scraping the HTML or XML pages.

In [1]:
# installing the libraries
!pip install requests
!pip install bs4
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 233 kB in 5s (48.4 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state informat

In [2]:
# importing the libraries
import requests
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import json
from google.colab import drive
import sys

In [3]:
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [4]:
# getting the first URL
# open the URL in parallel in other tab to check the information we are extracting
url = "https://www.geeksforgeeks.org/python-programming-language/"

In [5]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [6]:
# creating a soup object from the returned html page
sp = soup(res.text, "lxml")
sp

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#"><head><meta charset="utf-8"/><meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/><meta content="width=device-width,initial-scale=1,minimum-scale=.5,maximum-scale=3" name="viewport"/><link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/><link href="https://fonts.googleapis.com" rel="preconnect"/><link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/><meta content="#308D46" name="theme-color"/><meta content="https://media.geeksforgeeks.org/wp-content/cdn-

In [7]:
# printing it in readable format
print(sp.prettify())

<!DOCTYPE html>
<html lang="en-us" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="Data Structures,Algorithms,Python,Java,C,C++,JavaScript,Android Development,SQL,Data Science,Machine Learning,PHP,Web Development,System Design,Tutorial,Technical Blogs,Interview Experience,Interview Preparation,Programming,Competitive Programming,SDE Sheet,Job-a-thon,Coding Contests,GATE CSE,HTML,CSS,React,NodeJS,Placement,Aptitude,Quiz,Computer Science,Programming Examples,GeeksforGeeks Courses,Puzzles" name="keywords"/>
  <meta content="width=device-width,initial-scale=1,minimum-scale=.5,maximum-scale=3" name="viewport"/>
  <link href="https://media.geeksforgeeks.org/wp-content/cdn-uploads/gfg_favicon.png" rel="shortcut icon" type="image/x-icon"/>
  <link href="https://fonts.googleapis.com" rel="preconnect"/>
  <link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
  <meta content="#308D46" name="theme-color"/>
  <meta content="https://media.geeksfo

In [8]:
# parsing title element of the page
print(sp.title)
print(sp.title.name)
print(sp.title.string)
print(sp.title.parent.name)

<title>Python Tutorial | Learn Python Programming</title>
title
Python Tutorial | Learn Python Programming
head


In [9]:
# extracting the title of the article
print(sp.find("h1", {"class" : "entry-title"}).text)

Python Tutorial


In [10]:
# extracting the date of the article
print(sp.find("div", {"class" : "meta"}).text)

Last Updated :
27 Sep, 2023


In [11]:
# extracting the content of the article
# it extracts everyhting together, in the next sections we can see how to iteratively extract information paragraph by paragraph
print(sp.find("div", {"class" : "page_content"}).text)

This Python Tutorial is very well suited for Beginners, and also for experienced programmers with other programming languages like C++ and Java. This specially designed Python tutorial will help you learn Python Programming Language in the most efficient way, with topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.What is Python?Python is a high-level, general-purpose, and very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.Python language is being used by almost all tech-giant companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.The biggest strength of Python is huge collection of standard library which can be used for the following:Machine LearningGUI Applications (like Kivy, Tkinter, PyQt etc. )Web frameworks like Django (used by YouTube, Instagram, Dropbox)Image proces

In [12]:
# extracting the links found in the bottom of the article for further reading
for tag in sp.find("div", {"class":"Basics"}).findAll("a", href = True): print(tag.text, "\n", tag["href"], "\n\n")

Python language introduction 
 https://www.geeksforgeeks.org/python-language-introduction/ 


Python 3 basics 
 https://www.geeksforgeeks.org/python-3-basics/ 


Python The new generation language 
 https://www.geeksforgeeks.org/python-the-new-generation-language/ 


Important difference between python 2.x and python 3.x with example 
 https://www.geeksforgeeks.org/important-differences-between-python-2-x-and-python-3-x-with-examples/ 


Keywords in Python | Set 1 
 https://www.geeksforgeeks.org/keywords-python-set-1/ 


Set 2 
 https://www.geeksforgeeks.org/keywords-python-set-2/ 


Namespaces and Scope in Python 
 https://www.geeksforgeeks.org/namespaces-and-scope-in-python/ 


Statement, Indentation and Comment in Python 
 https://www.geeksforgeeks.org/statement-indentation-and-comment-in-python/ 


Structuring Python Programs 
 https://www.geeksforgeeks.org/structuring-python-programs/ 


How to check if a string is a valid keyword in Python? 
 https://www.geeksforgeeks.org/check-s

In [13]:
# extracting information paragraph by paragraph
for tag in sp.find("div", {"class":"page_content"}).findAll("p"): print(tag.text)

This Python Tutorial is very well suited for Beginners, and also for experienced programmers with other programming languages like C++ and Java. This specially designed Python tutorial will help you learn Python Programming Language in the most efficient way, with topics from basics to advanced (like Web-scraping, Django, Deep-Learning, etc.) with examples.


Python is a high-level, general-purpose, and very popular programming language. Python programming language (latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.
Python language is being used by almost all tech-giant companies like – Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard library which can be used for the following:
Python is currently the most widely used multi-purpose, high-level programming language, which allows programming in Object-Oriented and Procedural pa

Almost all the geeksforgeeks articles have the same format and hence all of them can be scraped using the same code, this repeatability is useful while doing web scraping as a block of code can help one get a lot of information in a structured way

In [14]:
# getting the second url
# again open the url in parallel to track the extracted information
url = "https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963?mod=home-page"

In [15]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [16]:
# creating a soup object from the returned html page
sp = soup(res.text, "html.parser")
sp

<!DOCTYPE html>

<html data-env="prod" data-site="marketwatch" lang="en-US">
<head>
<title>The next financial crisis may already be brewing, but not where many expect - MarketWatch</title>
<link href="https://www.marketwatch.com/story/the-next-financial-crisis-may-already-be-brewing-but-not-where-investors-might-expect-11663170963" rel="canonical"/>
<meta content="The next financial crisis may already be brewing --- but not where investors might expect" property="og:title"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon.png" rel="apple-touch-icon"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-167x167.png" rel="apple-touch-icon" sizes="167x167"/>
<link href="https://mw4.wsj.net/mw5/cont

In [17]:
# getting the title
print(sp.find("h1", {"class" : "article__headline"}).text)


  The next financial crisis may already be brewing — but not where investors might expect



In [18]:
# getting the time
print(sp.find("time", {"class" : "timestamp--pub"}).text)


  First Published: Sept. 14, 2022 at 11:56 a.m. ET



Other information can be extracted using similar methods as used in geeksforgeeks, this can be done in homework

Again same as geeksforgeeks, all articles of marketwatch are similar in structure and hence the same code can be used to scrap through all the articles of this website

Now working with a dynamic website that has its API open.

Open the CNBC website and search for any topic. If we search for SPORTS the URL looks like this: https://www.cnbc.com/search/?query=SPORTS&qsearchterm=SPORTS and if we search for POLITICS the URL looks like this: https://www.cnbc.com/search/?query=POLITICS&qsearchterm=POLITICS

Here we can observe a pattern and we can predict what the url would look like if we search something else, this information can be used for reusability and repseatability of code.

Now we can see that there are more than 50,000 results for the topic politics but only 10 are loaded in the beginning. Once we scroll down, next 10 are loaded and so on. To load the next results when the user scrolls down, an API is hit and link to that API is found from the network section of the inspect element: https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

Here playing around with the endindex and batchsize parameter we can get various results.

Now the response of this API call would be a JSON and hence the need for parsing a web page is gone when there is an open API

In [19]:
# getting the third url
url = "https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"

In [20]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [21]:
res.text

'{"metadata":{"q":"politics","totalresults":55485,"pagesize":10,"totalpage":5549,"pagerequested":2,"corrections":[],"stems":["politics"],"suggestions":["politics"],"facetsuggestions":[{"facet":"tags:show","suggestions":["Markets and Politics Digital Original Video"]},{"facet":"tags:topic","suggestions":["Politics"]}],"related":[],"resultgenerationtime":"31.254 ms"},"results":[{"description":"California Gov. Gavin Newsom has chosen Laphonza Butler, the president of EMILY\'s List, to fill the seat of the late Democratic Sen. Dianne Feinstein, the governor\'s office confirmed to NBC News.She w","cn:lastPubDate":"2023-10-02T04:47:46+0000","dateModified":"2023-10-02T04:47:46+0000","cn:dateline":"","cn:branding":"cnbc","section":"Politics","cn:type":"partnerstory","author":"Adam Edelman and Amanda Terkel","cn:source":[],"cn:subtype":"","duration":"","summary":"Butler is the head of the group EMILY\'s List, which works to elect Democratic women. She will be the third Black woman in history to

In [22]:
# getting the description of each news article
for description in json.loads(res.text)["results"]: print(description["description"], "\n")

California Gov. Gavin Newsom has chosen Laphonza Butler, the president of EMILY's List, to fill the seat of the late Democratic Sen. Dianne Feinstein, the governor's office confirmed to NBC News.She w 

Tony Ponturo, Ponturo Management CEO, and Americus Reed, Wharton School of Business professor, join 'Power Lunch' to discuss corporate social activism and how companies navigate social issues. 

Italy has a new face in its national politics that's being compared to Rep. Alexandria Ocasio-Cortez, the popular Democrat lawmaker stateside.Elly Schlein was elected as the center-left party Partito  

CNBC's Dan Murphy speaks to Barin Kayaoglu of the American University of Iraq, Sulaimani, in Istanbul for the latest updates following the Turkish election of May 28. 

'Mad Money' host Jim Cramer discusses risk involved with Taiwan-based companies and how geopolitics can impact your stock portfolio. 

Speaker of the House Kevin McCarthy and Senate Minority Leader Mitch McConnell speak after the 

Finally working with Selenium

In [23]:
!sudo add-apt-repository ppa:saiarcot895/chromium-beta
!sudo apt remove chromium-browser
!sudo snap remove chromium
!sudo apt install chromium-browser -qq

!pip3 install selenium --quiet
!apt-get update
!apt install chromium-chromedriver -qq
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu/ jammy main'
Description:
This PPA contains the latest Chromium Beta builds, with hardware video decoding enabled (hidden behind a flag), and support for Widevine (needed for viewing many DRM-protected videos) enabled.

== Hardware Video Decoding ==

To enable hardware video decoding, start Chromium with the --enable-features=VaapiVideoDecoder argument. To make this persistent, create a file at /etc/chromium-browser/customizations/92-vaapi-hardware-decoding with the following contents:

CHROMIUM_FLAGS="${CHROMIUM_FLAGS} --enable-features=VaapiVideoDecoder"

See also https://wiki.archlinux.org/title/Chromium#Hardware_video_acceleration for more information on VAAPI video decoding support.

=== Widevine Support ===

The packages in this PPA have support for Widevine inside Chromium enabled. However, you still need to copy some files from 

In [24]:
!pip install --upgrade selenium



In [25]:
!pip install selenium
!apt-get update
!apt-get install -y chromium-chromedriver


Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest versi

In [26]:
import sys
from selenium.webdriver.chrome.service import Service as ChromeService
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)

In [27]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://hoopshype.com/salaries/players/')

In [28]:
# getting players name list
players = driver.find_elements("xpath", '//td[@class="name"]')

In [29]:
players_list = []
for p in range(len(players)): players_list.append(players[p].text)
players_list

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


Similarly other information such as players' salaries can also be easily extracted and can be done as homework

# Part 2: PPT Scraping

In [30]:
from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
!pip install python-pptx



Extracting text from PPT using the python-pptx library, it is typically used for generating ppts from databases but we can exploit some of its features here to extract text from ppts, this is a very basic example and it can be explored further as per the need. documentation to the libary: https://python-pptx.readthedocs.io/en/latest/

In [32]:
# importing library
from pptx import Presentation

In [33]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/drive/MyDrive/CMPE259/Assignment 3/Web Scraping And Text Extraction/Your big idea.pptx")
counter_slide = 1
for slide in prs.slides:
    print("slide:", counter_slide, "\n")
    counter_content = 1
    for shape in slide.shapes:
        try:
            print("content:", counter_content, shape.text, "\n")
            counter_content += 1
        except: continue
    print("\n\n")
    counter_slide += 1

slide: 1 

content: 1 Making Presentations That Stick 

content: 2 A guide by Chip Heath & Dan Heath 




slide: 2 

content: 1 Selling your idea 

content: 2 Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea. 




slide: 3 

content: 1 1. Intro 

content: 2 Choose one approach to grab the audience’s attention right from the start: unexpected, emotional, or simple.
UnexpectedHighlight what’s new, unusual, or surprising.
EmotionalGive people a reason to care.
SimpleProvide a simple unifying message for what is to come 




slide: 4 

content: 1 How many languages do you need to know to communicate with the rest of the world? 




slide: 5 

content: 1 Just one! Your own.
(With a little help from your smart phone) 




slide: 6 

content: 1 The Google Translate app can repeat anything you say in up to NINETY LANGUAGES from G

Similarly other components of the PPT can be extracted after following the documentation as per need

# Part 3: PDF Scraping


Using the library PyPDF2: https://pypi.org/project/PyPDF2/
This library can only extract text from PDFs, for tables and images other methods are required.
Extracting text from PDFs is much difficult compared to web and ppt as there is no inherent structure where just calling the right elements will give us everything, infact pdfs can be seen as an image and hence whatever extraction we do is by using some kind of optical character recognition.

In [34]:
# installing the library
!pip install PyPDF2



In [35]:
# importing the library
import PyPDF2

In [36]:
# reading the file
pdfFileObj = open('/content/drive/MyDrive/CMPE259/Assignment 3/Web Scraping And Text Extraction/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf', 'rb')


In [37]:
# passing the file to PyPDF
pdfReader = PyPDF2.PdfReader(pdfFileObj)

In [38]:
# getting the number of pages
print(len(pdfReader.pages))

21


In [39]:
# getting the first page
pageObj = pdfReader.pages[1]

In [40]:
# extracting text from the first page
print(pageObj.extract_text())

K. Mishev et al.: Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers
decisions. The sentiments expressed in news and tweets inu-
ence stock prices and brand reputation, hence, constant mea-
surement and tracking of these sentiments is becoming one of
the most important activities for investors. Studies have used
sentiment analysis based on nancial news to forecast stock
prices [6][8], foreign exchange and global nancial market
trends [9], [10] as well as to predict corporate earnings [11].
Given that the nancial sector uses its own jargon, it is
not suitable to apply generic sentiment analysis in nance
because many of the words differ from their general meaning.
For example, ``liability'' is generally a negative word, but
in the nancial domain it has a neutral meaning. The term
``share'' usually has a positive meaning, but in the nancial
domain, share represents a nancial asset or a stock, which
is a neutral word. Furthermore, ``bull'' is neutral in gen

# Homework

1. As discussed in the demo above, using the example of geeksforgeeks, extract the information from marketwatch articles apart from the title and date that is already demonstrated.

2. As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

3. Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

4. Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

5. Pick a website that has tabular data (can be one of the two selected above) and try to scrap it using the tools studied during the demo.

(The datasets you will be collecting for the projects would be by text extraction so make sure to extract usable structured information)

6. Explore further the python-pptx library and check how to differentiate between texts coming from different components such as title, subtitle and paragraphs.

7. Extract table from a PPT using the same library.

8. Research and find some more libraries to extract text from PDFs and show basic implementation of any one of them.


# Questions 1 to 5

Question 1: Extract the information from marketwatch articles apart from the title and date that is already demonstrated.



Using Requests and Beautiful Soup

In [57]:
url = "https://www.marketwatch.com/story/heres-a-roadmap-for-stocks-to-own-and-avoid-this-earnings-season-d1f468aa?mod=home-page"

In [58]:
response = requests.get(url)
print(response.status_code)

200


In [59]:
response.text



In [60]:
mw = soup(response.text, 'html.parser')
print(mw.prettify())

<!DOCTYPE html>
<html data-env="prod" data-site="marketwatch" lang="en-US">
 <head>
  <title>
   Opinion: Here's a roadmap for stocks to own and avoid this earnings season - MarketWatch
  </title>
  <link href="https://www.marketwatch.com/story/heres-a-roadmap-for-stocks-to-own-and-avoid-this-earnings-season-d1f468aa" rel="canonical"/>
  <meta content="Here's a roadmap for stocks to own and avoid this earnings season" property="og:title"/>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
  <link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon.png" rel="apple-touch-icon"/>
  <link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
  <link href="https://mw4.wsj.net/mw5/content/images/favicons/apple-touch-icon-167x167.png" rel="apple-touch-icon" sizes="167x167"/>
  <link href="https://mw4.wsj.net/mw5/content/images/favicons/appl

In [64]:
print(mw.find("h1", {"class" : "article__headline"}).text)


Opinion: Here’s a roadmap for stocks to own and avoid this earnings season



In [89]:
print(mw.find("time", {"class" : "timestamp--pub"}).text)


  First Published: Oct. 2, 2023 at 7:18 a.m. ET



In [110]:
article_content = ''
article_links = []
paragraphs = mw.find_all("p")
for paragraph in paragraphs:
    article_content += paragraph.get_text() + '\n'
    links = paragraph.find_all("a")
    for link in links:
        href = link.get("href")
        if href:  # Check if the link has an 'href' attribute
            article_links.append(href)

In [111]:
print("Article Content:",article_content.strip())

Article Content: With corporate earnings season just around the corner, Todd Gervasini, a portfolio manager at Wakefield Asset Management, is giving a lot of thought to potential stock-market winners and losers. 
Information technology and energy stocks should beat estimates and outperform, he says — especially Meta Platforms 
        META,
        +2.20%,
       Applied Materials 
        AMAT,
        +0.77%,
       Cadence Design Systems 
        CDNS,
        +0.93%,
       Fortinet 
        FTNT,
        -0.09%
       and Marathon Petroleum 
        MPC,
        -0.87%.
       
But be careful with consumer staples, industrials and communications services stocks like telecom and cable company names, he adds.  
Why should you listen to Gervasini? For starters, the firm he founded has a great investment track record. Since inception in April 1997 through June 2023,  the Wakefield Large-Cap Equity Portfolio gained 11.6% annualized vs. 8.8% for the S&P 500 
        SPX.

Gervasini also

In [112]:
print("Article Links:\n")
for link in article_links:
    print(link)

Article Links:

/investing/stock/META?mod=MW_story_quote
/investing/stock/AMAT?mod=MW_story_quote
/investing/stock/CDNS?mod=MW_story_quote
/investing/stock/FTNT?mod=MW_story_quote
/investing/stock/MPC?mod=MW_story_quote
/investing/index/SPX?mod=MW_story_quote
/investing/stock/ACGL?mod=MW_story_quote
/investing/stock/CAT?mod=MW_story_quote
/investing/stock/HSY?mod=MW_story_quote
/investing/stock/MSFT?mod=MW_story_quote
/investing/stock/GOOGL?mod=MW_story_quote
/investing/stock/NVDA?mod=MW_story_quote
/investing/stock/JPM?mod=MW_story_quote
/investing/stock/AFL?mod=MW_story_quote
https://www.marketwatch.com/story/why-stocks-are-likely-to-be-especially-volatile-this-october-be69ccc3?mod=article_inline
https://www.marketwatch.com/story/anxiety-high-as-stock-market-falls-bond-yields-rise-what-investors-need-to-know-after-s-p-500s-worst-month-of-2023-4196cd0e?mod=article_inline
http://www.uponstocks.com/
http://www.uponstocks.com/


Question 2: As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

In [116]:
player_salaries = driver.find_elements("xpath", '//td[@class="hh-salaries-sorted"]')
player_salary_list = []

In [117]:
for salary_element in player_salaries:
    salary_text = salary_element.text
    player_salary_list.append(salary_text)

In [118]:
print(player_salary_list)

['2023/24', '$51,915,615', '$47,649,433', '$47,607,350', '$47,607,350', '$47,607,350', '$46,741,590', '$45,640,084', '$45,640,084', '$45,640,084', '$45,640,084', '$45,183,960', '$43,219,440', '$41,000,000', '$40,806,300', '$40,600,080', '$40,064,220', '$40,064,220', '$40,064,220', '$39,270,150', '$37,893,408', '$37,893,408', '$37,037,037', '$36,861,707', '$36,016,200', '$36,016,200', '$36,016,200', '$35,802,469', '$35,640,000', '$34,005,250', '$34,005,250', '$34,005,250', '$33,833,400', '$33,833,400', '$33,386,850', '$33,386,850', '$33,162,030', '$32,600,060', '$32,600,060', '$32,600,060', '$32,459,438', '$31,830,357', '$31,500,000', '$30,800,000', '$30,600,000', '$29,682,540', '$29,320,988', '$28,600,000', '$28,226,880', '$27,955,357', '$27,586,207', '$27,102,202', '$27,000,000', '$26,346,666', '$25,679,348', '$25,568,182', '$25,340,000', '$25,000,000', '$24,360,000', '$24,330,357', '$24,107,143', '$23,883,929', '$23,487,629', '$23,205,221', '$22,627,671', '$22,500,000', '$22,321,429'

Question 3: Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

Wikipedia article of the day Gillingham F.C. season

In [122]:
wiki = "https://en.wikipedia.org/wiki/1919%E2%80%9320_Gillingham_F.C._season"

In [123]:
response = requests.get(wiki)
print(response.status_code)

200


In [124]:
wiki_soup = soup(response.text, 'html.parser')
print(wiki_soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   1919–20 Gillingham F.C. season - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature

In [128]:
print("Title: ", wiki_soup.find("span", {"class" : "mw-page-title-main"}).text)

Title:  1919–20 Gillingham F.C. season


In [132]:
print("Subheadings")
headings = wiki_soup.findAll("span", {"class" : "mw-headline"})
for heading in headings:
  print(heading.text)

Subheadings
Background and pre-season
Southern League Division One
August–December
January–May
League match details
Partial league table
FA Cup
Cup match details
Players
Aftermath
References
Works cited


In [133]:
wiki_paragraphs = wiki_soup.findAll("p")
for paragraph in wiki_paragraphs:
  print(paragraph.text.strip())


During the 1919–20 English football season, Gillingham F.C. played in the Southern League Division One. It was the 22nd season in which the club competed in the Southern League, and the 21st in Division One; prior to the season, the club had been inactive for over four years due to the First World War. George Collins was appointed as the club's new manager, and most of the players were new; the club struggled to find a settled team during the season, fielding nearly 40 players, including six goalkeepers. The team's results included a run of 14 league games, from October to February, without a win.  Gillingham finished bottom of the league table but nonetheless gained entry to the national Football League when it absorbed the entirety of the Southern League Division One.
Gillingham also competed in the FA Cup, requiring three replays to progress from the sixth qualifying round before losing in the first round proper. The team played 47 competitive matches, winning 11, drawing 10, and l

BBC Editor's pick https://www.bbc.com/future/article/20230929-the-seed-guardians-that-are-saving-our-crops

In [140]:
bbc_url = "https://www.bbc.com/future/article/20230929-the-seed-guardians-that-are-saving-our-crops"
response = requests.get(bbc_url)
print(response.status_code)

200


In [141]:
bbc_soup = soup(response.text, 'html.parser')
print(bbc_soup.prettify())

<!DOCTYPE html>
<html class="no-js b-pw-1280 b-reith-sans-font b-reith-serif-font b-reith-serif-loaded b-reith-sans-loaded" lang="en">
 <head>
  <meta content="IE=edge" data-rh="true" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-rh="true"/>
  <meta content="width=device-width, initial-scale=1" data-rh="true" name="viewport"/>
  <meta content="future,ARTICLE,story,food-tag/climatechange-tag/environment-tag/climate-tag/features-tag/nature-outdoors-tag/weather" data-rh="true" name="keywords"/>
  <meta content="From potatoes to quinoa, many of our favourite foods are at risk from threats like climate change and disease. The &quot;seed guardians&quot; of Peru's Potato Park are hoping to change that." data-rh="true" name="description"/>
  <meta content="Kelly Oakes" data-rh="true" name="author"/>
  <meta content="The seed guardians in the Andes trying to save the potato" data-rh="true" property="og:title"/>
  <meta content="article" data-rh="true" property="og:type"/>
  <meta 

In [142]:
print("Title: ", bbc_soup.find("h1").text)

Title:  The seed guardians in the Andes trying to save the potato


In [144]:
print("Authored By: ", bbc_soup.find("a", {"class" : "author-unit__text b-font-family-serif"}).text)

Authored By:  By Kelly Oakes


In [145]:
bbc_paragraphs = bbc_soup.findAll("p")
for paragraph in bbc_paragraphs:
  print(paragraph.text.strip())


What is BBC Future?
Future Planet
Lost Index
Immune Response
Future Now
Health Gap
The Next Giant Leap
Towards Net Zero
Best of BBC Future
Latest
More
The potatoes that grow in the Andes of South America are far more than a starchy staple of the local diet. They are a rich part of the culture.
"There's one really wonderfully beautiful potato, it looks almost like a rose. And the name of that one is 'the-one-that-makes-the-daughter-in-law-cry'," says Tammy Stenner, executive assistant at Asociación Andes, a non-profit organisation in Cusco, Peru, that works to protect biodiversity and indigenous rights in the region. "A potential mother-in-law would ask the young woman who wants to marry her son to peel this potato, but she has to peel it with care, so not wasting the flesh, not ruining the shape."
It is just one of more than 1,300 varieties of potato to be found growing in the mountains of the Andes, somewhere between 3,200m and 5,000m (10,500ft-16,500ft) above sea level. These are no

Question 5 Pick a website that has tabular data (can be one of the two selected above) and try to scrap it using the tools studied during the demo.

Used the Wikipedia website which had a table of results for the 1919–20 Gillingham F.C. season

In [146]:
table = wiki_soup.find('table', class_='wikitable')

In [147]:
data = []
rows = table.find('tbody').find_all('tr')
for row in rows:
    columns = row.find_all(['td', 'th'])
    row_data = [column.get_text(strip=True) for column in columns]
    data.append(row_data)

In [148]:
for row in data:
    print(row)

['Date', 'Opponents', 'Result', 'Goalscorers', 'Attendance']
['30 August 1919', 'Watford(H)', '0–0', '', '7,000']
['1 September 1919', 'Luton Town(A)', '0–2', '', '4,000']
['6 September 1919', 'Swansea Town(A)', '1–0', 'Savage', '2,304']
['10 September 1919', 'Luton Town(H)', '2–0', 'Wood(2, 1 pen.)', 'not recorded']
['13 September 1919', 'Exeter City(H)', '0–0', '', '7,808']
['17 September 1919', 'Southend United(H)', '0–1', '', '6,000']
['20 September 1919', 'Cardiff City(A)', '0–5', '', '10,000']
['27 September 1919', 'Queens Park Rangers(H)', '0–1', '', '7,432']
['4 October 1919', 'Swindon Town(A)', '2–5', 'Chalmers,Wood(pen.)', '5,000']
['6 October 1919', 'Plymouth Argyle(H)', '0–2', '', '2,500']
['11 October 1919', 'Millwall(H)', '2–0', 'Wood(2)', '7,000']
['18 October 1919', 'Brighton & Hove Albion(A)', '0–3', '', '8,000']
['25 October 1919', 'Newport County(H)', '1–3', 'John', '8,000']
['1 November 1919', 'Portsmouth(A)', '0–4', '', '10,000']
['8 November 1919', 'Northampton To

Question 4: Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

We use Scrapy  a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages in a different notebook

In [45]:
!pip install scrapy



In [46]:
import scrapy
from scrapy.http import HtmlResponse

# Questions 6 to 8

Question 6: Explore further the python-pptx library and check how to differentiate between texts coming from different components such as title, subtitle and paragraphs.

In [47]:
#The presentation data is available in the prs object
counter_slide = 1
for slide in prs.slides:
    print("Slide ", counter_slide, "\n")
    #Print title, subtitle and paragraph information for each slide
    for shape in slide.shapes:
      if shape.has_text_frame:
        if "Title" in shape.name:
            title_text = shape.title.text
            print("Title:", title_text)
        elif "Subtitle" in shape.name:
            subtitle_text = shape.subtitle_text
            print("Subtitle:", subtitle_text)
        for paragraph in shape.text_frame.paragraphs:
            paragraph_text = paragraph.text
            print(paragraph_text)
    print("\n")
    counter_slide += 1

Slide  1 

Making Presentations That Stick
A guide by Chip Heath & Dan Heath


Slide  2 

Selling your idea
Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea.


Slide  3 

1. Intro
Choose one approach to grab the audience’s attention right from the start: unexpected, emotional, or simple.
UnexpectedHighlight what’s new, unusual, or surprising.
EmotionalGive people a reason to care.
SimpleProvide a simple unifying message for what is to come


Slide  4 

How many languages do you need to know to communicate with the rest of the world?


Slide  5 

Just one! Your own.
(With a little help from your smart phone)


Slide  6 

The Google Translate app can repeat anything you say in up to NINETY LANGUAGES from German and Japanese  to Czech and Zulu


Slide  7 

2. Examples
By the end of this section, your audience should be able 

Created own presentation with structured information to demonstrate how to extract the various componenets

In [48]:
from pptx.util import Inches
prs = Presentation()
title_only_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_only_slide_layout)
shapes = slide.shapes
shapes.title.text = 'Slide with a Table'
subtitle = slide.placeholders[1]
subtitle.text = "Subtititle for slide with a Table"
left = top = width = height = Inches(1)
txBox = slide.shapes.add_textbox(left, top, width, height)
tf = txBox.text_frame
p = tf.add_paragraph()
p.text = 'This is a paragraph in a slide with a Table'
p.space_after = Inches(0.1)

p2 = tf.add_paragraph()
p2.text = "This is the second paragraph."
p2.space_after = Inches(0.1)

rows = cols = 2
left = top = Inches(2.0)
width = Inches(6.0)
height = Inches(0.8)

table = shapes.add_table(rows, cols, left, top, width, height).table

# set column widths
table.columns[0].width = Inches(2.0)
table.columns[1].width = Inches(4.0)

# write column headings
table.cell(0, 0).text = 'Foo'
table.cell(0, 1).text = 'Bar'

# write body cells
table.cell(1, 0).text = 'Baz'
table.cell(1, 1).text = 'Qux'

prs.save('test.pptx')

In [49]:
prs = Presentation('test.pptx')
slide = prs.slides[0]
for shape in slide.shapes:
    if shape.has_text_frame:
        if "Title" in shape.name:
            title_text = shape.text_frame.text
            print("Title:", title_text)
        elif "Subtitle" in shape.name:
            subtitle_text = shape.text_frame.text
            print("Subtitle:", subtitle_text)
        else:
          print("Paragraph: ")
          for paragraph in shape.text_frame.paragraphs:
            paragraph_text = paragraph.text
            print(paragraph_text)



Title: Slide with a Table
Subtitle: Subtititle for slide with a Table
Paragraph: 

This is a paragraph in a slide with a Table
This is the second paragraph.


Question 7: Extract table from a PPT using the same library.

In [50]:
# Function to extract table data from a shape
def extract_table_data(shape):
    if shape.has_table:
        table = shape.table
        rows = []
        for row in table.rows:
            cells = [cell.text for cell in row.cells]
            rows.append(cells)
        return rows
    else:
        return None
#Save the table data from the slides
table_data = []
for slide in prs.slides:
    for shape in slide.shapes:
        data = extract_table_data(shape)
        if data:
            table_data.append(data)
#Save the table data to a dataframe
for i, data in enumerate(table_data):
    df = pd.DataFrame(data)

In [51]:
df.head()

Unnamed: 0,0,1
0,Foo,Bar
1,Baz,Qux


Question 8: Research and find some more libraries to extract text from PDFs and show basic implementation of any one of them.

 PDF Plumber, an open-source python package to plumb a PDF for detailed information about each text character, rectangle, and line,table extraction and visual debugging.
https://github.com/jsvine/pdfplumber


Works best on machine-generated, rather than scanned, PDFs.


https://medium.com/analytics-vidhya/how-to-easily-extract-text-from-any-pdf-with-python-fc6efd1dedbe

In [52]:
!pip install pdfplumber



In [53]:
import pdfplumber

with pdfplumber.open("/content/drive/MyDrive/CMPE259/Assignment 3/Web Scraping And Text Extraction/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf") as pdf:
    text = ""
    for page in pdf.pages:
        # Extract text from the page
        text += page.extract_text()

print(text)


ReceivedJune13,2020,acceptedJuly1,2020,dateofpublicationJuly16,2020,dateofcurrentversionJuly29,2020.
DigitalObjectIdentifier10.1109/ACCESS.2020.3009626
Evaluation of Sentiment Analysis in Finance:
From Lexicons to Transformers
KOSTADINMISHEV 1,ANAGJORGJEVIKJ 1,IRENAVODENSKA2,LUBOMIRT.CHITKUSHEV2,
ANDDIMITARTRAJANOV 1,(Member,IEEE)
1FacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,1000Skopje,NorthMacedonia
2FinancialInformaticsLab,MetropolitanCollege,BostonUniversity,Boston,MA02215,USA
Correspondingauthor:KostadinMishev(kostadin.mishev@finki.ukim.mk)
ThisworkwassupportedinpartbytheFacultyofComputerScienceandEngineering,Ss.CyrilandMethodiusUniversity,Skopje.
ABSTRACT Financial and economic news is continuously monitored by financial market participants.
Accordingtotheefficientmarkethypothesis,allpastinformationisreflectedinstockpricesandnewinfor-
mationisinstantaneouslyabsorbedindeterminingfuturestockprices.Hence,promptextractionofpositive
or negative sentiments from