# Web Scraping Activity
Automating bulk content retrieval from websites

(17th October 2023)


In [1]:
import requests
from bs4 import BeautifulSoup

I basically want to go through these pages:

https://theprint.in/page/1/?s=israel

https://theprint.in/page/2/?s=israel

...

https://theprint.in/page/1/?s=hamas

https://theprint.in/page/2/?s=hamas

...

So step 1 is to identify the base URL.
Here, it is: https://theprint.in/page/

now I identify all the keywords I want to search by, and decide that I want only the first 4 pages for each keyword search.
run it through a nested for loop

In [2]:
URL = "https://theprint.in/page/"
kewWords = ["gaza", "israel", "palestine", "hamas"]

myURL = []

# This loop is for the page number. from 1 to 4.
for i in range(4):

    # before appending: https://theprint.in/page/
    newURL = URL + str(i+1) + "/?s="
    # after appending: https://theprint.in/page/1/?s=

    # This loop is for the keywords.
    for words in kewWords:
        newnewURL = newURL + words
        # after appending https://theprint.in/page/1/?s=myKeyWord
        print(newnewURL)
        
        # Storing all the URLs in a list (array).
        myURL.append(newnewURL)

https://theprint.in/page/1/?s=gaza
https://theprint.in/page/1/?s=israel
https://theprint.in/page/1/?s=palestine
https://theprint.in/page/1/?s=hamas
https://theprint.in/page/2/?s=gaza
https://theprint.in/page/2/?s=israel
https://theprint.in/page/2/?s=palestine
https://theprint.in/page/2/?s=hamas
https://theprint.in/page/3/?s=gaza
https://theprint.in/page/3/?s=israel
https://theprint.in/page/3/?s=palestine
https://theprint.in/page/3/?s=hamas
https://theprint.in/page/4/?s=gaza
https://theprint.in/page/4/?s=israel
https://theprint.in/page/4/?s=palestine
https://theprint.in/page/4/?s=hamas


Please remembers, Do NOT send multiple requests to the website. The website might mistake that you are trying to attack it.

the requests.get() function sends a request to the website. so once you call it, store its value somewhere immediately, so that you can work on the data and not make additional calls to the website

In [3]:
# This is an array I am creating to store all the requests.
reqs = []

for urls in myURL:
    # Sending a 'get' request to the URL. This means, the website will send us the HTML code of the page.
    # I am then appending it to my array.
    reqs.append(requests.get(urls))

In [4]:
print(reqs[0].content)



I am not wrinting the full code. Over here, I have only written the sample for processing 1 page. You will have to appropriately call this in a for loop to access all the pages.

If you recall, the links which we have in the array myURL, are links to pages in the print website which does not really have any body content.  It is basically a webpage containing more links. Our aim is not to go through each of these links and access the articles.

So over here, I am calling those links as secondary links. 

For example:
https://theprint.in/page/1/?s=gaza will be a primary link stored in myURL array.

Now, I need to go to this page, and that contains many links.

https://theprint.in/world/unsc-rejects-russian-drafted-resolution-on-israel-hamas-war/1806723/ this is a link in that page. This I will call as a secondary link and will store in teh SecondaryLinks array.

My final aim is, get all theese secondary links, go to thiose articles and get the article content.


In [5]:
# page0 contains the HTML code of the first page.
page0 = BeautifulSoup(reqs[0].content, 'html5lib')

# From the source code of teh website, we noticed that, the links of the articles are in the class 'td-module-thumb'. (This varies from website to website)
links = page0.find_all(class_='td-module-thumb')

secondaryLinks = []

# This loop goes through each element of the array 'links' and gets the exact link of the article.
# URLs are generally found in the HTML <a> tag (called anchor tag). In the anchor tag, there is an attribute called 'href' which contains the URL. 
for link in links:
    print(link.find('a').get('href'))
    secondaryLinks.append(link.find('a').get('href'))

https://theprint.in/world/unsc-rejects-russian-drafted-resolution-on-israel-hamas-war/1806723/
https://theprint.in/world/exclusive-senior-us-general-flies-into-israel-as-its-war-with-hamas-deepens/1806714/
https://theprint.in/world/biden-to-visit-israel-as-gaza-war-sparks-humanitarian-crisis/1806710/
https://theprint.in/world/israel-agrees-to-enable-humanitarian-aid-for-civilians-in-gaza-us-secy-blinken/1806694/
https://theprint.in/world/canada-pm-calls-for-immediate-humanitarian-corridor-into-gaza/1806692/
https://theprint.in/world/malaysia-pulls-out-of-frankfurt-book-fair-citing-organisers-pro-israel-stance/1806673/
https://theprint.in/world/hamas-releases-first-video-of-israeli-hostage/1806667/
https://theprint.in/world/israel-hamas-war-biden-to-travel-to-israel-jordan-on-wednesday/1806664/
https://theprint.in/world/us-president-joe-biden-will-visit-israel-tomorrow-says-blinken/1806655/
https://theprint.in/india/putin-speaks-to-netanyahu-about-gaza-conflict-promises-measures-to-prev

So now I have got the secondary links. I now need to visit these secondary links and get the article content.
Below is a demo for one article. (again you need to iterate over all the secondary links)

In [6]:
req = requests.get(secondaryLinks[0])
page = BeautifulSoup(req.content, 'html5lib')

In the source code of the website, we noticed that all the body content is inside `<p>` HTML tags, which stands for paragraph tag. So we just search for all `<p>` tags

In [13]:
# Find all <p> tags in the HTML code.
bodyContent = page.find_all('p')

# Now, we noticed that the first line of teh article, and the last 10 lines are not part of the body. They have some other content which we do not need.
# Hence, we remove the first line and the last 10 lines.
# How did we arrive at this? Print the content without removing those lines. You will see.
# Then we just did a hit-and-try to see how many lines we need to discard.
# So now, this array called finalArticleBody contains all the lines of bodyContent except 1st and last 10 lines.
finalArticleBody = [body.text for body in bodyContent[1:-10]]

# convert list to string
finalArticleBody = " ".join(finalArticleBody)
print(finalArticleBody)

United Nations, Oct 17 (PTI) The UN Security Council has rejected a draft resolution proposed by Russia that called for a humanitarian ceasefire in Gaza but made no mention of Hamas’ attack on Israel, while a vote on a rival Brazilian text will be held on Tuesday. 
  The 15-nation Council met on Monday evening to vote on the Russian-led draft resolution, the first such text that was considered by the powerful UN body, amid an escalating war between Israel and the Palestinian militant group Hamas. 
  The one-page draft resolution failed to garner enough votes and was not adopted by the Council, which is expected to meet again on Tuesday to vote on a rival draft resolution circulated by Brazil, the Council President for the month of October. 
  The draft text that called for an “immediate, durable and fully respected humanitarian ceasefire”, the release of all hostages, aid access and safe evacuation of civilians received five votes in favour from China, Gabon, Mozambique, Russia, and th