# **Web Scraping using BeautifulSoup**




## <font color=purple>*A). Scraping from newspaper "The Asian Age"*</font>

This newspaper is scraped in depth going from page to page in various categories from home page. 

I have tried automate the process as much as possible.

In [1]:
import requests
from bs4 import BeautifulSoup as bs

In [2]:
news_page2 = requests.get("https://www.asianage.com/india")
content2   = news_page2.content
soup2 = bs(content2,'html.parser')
list_soup_children2 = list(soup2.children)

In [3]:
tags = list_soup_children2[2]
pages = tags.select("li li")[:17]

# Getting links of news pages regarding different categories of news

link_category = []
for page in pages:
    link_category.append("https://www.asianage.com"+page.find('a').get('href'))
    
link_category

['https://www.asianage.com/india/politics',
 'https://www.asianage.com/india/crime',
 'https://www.asianage.com/india/all-india',
 'https://www.asianage.com/world/south-asia',
 'https://www.asianage.com/world/asia',
 'https://www.asianage.com/world/middle-east',
 'https://www.asianage.com/world/africa',
 'https://www.asianage.com/world/europe',
 'https://www.asianage.com/world/americas',
 'https://www.asianage.com/world/oceania',
 'https://www.asianage.com/metros/delhi',
 'https://www.asianage.com/metros/kolkata',
 'https://www.asianage.com/metros/mumbai',
 'https://www.asianage.com/metros/in-other-cities',
 'https://www.asianage.com/business/economy',
 'https://www.asianage.com/business/market',
 'https://www.asianage.com/business/companies']

In [4]:
page_no = [number for number in range(1,10)]  # Number of pages to scan from each category. 
                                              #Remember that some category have very less pages to passing large value might show error.
dict_final = {"Title":[], "Content":[]}

for page_cat in link_category:
    for num in page_no:
        new_page = requests.get(page_cat+'?pg={}'.format(num))
        soup_np = bs(new_page.content,'html.parser')
        tags_np = list(soup_np.children)[2]
        articles = tags_np.select('h2.costly a') # articles list with 'a' tag containing their urls
        
        # WORKING WITH SPECIFIC STORY PAGE
    
        for story in articles:
            link_story = 'https://www.asianage.com'+story.get('href')
            story_page = requests.get(link_story)
            soup_story = bs(story_page.content,'html.parser')
            tag_story = list(soup_story.children)[2]
            
            # GETTING TITLE OF THE STORY
            title = tag_story.find('title').get_text()
            dict_final["Title"].append(title)
            
            # GETTING CONTENT OF THE STORY
            paras_list = tag_story.select('div.storyBody p')
            content=''
            for paras in paras_list:
                content += paras.get_text()
            dict_final["Content"].append(content)

In [5]:
import pandas as pd
table = pd.DataFrame(dict_final)      # Creating Data Frame of resulting values
table

Unnamed: 0,Title,Content


## <font color=purple>*B). Scraping from newspaper "International Business Times"*</font>

Code for above newspaper scrapes large amount of data one time, but we can run below program daily and it will add latest stories from newspaper in existing dictionary. It extracts main stories from front page only rather than scraping in depth for old news as above program does.

### Note:  Do not run below cell if you want to add new data to older one.

In [6]:
dict_main = {"Title":[],"Content":[]}

In [7]:
news_page = requests.get("https://www.ibtimes.co.in/")
content   = news_page.content
soup = bs(content,'html.parser')
list_soup_children = list(soup.children)
[type(item) for item in list_soup_children]  # Checking for tags only in soup's children

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [8]:
import json
tags = list_soup_children[2]
article_list = tags.select('article h3 a') # getting to 'a' tag inside heading level 3 i.e. 'h3' which itself is inside 'article' tag

url_list = []
for i in range(len(article_list)):
    url_list.append(article_list[i].get('href'))          # Collecting url's of news from 'href' inside 'a' tags 
    dict_main["Title"].append(article_list[i].get_text())         # Collecting news headings 
    

for link in url_list:
    get_page = requests.get(link)
    soup = bs(get_page.content,'html.parser')                       # creating soup for each news webpage collected from main page.
    script = soup.find('script', {'type' : 'application/ld+json'})  # Content is inside 'script' tag of type 'application/ld+json'.
    parsing = json.loads(script.string)                             # Providing string value inside 'loads' module of 'json' library which  
                                                                    # returns dictionary with different elements from 'script' tag on the website.
    dict_main["Content"].append(parsing['articleBody'])             # Content is the value of key 'articleBody' in dictionary named 'parsing' 

In [9]:
table2 = pd.DataFrame(dict_main)      # Creating Data Frame of resulting values
table2

Unnamed: 0,Title,Content
0,"US Covid-19 deaths cross 600,000, full vaccina...",Six hundred thousand. Americas coronavirus de...
1,Covid vaccination in India without online regi...,Amid the slowing down of the Covid second wav...
2,Covid origin mystery: Live bats locked up in c...,It was in late 2019 that the first case of th...
3,"New imaging technique could help pilots, drive...",The researchers at Shiv Nadar University in G...
4,Galwan clash anniversary brings back painful m...,"Last year on this day, while an exhausted cou..."
5,Pune-based vocalist's music app among Apple De...,", a studio-quality iOS music app created by P..."
6,iPhones getting whole lot better with iOS 15; ...,"Apple has announced iOS 15, a major update wi..."
7,Saliva effective than nasal swabs for detectin...,The mention of COVID-19 testing is likely to ...
8,Hospital rush likely even after Covid as one l...,A huge backlog of surgeries has built up in t...
9,100 Not Out Campaign: Karnataka Congress leade...,Karnataka Pradesh Congress Committee (KPCC) p...


## Combining data from both the newspaper

In [10]:
# ADDING BOTH DICTIONARIES 
dict_final["Title"].extend(dict_main["Title"])
dict_final["Content"].extend(dict_main["Content"])

new_dict = {"Title":dict_final["Title"],"Content":dict_final["Content"]} # 

In [11]:
table_final = pd.DataFrame(new_dict)
table_final

Unnamed: 0,Title,Content
0,"US Covid-19 deaths cross 600,000, full vaccina...",Six hundred thousand. Americas coronavirus de...
1,Covid vaccination in India without online regi...,Amid the slowing down of the Covid second wav...
2,Covid origin mystery: Live bats locked up in c...,It was in late 2019 that the first case of th...
3,"New imaging technique could help pilots, drive...",The researchers at Shiv Nadar University in G...
4,Galwan clash anniversary brings back painful m...,"Last year on this day, while an exhausted cou..."
5,Pune-based vocalist's music app among Apple De...,", a studio-quality iOS music app created by P..."
6,iPhones getting whole lot better with iOS 15; ...,"Apple has announced iOS 15, a major update wi..."
7,Saliva effective than nasal swabs for detectin...,The mention of COVID-19 testing is likely to ...
8,Hospital rush likely even after Covid as one l...,A huge backlog of surgeries has built up in t...
9,100 Not Out Campaign: Karnataka Congress leade...,Karnataka Pradesh Congress Committee (KPCC) p...


In [17]:
from google.colab import files

table_final.to_csv("scraped_data.csv")
files.download('scraped_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>