# Newspaper scraping

## Imposing  structure on data

* Web scraping focuses on the transformation of unstructured data on the web
* Typically in html format, into structured data that can be stored
* And analyzed in a central local database or spreadsheat

![](../images/web2.jpg)

here we consider the UK BBC News website i.e https://www.bbc.co.uk/search?q=covid+19&page=1 and searched about covid 19 in a search box present on the top of the page shown in the picture.

![](../images/web3.png)



Let we write the script to scrape the news by giving keywords in our python

## Import the packages

here the the BBC news website had written using html scripts to extract the html scripts we have to install the BeautifulSoup and requests packages.

In [1]:
import urllib.request,sys,time
from bs4 import BeautifulSoup
import requests
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

## pages to get
 After searching of keywords the results shown in the number of pages

![](../images/web4.png)

so we have to intialize the variable to request the particular page and the result of the searching will be stored in the dataframe. 



In [5]:
pagesToGet= 1
resultframe=[]

## get the keyword to search

In [6]:
key=input("Enter the query: ")
key=key.replace(' ','+')
print(key)

Enter the query: covid 19 vaccination
covid+19+vaccination


## Extracting html script
write the code to extract html script using class names mentioned in the bbc news website.

![](../images/web5.png)

and the result stored in the csv file named <b>"search_res.csv" </b>

In [7]:
for page in range(1,pagesToGet+1):
    print('processing page :', page)
    url = 'https://www.bbc.co.uk/search?q='+key+'+article&page='+str(page)
    print(url)
    page=requests.get(url)            
    soup=BeautifulSoup(page.text,'html.parser')
    frame=[]
    links=soup.find('ul',attrs={'class':'css-1lb37cz-Stack e1y4nx260'}).find_all('li')
    #links=soup.find_all('li',attrs={'class':'o-listicle__item'})
    print(len(links))
    filename="search_res.csv"
    f=open(filename,"w", encoding = 'utf-8')
    headers="Statement,Link,Date\n"
    f.write(headers)
    
    for j in links:
      
        Statement = j.find("div",attrs={'class':'css-l100ew-PromoContentSummary e1f5wbog1'}).find('p',attrs={'class':'css-1uw1j0b-PromoHeadline e1f5wbog2'}).find('a',attrs={'class':'css-vh7bxp-PromoLink e1f5wbog6'}).text.strip()
       
        Link = j.find("p",attrs={'class':'css-1uw1j0b-PromoHeadline e1f5wbog2'}).find('a')['href'].strip()
        Date = j.find('span',attrs={'class':'css-1hizfh0-MetadataSnippet ecn1o5v0'}).text[8:].strip()
       
        frame.append((Statement,Link,Date))
        f.write(Statement.replace(",","^")+","+Link+","+Date.replace(",","^")+"\n")
    resultframe.extend(frame)
f.close()


processing page : 1
https://www.bbc.co.uk/search?q=covid+19+vaccination+article&page=1
2


## Result csv file

here in a page there are 10 links are displayed. The title of the news related to keyword and published url and published date are stored in the csv file.

In [8]:
data=pd.DataFrame(resultframe, columns=['Statement','Link','Date'])
print(data)

                                           Statement  \
0  Covid: Biden vows 100m vaccinations for US in ...   
1  Coronavirus in South Africa: Two-day-old baby ...   

                                                Link         Date  
0  https://www.bbc.co.uk/news/world-us-canada-552...   9 Dec 2020  
1   https://www.bbc.co.uk/news/world-africa-52752334  21 May 2020  


## using Article package

using the article package, we can display the various properties of news article like title of the news, summary of the news, meta description etc., we can take the link of the above result i.e data DataFrame.

In [9]:
data['Link']

0    https://www.bbc.co.uk/news/world-us-canada-552...
1     https://www.bbc.co.uk/news/world-africa-52752334
Name: Link, dtype: object

### installing the package

In [10]:
!pip install newspaper3k



### importing the package

In [11]:
from newspaper import Article 

In [12]:
url = data['Link'][1] #for example take a first result link

### apply parsing to know the properties of news article easily

In [13]:
res_article = Article(url, language="en") # en for English 
res_article.download()  #download an article
res_article.parse() #To parse the article 
res_article.nlp() #To perform natural language processing ie..nlp 

### displaying the title of the news article

In [23]:
res_article.title

'Covid: Biden vows 100m vaccinations for US in first 100 days'

### displaying the text of the news article

In [24]:
res_article.text

'"My first 100 days won\'t end the Covid-19 virus. I can\'t promise that. But we did not get into this mess quickly. We\'re not going to get out of it quickly," he said at the event in Delaware, giving few details of how the largest vaccination programme in US history would be carried out.'

### using BeautifulSoup to extract entire article

In [25]:
url=data['Link'][2]
url

'https://www.bbc.co.uk/news/uk-england-leeds-53394717'

![](../images/web6.png)

In [26]:
page=requests.get(url)
     
soup=BeautifulSoup(page.text,'html.parser')
frame=[]
links=soup.find('article',attrs={'class':'css-5h7eao-ArticleWrapper e1nh2i2l0'}).find_all('div',attrs={'class':'css-uf6wea-RichTextComponentWrapper e1xue1i83'})
#links=soup.find_all('li',attrs={'class':'o-listicle__item'})
#print("l:"+links)
for i in links:
    news=i.find('div',attrs={'class':'css-83cqas-RichTextContainer e5tfeyi2'}).text
    print(news)

A bed factory has seen eight workers test positive for coronavirus - the third in a series of outbreaks at similar sites in West Yorkshire.
Highgrove Beds in Liversedge ceased production as a safety precaution with all staff being offered tests.
The outbreak follows cases at Deep Sleep Beds in Ossett and Dura Beds in Batley over the past month.
There have also been cases of coronavirus reported at two meat factories in West Yorkshire.
Rachel Spencer-Henshall, director of public health at Kirklees Council, warned factory workers of the risk of car sharing.
She said: "With the bed factories, it's less about the industry itself and more about working in a factory setting.
"What interests me a lot more is how people get to and from work, because actually you find a lot of people are car-sharing and in those scenarios you're in quite close contact with others for quite a long period of time, dependent on the commute."
Latest news and stories from YorkshireDeep Sleep Beds in Ossett sees four