## Scraping News Headlines 

**In this document, news headlines from various news portals have been scraped for the company Apple.**

**After scraping the data, it has been converted into a dataframe and the sentiment for all the content has been analysed using VADER.**

In [2]:
#importing necessary libraries
from bs4 import BeautifulSoup
import pandas as pd
import requests
#The dateutil module provides powerful extensions to the standard datetime module, available in Python.
from datetime import datetime
from dateutil import parser

In [3]:
#list where content and date would be appended
articles = []

#function to scrape the economic times website
def scrap_economictimes(url):
    #sending a GET request to the url
    page = requests.get(url)
    #BeautifulSoup is used to parse the HTML document
    soup = BeautifulSoup(page.text, 'lxml')
    divs = soup.find_all(class_ = 'clr flt topicstry')
    for div in divs:
        data = {}
        #finding the h3 tag 
        a = div.find('h3')    
        if a:
            data['content'] = a.get_text().strip()
            #finding the time tag
            t = div.find('time')
            if t:
                d = t.get_text().strip()
                data['date'] = parser.parse(d)            
            articles.append(data)

#function to scrape the forbes website
def scrap_forbes(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')
    divs = soup.find_all(class_ = "stream-item__text")
    for div in divs:
        data = {}
        a = div.find(class_ = "stream-item__date")
        if a:
            b = div.find(class_="stream-item__description")
            if b:
                x = a.attrs.get('data-date') and int(a.attrs.get('data-date'))//1000
                if x:
                    #fromtimestamp() function return a Timestamp object when passed an integer which represents the timestamp value.
                    data['date'] = datetime.fromtimestamp(x)
                data['content'] = b.get_text().strip()
                articles.append(data)

#function to scrape the india tv website
def scrap_indiatv(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'lxml')
    divs = soup.find_all(class_ = "content")
    for div in divs:
        data = {}
        a = div.find(class_ = "deskTime")
        if a:
            d = a.get_text().strip().split("|")
            data['date'] = parser.parse(d[1])
            b = div.find(class_="dic")
            if b:
                data['content'] = b.get_text().strip()
            articles.append(data)
          
#defined a list of dictionaries            
url = [
    {
        'url':'https://economictimes.indiatimes.com/topic/{}/news', 
        'site':'The Economic Times',
        'scrapper': scrap_economictimes
    }, 
    {
        'url': 'https://www.forbes.com/search/?q={}#2b35cf38279f',
        'site':'Forbes',
        'scrapper': scrap_forbes
    },
    {
        'url':'https://www.indiatvnews.com/topic/{}/news',
        'site':'India TV',
        'scrapper': scrap_indiatv
    }
]

company = 'apple'

for i in url:
    i['scrapper'](i['url'].format(company))
    
# for article in articles:
#     print(article)



In [4]:
#converting the list into a dataframe
df = pd.DataFrame(articles)
df

Unnamed: 0,content,date
0,Facebook's Mark Zuckerberg skewered with damag...,2020-07-30 03:52:00
1,Eye phone or a smart idea for smartphone compa...,2020-07-30 03:55:00
2,"Zuckerberg, Bezos, Sundar Pichai and Tim Cook ...",2020-07-30 03:54:00
3,View: How India can use the prevailing geopoli...,2020-07-29 17:35:00
4,"Competition in under Rs 10,000 smartphone segm...",2020-07-29 10:50:00
5,OnePlus asks retailers to stop all online sale...,2020-07-29 12:09:00
6,Quarantine requirements keep many buyers away ...,2020-07-27 04:10:00
7,The Mac’s move to ARM is an exciting one for A...,2020-07-28 04:53:03
8,Apple's latest documents confirm a new MacBook...,2020-07-29 05:22:40
9,Will future AirPods supplement regular audio w...,2020-07-30 04:00:00


In [5]:
#importing SentimentAnalyser to analyse the sentiments of the above content
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
sia = SentimentIntensityAnalyzer()

**1. Analysing the day by day sentiment for each content**

In [6]:
results = []

for line in df['content']:
    pol_score = sia.polarity_scores(line) 
    pol_score['headline'] = line
    results.append(pol_score)

results[0:5]

[{'neg': 0.248,
  'neu': 0.752,
  'pos': 0.0,
  'compound': -0.5106,
  'headline': "Facebook's Mark Zuckerberg skewered with damaging internal emails during antitrust hearing"},
 {'neg': 0.0,
  'neu': 0.828,
  'pos': 0.172,
  'compound': 0.4019,
  'headline': 'Eye phone or a smart idea for smartphone companies, may be the next market disruptor'},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'headline': 'Zuckerberg, Bezos, Sundar Pichai and Tim Cook getting heat from US Cong on competition'},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'headline': 'View: How India can use the prevailing geopolitical situation to reset ties with US'},
 {'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'compound': 0.0,
  'headline': 'Competition in under Rs 10,000 smartphone segment hots up as Covid limits purse strings'}]

In [7]:
df['Compound_score'] = pd.DataFrame(results)['compound']

In [8]:
df.head()

Unnamed: 0,content,date,Compound_score
0,Facebook's Mark Zuckerberg skewered with damag...,2020-07-30 03:52:00,-0.5106
1,Eye phone or a smart idea for smartphone compa...,2020-07-30 03:55:00,0.4019
2,"Zuckerberg, Bezos, Sundar Pichai and Tim Cook ...",2020-07-30 03:54:00,0.0
3,View: How India can use the prevailing geopoli...,2020-07-29 17:35:00,0.0
4,"Competition in under Rs 10,000 smartphone segm...",2020-07-29 10:50:00,0.0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 3 columns):
content           47 non-null object
date              47 non-null datetime64[ns]
Compound_score    47 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 1.2+ KB


**2. Analysing the overall sentiment of the company**

In [10]:
df['Compound_score'].mean()

0.1668744680851064

**The overall sentiment score is 0.24 which indicates some sort of positive content.**