# **WEB SCRAPING LIVE NEWS FROM YFINANCE**
The top 15 live news will be web scraped from the yahoo finance stocks news webpage in order to perform sentiment analysis on them.

In [1]:
#importing the necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from textblob import TextBlob
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [2]:
#Function definition to request the url page and extract the news tags
def request_and_extract(url):

  #Requesting using the yahoo finance ulr
  response = requests.get(url)

  #Checking for failure in fetching the page details
  if not response.ok:
    print('Status code : ', response.status_code)
    raise Exception('Failed to  load  page {}'.format(url))

  print('\n The contents of the page in HTML format is  : ',response.text)

  #Parsing the HTML page and printing the details
  doc = BeautifulSoup(response.text, 'html.parser')
  print('\n The HTML page in raw format : ', doc)
  print('\n The HTML page in a clean format  : ', doc.prettify())



  #Fetching the list of tags containing the stock news
  news_class = "Ov(h) Pend(44px) Pstart(25px)"
  news_list  = doc.find_all('div', {'class': news_class})

  #Printing the news list and returning the news_list
  return news_list

#Function definition to parse the page and web scrap the necessary information
def parse_news(news_tag):
  #Extracting the necessary information to a dictionary
  source = news_tag.find('div').text
  headline = news_tag.find('a').text
  news_url = news_tag.find('a')['href']
  content = news_tag.find('p').text
  image = news_tag.findParent().find('img')['src']
  return { 'source' : source,
            'headline' : headline,
            'url' : url + news_url,
            'content' : content,
            'image' : image
           }

def scrape_yahoo_news(url, path=None):

  #Function call Requesting the html page and extracting the news tags to scap the news
  print('Requesting the page and extracting the news tags')
  news_list = request_and_extract(url)

  #Function call parsing the news tags and extracting the necesary information.
  print('Parsing news tags')
  news_data = [parse_news(news_tag) for news_tag in news_list]

  #Converting the news_data to a dataframe
  print('Save the data to a CSV')
  df = pd.DataFrame(news_data)
  df.to_csv('/content/sample_data/df.csv', index=None)

  #Returning the webscrapped news as a dataframe.
  return df

#Providing the path for the yahoo finance news url and calling the function news_df
url = 'https://finance.yahoo.com/topic/stock-market-news/'
df = pd.DataFrame()
df = scrape_yahoo_news(url)
print('The columns in the dataframe is : ', df.columns)
print('\n The top 3 latest news article details are : \n', df.head(3))

The columns in the dataframe is :  Index(['source', 'headline', 'url', 'content', 'image'], dtype='object')

 The top 3 latest news article details are : 
              source                                           headline  \
0          TipRanks  American Express Stock (NYSE:AXP): Strong Earn...   
1         Bloomberg  Asian Stocks Gain After Fed, Yen Resumes Decli...   
2  Business Insider  AI stocks plunge after AMD gives weak forecast...   

                                                 url  \
0  https://finance.yahoo.com/topic/stock-market-n...   
1  https://finance.yahoo.com/topic/stock-market-n...   
2  https://finance.yahoo.com/topic/stock-market-n...   

                                             content  \
0  American Express (NYSE:AXP) is one of the few ...   
1  (Bloomberg) -- Most Asian equities rose after ...   
2  Shares of Super Micro Computer, AMD, and Nvidi...   

                                               image  
0  https://s.yimg.com/uu/api/res/1.2/fUMHJ

# **PERFORMING SENTIMENT ANALYSIS ON LIVE NEWS**
From the data that has been web scrapped from yahoo finance, sentiment analysis will be carried out using textblob and vader methods


### **Data Cleaning and Preprocessing of live news data**

In [3]:
#Function defining the necessary data preprocessing techniques for the df dataframe.
def data_cleaning_prepreproceesing(df):

  #Cleaning the headlines by removing special characters
  print('CLEANED THE HEADLINES COLUMN')
  df['headline_cleaned'] = df['headline']
  for i in df['headline_cleaned']:
    print('\n \n')
    print(i)
    i = re.sub(r'[^\w\s]','',i)
    print(i)

  #Cleaning the contents page by removing special character
  print('\n \n')
  print('CLEANED THE CONTENT COLUMN')
  df['content_cleaned'] = df['content']
  for i in df['content_cleaned']:
    print('\n \n')
    print(i)
    i = re.sub(r'[^\w\s]','',i)
    print(i)

  #Returing the cleaned df dataframe
  return df

#Function call for cleaning the preprocessing the df dataframe
df = data_cleaning_prepreproceesing(df)

CLEANED THE HEADLINES COLUMN

 

American Express Stock (NYSE:AXP): Strong Earnings Strengthen the Bull Case
American Express Stock NYSEAXP Strong Earnings Strengthen the Bull Case

 

Asian Stocks Gain After Fed, Yen Resumes Declines: Markets Wrap
Asian Stocks Gain After Fed Yen Resumes Declines Markets Wrap

 

AI stocks plunge after AMD gives weak forecast for chip sales
AI stocks plunge after AMD gives weak forecast for chip sales

 

A contrarian stock market indicator is on the verge of flashing a 'buy' signal, Bank of America says
A contrarian stock market indicator is on the verge of flashing a buy signal Bank of America says

 

Yen Swings Stir Talk That Japan Is in the FX Market Once Again
Yen Swings Stir Talk That Japan Is in the FX Market Once Again

 

Philip Morris Stock (NYSE:PM): Strong Q1 Results to Fuel Gains
Philip Morris Stock NYSEPM Strong Q1 Results to Fuel Gains

 

The Fed's new interest-rate outlook may roil markets
The Feds new interestrate outlook may roil ma

### **Implementing TEXTBLOB method**
TextBlob is a python library that uses Natural Language Toolkit to perform the tasks. Textblob will perform tasks such as processing textual data, common NLP tasks, sentiment analysis, classification. The two scores are the polarity and the subjectivity.

In [7]:
#Function definintion which calculates the texblob scores
def textblob_scores(df):

  #Creating a new dataframe called textblob scores and droppping unnecessaey columns
  df_textblob_scores = pd.DataFrame()
  df_textblob_scores = df
  df_textblob_scores = df_textblob_scores.drop(['source','url', 'image'], axis = 1)

  #Creating 4 lists which will store the polarity and subjectivity score.
  #Fetching the  polarity score of headline column
  li = []
  for i in df_textblob_scores['headline']:
    blob = TextBlob(i)
    li.append(blob.sentiment.polarity)
  df_textblob_scores['headline_score'] = li

  #Fetching the polarity score of cleaned_headline column
  li = []
  for i in df_textblob_scores['headline_cleaned']:
    blob = TextBlob(i)
    li.append(blob.sentiment.polarity)
  df_textblob_scores['headline_cleaned_score'] = li


  #Fetching the  polarity score of content column
  li = []
  for i in df_textblob_scores['content']:
    blob = TextBlob(i)
    li.append(blob.sentiment.polarity)
  df_textblob_scores['content_score'] = li


  #Fetching the  polarity score of content_cleaned column
  li = []
  for i in df_textblob_scores['content_cleaned']:
    blob = TextBlob(i)
    li.append(blob.sentiment.polarity)
  df_textblob_scores['content_cleaned_score'] = li


  #Calculating if the polarity score is greater than 0 or 1.
  df_textblob_scores['Sentiment_headline_cleaned'] = df_textblob_scores['headline_cleaned_score'].apply(lambda x : 'Positive' if x >0 else 'Negative' if x <0 else 'Neutral')
  df_textblob_scores['Sentiment_content_cleaned'] = df_textblob_scores['content_cleaned_score'].apply(lambda x : 'Positive' if x >0 else 'Negative' if x <0 else 'Neutral')
  df_textblob_scores['Overall_Sentiment'] = li

  #Calculating the overall sentiment of the news article
  for i in range(0, len(df_textblob_scores['Overall_Sentiment'])):
    if (df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Sentiment_headline_cleaned')]) == 'Positive' and (df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Sentiment_content_cleaned')]) == 'Positive':
      df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Overall_Sentiment')] = 'Positive'
    elif (df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Sentiment_headline_cleaned')]) == 'Negative' and (df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Sentiment_content_cleaned')]) == 'Negative':
      df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Overall_Sentiment')] = 'Negative'
    else :  df_textblob_scores.iloc[i, df_textblob_scores.columns.get_loc('Overall_Sentiment')] = 'Neutral'


  #Printing the textblob results
  print(df_textblob_scores[['headline', 'content',  'Overall_Sentiment']].head(10))

#Function call for textblob_scores
textblob_scores(df)



                                            headline  \
0  American Express Stock (NYSE:AXP): Strong Earn...   
1  Asian Stocks Gain After Fed, Yen Resumes Decli...   
2  AI stocks plunge after AMD gives weak forecast...   
3  A contrarian stock market indicator is on the ...   
4  Yen Swings Stir Talk That Japan Is in the FX M...   
5  Philip Morris Stock (NYSE:PM): Strong Q1 Resul...   
6  The Fed's new interest-rate outlook may roil m...   
7  Asia stocks rise as Fed tamps down hike fears;...   
8  Analysts revise SuperMicro stock price target ...   
9  C.H. Robinson stock blasts higher after some s...   

                                             content Overall_Sentiment  
0  American Express (NYSE:AXP) is one of the few ...          Positive  
1  (Bloomberg) -- Most Asian equities rose after ...           Neutral  
2  Shares of Super Micro Computer, AMD, and Nvidi...           Neutral  
3  Bank of America's contrarian stock market indi...           Neutral  
4  (Bloomberg) -- 

### **Implementing VADER Method**
(Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analyzer that has been trained on social media text. This is a module in nltk which has been trained on a lot of online articles. This is a module in nltk.

In [5]:
#Function definition to calculate vader scores
def vader_scores(df):

  #Initialising Vader
  vader = SentimentIntensityAnalyzer()

  #Creating a new dataframe called df_vader_scores
  df_vader_scores = pd.DataFrame()

  #Creating a lamda function to get the compound scores
  f = lambda i : vader.polarity_scores(i)['compound']

  #Fetching the compound scores on the cleaned headline column
  df_vader_scores['headline_cleaned'] = df['headline_cleaned']
  df_vader_scores['compound_cleaned_headline'] = df['headline_cleaned'].apply(f)


  #Fetching the compound scores on the uncleaned headline column
  df_vader_scores['headline'] = df['headline']
  df_vader_scores['compound_uncleaned_headline'] = df['headline'].apply(f)

  #Fetching the compound scores on the cleaned content column
  df_vader_scores['content_cleaned'] = df['content_cleaned']
  df_vader_scores['compound_cleaned_content'] = df['content_cleaned'].apply(f)

  #Fetching the comnpound scores on the uncleaned content column
  df_vader_scores['content'] = df['content']
  df_vader_scores['compound_uncleaned_content'] = df['content'].apply(f)


  #Printing the vader scores
  print(df_vader_scores.head(10))





#Funtion call for vader scores
vader_scores(df)



                                    headline_cleaned  \
0  American Express Stock (NYSE:AXP): Strong Earn...   
1  Asian Stocks Gain After Fed, Yen Resumes Decli...   
2  AI stocks plunge after AMD gives weak forecast...   
3  A contrarian stock market indicator is on the ...   
4  Yen Swings Stir Talk That Japan Is in the FX M...   
5  Philip Morris Stock (NYSE:PM): Strong Q1 Resul...   
6  The Fed's new interest-rate outlook may roil m...   
7  Asia stocks rise as Fed tamps down hike fears;...   
8  Analysts revise SuperMicro stock price target ...   
9  C.H. Robinson stock blasts higher after some s...   

   compound_cleaned_headline  \
0                     0.6808   
1                     0.5267   
2                    -0.4404   
3                     0.0000   
4                     0.0000   
5                     0.6908   
6                     0.0000   
7                    -0.4215   
8                     0.0000   
9                     0.5106   

                              