<a href="https://colab.research.google.com/github/Saifullah3711/Sentiment_analysis_hugging_face/blob/main/Sentiment_Analysis_Reviews_from_yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis on the Data Scraped from yelp and tweets

This notebook is having the complete code of sentiment analysis of the reviews scraped from yelp using beautifulsoup library and tweets scraped from twitter using snscrape library. 
Pre-trained models are downloaded from hugging-face collections of pre-trained models. 
The code in this notebook is well-commented and self-explanatory. 

# Installing Dependencies and Libraries

In [2]:
!pip install transformers requests beautifulsoup4 pandas numpy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 23.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 33.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 51.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling 

# Importing Dependencies

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd

## Downloading pre-trained model and tokenizer from hugging-face

In [4]:
# Downloading the pre-trained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/851k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/638M [00:00<?, ?B/s]

### Testing the model and tokenizer

In [5]:
tokens = tokenizer.encode('Dont worry, things will be good and hopefully it was never that way', return_tensors = 'pt')
tokens

tensor([[  101, 11930, 12912, 60416,   117, 17994, 11229, 10346, 12050, 10110,
         18763, 46943, 10197, 10140, 13362, 10203, 12140,   102]])

In [6]:
result = model(tokens)
res = int(torch.argmax(result.logits)) + 1
res

3

# Reviews From Yelp for Sentiment Analysis


* This function scrape the reviews from all the pages provided. 
* The required parameters to the scrape_all() function is the base_url which is the url of the home page, pages are the number of pages in the website and increment is the change observed at the end of the url when we shift from one page to another page. The pattern may change depending on the website. It can easily be found out just by observing the change in url when going from one page to the consecutive next page. 
* In case of yelp, the base_url is the url of the home page, followed by the "start?=page". In case of yelp, the page number for home_page is 0 and for the next pages are 10, 20, 30 .....  

In [7]:
#### Scraping tool for multiple pages
def scrape_all(base_url, pages, increment):
  # base_url is the url of home page
  # pages are the number of pages to be scraped in a website
  # increment is the pattern of the increment observed going from one page to another
  page = 0
  counter_pages = 1   # pages counter starts from 1
  all_pages_reviews = []
  while counter_pages <= pages:
    print(f'Page {counter_pages} scraping in progress...')
    prep_url = f'{base_url}?start={page}'    # as per the pattern observed in yelp pages
    r = requests.get(prep_url)
    soup = BeautifulSoup(r.text, 'html.parser')
    regex = re.compile('.*comment.*')
    results = soup.find_all('p', {'class':regex})
    reviews = [result.text for result in results]
    all_pages_reviews +=reviews
    page +=increment # as per the pattern observed in new pages
    counter_pages +=1
  print("Done Scraping. All reviews returned")
  return all_pages_reviews

In [8]:
base_url = 'https://www.yelp.com/biz/social-brew-cafe-pyrmont'
pages = 9
counter = 10
all_reviews_yelp = scrape_all(base_url,pages, counter )

Page 1 scraping in progress...
Page 2 scraping in progress...
Page 3 scraping in progress...
Page 4 scraping in progress...
Page 5 scraping in progress...
Page 6 scraping in progress...
Page 7 scraping in progress...
Page 8 scraping in progress...
Page 9 scraping in progress...
Done Scraping. All reviews returned


In [9]:
len(all_reviews_yelp)

94

In [10]:
all_reviews_yelp[0]

"It was ok. The coffee wasn't the best but it was fine. The relish on the breakfast roll was yum which did make it sing. So perhaps I just got a bad coffee but the food was good on my visit."

In [11]:
yelp_df = pd.DataFrame(all_reviews_yelp, columns = ['reviews'])

In [12]:
yelp_df.head()

Unnamed: 0,reviews
0,It was ok. The coffee wasn't the best but it w...
1,Great staff and food. Must try is the pan fri...
2,I went here a little while ago- a beautiful mo...
3,I came to Social brew cafe for brunch while ex...
4,Good coffee and toasts. Straight up and down -...


In [13]:
yelp_df['reviews'][0]

"It was ok. The coffee wasn't the best but it was fine. The relish on the breakfast roll was yum which did make it sing. So perhaps I just got a bad coffee but the food was good on my visit."

In [14]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [15]:
yelp_df['pred_sentiment'] = yelp_df['reviews'].apply(lambda y: sentiment_score(y[:512]))

In [16]:
yelp_df.head()

Unnamed: 0,reviews,pred_sentiment
0,It was ok. The coffee wasn't the best but it w...,3
1,Great staff and food. Must try is the pan fri...,5
2,I went here a little while ago- a beautiful mo...,2
3,I came to Social brew cafe for brunch while ex...,5
4,Good coffee and toasts. Straight up and down -...,5


In [19]:
yelp_df['reviews'][4]

'Good coffee and toasts. Straight up and down - hits the spot with nothing mind blowing. Solid and tasty. \xa0Good work'

In [20]:
# Saving the data to csv file
yelp_df.to_csv('sentiments.csv')

# Collect/Scrape Tweets from Twitter

In [21]:
!pip install snscrape

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting snscrape
  Downloading snscrape-0.3.4-py3-none-any.whl (35 kB)
Installing collected packages: snscrape
Successfully installed snscrape-0.3.4


In [37]:
from scipy.special import softmax
import snscrape.modules.twitter as sntwitter

In [34]:


query = "(from:billgates) until:2022-01-01 since:2000-01-01"
tweets = []
limit = 5000


for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    
    # print(dir(tweet))
    # print(tweet)
    # break
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.username, tweet.content])
        
tweets_df = pd.DataFrame(tweets, columns=['Date', 'User', 'Tweet'])
print(tweets_df)

                          Date       User  \
0    2021-12-30 01:20:15+00:00  BillGates   
1    2021-12-27 22:49:47+00:00  BillGates   
2    2021-12-26 18:34:19+00:00  BillGates   
3    2021-12-24 00:30:44+00:00  BillGates   
4    2021-12-23 01:06:40+00:00  BillGates   
...                        ...        ...   
3415 2010-01-20 21:16:36+00:00  BillGates   
3416 2010-01-20 20:06:30+00:00  BillGates   
3417 2010-01-20 18:59:40+00:00  BillGates   
3418 2010-01-20 00:59:32+00:00  BillGates   
3419 2010-01-19 22:50:41+00:00  BillGates   

                                                  Tweet  
0     Heroes like @PumlaNtlabati are spreading impor...  
1     We have some, but not all, of the tools we nee...  
2     The world has lost a hero. Archbishop Desmond ...  
3     One of my favorite holiday traditions is shari...  
4     Mamello Makhele is a hero from Lesotho who tra...  
...                                                 ...  
3415  From www.gatesnotes.com - one of our 2009 Indi.

In [35]:
tweets_df.head()

Unnamed: 0,Date,User,Tweet
0,2021-12-30 01:20:15+00:00,BillGates,Heroes like @PumlaNtlabati are spreading impor...
1,2021-12-27 22:49:47+00:00,BillGates,"We have some, but not all, of the tools we nee..."
2,2021-12-26 18:34:19+00:00,BillGates,The world has lost a hero. Archbishop Desmond ...
3,2021-12-24 00:30:44+00:00,BillGates,One of my favorite holiday traditions is shari...
4,2021-12-23 01:06:40+00:00,BillGates,Mamello Makhele is a hero from Lesotho who tra...


In [36]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3420 entries, 0 to 3419
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   Date    3420 non-null   datetime64[ns, UTC]
 1   User    3420 non-null   object             
 2   Tweet   3420 non-null   object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 80.3+ KB


## Preprocess tweets for model

In [40]:
tweets_df['Tweet'][0]

'Heroes like @PumlaNtlabati are spreading important information, and hope, across South Africa with the help of an unusual and innovative tool: https://t.co/vBSMvpv6Lt https://t.co/euBp7fM2PF'

In [44]:
# precprcess tweet
def tweet_preprocess(tweet):
  tweet_words = []

  for word in tweet.split(' '):
      if word.startswith('@') and len(word) > 1:
          word = '@user'
      
      elif word.startswith('http'):
          word = "http"
      tweet_words.append(word)
      tweet_proc = " ".join(tweet_words)

  return tweet_proc

In [46]:
tweet_preprocess(tweets_df['Tweet'][10])

'Omicron is spreading faster than any virus in history. It will soon be in every country in the world.'

## Load the model and the tokenizer

In [42]:
roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model_tweets = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer_tweets = AutoTokenizer.from_pretrained(roberta)

Downloading config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

## Sentiment Analysis on tweets

In [55]:
labels = ['Negative', 'Neutral', 'Positive']

# sentiment analysis
encoded_tweet = tokenizer_tweets(tweet_preprocess(tweets_df['Tweet'][0]), return_tensors='pt')
# output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
output = model_tweets(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

In [56]:
scores

array([0.00193649, 0.08247133, 0.91559225], dtype=float32)

In [58]:
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = labels[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")
    print(np.argmax(scores))

1) Positive 0.9156
2
2) Neutral 0.0825
2
3) Negative 0.0019
2


In [74]:
def tweets_sentiments(all_tweets):

  # pre-process all the tweets
  prep_tweets_sentiments = []
  for idx, tweet in enumerate(all_tweets):
    if idx % 500 == 0:
      print("Tweet Progress : ", idx )
    tweet_prep = tweet_preprocess(tweet)
    encoded_tweet = tokenizer_tweets(tweet_prep, return_tensors='pt')
    output = model_tweets(**encoded_tweet)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    index = np.argmax(scores)
    sentiment = labels[index]
    prep_tweets_sentiments.append(sentiment)
  return prep_tweets_sentiments

In [75]:
all_sentiments = tweets_sentiments(tweets_df['Tweet'])

Tweet Progress :  0
Tweet Progress :  500
Tweet Progress :  1000
Tweet Progress :  1500
Tweet Progress :  2000
Tweet Progress :  2500
Tweet Progress :  3000


In [76]:
all_sentiments[0:50]

['Positive',
 'Neutral',
 'Positive',
 'Positive',
 'Positive',
 'Negative',
 'Neutral',
 'Neutral',
 'Neutral',
 'Negative',
 'Negative',
 'Negative',
 'Neutral',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Neutral',
 'Positive',
 'Negative',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Neutral',
 'Positive',
 'Positive',
 'Neutral',
 'Positive',
 'Positive',
 'Positive',
 'Neutral',
 'Neutral',
 'Neutral',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Positive',
 'Neutral',
 'Positive']

In [77]:
tweets_df['sent_alpha'] = all_sentiments


In [78]:
tweets_df.head()

Unnamed: 0,Date,User,Tweet,sent_alpha
0,2021-12-30 01:20:15+00:00,BillGates,Heroes like @PumlaNtlabati are spreading impor...,Positive
1,2021-12-27 22:49:47+00:00,BillGates,"We have some, but not all, of the tools we nee...",Neutral
2,2021-12-26 18:34:19+00:00,BillGates,The world has lost a hero. Archbishop Desmond ...,Positive
3,2021-12-24 00:30:44+00:00,BillGates,One of my favorite holiday traditions is shari...,Positive
4,2021-12-23 01:06:40+00:00,BillGates,Mamello Makhele is a hero from Lesotho who tra...,Positive


In [80]:
# Assigning numerical values to positive, negative and neutral tweets
tweets_df['sentiment_num'] = None
for idx, sentiment in enumerate(tweets_df['sent_alpha']):
  if sentiment == 'Positive':
    tweets_df['sentiment_num'][idx] = 1
  elif sentiment == 'Negative':
    tweets_df['sentiment_num'][idx] = -1
  else :
    tweets_df['sentiment_num'][idx] = 0
  






A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [81]:
tweets_df.head()

Unnamed: 0,Date,User,Tweet,sent_alpha,sentiment_num
0,2021-12-30 01:20:15+00:00,BillGates,Heroes like @PumlaNtlabati are spreading impor...,Positive,1
1,2021-12-27 22:49:47+00:00,BillGates,"We have some, but not all, of the tools we nee...",Neutral,0
2,2021-12-26 18:34:19+00:00,BillGates,The world has lost a hero. Archbishop Desmond ...,Positive,1
3,2021-12-24 00:30:44+00:00,BillGates,One of my favorite holiday traditions is shari...,Positive,1
4,2021-12-23 01:06:40+00:00,BillGates,Mamello Makhele is a hero from Lesotho who tra...,Positive,1


In [84]:
directory = r"/content/drive/MyDrive/NLP_Projects/Sentiment_analysis_Bert_Huggingface_Nick_tut/bill_gates.csv"
tweets_df.to_csv(directory)

In [85]:
bill_df = pd.read_csv(directory)

In [88]:
bill_df.head(20)

Unnamed: 0.1,Unnamed: 0,Date,User,Tweet,sent_alpha,sentiment_num
0,0,2021-12-30 01:20:15+00:00,BillGates,Heroes like @PumlaNtlabati are spreading impor...,Positive,1.0
1,1,2021-12-27 22:49:47+00:00,BillGates,"We have some, but not all, of the tools we nee...",Neutral,0.0
2,2,2021-12-26 18:34:19+00:00,BillGates,The world has lost a hero. Archbishop Desmond ...,Positive,1.0
3,3,2021-12-24 00:30:44+00:00,BillGates,One of my favorite holiday traditions is shari...,Positive,1.0
4,4,2021-12-23 01:06:40+00:00,BillGates,Mamello Makhele is a hero from Lesotho who tra...,Positive,1.0
5,5,2021-12-21 16:46:14+00:00,BillGates,I know it’s frustrating to go into another hol...,Negative,-1.0
6,6,2021-12-21 16:46:14+00:00,BillGates,"If there’s good news here, it’s that omicron m...",Neutral,0.0
7,7,2021-12-21 16:46:13+00:00,BillGates,There will be more breakthrough cases in peopl...,Neutral,0.0
8,8,2021-12-21 16:46:13+00:00,BillGates,"In the meantime, we all have to look out for e...",Neutral,0.0
9,9,2021-12-21 16:46:13+00:00,BillGates,The big unknown is how sick omicron makes you....,Negative,-1.0
