# Sentiment Analysis

Sentiment analysis is the process of identifying and extracting subjective information from text data, which can include opinions, attitudes, emotions, and other similar aspects of the writer's experience. The goal of sentiment analysis is to determine the overall sentiment of a piece of text, whether it is positive, negative, or neutral, and to identify the specific aspects of the text that contribute to that sentiment.

There are several ways to perform sentiment analysis, some of which include:

- Rule-based approach: This involves the use of a set of pre-defined rules to identify and classify sentiment in text. For example, a rule-based system might identify the presence of certain words or phrases that are indicative of positive or negative sentiment, and use those to assign a sentiment score to the text.

- Machine learning approach: This involves training a machine learning algorithm on a set of labeled data, where each piece of text is associated with a sentiment label (positive, negative, or neutral). The algorithm learns to identify patterns in the data that are indicative of each sentiment label and can then be used to classify new, unlabeled text data.

- Hybrid approach: This combines the rule-based and machine learning approaches, using a set of pre-defined rules to identify sentiment in text and then using a machine learning algorithm to refine and improve the sentiment analysis.

- Lexicon-based approach: This approach involves the use of sentiment lexicons or dictionaries, which are pre-built lists of words and phrases that are associated with positive or negative sentiment. The sentiment of a given text can then be determined by calculating the number and polarity of sentiment words present in the text.

- Deep learning approach: This involves the use of neural networks to learn and classify sentiment in text. Deep learning models can process large amounts of text data and identify complex patterns that may be difficult to identify using other methods.

Overall, sentiment analysis can be performed using a variety of methods, each with its own strengths and weaknesses. The choice of approach will depend on the specific needs of the application and the resources available for implementation.

Python provides several libraries for performing sentiment analysis. Some popular libraries and tools for sentiment analysis in Python include:

- TextBlob: TextBlob is a Python library that provides simple API for common natural language processing (NLP) tasks such as sentiment analysis, part-of-speech tagging, and noun phrase extraction.

- NLTK: The Natural Language Toolkit (NLTK) is a widely used Python library for NLP. It provides various tools and methods for text processing, including sentiment analysis.

- VaderSentiment: VaderSentiment is a Python library that is specifically designed for sentiment analysis of social media text. It is based on a rule-based approach and can handle emoticons and slang language.

- Scikit-learn: Scikit-learn is a machine learning library for Python that can be used for various NLP tasks including sentiment analysis. It provides various algorithms for text classification such as Naive Bayes, Support Vector Machines, and Logistic Regression.

- Hugging Face: Hugging Face built on top of PyTorch and provides pre-trained models for e.g. sentiment analysis, such as BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly Optimized BERT), and DistilBERT, among others. These models can be fine-tuned on specific datasets to improve their performance for sentiment analysis on specific domains or languages.

To perform sentiment analysis using these libraries, you will typically need to first preprocess your text data, such as tokenization, stop word removal, and stemming. Then, you can use the sentiment analysis functions or methods provided by the library of your choice to obtain a sentiment score or label for your text data.

For example, using TextBlob, you can perform sentiment analysis on a sentence as follows:

In [1]:
from textblob import TextBlob

text = "I love pizza"
blob = TextBlob(text)
sentiment_score = blob.sentiment.polarity
print(sentiment_score)

0.5


This code will output the sentiment score of the given sentence as a floating-point value between -1 (negative) and 1 (positive). In this case, the sentiment score will be a positive value, indicating a positive sentiment.

Hugging Face is a popular open-source library for natural language processing (NLP) tasks that provides easy-to-use interfaces to pre-trained transformer models, such as BERT and RoBERTa. These pre-trained models can be fine-tuned on specific NLP tasks, such as sentiment analysis, with just a few lines of code.

A pipeline transformer is a simple and convenient way to perform a wide range of NLP tasks, including sentiment analysis, using pre-trained transformer models. A pipeline transformer allows you to perform these tasks without the need to fine-tune a model or write complex code.

`pip install -q transformers`

In [3]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998656511306763},
 {'label': 'NEGATIVE', 'score': 0.9991129040718079}]

Let's download some data. 

The `tweets.csv` contains tweets from Hillary Clinton and Donald Trump from the 2016 presidential election.

In [8]:
import pandas as pd
df = pd.read_csv('data/tweets.csv')

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6444 entries, 0 to 6443
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               6444 non-null   int64  
 1   id                       6444 non-null   float64
 2   handle                   6444 non-null   object 
 3   text                     6444 non-null   object 
 4   is_retweet               6444 non-null   bool   
 5   original_author          722 non-null    object 
 6   time                     6444 non-null   object 
 7   in_reply_to_screen_name  208 non-null    object 
 8   in_reply_to_status_id    202 non-null    float64
 9   in_reply_to_user_id      208 non-null    float64
 10  is_quote_status          6444 non-null   bool   
 11  lang                     6444 non-null   object 
 12  retweet_count            6444 non-null   int64  
 13  favorite_count           6444 non-null   int64  
 14  longitude               

The transformer models can be quite heavy, and we will therefore only run it on a random sample of 100 tweets.

In [10]:
df_sample = df.sample(100, ignore_index=True)

In [12]:
df_sample.head()

Unnamed: 0.1,Unnamed: 0,id,handle,text,is_retweet,original_author,time,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,place_type,place_country_code,place_country,place_contained_within,place_attributes,place_bounding_box,source_url,truncated,entities,extended_entities
0,3176,7.471575e+17,HillaryClinton,.@HillaryClinton is a champion for LGBT equali...,True,jessetyler,2016-06-26T19:59:59,,,,...,,,,,,,https://about.twitter.com/products/tweetdeck,False,"{'user_mentions': [{'id_str': '17350250', 'nam...",
1,4505,7.276378e+17,HillaryClinton,Our teachers deserve more than just a pat on t...,False,,2016-05-03T23:15:54,,,,...,,,,,,,https://about.twitter.com/products/tweetdeck,False,{'media': [{'display_url': 'pic.twitter.com/0s...,{'media': [{'display_url': 'pic.twitter.com/0s...
2,4038,7.348127e+17,HillaryClinton,"""The lesson of our history, through good times...",False,,2016-05-23T18:26:08,,,,...,,,,,,,https://about.twitter.com/products/tweetdeck,False,{'media': [{'display_url': 'pic.twitter.com/Cc...,{'media': [{'display_url': 'pic.twitter.com/Cc...
3,4016,7.35181e+17,HillaryClinton,How cruel do you have to be to actually root f...,False,,2016-05-24T18:49:32,HillaryClinton,7.35179e+17,1339836000.0,...,,,,,,,https://about.twitter.com/products/tweetdeck,False,"{'user_mentions': [], 'symbols': [], 'urls': [...",
4,1377,7.65164e+17,HillaryClinton,"Trump says, ""I know more about ISIS than the g...",False,,2016-08-15T12:31:18,,,,...,,,,,,,http://twitter.com,False,"{'user_mentions': [], 'symbols': [], 'urls': [...",


We'll use our transformer to perform sentiment analysis on the tweets.

In [13]:
sent = pd.DataFrame(sentiment_pipeline(list(df_sample.text)))

In [16]:
sent.head()

Unnamed: 0,label,score
0,NEGATIVE,0.660723
1,NEGATIVE,0.989819
2,POSITIVE,0.999174
3,NEGATIVE,0.999402
4,NEGATIVE,0.994164


We then merge the data

In [17]:
final = pd.concat([df_sample, sent], axis=1)

In [18]:
final[['text','label', 'score']].head(10)

Unnamed: 0,text,label,score
0,.@HillaryClinton is a champion for LGBT equali...,NEGATIVE,0.660723
1,Our teachers deserve more than just a pat on t...,NEGATIVE,0.989819
2,"""The lesson of our history, through good times...",POSITIVE,0.999174
3,How cruel do you have to be to actually root f...,NEGATIVE,0.999402
4,"Trump says, ""I know more about ISIS than the g...",NEGATIVE,0.994164
5,"“I love Hispanics!” —Trump, 52 minutes ago htt...",NEGATIVE,0.955088
6,.@realDonaldTrump doesn't want you to see what...,NEGATIVE,0.998818
7,America is a country of diverse beliefs and he...,POSITIVE,0.999649
8,So with all of the Obama tough talk on Russia ...,POSITIVE,0.940436
9,The only people who are not interested in bein...,NEGATIVE,0.997979


In [19]:
final.text[0]

".@HillaryClinton is a champion for LGBT equality. That's why #ImWithHer. 💁  Text PRIDE to 47246 if you are too! 🌈 https://t.co/8k0SHzrfqi"

In [20]:
final.text[7]

'America is a country of diverse beliefs and heritages. That makes us strong, regardless of what Donald thinks.\nhttps://t.co/Nbyd4zSuyY'

return to [overview](../00_overview.ipynb)