# Sentiment analysis

### Using the HuggingFace `pipeline` class

There are many different types of sentiment analysis, but most commonly we wish to classify data into positive, negative or neutral classes. 

We might use sentiment analysis to:

- analyse tweets or other social media mentions - e.g. to compare with competitors
- get insights into what customers do and don't like about a product or service
- detect negative reviews quickly so action can be taken

In this notebook, we'll use pre-trained models from HuggingFace to run analyse sentiments in twitter data. 

Before we dive into the twitter data, let's see how easy it can be to use these models!

First we need to run some setup, to install the `transformers` library.

In [None]:
%pip install transformers

Next, we can import the pipeline class, which provides a super easy interface for making predictions using models from the HuggingFace hub. Let's select the default model for sentiment analysis.

[Pipelines documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline('sentiment-analysis')

Now we have the model set up, we can go ahead and try it! Update the text below to try classifying different inputs.

In [None]:
data = ["I love NLP"]
sentiment_pipeline(data)

You can easily specify a particular model in the pipeline setup that is more appropriate for your use case. For instance

- [`twitter-roberta-base-sentiment`](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) from `cardiffnlp` is trained on tweets and fine-tuned for sentiment analysis
- [`bert-base-multilingual-uncased-sentiment`](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) from `nlptown` is a BERT model fine-tuned for sentiment analysis on product reviews in English, Dutch, German, French, Spanish and Italian
- [`distilbert-base-uncased-emotion`](https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion) from `bhadresh-savani` is fine-tuned for detecting emotions in texts

Let's try the emotion model...

In [None]:
emotion_pipeline = pipeline("text-classification",model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True)

Try it with some text of your choice. I've gone for __'Hope' is the thing with feathers__ by Emily Dickinson. 

In [None]:
data = ["‘Hope’ is the thing with feathers\
    That perches in the soul\
    And sings the tune without the words\
    And never stops – at all\
    And sweetest – in the Gale – is heard\
    And sore must be the storm\
    That could abash the little Bird\
    That kept so many warm\
    I’ve heard it in the chillest land\
    And on the strangest Sea\
    Yet, never, in Extremity,\
    It asked a crumb – of Me.\
"]
emotion_pipeline(data)

### A more flexible approach

Here, we will load in a dataset of tweets, run some preprocessing and use a pre-trained model to classify into positive, negative or neutral sentiments.

First, we need to import some packages.

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

Set the path to the data set you wish to use. There are some pre-downloaded data sets on the workshop GitHub in the [data folder](https://github.com/NICD-UK/IWD-twitterxhuggingface/tree/main/data), or feel free to try using your own data!

If you have an existing twitter developer account you can use the notebook [`get_tweets.ipynb`](https://github.com/NICD-UK/IWD-twitterxhuggingface/blob/main/get_tweets.ipynb) to get your own twitter data set. 

In [None]:
PATH = 'https://raw.githubusercontent.com/NICD-UK/IWD-twitterxhuggingface/main/data/elon_musk_tweets.csv'

We can then load the tweets into a pandas dataframe and take a look at the top five. 

In [None]:
tweets_df = pd.DataFrame(pd.read_csv(PATH)['text']) 
tweets_df.head(5)

Our twitter data contains lots of web links and usernames. We can write a simple preprocessing script, that will replace all usernames with `@user` and all links with `http`. 

Have a look at the data and think about what other preprcessing might be useful. Remember that we want to avoid loss of any information that might be informative when classifying sentiments. 


In [None]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

Next we can set up the model, tokeniser and config. We will use the [`twitter-roberta-base-sentiment`](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) model from `cardiffnlp` that we mentioned earlier, with a slightly different setup. 


In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)


The preprocessing step uses our custom preprocessing method, and we will use the model's default tokeniser. 

We will add both the preprocessed and encoded version of each tweet to our dataframe, to allow further inspection. 

In [None]:
preprocessed_tweets = []
encoded_tweets = []

for tweet in tweets_df['text']:
    preprocessed = preprocess(tweet)
    encoded = tokenizer(preprocessed, return_tensors='pt')
    preprocessed_tweets.append(preprocessed)
    encoded_tweets.append(encoded)

tweets_df['preprocessed'] = preprocessed_tweets
tweets_df['encoded'] = encoded_tweets
tweets_df.reset_index(inplace=True, drop=True)

Next we can use the model to get classification scores, and use a softmax function to convert the scores into a vector of probabilities.

In [None]:
tweets_analysis = []
for item in tweets_df.encoded:

    output = model(**item)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    tweets_analysis.append(scores)

Finally, let's join up our analysis with the tweets dataframe.

In [None]:
tweets_df = pd.concat([tweets_df, pd.DataFrame(tweets_analysis)], axis = 1)

Rename the columns to make them more reader-friendly.

In [None]:
tweets_df = tweets_df.rename(columns={'LABEL_0': 'negative', 'LABEL_1': 'neutral','LABEL_2': 'positive'})

We can add a column to our dataframe to specify the sentiment with the maximum probability.

In [None]:
tweets_df['sentiment'] = tweets_df[['negative','positive', 'neutral']].idxmax(axis=1)

Let's take a look at a tweet for each sentiment.

In [None]:
pd.set_option('max_colwidth', None)
pd.set_option('display.width', 3000)
 
display(tweets_df[tweets_df["sentiment"] == 'positive'].head(1))
display(tweets_df[tweets_df["sentiment"] == 'neutral'].head(1))
display(tweets_df[tweets_df["sentiment"] == 'negative'].head(1))

In [None]:
tweets_df.head(5)

Having classified the tweets, what questions might we want to ask about our data? We might be interested in the distirbution over the classes.

In [None]:
sentiment_counts = tweets_df.groupby(['sentiment']).size()
print(sentiment_counts)

We can make a quick plot to visualise this (for more beautiful plots, you need to be in the Visualisation masterclass with Louise!)

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6,6), dpi=100)
ax = plt.subplot(111)
sentiment_counts.plot.pie(ax=ax, autopct='%1.1f%%', startangle=270, fontsize=12, label="")

#### Wordclouds

Whilst less technical, wordclouds can be a fun and populay way to visualise common terms in text data. We can make two wordclouds, one for positive and one for negative tweets. 

Remember we are working with a small data set!

First, import the libraries and get our positive and negative tweets. 

In [None]:
from wordcloud import WordCloud, STOPWORDS

positive_tweets = tweets_df['text'][tweets_df["sentiment"] == 'positive']
negative_tweets = tweets_df['text'][tweets_df["sentiment"] == 'negative']

Next, lets get a list of stopwords to remove. These are commonly used words that don't carry meaning on their own. Be careful with these - some are better quality than others! 

We will add `https`, `co` and `RT` to our list. You might want to add others after looking at the wordclouds.

In [None]:
stop_words = ["https", "co", "RT"] + list(STOPWORDS)

The next two cells will create the positive and negative wordclouds. 

**A reminder that the data is real twitter data and has not been filtered for toxicity or profanity!**

In [None]:
positive_wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white", stopwords = stop_words).generate(str(positive_tweets))
plt.figure()
plt.title("Positive Tweets")
plt.imshow(positive_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
negative_wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white", stopwords = stop_words).generate(str(negative_tweets))
plt.figure()
plt.title("Negative Tweets")
plt.imshow(negative_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()