Resources:
- [Blog NLP](https://medium.com/@nikitasilaparasetty/twitter-sentiment-analysis-for-data-science-using-python-in-2022-6d5e43f6fa6e)

## Twint

With this API, we do not have any type of limitation and we can use it without credentials on the Twitter API.

In [1]:
import twint

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [3]:
def scrape_company(company_name, amount):
    """
    We will configure our search, for this we define:
    - company name we will look for
    - limit of amount of tweets to screapte
    - where we store it into a csv file
    """
    
    # We start configuring what we will scrape
    c = twint.Config()

    c.Search = [f'{company_name}']       # topic
    c.Limit = amount      # number of Tweets to scrape
    c.Store_csv = True       # store tweets in a csv file
    c.Output = f"{company_name}_tweets.csv"     # path to csv file

    twint.run.Search(c)

In [None]:
# We start configuring what we will scrape
c = twint.Config()

c.Search = ['amazon']       # topic
c.Limit = 10      # number of Tweets to scrape
c.Store_csv = True       # store tweets in a csv file
c.Output = "amazon_tweets.csv"     # path to csv file

twint.run.Search(c)

In [4]:
import pandas as pd
df = pd.read_csv("./csv/amazon_tweets.csv")

In [19]:
tweet = df["tweet"][12]

## Sentiment Analysis

### OpenAI

We can use OpenAI model to make a sentiment analysis of the tweets.

In [None]:
import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-002",
  prompt=tweet,
  temperature=0,
  max_tokens=60,
  top_p=1.0,
  frequency_penalty=0.0,
  presence_penalty=0.0
)

But it only works for english language.

A point in favour is that if there is multi-line tweet, it will evaluate each of the lines separately.

### HuggingFace

We could look for a free option, and one that provides also a solution for other languages (e.g. Chinese).
This is important as we can see the distribution of tweets (from *Barbieri F., Espinosa L., Camacho-Collados J. - XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond*):

![Distribution of languages](./images/language_distribution_twitter.png)

For the tweets, we have to do some cleaning:
- Get rid of the name of the company?
- Get rid of urls to avoid spam
- How to avoid bot tweets?

For making the overall sentiment about a company, we could consider:
- Longer Tweets have a higher weight?
- Some threshold of certainty about if a tweet is + or -?

In [6]:
import numpy as np
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
from scipy.special import softmax

# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        # Avoid user names
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        
        # Avoid urls
        t = 'http' if t.startswith('http') else t
        
        new_text.append(t)
        
    return " ".join(new_text)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

In [7]:
# And now we instantiate the model
MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

In [10]:
# We load the model
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

# And we pass the text
text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

In [13]:
print("Negative: ", scores[0])
print("Neutral: ", scores[1])
print("Positive: ", scores[2])

Negative:  0.03125937
Neutral:  0.20148008
Positive:  0.76726055


In [None]:
from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
sentiment_task("Huggingface es lo mejor! Awesome library 🤗😎")