# Project Update 2

Second update for Cases Study project, this update focuses on following suggestions and showing progress with the last next steps we had in the previous update.

## Pending tasks

From the previous update the pending tasks were:

1. Classify more tweets: We were able to download more tweets from the tweeter API, but we are still short with the manual classification, as we prioritized using the transformers library which already has some neat pre-trained models.

2. Run an SVM classifier for the tweets: The SVM classifier was not trained or executed on its own, as we believe it's better to invest time in the actual problem that we want to solve, that is to show the sentiment analysis related to airlines in an easier and more meaningful way through dashboards.

3. First dashboard displaying the collected data: We were able to put a dashboard together with the new data

## Required SetUp

The following setup is needed to install the transformers library, only macOS instructions are available at the moment.

### macOS

Installs the Rust compiler to build the tokenizers library

```shell
brew install rustup-init
```

Add new env config to zshrc or equivalent

```shell
echo "source ~/.cargo/env" >> ~/.zshrc
```

_Source: [Installation Error - Failed building wheel for tokenizers](https://github.com/huggingface/transformers/issues/2831#issuecomment-1001437376)_

Install tensorflow in macOS by following the [official Apple documentation](https://developer.apple.com/metal/tensorflow-plugin/)[1].

Finish by installing the `transformers` library in the created virtual environment.

Installs the transformers library for access to the models

```shell
pip install transformers
```

[1]: [Could not find a version that satisfies the requirement tensorflow](https://stackoverflow.com/questions/48720833/could-not-find-a-version-that-satisfies-the-requirement-tensorflow)

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to introduce pipeline to the transformers repository.')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are 

[{'label': 'POSITIVE', 'score': 0.9996980428695679}]

Wit the above classification of a sample message obtained from the hugging face site, we have successfully configured the classifier to run the model against the available twits we have. 

Below some of the available tweets in the original data set are ran through the model to test it with data now scoped to the domain of airlines.

In [4]:
import pandas as pd

In [5]:
old_tweets = pd.read_csv('../../archive/Tweets.csv')

In [11]:
row = old_tweets.iloc[0, ]
tfm_result = classifier(row.text)

In [13]:
tfm_result

[{'label': 'POSITIVE', 'score': 0.8633630275726318}]

In [16]:
print(f"Original confidence %f vs %f" % (row.airline_sentiment_confidence, tfm_result[0]['score']))

print ("Original label %s vs new label %s" % (row.airline_sentiment, tfm_result[0]['label'].lower()))

Original confidence 1.000000 vs 0.863363
Original label neutral vs new label positive


It's now possible to compare the previous and new labels to check how those may change by using the transformers model vs what the existing labeled data already has.

In [19]:
old_tweets.shape

(14640, 15)

In [7]:
for index, row in old_tweets.sample(n=10).iterrows():
    tfm_result = classifier(row.text)[0]
    old_tweets.loc[index, 'tfm_classification'] = tfm_result['label']
    old_tweets.loc[index, 'tfm_score'] = tfm_result['score']


In [8]:
old_tweets[~old_tweets['tfm_classification'].isna()][['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'tfm_classification', 'tfm_score']]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,tfm_classification,tfm_score
925,570009743160254464,neutral,1.0,NEGATIVE,0.998697
1157,569911515106582528,negative,1.0,NEGATIVE,0.998926
2680,568958107205783554,negative,1.0,NEGATIVE,0.999817
3423,568454386617356291,negative,1.0,NEGATIVE,0.999613
4877,569670671695011840,neutral,0.6801,NEGATIVE,0.991647
5409,569132950006185985,negative,1.0,NEGATIVE,0.997139
6673,567737625637687296,neutral,0.6619,NEGATIVE,0.984052
8107,568782407698149376,positive,1.0,NEGATIVE,0.68743
9960,569600720254541824,negative,1.0,NEGATIVE,0.998992
10086,569541291467522048,negative,1.0,NEGATIVE,0.999313


The default model in use is [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english), as we can see above the model only has `POSITIVE` and `NEGATIVE` classifications, or at least with the above tests, there's no neutral classification, at least not with the sample above.

Let's try now with one of the most downloaded models available at Hugging Face, the [`cardiffnlp/twitter-xlm-roberta-base-sentiment`](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment) (XLM-Roberta model).

The step below requires to have the `sentencepiece` package installed. The steps below were followed to make it run on a Mac.

```shell
brew install cmake
wget https://files.pythonhosted.org/packages/aa/71/bb7d64dcd80a6506146397bca7310d5a8684f0f9ef035f03affb657f1aec/sentencepiece-0.1.96.tar.gz
brew install pkgconfig
pip -v install  sentencepiece-0.1.96.tar.gz
```

After the above steps, the kernel must be restarted for the model to load properly.

_Sources followed:_
* [Add Mac M1 Compatibility](https://github.com/google/sentencepiece/issues/608#issuecomment-1158367943)
* [ValueError: Couldn't instantiate the backend tokenizer while loading model tokenizer #9750](https://github.com/huggingface/transformers/issues/9750#issuecomment-766862107)

In [2]:
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
cdiff_model = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)

2022-07-07 16:11:02.327695: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-07-07 16:11:02.327853: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Metal device set to: Apple M1 Pro

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


In [9]:
for index, row in old_tweets[~old_tweets['tfm_classification'].isna()].iterrows():
    rbt_result = cdiff_model(row.text)[0]
    old_tweets.loc[index, 'roberta_classification'] = rbt_result['label']
    old_tweets.loc[index, 'roberta_score'] = rbt_result['score']


In [10]:
old_tweets[~old_tweets['tfm_classification'].isna()][['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'tfm_classification', 'tfm_score', 'roberta_classification', 'roberta_score']]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,tfm_classification,tfm_score,roberta_classification,roberta_score
925,570009743160254464,neutral,1.0,NEGATIVE,0.998697,Negative,0.515928
1157,569911515106582528,negative,1.0,NEGATIVE,0.998926,Negative,0.925066
2680,568958107205783554,negative,1.0,NEGATIVE,0.999817,Negative,0.940675
3423,568454386617356291,negative,1.0,NEGATIVE,0.999613,Negative,0.941453
4877,569670671695011840,neutral,0.6801,NEGATIVE,0.991647,Neutral,0.655823
5409,569132950006185985,negative,1.0,NEGATIVE,0.997139,Negative,0.83757
6673,567737625637687296,neutral,0.6619,NEGATIVE,0.984052,Neutral,0.737134
8107,568782407698149376,positive,1.0,NEGATIVE,0.68743,Positive,0.884003
9960,569600720254541824,negative,1.0,NEGATIVE,0.998992,Negative,0.839267
10086,569541291467522048,negative,1.0,NEGATIVE,0.999313,Negative,0.94151


It's interesting to see how the XLM-Roberta model closely matches the manually classified tweets above, more specifically the neutral tweets, which could not be obtained directly from the first model we tried.

## Sentiment Analysis on fresh tweets

Below some sample tweets are obtained from the Twitter API to run them through the classifier.

From the original data we can check which airlines are available to get an idea of how we should retrieve data from the API.

In [13]:
old_tweets[['airline', 'tweet_id']].groupby('airline', as_index=False).count().sort_values('tweet_id', ascending=False)

Unnamed: 0,airline,tweet_id
4,United,3822
3,US Airways,2913
0,American,2759
2,Southwest,2420
1,Delta,2222
5,Virgin America,504


## References

* [pandas create new column based on values from other columns / apply a function of multiple columns, row-wise](https://stackoverflow.com/a/46570641/3211335)
* [Creating an empty Pandas DataFrame, then filling it?](https://stackoverflow.com/a/56746204/3211335)
* [How to iterate over rows in a DataFrame in Pandas](https://stackoverflow.com/a/16476974/3211335)