# What to expect 🤔

In this Notebook, we're going to run inferences on our tweets Dataset with the Sentiment Analysis model trained [here](https://www.kaggle.com/code/ibrahimserouis99/twitter-sentiment-analysis). (~83% accuracy)

# Libraries 

In [1]:
import numpy as np 
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import load_model

2022-05-09 10:04:11.588170: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


# Exploring the datasets 🔬

## Messi

In [2]:
dataset_messi = pd.read_csv("../input/twitter-sentiment-analysis-and-word-embeddings/Cleaned_messi_tweets.csv", encoding="latin")
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1523587143632703488,1292468372618326023,oh yeah and messi contributed so heavily to hi...,en,2022-05-09T08:54:16.000Z,Twitter for Android,-1,0,0,0
1,1523587126595825664,3985914141,this level of stretch is insane he literally t...,en,2022-05-09T08:54:12.000Z,Twitter for iPhone,-1,0,0,0
2,1523587117556715520,1480936861790924801,fred and messi tradeable thoughts,en,2022-05-09T08:54:10.000Z,Twitter for Android,-1,0,1,0
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0,0,0
4,1523587088381136896,942073809464467456,ok sir i think you know messi better than mess...,en,2022-05-09T08:54:03.000Z,Twitter for Android,-1,0,0,0


### Check for duplicate tweets

In [3]:
assert len(np.unique(dataset_messi["tweet_id"])) == len(dataset_messi.index), "Duplicate IDs !"

### Drop N/A values

In [4]:
print(f"Number of N/A: \n{dataset_messi.isna().sum()}")
print(f"\nDropping N/A values...")
# Inplace = replace the original dataframe
dataset_messi.dropna(inplace=True)
print(f"\n\nNumber N/A after dropping: \n{dataset_messi.isna().sum()}")

Number of N/A: 
tweet_id         0
author_id        0
content          1
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64

Dropping N/A values...


Number N/A after dropping: 
tweet_id         0
author_id        0
content          0
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64


## Ronaldo

In [5]:
dataset_ronaldo = pd.read_csv("../input/twitter-sentiment-analysis-and-word-embeddings/Cleaned_ronaldo_tweets.csv", encoding="utf-8")
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1523587244572872704,1383801795823116291,where is the band leader ronaldo,en,2022-05-09T08:54:40.000Z,Twitter for Android,-1,0.0,0.0,0.0
1,1523587203351293953,1011901831411322880,all untradeable except alex telles what could ...,en,2022-05-09T08:54:30.000Z,Twitter Web App,-1,0.0,0.0,0.0
2,1523587136968003584,288628771,or cr,en,2022-05-09T08:54:14.000Z,Twitter for iPhone,-1,0.0,0.0,0.0
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0.0,0.0,0.0
4,1523587106647560192,278038673,erik ten hag needs some constance says he want...,en,2022-05-09T08:54:07.000Z,Wildmoka,-1,5.0,19.0,0.0


### Check for duplicate tweets

In [6]:
assert len(np.unique(dataset_ronaldo["tweet_id"])) == len(dataset_ronaldo.index), "Duplicate IDs!"

### Check data

In [7]:
print(f"Number of N/A: \n{dataset_ronaldo.isna().sum()}")
print(f"\nDropping N/A values...")
# Inplace = replace the original dataframe
dataset_ronaldo.dropna(inplace=True)
print(f"\n\nNumber N/A after dropping: \n{dataset_ronaldo.isna().sum()}")

Number of N/A: 
tweet_id          0
author_id         0
content          11
lang              0
date              0
source            0
geo               0
retweet_count     0
like_count        0
quote_count       0
dtype: int64

Dropping N/A values...


Number N/A after dropping: 
tweet_id         0
author_id        0
content          0
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64


# Load the model 🔌

In [8]:
model = load_model("../input/twitter-sentiment-analysis-and-word-embeddings/TSA_model_v4")
model.summary()

2022-05-09 10:04:20.378402: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-09 10:04:20.381605: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-05-09 10:04:20.450627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-09 10:04:20.451325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2022-05-09 10:04:20.451403: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-05-09 10:04:20.559567: I tensorflow/stream_executor/platform/def

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 53)                0         
_________________________________________________________________
sequential (Sequential)      (None, 1)                 27859389  
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
Total params: 27,859,389
Trainable params: 27,859,389
Non-trainable params: 0
_________________________________________________________________


# Set the thresholds : refer to [this link](https://github.com/Justsecret123/Twitter-sentiment-analysis/blob/main/Notebook/twitter-sentiment-analysis.ipynb)  🛠

In [9]:
threshold = 0.625

# Run inferences

## Messi dataset

In [10]:
X_messi = dataset_messi["content"]
X_messi.head(10)

0    oh yeah and messi contributed so heavily to hi...
1    this level of stretch is insane he literally t...
2                    fred and messi tradeable thoughts
3    their is a reason most tickets sold in worldcu...
4    ok sir i think you know messi better than mess...
5    so psg pays million a week to messi to keep do...
6    messi is a better goal scorer too its obvious ...
7    ronaldo for juventus games goals assists g a i...
8                messi is basically german for hoarder
9                               all these cos of messi
Name: content, dtype: object

### Perform inferences

In [11]:
predictions_messi = model.predict(X_messi)

2022-05-09 10:04:47.179156: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-05-09 10:04:47.979837: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-05-09 10:04:48.038345: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8


### Display some results

In [12]:
for i in range(10): 
    print(f"Tweet: {X_messi[i]} ||| Score : {predictions_messi[i]}")

Tweet: oh yeah and messi contributed so heavily to his trophies at psg ||| Score : [0.61182326]
Tweet: this level of stretch is insane he literally tried to force his way out of psg to go back to bar a and play with messi inevitably losing no even tried paying part of his own transfer fee it s a stupid premise to act like neymar doesn t respect messi ||| Score : [0.62912303]
Tweet: fred and messi tradeable thoughts ||| Score : [0.5103146]
Tweet: their is a reason most tickets sold in worldcup were that of argentina surely people dont consider watching portuguese merchant man you have no idea what messi stands for ronaldo stands for factos and two jealous ||| Score : [0.6450123]
Tweet: ok sir i think you know messi better than messi himself ||| Score : [0.57039213]
Tweet: so psg pays million a week to messi to keep doing what neymar has been doing for a while its easy to feed mbappe because he converts from nothing they brought him so they could have a suarez messi partnership but one g

## Ronaldo dataset

In [13]:
X_ronaldo = dataset_ronaldo["content"]
X_ronaldo.head(5)

0                     where is the band leader ronaldo
1    all untradeable except alex telles what could ...
2                                                or cr
3    their is a reason most tickets sold in worldcu...
4    erik ten hag needs some constance says he want...
Name: content, dtype: object

In [14]:
predictions_ronaldo = model.predict(X_ronaldo)

### Display some results

In [15]:
for i in range(10): 
    print(f"Tweet: {X_ronaldo[i]} ||| Score : {predictions_ronaldo[i]}")

Tweet: where is the band leader ronaldo ||| Score : [0.7283744]
Tweet: all untradeable except alex telles what could i change in game with cr and bernardo as strikers ||| Score : [0.70669216]
Tweet: or cr ||| Score : [0.7226145]
Tweet: their is a reason most tickets sold in worldcup were that of argentina surely people dont consider watching portuguese merchant man you have no idea what messi stands for ronaldo stands for factos and two jealous ||| Score : [0.6450123]
Tweet: erik ten hag needs some constance says he wants cristiano ronaldo to stay at man united for at least one more season ||| Score : [0.5930576]
Tweet: if ronaldo leaves then we ll make it for sure ||| Score : [0.54073703]
Tweet: not cr his contract is this proof ||| Score : [0.691198]
Tweet: and don t forget ronaldo for new era with hag ||| Score : [0.7160235]
Tweet: messi is a better goal scorer too its obvious ronaldo has more goals since he played more matches ||| Score : [0.7271393]
Tweet: ronaldo for juventus gam

# Assign the predictions to the dataset 📝

## Messi

In [16]:
dataset_messi["prediction"] = predictions_messi
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction
0,1523587143632703488,1292468372618326023,oh yeah and messi contributed so heavily to hi...,en,2022-05-09T08:54:16.000Z,Twitter for Android,-1,0,0,0,0.611823
1,1523587126595825664,3985914141,this level of stretch is insane he literally t...,en,2022-05-09T08:54:12.000Z,Twitter for iPhone,-1,0,0,0,0.629123
2,1523587117556715520,1480936861790924801,fred and messi tradeable thoughts,en,2022-05-09T08:54:10.000Z,Twitter for Android,-1,0,1,0,0.510315
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0,0,0,0.645012
4,1523587088381136896,942073809464467456,ok sir i think you know messi better than mess...,en,2022-05-09T08:54:03.000Z,Twitter for Android,-1,0,0,0,0.570392


## Ronaldo

In [17]:
dataset_ronaldo["prediction"] = predictions_ronaldo
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction
0,1523587244572872704,1383801795823116291,where is the band leader ronaldo,en,2022-05-09T08:54:40.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.728374
1,1523587203351293953,1011901831411322880,all untradeable except alex telles what could ...,en,2022-05-09T08:54:30.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.706692
2,1523587136968003584,288628771,or cr,en,2022-05-09T08:54:14.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.722615
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.645012
4,1523587106647560192,278038673,erik ten hag needs some constance says he want...,en,2022-05-09T08:54:07.000Z,Wildmoka,-1,5.0,19.0,0.0,0.593058


# Convert predictions (scores) to labels (positive or negative) ⚗️

## Define the processing function

In [18]:
def assign_label(x):
    """
    Parameters
    ----------
    x : SCORE
        The prediction score

    Returns
    -------
    label : STRING
        The sentiment of the tweet.
    """
    rounded_x = round(x,2)
    label = ""
    if rounded_x >= threshold: 
        label = "Positive"
    else:
        label = "Negative"
    return label

## Application : Messi Dataset

In [19]:
 # Create a column named "label"
dataset_messi["label"] = "0"
dataset_messi["label"] = dataset_messi.prediction.apply(lambda x: assign_label(x))
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction,label
0,1523587143632703488,1292468372618326023,oh yeah and messi contributed so heavily to hi...,en,2022-05-09T08:54:16.000Z,Twitter for Android,-1,0,0,0,0.611823,Negative
1,1523587126595825664,3985914141,this level of stretch is insane he literally t...,en,2022-05-09T08:54:12.000Z,Twitter for iPhone,-1,0,0,0,0.629123,Positive
2,1523587117556715520,1480936861790924801,fred and messi tradeable thoughts,en,2022-05-09T08:54:10.000Z,Twitter for Android,-1,0,1,0,0.510315,Negative
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0,0,0,0.645012,Positive
4,1523587088381136896,942073809464467456,ok sir i think you know messi better than mess...,en,2022-05-09T08:54:03.000Z,Twitter for Android,-1,0,0,0,0.570392,Negative


## Application : Ronaldo Dataset

In [20]:
# Create a column named "label"
dataset_ronaldo["label"] = "0" 
dataset_ronaldo["label"] = dataset_ronaldo.prediction.apply(lambda x: assign_label(x))
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction,label
0,1523587244572872704,1383801795823116291,where is the band leader ronaldo,en,2022-05-09T08:54:40.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.728374,Positive
1,1523587203351293953,1011901831411322880,all untradeable except alex telles what could ...,en,2022-05-09T08:54:30.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.706692,Positive
2,1523587136968003584,288628771,or cr,en,2022-05-09T08:54:14.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.722615,Positive
3,1523587113005948929,1424726769236529155,their is a reason most tickets sold in worldcu...,en,2022-05-09T08:54:09.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.645012,Positive
4,1523587106647560192,278038673,erik ten hag needs some constance says he want...,en,2022-05-09T08:54:07.000Z,Wildmoka,-1,5.0,19.0,0.0,0.593058,Negative


# Save the results 💾
> Note : if you plan to use the dataset, feel free to manually check and tweak the results, in order to come closer to a human-precision level

## Messi 

In [21]:
dataset_messi.to_csv("Predictions_messi.csv", index=False)

## Ronaldo

In [22]:
dataset_ronaldo.to_csv("Predictions_ronaldo.csv", index=False)

# Thank you for your time 😄