# What to expect 🤔

In this Notebook, we're going to run inferences on our tweets Dataset with the Sentiment Analysis model trained [here](https://www.kaggle.com/code/ibrahimserouis99/twitter-sentiment-analysis).

# Libraries

In [1]:
import numpy as np 
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import load_model

2022-04-16 10:14:57.583733: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0


# Exploring the datasets

## Messi

In [2]:
dataset_messi = pd.read_csv("../input/twitter-sentiment-analysis-and-word-embeddings/Cleaned_messi_tweets.csv", encoding="latin")
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1514503439291543554,1504674335533187078,when did this happened messi,en,2022-04-14T07:18:52.000Z,Twitter for Android,-1,1,1,0
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0,0,0
2,1514503259762958336,819679110,don t get me wrong mate i know he was good but...,en,2022-04-14T07:18:09.000Z,Twitter for iPhone,-1,0,0,0
3,1514503228221702144,1012688452863909891,messi fans have no shame,en,2022-04-14T07:18:02.000Z,Twitter for Android,-1,0,0,0
4,1514503144478322688,1309496212345810944,messi has to pay now 50m to play,en,2022-04-14T07:17:42.000Z,Twitter for Android,-1,0,1,0


### Check for duplicate tweets

In [3]:
assert len(np.unique(dataset_messi["tweet_id"])) == len(dataset_messi.index), "Duplicate IDs !"

### Drop N/A values

In [4]:
print(f"Number of N/A: \n{dataset_messi.isna().sum()}")
print(f"\nDropping N/A values...")
# Inplace = replace the original dataframe
dataset_messi.dropna(inplace=True)
print(f"\n\nNumber N/A after dropping: \n{dataset_messi.isna().sum()}")

Number of N/A: 
tweet_id         0
author_id        0
content          0
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64

Dropping N/A values...


Number N/A after dropping: 
tweet_id         0
author_id        0
content          0
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64


## Ronaldo

In [5]:
dataset_ronaldo = pd.read_csv("../input/twitter-sentiment-analysis-and-word-embeddings/Cleaned_ronaldo_tweets.csv", encoding="latin")
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1514503408820051969,1406215235829182470,dressing room source at ajax there s been crit...,en,2022-04-14T07:18:45.000Z,Twitter Web App,-1,0.0,0.0,0.0
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0.0,0.0,0.0
2,1514503356764418051,1215836950931705856,i personally feel that cdm is the most importa...,en,2022-04-14T07:18:33.000Z,Twitter for Android,-1,0.0,0.0,0.0
3,1514503333112864771,1201535456141168640,timber that s it these are the three essential...,en,2022-04-14T07:18:27.000Z,Twitter for Android,-1,0.0,0.0,0.0
4,1514503191563579394,1080591523715129344,a quick reminder that cristiano ronaldo is the...,en,2022-04-14T07:17:53.000Z,Twitter for iPhone,-1,0.0,0.0,0.0


### Check for duplicate tweets

In [6]:
assert len(np.unique(dataset_ronaldo["tweet_id"])) == len(dataset_ronaldo.index), "Duplicate IDs!"

### Check data

In [7]:
print(f"Number of N/A: \n{dataset_ronaldo.isna().sum()}")
print(f"\nDropping N/A values...")
# Inplace = replace the original dataframe
dataset_ronaldo.dropna(inplace=True)
print(f"\n\nNumber N/A after dropping: \n{dataset_ronaldo.isna().sum()}")

Number of N/A: 
tweet_id         0
author_id        0
content          6
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64

Dropping N/A values...


Number N/A after dropping: 
tweet_id         0
author_id        0
content          0
lang             0
date             0
source           0
geo              0
retweet_count    0
like_count       0
quote_count      0
dtype: int64


# Load the model

In [8]:
model = load_model("../input/twitter-sentiment-analysis-and-word-embeddings/TSA_model_v4")
model.summary()

2022-04-16 10:15:04.316640: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-04-16 10:15:04.319592: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-04-16 10:15:04.377923: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-16 10:15:04.378713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla P100-PCIE-16GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 15.90GiB deviceMemoryBandwidth: 681.88GiB/s
2022-04-16 10:15:04.378783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-04-16 10:15:04.409277: I tensorflow/stream_executor/platform/def

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 53)                0         
_________________________________________________________________
sequential (Sequential)      (None, 1)                 27859389  
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
Total params: 27,859,389
Trainable params: 27,859,389
Non-trainable params: 0
_________________________________________________________________


# Set the thresholds : refer to [this link](https://github.com/Justsecret123/Twitter-sentiment-analysis/blob/main/Notebook/twitter-sentiment-analysis.ipynb)

In [9]:
threshold = 0.625

# Run inferences

## Messi dataset

In [10]:
X_messi = dataset_messi["content"]
X_messi.head(10)

0                         when did this happened messi
1    highest ranking active players on my all time ...
2    don t get me wrong mate i know he was good but...
3                             messi fans have no shame
4                     messi has to pay now 50m to play
5                                         you re mines
6    nigeria must cease being a killing field peter...
7                    prime ronaldo and messi instantly
8    karim benzema joins lionel messi and cristiano...
9    how many goal does the finished messi have 3 d...
Name: content, dtype: object

### Check data

In [11]:
predictions_messi = model.predict(X_messi)

2022-04-16 10:15:31.341168: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-04-16 10:15:32.060244: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-04-16 10:15:32.117746: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8


### Display some results

In [12]:
for i in range(10): 
    print(f"Tweet: {X_messi[i]} ||| Score : {predictions_messi[i]}")

Tweet: when did this happened messi ||| Score : [0.7044735]
Tweet: highest ranking active players on my all time top 75 for now 1 l messi 5th 2 c ronaldo 8th 3 a iniesta 31st 4 g buffon 34th 5 neymar jr 43rd 6 l modri 62nd 7 k de bruyne 67th 8 r lewandowski 68th 9 z ibrahimovi 70th 10 m neuer 72nd ||| Score : [0.5355405]
Tweet: don t get me wrong mate i know he was good but for me messi trumps everyone that fella is insane ||| Score : [0.619831]
Tweet: messi fans have no shame ||| Score : [0.7156492]
Tweet: messi has to pay now 50m to play ||| Score : [0.7179817]
Tweet: you re mines ||| Score : [0.71386266]
Tweet: nigeria must cease being a killing field peter obi plateau benue massacre news benuemassacre nigeria peterobi plateau bbnaija messi wizkid davido ||| Score : [0.5798119]
Tweet: prime ronaldo and messi instantly ||| Score : [0.7241524]
Tweet: karim benzema joins lionel messi and cristiano ronaldo in accomplishing champions league feat football ||| Score : [0.70077986]
Tweet: h

## Ronaldo dataset

In [13]:
X_ronaldo = dataset_ronaldo["content"]
X_ronaldo.head(5)

0    dressing room source at ajax there s been crit...
1    highest ranking active players on my all time ...
2    i personally feel that cdm is the most importa...
3    timber that s it these are the three essential...
4    a quick reminder that cristiano ronaldo is the...
Name: content, dtype: object

In [14]:
predictions_ronaldo = model.predict(X_ronaldo)

### Display some results

In [15]:
for i in range(10): 
    print(f"Tweet: {X_ronaldo[i]} ||| Score : {predictions_ronaldo[i]}")

Tweet: dressing room source at ajax there s been criticism of him ten hag for playing blind just because of the surname so we re curious to see how he handles cristiano ronaldo and co mufc ||| Score : [0.68055373]
Tweet: highest ranking active players on my all time top 75 for now 1 l messi 5th 2 c ronaldo 8th 3 a iniesta 31st 4 g buffon 34th 5 neymar jr 43rd 6 l modri 62nd 7 k de bruyne 67th 8 r lewandowski 68th 9 z ibrahimovi 70th 10 m neuer 72nd ||| Score : [0.5355405]
Tweet: i personally feel that cdm is the most important position after we buy a cdm that will ultimately determine the budget for the rest of the positions ronaldo is unlikely to leave if cavani leaves we still can bring a youth into the system ||| Score : [0.5631666]
Tweet: timber that s it these are the three essential transfer mufc needs to pull off this summer to give a fight in domestic competitions and release players with huge salary amp free up funds for them cr7 should not be at ot next ssn ||| Score : [0.685

# Assign the predictions to the dataset

## Messi

In [16]:
dataset_messi["prediction"] = predictions_messi
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction
0,1514503439291543554,1504674335533187078,when did this happened messi,en,2022-04-14T07:18:52.000Z,Twitter for Android,-1,1,1,0,0.704473
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0,0,0,0.535541
2,1514503259762958336,819679110,don t get me wrong mate i know he was good but...,en,2022-04-14T07:18:09.000Z,Twitter for iPhone,-1,0,0,0,0.619831
3,1514503228221702144,1012688452863909891,messi fans have no shame,en,2022-04-14T07:18:02.000Z,Twitter for Android,-1,0,0,0,0.715649
4,1514503144478322688,1309496212345810944,messi has to pay now 50m to play,en,2022-04-14T07:17:42.000Z,Twitter for Android,-1,0,1,0,0.717982


## Ronaldo

In [17]:
dataset_ronaldo["prediction"] = predictions_ronaldo
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction
0,1514503408820051969,1406215235829182470,dressing room source at ajax there s been crit...,en,2022-04-14T07:18:45.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.680554
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.535541
2,1514503356764418051,1215836950931705856,i personally feel that cdm is the most importa...,en,2022-04-14T07:18:33.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.563167
3,1514503333112864771,1201535456141168640,timber that s it these are the three essential...,en,2022-04-14T07:18:27.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.68512
4,1514503191563579394,1080591523715129344,a quick reminder that cristiano ronaldo is the...,en,2022-04-14T07:17:53.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.676029


# Convett predictions to labels

## Define the processing function

In [18]:
def assign_label(x):
    """
    Parameters
    ----------
    x : SCORE
        The prediction score

    Returns
    -------
    label : STRING
        The sentiment of the tweet.
    """
    rounded_x = round(x,2)
    label = ""
    if rounded_x >= 0.593: 
        label = "Positive"
    else:
        label = "Negative"
    return label

## Application : Messi Dataset

In [19]:
 # Create a column named "label"
dataset_messi["label"] = "0"
dataset_messi["label"] = dataset_messi.prediction.apply(lambda x: assign_label(x))
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction,label
0,1514503439291543554,1504674335533187078,when did this happened messi,en,2022-04-14T07:18:52.000Z,Twitter for Android,-1,1,1,0,0.704473,Positive
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0,0,0,0.535541,Negative
2,1514503259762958336,819679110,don t get me wrong mate i know he was good but...,en,2022-04-14T07:18:09.000Z,Twitter for iPhone,-1,0,0,0,0.619831,Positive
3,1514503228221702144,1012688452863909891,messi fans have no shame,en,2022-04-14T07:18:02.000Z,Twitter for Android,-1,0,0,0,0.715649,Positive
4,1514503144478322688,1309496212345810944,messi has to pay now 50m to play,en,2022-04-14T07:17:42.000Z,Twitter for Android,-1,0,1,0,0.717982,Positive


## Application : Ronaldo Dataset

In [20]:
# Create a column named "label"
dataset_ronaldo["label"] = "0" 
dataset_ronaldo["label"] = dataset_ronaldo.prediction.apply(lambda x: assign_label(x))
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count,prediction,label
0,1514503408820051969,1406215235829182470,dressing room source at ajax there s been crit...,en,2022-04-14T07:18:45.000Z,Twitter Web App,-1,0.0,0.0,0.0,0.680554,Positive
1,1514503399462432770,1437183590781964290,highest ranking active players on my all time ...,en,2022-04-14T07:18:43.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.535541,Negative
2,1514503356764418051,1215836950931705856,i personally feel that cdm is the most importa...,en,2022-04-14T07:18:33.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.563167,Negative
3,1514503333112864771,1201535456141168640,timber that s it these are the three essential...,en,2022-04-14T07:18:27.000Z,Twitter for Android,-1,0.0,0.0,0.0,0.68512,Positive
4,1514503191563579394,1080591523715129344,a quick reminder that cristiano ronaldo is the...,en,2022-04-14T07:17:53.000Z,Twitter for iPhone,-1,0.0,0.0,0.0,0.676029,Positive


# Save the results
> Note : if you plan to use the dataset, feel free to manually check and tweak the results, in order to come closer to a human-precision level

## Messi 

In [21]:
dataset_messi.to_csv("Predictions_messi.csv", index=False)

## Ronaldo

In [22]:
dataset_ronaldo.to_csv("Predictions_ronaldo.csv", index=False)