<a href="https://colab.research.google.com/github/Giskard-AI/demo-notebooks/blob/main/Sentiment_Analysis_for_Twitter_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# To Download tweets runtime

# What is sentiment analysis ?
Sentiment Analysis is the technique of determining the sentiments involved in the given text. The ability to determine is the text is 'Positive', 'Neutral' or 'Negative' . 
Imagine you have released a product and want to monitor the performance of your product based on the reviews and feedbacks or even the twitter posts about the product - Sentiment Analysis to your rescue! Lets try to see it in action! 

In [None]:
!pip install -q transformers tweepy wordcloud matplotlib

## To train on existing data 
I have used the annotated twitter data from kaggle https://www.kaggle.com/datasets/kazanova/sentiment140

In [None]:
# Read the data 
import pandas as pd
data_full = pd.read_csv('twitter_data_revised.csv', encoding='ISO-8859-1', names=['target', 'ids', 'date', 'flag', 'user', 'text'])

In [None]:
#data = data_full.head(5000)
data_full[data_full['target']=='target']
data_full['target'].unique()

array([0])

In [None]:
data = data_full.head(1000)

In [None]:
# Preprocess text (username and link placeholders)
# Replace the User name with @user and the URL in the tweet with http for better comprehension of data for the model
def preprocess(text):
    new_text = []
 
 
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import torch.nn.functional as F
import urllib.request
import csv
np.random.seed(112)


# Define pretrained tokenizer and model
task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)


model = AutoModelForSequenceClassification.from_pretrained(MODEL)



In [None]:
text = "Good night 😊"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

In [None]:
scores

array([0.00760988, 0.14581235, 0.84657764], dtype=float32)

In [None]:
ranking = np.argsort(scores)

In [None]:
# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

In [None]:
labels

['negative', 'neutral', 'positive']

In [None]:
def predict(data):
    test_dataset= data.squeeze() if len(data) >1 else data
    X_test = list(test_dataset.apply(preprocess))
    X_test_tokenized = tokenizer(X_test, return_tensors='pt', padding=True, truncation=True)


    with torch.no_grad():
        output = model(**X_test_tokenized)
        scores = torch.nn.functional.softmax(output.logits, dim=-1)
        scores = scores.cpu().detach().numpy()
    return scores

In [None]:
feature_names = ['text']
test_df = data[feature_names][:5]
predict(test_df)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


array([[0.6341271 , 0.31541577, 0.05045717],
       [0.9328357 , 0.06181807, 0.00534626],
       [0.17255476, 0.67783666, 0.14960855],
       [0.8134728 , 0.16300653, 0.02352061],
       [0.9398792 , 0.05530898, 0.00481181]], dtype=float32)

In [None]:
pip install giskard

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:

from giskard.giskard_client import GiskardClient

url = "https://dev.giskard.ai" # If Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL 
token = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY3MDc4OTcxN30.tnwkDybSnmiRIga2moq2a7G-Y_PA447GYXmseo-CkUc" # you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
#senti = client.create_project("sentimental_analysis", "Sentimental Analysis for Twitter Data", "Sentimental Analysis for Twitter Data")

# If you've already created a project with the key "enron_demo" use
senti = client.get_project("sentimental_analysis")

In [None]:
column_types={       
        'target': "category",
        #"ids": "numeric",
        #"date": "numeric",
        #"flag": "text",
        #"user": "text",
        "text": "text"
    }

In [None]:
senti.upload_model_and_df(
    prediction_function=predict, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=data[['text','target']], # The dataset you want to use to inspect your model
    column_types=column_types, # # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='target', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=['text'], # List of the feature names of prediction_function
    model_name='senti_analysis', # Name of the model
    dataset_name='twitter_data_10000', # Name of the dataset
    classification_labels=[0,1,2] # List of the classification labels of your prediction
)

Hint: "Your target variable values are numeric. It is recommended to have Human readable string as your target values to make results more understandable in Giskard."
Dataset successfully uploaded to project key 'sentimental_analysis' and is available at https://dev.giskard.ai 


HTTPError: ignored

In [None]:
import tweepy

# Add Twitter API key and secret
consumer_key = "8blQte5ELe8PeIqTcCCT6t4Fm"
consumer_secret = "Xlvv6ZASsOnYF9bkKPlhQSg54yPLYD6yMc0Wm3YNIRZdszHxak"

# Handling authentication with Twitter
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)

# Create a wrapper for the Twitter API
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
# Helper function for handling pagination in our search and handle rate limits
def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            print('Reached rate limite. Sleeping for >15 minutes')
            time.sleep(15 * 61)
        except StopIteration:
            break

# Define the term we will be using for searching tweets
query = '#NFTs'
query = query + ' -filter:retweets'

# Define how many tweets to get from the Twitter API 
count = 1000

# Let's search for tweets using Tweepy 
search = limit_handled(tweepy.Cursor(api.search,
                        q=query,
                        tweet_mode='extended',
                        lang='en',
                        result_type="recent").items(count))

In [None]:
from transformers import pipeline

# Set up the inference pipeline using a model from the 🤗 Hub
sentiment_analysis = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")

# Let's run the sentiment analysis on each tweet
tweets = []
for tweet in search:
    try: 
      content = tweet.full_text
      sentiment = sentiment_analysis(content)
      tweets.append({'tweet': content, 'sentiment': sentiment[0]['label']})

    except: 
      pass

Downloading config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/515M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading bpe.codes:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Token indices sequence length is longer than the specified maximum sequence length for this model (132 > 128). Running this sequence through the model will result in indexing errors


In [None]:
import pandas as pd

# Load the data in a dataframe
pd.set_option('max_colwidth', None)
pd.set_option('display.width', 3000) 
df = pd.DataFrame(tweets)

# Show a tweet for each sentiment 
display(df[df["sentiment"] == 'POS'].head(1))
display(df[df["sentiment"] == 'NEU'].head(1))
display(df[df["sentiment"] == 'NEG'].head(1))

Unnamed: 0,tweet,sentiment
0,Gm ☀️ It is still possible to visit my solo exhibition with amazing catalogue presented by https://t.co/N1o8bWrsSN and curated by @elishafaei ☀️\n#nft #nftart #nfts #nftart #NFTphotographers #nftcollector #exhibition #show #energy #colors #photography #art #PositiveVibes #healing https://t.co/fkXs1tREaN,POS


Unnamed: 0,tweet,sentiment
2,The story of 5200 HanfuNFTs has begun.\nWho is your favourite among them? @Hanfu_NFT\n\n#NFT #NFTs #NFTCommunity #HanFuNFT #NFTReveal #NFTGiveaway #hanfu \n\nhttps://t.co/4jedOjU4R6 @ducbuom188 @ducbuom @lngcbch26795627,NEU


Unnamed: 0,tweet,sentiment
29,"Shiba Social Club is currently undergoing rug 2.0. Shiba holders are deluded by someone they ""think"" owns the project, when in fact the original owners are still behind it (they control it and nothing will ever change). Everyone should have listened to Z. #rugpullfinder #NFTs",NEG
