<a href="https://colab.research.google.com/github/Snack-ary/Snack-ary/blob/main/Twitter_Sentiment_Analysis(ML).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import spacy
from transformers import AutoTokenizer
from scipy.special import softmax
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Stop words

In [None]:
# printing stopwords in english
print(stopwords.words('english'))
# stop words are words in any langauge that does not provide contextual importance.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Data Processing

**DP method 1**

In [None]:
# Load data from csv url to Pandas Dataframe
url = 'https://yunhefeng.me/test/tweet_10000.csv'
#raw = pd.read_csv(url, encoding = "ISO-8859-1", names='Tweets', on_bad_lines='skip')
raw = pd.read_csv(url, header=None, on_bad_lines='skip')
df = pd.DataFrame(raw)
df.columns = ['Tweets']
# Dataset is now stored in a Pandas Dataframe

In [None]:
# First 12 rows in the dataframe
raw.head(12)

Unnamed: 0,Tweets
0,RT @ankitasood13: @Sonamshr1990 Wow with an em...
1,nah pelangi tanda menghargai pastu bagi emoji ...
2,RT @dreamyflames_: what you need right now ? ...
3,RT @Alana_Dream17: 🌿NEW PINNED TWEET!🌿 ☮️ RT ...
4,RT @Nikki_Squirtz: 💦NEW PINNED TWEET 💦 RT &am...
5,RT @JrWave19: We really turned “😭” into a laug...
6,How much you like these about yourself? 1. 7 ...
7,RT @SouthamptonFC: Describe this goal using an...
8,RT @rainygukkie: HES LITERALLY THE 🥺🥺 EMOJI ht...
9,How much do you like these about yourself? 1....


In [None]:
# Check number of rows and columns
raw.shape

(8278, 1)

In [None]:
# Check the type of data
raw.dtypes

0    object
dtype: object

*After using this processing method we have 8279 rows of data out of 10,000. This may be unsastifactory for our goals being that more than 10% of our data is unaccounted for.*

*Moreover, the data seems to have only one field meaning that we stil need to tokenize/parse each line, into the appropriate amount of fields.*

**Tokenizing/Parsing**

The code bellow attempts to split the dataframe into two columns, user & text(tweet). However doing so is futile as not all tweets have the delimiter used to signify the beginning of a tweet. This in turn confuses tweets without a use name as a user entirely.

The goal now is to remove "RT @-----..." entirely from tweets that do have them. Ultimately leaving the tweets in general.

In [None]:
# Spliting Tweets column into two seperate columns using ': ' as delimiter
#df[['User', 'Tweet']] = df['Tweets'].str.split(': ', n=1, expand=True)
# Drop the original tweets column
#df = df.drop('Tweets', axis=1)
#df.head(50)

Bellow are lines that help view and analyze the current data.

In [None]:
# how to select a specific row in the current Dataframe
row_index = 1
specific_row_iloc = raw.iloc[row_index]
print("Using iloc:")
print(specific_row_iloc)
print()

Using iloc:
Tweets    RT @dreamyflames_: what you need right now ?  ...
User                                      RT @dreamyflames_
Tweet                  what you need right now ?  6th emoji
Name: 1, dtype: object



In [None]:
# finding the number of rows and column
df.shape


(8280, 2)

In [None]:
# finding null items
np.sum(df.isnull().any(axis=1))
rows_with_null = df[df.isnull().any(axis=1)]
print(rows_with_null)

Empty DataFrame
Columns: [Tweets]
Index: []


In [None]:
df.head()

Unnamed: 0,0
0,RT @ankitasood13: @Sonamshr1990 Wow with an em...
1,nah pelangi tanda menghargai pastu bagi emoji ...
2,RT @dreamyflames_: what you need right now ? ...
3,RT @Alana_Dream17: 🌿NEW PINNED TWEET!🌿 ☮️ RT ...
4,RT @Nikki_Squirtz: 💦NEW PINNED TWEET 💦 RT &am...


Cleaning out the user name:

In [None]:
def delete_words_starting_with(sentence, start_character):
    # Split the sentence into words
    words = sentence.split()

    # Keep only the words that don't start with the specified character
    filtered_words = [word for word in words if not word.startswith(start_character)]

    # Join the remaining words back into a sentence
    result_sentence = ' '.join(filtered_words)

    return result_sentence

# Create a sample DataFrame
# data = {'Sentence': ["This is #Delete# a sample sentence.", "Remove words starting with #."]}
# df = pd.DataFrame(data)

# Specify the start character
start_character = '@', 'RT', 'http'

# Apply the function to the 'Sentence' column
df['Tweets'] = df['Tweets'].apply(lambda x: delete_words_starting_with(x, start_character))

# Print the modified DataFrame
df.head(50)

Unnamed: 0,Tweets
0,Wow with an emoji !! Lovely 🥰
1,nah pelangi tanda menghargai pastu bagi emoji ...
2,what you need right now ? 6th emoji
3,🌿NEW PINNED TWEET!🌿 ☮️ my pinned + I'll urs! ☮...
4,💦NEW PINNED TWEET 💦 &amp; LIKE &amp;&amp; i wi...
5,We really turned “😭” into a laughing emoji
6,How much you like these about yourself? 1. 7 2...
7,Describe this goal using an emoji 😱 Looking ba...
8,HES LITERALLY THE 🥺🥺 EMOJI
9,How much do you like these about yourself? 1. ...


The result of the code above has cleaned out user, RT, and URL tokens.

In [None]:
df.shape

(8280, 1)


Translated Entire Data Frame to english.

In [None]:
import pandas as pd
from google.cloud import translate_v2 as translate
import os

# Set your Google Cloud credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/google_key.json"

def translate_text(text, target_language="en"):
    # Instantiates a client with key file
    client = translate.Client.from_service_account_json("/content/google_key.json")

    # Translates the text
    result = client.translate(text, target_language=target_language)

    return result["translatedText"]

def translate_csv(input_csv, output_csv):
    # Read the CSV file
    # Iterate through all text columns
    for column in df.select_dtypes(include=['object']).columns:
        # Translate the text in each column and replace the original column
        df[column] = df[column].apply(translate_text)

    # Save the translated CSV file
    df.to_csv(output_csv, index=False)

if __name__ == "__main__":
    # Specify your CSV file, output file
    input_csv = "https://yunhefeng.me/test/tweet_10000.csv"
    output_csv = "/content/output.csv"

    # Perform translation and save the result
    translate_csv(input_csv, output_csv)

In [None]:
df.head()


Unnamed: 0,Tweets
0,Wow with an emoji !! Lovely 🥰
1,"Well, the rainbow is a sign of appreciation fo..."
2,what you need right now ? 6th emoji
3,🌿NEW PINNED TWEET!🌿 ☮️ my pinned + I'll urs! ☮...
4,💦NEW PINNED TWEET 💦 &amp; LIKE &amp;&amp; i wi...


Tokenizing/Parsing 2

In [None]:
nlp = spacy.load('en_core_web_sm')

# Tokenize each row
df['tokenized_text'] = df['Tweets'].apply(lambda x: [token.text for token in nlp(x)])


Tokenized Tweets Column

In [None]:
df1 = df.iloc[:, [1]]
df1.head()

Unnamed: 0,tokenized_text
0,"[Wow, with, an, emoji, !, !, Lovely, 🥰]"
1,"[Well, ,, the, rainbow, is, a, sign, of, appre..."
2,"[what, you, need, right, now, ?, 6th, emoji]"
3,"[🌿, NEW, PINNED, TWEET, !, 🌿, ☮, ️, my, pinned..."
4,"[💦, NEW, PINNED, TWEET, 💦, &, amp, ;, LIKE, &,..."


In [None]:
from textblob import TextBlob
def analyze_sentiment(tweet):
    analysis = TextBlob(tweet)
    polarity = analysis.sentiment.polarity

    if polarity > 0:
        return "positive"
    elif polarity == 0:
        return "neutral"
    else:
        return "negative"

# Apply sentiment analysis to the 'tweets' column
df['sentiment'] = df['Tweets'].apply(analyze_sentiment)

In [None]:
# Display the result
print(df[['Tweets', 'sentiment']])
positive_tweets = sum(sentiment == "positive" for sentiment in df['sentiment'])
neutral_tweets = sum(sentiment == "neutral" for sentiment in df['sentiment'])
negative_tweets = sum(sentiment == "negative" for sentiment in df['sentiment'])

total_tweets = len(df['sentiment'])

print(f"Total Tweets: {total_tweets}")
print(f"Positive Tweets: {positive_tweets} ({(positive_tweets / total_tweets) * 100:.2f}%)")
print(f"Neutral Tweets: {neutral_tweets} ({(neutral_tweets / total_tweets) * 100:.2f}%)")
print(f"Negative Tweets: {negative_tweets} ({(negative_tweets / total_tweets) * 100:.2f}%)")

                                                 Tweets sentiment
0                         Wow with an emoji !! Lovely 🥰  positive
1     Well, the rainbow is a sign of appreciation fo...   neutral
2                   what you need right now ? 6th emoji  positive
3     🌿NEW PINNED TWEET!🌿 ☮️ my pinned + I'll urs! ☮...  positive
4     💦NEW PINNED TWEET 💦 &amp; LIKE &amp;&amp; i wi...  positive
...                                                 ...       ...
8275  #BLACKPINK #TwitterBlueRoom 📺 CLIP #1 : BLACKP...  positive
8276  🌞RT my pinned and I’ll yours🌞 🌹please this pos...   neutral
8277  Help me to click on this link. Like and drop e...   neutral
8278  Fuck I didn’t delete the emoji cause I just co...  negative
8279                           Which PAUDSON emoji? LOL  positive

[8280 rows x 2 columns]
Total Tweets: 8280
Positive Tweets: 3808 (45.99%)
Neutral Tweets: 3578 (43.21%)
Negative Tweets: 894 (10.80%)


At the end of the project, it seems that tokenizing was unecessary for the training model to function.

Results:
out of 10,000 data points 8,280 were taken into account due to "bad lines".
data cleaning/prepping process consisted of:
- creating a header and column name
- finding and removing word tokens starting with "RT", "http", and "@"
- translating foriegn text via downloading and using the Google Cloud API

Finally training model was created using textblob package allowing us to create a sentiment anaylsis model that analyzed weather words in a tweet was measured as neagative or positive.

As stated in the last coding output above:
- [8280 rows x 2 columns]
- Total Tweets: 8280
- Positive Tweets: 3808 (45.99%)
- Neutral Tweets: 3578 (43.21%)
- Negative Tweets: 894 (10.80%)
