<a href="https://colab.research.google.com/github/1MuhammadFarhanAslam/ML-Projects/blob/main/Ukraine_Russia_War_Twitter_Sentiment_Analysis_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
**Many countries including west are supporting Ukraine by introducing economic sanctions on Russia. There are a lot of tweets about the Ukraine and Russia war where people tend to update about the ground truths, what they feel about it, and who they are supporting. So we will analyze the sentiments of people over the Ukraine and Russian War.**

# **Mounting Google Drive**

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# **Configure Google Colab to Kaggle through Kaggle API**

**To connect Kaggle datasets to Google Colab, you need to follow these steps:**

* 1: Install the Kaggle library in Google Colab by running the following command

In [None]:
!pip install kaggle

**Go to the Kaggle website (https://www.kaggle.com) and sign in to your account (or create a new account if you don't have one).**

*Navigate to the dataset you want to use in your Colab notebook.*

*Click on the "Copy API command" button below the dataset description. This will copy the command to download the dataset using the Kaggle API.*

*In your Colab notebook, import the necessary libraries and set up the Kaggle API by running the following code*

In [None]:
import os
import json

# Upload your Kaggle API key file (kaggle.json) to Colab using the file upload feature
from google.colab import files
files.upload()

# Read the contents of the kaggle.json file
with open('kaggle.json', 'r') as file:
    kaggle_json = json.load(file)

# **Important about Kaggle API Security**

**The command !chmod 600 ~/.kaggle/kaggle.json is used to change the permissions of the kaggle.json file to restrict access permissions.**

*In Linux-based systems, including Google Colab, file permissions are represented by a three-digit number: the first digit represents the owner's permissions, the second digit represents the group's permissions, and the third digit represents other users' permissions.*

**Here's a breakdown of what chmod 600 does:**

* ***6 means the owner (the user who uploaded the kaggle.json file) has read and write permissions (4 for read and 2 for write), but no execute permissions (0 for execute). 0 means the group and other users have no permissions to read, write, or execute the file.***

* ***By setting the permissions to chmod 600, it ensures that only the owner of the file (the user who uploaded the kaggle.json file) has read and write access, and no other users (group or others) can access or modify the file.***

* **This step is important to maintain the security of your Kaggle API key, as it contains sensitive information and should not be accessible to other users of the system.**

In [None]:
# Move the saved kaggle.json file to the required directory
os.makedirs('/root/.kaggle', exist_ok=True)
os.rename('kaggle.json', '/root/.kaggle/kaggle.json')

# Set the appropriate permissions for the Kaggle API key file
os.chmod('/root/.kaggle/kaggle.json', 0o600)

or

In [None]:
import os

# Specify the path to the kaggle.json file
kaggle_json_path = os.path.join(os.path.expanduser("~"), ".kaggle", "kaggle.json")

# Check if the kaggle.json file already exists
if os.path.exists(kaggle_json_path):
    print("kaggle.json file already exists.")
else:
    # Move the uploaded Kaggle API key file to the required directory
    !mkdir -p ~/.kaggle    # This command creates a directory named '.kaggle' inside the user's home directory (~). The -p option ensures that the parent directories are also created if they don't exist. If the directory already exists, this command will not throw an error
    !mv kaggle.json ~/.kaggle/    # This command moves the file named 'kaggle.json' to the ~/.kaggle/ directory. The mv command is used for file or directory relocation. The first argument, kaggle.json, represents the current name/path of the file, and the second argument, ~/.kaggle/, represents the destination directory where the file should be moved.
    !chmod 600 ~/.kaggle/kaggle.json
    print("kaggle.json file moved and permissions set successfully.")


**Verifying Kaggle API**

In [None]:
# Verify the Kaggle API is working
!kaggle datasets list

# **Downloading dataset from kaggle**

In [None]:
!kaggle datasets download --force towhidultonmoy/russia-vs-ukraine-tweets-datasetdaily-updated

**If the Kaggle API is working correctly, you can download the dataset by running the copied API command in your Colab notebook:**

* **The -d flag is useful if you want to download the dataset only once. If you use the -d flag and the dataset already exists in your local directory, Kaggle will not download the dataset again.In your case, the dataset is being updated daily, so you may want to use the --force flag to make sure that you always have the latest version of the dataset.**

**The dataset will be downloaded as a ZIP file. You can unzip the file using the following command**

In [None]:
import zipfile

# Specify the path to the ZIP file
zip_file_path = '/content/russia-vs-ukraine-tweets-datasetdaily-updated.zip'

# creating directory to unzip dataset
!mkdir -p /content/russia-vs-ukraine-tweets-datasetdaily-updated

# Specify the target directory to extract the files
target_directory = '/content/russia-vs-ukraine-tweets-datasetdaily-updated'

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the files to the target directory
    zip_ref.extractall(target_directory)

print("ZIP file extracted successfully.")

In [None]:
import os

# Specify the directory path
directory_path = '/content/russia-vs-ukraine-tweets-datasetdaily-updated'

# Create the directory if it doesn't already exist
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
    print(f"Directory '{directory_path}' created successfully.")
else:
    print(f"Directory '{directory_path}' already exists.")


Overview of Libraries using for Sentiment Analysis:

1. **pandas**: pandas is a powerful data manipulation and analysis library in Python. It is commonly used for tasks like data cleaning, transformation, and exploration.

2. **seaborn**: seaborn is a data visualization library built on top of matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. Seaborn simplifies the process of creating visualizations such as scatter plots, line plots, bar plots, histograms, and more.

3. **matplotlib**: matplotlib is a widely used plotting library in Python. Matplotlib can be used to generate line plots, scatter plots, bar plots, histograms, pie charts, and many other types of visualizations.

4. **nltk.sentiment.vader**: NLTK (Natural Language Toolkit) is a popular library for natural language processing in Python. It provides various tools and resources for tasks like tokenization, stemming, tagging, parsing, and sentiment analysis. The `vader` module within NLTK implements the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis algorithm, which is specifically designed for analyzing sentiment in social media texts.

5. **wordcloud**: wordcloud is a library used for generating word clouds in Python. A word cloud is a visual representation of text data, where the size of each word corresponds to its frequency or importance. The `WordCloud` class in the wordcloud library allows you to create and customize word clouds based on your text data.

6. **nltk**: It is a comprehensive library for working with human language data and performing various natural language processing tasks. It provides a wide range of functionalities, including tokenization, stemming, tagging, parsing, and more.

7. **re**: re is the built-in regular expression module in Python. It provides functions and methods for working with regular expressions.The `re` module is often used for tasks like searching, extracting, and replacing specific patterns of text in strings.

8. **nltk.corpus.stopwords**: nltk.corpus.stopwords is a collection of commonly used stopwords (i.e., words that are considered irrelevant for text analysis) in different languages. The stopwords module from the NLTK corpus contains pre-defined lists of stopwords that can be used to filter out these words from your text data.

9. **string**: string is a built-in module in Python that provides various useful functions for working with strings. It includes a collection of ASCII characters, such as punctuation marks and whitespace, as well as functions for formatting, manipulating, and comparing strings.

**Importing essential libraries**

In [None]:
!pip install textblob

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
import re
from nltk.corpus import stopwords
import string
from textblob import TextBlob

**Reading CSV data**

In [None]:
data = pd.read_csv("/content/russia-vs-ukraine-tweets-datasetdaily-updated/filename.csv")
print(data.head())

In [None]:
data.shape

In [None]:
print(data.describe())

In [None]:
print(data.info())

**Let’s have a quick look at all the column names of the dataset:**

In [None]:
print(data.columns)

**We only need three columns for this task (username, tweet, and language); I will only select these columns and move forward:**

In [None]:
data = data[['username', 'tweet','language']]
data

In [None]:
data["tweet"][0] # reading tweet having 0 index.

**Let’s have a look at whether any of these columns contains any null values or not:**

In [None]:
data.isnull().sum()

**So none of the columns has null values, let’s have a quick look at how many tweets are posted in which language:**

In [None]:
data['language'].value_counts()

In [None]:
data.language.value_counts().sort_values().plot(kind = 'pie')

**Count of unique languages present in the 'language' column of the DataFrame.**

In [None]:
total_languages = len(data['language'].value_counts())
print("Total number of languages used:", total_languages)

**Function to extract hashtags and generate barplot of the most frequent hashtags**

In [None]:
# Function to extract hashtags from a list of tweets
def hashtag_extract(text_list):
    # Create a list to store the hashtags
    hashtags = []

    # Loop over the tweets
    for text in text_list:
        # Use the `re` module to find all of the hashtags in the tweet
        ht = re.findall(r"#(\w+)", text)

        # Append the hashtags to the list
        hashtags.append(ht)

    # Return the list of hashtags
    return hashtags

# Function to generate a barplot of the most frequent hashtags
def generate_hashtag_freqdist(hashtags):
    # Create a frequency distribution of the hashtags
    a = nltk.FreqDist(hashtags)

    # Convert the frequency distribution to a Pandas DataFrame
    d = pd.DataFrame({'Hashtag': list(a.keys()),
                      'Count': list(a.values())})

    # Select the top 25 most frequent hashtags
    d = d.nlargest(columns="Count", n = 25)

    # Create a figure with the specified size
    plt.figure(figsize=(16, 7))

    # Create a barplot of the most frequent hashtags
    ax = sns.barplot(data=d, x= "Hashtag", y = "Count")

    # Rotate the x-ticks by 80 degrees
    plt.xticks(rotation=80)

    # Set the y-label
    ax.set(ylabel = 'Count')

    # Show the figure
    plt.show()


* *The first function, hashtag_extract(), takes a list of tweets as input and returns a list of hashtags. The function works by looping over the tweets and using the re module to find all of the hashtags in each tweet. The hashtags are then returned as a list.*

* *The second function, generate_hashtag_freqdist(), takes a list of hashtags as input and returns a barplot of the most frequent hashtags. The function works by first creating a frequency distribution of the hashtags using the nltk.FreqDist() function. The frequency distribution is then converted into a Pandas DataFrame and the top 25 hashtags are selected. The barplot is then created using the Seaborn library.*

In [None]:
hashtags = hashtag_extract(data["tweet"])
hashtags = sum(hashtags, [])

In [None]:
generate_hashtag_freqdist(hashtags)

# **Preparation of data**

**Let’s prepare this data for the task of sentiment analysis. Here I will remove all the links, punctuation, symbols and other language errors from the tweets through:**

*nltk.download('stopwords')*

*stemmer = nltk.SnowballStemmer("english")*

*stopword=set(stopwords.words('english'))*

* **The first line, nltk.download('stopwords'), downloads the list of stopwords from the Natural Language Toolkit (NLTK). Stopwords are words that are commonly used in a language, but that do not add much meaning to a sentence. For example, the words "the", "a", and "and" are all stopwords in English.**

* **The second line, stemmer = nltk.SnowballStemmer("english"), creates a Snowball stemmer object. A Snowball stemmer is a type of stemmer that uses a recursive algorithm to reduce a word to its stem or root. For example, the word "playing" would be stemmed to "play" by a Snowball stemmer.**

* **The third line, stopword=set(stopwords.words('english')), creates a set of stopwords. The set() function is used to convert the list into a set data structure. The stopword set will contain all of the stopwords that were downloaded in the first line.**

**These three lines of code are commonly used when processing text with NLTK. The stopwords are removed from the text to reduce the number of words that need to be processed, and the Snowball stemmer is used to reduce the words to their stems. This can make the text easier to analyze and can improve the performance of natural language processing algorithms.**

In [None]:
# Download the stopwords from the NLTK library.
nltk.download('stopwords')

# Create a SnowballStemmer object for the English language.
stemmer = nltk.SnowballStemmer("english")

# Create a set of stopwords from the NLTK library.
stopword=set(stopwords.words('english'))


In [None]:
def clean(text):
    text = str(text).lower()
    # Convert the text to lowercase

    text = re.sub('\[.*?\]', '', text)
    # Remove all square brackets and their contents

    text = re.sub('https?://\S+|www\.\S+', '', text)
    # Remove all URLs

    text = re.sub('<.*?>+', '', text)
    # Remove all HTML tags

    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # Remove all punctuation

    text = re.sub('\n', '', text)
    # Remove all newline characters

    text = re.sub('\w*\d\w*', '', text)
    # Remove all words that contain numbers

    text = [word for word in text.split(' ') if word not in stopword]
    # Remove all stopwords from the text

    text = " ".join(text)
    # Join the words back together with spaces

    text = [stemmer.stem(word) for word in text.split(' ')]
    # Stem all of the words in the text

    text = " ".join(text)
    # Join the stemmed words back together with spaces

    return text


def analyze_sentiment(tweet):
    """
    Analyze the sentiment of a tweet.

    Args:
        tweet: A string containing the tweet.

    Returns:
        An integer representing the sentiment of the tweet:
        - 1 for positive sentiment
        - 0 for neutral sentiment
        - -1 for negative sentiment
    """

    # Create a TextBlob object from the tweet.
    analysis = TextBlob(clean(tweet))

    # Get the sentiment polarity of the tweet.
    polarity = analysis.sentiment.polarity

    # Return the sentiment of the tweet.
    if polarity > 0:
        return 1
    elif polarity == 0:
        return 0
    else:
        return -1


**Adding columns to dataframe**

In [None]:
# This line adds a sentiment column to the data frame, using the analyze_sentiment() function to determine the sentiment of each tweet.
data['Sentiment'] = data['tweet'].apply(lambda x: analyze_sentiment(x))

# This line adds a source column to the data frame, setting the source to "random_user" for all tweets.
data['Source'] = 'random_user'

# This line adds a length column to the data frame, counting the number of characters in each tweet.
data['Length'] = data['tweet'].apply(len)

# This line adds a word_counts column to the data frame, counting the number of words in each tweet.
data['Word_counts'] = data['tweet'].apply(lambda x: len(str(x).split()))

# This line adds a clean_tweet column to the data frame, using the clean() function to clean each tweet.
data['Clean tweet'] = data['tweet'].apply(lambda x: clean(x))


In [None]:
data

In [None]:
# This line creates a new data frame called data2, which contains only the tweet, sentiment, source, length, and word_counts columns from the original data frame.
data2 = data[['tweet', 'Sentiment', 'Source', 'Length', 'Word_counts']]

# This line prints the first five rows of the data2 data frame.
data2.head()

Creating dataframes containing neutral,positive and negative sentiments

In [None]:
neutral = data[data['Sentiment'] == 0]
positive = data[data['Sentiment'] == 1]
negative = data[data['Sentiment'] == -1]

In [None]:
neutral

In [None]:
positive

In [None]:
negative

**Visualization**

In [None]:
# This code creates a bar chart showing the distribution of sentiment in the dataset of tweets.
import plotly.graph_objs as go

# Create a list of x-axis labels.
x = ['Neutral', 'Positive', 'Negative']

# Create a list of y-axis values.
y = [len(neutral), len(positive), len(negative)]

# Create a bar chart object.
fig = go.Figure(data=[go.Bar(x=x, y=y, hovertext=['61% of tweets', '28% of tweets', '11% of tweets'])])

# Customize the aspect of the bar chart.
fig.update_traces(marker_line_color='midnightblue', marker_line_width=1.)

# Set the title of the bar chart.
fig.update_layout(title_text='Distribution of sentiment')

# Display the bar chart.
fig.show()

In [None]:
# This code creates a pie chart showing the sentiment polarity of the invasion tweets dataset.
fig, ax = plt.subplots(figsize=(6, 6))

# Get the sentiment counts
sizes = [count for count in data['Sentiment'].value_counts()]
labels = list(data['Sentiment'].value_counts().index)

# Set the pie chart properties
explode = (0.1, 0, 0)
ax.pie(x=sizes, labels=labels, autopct='%1.1f%%', explode=explode, textprops={'fontsize': 14})
ax.set_title('Sentiment Polarity on invasion Tweets Data \n (total = {})'.format(len(data)), fontsize=16, pad=20)

# Show the pie chart
plt.show()


In [None]:
# This code prints out three examples of tweets, one each for neutral, positive, and negative sentiment.

# Neutral tweet
print("Neutral tweet example  :", neutral['tweet'].values[15])
# Comment: This is an example of a neutral tweet. It does not express any strong positive or negative sentiment.

# Positive tweet
print("Positive Tweet example :", positive['tweet'].values[37])
# Comment: This is an example of a positive tweet. It expresses happiness, excitement, or some other positive emotion.

# Negative tweet
print("Negative Tweet example :", negative['tweet'].values[1])
# Comment: This is an example of a negative tweet. It expresses sadness, anger, or some other negative emotion.


In [None]:
# This code creates a histogram showing the distribution of tweet lengths.
x = data.Length.values

fig = go.Figure(data=[go.Histogram(x=x,
                                   marker_line_width=1, 
                                   marker_line_color="midnightblue", 
                                   xbins_size = 5)])

fig.update_layout(title_text='Distribution of tweet lengths')
fig.show()

In [None]:
# This code creates histograms showing the distribution of tweet lengths for neutral, positive, and negative tweets.

x1 = neutral.Length.values
x2 = positive.Length.values
x3 = negative.Length.values

fig1 = go.Figure(data=[go.Histogram(x=x1,
                                   marker_line_width=1, 
                                   marker_line_color="midnightblue", 
                                   xbins_size = 5,
                                   opacity = 1)])

fig1.update_layout(title_text='Distribution of neutral tweet lengths')
fig1.show()

fig2 = go.Figure(data=[go.Histogram(x=x2,
                                   marker_line_width=1, 
                                   marker_color='rgb(50,202,50)', 
                                   marker_line_color="midnightblue", 
                                   xbins_size = 5,
                                   opacity = 1)])

fig2.update_layout(title_text='Distribution of positive tweet lengths')
fig2.show()

fig3 = go.Figure(data=[go.Histogram(x=x3,
                                   marker_line_width=1, 
                                   marker_color='crimson', 
                                   marker_line_color="midnightblue", 
                                   xbins_size = 5,
                                   opacity = 1)])

fig3.update_layout(title_text='Distribution of negative tweet lengths')
fig3.show()

In [None]:
# This code creates a box plot showing the distribution of tweet lengths for neutral, positive, and negative tweets.

y1 = neutral.Length.values
y2 = positive.Length.values
y3 = negative.Length.values

fig = go.Figure()

fig.add_trace(go.Box(y=y1, 
                     name="Neutral", 
                     marker_line_width=1, 
                     marker_line_color="midnightblue"))

fig.add_trace(go.Box(y=y2, 
                     name="Positive", 
                     marker_line_width=1, 
                     marker_color = 'rgb(50,202,50)'))

fig.add_trace(go.Box(y=y3, 
                     name="Negative", 
                     marker_line_width=1, 
                     marker_color = 'crimson'))

fig.update_layout(title_text="Box Plot tweet lengths")

fig.show()

In [None]:
neutral

In [None]:
data['Clean tweet']

In [None]:
# This code creates a new data frame called `tokenized_tweet`, which contains the lists of tweets in the `Clean tweet` column.
tokenized_tweet = data['Clean tweet'].apply(lambda x: x.split())

tokenized_tweet.head()

* **PorterStemmer is a class in the Natural Language Toolkit (NLTK) that implements the Porter stemming algorithm. Stemming is a process of reducing a word to its root form. For example, the words "running", "ran", and "runner" would all be stemmed to their root word.**

In [None]:
# defining simple stem_words function to reduce words to its root form.
# This code imports the PorterStemmer class from the nltk.stem.porter module.
from nltk.stem.porter import PorterStemmer

def stem_words(words):
  stemmer = PorterStemmer()
  stemmed_words = []
  for word in words:
    stemmed_word = stemmer.stem(word)
    stemmed_words.append(stemmed_word)
  return stemmed_words

words = ["running", "ran", "runner","going"]
stemmed_words = stem_words(words)
print(stemmed_words)

In [None]:
# This code imports the PorterStemmer class from the nltk.stem.porter module.
from nltk.stem.porter import PorterStemmer

# This creates a new instance of the PorterStemmer class.
stemmer = PorterStemmer()

# This code applies the `lambda` function to each row in the `tokenized_tweet` data frame. The `lambda` function stems each token in the tweet.
tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x])

tokenized_tweet.head()

# **Word cloud of tweets**

In [None]:
# This code creates a word cloud from lists of tweets.

all_words = ' '.join([text for text in data['Clean tweet']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

# This line creates a figure with a width of 10 inches and a height of 7 inches.
plt.figure(figsize=(10, 7))

# This line displays the word cloud.
plt.imshow(wordcloud, interpolation="bilinear")

# This line hides the axes.
plt.axis('off')

# This line shows the word cloud.
plt.show()

# **Most Frequently Used Words**

**Now let’s have a look at the wordcloud of the tweets, which will show the most frequently used words in the tweets by people sharing their feelings and updates about the Ukraine and Russia war:**

In [None]:
text = " ".join(i for i in data.tweet)
stopwords = set(STOPWORDS)

# This line creates a word cloud from the text, with stopwords removed and a black background.
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110, stopwords=stopwords, background_color="black").generate(text)
plt.figure( figsize=(10,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()


# **Positive Sentiments**

**Now let’s have a look at the most frequent words used by people with positive sentiments**

In [None]:
positive_words = ' '.join([text for text in data['Clean tweet'][data['Sentiment'] == 1]])

# This line creates a word cloud from the positive tweets
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(positive_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()


# **Negative Sentiments**

**Now let’s have a look at the most frequent words used by people with negative sentiments**

In [None]:
negative_words = ' '.join([text for text in data['Clean tweet'][data['Sentiment'] == -1]])

# This line creates a word cloud from the negative tweets
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(negative_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()


# **Neutral Sentiments**

**Now let’s have a look at the most frequent words used by people with neutral sentiments**

In [None]:
neutral_words =' '.join([text for text in data['Clean tweet'][data['Sentiment'] == 0]])

# This line creates a word cloud from the neutral tweets
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(neutral_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

# **Selecting top 10 most frequent hashtags**

In [None]:
HT = hashtag_extract(data['tweet'])
HT = sum(HT,[])

In [None]:
a = nltk.FreqDist(HT)
d = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})
# selecting top 10 most frequent hashtags     
d = d.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()