<a href="https://www.kaggle.com/code/amirmotefaker/ukraine-russia-war-twitter-sentiment-analysis?scriptVersionId=127551144" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

- More than 400 days have passed since the war between Russia and Ukraine. Many countries support Ukraine by imposing economic sanctions against Russia. There are a lot of tweets about the Ukraine-Russia war where people tend to update the facts on the ground, how they feel about it, and who they support.

# Russia-Ukraine war at a glance: what we know on day 429 of the invasion

- Russia on Friday launched a wave of missile attacks across many of Ukraine’s biggest cities, killing a mother and young child in the port city of Dnipro, and three people at a high-rise apartment building in the central city of Uman. Air raid alarms were active across the country in the early hours of Friday morning, while explosions were heard in Kyiv, and southern Mykolaiv was targeted again.

- At least seven civilians were killed and 33 injured between Wednesday and Thursday, Ukraine’s presidential office said, including one person killed and 23 wounded when four Kalibr cruise missiles hit the southern city of Mykolaiv.

- The parliamentary assembly of the Council of Europe has voted that the forced detention and deportation of children from Russian occupied territories of Ukraine is genocide.

- [The Guardian](https://www.theguardian.com/world/2023/apr/28/russia-ukraine-war-at-a-glance-what-we-know-on-day-429-of-the-invasion)


# Import Libraries

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
import re
from nltk.corpus import stopwords
import string



# Read Data

In [2]:
data = pd.read_csv("/kaggle/input/russia-vs-ukraine-tweets-datasetdaily-updated/filename.csv")

In [3]:
print(data.head())

                    id      conversation_id               created_at  \
0  1630366235354451969  1630152070530576385  2023-02-28 00:36:15 UTC   
1  1630366226424778753  1630366226424778753  2023-02-28 00:36:13 UTC   
2  1630366225930027011  1630366225930027011  2023-02-28 00:36:13 UTC   
3  1630366223056662530  1630351686974992385  2023-02-28 00:36:12 UTC   
4  1630366221483884545  1629903982255644672  2023-02-28 00:36:12 UTC   

         date      time  timezone              user_id     username  \
0  2023-02-28  00:36:15         0  1493761817406894086  tomasliptai   
1  2023-02-28  00:36:13         0  1526694166662721536  paperfloure   
2  2023-02-28  00:36:13         0  1053018392939167746    katetbar1   
3  2023-02-28  00:36:12         0            602371247    jlhrdhmom   
4  2023-02-28  00:36:12         0  1053594763214184448    phemikali   

                  name place  ... geo source user_rt_id user_rt retweet_id  \
0         Tomas Liptai   NaN  ... NaN    NaN        NaN     Na

- Let’s have a quick look at all the column names of the dataset:

In [4]:
print(data.columns)

Index(['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'username', 'name', 'place', 'tweet', 'language', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video',
       'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest'],
      dtype='object')


- We only need three columns for this task (username, tweet, and language); I will only select these columns and move forward:

In [5]:
data = data[["username", "tweet", "language"]]

- Let’s have a look at whether any of these columns contains any null values or not:


In [6]:
data.isnull().sum()

username    0
tweet       0
language    0
dtype: int64

- So none of the columns has null values, let’s have a quick look at how many tweets are posted in which language:

In [7]:
data["language"].value_counts()

en     8858
pt      440
it      194
qme     105
und      60
in       47
ru       44
ja       42
es       36
ca       20
qht      20
th       19
fr       18
de       14
ko        9
vi        8
nl        8
ro        7
fi        7
ar        6
zxx       6
uk        6
cs        6
zh        5
pl        5
qam       4
tl        4
da        3
eu        2
no        2
hi        2
tr        2
hu        1
cy        1
lv        1
el        1
bn        1
Name: language, dtype: int64

- So most of the tweets are in English. Let’s prepare this data for the task of sentiment analysis. Here I will remove all the links, punctuation, symbols and other language errors from the tweets:

In [8]:
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
stopword=set(stopwords.words('english'))

def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["tweet"] = data["tweet"].apply(clean)

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
