# Exploratory Data Analysis of *COVID_tweets* dataset

#### Notebook contains an EDA and short preprocessing of the train part of the *COVID_tweets_dataset* 

#### Importing libraries

In [37]:
import numpy as np
import pandas as pd
import plotly.express as px
import os

#### Selecting proper path with os library and uploading one part of the dataset: train_df into the Jupyter Notebook

In [38]:
cur_path = os.getcwd()
df_folder_path = os.path.join(cur_path, "COVID_tweets_dataset")
df_path = os.path.join(df_folder_path, "Corona_NLP_test.csv")

train_df = pd.read_csv(df_path)
train_df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


#### Dropping unncessary columns

In [39]:
train_df.drop(columns = ['ScreenName', 'Location', 'TweetAt'], inplace = True, axis = 1)

In [40]:
train_df.shape

(3798, 3)

In [41]:
train_df.Sentiment.unique()

array(['Extremely Negative', 'Positive', 'Extremely Positive', 'Negative',
       'Neutral'], dtype=object)

#### Checking for empty (NaN) values

In [42]:
train_df.isna().any()

UserName         False
OriginalTweet    False
Sentiment        False
dtype: bool

#### Visualizing the distribution of labels of Sentiment

In [43]:
fig = px.histogram(train_df, x='Sentiment', color_discrete_sequence=['skyblue'])
fig.update_layout(
    title='Distribution of Sentiment Values',
    xaxis_title='Sentiment',
    yaxis_title='Frequency',
    template='plotly_white'
)
fig.show()

#### Compressing the dataset into the equally-distributed dataset, where each class has its own 400 records - 2000 samples combined

In [44]:
samples_list = []

for emotion in train_df['Sentiment'].unique():
    # Select the first 400 samples for the current emotion
    samples = train_df[train_df['Sentiment'] == emotion].head(400)
    samples_list.append(samples)

# Concatenation
equal_distribution_df = pd.concat(samples_list, ignore_index=True)

fig = px.histogram(equal_distribution_df, x='Sentiment', color_discrete_sequence=['skyblue'])
fig.update_layout(
    title='Distribution of Sentiment Values',
    xaxis_title='Sentiment',
    yaxis_title='Frequency',
    template='plotly_white'
)
fig.show()

In [45]:
equal_distribution_df.shape

(2000, 3)

In [46]:
equal_distribution_df= equal_distribution_df.sample(frac = 1).reset_index(drop = True)

In [47]:
sentiment_mapping = {
    'Negative': 2,
    'Neutral': 3,
    'Positive': 4,
    'Extremely Negative': 1,
    'Extremely Positive': 5
}
equal_distribution_df['Sentiment_float'] = equal_distribution_df['Sentiment'].map(sentiment_mapping).astype(float)
#equal_distribution_df.to_csv(path) only to save Sentiment float column
equal_distribution_df.head()

Unnamed: 0,UserName,OriginalTweet,Sentiment,Sentiment_float
0,2374,Price gouging #California: illegal to charge m...,Extremely Negative,1.0
1,727,Well done to everybody who has finished most o...,Positive,4.0
2,2041,We're concerned...please help us support our l...,Extremely Positive,5.0
3,1160,"If your grocery store is out of toilet paper, ...",Extremely Negative,1.0
4,1308,#shortages at the local supermarket #hoarding ...,Neutral,3.0


#### Saving file into dataset folder

In [48]:
equal_distribution_df.to_csv(os.path.join(df_folder_path, "Corona_balanced.csv"))

#### Printing out 5 samples for each Sentiment

In [49]:
df = equal_distribution_df.copy()

for emotion in df['Sentiment'].unique():
    print(f"\n5 random samples for Sentiment: {emotion} \n\n")
    random_samples = df[df['Sentiment'] == emotion]['OriginalTweet'].sample(5)
    for sample in random_samples:
        print(sample)
    print("\n\n")


5 random samples for Sentiment: Extremely Negative 


I went shopping today and the part of town the supermarket is in has a lot of SE Asian people living nearby, so obvs using the same shop. The racism and ingnrance I heard and witnessed is 1 of the byproducts of this bullshit I cant handle! #Covid_19 #coronavirus #CoronaOutbreak
Out shopping with my elderly folks at local supermarket this morning 

Fascinating observing behaviours

Panic buying in place. Not helpful for the vulnerable.

No toilet roll, no soap and no pasta left and its only 11am!

#coronavirus #Covid_19 https://t.co/694VHK5YAz
For a nation that coined #keepcalmandcarryon we seriously need to get a grip.Just back from hoilday.Went to buy food for tonight/tomorrow, and apparently the apocalypse has begun, and the #Covid_19 everyone has turned everyone into flapping, panic-buying morons.
who can't afford to stock up and elderly people are now stuck while you have 7 packets of toilet roll, 50 soaps and food rotting in