# Exploratory Data Analysis of *COVID_tweets* dataset

#### Notebook contains an EDA and short preprocessing of the train part of the *COVID_tweets_dataset* 

#### Importing libraries

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import os

#### Selecting proper path with os library and uploading one part of the dataset: train_df into the Jupyter Notebook

In [3]:
cur_path = os.getcwd()
df_folder_path = os.path.join(cur_path, "COVID_tweets_dataset")
df_path = os.path.join(df_folder_path, "Corona_NLP_test.csv")

train_df = pd.read_csv(df_path)
train_df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


#### Dropping unncessary columns

In [4]:
train_df.drop(columns = ['ScreenName', 'Location', 'TweetAt'], inplace = True, axis = 1)

In [5]:
train_df.shape

(3798, 3)

In [6]:
train_df.Sentiment.unique()

array(['Extremely Negative', 'Positive', 'Extremely Positive', 'Negative',
       'Neutral'], dtype=object)

#### Checking for empty (NaN) values

In [7]:
train_df.isna().any()

UserName         False
OriginalTweet    False
Sentiment        False
dtype: bool

#### Visualizing the distribution of labels of Sentiment

In [8]:
fig = px.histogram(train_df, x='Sentiment', color_discrete_sequence=['skyblue'])
fig.update_layout(
    title='Distribution of Sentiment Values',
    xaxis_title='Sentiment',
    yaxis_title='Frequency',
    template='plotly_white'
)
fig.show()

#### Compressing the dataset into the equally-distributed dataset, where each class has its own 400 records - 2000 samples combined

In [9]:
samples_list = []

for emotion in train_df['Sentiment'].unique():
    # Select the first 400 samples for the current emotion
    samples = train_df[train_df['Sentiment'] == emotion].head(400)
    samples_list.append(samples)

# Concatenation
equal_distribution_df = pd.concat(samples_list, ignore_index=True)

fig = px.histogram(equal_distribution_df, x='Sentiment', color_discrete_sequence=['skyblue'])
fig.update_layout(
    title='Distribution of Sentiment Values',
    xaxis_title='Sentiment',
    yaxis_title='Frequency',
    template='plotly_white'
)
fig.show()

In [10]:
equal_distribution_df.shape

(2000, 3)

#### Saving file into dataset folder

In [11]:
equal_distribution_df.to_csv(os.path.join(df_folder_path, "Corona_balanced.csv"))

#### Printing out 5 samples for each Sentiment

In [16]:
df = equal_distribution_df.copy()

for emotion in df['Sentiment'].unique():
    print(f"\n5 random samples for Sentiment: {emotion} \n\n")
    random_samples = df[df['Sentiment'] == emotion]['OriginalTweet'].sample(5)
    for sample in random_samples:
        print(sample)
    print("\n\n")


5 random samples for Sentiment: Extremely Negative 


US consumer prices unexpectedly rose in February but could drop in months ahead as COVID-19 depresses demand for some goods &amp; services, outweighing price increases related to shortages caused by disruptions to supply chain.
https://t.co/ta4GzWk3nx
I just went to grocery store for a bagel in greater L.A. area  &amp; the lines were VERY LONG ..shelves emptying quickly. People are PANIC buying &amp; hoarding. Some with 2 full carts. #coronapocalypse #coronaCrazy
#coronavirus #COVID2019 https://t.co/JZAI0bEYz7
Did anyone hear in the budget a pledge to help people on food banks? who are now facing no food as a result of panic buying that is 1,6 million people facing severe hardships in the coming months #BorisOut #UK #Covid_19 #SirPatrickVallance
@KyleBrandt The Corona Tough Guys will be the same ones trying to cut long lines at the ER and starting fights at the supermarket. 

An idea to limit hospital over crowding: Anyone still sp