# FIFA WORLD CUP 2022 SENTIMENT ANALYSIS

## Introduction
This is a twitter sentiment analysis project of the 2022 FIFA WORLD CUP in Qatar. To find out what football fans think about the FIFA World Cup 2022, I'll conduct a Twitter sentiment analysis using some World Cup hashtags. The tweet is taken from diffreent days of the world cup and combined into one master dataset. 

### Importing Useful libraries

In [14]:
#Import the useful libraries for project environment
import snscrape.modules.twitter as sntwitter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob # to return all the CSV files‚Äô list located within the path
import time

### 1 - Web Scrapping

In [15]:
#Scraping relevant tweets using snscrape from twitter

query = '((#fifaworldcup2022 OR #fifaworldcup OR #worldcup2022) lang:en until:2022-12-21 since:2022-11-14)'
tweets = []
limit = 1

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.user.username, tweet.user.location, 
                tweet.rawContent, tweet.retweetCount, tweet.likeCount])

#Save the data to a dataframe

df = pd.DataFrame(tweets, columns=["Date", "Username", "Locations", "Content", "Retweets", "Favorites"])



In [16]:
# Check the dataframe
df.head()

Unnamed: 0,Date,Username,Locations,Content,Retweets,Favorites
0,2022-12-14 23:59:49+00:00,anyway_football,,"BIG TEAM, UNDERDOG TEM, &amp; MAROCCO üá≤üá¶\n\n#F...",2,1


In [12]:
#Save the data to csv file

df.to_csv("Dec_.csv", index=False)

In [53]:
path = r"C:\Users\user\Documents\DA Projects\Python"
all_csv_files = glob.glob(path + "/*.csv")

tweets = []

# Convert each csv to a dataframe

for filename in all_csv_files:
    df = pd.read_csv(filename, index_col = None, header = 0) 
    tweets.append(df)
    
# Merge all dataframes

FIFA_df = pd.concat(tweets, axis=0, ignore_index = True) 

FIFA_df.head()

Unnamed: 0,Date,Username,Locations,Content,Retweets,Favorites
0,2022-12-10 23:59:57+00:00,OSUBlockie,"Columbus, OH",I just earned the 'World Pint (2022)' badge on...,0,0
1,2022-12-10 23:59:54+00:00,standardsport,"London, England",France üÜö Morocco\n\nüìÜ Date\n‚è∞ Kick-off time\nüì∫...,0,5
2,2022-12-10 23:59:52+00:00,MpatrioticM,‚µú‚¥∞‚¥≥‚µç‚¥∑‚µâ‚µú ‚µè ‚µç‚µé‚µñ‚µî‚µâ‚¥± üá≤üá¶,Glory to Africa ‚ù§Ô∏è\n#FIFAWorldCup2022 https://...,10,70
3,2022-12-10 23:59:52+00:00,masuma114,"London, England",Why do black England players always get so muc...,0,4
4,2022-12-10 23:59:46+00:00,Mexitly81,"California, USA","No surprise, classless organization. I guess i...",0,0


In [54]:
# Save df to one csv file

FIFA_df.to_csv("FIFA_World_Cup_2022.csv", index = False)

In [55]:
# Read the csv file into dataframe
df = pd.read_csv("FIFA_World_Cup_2022.csv")

In [57]:
df.tail()

Unnamed: 0,Date,Username,Locations,Content,Retweets,Favorites
617018,2022-11-20 15:07:31+00:00,sataekooklang,,Jungkookie! üò≠ \nüíú\n#DreamersJungkook \n#Dreame...,10,41
617019,2022-11-20 15:07:31+00:00,Sir_Stevensarp,AngelTown,The tune dey go on though. #FIFAWorldCup,0,1
617020,2022-11-20 15:07:31+00:00,kassimisola,"London, England",I knew the opening ceremony in Qatar will be a...,0,0
617021,2022-11-20 15:07:31+00:00,btslostacc,‚ô°,Jungkook slayed #WorldCup2022,0,0
617022,2022-11-20 15:07:30+00:00,natasyaeffriani,Jakarta,not me crying every time i watch a sports fest...,0,0


### 2 - Data Cleaning

In [64]:
#Check null values
df.isnull().sum()

Date              0
Username          2
Locations    176454
Content           0
Retweets          0
Favorites         0
dtype: int64

###### The null location is understandable given that not every Twitter user tweets from their location, but I will look into the null value in the username further.

In [65]:
df[df['Username'].isnull()]

Unnamed: 0,Date,Username,Locations,Content,Retweets,Favorites
452511,2022-11-15 00:22:14+00:00,,"San Antonio, TX",Shirts ordered for #WorldCup2022 https://t.co/...,0,3
528135,2022-11-20 19:06:54+00:00,,"San Antonio, TX",On the road to Doha starting at @SATairport #F...,0,11


In [72]:
# Drop the null values in the USername column

df.dropna(subset=['Username'], inplace=True)

#Check again
df.isnull().sum()

Date              0
Username          0
Locations    617021
Content           0
Retweets          0
Favorites         0
dtype: int64

In [73]:
# Replace "NaN" values in the location with "Unknown"
df["Locations"].fillna("Unknown", inplace = True)

In [75]:
#Check again
df.isnull().sum()

Date         0
Username     0
Locations    0
Content      0
Retweets     0
Favorites    0
dtype: int64

In [76]:
# Check for duolicates
df.duplicated().sum()

15

In [78]:
df[df.duplicated()]

Unnamed: 0,Date,Username,Locations,Content,Retweets,Favorites
68338,2022-12-10 18:28:56+00:00,shumanimutendi,Unknown,@blacklabelsa They will meet France \n#FIFAWor...,0,0
164370,2022-12-14 10:27:45+00:00,Jimbob210712,Unknown,@markgoldbridge Depends who the Ref is on the ...,0,0
216151,2022-12-13 12:58:11+00:00,SubramanyamJk,Unknown,@Hisense_IND @KMbappe\n going to win the Golde...,0,1
234444,2022-12-17 15:07:44+00:00,masagyai1,Unknown,@SSFootball @TheBold27 The Moroccan Goalkeeper...,0,1
237426,2022-12-17 12:42:36+00:00,TakiSAPP96,Unknown,@ViaWallet I think Argentina will win this mat...,0,0
305331,2022-12-18 18:59:03+00:00,BigBellaseason1,Unknown,Are you goin to organise a cup for them? üòÖ#Fif...,0,0
454848,2022-11-14 17:04:30+00:00,JkvRivai,Unknown,@Daily_JKUpdate OUR PRIDE JUNGKOOK\nFIFAKOOK I...,0,0
455424,2022-11-14 15:58:05+00:00,NareeSahl,Unknown,@ITZYTrendGlobal @ITZYofficial @McDonalds WANN...,0,0
473875,2022-11-19 16:59:58+00:00,seraapalestine,Unknown,who were punished solely for expressing solida...,0,0
474569,2022-11-19 16:41:18+00:00,Qwerty6137,Unknown,Dreamers üåèüèÜ‚öΩ\n#Dreamers2022 #FIFAWorldCup #Jun...,0,0


In [88]:
# Drop the duplicate rows
df.drop_duplicates(subset=None, keep="first", inplace=True)

In [89]:
#Check again for duplicates
df.duplicated().sum()

0

In [90]:
#Check the description of the dataset
df.describe()

Unnamed: 0,Retweets,Favorites
count,617006.0,617006.0
mean,5.651269,33.05181
std,457.320311,1481.945343
min,0.0,0.0
25%,0.0,0.0
50%,0.0,1.0
75%,0.0,2.0
max,296698.0,793393.0


In [91]:
# Check the dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 617006 entries, 0 to 617022
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Date       617006 non-null  object
 1   Username   617006 non-null  object
 2   Locations  617006 non-null  object
 3   Content    617006 non-null  object
 4   Retweets   617006 non-null  int64 
 5   Favorites  617006 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 33.0+ MB


In [None]:
# Convert the date column to datetime type

