<br>
<br>
<h2 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color:#1DA1F2 ; color :#FFFFFF; border-radius: 5px 5px; padding:10px;text-align:center; font-weight: bold">Exploratory Data Analysis And Visualization on Squid Game Tweets</h2> 
<br> 
<br>

<div class="Column">
  <div class="row">
    <img src="https://i.postimg.cc/HxhJ902h/60f5172816321e9428ac1ede-twitter.gif" alt="Snow" style="width:100%">
  </div>

# **Introduction**

**Squid Game (Korean: 오징어 게임; RR: Ojing-eo Geim) is a South Korean survival drama television series created by Hwang Dong-hyuk for Netflix.Its cast includes Lee Jung-jae, Park Hae-soo, Wi Ha-joon, HoYeon Jung, O Yeong-su, Heo Sung-tae, Anupam Tripathi, and Kim Joo-ryoung.**

**The series revolves around a contest where 456 players, all of whom are in deep financial debt, risk their lives to play a series of deadly children's games for the chance to win a ₩45.6 billion (US$38 million, €33 million, or GB£29 million as of broadcast) prize. The title of the series draws from a similarly named Korean children's game.**

<div class="Column">
  <div class="row">
    <img src="https://media.npr.org/assets/img/2021/10/15/squidgame_unit_101_577_wide-feafa98c140a6d814d3a600a52f75535bf1a2df4-s900-c85.webp" alt="Snow" style="width:100%">
  </div>

# **Importing the necessary libraries**:

In [None]:
import numpy as np 
import pandas as pd 
import os
import itertools

#plots
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer

from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

from PIL import Image
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
from nltk.util import ngrams
from textblob import TextBlob
%matplotlib inline 
import missingno as mno

import re
from collections import Counter

import nltk
from nltk.corpus import stopwords

import requests
import json

import seaborn as sns
sns.set(rc={'figure.figsize':(11.7,8.27)})

import warnings
warnings.filterwarnings("ignore")

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path
              .join(dirname, filename))

# **Loading Dataset**

In [None]:
twitter_data = pd.read_csv("../input/squid-game-netflix-twitter-data/tweets_v8.csv")


# **Let's take a quick overview of how the data looks!**:

In [None]:
# Examining Data
twitter_data.head()

In [None]:
# Examining Data
twitter_data.tail()

In [None]:
print("In this dataset there are {} rows and {} columns in the dataset.".format(twitter_data.shape[0],twitter_data.shape[1]))

In [None]:
# Examining statistics
twitter_data.describe()

In [None]:
#let's get some information about the data types of our dataset by executing the code binfo()
twitter_data.info()

In [None]:
##The below box plot also shows how the values are distributed for both int64 and bool variable type
sns.boxplot(data=twitter_data, orient="h", palette="Set2")

# **Missing Values:**

In [None]:
#Let's find out about the missing values in the dataset by executing the code below:
mno.matrix(twitter_data)

In [None]:
missed = pd.DataFrame()
missed['column'] = twitter_data.columns

missed['percent'] = [round(100* twitter_data[col].isnull().sum() / len(twitter_data), 2) for col in twitter_data.columns]
missed = missed.sort_values('percent',ascending=False)
missed = missed[missed['percent']>0]

fig = sns.barplot(
    x=missed['percent'], 
    y=missed["column"], 
    orientation='horizontal'
).set_title('Missed values percent for every column')

In [None]:
twitter_data.isna().sum()

# Observation:
* **The above barplot shows us that there are only three columns with missing values.**
* **Percentage of missing values are also shown in the plot.**
* **We can see that there is a lot of missing data in user_location, description. And very few user_name are missing.**

# Reasons for missing values!

* **Sometimes a user doesnt add his/her description in the bio and also user make a tweet without any user_location !**
* **But it's very strange of having missing value in user_name. I think it is a data collection error because no account can exist without user_name.**

In [None]:
#converting date column to date format
twitter_data['date'] = pd.to_datetime(twitter_data['date']).dt.date
twitter_data.head()


# **Features exploration**

# **Most frequent values**

In [None]:
def most_frequent_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    items = []
    vals = []
    for col in data.columns:
        
        itm = data[col].value_counts().index[0]
        val = data[col].value_counts().values[0]
        items.append(itm)
        vals.append(val)
    tt['Most frequent item'] = items
    tt['Frequence'] = vals
    tt['Percent from total'] = np.round(vals / total * 100, 3)
    return(np.transpose(tt))

In [None]:
most_frequent_values(twitter_data)

# **Let's Look into the unique values**


In [None]:
def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    tt['Percentage']=tt['Uniques']/tt['Total']
    return(np.transpose(tt))

In [None]:
unique_values(twitter_data)

# **Unique values in each column**

In [None]:
unique_df = pd.DataFrame()
unique_df['Features'] = twitter_data.columns
unique=[]
for i in twitter_data.columns:
    unique.append(twitter_data[i].nunique())
unique_df['Uniques'] = unique

f, ax = plt.subplots(1,1, figsize=(15,7))

splot = sns.barplot(x=unique_df['Features'], y=unique_df['Uniques'], alpha=0.8)
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center',
                   va = 'center', xytext = (0, 9), textcoords = 'offset points')
plt.title('Bar plot for number of unique values in each column',weight='bold', size=15)
plt.ylabel('#Unique values', size=12, weight='bold')
plt.xlabel('Features', size=12, weight='bold')
plt.xticks(rotation=90)
plt.show()

**Nearly 68% and 70% of the user name,and user description are unique**

# **Top50 users**

In [None]:
twitter_data_username_count = twitter_data['user_name'].value_counts().reset_index().rename(columns={
    'user_name':'tweet_count','index':'user_name'})

plt.figure(figsize=(15, 17))
sns.barplot(y='user_name',x='tweet_count',data=twitter_data_username_count.head(50))
y=twitter_data_username_count['tweet_count'].head(50)
for index, value in enumerate(y):
    plt.text(value, index, str(value),fontsize=12)
plt.title('Top50 users by number of tweets',weight='bold', size=15)
plt.ylabel('User_name', size=12, weight='bold')
plt.xlabel('Tweet_count', size=12, weight='bold')
plt.show()

# **Visulizing Tweet Count vs Location**

In [None]:
plt.figure(figsize=(15,10))
twitter_data['user_location'].value_counts().nlargest(20).plot(kind='bar')
plt.xticks(rotation=60)

# **Twitter tweets source distribution**

In [None]:
plt.figure(figsize=(15,10))
twitter_data['source'].value_counts().nlargest(6).plot(kind='bar')
plt.xticks(rotation=80)

# **Users created year by year**

In [None]:
twitter_data['Timestamp'] = pd.to_datetime(twitter_data.user_created, format="%d-%m-%Y %H:%M", errors='coerce')
mask = twitter_data.Timestamp.isnull()
twitter_data.loc[mask, 'Timestamp'] = pd.to_datetime(twitter_data[mask]['user_created'], format='%Y-%m-%d %H:%M:%S',
                                             errors='coerce')
twitter_data['year_created'] = twitter_data['Timestamp'].dt.year
data = twitter_data.drop_duplicates(subset='user_name', keep="first")
data = data[data['year_created']>1970]
data = data['year_created'].value_counts().reset_index()
data.columns = ['year', 'number']

fig = sns.barplot( 
    x=data["year"].astype(int), 
    y=data["number"], 
    orientation='vertical'
    #title='', 
).set_title('User created year by year')

* **2021 has the highest number of users followed by the year 2009.**
* **After 2009 it gradually decrease but from 2019 it increase exponential.**

# **Timestamp Analysis of tweets:**

In [None]:
twitter_data['tweet_date']=pd.to_datetime(twitter_data['date'],errors='coerce').dt.date
tweet_date=twitter_data['tweet_date'].value_counts().to_frame().reset_index().rename(columns={'index':'date','tweet_date':'count'})
tweet_date['date']=pd.to_datetime(tweet_date['date'])
tweet_date=tweet_date.sort_values('date',ascending=False)

fig=go.Figure(go.Scatter(x=tweet_date['date'],
              y=tweet_date['count'],
              mode='markers+lines',
              name="Submissions",
              marker_color='dodgerblue'))

fig.update_layout(
title_text='Tweets per Day : ({} - {})'.format(twitter_data['tweet_date'].sort_values()[0]#.strftime("%d/%m/%Y"),
,twitter_data['tweet_date'].sort_values().iloc[-1]#.strftime("%d/%m/%Y"))
,template="plotly_dark",title_x=0.5))

fig.show()

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=50,
        max_font_size=40, 
        scale=5,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(10,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

# **Text wordcloauds**

In [None]:
show_wordcloud(twitter_data['text'], title = 'Most frequent words in tweets')

In [None]:
CA_df = twitter_data.loc[twitter_data.user_location=="CA"]
show_wordcloud(CA_df['text'], title = 'Most frequent words in tweets from CA ')

In [None]:
England_df =  twitter_data.loc[ twitter_data.user_location=="England"]
show_wordcloud(England_df['text'], title = 'Most frequent words in tweets from London, England')

In [None]:
us_df = twitter_data.loc[twitter_data.user_location=="United States"]
show_wordcloud(us_df['text'], title = 'Most frequent words in tweets from US')