# Exploratory Data Analysis

In this notebook, I will provide an overview of the top words, word clouds and some overview of the users.

Our research will attempt to tackle these metrics and questions:
1. Tweet metrics:
    1. Tweet statistics (likes, retweets) (tableau)
    2. What are the most common words in the tweets
    3. How has the number of tweets changed over time (tableau)

2. User metrics:
    1. Which users are the most liked users
    2. Which users are the most active users by number of tweets
    3. How many users in total tweet about the topic and how has it changed (tableau)
    
3. Content analysis:
    1. Sentiment analysis over time (tableau)
    2. Topic modelling

In [None]:
import pandas as pd
import numpy as np
import pickle
import feather
import matplotlib.pyplot as plt
import os

dataset = input()
path = os.getcwd() +'/Datasets/'+dataset+'/'

## 1) Most common words

In [None]:
dtm = pd.read_pickle(path+dataset+'_dtm.pkl')

In [None]:
dtm.sort_values(by='2020-12-31', ascending=False).head(10)

## 2) Generate WordClouds

In [None]:
df_corpus = pd.read_pickle(path+dataset+'_corpus.pkl').to_frame().transpose()

In [None]:
from wordcloud import WordCloud

wc = WordCloud(width=800, height = 800,
               collocations=False,max_font_size=150, random_state=42)

In [None]:
plt.rcParams['figure.figsize'] = [16,6]

In [None]:
for index, date in enumerate(df_corpus.columns):
    # Create a string containing all the words in a year
    allwords = ' '.join( [word for word in df_corpus[date]] )
    # Generate a wordcloud
    cloud = wc.generate(allwords)
    
    # Create subplots
    plt.figure( figsize=(20,10) )
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis('off')
    plt.title((str(date)[:4]))
    plt.tight_layout(pad=0)
    plt.savefig(path+dataset+(str(date)[:4])+'_wordcloud.pdf')
    plt.show()