<a href="https://colab.research.google.com/github/ThiagoFPMR/Discord-Analysis/blob/master/Discord_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Modules & Reading Data
Getting the imports out of the way first.

In [None]:
!pip install emoji



In [None]:
import re
import emoji
import numpy as np
import pandas as pd
import plotly.express as px

## Reading the data

In [None]:
data = pd.read_csv("msg_hist.csv", usecols=[1, 2, 3])

# Preparing The Data
Turning the raw data we extracted into something better to work with.

## Anonymizing The Data
It's not ethical (and probably not even legal) to make the account names of the people in a dataset public without their permission, so we'll anonimize them first.

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   content  97923 non-null   object
 1   time     100000 non-null  object
 2   author   100000 non-null  object
dtypes: object(3)
memory usage: 2.3+ MB


In [None]:
def anom_dict (names):
  anonymized = {}
  for index, name in enumerate(names.unique()):
    anonymized[name] = f"A{index + 1}"
  return anonymized

In [None]:
data.author = data.author.map(anom_dict(data.author))

## Basic Cleaning
The dataset came with missing values resulting from the bot's failed read of a few messages and also has the time column as a string, which we want as a datetime object.

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   content  97923 non-null   object
 1   time     100000 non-null  object
 2   author   100000 non-null  object
dtypes: object(3)
memory usage: 2.3+ MB


Our time is stored as an object, which drastically limits what we can do with it. To fix that, we'll be turning it into a datetime object.

In [None]:
data.time = pd.to_datetime(data.time)

In [None]:
data.isnull().sum()

content    2077
time          0
author        0
dtype: int64

The missing values consist of messages the bot was unable to read, such as embeds and images. They're all unecessary for our purposes so we can just get rid of them.

In [None]:
data.dropna(inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97923 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   content  97923 non-null  object        
 1   time     97923 non-null  datetime64[ns]
 2   author   97923 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 3.0+ MB


## Preparing The Data For Analysis
We have dealt with missing values and incorrect value types but we might still want to do a few changes to the dataframe to make getting info out of it easier.

In [None]:
data.head()

Unnamed: 0,content,time,author
0,D:< i tried something different this year,2020-10-31 23:44:09.770,A1
1,I like how u had to think about what would go ...,2020-10-31 23:29:08.644,A2
2,Thx,2020-10-31 23:20:24.956,A3
3,Nice costumes,2020-10-31 23:08:03.268,A1
4,Ooo,2020-10-31 23:07:49.062,A1


### Emojis
We'll be also keeping track of the emojis sent in each message via an array.

In [None]:
def emoji_list (msg):
  emote = np.array([])
  pattern = "["u"\U0001F600-\U0001F64F"u"\U0001F300-\U0001F5FF"u"\U0001F680-\U0001F6FF"u"\U0001F1E0-\U0001F1FF""]+"   
  if re.search(pattern, msg):
    for term in msg.split():
      if term in emoji.UNICODE_EMOJI:
        emote = np.append(emote, term)
  return emote

In [None]:
emoji_list = data.content.apply(emoji_list)
data.insert(3, "emoji", emoji_list)

### Discord Emotes
Aside from standard emojis, discord also has it's own exclusive guild emotes, which we might want to keep track of.

Filtering out and creating a separate column for discord emojis.

In [None]:
def emote_list (msg):
  emotes = np.array([])
  pattern = "<:(.*):[0-9]{18}>"  
  if re.search(pattern, msg):
    for term in msg.split():
      if term[:2] == "<:": # We only want to store the name of the emote
        emotes = np.append(emotes, term.split(":")[1]) 
  return emotes

In [None]:
discord_emotes = data.content.apply(emote_list)
data.insert(4, "discord_emotes", discord_emotes)

Filtering out and creating a separate column for animated discord emojis.

In [None]:
def animated_emote_list (msg):
  emotes = np.array([])
  pattern = "<a:(.*):[0-9]{18}>"  
  if re.search(pattern, msg):
    for term in msg.split():
      if term[:3] == "<a:": # We only want to store the name of the emote
        emotes = np.append(emotes, term.split(":")[1])
  return emotes

In [None]:
animated_discord_emotes = data.content.apply(animated_emote_list)
data.insert(5, "a_discord_emotes", animated_discord_emotes)

### Adding a Word Count Column
We can see the average length of a user's message by adding a column that keeps track of their word count per message.

Adding a new column that contains the word count for each message.

In [None]:
def word_count (msg):
  return len(msg.split())

In [None]:
words = data.content.apply(word_count)
data.insert(6, "word_count", words)

# Playing With The Data
After being done with that entire process, we can finally do some analysis work.

In [None]:
data.head()

Unnamed: 0,content,time,author,emoji,discord_emotes,a_discord_emotes,word_count
0,D:< i tried something different this year,2020-10-31 23:44:09.770,A1,[],[],[],7
1,I like how u had to think about what would go ...,2020-10-31 23:29:08.644,A2,[],[],[],22
2,Thx,2020-10-31 23:20:24.956,A3,[],[],[],1
3,Nice costumes,2020-10-31 23:08:03.268,A1,[],[],[],2
4,Ooo,2020-10-31 23:07:49.062,A1,[],[],[],1


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97923 entries, 0 to 99999
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   content           97923 non-null  object        
 1   time              97923 non-null  datetime64[ns]
 2   author            97923 non-null  object        
 3   emoji             97923 non-null  object        
 4   discord_emotes    97923 non-null  object        
 5   a_discord_emotes  97923 non-null  object        
 6   word_count        97923 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 6.0+ MB


## User Activity Over Time
Plotting the messages sent by the server members over the entire period covered by the dataset.

When we try to plot the user activity in the server, the plot gets visually crowded from the amount of authors there is. Most of which actually contribute very little to the data.

In [None]:
fig = px.histogram(data, x='time', color='author', opacity=0.5)
fig.show()

The reason for the presence of many authors who don't actually make a difference to the dataset is that the bot also reads messages sent from inactive users and bots alike, meaning the there's data that, despite not being NaN, is unnecessary to the overall analysis.

In [None]:
fig = px.histogram(data, 
                   x='author',
                   title='Messages Sent Per Author',
                   labels={'author':'Author'})

fig.update_layout(title_font_size=30,
                  template='plotly_white')

fig.show()

By dropping all authors who have sent less messages than a minimum amount, we're filtering out those who don't significantly alter our data. You can execute the cell below and then the graph above again to see the difference for yourself.

In [None]:
min_msgs = 2030
data = data[(data.author.value_counts()[data.author] > min_msgs).to_list()].copy()
data = data.dropna()

With the crowing problem resolved, plotting the graph becomes very easy.

In [None]:
fig = px.histogram(data, x='time',
                         color='author', 
                         opacity=0.5,
                         title="User Activity Over Time",
                         labels={'time':'Date'})
fig.update_layout(barmode='overlay',
                  title_font_size=30,
                  template='plotly_white')
fig.show()

## Messages Sent per User
By using the ***value_counts()*** method, we can easily get the amount of messages sent by each actor. By minimizing the time window represented by the plot, you can limit your analysis to more recent data.

In [None]:
fig = px.bar(x=data.author.value_counts().index,
             y=data.author.value_counts(),
             color=data.author.value_counts().index,
             title='Messages Sent per User', 
             labels={'x': 'Author', 'y': 'Messages Sent'})
fig.update_layout(title_font_size=30,
                  template='plotly_white')

fig.show()

### Messages Sent Per User Containing a Certain Term
By filtering the dataframe as to only display data that contains a certain term using the vectorized string method ***contains()***, you can easily change the plot above to display how many messages each author has sent that countains the specified term.

In [None]:
term = 'LOL'
term_data = data[data.content.str.contains(term)]

fig = px.bar(x=term_data.author.value_counts().index,
             y=term_data.author.value_counts(),
             color=term_data.author.value_counts().index,
             title=f'Messages Containing "{term}" Per User', 
             labels={'x': 'Author', 'y': 'Messages Sent'})
fig.update_layout(title_font_size=30,
                  template='plotly_white')

fig.show()

## Emotes Sent per User
Plotting the emotes sent by the server members over the entire period covered by the dataset while dividing the data by the type of the emote.

First you define a function that turns the **arrays** in the *emoji*, *discord_emotes* and *a_discord_emotes* columns into **int values** that display the amount of elements that each array contains.

In [None]:
def quantity (emote_list):
  return len(emote_list)

After copying only the columns we'll use (dropping time and word count) into a new variable, we apply the function we defined.

In [None]:
data_line = data[['author', 'emoji', 'discord_emotes', 'a_discord_emotes']].copy()

for column in ['emoji', 'discord_emotes', 'a_discord_emotes']:
  data_line[column] = data_line[column].apply(quantity) 

data_line = pd.melt(data_line, id_vars=['author'], value_vars=['emoji', 'discord_emotes', 'a_discord_emotes'])
data_line = data_line.groupby(by=['author', 'variable']).sum().reset_index()

Then we use the ***pd.melt()*** method to get the data frame into the *'tidy'* format and use the ***groupby()*** method to sum up values from the same author and variable type.

In [None]:
data_line.head()

Unnamed: 0,author,variable,value
0,A1,a_discord_emotes,0
1,A1,discord_emotes,46
2,A1,emoji,50
3,A10,a_discord_emotes,14
4,A10,discord_emotes,72


After filtering the dataset and transforming it into the appopriate format, plotting it becomes pretty easy.

In [None]:
fig = px.bar(data_line, x ='author',
                        y='value',
                        color='variable',
                        labels={'value':'Emotes Sent', 'author':'Author'},
                        title="Emotes Sent per User")
fig.update_layout(title_font_size=30,
                  template='plotly_white')
fig.show()