# Case Study:  Reddit Social Network Analysis Against Influence Operation
Adel Abu Hashim & Mahmoud Nagy - August 2021

## Table of Contents
<ul>
<li><a href="#intro"><b><mark>Introduction</mark></b></a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>This case study aims to help **Amber Heard** <br>
> 
> By analyzing new accounts posting/ commenting against a victim of a Social Bot Disinformation/Influence Operation. 
> 
> **We have three main datasets**: <br>
>(The datasets screaped from **reddit**).
> - 1- A dataset with submissions & comments data.
> - 2- Users Data (from 2006 to 2021).
> - 3- A merged dataset (submissions & comments data, users data).
> - 4- Daily creation data 
> (# of accounts created per day from 2006 to 2021)

In [89]:
#import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import helpers
import matplotlib.dates as mdates
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
sb.set_style("darkgrid")
%matplotlib inline

# import plotly.io as pio
# pio.renderers.default = "svg"
# svg_renderer = pio.renderers["svg"]
# svg_renderer.width = 900
# svg_renderer.height = 500

In [90]:
# load data
df_users = pd.read_csv("cleaned_data/users_cleaned.csv")
df_creation = pd.read_csv("cleaned_data/creation_cleaned.csv")

In [91]:
# load data
df_21 = pd.read_csv("cleaned_data/reddit_merged_2021.csv")
df_20 = pd.read_csv("cleaned_data/reddit_merged_2020.csv")
df_19 = pd.read_csv("cleaned_data/reddit_merged_2019.csv")
df_18 = pd.read_csv("cleaned_data/reddit_merged_2018.csv")

# convert to datetime
df_21.created_at = pd.to_datetime(df_21.created_at)
df_20.created_at = pd.to_datetime(df_20.created_at)
df_19.created_at = pd.to_datetime(df_19.created_at)
df_18.created_at = pd.to_datetime(df_18.created_at)

# convert to datetime
df_21.user_created_at = pd.to_datetime(df_21.user_created_at)
df_20.user_created_at = pd.to_datetime(df_20.user_created_at)
df_19.user_created_at = pd.to_datetime(df_19.user_created_at)
df_18.user_created_at = pd.to_datetime(df_18.user_created_at)

In [92]:
# All Creations
df_creation.n_accounts.sum()

65548

The missing 5k are for banned accounts (since we don't have their account creation dates).

In [93]:
# All Users
df_users.user_name.nunique()

70573

In [94]:
# convert to datetime
df_users.user_created_at = pd.to_datetime(df_users.user_created_at)
df_creation.date = pd.to_datetime(df_creation.date)

<a id='eda'></a>
## Exploratory Data Analysis
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda"><b><mark>Exploratory Data Analysis</mark></b></a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<ul>
<li><a href="#explore_daily"><b><mark>Daily Creation Data</mark></b></a></li>  
<li><a href="#explore_users">User Charcterstics</a></li>
<li><a href="#peak_contributions">Contributions of the accounts created on peak days</a></li>
</ul>

<a id='explore_daily'></a>
> ### Exploring Daily Creation Data

<ul>
<li><a href="#all"><b><mark>All Years</mark></b></a></li>
<li><a href="#2018">2018</a></li>
<li><a href="#2019">2019</a></li>
<li><a href="#2020">2020</a></li>
<li><a href="#2021">2021</a></li>
</ul>


<a id='all'></a>
>>### Estimation of the number of user accounts created in each year

We can further Explore the contributions of the accounts created in peak days.<br>

or We can further Explore the creation dates of the users in peak contribution days.

In [95]:
df_creation.groupby(df_creation['date'].dt.year).sum('n_accounts').reset_index();

In [96]:
fig = px.bar(df_creation.groupby(df_creation['date'].dt.year).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each year",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Year',
        tickmode = 'linear',
        dtick = 1
    )
)

clrs = ['red' if (y > 7000) else '#5296dd' for y in df_creation.groupby(df_creation['date'].dt.year)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

**NOTE:** <br>
Reddit began to achieve a notable level of popularity in mid-2010, and it has expanded its reach since. It had become “really popular” in early 2013. 

Reddit was launched in June 2005

##### the effect of 2019 and  2020 data is obvious, what about 2021?

In [97]:
df_creation.tail(5)

Unnamed: 0,date,n_accounts
4709,2021-05-22,1
4710,2021-05-24,1
4711,2021-05-27,1
4712,2021-05-28,1
4713,2021-05-30,1


####  <font color='green'>yes, as expected, the data of 2021 is not complete, till may only.</font>

>>### newly created accounts in 2018, 2019, 2020,2021

In [98]:
colors = px.colors.qualitative.T10

fig = px.pie(df_users.creation_year.value_counts().to_frame().reset_index(),
             values='creation_year', names='index', color_discrete_sequence = colors,
             title = 'newly created accounts')

fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

>>### user accounts created in each date

In [99]:
fig = px.bar(df_creation,
             x='date', y='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each date",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Year',
        tickmode = 'array',
        tickvals = [2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021],
        ticktext = [2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]
    )
)


fig.update_traces(marker_color='#f69c56', marker_line_color='#5296dd',
                  opacity=1, textposition='auto')
fig.show()


### user accounts created in each date for (2018, 2019, 2020, 2021)

In [100]:
# filter on years with peaks/ of interest (2018, 2019, 2020, 2021)
df_creation_temp = df_creation[(df_creation['date'].dt.year == 2018) |
                                             (df_creation['date'].dt.year == 2019) |
                                             (df_creation['date'].dt.year == 2020) |
                                             (df_creation['date'].dt.year == 2021)][['date', 'n_accounts']]

df_creation_temp['date'] = df_creation_temp['date'].dt.date

In [101]:
# filtered dataframe on years with peaks (2018, 2019, 2020, 2021)
df_creation_temp;

In [102]:
fig = px.bar(df_creation_temp,
             x='date', 
             y='n_accounts', 
             title='The number of user accounts created in years with peaks/of interest (2018, 2019, 2020, 2021)')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Year',
    )
)

fig.update_traces(marker_color='#f69c56', marker_line_color='#5296dd',
                  opacity=1, textposition='auto')
fig.show()


In [103]:
'As we see from the graph {} has the highest number of accounts created on reddit: {} Accounts'\
.format(df_creation_temp.sort_values('n_accounts', ascending=False).head(1).date.values[0],
        df_creation_temp.sort_values('n_accounts', ascending=False).head(1).n_accounts.values[0])


'As we see from the graph 2020-02-04 has the highest number of accounts created on reddit: 65 Accounts'

### Dates of Peaks in all Years

In [104]:
# filter on dates with peaks 
df_creation_peak = df_creation.sort_values('n_accounts', ascending=False).head(10)

fig = px.bar(df_creation_peak,
             x='date', 
             y='n_accounts',  
             title='Dates with the highest accounts creation')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak.date,
    )
)

fig.update_traces(marker_color='red',
                  opacity=1, textposition='auto')

# , marker_line_color='#5296dd'

fig.show()



### Zoom in to see dates

In [105]:
# filter on dates with peaks 
df_creation_peak = df_creation.sort_values('n_accounts', ascending=False).head(10)
df_creation_peak.sort_values('date', inplace=True)

fig = px.bar(df_creation_peak.tail(4),
             x='date', 
             y='n_accounts',  
             title='Dates with the highest accounts creation')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak.tail().date,
    )
)

fig.update_traces(marker_color='red',
                  opacity=1, textposition='auto')


fig.show()


In [106]:
fig = px.bar(df_creation.groupby(df_creation['date'].dt.month).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each month",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Month',
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
        )
)

clrs = ['red' if (y > 7000) else '#5296dd' for y in df_creation.groupby(df_creation['date'].dt.month)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

In [107]:
fig = px.bar(df_creation.groupby(df_creation['date'].dt.day).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each day",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation in each day',
        tickmode = 'linear',
    )
)

clrs = ['red' if (y > 3000) else '#5296dd' for y in df_creation.groupby(df_creation['date'].dt.day)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

<a id='2018'></a>
>>### Investigate Each year in more details

<ul>
<li><a href="#all">All Years</a></li>
<li><a href="#2018"><b><mark>2018</mark></b></a></li>
<li><a href="#2019">2019</a></li>
<li><a href="#2020">2020</a></li>
<li><a href="#2021">2021</a></li>
</ul>



### user accounts created in each month of 2018

In [108]:
# filter on 2018 
df_creation_18 = df_creation[(df_creation['date'].dt.year == 2018)][['date', 'n_accounts']]


fig = px.bar(df_creation_18.groupby(df_creation_18['date'].dt.month).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each month of 2018",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Month of 2018',
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    )
)

clrs = ['red' if (y > 1000) else '#5296dd' for y in df_creation_18.groupby(df_creation_18['date'].dt.month)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### user accounts created in each day of Dec, 2018

In [109]:
# filter on Nov, 2018
df_dec_18 = df_creation_18[(df_creation_18['date'].dt.month == 12)][['date', 'n_accounts']]


fig = px.bar(df_dec_18,
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each day of Dec, 2018",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation on Dec, 2018',
        tickmode = 'linear',

    )
)

clrs = ['red' if (y > 44) else '#5296dd' for y in df_dec_18['n_accounts']]
        
fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### Dates of peaks in 2018

In [110]:
# filter on dates of peaks in 2018

df_creation_peak_18 = df_creation_18.sort_values('n_accounts', ascending=False).head(3)

fig = px.bar(df_creation_peak_18,
             x='date', 
             y='n_accounts', text = 'n_accounts', title='Dates with the highest accounts creation in 2018')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak_18.date,
    )
)

clrs = ['red' if (y > 40) else '#5296dd' for y in df_creation_peak_18['n_accounts']]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')

# , marker_line_color='#5296dd'

fig.show()




<a id='2019'></a>

<ul>
<li><a href="#all">All Years</a></li>
<li><a href="#2018">2018</a></li>
<li><a href="#2019"><b><mark>2019</mark></b></a></li>
<li><a href="#2020">2020</a></li>
<li><a href="#2021">2021</a></li>
</ul>


### user accounts created in each month of 2019

In [111]:
# filter on 2019
df_creation_19 = df_creation[(df_creation['date'].dt.year == 2019)][['date', 'n_accounts']]


fig = px.bar(df_creation_19.groupby(df_creation_19['date'].dt.month).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each month of 2019",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Months of 2019',
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    )
)

clrs = ['red' if (y > 1200) else '#5296dd' for y in df_creation_19.groupby(df_creation_19['date'].dt.month)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### user accounts created in each day of Jan, 2019

In [112]:
# filter on Jan, 2019
df_july_19 = df_creation_19[(df_creation_19['date'].dt.month == 1)][['date', 'n_accounts']]


fig = px.bar(df_july_19,
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each day of Jan, 2019",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation on Jan, 2019',
        tickmode = 'linear',

    )
)

clrs = ['red' if (y > 50) else '#5296dd' for y in df_july_19['n_accounts']]
        
fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### Dates of peaks in 2019

In [113]:
# filter on dates of peaks in 2019

df_creation_peak_19 = df_creation_19.sort_values('n_accounts', ascending=False).head(3)

fig = px.bar(df_creation_peak_19,
             x='date', 
             y='n_accounts', text = 'n_accounts', title='Dates with the highest accounts creation in 2019')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak_19.date,
    )
)

clrs = ['red' if (y > 50) else '#5296dd' for y in df_creation_peak_19['n_accounts']]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')

# , marker_line_color='#5296dd'

fig.show()



<a id='2020'></a>

<ul>
<li><a href="#all">All Years</a></li>
<li><a href="#2018">2018</a></li>
<li><a href="#2019">2019</a></li>
<li><a href="#2020"><b><mark>2020</mark></b></a></li>
<li><a href="#2021">2021</a></li>
</ul>


### user accounts created in each month of 2020

In [114]:
# filter on 2020
df_creation_20 = df_creation[(df_creation['date'].dt.year == 2020)][['date', 'n_accounts']]


fig = px.bar(df_creation_20.groupby(df_creation_20['date'].dt.month).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each month of 2020",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Months of 2020',
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    )
)

clrs = ['red' if (y > 1000) else '#5296dd' for y in df_creation_20.groupby(df_creation_20['date'].dt.month)\
        .sum('n_accounts').reset_index()['n_accounts']] 

fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

Note the peaks in Jan, 2020

### user accounts created in each day of Jan, 2020

In [115]:
# filter on Jan, 2020
df_jan_20 = df_creation_20[(df_creation_20['date'].dt.month == 1)][['date', 'n_accounts']]


fig = px.bar(df_jan_20,
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each day of Jan, 2020",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation on Jan, 2020',
        tickmode = 'linear',

    )
)

clrs = ['red' if (y > 45) else '#5296dd' for y in df_jan_20['n_accounts']]
        
fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### Dates of peaks in 2020

In [116]:
# filter on dates of peaks in 2020

df_creation_peak_20 = df_creation_20.sort_values('n_accounts', ascending=False).head(3)

fig = px.bar(df_creation_peak_20,
             x='date', 
             y='n_accounts', text = 'n_accounts', title='Dates with the highest accounts creation in 2020')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak_20.date,
    )
)

clrs = ['red' if (y > 50) else '#5296dd' for y in df_creation_peak_20['n_accounts']]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')


fig.show()


Note that the peak dates are all at the beginning of each month!

<a id='2021'></a>

<ul>
<li><a href="#all">All Years</a></li>    
<li><a href="#2018">2018</a></li>
<li><a href="#2019">2019</a></li>
<li><a href="#2020">2020</a></li>
<li><a href="#2021"><b><mark>2021</mark></b></a></li>
</ul>


### user accounts created in each month of 2021

In [117]:
# filter on 2021 
df_creation_21 = df_creation[(df_creation['date'].dt.year == 2021)][['date', 'n_accounts']]


fig = px.bar(df_creation_21.groupby(df_creation_21['date'].dt.month).sum('n_accounts').reset_index(),
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each month of 2021",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Months of 2021',
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
        ticktext = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### user accounts created in each day of Jan, 2021

In [118]:
# filter on Jan, 2020
df_jan_21 = df_creation_21[(df_creation_21['date'].dt.month == 1)][['date', 'n_accounts']]


fig = px.bar(df_jan_21,
             x='date', y='n_accounts', text='n_accounts')

fig.update_layout(
            title={
        'text': "Estimation of the number of user accounts created in each day of Jan, 2021",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation on Jan, 2021',
        tickmode = 'linear',

    )
)

clrs = ['red' if (y > 400) else '#5296dd' for y in df_jan_21['n_accounts']]
        
fig.update_traces(marker_color=clrs,
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### Dates of peaks in 2021

In [119]:
# filter on dates of peaks in 2021

df_creation_peak_21 = df_creation_21.sort_values('n_accounts', ascending=False).head(3)

fig = px.bar(df_creation_peak_21,
             x='date', 
             y='n_accounts', text = 'n_accounts', title='Dates with the highest accounts creation in 2020')

fig.update_layout(
    xaxis = dict(
        title='Users Creation Date',
        tickmode = 'array',
        tickvals = df_creation_peak_21.date,
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=2, opacity=1, textposition='auto')

# , marker_line_color='#5296dd'

fig.show()



<ul>
<li><a href="#explore_daily">Daily Creation Data</a></li>  
<li><a href="#explore_users"><b><mark>User Charcterstics</mark></b></a></li>
<li><a href="#peak_contributions">Contributions of the accounts created on peak days</a></li>
</ul>

<a id='explore_users'></a>
>### User Charcterstics

<a id='count'></a>
>>### Charactristics Count

<ul>
<li><a href="#count"><b><mark>Charactristics Count</mark></b></a></li>
<li><a href="#over_years">User Charcterstics over years</a></li>
<li><a href="#check_hours">The hour in which user accounts were created</a></li>
<li><a href="#link_comment_karma">Users Link and Comment Karma</a></li>
</ul>

### Banned / Unverified / Others in all years

In [120]:
df_banned_unverified = df_users.banned_unverified.value_counts().to_frame().reset_index()

fig = px.pie(df_banned_unverified,
             values='banned_unverified', names='index', color_discrete_sequence = colors,
             title = 'banned / unverified /others in all years')

fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

In [121]:
fig = px.histogram(df_users, x='banned_unverified', color="banned_unverified", 
                color_discrete_sequence = colors, 
                   title = 'banned / unverified /others in all years')
fig.show()

### Banned Users

In [122]:
px.bar(data_frame=df_users['is_banned'].value_counts().to_frame().reset_index(),
       x="index", y="is_banned").update_layout(title='Is Banned?',
                   xaxis_title='True or False',
                   yaxis_title='number of users').update_traces(marker_color='#5296dd')

There are 5K banned users, we will furthe investigate their contributions in another notebook.

### Verified Mail

In [123]:
px.bar(data_frame=df_users['has_verified_email'].value_counts().to_frame().reset_index(),
       x="index", y="has_verified_email").update_layout(title='Does the user has a verfied Email?',
                   xaxis_title='True or False',
                   yaxis_title='number of users').update_traces(marker_color='#5296dd')

There are 11.4K users with no verified email, we will further investigate their contributions in another notebook.

##### verifying mail is not a mark that is the account is spam, as more expensive bots have verified mails.

### Is Gold?
Gold is a way to show appreciation for an exceptional contribution to Reddit.

In [124]:
px.bar(data_frame=df_users['is_gold'].value_counts().to_frame().reset_index(),
       x="index", y="is_gold").update_layout(title='A Gold user?',
                   xaxis_title='True or False',
                   yaxis_title='number of users').update_traces(marker_color='#5296dd')

### Is Mod? 
Mod -> Moderator

A moderator, or a mod for short, are redditors who volunteer their time to help guide and create Reddit's many communities. Each Reddit community has its own focus, look, and rules, including what posts are on-topic there and how users are expected to behave. ... Add other redditors as moderators.
https://www.reddit.com/r/help/comments/1f0zni/what_are_mods_for_and_what_are_their/

In [125]:
px.bar(data_frame=df_users['is_mod'].value_counts().to_frame().reset_index(),
       x="index", y="is_mod").update_layout(title='Is this use a moderator?',
                   xaxis_title='True or False',
                   yaxis_title='number of users').update_traces(marker_color='#5296dd')

<a id='over_years'></a>
>>### User Charcterstics over years

<ul>
<li><a href="#count">Charactristics Count</a></li>
<li><a href="#over_years"><b><mark>User Charcterstics over years</mark></b></a></li>
<li><a href="#check_hours">The hour in which user accounts were created</a></li>
<li><a href="#link_comment_karma">Users Link and Comment Karma</a></li>
</ul>


In [126]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70573 entries, 0 to 70572
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   user_name           70573 non-null  object        
 1   has_verified_email  70573 non-null  bool          
 2   is_mod              70573 non-null  bool          
 3   is_gold             70573 non-null  bool          
 4   is_banned           70573 non-null  bool          
 5   comment_karma       65548 non-null  float64       
 6   link_karma          65548 non-null  float64       
 7   user_created_at     65548 non-null  datetime64[ns]
 8   banned_unverified   70573 non-null  object        
 9   creation_year       70573 non-null  object        
dtypes: bool(4), datetime64[ns](1), float64(2), object(3)
memory usage: 3.5+ MB


In [127]:
df_users_chs = df_users.groupby(df_users['user_created_at'].dt.year).sum().reset_index()
df_users_chs.user_created_at = df_users_chs.user_created_at.astype(int)
df_users_chs;

No banned accounts are shown since we don't have their creation years.

In [128]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_users_chs.user_created_at, y=df_users_chs.has_verified_email, name='has_verified_email',
                         line=dict(color='green', width=1.5)))
fig.add_trace(go.Scatter(x=df_users_chs.user_created_at, y=df_users_chs.is_mod, name='is_mod',
                         line=dict(color='royalblue', width=1.5)))
fig.add_trace(go.Scatter(x=df_users_chs.user_created_at, y=df_users_chs.is_gold, name='is_gold',
                         line=dict(color='orange', width=1.5)))


fig.update_layout(title='User Charcteristcs count',
                   xaxis_title='Year',
                   yaxis_title='Number of users')

fig.show()

https://stackoverflow.com/questions/36285155/pandas-get-dummies

In [129]:
df_total = df_users.groupby(df_users['user_created_at'].dt.year).size().reset_index(name='total_naccounts')
df_total.user_created_at = df_total.user_created_at.astype(int)

df_users_chs2 = pd.merge(df_users_chs, df_total, on='user_created_at')

df_users_chs2['unverified'] = df_users_chs2.total_naccounts - df_users_chs2.has_verified_email
df_users_chs2;

In [130]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_users_chs2.user_created_at, y=df_users_chs2.unverified, name='Unverified',
                         line=dict(color='red', width=1.5)))
fig.add_trace(go.Scatter(x=df_users_chs2.user_created_at, y=df_users_chs2.total_naccounts, name='Total Accounts',
                         line=dict(color='royalblue', width=1.5)))
# fig.add_trace(go.Scatter(x=df_users_chs.user_created_at, y=df_users_chs.is_gold, name='is_gold',
#                          line=dict(color='orange', width=1.5)))


fig.update_layout(title='Unverified VS Total Accounts in each year',
                   xaxis_title='Year',
                   yaxis_title='Number of users')

fig.show()

<a id='check_hours'></a>
>>### Check for the hour in which user accounts were created

<ul>
<li><a href="#count">Charactristics Count</a></li>
<li><a href="#over_years">User Charcterstics over years</a></li>
<li><a href="#check_hours"><b><mark>The hour in which user accounts were created</mark></b></a></li>
<li><a href="#link_comment_karma">Users Link and Comment Karma</a></li>
</ul>


In [131]:
df_users_hours = df_users.groupby(df_users['user_created_at'].dt.hour).size().reset_index(name='n_accounts')
# df_users_hours.sort_values('n_accounts', ascending=False);

fig = px.bar(df_users_hours,
             x='user_created_at', y='n_accounts')

fig.update_layout(
            title={
        'text': "The number of user accounts created in each hour of the day",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Hour',
        tickmode = 'linear',
        dtick = 1
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

### Check for the hour in which user accounts were created on peak days

Note that we only have limited users data in each date.

In [132]:
# Filter the users data on peak pays
# Dec 30,2018
mask = (df_users.user_created_at.dt.date.astype('str') == '2018-12-30')
df_peak_18 = df_users[mask]

df_hours_18 = df_peak_18.groupby(df_peak_18['user_created_at'].dt.hour).size().reset_index(name='n_accounts')
# df_users_hours.sort_values('n_accounts', ascending=False);

fig = px.bar(df_hours_18,
             x='user_created_at', y='n_accounts')

fig.update_layout(
            title={
        'text': "The number of user accounts created in each hour of the day on 2018-12-30",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Hour',
        tickmode = 'linear',
        dtick = 1
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

In [133]:
# Jan 13,2019
mask = (df_users.user_created_at.dt.date.astype('str') == '2019-01-13')
df_peak_19 = df_users[mask]

df_hours_19 = df_peak_19.groupby(df_peak_19['user_created_at'].dt.hour).size().reset_index(name='n_accounts')
# df_users_hours.sort_values('n_accounts', ascending=False);

fig = px.bar(df_hours_19,
             x='user_created_at', y='n_accounts')

fig.update_layout(
            title={
        'text': "The number of user accounts created in each hour of the day on 2019-01-13",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Hour',
        tickmode = 'linear',
        dtick = 1
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

In [134]:
# Feb 4,2020
mask = (df_users.user_created_at.dt.date.astype('str') == '2020-02-04')
df_peak_20 = df_users[mask]

df_hours_20 = df_peak_20.groupby(df_peak_20['user_created_at'].dt.hour).size().reset_index(name='n_accounts')
# df_users_hours.sort_values('n_accounts', ascending=False);

fig = px.bar(df_hours_20,
             x='user_created_at', y='n_accounts')

fig.update_layout(
            title={
        'text': "The number of user accounts created in each hour of the day on 2020-02-04",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Hour',
        tickmode = 'linear',
        dtick = 1
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

In [135]:
# Feb 4,2020
mask = (df_users.user_created_at.dt.date.astype('str') == '2021-02-04')
df_peak_21 = df_users[mask]

df_hours_21 = df_peak_21.groupby(df_peak_21['user_created_at'].dt.hour).size().reset_index(name='n_accounts')
# df_users_hours.sort_values('n_accounts', ascending=False);

fig = px.bar(df_hours_21,
             x='user_created_at', y='n_accounts')

fig.update_layout(
            title={
        'text': "The number of user accounts created in each hour of the day on 2021-02-04",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'
        })

fig.update_layout(
    xaxis = dict(
        title='Users Creation Hour',
        tickmode = 'linear',
        dtick = 1
    )
)

fig.update_traces(marker_color='#5296dd',
                  marker_line_width=1.5, opacity=1, textposition='auto')
fig.show()

<a id='link_comment_karma'></a>
>>### Users Link and Comment Karma

<ul>
<li><a href="#count">Charactristics Count</a></li>
<li><a href="#over_years">User Charcterstics over years</a></li>
<li><a href="#check_hours">The hour in which user accounts were created</a></li>
<li><a href="#link_comment_karma"><b><mark>Users Link and Comment Karma</mark></b></a></li>
</ul>

<a id='largest_link_karma'></a>
### Accounts With The Largest Link Karma

<ul>
<li><a href="#largest_link_karma"><b><mark>Accounts With The Largest Link Karma</mark></b></a></li>   
<li><a href="#largest_comment_karma">Accounts With The Largest Comment Karma</a></li>
    <br>
<li><a href="#minimum_link_karma">Accounts With The Minimum Link Karma</a></li>
<li><a href="#minimum_comment_karma">Accounts With The Minimum Comment Karma</a></li>
</ul>

In [136]:
# Filter on largest link karma

df_link_high = df_users.sort_values('link_karma', ascending=False).head(10)

fig = px.bar(df_link_high,
             x='user_name', 
             y=df_link_high.link_karma, text = df_link_high.link_karma, title='Accounts with highest link karma')

fig.update_layout(
    xaxis = dict(
        title='user name',
        tickmode = 'array',
        tickvals = df_link_high.user_name,
    )
)

clrs = ['red' if (y > 7000000) else '#5296dd' for y in df_link_high.link_karma]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')

fig.show()

In [137]:
df_users[df_users.user_name == 'BunyipPouch']

Unnamed: 0,user_name,has_verified_email,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year
8253,BunyipPouch,True,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others


A GOLD USER

In [138]:
df_users[df_users.user_name == 'ExpertAccident']

Unnamed: 0,user_name,has_verified_email,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year
42596,ExpertAccident,True,True,True,False,480814.0,10245202.0,2018-12-08 02:07:22,others,2018


GOLD USER Too

### Check BunyipPouch contributions in each year <br>
Account Created on: 2012-12-30 <br>
<font color='green'> positive comments </font>

In [139]:
# In 2021 
df_BunyipPouch_21 = df_21[df_21.user_name == 'BunyipPouch']
print(df_BunyipPouch_21.shape)
df_BunyipPouch_21.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [140]:
# In 2020 
df_BunyipPouch_20 = df_20[df_20.user_name == 'BunyipPouch']
print(df_BunyipPouch_20.shape)
df_BunyipPouch_20.head()

(6, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
22168,t1_fg5ba0t,/r/movies/comments/ewz3p3/amber_heard_admits_t...,perezhilton.com should be a banned domain. tab...,t3_ewz3p3,r/movies,2020-02-01 02:09:38,Neutral,Negative,1.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2588 days 04:23:01,2588.0
22169,t1_fgc9wd0,/r/movies/comments/ext54y/justiceforjohnnydepp...,imagine if all celebrity relationship drama/ru...,t1_fgc9745,r/movies,2020-02-02 19:11:10,Negative,Neutral,-2.0,comment,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2589 days 21:24:33,2589.0
22170,t1_fgkf0pv,/r/movies/comments/ez0m4r/audio_of_amber_heard...,so you gonna keep posting this every 10 minute...,t3_ez0m4r,r/movies,2020-02-05 00:53:44,Neutral,Neutral,10.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2592 days 03:07:07,2592.0
22171,t1_fguruuz,/r/movies/comments/f0kyo1/its_insane_that_just...,"like, what do you want people to be doing? rio...",t3_f0kyo1,r/movies,2020-02-08 02:13:09,Negative,Neutral,47.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2595 days 04:26:32,2595.0
22172,t1_fm1c92n,/r/movies/comments/fsh4kw/amber_heard_to_be_sa...,"ah, yes, the very reputable source indulgexpre...",t3_fsh4kw,r/movies,2020-03-31 17:30:06,Positive,Positive,3.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2647 days 19:43:29,2647.0


In [141]:
# df_BunyipPouch_19.permalink[8859]

In [142]:
# In 2019
df_BunyipPouch_19 = df_19[df_19.user_name == 'BunyipPouch']
print(df_BunyipPouch_19.shape)
df_BunyipPouch_19.head()

(8, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
8857,t1_eho88bu,/r/movies/comments/awozzj/johnny_depp_suing_am...,"> Johnny Depp isnt hurting for money\n\nuhhh, ...",t1_eho841h,r/movies,2019-03-03 02:43:05,Positive,Positive,36.0,comment,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2253 days 04:56:28,2253.0
8858,t1_eho9zyn,/r/movies/comments/awozzj/johnny_depp_suing_am...,"> He's got a tattoo that says ""Wino Forever"".\...",t1_eho9s62,r/movies,2019-03-03 03:07:25,Positive,Positive,5.0,comment,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2253 days 05:20:48,2253.0
8859,t1_eikm3gp,/r/movies/comments/b1bdis/johnny_depp_was_abus...,"> geo.tv\n\nlol, what a shitpost.",t3_b1bdis,r/movies,2019-03-15 06:52:50,Positive,Positive,-34.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2265 days 09:06:13,2265.0
8860,t1_eil9kyu,/r/movies/comments/b1bdis/johnny_depp_was_abus...,Imagine thinking that this is a quality post.\...,t1_eikn3h1,r/movies,2019-03-15 14:39:56,Positive,Neutral,-1.0,comment,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2265 days 16:53:19,2265.0
8861,t1_eld3bcs,/r/boxoffice/comments/bfeu79/other_johnny_depp...,Removed. Wrong sub. Not sure there is a right ...,t3_bfeu79,r/boxoffice,2019-04-20 17:47:58,Positive,Negative,1.0,submission,...,True,True,False,3566321.0,14864962.0,2012-12-30 21:46:37,others,others,2301 days 20:01:21,2301.0


In [143]:
# In 2018
df_BunyipPouch_18 = df_18[df_18.user_name == 'BunyipPouch']
print(df_BunyipPouch_18.shape)
df_BunyipPouch_18.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


### Check ExpertAccident contributions in each year <br>
Account Created on: 2018-12-08 <br>
<font color='red'> negative submission (2020-11-10) </font> <br>
<font color='orange'> gold user </font>

https://www.reddit.com/r/redditmoment/comments/jrr5po/well_guys_we_epic_redditors_singlehandedly_took/

In [144]:
# In 2021 
df_ExpertAccident_21 = df_21[df_21.user_name == 'ExpertAccident']
print(df_ExpertAccident_21.shape)
df_ExpertAccident_21.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [145]:
# df_ExpertAccident_20.permalink[106204]

In [146]:
# In 2020 
df_ExpertAccident_20 = df_20[df_20.user_name == 'ExpertAccident']
print(df_ExpertAccident_20.shape)
df_ExpertAccident_20.head()

(3, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
106202,t1_gbhgrjd,/r/memes/comments/jpwz5n/im_looking_at_you_amb...,"Any person, regardless of gender, should go to...",t3_jpwz5n,r/memes,2020-11-07 19:57:21,Negative,Negative,69.0,submission,...,True,True,False,480814.0,10245202.0,2018-12-08 02:07:22,others,2018,700 days 17:49:59,700.0
106203,t1_gbitdi9,/r/memes/comments/jpwz5n/im_looking_at_you_amb...,"TRUE, they have such small sentences sometimes...",t1_gbir7pd,r/memes,2020-11-08 00:27:55,Negative,Neutral,4.0,comment,...,True,True,False,480814.0,10245202.0,2018-12-08 02:07:22,others,2018,700 days 22:20:33,700.0
106204,t3_jrr5po,/r/redditmoment/comments/jrr5po/well_guys_we_e...,"Well guys, we, epic Redditors, singlehandedly ...",,r/redditmoment,2020-11-10 19:03:43,Negative,Positive,3782.0,,...,True,True,False,480814.0,10245202.0,2018-12-08 02:07:22,others,2018,703 days 16:56:21,703.0


In [147]:
# In 2019
df_ExpertAccident_19 = df_19[df_19.user_name == 'ExpertAccident']
print(df_ExpertAccident_19.shape)
df_ExpertAccident_19.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [148]:
# In 2018
df_ExpertAccident_18 = df_18[df_18.user_name == 'ExpertAccident']
print(df_ExpertAccident_18.shape)
df_ExpertAccident_18.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


<a id='largest_comment_karma'></a>
### Accounts With The Largest Comment Karma

<ul>
<li><a href="#largest_link_karma">Accounts With The Largest Link Karma</a></li>   
<li><a href="#largest_comment_karma"><b><mark>Accounts With The Largest Comment Karma</mark></b></a></li>
    <br>
<li><a href="#minimum_link_karma">Accounts With The Minimum Link Karma</a></li>
<li><a href="#minimum_comment_karma">Accounts With The Minimum Comment Karma</a></li>
</ul>

In [149]:
# Filter on largest comment karma

df_comment_high = df_users.sort_values('comment_karma', ascending=False).head(10)

fig = px.bar(df_comment_high,
             x='user_name', 
             y=df_comment_high.comment_karma, text = df_comment_high.comment_karma, title='Accounts with highest comment karma')

fig.update_layout(
    xaxis = dict(
        title='user name',
        tickmode = 'array',
        tickvals = df_comment_high.user_name,
    )
)

clrs = ['red' if (y > 8000000) else '#5296dd' for y in df_comment_high.comment_karma]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')

fig.show()

In [150]:
df_users[df_users.user_name == 'TooShiftyForYou']

Unnamed: 0,user_name,has_verified_email,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year
19113,TooShiftyForYou,True,True,True,False,23424765.0,2405575.0,2015-07-26 10:44:20,others,others


A GOLD USER

### Check TooShiftyForYou contributions in each year <br>
Account Created on: 2015-07-26 <br>
<font color='red'> negative comments (May and July, 2020)</font> <br>
<font color='orange'> gold user </font>

https://www.reddit.com/r/movies/comments/gsu5su/warner_bros_fires_amber_heard_from_aquaman_2/fs7glje/

In [151]:
# In 2021 
df_TooShiftyForYou_21 = df_21[df_21.user_name == 'TooShiftyForYou']
print(df_TooShiftyForYou_21.shape)
df_TooShiftyForYou_21.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [152]:
# df_TooShiftyForYou_20.permalink[85870]

In [153]:
# In 2020 
df_TooShiftyForYou_20 = df_20[df_20.user_name == 'TooShiftyForYou']
print(df_TooShiftyForYou_20.shape)
df_TooShiftyForYou_20.head()

(2, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
85870,t1_fs7glje,/r/movies/comments/gsu5su/warner_bros_fires_am...,*Reports have it that the 34-year old Heard co...,t3_gsu5su,r/movies,2020-05-29 15:12:27,Negative,Negative,139.0,submission,...,True,True,False,23424765.0,2405575.0,2015-07-26 10:44:20,others,others,1769 days 04:28:07,1769.0
85871,t1_fx7cytn,/r/movies/comments/hmtfxc/johnny_depp_not_a_wi...,Amber Heard filed for divorce from Johnny Depp...,t3_hmtfxc,r/movies,2020-07-07 13:43:55,Positive,Neutral,1.0,submission,...,True,True,False,23424765.0,2405575.0,2015-07-26 10:44:20,others,others,1808 days 02:59:35,1808.0


In [154]:
# In 2019
df_TooShiftyForYou_19 = df_19[df_19.user_name == 'TooShiftyForYou']
print(df_TooShiftyForYou_19.shape)
df_TooShiftyForYou_19.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [155]:
# In 2018
df_TooShiftyForYou_18 = df_18[df_18.user_name == 'TooShiftyForYou']
print(df_TooShiftyForYou_18.shape)
df_TooShiftyForYou_18.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


<a id='minimum_link_karma'></a>
### Accounts With The Minimum Link Karma

<ul>
<li><a href="#largest_link_karma">Accounts With The Largest Link Karma</a></li>   
<li><a href="#largest_comment_karma">Accounts With The Largest Comment Karma</a></li>
    <br>
<li><a href="#minimum_link_karma"><b><mark>Accounts With The Minimum Link Karma</mark></b></a></li>
<li><a href="#minimum_comment_karma">Accounts With The Minimum Comment Karma</a></li>
</ul>

In [156]:
# Filter on minimum link karma

df_link_low = df_users.sort_values('link_karma').head(20)

fig = px.bar(df_link_low,
             x='user_name', 
             y=df_link_low.link_karma, 
             text = df_link_low.link_karma)

fig.update_layout(title_text='Accounts with minimum link karma', title_x=0.5)

fig.update_layout(
    xaxis = dict(
        title='user name',
        tickmode = 'array',
        tickvals = df_link_low.user_name,
    )
)

clrs = ['red' if (y < -60) else '#5296dd' for y in df_link_low.link_karma]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')


fig.show()

<a id='minimum_comment_karma'></a>
### Accounts With The Minimum Comment Karma

<ul>
<li><a href="#largest_link_karma">Accounts With The Largest Link Karma</a></li>   
<li><a href="#largest_comment_karma">Accounts With The Largest Comment Karma</a></li>
    <br>
<li><a href="#minimum_link_karma">Accounts With The Minimum Link Karma</a></li>
<li><a href="#minimum_comment_karma"><b><mark>Accounts With The Minimum Comment Karma</mark></b></a></li>
</ul>

In [157]:
# Filter on minimum comment karma

df_comment_low = df_users.sort_values('comment_karma').head(20)

fig = px.bar(df_comment_low,
             x='user_name', 
             y=df_comment_low.comment_karma, 
             text = df_comment_low.comment_karma)

fig.update_layout(title_text='Accounts with minimum comment karma', title_x=0.5, title_y=0.1)

fig.update_layout(
    xaxis = dict(
        side='top',
        title='user name',
        tickmode = 'array',
        tickvals = df_comment_low.user_name,
    )
)

clrs = ['red' if (y < -60) else '#5296dd' for y in df_comment_low.comment_karma]

fig.update_traces(marker_color=clrs,
                  marker_line_width=2, opacity=1, textposition='auto')


fig.show()

In [158]:
df_users[df_users.user_name == '1276810520']

Unnamed: 0,user_name,has_verified_email,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year
50584,1276810520,True,False,False,False,-100.0,17.0,2019-07-27 12:24:21,others,2019


In [159]:
df_users[df_users.user_name == 'CoolDownBot']

Unnamed: 0,user_name,has_verified_email,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year
56923,CoolDownBot,True,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020


### Check 1276810520 contributions in each year <br>
Account Created on: 2019-07-27

In [160]:
# In 2021 
df_num_21 = df_21[df_21.user_name == '1276810520']
print(df_num_21.shape)
df_num_21.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [161]:
# df_num_20.permalink[84961]

In [162]:
# In 2020 
df_num_20 = df_20[df_20.user_name == '1276810520']
print(df_num_20.shape)
df_num_20.head()

(3, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
84960,t1_fr5f3hw,/r/unpopularopinion/comments/gmk0zo/if_harvey_...,Weinstein did not deserve to go to prison. The...,t3_gmk0zo,r/unpopularopinion,2020-05-19 16:58:41,Positive,Negative,0.0,submission,...,False,False,False,-100.0,17.0,2019-07-27 12:24:21,others,2019,297 days 04:34:20,297.0
84961,t1_fr5o5f1,/r/unpopularopinion/comments/gmk0zo/if_harvey_...,I don’t agree. There is scant evidence of his ...,t1_fr5mwgz,r/unpopularopinion,2020-05-19 18:10:16,Positive,Neutral,0.0,comment,...,False,False,False,-100.0,17.0,2019-07-27 12:24:21,others,2019,297 days 05:45:55,297.0
84962,t1_fr74xq7,/r/unpopularopinion/comments/gmk0zo/if_harvey_...,"They were able to convince a jury, but with al...",t1_fr5r8ch,r/unpopularopinion,2020-05-20 01:54:13,Positive,Neutral,1.0,comment,...,False,False,False,-100.0,17.0,2019-07-27 12:24:21,others,2019,297 days 13:29:52,297.0


In [163]:
# In 2019
df_num_19 = df_19[df_19.user_name == '1276810520']
print(df_num_19.shape)
df_num_19.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [164]:
# In 2018
df_num_18 = df_18[df_18.user_name == '1276810520']
print(df_num_18.shape)
df_num_18.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


### Check CoolDownBot contributions in each year <br>
Account Created on: 2020-02-02 <br>
A chatbot which posts the following message to some comments (having the word "f*ck" 3 times) <br>

Hello. <br>
I noticed you dropped 3 f-bombs in this comment. This might be necessary, but using nicer language makes the whole world a better place.
Maybe you need to blow off some steam - in which case, go get a drink of water and come back later. This is just the internet and sometimes it can be helpful to cool down for a second.

In [165]:
# In 2021 
df_CoolDownBot_21 = df_21[df_21.user_name == 'CoolDownBot']
print(df_CoolDownBot_21.shape)
df_CoolDownBot_21.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [166]:
# df_CoolDownBot_20.permalink[67475]

In [167]:
# In 2020 
df_CoolDownBot_20 = df_20[df_20.user_name == 'CoolDownBot']
print(df_CoolDownBot_20.shape)
df_CoolDownBot_20.head()

(16, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
67475,t1_fharq9y,/r/MensRights/comments/f2596l/mainstream_media...,**Hello.**\n\nI noticed you dropped 3 f-bombs ...,t1_fharq11,r/MensRights,2020-02-11 12:39:57,Positive,Positive,-25.0,comment,...,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020,8 days 14:31:27,8.0
67476,t1_fhaur5x,/r/MensRights/comments/f2596l/mainstream_media...,**Hello.**\n\nI noticed you dropped 23 f-bombs...,t1_fhaur2h,r/MensRights,2020-02-11 13:26:19,Positive,Positive,-16.0,comment,...,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020,8 days 15:17:49,8.0
67477,t1_fna2foa,/r/JerkOffToCelebs/comments/g0is1q/which_celeb...,**Hello.**\n\nI noticed you dropped 3 f-bombs ...,t1_fna2elt,r/JerkOffToCelebs,2020-04-13 14:55:22,Positive,Positive,0.0,comment,...,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020,70 days 16:46:52,70.0
67478,t1_fna2g97,/r/JerkOffToCelebs/comments/g0is1q/which_celeb...,**Hello.**\n\nI noticed you dropped 11 f-bombs...,t1_fna2g3l,r/JerkOffToCelebs,2020-04-13 14:55:31,Positive,Positive,1.0,comment,...,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020,70 days 16:47:01,70.0
67479,t1_ghefwqw,/r/peachykeenocean/comments/kmbx33/do_you_thin...,**Hello.**\n\nI noticed you dropped 9 f-bombs ...,t1_ghefvxb,r/peachykeenocean,2020-12-29 15:52:29,Positive,Positive,1.0,comment,...,True,False,False,-100.0,182.0,2020-02-02 22:08:30,others,2020,330 days 17:43:59,330.0


In [168]:
# In 2019
df_CoolDownBot_19 = df_19[df_19.user_name == 'CoolDownBot']
print(df_CoolDownBot_19.shape)
df_CoolDownBot_19.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


In [169]:
# In 2018
df_CoolDownBot_18 = df_18[df_18.user_name == 'CoolDownBot']
print(df_CoolDownBot_18.shape)
df_CoolDownBot_18.head()

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


<ul>
<li><a href="#explore_daily">Daily Creation Data</a></li>  
<li><a href="#explore_users">User Charcterstics</a></li>
<li><a href="#peak_contributions"><b><mark>Contributions of the accounts created on peak days</mark></b></a></li>
</ul>

<a id='peak_contributions'></a>
>### Contributions of the accounts created on peak days
- 04/02/2021 --> 15 accounts
- 04/02/2020 --> 65 accounts
- 13/01/2019 --> 53 accounts
- 30/12/2018 --> 48 accounts

In [170]:
users_peak_18 = list(df_peak_18.user_name)
users_peak_19 = list(df_peak_19.user_name)
users_peak_20 = list(df_peak_20.user_name)
users_peak_21 = list(df_peak_21.user_name)

users_peak = users_peak_18 + users_peak_19 + users_peak_20 + users_peak_21
len(users_peak)

181

### Check the contributions (in each year) of the accounts created on peak days.

In [171]:
# In 2021 
df_21_peak = df_21[df_21.user_name.isin(users_peak)]
print(df_21_peak.shape)
df_21_peak.head(3)

(84, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
2974,t1_ghtfd2j,/r/redditmoment/comments/ko1xfy/delete_tik_tok...,I remember when I came across a Scooby doo por...,t1_ghqe1s3,r/redditmoment,2021-01-02 10:41:12,Positive,Positive,2.0,comment,...,True,False,False,18044.0,36.0,2018-12-30 01:26:28,others,2018,734 days 09:14:44,734.0
2975,t1_ghtfoa5,/r/redditmoment/comments/ko1xfy/delete_tik_tok...,"I get what you're saying, but saying there are...",t1_ghpnq7e,r/redditmoment,2021-01-02 10:46:49,Positive,Negative,1.0,comment,...,True,False,False,18044.0,36.0,2018-12-30 01:26:28,others,2018,734 days 09:20:21,734.0
3232,t1_ghxocjp,/r/redditmoment/comments/ko1xfy/delete_tik_tok...,How the fuck does corona have to do with liter...,t1_ghrtpts,r/redditmoment,2021-01-03 10:51:34,Negative,Negative,2.0,comment,...,False,False,False,2437.0,5274.0,2020-02-04 09:11:38,others,2020,334 days 01:39:56,334.0


In [172]:
# In 2020 
df_20_peak = df_20[df_20.user_name.isin(users_peak)]
print(df_20_peak.shape)
df_20_peak.head(3)

(268, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
20236,t1_felccxh,/r/JerkOffToCelebs/comments/epqpa1/would_you_r...,I want the constant stimulation of Amber's lip...,t3_epqpa1,r/JerkOffToCelebs,2020-01-16 23:54:19,Neutral,Neutral,3.0,submission,...,False,False,False,2584.0,39.0,2018-12-30 09:20:29,others,2018,382 days 14:33:50,382.0
21564,t1_ffmcp94,/r/MensRights/comments/etxr1z/petition_to_remo...,Apparently there was cctv footage,t1_ffjyivg,r/MensRights,2020-01-26 15:27:37,Positive,Neutral,6.0,comment,...,True,False,False,17344.0,4616.0,2018-12-30 16:34:11,others,2018,391 days 22:53:26,391.0
21565,t1_fgeytvy,/r/MensRights/comments/exxkb4/damn_id_let_ambe...,People supporting Amber Heard,t1_fgea5ig,r/MensRights,2020-02-03 07:40:53,Positive,Positive,4.0,comment,...,True,False,False,17344.0,4616.0,2018-12-30 16:34:11,others,2018,399 days 15:06:42,399.0


In [173]:
# In 2019
df_19_peak = df_19[df_19.user_name.isin(users_peak)]
print(df_19_peak.shape)
df_19_peak.head(3)

(36, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation
7054,t1_efm9ivw,/r/Celebs/comments/amivoi/round_1_amber_heard_...,Amber. Get her to wear her mera costume and ri...,t3_amivoi,r/Celebs,2019-02-02 22:05:09,Neutral,Neutral,1.0,submission,...,False,False,False,2584.0,39.0,2018-12-30 09:20:29,others,2018,34 days 12:44:40,34.0
7055,t1_eps1535,/r/CelebEconomy/comments/bvo3jz/my_first_post_...,"Gal cause shes a goddess and then cara, zenday...",t3_bvo3jz,r/CelebEconomy,2019-06-02 00:25:09,Neutral,Neutral,1.0,submission,...,False,False,False,2584.0,39.0,2018-12-30 09:20:29,others,2018,153 days 15:04:40,153.0
7056,t1_er4n0e3,/r/JerkOffToCelebs/comments/c0hfti/amber_heard...,She was pure sex in that film,t3_c0hfti,r/JerkOffToCelebs,2019-06-14 08:03:00,Positive,Neutral,15.0,submission,...,False,False,False,2584.0,39.0,2018-12-30 09:20:29,others,2018,165 days 22:42:31,165.0


In [174]:
# In 2018
df_18_peak = df_18[df_18.user_name.isin(users_peak)]
print(df_18_peak.shape)
df_18_peak.head(3)

(0, 24)


Unnamed: 0,child_id,permalink,text,parent_id,subreddit,created_at,sentiment_blob,sentiment_nltk,score,top_level,...,is_mod,is_gold,is_banned,comment_karma,link_karma,user_created_at,banned_unverified,creation_year,diff,days_after_creation


<a id='conclusions'></a>
## Conclusions

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions"><b><mark>Conclusions</mark></b></a></li>
</ul>

>### Daily Creation Peaks

**PEAK YEARS:**
- 2018 --> 9567 accounts
- 2019 --> 12264 accounts
- 2020 --> 9348 accounts

**PEAK MONTHS:**
- 2018 --> Dec --> 1056 accounts
- 2019 --> Jan --> 1245 accounts
- 2020 --> Jan --> 1151 accounts
- 2021 --> Jan --> 197 accounts

**PEAK DATES:**
- 04/02/2021 --> 15 accounts
- 04/02/2020 --> 65 accounts
- 13/01/2019 --> 53 accounts
- 30/12/2018 --> 48 accounts


>### Users Characteristics

- More than 50% of reddit accounts were created in the last 4 years. <br>
- 16% (11.3K) of the accounts do not have a verified email. <br>
- 7.12% (5K) of the accounts are banned.

**ExcpertAccident:**
- Gold User
- Created on 08/12/2018
- Got the second largest link karma with more than 10 M
- Has a negative submission on 10/11/2020
- https://www.reddit.com/r/redditmoment/comments/jrr5po/well_guys_we_epic_redditors_singlehandedly_took/

**TooShiftyForYou:**
- Gold User
- Created on 26/07/2015
- Got the largest comment karma with more than 23.4 M
- Posting negative comments.

<a id='end'></a>
## END OF NOTEBOOK

In [175]:
# !jupyter nbconvert --to html user_creation_analysis.ipynb 

In [176]:
# !wkhtmltopdf user_creation_analysis.html user_creation_analysis.pdf