<p>&nbsp;</p>
<img src="https://1000logos.net/wp-content/uploads/2017/05/Reddit-logo.png" width=400>
<p>&nbsp;</p>

## Introduction

This is a brief exploratory data analysis using Pandas for a given public sample of random Reddit posts.
We will get a feel of a dataset and try to answer the following questions: 
* What are the most popular reddits? Which topics are viral?
* Which posts have been removed and why? 
* What % removed reddits are deleted by moderatos? 
* Who are the most popular authors? 
* Who are the biggest spammers at Reddit platform?


In [2]:
#Getting all the packages we need: 

import numpy as np # linear algebra
import pandas as pd # data processing

# from wordcloud import WordCloud, STOPWORDS # optional to filter out the stopwords


## <a name="read"></a>Reading the dataset
Accessing Reddit dataset:

In [3]:
df = pd.read_csv('../input/r_dataisbeautiful_posts.csv')


  df = pd.read_csv('../input/r_dataisbeautiful_posts.csv')


In [4]:
df.sample(5)

Unnamed: 0,id,title,score,author,author_flair_text,removed_by,total_awards_received,awarders,created_utc,full_link,num_comments,over_18
44413,a5t9ed,Timeline Of All Known Exoplanets Being Discove...,1,AlanZucconi,OC: 6,,,,1544705810,https://www.reddit.com/r/dataisbeautiful/comme...,6,False
145326,2wjis2,What are the Youngest (and Oldest) Counties in...,2,sympletic,,,,,1424436204,https://www.reddit.com/r/dataisbeautiful/comme...,3,False
65106,7u3rqz,Data viz shows the diversity of Academy Award ...,1,savard1120,,,,,1517342037,https://www.reddit.com/r/dataisbeautiful/comme...,0,False
24252,cy5ed5,[OC] 16 Months of Searching for a Job,74,Sorry_Sorry_Im_Sorry,,,0.0,,1567310448,https://www.reddit.com/r/dataisbeautiful/comme...,24,False
94986,5mxlwx,Path to the College Football National Champion...,0,DataVizWithTableau,,,,,1483968064,https://www.reddit.com/r/dataisbeautiful/comme...,1,False


## <a name="feel"></a>Getting a feel of the dataset
Let's run basic dataframe exploratory commands

In [5]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173611 entries, 0 to 173610
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   id                     173611 non-null  object 
 1   title                  173610 non-null  object 
 2   score                  173611 non-null  int64  
 3   author                 173611 non-null  object 
 4   author_flair_text      22029 non-null   object 
 5   removed_by             6543 non-null    object 
 6   total_awards_received  33605 non-null   float64
 7   awarders               22930 non-null   object 
 8   created_utc            173611 non-null  int64  
 9   full_link              173611 non-null  object 
 10  num_comments           173611 non-null  int64  
 11  over_18                173611 non-null  bool   
dtypes: bool(1), float64(1), int64(3), object(7)
memory usage: 14.7+ MB


Unnamed: 0,score,total_awards_received,created_utc,num_comments
count,173611.0,33605.0,173611.0,173611.0
mean,193.861069,0.00128,1491547000.0,25.282436
std,2001.160875,0.070061,61371820.0,195.280094
min,0.0,0.0,1329263000.0,0.0
25%,1.0,0.0,1445138000.0,1.0
50%,1.0,0.0,1491085000.0,1.0
75%,5.0,0.0,1546150000.0,4.0
max,116226.0,8.0,1586792000.0,18801.0


In [6]:
print("Data shape :",df.shape)

Data shape : (173611, 12)


In [7]:
#Empty values:

df.isnull().sum().sort_values(ascending = False)

removed_by               167068
author_flair_text        151582
awarders                 150681
total_awards_received    140006
title                         1
id                            0
score                         0
author                        0
created_utc                   0
full_link                     0
num_comments                  0
over_18                       0
dtype: int64

We note from the table above:
- There are `173,611` entries in the dataset. Caveat, not all columns in the dataset are complete. 
- The average reddit score `193`. The median value for the score is `1`, which means that a half of reddits in our dataset have the score `0` or `1` and only less than 75% reddits have the score more than `5`
- The most popular reddit has `18,801` comments, while the average is `25` and the median is `1`. 

## <a name="corr"></a>Removed reddits deep dive

Let's see who and why removes posts:

>As we can see, the most deleted posts (68%) were removed by moderator. Less than 1% are deleted by authors.


## <a name="corr"></a>The most popular reddits

## <a name="corr"></a>The most common words in reddits:

Let's see the word map of the most commonly used words from reddit titles:

In [8]:
#To build a wordcloud, we have to remove NULL values first:
df["title"] = df["title"].fillna(value="")

In [9]:
#Now let's add a string value instead to make our Series clean:
word_string=" ".join(df['title'].str.lower())

#word_string

## <a name="corr"></a>Comments distribution


>The average reddit has less than 25 comments. Let's see the comment distribution for those reddits who have <25 comments:

In [15]:
df

Unnamed: 0,id,title,score,author,author_flair_text,removed_by,total_awards_received,awarders,created_utc,full_link,num_comments,over_18
0,g0l1o6,[OC] Website about covid-19 pandemic stats wit...,1,muddymind,,moderator,0.0,[],1586791506,https://www.reddit.com/r/dataisbeautiful/comme...,3,False
1,g0kxzc,Dynamic timeline of the founding of major Euro...,1,[deleted],,deleted,0.0,[],1586791184,https://www.reddit.com/r/dataisbeautiful/comme...,0,False
2,g0kwbp,"Despite more than four weeks to complete it, I...",251,jamaisvu99,OC: 3,,0.0,[],1586791045,https://www.reddit.com/r/dataisbeautiful/comme...,25,False
3,g0ktji,[OC] Reported Coronavirus Tests per million as...,24,AAA786786,OC: 2,,0.0,[],1586790800,https://www.reddit.com/r/dataisbeautiful/comme...,18,False
4,g0kiyr,[OC] House M.D.-IMDB rating of episodes,9,MrButterDucky,OC: 1,moderator,0.0,[],1586789902,https://www.reddit.com/r/dataisbeautiful/comme...,8,False
...,...,...,...,...,...,...,...,...,...,...,...,...
173606,pqbdl,Infosthetics seems like it belongs here.,15,magiclamp,,,,,1329282849,https://www.reddit.com/r/dataisbeautiful/comme...,0,False
173607,pqav2,Time lapse of every nuclear detonation from 19...,9,th3sousa,,,,,1329282160,https://www.reddit.com/r/dataisbeautiful/comme...,0,False
173608,pq922,Wavii.,13,ddshroom,,,,,1329279777,https://www.reddit.com/r/dataisbeautiful/comme...,2,False
173609,ppx09,An interactive representation of Pres. Obamas ...,21,zanycaswell,,,,,1329265203,https://www.reddit.com/r/dataisbeautiful/comme...,0,False


>As we can see, the most reddits have less than 5 comments. 

## <a name="corr"></a>Correlation between dataset variables

Now let's see how the dataset variables are correlated with each other:
* How score and comments are correlated? 
* Do they increase and decrease together (positive correlation)? 
* Does one of them increase when the other decrease and vice versa (negative correlation)? Or are they not correlated?

Correlation is represented as a value between -1 and +1 where +1 denotes the highest positive correlation, -1 denotes the highest negative correlation, and 0 denotes that there is no correlation.

* Let's see the correlation table between our dataset variables (numerical and boolean variables only)

In [16]:
# df = pd.to_numeric(df[:], errors='coerce')

df.corr(numeric_only=True)

Unnamed: 0,score,total_awards_received,created_utc,num_comments,over_18
score,1.0,0.222506,0.029288,0.637163,0.018861
total_awards_received,0.222506,1.0,0.015877,0.13504,0.008467
created_utc,0.029288,0.015877,1.0,0.024414,0.011568
num_comments,0.637163,0.13504,0.024414,1.0,0.028636
over_18,0.018861,0.008467,0.011568,0.028636,1.0


We see that score and number of comments are highly positively correlated with a correlation value of 0.6. 

There is some positive correlation of 0.2 between total awards received and score (0.2) and num_comments (0.1).

Now let's visualize the correlation table above using a heatmap


In [18]:
h_labels = [x.replace('_', ' ').title() for x in 
            list(df.select_dtypes(include=['number', 'bool']).columns.values)]


## <a name="corr"></a>Score distribution


In [19]:
df.score.describe()

count    173611.000000
mean        193.861069
std        2001.160875
min           0.000000
25%           1.000000
50%           1.000000
75%           5.000000
max      116226.000000
Name: score, dtype: float64

In [20]:
df.score.median()

1.0