# Data Exploration

To see my methods for collecting the data, look at `scrape_top_subreddits.py` and then `scrape_comments.py` in the `scripts` folder.

In [1]:
import pandas as pd

In [2]:
top_10 = pd.read_json("../private/comment_data_10.json")
top_100 = pd.read_json("../private/comment_data_100.json")
top_1000 = pd.read_json("../private/comment_data_1000.json")
top_2500 = pd.read_json("../private/comment_data_2500.json")

I split my dataset into four different sizes. The number in the file name is how far down the top subreddit list I went when scraping. So for example, `comment_data_10.json` contains every comment from the top ten posts of the past year from the top ten subreddits, while `comment_data_100.json` contains the same for the top 100 subredddits. `top_10` is the set I intend to use for basic experimentation, so I don't have to wait too long for operations to complete. Once I have my code worked out, I can scale it up to `top_100` for actual analysis. Then if that goes well and I want to scale up further, I can switch to `top_1000` or even `top_2500`.

In [3]:
top_10.head()

Unnamed: 0,subreddit,comment_id,text
0,funny,jg3d9yg,"""Oh my god! That's awful!"" Exactly how you wan..."
1,funny,jg3af9r,Her eyes when he stood up.
2,funny,jg3apmv,[deleted]
3,funny,jg3782t,the reporter's name is Brad Blanks
4,funny,jg39vhb,**Jennifer:** What are you getting in the way-...


Since I scraped the data myself, the format is already good for my purposes and I don't have to clean up much, other than removing the `[deleted]` and `[removed]` comments. Each file is structured like this, with the only difference being the number of rows. The `subreddit` column contains of the name the subreddit the comment was left on. The `comment_id` column contains an id which can be used with the reddit api to fetch more information about a comment, such as its number of upvotes, the post it was left on, or when it was made. Finally, the `text` column contains the text of the comment.

In [4]:
top_10.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9838 entries, 0 to 9837
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   subreddit   9838 non-null   object
 1   comment_id  9838 non-null   object
 2   text        9838 non-null   object
dtypes: object(3)
memory usage: 230.7+ KB


In [5]:
top_10.describe()

Unnamed: 0,subreddit,comment_id,text
count,9838,9838,9838
unique,10,9838,9456
top,aww,jg3d9yg,[deleted]
freq,1962,1,164


In [6]:
top_100.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120856 entries, 0 to 120855
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   subreddit   120856 non-null  object
 1   comment_id  120856 non-null  object
 2   text        120856 non-null  object
dtypes: object(3)
memory usage: 2.8+ MB


In [7]:
top_100.describe()

Unnamed: 0,subreddit,comment_id,text
count,120856,120856,120856
unique,100,120856,115028
top,cats,jg3d9yg,[deleted]
freq,2548,1,1342


In [8]:
top_1000.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132141 entries, 0 to 1132140
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   subreddit   1132141 non-null  object
 1   comment_id  1132141 non-null  object
 2   text        1132141 non-null  object
dtypes: object(3)
memory usage: 25.9+ MB


In [9]:
top_1000.describe()

Unnamed: 0,subreddit,comment_id,text
count,1132141,1132141,1132141
unique,1000,1132141,1056914
top,painting,jg3d9yg,[deleted]
freq,3851,1,9239


In [10]:
top_2500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2320731 entries, 0 to 2320730
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   subreddit   object
 1   comment_id  object
 2   text        object
dtypes: object(3)
memory usage: 53.1+ MB


In [11]:
top_2500.describe()

Unnamed: 0,subreddit,comment_id,text
count,2320731,2320731,2320731
unique,2492,2320731,2145538
top,Faces,jg3d9yg,[deleted]
freq,3858,1,17345


Almost all of this information seems to check out, and wow those big datasets have a lot of rows! But what are these 8 missing subreddits in the top_2500 dataset? We can pull in the list I used in my scraper to figures this out.

In [12]:
with open("../private/top_subreddits.txt") as f:
    expected = set(f.read().split())
actual = set(top_2500["subreddit"])
expected.difference(actual)

{'ForHire_GameDev',
 'ForHire_OG',
 'ForeverAloneDating',
 'GlamourSchool',
 'TheArtistStudio',
 'badtattoos',
 'recipegifs',
 'shortcircuit'}

Interesting, looking at these subreddits the problem for most of them seems to be that these subreddits haven't had any posts in the past year, and since I only looked at the top posts of the year, there weren't any posts in the list.

In [13]:
# TODO: mess with the data a bit more