# Code Notebook 1: Data Collection and Inital Cleaning and EDA #

This notbook contains the initial EDA of the subreddit data.
- Null values are identified and handled.
- Comment post times are converted to year, month, and day columns.
- Features with the most potential for this analysis are selected.  All data was originally collected in case I discovered I wanted to use a feature I didn't originally anticipate.
- The most frequent comments in each subreddit were examined for common patterns that indicate a comment should be dropped.  Examples of this were [Removed] comments and comments from moderator bots.

The aim of this project is to identify differences in the discourse among scientist vs among members of the general public who have shown enthusiasm for human advancement.  NLP will be used to examine differences between posts on the Science and Futurology subreddits, and classification models will be used to attempt to identify which subreddit posts were from.  The classification models will be evaluated based primarily on F1 score, to ensure that they are not skewed towards one subreddit.  Other scores considered will be precision, recall, and ROC AUC.

Please see the README.md for a more in-depth discussion of the motivation for this project.

In [1]:
import pandas as pd

import datetime
import re

import cleantools as ct

### Section 1: Obtaining and Importing Data ###

*The script "get_reddit_data.py," in this folder, was used to collect the data for this project.*

In [2]:
science_full = pd.concat([pd.read_csv(f'../data/sci_{i}.csv') for i in range(53)])
future_full = pd.concat([pd.read_csv(f'../data/future_{i}.csv') for i in range(53)])

In [3]:
# Saving the lengths of the original dataframes for comparison after cleaning, so I will know how many comments are dropped.
science_startlen = len(science_full)
future_startlen = len(future_full)

In [4]:
science_full.reset_index(inplace=True)
future_full.reset_index(inplace=True)

### Section 2: Identifying and Handling Null Values ###

In [5]:
science_full['created_utc'].isna().sum()

1

In [6]:
science_full[['body', 'created_utc']][science_full['created_utc'].isna()]

Unnamed: 0,body,created_utc
511129,The dose makes the poison,


In [7]:
science_full['body'].isna().sum()

1

In [8]:
science_full[['body', 'created_utc']][science_full['body'].isna()]

Unnamed: 0,body,created_utc
511130,,/r/science/comments/oturfg/caffeinated_bumbleb...


I'm not sure what happened to cause these two posts to have NaN values for created date and body, but they are such a small number of comments that I have no problem dropping them.

In [9]:
future_full['created_utc'].isna().sum()

0

In [10]:
future_full['body'].isna().sum()

0

In [11]:
science_full.drop([511129, 511130], inplace=True)
science_full.reset_index(inplace=True)

### Section 3: Converting UTC timestamp to more convenient post time information. ###

In [12]:
# I learned how to convert the Unix timestamp here:
# https://stackoverflow.com/questions/3694487/in-python-how-do-you-convert-seconds-since-epoch-to-a-datetime-object
def convert_time(timestamp):
    return datetime.datetime.fromtimestamp(timestamp)

In [13]:
def get_year(date):
    return date.year

In [14]:
def get_month(date):
    return date.month

In [15]:
def get_day(date):
    return date.day

In [16]:
science_full['created_utc'] = science_full['created_utc'].apply(ct.convert_to_float)
science_full['date'] = science_full['created_utc'].apply(convert_time)
science_full['year'] = science_full['date'].apply(get_year)
science_full['month'] = science_full['date'].apply(get_month)
science_full['day'] = science_full['date'].apply(get_day)

In [17]:
future_full['created_utc'] = future_full['created_utc'].apply(ct.convert_to_float)
future_full['date'] = future_full['created_utc'].apply(convert_time)
future_full['year'] = future_full['date'].apply(get_year)
future_full['month'] = future_full['date'].apply(get_month)
future_full['day'] = future_full['date'].apply(get_day)

The above code has converted the utc timestamps to year, month, and day, and then added a column for each to the dataframe.

In [18]:
science_full['year'].value_counts()

2021    529873
Name: year, dtype: int64

In [19]:
science_full['month'].value_counts()

8     188779
9     155970
10    148766
7      36358
Name: month, dtype: int64

In [20]:
future_full['year'].value_counts()

2021    529866
Name: year, dtype: int64

In [21]:
future_full['month'].value_counts()

6     129348
8     112560
7     104622
9      90938
10     63712
5      28686
Name: month, dtype: int64

My posts cover a similar, but not identical, time period.  This may be worth keeping in mind, but I don't see a big enough difference to be a problem right now.

### Section 4: Selecting useful features and confirming they have correct data types ###

In [22]:
interesting_cols = ['subreddit', 'body', 'score', 'author', 'year', 'month', 'day']

At this point, I'm dropping the features that I don't expect to use.  The features I defintely need are subreddit and body.  The date columns are useful for ensuring that the comments I use are from a similar timeframe.  The score column could be interesting to try to include as a weight or additional feature, indicating the types of comments that are most successful on each subreddit.  Author will be interesting during EDA to examine differences in posting behavior, but it will not be used in predictive modeling because matching authors to subreddits is not the goal of this analysis.

In [23]:
science_df = science_full[interesting_cols].copy()
future_df = future_full[interesting_cols].copy()

In [24]:
science_df.isna().sum()

subreddit    0
body         0
score        0
author       0
year         0
month        0
day          0
dtype: int64

In [25]:
future_df.isna().sum()

subreddit    0
body         0
score        0
author       0
year         0
month        0
day          0
dtype: int64

In [26]:
science_df.dtypes

subreddit    object
body         object
score        object
author       object
year          int64
month         int64
day           int64
dtype: object

In [27]:
future_df.dtypes

subreddit    object
body         object
score         int64
author       object
year          int64
month         int64
day           int64
dtype: object

Focusing in on these columns of interest, there are no NaN values remaining.  Data types are also as expected, except that score should be numeric and is not for the science dataframe.

In [28]:
ct.find_non_numeric(science_df['score'])

[]

There don't appear to be any problematic values.  Perhaps the problem was in one of the rows that I already removed.

In [29]:
def convert_to_int(in_item):
    
    """
    Converts all inputs to int.
    Warns the user if any input cannot be typecast as int.
    """
    
    try:
        return int(in_item)
    except:
        print('Non-numeric value found.')

In [30]:
science_df['score'] = science_df['score'].apply(convert_to_int)

In [31]:
science_df.dtypes

subreddit    object
body         object
score         int64
author       object
year          int64
month         int64
day           int64
dtype: object

The data types are now correct.

### Section 5: Identifying and removing comments that are irrelevant to this analysis ###

In [32]:
future_df['body'].value_counts()[future_df['body'].value_counts() > 10].sort_values(ascending=False)

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                           68337
[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                                           11340
Hello, everyone! Want to help improve this community?\n    \nWe're looking for more moderators!\n     \n[If you're i

Checking for repeated comments in the futurology subreddit, I see a few things that should be screened for:

1. [removed]
2. [deleted]
3. "Removed - Duplicate submission"
4. Any comment that includes the text "This is a bot. Replies will not receive responses" or "I am a bot, and this action was performed automatically".
5. Any post containing "could you please repost with a submission statement."
6. Any comment that contains with the word "Rule" followed by a number.  This one may get a few legitimate comments, but I checked below.  Only 396 total comments contains this, so a very small number of legitimate comments will be removed.

In [33]:
drop_cols = list(future_df[future_df['body'] == '[removed]'].index)
drop_cols += list(future_df[future_df['body'] == '[deleted]'].index)
drop_cols += list(future_df[future_df['body'] == 'Removed - Duplicate submission'].index)
drop_cols += [i for i, text in enumerate(future_df['body']) if 'this is a bot. replies will not receive responses' in text.lower()]
drop_cols += [i for i, text in enumerate(future_df['body']) if 'i am a bot, and this action was performed automatically' in text.lower()]
drop_cols += [i for i, text in enumerate(future_df['body']) if 'could you please repost with a submission statement' in text.lower()]
drop_cols += [i for i, text in enumerate(future_df['body']) if re.findall('Rule \d', text) != []]
drop_cols = set(drop_cols)

Identifies the index of any row that has a comment that should be removed, according to the list above.

In [34]:
future_df.drop(drop_cols, inplace=True)
future_df.reset_index(inplace=True)

Removes the indicated columns from the dataframe.

In [35]:
science_df['body'].value_counts()[science_df['body'].value_counts() > 12].sort_values(ascending=False)

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           170268
[deleted]                                                                                                                                                                                                    

Checking for repeated comments in the futurology subreddit, I see a few things that should be screened for:

1. [removed]
2. [deleted]
3. Any comment that includes the text "I am a bot, and this action was performed automatically".
4. Any comment that includes "[Submission Rule".

In [36]:
drop_cols = list(science_df[science_df['body'] == '[removed]'].index)
drop_cols += list(science_df[science_df['body'] == '[deleted]'].index)
drop_cols += [i for i, text in enumerate(science_df['body']) if 'i am a bot, and this action was performed automatically' in text.lower()]
drop_cols += [i for i, text in enumerate(science_df['body']) if '[submission rule' in text.lower()]
drop_cols = set(drop_cols)

Identifies the index of any row that has a comment that should be removed, according to the list above.

In [37]:
science_df.drop(drop_cols, inplace=True)
science_df.reset_index(inplace=True)

Removes the indicated columns from the dataframe.

In [38]:
len(future_df) / future_startlen

0.845343917141315

In [39]:
len(science_df) / science_startlen

0.6441519226232602

After removing the above comments, 84.5% of the Futurology comments remain, and 64.4% of the Science comments remain.  This is not very surprising because the Science subreddit describes itself as heavily moderated.

In [40]:
sub_range = 25_000

*sub_range* controls the number of comments from each subreddit that will be included in the model.  This is included because the number of comments I have available, even after cleaning, is too large to be handled efficiently by my current hardware and likely overkill for addressing my problem statement.  After some experimentation, I have settled on 25,000 based on the time it takes my computer to process these models.

In [41]:
future_df['month'][:sub_range].value_counts()

10    25000
Name: month, dtype: int64

In [42]:
future_df['day'][:sub_range].value_counts()

14    4080
25    2977
26    2838
18    2244
19    2046
15    1715
23    1508
24    1327
17    1308
22    1296
16    1138
21    1046
20    1026
13     451
Name: day, dtype: int64

In [43]:
science_df['month'][:sub_range].value_counts()

10    25000
Name: month, dtype: int64

In [44]:
science_df['day'][:sub_range].value_counts()

25    6443
21    5096
23    3671
22    3580
24    3299
26    2870
20      41
Name: day, dtype: int64

The above cells show that all comments used in this analysis are from a narrow time range.  The comments from the Futurology subreddit cover about two weeks, and the comments from the science subreddit cover about one week.  The fact that they come from a similar time frame means that they should be a good representation of how the Science and Futurology discourse differed in that timeframe.

In [45]:
corpus = pd.concat([future_df[:sub_range], science_df[:sub_range]])
corpus.reset_index(inplace=True)

In [46]:
corpus.to_csv('../data/corpus.csv', index=False)

To reduce memory use, only the features and rows that will be used from here on are saved in the corpus.

Data has now been cleaned, with Null and unuseful comments removed.  I have selected only the rows and features that I expect to use and included them in the file corpus.csv.