# Reddit Dataset - Extraction

This notebook contains the code that extracts data that is generated by the scripts in `/scripts/`. Since the data is raw, there is some EDA being done here to make sure we get a nice curated dataset which is fairly balanced.

In [1]:
import pandas as pd

Function to remove links and escape characters.

Since punctuation, whitespace, and capitalization remain relevant, I am leaving in those features for now.

In [2]:
import re
import string

def preprocess_text(text):
    # Remove URLs
    text = str(text)
    text = re.sub(r'http\S+', '', text)
    # Remove escape characters
    text = text.replace('\n', ' ')
    text = text.replace('\t', ' ')        
    return text

I'll explain this down below.

In [3]:
def remove_mod_messages(texts):
    return [text for text in texts if 'mod' not in text.lower().split() and 'moderator' not in text.lower().split()]

In [4]:
filepath = '../../reddit_data/AskTeenBoys_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,Fax.
1,1,What happened to my comment....it was soo good...
2,2,"A shit ton of censorship. And I don't mean ""de..."
3,3,Wasn't aware of the drama between /r/askmen an...
4,4,Nice username\n\nI too am from Finland
...,...,...
14896,14896,Not gay if youre a girl
14897,14897,Masturbate and play with my temporary boobs
14898,14898,And rape him first
14899,14899,Prostitution.


In [5]:
teens_1 = df['comments'].apply(preprocess_text)

In [6]:
teens_1[90]

"#MOD MESSAGE    If anyone can help us get in touch with reddit admins, that would be great. To help us counter this situation.           If you see any pedos around here or on r/AskTeenGirls, report them to us AND directly to Reddit.          Edit: everyone who's confused, I'm a mod from ATG."

It looks like we randomly stumbled onto a mod in a teen subreddit whose text will probably get assigned the 'teen' label. It looks like an older person wrote it to me, but I could be wrong.

I can check for 'mod' and remove text, but it will remove all text that mentions a mod, not just the moderator's messages.

Either way, this is a good example of noise in a dataset. Hence the function (also repeated below)

In [7]:
def remove_mod_messages(texts):
    return [text for text in texts if 'mod' not in text.lower().split() and 'moderator' not in text.lower().split()]

In [8]:
teens_1 = remove_mod_messages(teens_1)

In [9]:
teens_1[0:5]

['Fax.',
 'What happened to my comment....it was soo good....  so ima just repeat myself  U callin yer selfs ***D O G S*** eh?',
 'A shit ton of censorship. And I don\'t mean "dem libtards takin muh free spheech" kind of censorship. Basically if you disagree with a commenter, and reply stating a counter, your comments will be removed. Or even factual fucking errors are counted as "invalidating".  Extreme example:  "2+3 is 7 and you just got to accept that"  "It\'s actually 5 but i get what you\'re saying"  "Your comment was removed for Invalidation.',
 "Wasn't aware of the drama between /r/askmen and /r/AskWomen, and yet hearing about it now is completely unsurprising.",
 'Nice username  I too am from Finland']

In [10]:
teens_1[90]

'#Come again nigga?'

looks like that data point we had before is now gone

In [11]:
filepath = '../../reddit_data/AskTeenGirls_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,you're damn right i gave you a kiss
1,1,how I missed you my comment
2,2,So are we married now? jk jk... unless😳
3,3,what if 😳😳😳
4,4,***I missed you too baby***
...,...,...
21296,21296,1) Legend\n\n2) In Italy we don’t have dress c...
21297,21297,i left during lunch w a few friends and caught...
21298,21298,I used to skip one of my classes. Then I’d lie...
21299,21299,Yea but I didn't want to write an essay


In [12]:
teens_2 = [preprocess_text(item) for item in list(df['comments'])]

In [13]:
teens_2 = remove_mod_messages(teens_2)

In [14]:
teens_2[0:5]

["you're damn right i gave you a kiss",
 'how I missed you my comment',
 'So are we married now? jk jk... unless😳',
 'what if 😳😳😳',
 '***I missed you too baby***']

In [15]:
teens = teens_1.extend(teens_2)

In [16]:
len(teens_1)

36119

This is now our teen data.

In [17]:
teens = teens_1

In [18]:
len(teens)

36119

In [19]:
filepath = '../../reddit_data/college_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,I honestly teared up a bit reading this letter...
1,1,Yup. There's almost nothing better than a prof...
2,2,"My mom died Nov 11, just a few weeks ago. I am..."
3,3,That’s so touching that your professor cares s...
4,4,I fucking can’t. My mom died two years ago fro...
...,...,...
14696,14696,yes! with the past year being work from home a...
14697,14697,RIP for whichever lab hires me after grad lol
14698,14698,"Easier, but so boring"
14699,14699,"I love online classes too, it’s easier to lear..."


In [20]:
college = [preprocess_text(item) for item in list(df['comments'])]
college = remove_mod_messages(college)
len(college)

14700

In [21]:
filepath = '../../reddit_data/LifeAfterSchool_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,the agricultural revolution was a mistake
1,1,Why do people get degrees like engineering and...
2,2,Instead of staring at the one 5 steps from my ...
3,3,I came to this subreddit for advice and direct...
4,4,"Exactly, sitting 8 hours behind a desk 5 days ..."
...,...,...
10596,10596,thanks so much! it’s sad but your right about ...
10597,10597,Abbreviations mean Child Protective Services a...
10598,10598,"As the other commenter said, I'd consider grad..."
10599,10599,thanks for the tips!


In [22]:
life_after_school = [preprocess_text(item) for item in list(df['comments'])]
life_after_school = remove_mod_messages(life_after_school)
len(life_after_school)

10600

In [23]:
college.extend(life_after_school)

In [24]:
len(college)

25300

This is now our twenties data.

In [25]:
twenties = college

I hypothesize that these subreddits by nature have shorter or longer comments. Let's look further.

The function below counts words, and we use this to get average comment word count per data group.

In [26]:
def word_counts(text_list):
    return [len(text.split()) for text in text_list]

In [27]:
sum(word_counts(twenties)) / len(word_counts(twenties))

45.894268774703555

In [28]:
sum(word_counts(teens)) / len(word_counts(teens))

19.068578864309643

These are drastically different, and I'm not too concerned about length of text denoting age -- it could be subreddit specific, and it's an easy differentiator which tells in an obvious way where the data came from. Let's remove the data shorter than the first quartile of comments when arranged by length of words in the text.

This should hopefully make the dataset more difficult and interesting to work with.

In [29]:
import numpy as np
teens_words = [text.split() for text in teens]
teens_word_counts = [len(words) for words in teens_words]
q1 = np.percentile(teens_word_counts, 25)
teens_long_comments = [text for i, text in enumerate(teens) if teens_word_counts[i] >= q1]


In [30]:
sum(word_counts(teens_long_comments)) / len(word_counts(teens_long_comments))

23.6631641623761

In [31]:
len(teens_long_comments)

28551

Now we're up to 23 words per comment, from 19.

Let's also remove text longer than the third quartile in the `twenties` data to make the word length closer.

In [32]:
import numpy as np
twenties_words = [text.split() for text in twenties]
twenties_word_counts = [len(words) for words in twenties_words]
q3 = np.percentile(twenties_word_counts, 75)
twenties_short_comments = [text for i, text in enumerate(twenties) if twenties_word_counts[i] < q3]

In [33]:
sum(word_counts(twenties_short_comments)) / len(word_counts(twenties_short_comments))

20.002953898090517

Looks like the upper data points are really long, pulling up the average. Lets check how much data we have here.

In [34]:
len(twenties_short_comments)

18958

Looks like there is a pretty big difference in data size -- I'll leave it for now, and stratify when I make my test set. I might come back and change it.

Moving onto the older age group.

In [35]:
filepath = '../../reddit_data/AskWomenOver30_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,Definitely the best side of reddit in that pos...
1,1,Yessss this is the glow-up we want to see!
2,2,I remember reading your original post and bein...
3,3,I find it so satisfying when they come groveli...
4,4,Very happy for you! Funny that the greener pas...
...,...,...
14796,14796,Yes. Thank you for this response. I don’t view...
14797,14797,Better hope that you're contacted before someo...
14798,14798,Thank you for this question. I also find mysel...
14799,14799,Yes. Being able to consider working part-time ...


In [36]:
women_over_thirty = [preprocess_text(item) for item in list(df['comments'])]
women_over_thirty = remove_mod_messages(women_over_thirty)
len(women_over_thirty)

14794

In [37]:
filepath = '../../reddit_data/AskMenOver30_comments.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0.1,Unnamed: 0,comments
0,0,"It should never be ""You vs. Me"" but rather ""Us..."
1,1,To build off this...my Father mentioned once.....
2,2,If you’re arguing to win you’ve already lost.
3,3,"Just recently I heard “if he loves you, you wi..."
4,4,"""Hey, can you please shut the fuck up?""\n\nI'm..."
...,...,...
14196,14196,Washing your balls primarily
14197,14197,Yeah. I was born in 84 and really saw the adve...
14198,14198,"I was 31, I was massively overweight, smoked 2..."
14199,14199,"I think the decline started in my early 40s, b..."


In [38]:
men_over_thirty = [preprocess_text(item) for item in list(df['comments'])]
men_over_thirty = remove_mod_messages(men_over_thirty)
len(men_over_thirty)

14196

In [39]:
men_over_thirty.extend(women_over_thirty)
thirties = men_over_thirty

Now let's check the average word length.

In [40]:
sum(word_counts(thirties)) / len(word_counts(thirties))

63.30131079682649

Wow, super high. Definitely some wise people here.

How much can we normalize?

In [41]:
import numpy as np
thirties_words = [text.split() for text in thirties]
thirties_word_counts = [len(words) for words in thirties_words]
q3 = np.percentile(thirties_word_counts, 75)
thirties_short_comments = [text for i, text in enumerate(thirties) if thirties_word_counts[i] < q3]

In [42]:
sum(word_counts(thirties_short_comments)) / len(word_counts(thirties_short_comments))

29.529132040627886

In [43]:
len(thirties_short_comments)

21660

Since we have some more data here anyway, let's take out some more of the longer data points to make word count normalized.

In [45]:
import numpy as np
thirties_words = [text.split() for text in thirties_short_comments]
thirties_word_counts = [len(words) for words in thirties_words]
p90 = np.percentile(thirties_word_counts, 90)
thirties_short_comments = [text for i, text in enumerate(thirties_short_comments) if thirties_word_counts[i] < p90]

In [46]:
sum(word_counts(thirties_short_comments)) / len(word_counts(thirties_short_comments))

24.883458260601703

In [47]:
len(thirties_short_comments)

19478

Okay, much better. Putting it all together.

In [48]:
teen_labels = ['te'] * len(teens_long_comments)
twenties_labels = ['tw'] * len(twenties_short_comments)
thirties_labels = ['th'] * len(thirties_short_comments)

In [49]:
data = list(zip(teens_long_comments + twenties_short_comments + thirties_short_comments, teen_labels + twenties_labels + thirties_labels))

In [50]:
data[-1]

("Yes. Being able to consider working part-time because I'll be on my partner's health insurance is super exciting for me. If I were to go part-time right now my health insurance costs would be astronomical.   Also, I absolutely want to be able to have my partner make medical decisions for me if something happens. He knows me best and knows what I want.",
 'th')

In [51]:
# Create a pandas DataFrame from the list of tuples
df = pd.DataFrame(data, columns=['text', 'age'])
df = df.loc[df['text'] != '[deleted]'].reset_index(drop=True)
df

Unnamed: 0,text,age
0,What happened to my comment....it was soo good...,te
1,"A shit ton of censorship. And I don't mean ""de...",te
2,Wasn't aware of the drama between /r/askmen an...,te
3,Nice username I too am from Finland,te
4,Your comment was on the [other post]( lol,te
...,...,...
64861,"And, after 10 years of marriage, you can get 5...",th
64862,Yes. Thank you for this response. I don’t view...,th
64863,Better hope that you're contacted before someo...,th
64864,Thank you for this question. I also find mysel...,th


In [52]:
# save df as a pickle file
df.to_pickle('../../data_samples/reddit_samples/all.pkl')

# load df from pickle file to make sure it works
loaded_df = pd.read_pickle('../../data_samples/reddit_samples/all.pkl')
loaded_df

Unnamed: 0,text,age
0,What happened to my comment....it was soo good...,te
1,"A shit ton of censorship. And I don't mean ""de...",te
2,Wasn't aware of the drama between /r/askmen an...,te
3,Nice username I too am from Finland,te
4,Your comment was on the [other post]( lol,te
...,...,...
64861,"And, after 10 years of marriage, you can get 5...",th
64862,Yes. Thank you for this response. I don’t view...,th
64863,Better hope that you're contacted before someo...,th
64864,Thank you for this question. I also find mysel...,th


Looks like that's the mostly final form of the data.