In [1]:
# Import Libraries
import pandas as pd
import regex as re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

pd.set_option('display.max_colwidth', -1)
pd.options.display.max_columns = 999

In [2]:
# Read in two volumes 
zone = pd.read_csv('../data/raw_data/twilight_zone_raw')
comics = pd.read_csv('../data/raw_data/comicbooks_raw')

In [3]:
# Going to drop the 'Unnamed' column for each dataset and use the pandas index
zone.drop(columns='Unnamed: 0', inplace=True)
comics.drop(columns='Unnamed: 0', inplace=True)

In [4]:
zone.shape

(1235, 8)

In [5]:
comics.shape

(1797, 8)

In [6]:
# How many of these columns are useful to modeling? 
zone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1235 entries, 0 to 1234
Data columns (total 8 columns):
selftext        1116 non-null object
author          1235 non-null object
title           1235 non-null object
created_utc     1235 non-null int64
num_comments    1235 non-null int64
is_self         1235 non-null bool
subreddit       1235 non-null object
timestamp       1235 non-null object
dtypes: bool(1), int64(2), object(5)
memory usage: 68.9+ KB


In [7]:
# Returns 'True' for every column therefore not going to be a helpful classifier
zone['is_self'].value_counts()

True    1235
Name: is_self, dtype: int64

In [8]:
# Returns 'True' for every column therefore not going to be a helpful classifier
comics['is_self'].value_counts()

True    1797
Name: is_self, dtype: int64

In [9]:
# Dropping the 'is_self' columns from each dataset
zone.drop(columns='is_self', inplace=True)
comics.drop(columns='is_self', inplace=True)

In [10]:
# Dropping the 'created_utc' column from both datasets
zone.drop(columns='created_utc', inplace=True)
comics.drop(columns='created_utc', inplace=True)

In [11]:
# Need to convert dtypes to strings in appropriate columns 
for col in ['selftext', 'author', 'title', 'subreddit']:
    zone[col] = zone[col].astype(str)

In [12]:
for col in ['selftext', 'author', 'title', 'subreddit']:
    comics[col] = comics[col].astype(str)

In [13]:
# Removing all columns that are not text-based
zone.drop(['num_comments','timestamp'], axis=1, inplace=True)

In [14]:
comics.drop(['num_comments','timestamp'], axis=1, inplace=True)

In [15]:
# Checking for null values in each dataframe 
comics.isna().sum()

selftext     0
author       0
title        0
subreddit    0
dtype: int64

In [16]:
zone.isna().sum()

selftext     0
author       0
title        0
subreddit    0
dtype: int64

In [17]:
comics.head(2)

Unnamed: 0,selftext,author,title,subreddit
0,"I feel like comic colorists don't get enough love, so let me know who your favorite colorists are. I'm looking for inspiration!",avidya1997,Who is your favorite comicbook colorist?,comicbooks
1,"What I mean is, outside those covers we know are instantly iconic (an example would be like Frank Miller’s Dark Knight Returns covers).\n\nMy personal favorites are both the standard and variant covers for Judas by Jeff Loveness. They’re stunning and invoke such a distinct emotion in regards to the story he’s telling.\n\nI’m trying to come up with ideas for a cover for the comic I’m writing and I’m hitting a wall creatively.\n\nThanks!",writingsupplies,What Are Your Favorite Covers (that aren’t iconic)?,comicbooks


In [18]:
zone.head(2)

Unnamed: 0,selftext,author,title,subreddit
0,"I'm a big fan of the Twilight Zone, I have the complete series on DVD I've seen every episode.\n\nRecently, someone told me about an episode of the Alfred Hitchcock Hour called The Jar, which they thought was an episode of the Twilight Zone.\n\nI watched it tonight, it's the first and only episode of the show I've seen. I somewhat enjoyed it, although it seemed a little ""darker"" than the Twilight Zone, and less ""sci-fi"" focused. I'm not sure if I should try to watch more episodes or not. Maybe I just chose the wrong episode to start with.\n\nHow does the show compare to the Twilight Zone overall? Are there good episodes that any Twilight Zone fan would enjoy?",GAMESHARQ,"Twilight Zone fans, what's your opinion on the Alfred Hitchcock Hour?",TwilightZone
1,"********SPOILERS AHEAD********\n\n\nJust wanted to warn anyone before reading all this, that there are spoilers in here for those who haven't see this episode yet. \n\n\nNow, on with the show!!\n\n\nOK, so let's first talk about the number 14. The number 14 is mentioned a total of 3 times throughout this episode. First, when the driver says ""Mister, that's a 14 year old bus, and business is lousy"". Not only that, but before he says this, he says ""What do you think I got parked out there, a 707?"" What is 7 + 0 + 7? **14**!!\n\nThe second time 14 is mentioned, the alien behind the counter says ""Nothin's come in here for 14 hours"". \n\nThe third mention of the number 14 is when the bus riders are checking out at the diner counter. The alien behind the counter says that the first alien had ""14 cups of coffee"". Damn, that's a lot of coffee...\n\n\nOK, so how many times is the number 14 mentioned? 3 times. How many arms did the first alien have? 3. How many eyes did the other alien have? 3. \n\nKeep that in mind - \n\nNow let's talk about the title of this episode - ""Will the real martian please stand up?"". Well, actually, let's hold off on that one, for just a second. \n\nLet's talk about the officers. How many were there? 2. How many aliens were there? 2. Now, let's multiply 14 by 2 to get 28. Now, let's add 3 to that number (3 for how many arms/eyes the aliens had, and how many times 14 was mentioned), and what do you get? 31. \n\nNow, how many letters are in the title of this episode? **31!!** \n\nPhew... man, I'm tired from all that math... \n\n\nBut wait, there's more! \n\nHow many times was 14 mentioned? 3. How many arms did the first alien have? 3. How many eyes did the second alien have? 3. Add those together, and you get 9. What time did the first alien say his meeting was in Boston? **9AM!!!** \n\nOkay, okay. I'm reaching with some of these, but I sat there last night re-watching this episode, and I couldn't get all these damn numbers out of my head. \n\nIf you made it this far, I commend you. I doubt any of this means anything, but I found it interesting, nonetheless. \n\nCheers!",Pockets_The_Paladin,"""Will the real martian please stand up?"" is one giant numbers game...",TwilightZone


### Lemmatizing Text Columns 
Need to preprocess and create a new 'combined_text' column for our combined dataframe, comics_zone. Kept dataframes separate to avoid confusion before concatenation. 

In [19]:
# setting up tokenizer and lemmatizer
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

In [20]:
# function to lemmatize (with help from Kate Dowdy)
def lemma(text):
    tokens = tokenizer.tokenize(str(text))
    lems = [lemmatizer.lemmatize(i) for i in tokens]
    
    return(" ".join(lems))

In [21]:
# create a column for lemmatized words 
zone['lems'] = zone['selftext'].apply(lambda x: lemma(x))

In [22]:
zone['title_lems'] = zone['title'].apply(lambda x: lemma(x))

In [23]:
# create a column for lemmatized words 
comics['lems'] = comics['selftext'].apply(lambda x: lemma(x))

In [24]:
comics['title_lems'] = comics['title'].apply(lambda x: lemma(x))

In [25]:
# function to remove hanging contraction leftovers (given to me by Kate Dowdy)
def nocontract(x):
    x = re.sub("([ ][r][e][ ])", " ", x)
    x = re.sub("([ ][v][e][ ])", " ", x)
    x = re.sub("([ ][l][l][ ])", " ", x)
    x = re.sub("([ ][d][ ])", " ", x)
    x = re.sub("([ ][t][ ])", " ", x)
    x = re.sub("([ ][m][ ])", " ", x)
    x = re.sub("([ ][s][ ])", " ", x)
    return x

In [26]:
# applying no contractions function to both dataframes
zone['lems'] = zone['lems'].apply(lambda x: nocontract(x))
zone['title_lems'] = zone['title_lems'].apply(lambda x: nocontract(x))
comics['lems'] = comics['lems'].apply(lambda x: nocontract(x))
comics['title_lems'] = comics['title_lems'].apply(lambda x: nocontract(x))

In [27]:
zone.head(2)

Unnamed: 0,selftext,author,title,subreddit,lems,title_lems
0,"I'm a big fan of the Twilight Zone, I have the complete series on DVD I've seen every episode.\n\nRecently, someone told me about an episode of the Alfred Hitchcock Hour called The Jar, which they thought was an episode of the Twilight Zone.\n\nI watched it tonight, it's the first and only episode of the show I've seen. I somewhat enjoyed it, although it seemed a little ""darker"" than the Twilight Zone, and less ""sci-fi"" focused. I'm not sure if I should try to watch more episodes or not. Maybe I just chose the wrong episode to start with.\n\nHow does the show compare to the Twilight Zone overall? Are there good episodes that any Twilight Zone fan would enjoy?",GAMESHARQ,"Twilight Zone fans, what's your opinion on the Alfred Hitchcock Hour?",TwilightZone,I a big fan of the Twilight Zone I have the complete series on DVD I seen every episode Recently someone told me about an episode of the Alfred Hitchcock Hour called The Jar which they thought wa an episode of the Twilight Zone I watched it tonight it the first and only episode of the show I seen I somewhat enjoyed it although it seemed a little darker than the Twilight Zone and le sci fi focused I not sure if I should try to watch more episode or not Maybe I just chose the wrong episode to start with How doe the show compare to the Twilight Zone overall Are there good episode that any Twilight Zone fan would enjoy,Twilight Zone fan what your opinion on the Alfred Hitchcock Hour
1,"********SPOILERS AHEAD********\n\n\nJust wanted to warn anyone before reading all this, that there are spoilers in here for those who haven't see this episode yet. \n\n\nNow, on with the show!!\n\n\nOK, so let's first talk about the number 14. The number 14 is mentioned a total of 3 times throughout this episode. First, when the driver says ""Mister, that's a 14 year old bus, and business is lousy"". Not only that, but before he says this, he says ""What do you think I got parked out there, a 707?"" What is 7 + 0 + 7? **14**!!\n\nThe second time 14 is mentioned, the alien behind the counter says ""Nothin's come in here for 14 hours"". \n\nThe third mention of the number 14 is when the bus riders are checking out at the diner counter. The alien behind the counter says that the first alien had ""14 cups of coffee"". Damn, that's a lot of coffee...\n\n\nOK, so how many times is the number 14 mentioned? 3 times. How many arms did the first alien have? 3. How many eyes did the other alien have? 3. \n\nKeep that in mind - \n\nNow let's talk about the title of this episode - ""Will the real martian please stand up?"". Well, actually, let's hold off on that one, for just a second. \n\nLet's talk about the officers. How many were there? 2. How many aliens were there? 2. Now, let's multiply 14 by 2 to get 28. Now, let's add 3 to that number (3 for how many arms/eyes the aliens had, and how many times 14 was mentioned), and what do you get? 31. \n\nNow, how many letters are in the title of this episode? **31!!** \n\nPhew... man, I'm tired from all that math... \n\n\nBut wait, there's more! \n\nHow many times was 14 mentioned? 3. How many arms did the first alien have? 3. How many eyes did the second alien have? 3. Add those together, and you get 9. What time did the first alien say his meeting was in Boston? **9AM!!!** \n\nOkay, okay. I'm reaching with some of these, but I sat there last night re-watching this episode, and I couldn't get all these damn numbers out of my head. \n\nIf you made it this far, I commend you. I doubt any of this means anything, but I found it interesting, nonetheless. \n\nCheers!",Pockets_The_Paladin,"""Will the real martian please stand up?"" is one giant numbers game...",TwilightZone,SPOILERS AHEAD Just wanted to warn anyone before reading all this that there are spoiler in here for those who haven see this episode yet Now on with the show OK so let first talk about the number 14 The number 14 is mentioned a total of 3 time throughout this episode First when the driver say Mister that a 14 year old bus and business is lousy Not only that but before he say this he say What do you think I got parked out there a 707 What is 7 0 7 14 The second time 14 is mentioned the alien behind the counter say Nothin come in here for 14 hour The third mention of the number 14 is when the bus rider are checking out at the diner counter The alien behind the counter say that the first alien had 14 cup of coffee Damn that a lot of coffee OK so how many time is the number 14 mentioned 3 time How many arm did the first alien have 3 How many eye did the other alien have 3 Keep that in mind Now let talk about the title of this episode Will the real martian please stand up Well actually let hold off on that one for just a second Let talk about the officer How many were there 2 How many alien were there 2 Now let multiply 14 by 2 to get 28 Now let add 3 to that number 3 for how many arm eye the alien had and how many time 14 wa mentioned and what do you get 31 Now how many letter are in the title of this episode 31 Phew man I tired from all that math But wait there more How many time wa 14 mentioned 3 How many arm did the first alien have 3 How many eye did the second alien have 3 Add those together and you get 9 What time did the first alien say his meeting wa in Boston 9AM Okay okay I reaching with some of these but I sat there last night watching this episode and I couldn get all these damn number out of my head If you made it this far I commend you I doubt any of this mean anything but I found it interesting nonetheless Cheers,Will the real martian please stand up is one giant number game


In [28]:
comics.head(1)

Unnamed: 0,selftext,author,title,subreddit,lems,title_lems
0,"I feel like comic colorists don't get enough love, so let me know who your favorite colorists are. I'm looking for inspiration!",avidya1997,Who is your favorite comicbook colorist?,comicbooks,I feel like comic colorist don get enough love so let me know who your favorite colorist are I looking for inspiration,Who is your favorite comicbook colorist


In [29]:
# Dropping vestigal unprocessed text columns
zone.drop(columns=['selftext', 'title'], inplace=True)
comics.drop(columns=['selftext', 'title'], inplace=True)

In [30]:
comics.head(2)

Unnamed: 0,author,subreddit,lems,title_lems
0,avidya1997,comicbooks,I feel like comic colorist don get enough love so let me know who your favorite colorist are I looking for inspiration,Who is your favorite comicbook colorist
1,writingsupplies,comicbooks,What I mean is outside those cover we know are instantly iconic an example would be like Frank Miller Dark Knight Returns cover My personal favorite are both the standard and variant cover for Judas by Jeff Loveness They stunning and invoke such a distinct emotion in regard to the story he telling I trying to come up with idea for a cover for the comic I writing and I hitting a wall creatively Thanks,What Are Your Favorite Covers that aren iconic


In [31]:
zone.head(2)

Unnamed: 0,author,subreddit,lems,title_lems
0,GAMESHARQ,TwilightZone,I a big fan of the Twilight Zone I have the complete series on DVD I seen every episode Recently someone told me about an episode of the Alfred Hitchcock Hour called The Jar which they thought wa an episode of the Twilight Zone I watched it tonight it the first and only episode of the show I seen I somewhat enjoyed it although it seemed a little darker than the Twilight Zone and le sci fi focused I not sure if I should try to watch more episode or not Maybe I just chose the wrong episode to start with How doe the show compare to the Twilight Zone overall Are there good episode that any Twilight Zone fan would enjoy,Twilight Zone fan what your opinion on the Alfred Hitchcock Hour
1,Pockets_The_Paladin,TwilightZone,SPOILERS AHEAD Just wanted to warn anyone before reading all this that there are spoiler in here for those who haven see this episode yet Now on with the show OK so let first talk about the number 14 The number 14 is mentioned a total of 3 time throughout this episode First when the driver say Mister that a 14 year old bus and business is lousy Not only that but before he say this he say What do you think I got parked out there a 707 What is 7 0 7 14 The second time 14 is mentioned the alien behind the counter say Nothin come in here for 14 hour The third mention of the number 14 is when the bus rider are checking out at the diner counter The alien behind the counter say that the first alien had 14 cup of coffee Damn that a lot of coffee OK so how many time is the number 14 mentioned 3 time How many arm did the first alien have 3 How many eye did the other alien have 3 Keep that in mind Now let talk about the title of this episode Will the real martian please stand up Well actually let hold off on that one for just a second Let talk about the officer How many were there 2 How many alien were there 2 Now let multiply 14 by 2 to get 28 Now let add 3 to that number 3 for how many arm eye the alien had and how many time 14 wa mentioned and what do you get 31 Now how many letter are in the title of this episode 31 Phew man I tired from all that math But wait there more How many time wa 14 mentioned 3 How many arm did the first alien have 3 How many eye did the second alien have 3 Add those together and you get 9 What time did the first alien say his meeting wa in Boston 9AM Okay okay I reaching with some of these but I sat there last night watching this episode and I couldn get all these damn number out of my head If you made it this far I commend you I doubt any of this mean anything but I found it interesting nonetheless Cheers,Will the real martian please stand up is one giant number game


In [32]:
# Saving cleaned comics dataset to csv 
comics.to_csv('../data/cleaned_data/comics')

In [33]:
# Saving cleaned zone dataset to csv
zone.to_csv('../data/cleaned_data/zone')

In [34]:
# Concatenating dataframes 
comics_zone = pd.concat([zone, comics])

In [35]:
comics_zone['subreddit'].value_counts()

comicbooks      1797
TwilightZone    1235
Name: subreddit, dtype: int64

In [36]:
# Changing 'subreddit' values into binary 1 = Twilight Zone, 0 = 'scifi'
comics_zone['subreddit'] = comics_zone['subreddit'].map({'comicbooks': 0, 'TwilightZone': 1})
comics_zone.head(2)

Unnamed: 0,author,subreddit,lems,title_lems
0,GAMESHARQ,1,I a big fan of the Twilight Zone I have the complete series on DVD I seen every episode Recently someone told me about an episode of the Alfred Hitchcock Hour called The Jar which they thought wa an episode of the Twilight Zone I watched it tonight it the first and only episode of the show I seen I somewhat enjoyed it although it seemed a little darker than the Twilight Zone and le sci fi focused I not sure if I should try to watch more episode or not Maybe I just chose the wrong episode to start with How doe the show compare to the Twilight Zone overall Are there good episode that any Twilight Zone fan would enjoy,Twilight Zone fan what your opinion on the Alfred Hitchcock Hour
1,Pockets_The_Paladin,1,SPOILERS AHEAD Just wanted to warn anyone before reading all this that there are spoiler in here for those who haven see this episode yet Now on with the show OK so let first talk about the number 14 The number 14 is mentioned a total of 3 time throughout this episode First when the driver say Mister that a 14 year old bus and business is lousy Not only that but before he say this he say What do you think I got parked out there a 707 What is 7 0 7 14 The second time 14 is mentioned the alien behind the counter say Nothin come in here for 14 hour The third mention of the number 14 is when the bus rider are checking out at the diner counter The alien behind the counter say that the first alien had 14 cup of coffee Damn that a lot of coffee OK so how many time is the number 14 mentioned 3 time How many arm did the first alien have 3 How many eye did the other alien have 3 Keep that in mind Now let talk about the title of this episode Will the real martian please stand up Well actually let hold off on that one for just a second Let talk about the officer How many were there 2 How many alien were there 2 Now let multiply 14 by 2 to get 28 Now let add 3 to that number 3 for how many arm eye the alien had and how many time 14 wa mentioned and what do you get 31 Now how many letter are in the title of this episode 31 Phew man I tired from all that math But wait there more How many time wa 14 mentioned 3 How many arm did the first alien have 3 How many eye did the second alien have 3 Add those together and you get 9 What time did the first alien say his meeting wa in Boston 9AM Okay okay I reaching with some of these but I sat there last night watching this episode and I couldn get all these damn number out of my head If you made it this far I commend you I doubt any of this mean anything but I found it interesting nonetheless Cheers,Will the real martian please stand up is one giant number game


In [37]:
# Create a combined column for all of our text 
comics_zone['combined_text'] = comics_zone['lems'] + comics_zone['title_lems']

In [38]:
comics_zone.head(2)

Unnamed: 0,author,subreddit,lems,title_lems,combined_text
0,GAMESHARQ,1,I a big fan of the Twilight Zone I have the complete series on DVD I seen every episode Recently someone told me about an episode of the Alfred Hitchcock Hour called The Jar which they thought wa an episode of the Twilight Zone I watched it tonight it the first and only episode of the show I seen I somewhat enjoyed it although it seemed a little darker than the Twilight Zone and le sci fi focused I not sure if I should try to watch more episode or not Maybe I just chose the wrong episode to start with How doe the show compare to the Twilight Zone overall Are there good episode that any Twilight Zone fan would enjoy,Twilight Zone fan what your opinion on the Alfred Hitchcock Hour,I a big fan of the Twilight Zone I have the complete series on DVD I seen every episode Recently someone told me about an episode of the Alfred Hitchcock Hour called The Jar which they thought wa an episode of the Twilight Zone I watched it tonight it the first and only episode of the show I seen I somewhat enjoyed it although it seemed a little darker than the Twilight Zone and le sci fi focused I not sure if I should try to watch more episode or not Maybe I just chose the wrong episode to start with How doe the show compare to the Twilight Zone overall Are there good episode that any Twilight Zone fan would enjoyTwilight Zone fan what your opinion on the Alfred Hitchcock Hour
1,Pockets_The_Paladin,1,SPOILERS AHEAD Just wanted to warn anyone before reading all this that there are spoiler in here for those who haven see this episode yet Now on with the show OK so let first talk about the number 14 The number 14 is mentioned a total of 3 time throughout this episode First when the driver say Mister that a 14 year old bus and business is lousy Not only that but before he say this he say What do you think I got parked out there a 707 What is 7 0 7 14 The second time 14 is mentioned the alien behind the counter say Nothin come in here for 14 hour The third mention of the number 14 is when the bus rider are checking out at the diner counter The alien behind the counter say that the first alien had 14 cup of coffee Damn that a lot of coffee OK so how many time is the number 14 mentioned 3 time How many arm did the first alien have 3 How many eye did the other alien have 3 Keep that in mind Now let talk about the title of this episode Will the real martian please stand up Well actually let hold off on that one for just a second Let talk about the officer How many were there 2 How many alien were there 2 Now let multiply 14 by 2 to get 28 Now let add 3 to that number 3 for how many arm eye the alien had and how many time 14 wa mentioned and what do you get 31 Now how many letter are in the title of this episode 31 Phew man I tired from all that math But wait there more How many time wa 14 mentioned 3 How many arm did the first alien have 3 How many eye did the second alien have 3 Add those together and you get 9 What time did the first alien say his meeting wa in Boston 9AM Okay okay I reaching with some of these but I sat there last night watching this episode and I couldn get all these damn number out of my head If you made it this far I commend you I doubt any of this mean anything but I found it interesting nonetheless Cheers,Will the real martian please stand up is one giant number game,SPOILERS AHEAD Just wanted to warn anyone before reading all this that there are spoiler in here for those who haven see this episode yet Now on with the show OK so let first talk about the number 14 The number 14 is mentioned a total of 3 time throughout this episode First when the driver say Mister that a 14 year old bus and business is lousy Not only that but before he say this he say What do you think I got parked out there a 707 What is 7 0 7 14 The second time 14 is mentioned the alien behind the counter say Nothin come in here for 14 hour The third mention of the number 14 is when the bus rider are checking out at the diner counter The alien behind the counter say that the first alien had 14 cup of coffee Damn that a lot of coffee OK so how many time is the number 14 mentioned 3 time How many arm did the first alien have 3 How many eye did the other alien have 3 Keep that in mind Now let talk about the title of this episode Will the real martian please stand up Well actually let hold off on that one for just a second Let talk about the officer How many were there 2 How many alien were there 2 Now let multiply 14 by 2 to get 28 Now let add 3 to that number 3 for how many arm eye the alien had and how many time 14 wa mentioned and what do you get 31 Now how many letter are in the title of this episode 31 Phew man I tired from all that math But wait there more How many time wa 14 mentioned 3 How many arm did the first alien have 3 How many eye did the second alien have 3 Add those together and you get 9 What time did the first alien say his meeting wa in Boston 9AM Okay okay I reaching with some of these but I sat there last night watching this episode and I couldn get all these damn number out of my head If you made it this far I commend you I doubt any of this mean anything but I found it interesting nonetheless CheersWill the real martian please stand up is one giant number game


In [39]:
# Save cleaned and lemmatized data to csv
comics_zone.to_csv('../data/cleaned_data/comics_zone')