# r/CasualConversations Database creation
This Jupyter Notebook is meant to create a database from submissions to r/CasualConversations. It first extracts submissions and then saves them into a special database file such that feature can be created from the submissions. This notebook is part of a project that tries to predict the flair of a submission on CasualConversations.

In [1]:
import praw # Source code : https://github.com/praw-dev/praw
            # helpful page: https://praw.readthedocs.io/en/latest/code_overview/praw_models.html

# Initialising a connection to Reddit
reddit = praw.Reddit(user_agent='Doing some stuff', client_id='hir_Veg3Rs-Svw', client_secret="giRkbKknUe4Oy4EcoGYA1N_Y0sA")

In [2]:
# General modules and function(s)
import numpy as np
from datetime import datetime
import re
import os

def utc_time(timestamp):
    '''Changes Unix to UTC time'''
    return datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

## Extracting submissions from Reddit
First the submissions have to be extracted from Reddit. However, as Reddit does not want more than 1000 submissions to be extracted from a subreddit, this notebook tries to 'bruteforce' more submissions out of Reddit by using the 'subreddit.random()' method from the 'praw' module.

### Set-up and Functions

In [3]:
if 'all_submission_ids' not in locals():
    all_submissions = []
    all_submission_ids = []
    print('"all_submissions" and "all_submission_ids" have been initialised.')
else:
    print('"all_submissions" and "all_submission_ids" already exist')

"all_submissions" and "all_submission_ids" have been initialised.


In [4]:
os.path.isfile('submissions_CasualConversations.txt')

True

In [16]:
def load_submission_ids(load_name='submissions_CasualConversations.txt', mistake_catch=0,
                        submissions_list=all_submissions.copy(), submission_ids_list=all_submission_ids.copy()):
    ''' Function for loading the submission_ids to a text file
    
    =========================== ===============================================
    Attribute                   Description
    =========================== ===============================================
    "load_name"                 Under which name the file should be loaded.
    "mistake_catch"             To prevent exidently saving submissions, the
                                existing text file is checked for the amount of
                                IDs in that file. If this amount is greater 
                                than the amount of IDS in the
                                submission_ids_list, then submission_ids_list
                                won't be loadd. 
                                To overwrite this, mistake_catch has to be
                                equal to the amount of IDs in the text file
                                minus the IDs in submission_ids_list
    "submissions_list"          To which list submissions will be loaded.
    "submission_ids_list"       To which list the IDs will be loaded.
    =========================== ===============================================
    '''    
    if(load_name == '' and mistake_catch == -1):
        all_submissions = []
        submission_ids_list = []
        print('The submissions_list and submission_ids_list have been reset to empty lists.')
        return submissions_list, submission_ids_list 
    
    if(load_name[-4:]!='.txt'):
        load_name += '.txt'
        
    if(os.path.isfile(load_name) == False):
        print('The file to load (' + load_name + ') from does not exist. \n' + 
              'You can try to load from a different file or start with an empty submission_ids_list. \n' +
              'To do the latter, you must set the "load_name" to an empty string and "mistake_catch" to -1'
             )
        return
        
    load_file = open(load_name, 'r')
    
    # mistake prevention
    load_file_list = re.split('\n', load_file.read())
    if(len(submission_ids_list) > len(load_file_list) and mistake_catch != len(submission_ids_list) - len(load_file_list)):
        print('Are you sure that you want to overwrite your submission_ids_list that contians more IDs' +
              ' than the text file that you want to load? \n \n' +
              'If so, then set parameter "mistake_catch" to ' + str(len(submission_ids_list) - len(load_file_list)) +
              ' otherwise your mistake has luckily been prevented \n\n' +
              'IDs in text file : ' + str(len(load_file_list)) + '\n' +
              'IDs in submission_ids_list file : ' + str(len(submission_ids_list))
             )
        load_file.close()
        return
    
    # loading the submission_ids_list from the text file and loading submissions from Reddit
    submission_ids_list = [ID for ID in load_file_list]
    submissions_list = []
    for ID in submission_ids_list:
        submissions_list.append(reddit.submission(ID))    
    load_file.close()
    print('Loaded succesfully from', load_name, '!')
    return submissions_list, submission_ids_list

In [6]:
def save_submission_ids(save_name='submissions_CasualConversations.txt', 
                        mistake_catch=0, submission_ids_list=all_submission_ids):
    ''' Function for saving the submission_ids from a text file
    
    =========================== ===============================================
    Attribute                   Description
    =========================== ===============================================
    "save_name"                 Under which name the file should be saved.
    "mistake_catch"             To prevent exidently saving submissions, the
                                existing text file is checked for the amount of
                                IDs in that file. If this amount is greater 
                                than the amount of IDS in the
                                submission_ids_list, then submission_ids_list
                                won't be saved. 
                                To overwrite this, mistake_catch has to be
                                equal to the amount of IDs in the text file
                                minus the IDs in submission_ids_list
    "submission_ids_list"       From which list the IDs will be saved.
    =========================== ===============================================
    '''
    if(save_name[-4:]!='.txt'):
        save_name += '.txt'
        
    save_file = open(save_name, 'w+')
    
    # mistake prevention
    save_file_list = re.split('\n', save_file.read())
    if(len(save_file_list) > len(submission_ids_list) and mistake_catch != len(save_file_list) - len(submission_ids_list)):
        print('Are you sure that you want to overwrite this text file that contians more IDs' +
              ' than the submission_ids_list that you want to save? \n \n' +
              'If so, then set parameter "mistake_catch" to ' + str(len(save_file_list) - len(submission_ids_list)) +
              ' otherwise your mistake has luckily been prevented \n\n' +
              'IDs in text file : ' + str(len(save_file_list)) + '\n' +
              'IDs in submission_ids_list file : ' + str(len(submission_ids_list))
             )
        save_file.close()
        return
    
    # saving the submission_ids_list to the text file
    for ID in submission_ids_list[:-1]:
        save_file.write("%s\n" % ID)
    save_file.write("%s" % submission_ids_list[-1])
    save_file.close()
    print('Saved succesfully to', save_name, '!')

In [13]:
def find_submissions(iterations, submissions_list=all_submissions.copy(), submission_ids_list=all_submission_ids.copy(), verbose=2):
    ''' Function for finding submissions within the CasualConversations subreddit using the 'random' method
    
    =========================== ===============================================
    Attribute                   Description
    =========================== ===============================================
    "iterations"                The amount of iterations this function has to
                                be run.
    "submissions_list"          To which list the submissions have to be
                                appended.
    "submission_ids_list"       To which list the submission IDs have to be
                                appended.
    "verbose"                   The level of verbose. 0 for no print 
                                statements; 1 for result and efficiency print 
                                statements; 2 also shows progress; 3 for most 
                                detail, but instead of progress bar, shows
                                iterations.
    =========================== ===============================================
    '''
    
    submissions_found = 0
    duplicates_found = 0
    progress_readout_threshold = 0
    
    if(verbose==2and iterations > 0): print('Progress : ')
    for iteration in range(iterations):
        if(verbose==2 and iterations > 0):
            if(iteration/float(iterations) * 100 >= progress_readout_threshold):
                print(str(round(iteration/float(iterations) * 100)) + '%')
                progress_readout_threshold += 4
        
        random_submission = r_CasualConversation.random()
        
        if(submission_ids_list.count(random_submission.id) == 0):
            submissions_list.append(random_submission)
            submission_ids_list.append(random_submission.id)
            if(verbose>=3):print('ID ' + str(random_submission.id) + ' was found at iteration ' + str(iteration))
            submissions_found += 1
        else:
            if(verbose>=3):print('Duplicate ID ' + str(random_submission.id) + ' was found.')
            duplicates_found += 1
    
    if(duplicates_found==0): submission_percentage = 100
    else: submission_percentage = submissions_found/float(iterations) * 100

    if(verbose>=2):print('\n\n')
    if(verbose>=1):print('------------------------------------------------Result and Efficiency'+ 
                         '----------------------------------------------------------\n'
                         'Total amount of submissions found : ' + str(len(submission_ids_list)) + '\n\n' +
                         'Amount of submissions found over the ' + str(iterations) + ' iterations of this run : ' + 
                         str(submissions_found) + '\n' +
                         'Amount of duplicates found : ' + str(duplicates_found) + '\n' +
                         'Non-duplicate percentage : ' + str(submission_percentage) + '%'
                        )
    return submissions_list, submission_ids_list

In [8]:
r_CasualConversation = reddit.subreddit('CasualConversation')

In [12]:
# Check if there are certainly no duplicates
print('IDs in all_submission_ids : ' + str(len(all_submission_ids)))
print('IDs in all_submission_ids without duplicates : ' + str(len(set(all_submission_ids))))
print('No duplicates =', len(all_submission_ids) == len(set(all_submission_ids)))

IDs in all_submission_ids : 254
IDs in all_submission_ids without duplicates : 254
No duplicates = True


### Implementation of functions

In [10]:
all_submissions, all_submission_ids = load_submission_ids()

Loaded succesfully from submissions_CasualConversations.txt !


In [14]:
all_submissions, all_submission_ids = find_submissions(iterations=100)

Progress : 
0%
4%
8%
12%
16%
20%
24%
28%
32%
36%
40%
44%
48%
52%
56%
60%
64%
68%
72%
76%
80%
84%
88%
92%
96%



------------------------------------------------Result and Efficiency----------------------------------------------------------
Total amount of submissions found : 257

Amount of submissions found over the 100 iterations of this run : 3
Amount of duplicates found : 97
Non-duplicate percentage : 3.0%


In [15]:
save_submission_ids()

Saved succesfully to submissions_CasualConversations.txt !


### Random shit I have to see whether it is still useful

In [108]:
save_submission_ids()

In [19]:
for submission in r_CasualConversation.stream.submissions():
    print(submission)

b0845d
b0857i
b08cbs
b08gqi
b08q1b
b08sq4
b08t79
b08vnb
b08yrb
b0975r
b09996
b09a2z
b09bma
b09cfs
b09j3p
b07oqv
b07rp5
b09mwr
b09ns2
b09o0s
b09oq0
b09pv3
b09unk
b09wgv
b09y65
b09yix
b0a0q8
b0a7qr
b0a9gj
b0aamk
b0acs5
b0adze
b0afvv
b0aidq
b0amjz
b0ana3
b0ar5v
b0arbm
b0aruo
b0asaq
b0atb6
b0atyd
b0avl1
b0avv6
b0aw76
b0ayko
b0ba0p
b0bcri
b0bf03
b0bngv
b0boa9
b0c16d
b0c176
b0c3ge
b0c5ri
b0ca6g
b0cddz
b0ce94
b0cgwv
b0ciyn
b0cjyp
b0cnz2
b0czpp
b0d3m8
b0d549
b0ddd1
b0dfjf
b0dik3
b0dklc
b0dm4h
b0dmjc
b0dqac
b0dsit
b0dt6n
b0du0v
b0dv4d
b0dypm
b0e5im
b0e9rw
b0ea66
b0ebf3
b0ecd6
b0eeq7
b0ef1v
b0em6t
b0emum
b0enu6
b0epw9
b0espp
b0evti
b0f2e5
b0fdo5
b0fdwx
b0fgai
b0fgtz
b0fhtj
b0fkn2
b0fl4u
b0fmui
b0ea8q


KeyboardInterrupt: 

In [43]:
all_submission_ids.count('b0d549')

2

Extracting all submissions in r/CasualConversations that have a flair. Note that a limit has to be set (default value = 100), thus we don't know if all submissions are extracted. A seperate list is created for the ID of each submission. This way the same posts can be used 

In [9]:
# When timing a cell, the actual code will not be executed!!!
# %%timeit -r1 -n1


all_submissions = [submission for submission in r_CasualConversation.new(limit=10000)if submission.link_flair_text != None]
all_submission_ids = [submission.id for submission in all_submissions]

TypeError: 'Submission' object is not iterable

In [None]:
# Adding more submissions

more_submissions = 

In [7]:
len(all_submissions)

167

In [105]:
%%timeit -r1 -n1

for i in range(20):
    print(reddit.submission(all_submission_ids[i]).link_flair_text)

:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
13 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [106]:
%%timeit -r1 -n1

for i in range(20):
    print(all_submissions[i].link_flair_text)

:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questio

:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatti

:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatting
:chat: Just Chatting
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:question: Questions
:chat: Just Chatting
:story: Life Stories
:chat: Just Chatting
:question: Questions
:question: Questions
:chat: Just Chatting
:film: Movies & Shows
:question: Questions
:thinking: Thoughts & Ideas
:question: Questions
:question: Questions
:chat: Just Chatting
:thinking: Thoughts & Ideas
:ididit: Made did it
:chat: Just Chatti

In [101]:
all_flairs = [reddit.submission(submission).link_flair_text for submission in all_submissions[:20]]

## Basic Extraction

In [91]:
# Submission extraction

for submission in r_CasualConversation.new(limit=5):
    print('https://www.reddit.com/r/CasualConversation/comments/'+str(submission))
    submission_id = submission
a_submission = reddit.submission(submission_id)
print('"a_submission" is "' + str(submission_id) + '".')

https://www.reddit.com/r/CasualConversation/comments/ayvu98
https://www.reddit.com/r/CasualConversation/comments/azs3bb
https://www.reddit.com/r/CasualConversation/comments/azw69v
https://www.reddit.com/r/CasualConversation/comments/azzm5r
https://www.reddit.com/r/CasualConversation/comments/azxa8o
"a_submission" is "azxa8o".


In [78]:
# Flair of the submission (note that there are different formats)
print(a_submission.link_flair_text)

:film: Movies & Shows


In [94]:
# All top level comments on this submission
for top_level_comment in a_submission.comments:
    print(top_level_comment.body)
    print('\n-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -\n')

I totally understand that clarity. I smoked for years. Stopped cause the short term memory issues I developed really scared me. I regret smoking so heavily. My brain hasn't worked the same since. I was a teen and no idea how I kept my grades up through it all. I'm not anti pot, all for it being legal and such, but I'm very glad I stopped for good. I have no desire to do it again. Haven't for over a decade. 

-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -

That's awesome. My bf is an avid smoker as he just likes it and it helps his migraines and makes him feel normal. But he is very forgetful. I'd like to stop for a while so I can finally look for another job without the possibility of them drug testing. 

-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -

Nice. I wish most of my lifelong pothead friends would stop for a few weeks just 

In [96]:
# Score of the first comment on this submission
print(a_submission.comments[0].score)

16
