# Introduction to threading in Python 

In this tutorial we will introduce the concept of threading, why it is useful and worth doing, and demonstrate how to thread our [Reddit comment scraper](Scraping reddit using praw.ipynb)

### What is threading? 

A thread is a subset of a computer process. There can be multiple threads within a process running simultaneously and sharing memory. 

If we write a python script, assuming there is no multiprocessing element to the script, then executing the script initiates a single process. However within the script multiple threads may be initiated which run in parallel and share the same executable code (the script).

### Why use threading? 

If you have a process that repeats a task many times, while the task itself may not be intensive, the run time of the process can become long due to the fact that the task is repreted.

For example, consider a function that takes a variable `x` and doubles it:

```python
def double(x):
    return 2*x
```

Now say we want to double all numbers from `1` to `100`. The natural thing to do is to call this function 100 times using a loop:

```python
for x in range(1,101):
    y = double(x)
    print(y)
```

This loop will not actually take a long time to execute as the process of doubling a number is simple. However, if we were able to call the function `double` 100 times simultaneously, the run time would naturally decrease\*

\* *Sadly not by a factor of 100*

In order to call the function 100 times simultaneously, we simply need 100 threads, each calling the function `double` with a different value of `x`.

### Simple example 

Lets use 100 threads to speed up the example above. We will use the `threading` and `queue` libraries in Python 3.

Here is the example code:

```python

import threading
import queue
        
q = queue.Queue()
threads=[]

for i in range(100):
    t = threading.Thread(target=threader)
    t.start()
    threads.append(t)

for x in range(1,101):
    q.put(x)
    
q.join()

for i in range(100):
    q.put(None)
    
for t in threads:
    t.join()
    
```

So what is going on?

After importing the required libraries, we initialise a `Queue` called `q`:

```python
q = queue.Queue()
```

We then use a for loop to initialise 100 threads, saving each to a list `threads`:

```python
threads=[]

for i in range(100):
    t = threading.Thread(target=threader)
    t.start()
    threads.append(t)
```
Next we put all values of `x`, from 1 to 100 into the Queue:

```python
for x in range(1,101):
    q.put(x)
```

We tell the process to wait until all items in the Queue have been processed:
```python
q.join()
```

Finally, we enusure that the Queue contains only `None` values (reason explained below), and shut down all threads:
```python
for i in range(100):
    q.put(None)
    
for t in threads:
    t.join()
```

At this point, the process (or programme) will terminate.


What is missing from the above is the point at which the variables `x` are doubled. When each thread is initialised, it is passed an `target` arguement, `threader`. 

`threader` is a function that each thread will call:

```python
def threader():
    while True:
        x = q.get()
        if x is None:
            break
        y = double(x)
        print(y)
        q.task_done()       
```

This function takes a value of `x` from the Queue, `q.get()`. If `x` is `None` (ie. all values of `x` from 1 to 100 have already been doubled), the function will terminate and the thread can be terminated. If `x` is not `None`, `x` will be doubled and the outcome printed. 

The line `q.task_done()` corresponds to the thread telling the Queue that the processing of `x` is completed. Once `q.task_done()` has been called on all Queue items, `q.join()` will stop blocking the process and subsequent lines of code will be executed.

## Threading the Reddit comment scraper 

The use of threads in the above example does not have a large effect on runtime. Threading is more useful when the task repeated within the loop is time consuming. 

Lets look at the Reddit comment scraper:

In [None]:
import praw
import json 

def get_comments(submission):

    comments=[]

    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        data={}
        data['body']=comment.body
        data['id']=comment.fullname

        try:
            data['author']=comment.author.name

        except AttributeError:
            data['author']=comment.author

        data['time']=comment.created_utc
        data['parent']=comment.parent_id

        comments.append(data)

    return comments


def get_metadata(submission):

    thread={}
    thread["title"]=submission.title
    thread["score"]=submission.score
    thread["id"]=submission.id
    thread["url"]=submission.url
    thread["comms_num"]=submission.num_comments
    thread["created"]=submission.created
    thread["body"]=submission.selftext
    
    thread['posts']= get_comments(submission)

    return thread



with open('reddit_credentials.json') as fin:
    
    creds = json.load(fin)
    

reddit = praw.Reddit(user_agent='Comment Extraction (by /u/{0})'.format(creds['username']),
                     client_id=creds['client_id'], client_secret=creds['client_secret'],
                     username=creds['username'], password=creds['password'])


subreddit = reddit.subreddit('AskReddit')

with open('askreddit.json', 'w') as fout:
    
    for submission in subreddit.hot(limit=10):
    
        thread = get_metadata(submission)
        json.dump(thread,fout)
        fout.write('\n')

Here, you can see that the last five lines:

```python
with open('askreddit.json', 'w') as fout:
    
    for submission in subreddit.hot(limit=10):
    
        thread = get_metadata(submission)
        json.dump(thread,fout)
        fout.write('\n')
```
are the time consuming part of the code where everything happens -- data is extracted from each thread iteratively and then written to a file.

To speed this up, we can use 10 Python threads, with each thread extracting data from a subreddit thread simultaneously. 

In order to ensure that all data is saved correctly, we can use an 11th Python thread and a second Queue such that this 11th thread simply takes extracted data from the second Queue and saves it to the file.

The code looks like this:


In [3]:
import praw
import json 
import threading
import queue

def get_comments(submission):

    comments=[]

    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        data={}
        data['body']=comment.body
        data['id']=comment.fullname

        try:
            data['author']=comment.author.name

        except AttributeError:
            data['author']=comment.author

        data['time']=comment.created_utc
        data['parent']=comment.parent_id

        comments.append(data)

    return comments


def get_metadata(submission):

    thread={}
    thread["title"]=submission.title
    thread["score"]=submission.score
    thread["id"]=submission.id
    thread["url"]=submission.url
    thread["comms_num"]=submission.num_comments
    thread["created"]=submission.created
    thread["body"]=submission.selftext
    
    thread['posts']= get_comments(submission)

    return thread


def scraper():
    while True:
        item = q.get()
        if item is None:
            break
        thread = get_metadata(item)
        #adding the data to the second queue ready to be saved
        q2.put(thread)
        q.task_done()

        
def saver(fname):
    while True:
        thread = q2.get()
        if thread is None:
            break
        print('Saving {0}'.format(thread['title']))
        with open(fname,'a') as f:
            json.dump(thread,f)
            f.write('\n')
        q2.task_done()
        
        
        
with open('reddit_credentials.json') as fin:
    
    creds = json.load(fin)
    

reddit = praw.Reddit(user_agent='Comment Extraction (by /u/{0})'.format(creds['username']),
                     client_id=creds['client_id'], client_secret=creds['client_secret'],
                     username=creds['username'], password=creds['password'])


subreddit = reddit.subreddit('AskReddit')

q = queue.Queue()
q2= queue.Queue()

threads = []

# creating the file the data will be added to
with open('askreddit.json', 'w') as fout:
    fout.write('')
    
#starting 10 scraper threads
for i in range(10):
    t = threading.Thread(target=scraper)
    t.start()
    threads.append(t)

#starting saver thread
t = threading.Thread(target=saver, args=('askreddit.json',))
t.start()
threads.append(t)


#adding submissions (reddit threads) to the first queue
for submission in subreddit.rising(limit=10):
    q.put(submission)

# block first queue until all threads are done
q.join()

# stop scraper
for i in range(10):
    q.put(None)
    
# block second queue until all data is saved    
q2.join()
q2.put(None)

#terminate threads
for t in threads:
    t.join()


Saving For those who had an unplanned threesome or foursome, how did it happen and what did you do? NSFW
Saving [Serious] Happily Married couples of Reddit-How do you successfully share the responsibilities of your home?
Saving What's the weirdest comment you ever posted when put completely out of context?
Saving You hear your parents having sex. Would it be ok to touch yourself? Why?
Saving What do guys generally think of girls playing video games?
Saving [Serious] Redditors who voted for Trump, do you regret your decision? Why or why not?
Saving If your murdered body could not be identified by the usual means such as face, teeth or fingerprints, what is some obscure way a loved one might identify your body?
Saving Redditers... why is your life a lie?
Saving What turns you on?
Saving Would you cold heartedly take the life of one stranger for five billion dollars? Why?/why not?
