# Web Scraping and Multithreading in Python 

Here is my first attempt at a Jupyter notebook. I'm using multithreading as part of a parallelism and concurrency class I'm taking this semeseter. This is a simple application of web scraping on my statistic's class online textbook. 

### Dependencies: 

In [1]:
import threading
from bs4 import BeautifulSoup
import requests

### List of Titles: 

In my MATH 221 textbook, the url uses titles instead of id numbers. Here I will access txt file that I have made by hand. I believe I could regex on the request.txt to get the titles, but I didn't focus on that for the purposes of this assignment. I'm stripping the single quotes on the file to get just the names and import them as a list. The start url will remain the same for the entire program. 

In [2]:
titles_list = open('threading_scraping/stats_titles.txt').read()
titles_list = titles_list.replace('\'', '').split(',')

start_url = 'http://statistics.byuimath.com/index.php?title='

### Key Functions: 

- Parse site takes in a url string, uses BeautifulSoup to get the web page, and then print the title of the page
- With and without threads are pretty self-explanatory 

In [3]:
def parse_site(url_text):
    soup = BeautifulSoup(requests.get(url_text).text, 'html5lib')
    print(soup.title)

In [4]:
def without_threads(start_url, titles):
    for title in titles:
        parse_site(start_url + title)

In [5]:
def with_threads(start_url, titles):
    threads = []

    for title in titles:
        url = start_url + title # single string for argument

        t = threading.Thread(target=parse_site, args=(url,))
        t.start()
        threads.append(t)
        
    for th in threads:
        th.join()


---
# Testing: 

Let's try calling both of the functions and compare the speeds with and without multithreading. 

### With Threads: 

In [6]:
%%timeit 
with_threads(start_url, titles_list)

<title>Lesson 1: Course Introduction - BYU-I Statistics Text</title>
<title>Bad title - BYU-I Statistics Text</title>
<title>Lesson 3: Describing Quantitative Data - BYU-I Statistics Text</title>
<title>Lesson 6: Distribution of Sample Means - BYU-I Statistics Text</title>
<title>Internal error - BYU-I Statistics Text</title>
<title>Internal error - BYU-I Statistics Text</title>
<title>Internal error - BYU-I Statistics Text</title>
<title>Lesson 10: Inference for One Mean: Sigma Known (Confidence Interval) - BYU-I Statistics Text</title>
<title>Lesson 5: Normal Distributions - BYU-I Statistics Text</title><title>Lesson 7: Probability Calculations involving a Mean Response - BYU-I Statistics Text</title><title>Lesson 2: The Statistical Process &amp; Design of Studies - BYU-I Statistics Text</title><title>Lesson 21: Describing Bivariate Data: Scatterplots - BYU-I Statistics Text</title><title>Lesson 15: Review for Exam 2 - BYU-I Statistics Text</title><title>&amp; Covariance - BYU-I Stat

<title>Lesson 24: Review for Exam 4 - BYU-I Statistics Text</title>
<title>Lesson 22: Simple Linear Regression - BYU-I Statistics Text</title>
<title>Lesson 20: Review for Exam 3 - BYU-I Statistics Text</title>
1 loop, best of 3: 4 s per loop


### Without Threads:

In [7]:
%%timeit
without_threads(start_url, titles_list)

<title>Lesson 1: Course Introduction - BYU-I Statistics Text</title>
<title>Lesson 2: The Statistical Process &amp; Design of Studies - BYU-I Statistics Text</title>
<title>Lesson 3: Describing Quantitative Data - BYU-I Statistics Text</title>
<title>Lesson 4: Probability; Discrete Random Variables - BYU-I Statistics Text</title>
<title>Lesson 5: Normal Distributions - BYU-I Statistics Text</title>
<title>Lesson 6: Distribution of Sample Means - BYU-I Statistics Text</title>
<title>Lesson 7: Probability Calculations involving a Mean Response - BYU-I Statistics Text</title>
<title>Lesson 8: Review for Exam 1 - BYU-I Statistics Text</title>
<title>Lesson 9: Inference for One Mean: Sigma Known (Hypothesis Test) - BYU-I Statistics Text</title>
<title>Lesson 10: Inference for One Mean: Sigma Known (Confidence Interval) - BYU-I Statistics Text</title>
<title>Lesson 11: Inference for One Mean: Sigma Unknown - BYU-I Statistics Text</title>
<title>Lesson 12: Inference for Two Means: Paired Data

<title>Lesson 19: Inference for Independence of Categorical Data - BYU-I Statistics Text</title>
<title>Lesson 20: Review for Exam 3 - BYU-I Statistics Text</title>
<title>Lesson 21: Describing Bivariate Data: Scatterplots - BYU-I Statistics Text</title>
<title>Correlation - BYU-I Statistics Text</title>
<title>&amp; Covariance - BYU-I Statistics Text</title>
<title>Lesson 22: Simple Linear Regression - BYU-I Statistics Text</title>
<title>Lesson 23: Inference for Bivariate Data - BYU-I Statistics Text</title>
<title>Lesson 24: Review for Exam 4 - BYU-I Statistics Text</title>
<title>Bad title - BYU-I Statistics Text</title>
1 loop, best of 3: 34 s per loop


---
# Conclusion: 

* Without Threads: **34s**
* With Threads: **4s**

Over three different tests for each loop, the results are pretty clear. There is actually an **88.24%** increase in efficiency using multithreading. I will note however that my results have been different depending on the testing method. For example: I got an 82% increase in speed by running the program in Atom. Not a huge difference, but definitely notable. 