# Web scraping 

Web scraping of the series "The Big Bang Theory"

Katerina Kashchenko

Web scraping of the series The Big Bang Theory (from here https://bigbangtrans.wordpress.com ). 
We look at the code of the page with the transcript of the first series https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode /. 

The transcript of the series itself - it is contained under the div tag, the attribute class="entry text" 

Pulling out the data

In [1]:
# The requests package is usually used to get data from web pages
# import it

import requests



In [None]:
# How to use this package to send requests and receive information from web pages?
# read the package documentation, find the right command https://docs.python-requests.org/en/master/

# let's write a link to the page we need to get into the variable

url = "https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/"  

# creating a request to get this page, passing the
response = request.get(url) link as a parameter


*Sometimes the site requires you to register details about the user agent, i.e. the browser from which the request is made*   
*In this case, you can specify any data, for example, such*

response = request.get(url, header={"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})

In [None]:
# now the entire content of the page is written to the response variable
# we can view it using this command --
response.text 



We got about the same thing as viewing the page code in the browser.
It is impossible to understand what is happening here. 

We need a parser - a set of commands that can be used to separate the code (tags, attributes) from everything else, and get the necessary data.

We will use the html parser from the BeautifulSoup package. Documentation - https://www.crummy.com/software/BeautifulSoup/bs4/doc/



In [None]:
# Let's parse the resulting page using a parser
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser') 

# # The same page is now written to the soup variable, but in a more structured form
# # now we can get what we need by tags using the soup method.find All(tag)
# find, for example, all links by tag "a"
soup.findAll('a') 


In [None]:
# When we looked at the page code in the browser, we found out that the text we need is contained under the div tag with the attribute class = "entrytext"
# We prescribe this tag and attribute
# we learned exactly how this string is written from the BeautifulSoup documentation

soup.findAll('div', {'class': 'entrytext'})


We got a structure similar to a list. From each element we now need to get the text. We learned from the documentation that this can be done using the get_text() method

In [None]:
lines = []
for div in soup.findAll('div', {'class': 'entrytext'}):
        t = div.get_text() 
        lines.append(t.strip())      

lines

We have already received almost what we need! 
However, at the end there is some technical text that we don't need. Let's delete it. There are different ways to do this, but we will use regular expressions (https://docs.python.org/3/howto/regex.html ), because on different pages of this site, this text probably looks a little different. We need to delete everything that starts with "\__ATA" and ends with ":Like Loading...". In the regular expression language, the text can be encoded like this: \__ATA, any character (.) any number of times (*), :Like Loading...

You just need to take into account that in this fragment there are end-of-line icons (\n). "Any sign and end-of-line sign" in the regular expression language does not look like ".", but like (.|\s)

Therefore, the final regular expression is -- "__ATA(.|\s)*:Like Loading..."

In [None]:
import re
 
lines = re.sub("__ATA(.|\s)*:Like Loading...","",lines[0])

# let's break the whole text into replicas
lines = lines.split('\n')
lines

In [None]:
# save the result to a txt file
with open("s1e1.txt", "w", encoding="utf-8") as text_file:
    for i in lines:
        text_file.write(i + '\n') 
# Add an end-of-line sign so that we have all replicas on a separate line

Great, we downloaded one transcript. But we need everyone. To do this, you need to somehow get all the links to the pages. There is no one universal method here - you need to look at how the links look. With blog platforms like LiveJournal, everything is simple - there links have the form "https://username.livejournal.com/?skip=0 ", "https://username.livejournal.com/?skip=10 ", "https://username.livejournal.com/?skip=20 and so on. You can generate link texts by changing only the number at the end. In the case of our site, the links look different, it is impossible to generate so easily (the name of the series changes at the end). But on the side there is a Pages section with a list of all the series with links to them. We are now interested not in the displayed text, but in the link.


In this case, the usual <\a> tag for storing links may be suitable. The link text itself is stored with the href attribute (you can find out about this if you read about the html language, for example, here http://htmlbook.ru /, or if you google how to get the link text)

In [None]:
# Download links

links_with_text = []

links = soup.findAll('a', href=True)
for link in links:
    if link.text: 
        links_with_text.append(link['href'])
            

links_with_text

In [None]:
# # Let's see what happened - we see that there is something superfluous - 1-4 links and we don't need the last two links
# Delete them

del links_with_text[0:4]
del links_with_text[-2:]

links_with_text

In [None]:
# Extract the part of the link containing the season/episode number and the series name to save files with theseименами 
titles = []

for i in range(len(links_with_text)):
    a=links_with_text[i].replace('https://bigbangtrans.wordpress.com/',"")
    a=a.rstrip("/")
    titles.append(a)
titles

In [None]:
# Now everything is ready to perform all the necessary operations for all pages

for i, url in enumerate(links_with_text):
    print(url) 
    response = requests.get(url,headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
    soup = BeautifulSoup(response.text, 'html.parser')
    
    lines = []
    for div in soup.findAll('div', {'class': 'entrytext'}):
            t = div.get_text() 
            lines.append(t.strip())      

    lines = re.sub("__ATA(.|\s)*:Like Loading...", "", lines[0])
    lines = lines.split("\n")   
    # сохраним в txt файл
    with open(titles[i]+".txt", "w", encoding="utf-8") as text_file: 
        for i in lines:
            text_file.write(i + '\n') 
    
  

Finally, you can make a separate file for each season, so that it is convenient to analyze the transcript texts by season

In [None]:
seasons = {}

for i, url in enumerate(links_with_text):
    
    response = requests.get(url,headers={"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"})
    soup = BeautifulSoup(response.text, 'html.parser')
    
    lines = []
    for div in soup.findAll('div', {'class': 'entrytext'}):
            t = div.get_text() 
            lines.append(t.strip())      

    lines = re.sub("__ATA(.|\s)*:Like Loading...", "", lines[0])
    lines = lines.split("\n")
    
# now we check which season the page belongs to, and write everything down in the list,    for j in range(1,11):
        series = "series-" + str(j) + "-"
        series0 = "series-0" + str(j) + "-" # sometimes the links look like series-6-episode-,
#and sometimes series-06-episode-, so we add this condition
        if series in url or series0 in url:
            if j in seasons.keys():
                seasons[j].append(lines)
            else:
                seasons[j] = [lines]
   


In [None]:
for i in range(1,11):
    seasons[i] = [item for sublist in seasons[i] for item in sublist] # making one list from a list of lists
    with open("season"+str(i)+".txt", "w", encoding="utf-8") as text_file: 
            for j in seasons[i]:
                text_file.write(j + '\n') # Adding an end-of-line sign so that we have all the replicas on a separate line

We open several files just in case. Everything worked out, hooray!