This notebook covers:
* Regex basics
* Web scraping with BeautifulSoup
Link tham khảo
link dưới đây:
1. https://www.w3schools.com/python/python_regex.asp
2. https://regex101.com/
3. https://regexlearn.com/learn/regex101
4. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

? là tìm phần thích hợp ngắn nhất có thể chứ ko phải tìm lặp 0 hoặc 1 lần như trong w3school

In [None]:
import pandas as pd
import numpy as np

## Regular expressions

In [None]:
from IPython.display import Image

I have talked about some basic regex functionality which is taken from this excellent post

https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

In [1]:
import re

In [None]:
Image("../imgs/regex.png")

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

Here the '\s' matches any whitespace character. By adding a '+' notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab '\t' characters as well.

In [2]:
regex = re.compile('\s+')

**Splitting a string using regex**

In [3]:
text = "Hello World.   Regex is awesome"

In [4]:
" ".join(text.split())

'Hello World. Regex is awesome'

In [5]:
regex.split(text)

['Hello', 'World.', 'Regex', 'is', 'awesome']

Another way but regex is generally the better one

In [7]:
re.split('\s+', text)

['Hello', 'World.', 'Regex', 'is', 'awesome']

**re.findall**

the findall method extracts all occurrences of the pattern

 `'\d'` is a regular expression which matches any digit

In [8]:
text = "101 howard street, 246 mcallister street"

In [12]:
regex_num = re.compile('\d+')  #one or more digits

In [13]:
regex_num.findall(text)

['101', '246']

In [14]:
regex_num.split(text)

['', ' howard street, ', ' mcallister street']

In [15]:
re.findall('\d+', text)

['101', '246']

**re.search() vs re.match()**

`regex.search()` returns a particular match object that contains the starting and ending positions of the **first occurrence of the pattern**.

Likewise, `regex.match()` also returns a match object. But the difference is, it requires the pattern to be present at the **beginning of the text itself**.

In [10]:
text2 = "MAT 20567567576  Mathematics 189"

In [16]:
m = regex_num.match(text2)
m

In [17]:
m.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [18]:
m.start()  #returns the index of the starting

AttributeError: 'NoneType' object has no attribute 'start'

In [19]:
s = regex_num.search(text2)
s

<re.Match object; span=(4, 15), match='20567567576'>

In [None]:
s.group()

**Substituting one text by another using `regex.sub()`**

In [20]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""

In [21]:
regex = re.compile('\s+')

In [22]:
regex.sub(' ', text)  #it replaces the regular expression by ' '

'101 COM Computers 205 MAT Mathematics 189 ENG English'

In [23]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

101 COM Computers
205 MAT Mathematics
189 ENG English


**combining regex pattern**

In [26]:
# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)

[]

**greedy regex**

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

In [33]:
text = "< body>Regex Greedy Matching Example < /body>< /body>"
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>< /body>']

it should have stopped at first > but it didn't. For extracting only the smaller portions:

Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

In [34]:
re.findall('<.*?>', text)

['< body>', '< /body>', '< /body>']

In [29]:
s = re.search('<.*?>', text)  #getting only the first one

In [30]:
s.group()

'< body>'

In [31]:
text = '01, Jan 2015'

In [32]:
print(re.findall('\d{3}', text))

['201']


**matching word boundaries**

Word boundaries `\b` are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa.

For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b

Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides)

Likewise, `\B` will match any non-boundary.

For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

In [None]:
re.findall(r'\btoy\b', 'play toy broke toys')

In [None]:
re.findall(r'\btoy', 'play toy broke toys')

In [None]:
re.findall(r'toy\b', 'play toy broke toys')

In [None]:
re.findall(r'\Btoy\b', 'playtoy broke toys')

In [None]:
re.findall(r'\Btoy\B', 'playtoybroke toys')

In [None]:
re.findall(r'\btoy', 'playtoybroke toys')

**Practice regex examples**

In [None]:
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'), ('page33', 'google', 'com'),
                  ('jeff42', 'amazon', 'com')]

In [None]:
regex = re.compile('([\w]+)@([\w]+).([\w]+)')

In [None]:
regex.findall(emails)

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

In [None]:
text = """Betty bought a bit of butter, 
But the butter was so bitter, So she bought
some better butter, To make the bitter butter better."""

In [None]:
regex = re.compile('([$bB]\w+)')

In [None]:
regex.findall(text)

In [None]:
sentence = """A, very   very; irregular_sentence"""
desired_output = "A very very irregular sentence"

In [None]:
regex = re.compile('[,\s;_]+')

In [None]:
' '.join(regex.split(sentence))

In [None]:
tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

In [None]:
desired_output = 'Good advice What I would do differently if I was learning to code today'

In [None]:
def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""),
                   '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet


print(clean_tweet(tweet))

# Web Scraping with BeautifulSoup

In [35]:
from bs4 import BeautifulSoup
import urllib.request
import requests

In [36]:
def download_html(url):
    with urllib.request.urlopen(url) as response:
        html = response.read()
        html = html.decode('utf-8')
    response.close()
    return html

In [37]:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=boxoffice_gross_us,desc&start=1'
html = download_html(url)

soup = BeautifulSoup(html, 'lxml') #html.parser

Since above code extracts all data on the first page, below code is run only to extract movie information on it.

In [39]:
movie_blocks = soup.findAll('div',  {'class':'lister-item-content'})

In [40]:
movie_blocks[0]

<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt1825683/">Chiến Binh Báo Đen</a>
<span class="lister-item-year text-muted unbold">(2018)</span>
</h3>
<p class="text-muted">
<span class="certificate">C16</span>
<span class="ghost">|</span>
<span class="runtime">134 min</span>
<span class="ghost">|</span>
<span class="genre">
Action, Adventure, Sci-Fi            </span>
</p>
<div class="ratings-bar">
<div class="inline-block ratings-imdb-rating" data-value="7.3" name="ir">
<span class="global-sprite rating-star imdb-rating"></span>
<strong>7.3</strong>
</div>
<div class="inline-block ratings-user-rating">
<span class="userRatingValue" data-tconst="tt1825683" id="urv_tt1825683">
<span class="global-sprite rating-star no-rating"></span>
<span class="rate" data-no-rating="Rate this" data-value="0" name="ur">Rate this</span>
</span>
<div class="starBarWidget" id="sb_tt1825683">
<div class="ratin

BeautifulSoup.find_all(arguments) returns a list of BeautifulSoup objects. These are all occurrences matching the arguments. If there are no matches, method returns empty list. This is obviously used, when you cannot identify it right away and have to do some more digging before you get to the data you want.

> Let's examine one of the extracted block to identify the elements that we need to scrape.

In [41]:
movie_blocks[0].find('span', {'class': 'lister-item-year'}).contents

['(2018)']

In [42]:
import re
year = re.compile("\d+")
year.search(movie_blocks[0].find('span',{'class': 'lister-item-year'}).contents[0]).group()

'2018'

In [43]:
mname = movie_blocks[0].find('a').get_text() # Name of the movie

m_reyear = int(movie_blocks[0].find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year

m_rating = float(movie_blocks[0].find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating

m_mscore = float(movie_blocks[0].find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score

m_votes = int(movie_blocks[0].find('span',{'name':'nv'}).get('data-value')) # votes

print("Movie Name: " + mname,
      "\nRelease Year: " + str(m_reyear),
      "\nIMDb Rating: " + str(m_rating),
      "\nMeta score: " + str(m_mscore),
      "\nVotes: " + '{:,}'.format(m_votes)

)

Movie Name: Chiến Binh Báo Đen 
Release Year: 2018 
IMDb Rating: 7.3 
Meta score: 88.0 
Votes: 801,290


In [44]:
def scrape_mblock(movie_block):
    
    movieb_data ={}
  
    try:
        movieb_data['name'] = movie_block.find('a').get_text() # Name of the movie
    except:
        movieb_data['name'] = None

    try:    
        movieb_data['year'] = str(movie_block.find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year
    except:
        movieb_data['year'] = None

    try:
        movieb_data['rating'] = float(movie_block.find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating
    except:
        movieb_data['rating'] = None
    
    try:
        movieb_data['m_score'] = float(movie_block.find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score
    except:
        movieb_data['m_score'] = None

    try:
        movieb_data['votes'] = int(movie_block.find('span',{'name':'nv'}).get('data-value')) # votes
    except:
        movieb_data['votes'] = None

    return movieb_data

> Then I create the below function to scrape all movie blocks within a single search result page

In [45]:
def scrape_m_page(movie_blocks):
    
    page_movie_data = []
    num_blocks = len(movie_blocks)
    
    for block in range(num_blocks):
        page_movie_data.append(scrape_mblock(movie_blocks[block]))
    
    return page_movie_data

> Now we built functions to extract all movie data from a single page.

Next function will be created to iterate the above made function through all pages of the search result untill we scrape data for the targeted number of movies

In [47]:
import time     
import random as ran
def scrape_this(link,t_count):
    
    #from IPython.core.debugger import set_trace

    base_url = link
    target = t_count
    
    current_mcount_start = 0
    current_mcount_end = 0
    remaining_mcount = target - current_mcount_end 
    
    new_page_number = 1
    
    movie_data = []
    
    
    while remaining_mcount > 0:

        url = base_url + str(new_page_number)
        
        #set_trace()
        
        source = download_html(url)
        soup = BeautifulSoup(source, 'html.parser') # lxml
        
        movie_blocks = soup.findAll('div',{'class':'lister-item-content'})
        
        movie_data.extend(scrape_m_page(movie_blocks))   
        
        current_mcount_start = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[0])

        current_mcount_end = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[1].split(" ")[0])

        remaining_mcount = target - current_mcount_end
        
        print('\r' + "currently scraping movies from: " + str(current_mcount_start) + " - "+str(current_mcount_end), "| remaining count: " + str(remaining_mcount), flush=True, end ="")
        
        new_page_number = current_mcount_end + 1
        
        time.sleep(ran.randint(0, 10))
    
    return movie_data

> Finally, we have put together all functions created above to scrape the top 150 movies on the list

In [48]:
import pandas as pd 
base_scraping_link = "https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=boxoffice_gross_us,desc&start="

top_movies = 150 #input("How many movies do you want to scrape?")
films = []

movies = scrape_this(base_scraping_link,int(top_movies))

print('\r'+"List of top " + str(top_movies) +" movies:" + "\n", end="\n")
movies=pd.DataFrame(movies)
movies

List of top 150 movies:es from: 101 - 150 | remaining count: 0



Unnamed: 0,name,year,rating,m_score,votes
0,Chiến Binh Báo Đen,2018,7.3,88.0,801290
1,Avengers: Cuộc Chiến Vô Cực,2018,8.4,68.0,1121664
2,Gia Đình Siêu Nhân 2,2018,7.6,80.0,309100
3,Thế Giới Khủng Long: Vương Quốc Sụp Đổ,2018,6.1,,327060
4,Aquaman: Đế Vương Atlantis,2018,6.8,,490724
...,...,...,...,...,...
145,Boy Erased,2018,6.9,69.0,40286
146,Khách Sạn Tội Phạm,2018,6.1,,55576
147,A-X-L Chú Chó Robot,2018,5.3,,12545
148,Run the Race,2018,5.9,,1620


In [None]:
movies.to_csv('../datasets/movies.csv', index=False)