<a href="https://colab.research.google.com/github/Massittha/Data-portfolio/blob/main/hw04_web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW04 IMDB Web Scraping with gazpacho

In this python notebook, used **gazpacho** package to do some web scraping from https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc to obtain films' titles, ranks, years released, and ratings

Then, the scrapped data was processed using some **string methods** and converted into a dataframe using **pandas**


In [1]:
# install gazpacho
!pip install gazpacho

Collecting gazpacho
  Downloading gazpacho-1.1.tar.gz (7.9 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: gazpacho
  Building wheel for gazpacho (pyproject.toml) ... [?25l[?25hdone
  Created wheel for gazpacho: filename=gazpacho-1.1-py3-none-any.whl size=7461 sha256=e4bc3479ec8835d7f1c745b1fe0e94a37357cdbd54cbdd55e1f44449464ae0a3
  Stored in directory: /root/.cache/pip/wheels/9b/bf/9f/8c8849499462415fa5cdf0d9edb1103c189bdbece90c51488e
Successfully built gazpacho
Installing collected packages: gazpacho
Successfully installed gazpacho-1.1


In [2]:
# import function
import requests
from gazpacho import Soup

In [3]:
## get html of imdb
url = "https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc"
html = requests.get(url)

In [4]:
# convert html.text to Soup object
imdb = Soup(html.text)

In [5]:
#get titles and ratings
titles = imdb.find("h3",{"class":"lister-item-header"})
ratings = imdb.find("div",{"class":"ratings-imdb-rating"})

In [6]:
#extract titles and ratings out of html format
clean_titles = [title.strip() for title in titles]
clean_ratings = [rating.strip() for rating in ratings]

In [7]:
#check titles list
clean_titles[0:5]

['1. The Shawshank Redemption (1994)',
 '2. The Godfather (1972)',
 '3. The Dark Knight (2008)',
 "4. Schindler's List (1993)",
 '5. The Lord of the Rings: The Return of the King (2003)']

In [8]:
#check ratings list
clean_ratings[0:5]

['9.3', '9.2', '9.0', '9.0', '9.0']

In [9]:
# split the string of each clean title
split_titles = [title.split() for title in clean_titles]

In [10]:
# create a copy of split_titles
import copy
split_film_names = copy.deepcopy(split_titles)

# remove ranks and years
for split_film_name in split_film_names:
    split_film_name.pop(0)
    split_film_name.pop(-1)

#preview split strings
split_film_names[0:5]


[['The', 'Shawshank', 'Redemption'],
 ['The', 'Godfather'],
 ['The', 'Dark', 'Knight'],
 ["Schindler's", 'List'],
 ['The', 'Lord', 'of', 'the', 'Rings:', 'The', 'Return', 'of', 'the', 'King']]

In [11]:
# extract a list of film names
film_names = [" ".join(name) for name in split_film_names]

#preview film names
film_names[0:10]

['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 '12 Angry Men',
 'The Godfather Part II',
 'Pulp Fiction',
 'Fight Club',
 'The Lord of the Rings: The Fellowship of the Ring']

In [12]:
# extract a list of ranks
ranks = [split[0].replace(".","") for split in split_titles]

#preview ranks
ranks[0:5]


['1', '2', '3', '4', '5']

In [13]:
# extract a list of years
import re
years = [re.sub("[()]","",split[-1]) for split in split_titles]

#preview years
years[0:5]

['1994', '1972', '2008', '1993', '2003']

In [14]:
# Finally, all lists combined to create a dataframe
import pandas as pd

data = {
    "Film's name": film_names,
    "Released year": years,
    "Rating": clean_ratings,
}

df = pd.DataFrame(data,index = ranks)
df.index.name = "Rank"

display(df)

Unnamed: 0_level_0,Film's name,Released year,Rating
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,The Shawshank Redemption,1994,9.3
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.0
4,Schindler's List,1993,9.0
5,The Lord of the Rings: The Return of the King,2003,9.0
6,12 Angry Men,1957,9.0
7,The Godfather Part II,1974,9.0
8,Pulp Fiction,1994,8.9
9,Fight Club,1999,8.8
10,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
