## Assignment 1
### Author: Ilya Grigorev, DS-01

In this assignment, I demonstrate a web retrieving pipeline that is capable of scraping data from a specific website, clean it, and store in the remote database. Additionally, a visualization frontend with customized features is developed to present captured information in a user-friendly format.

## Install necessary dependencies

In [1]:
!pip install bs4
!pip install pymongo

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2
Collecting pymongo
  Downloading pymongo-4.11.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Downloading pymongo-4.11.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymongo
Successfully installed pymongo-4.11.1


## Import modules

In [88]:
from urllib.request import urlopen, urlparse
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
from typing import Tuple, List, TypedDict, Optional, Any, NamedTuple
import re
from pymongo import MongoClient
import json

## Setup for scraping and cleaning

In [None]:
TARGET_URL = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
BASE_URL = '://'.join(urlparse(TARGET_URL)[:2])


def extract_html(url: str) -> Any:
    """
    Retrieves html content of a webpage.

    Arguments:
        url (str): url of a webpage.
    
    Returns:
        html content of a webpage.
    """
    try:
        return urlopen(url)
    except HTTPError as e:
        print(e.__str__())
    except URLError:
        print("The server could not be found")
    return None


def exclude_refs(name: str) -> str:
    """
    Cleans a string from excessive sources links, e.g. "New York [1]" -> "New York".

    Arguments:
        name (str): string.

    Returns:
        formatted string.
    """
    pos = name.find('[')
    return name[:pos] if pos > 0 else name

### Extracting main page

In [None]:
if (html := extract_html(TARGET_URL)) is not None:
    main_page = BeautifulSoup(html, 'html.parser')

### Locating table with target information

In [None]:
highest_grossing_films = main_page.find('table', {'class': 'wikitable plainrowheaders sticky-header col4right col5center col6center'})
assert highest_grossing_films is not None   # actually present
assert len(highest_grossing_films.find_all('tr')) != 0  # has rows

### Data representation

In [None]:
class FilmRecord(TypedDict):
    """
    Typed dictionary for representing a film.

    Attributes:
        title (str): title of a film.
        release_year (int or None): release year of a film.
        director (str or None): director/-s of a film, each separated by semicolon.
        box_office (float or None): box office revenue of a film.
        country (str or None): country/-es of origin of a film, each separated by semicolon.
    """
    title: str
    release_year: Optional[int]
    director: Optional[str]
    box_office: Optional[float]
    country: Optional[str]

## Main extracting and cleaning functions

In [None]:
def parse_revenue(revenue: str) -> Optional[float]:
    """
    Splits the revenue string into currency, value, and order.
    Transforms value and order into a single number.

    Arguments:
        revenue (str): revenue in the specific format "{currency}{number in scientific notation} {order}, e.g. "$1.08 million"
    
    Returns:
        converted revenue value, e.g. 1080000.0
    
    """
    revenue = exclude_refs(revenue) # clean the string
    quantity, order = revenue.split()   # split by space
    value = quantity[re.search("[\d\.]+", quantity).start():]
    # Convert the value from scientific notation to a usual decimal number
    if (order == 'million'):
        major = ''.join(value.split('.'))
        digits_after_decimal = len(value.split('.')[1])
        return float(major + '0' * (6-digits_after_decimal))
    elif (order == 'billion'):
        major = ''.join(value.split('.'))
        digits_after_decimal = len(value.split('.')[1])
        return float(major + '0' * (9-digits_after_decimal))
    else:
        print(f"Unresolved order: {order}")
        return None


def parse_film_page(film_url: str) -> Optional[Tuple[str, float, str]]:
    """
    Parses information from a film page. Specifically, finds information about directors, box office revenue, and countries of origin.

    Arguments:
        film_url (str): url of a film.
    
    Returns:
        tuple of directors string, box revenue, and countries or None if the page was not retrieved.
    """

    # Retrieving webpage
    if (html := extract_html(film_url)) is None:
      return None
    film_information = BeautifulSoup(html, 'html.parser').find('table', {'class': 'infobox vevent'})

    # Directors information retrieval
    directors_row_element = film_information.find(lambda tag: re.compile('^\s*[Dd]irected\s+by').match(tag.text) is not None)
    if directors_row_element is not None:
        # Type 1 format
        if directors_row_element.find('div', {'class': 'plainlist'}) is not None:
            directors_list_element = directors_row_element.find_all('div', {'class': 'plainlist'})[-1]
            # organize into a single string separated by ';'
            directors = ';'.join(exclude_refs(director_element.text) for director_element in directors_list_element.find_all('li'))
        else:
            # Type 2 format
            annotated_text = directors_row_element.find_all()[1].get_text(strip=True, separator='\n')
            directors_list = re.split("\[[^\]]*\]|\n", annotated_text)  # split by [, ], or \n
            directors = ';'.join(directors_list)    # organize into a single string separated by ';'
    else:
        print(f'Film (url={film_url}): directors list is not found')
        directors = None

    # Box office revenue retrieval
    box_office_revenue_row_element = film_information.find(lambda tag: re.compile('^\s*[Bb]ox\s+office').match(tag.text) is not None)
    if box_office_revenue_row_element is not None:
        if (box_office_revenue_element:=box_office_revenue_row_element.find('td', {'class': 'infobox-data'})) is not None:
            box_office_revenue = parse_revenue(box_office_revenue_element.text.strip())     # parsing revenue string
        else:
            print(f'Film (url={film_url}): box revenue is not found')
            box_office_revenue = None
    else:
        print(f'Film (url={film_url}): box revenue is not found')
        box_office_revenue = None

    # Countries information retrieval
    countries_row_element = film_information.find(lambda tag: re.compile('^\s*[Cc]ountry|[Cc]ountries').match(tag.text) is not None)
    if countries_row_element is not None:
        # Type 1 format
        if countries_row_element.find('div', {'class': 'plainlist'}) is not None:
            country_list_element = countries_row_element.find_all('div', {'class': 'plainlist'})[-1]
            countries = ';'.join(exclude_refs(country_element.text) for country_element in country_list_element.find_all('li')) # organize into a single string separated by ';'
        else:
            # Type 2 format
            annotated_text = countries_row_element.find_all()[1].get_text(strip=True, separator='\n')
            countries_list = re.split("\[[^\]]*\]|\n", annotated_text)  # split by [, ], or \n
            countries = ';'.join(countries_list)    # organize into a single string separated by ';'
    else:
        print(f'Film (url={film_url}): countries list is not found')
        countries = None

    return (directors, box_office_revenue, countries)

## Parsing and cleaning process

In [None]:
films: List[FilmRecord] = []    # retrieved data list
film_rows = highest_grossing_films.find_all('tr')[1:]   # rows of the table with films, except the header row

In [None]:
for i, row in enumerate(film_rows):
    # print(f"Row {i}: Started processing")
    elements = row.find_all(recursive=False)
    assert len(elements) == 6

    # Title collection
    title_element = elements[2].find('a')
    if title_element is None:
        # print(f"Row {i}: title element was not found, the row is excluded")
        continue
    title_link = title_element.attrs['href']
    title = title_element.text.strip()

    # Year collection
    year_element = elements[4]
    if len(year_element.text.strip()) == 0:
        print(f"Row {i}: release year is missing")
    try:
        release_year = int(year_element.text.strip())
    except ValueError:
        release_year = None
        print(f"Row {i}: invalid year format: {year_element.text.strip()}")

    # Moving to film page
    film_url = BASE_URL + title_link
    director, box_office, country = parse_film_page(film_url)
    print(f"Row {i}: parsed a film")
    films.append(FilmRecord(title=title, release_year=release_year, director=director, box_office=box_office, country=country))


Row 0: Started processing
Row 0: parsed a film
Row 1: Started processing
Row 1: parsed a film
Row 2: Started processing
Row 2: parsed a film
Row 3: Started processing
Row 3: parsed a film
Row 4: Started processing
Row 4: parsed a film
Row 5: Started processing
Row 5: parsed a film
Row 6: Started processing
Row 6: parsed a film
Row 7: Started processing
Row 7: parsed a film
Row 8: Started processing
Row 8: parsed a film
Row 9: Started processing
Row 9: parsed a film
Row 10: Started processing
Row 10: parsed a film
Row 11: Started processing
Row 11: parsed a film
Row 12: Started processing
Row 12: parsed a film
Row 13: Started processing
Row 13: parsed a film
Row 14: Started processing
Row 14: parsed a film
Row 15: Started processing
Row 15: parsed a film
Row 16: Started processing
Row 16: parsed a film
Row 17: Started processing
Row 17: parsed a film
Row 18: Started processing
Row 18: parsed a film
Row 19: Started processing
Row 19: parsed a film
Row 20: Started processing
Row 20: parse

## Database creation

In this assignment, I decided to use the MongoDB as the database to store data. Firstly, the database server provides remote access to the database without the need to host a server myself. Secondly, the document database is the best choice, as converting from python dict / json is straightforward (no need to specify fixed-form schema).

In [73]:
# Replace the placeholders with your actual MongoDB Atlas credentials
username = "pyclient"
password = "admin"
URL = "@mycluster.hszuy.mongodb.net/?retryWrites=true&w=majority&appName=MyCluster"

# Construct the MongoDB URI with authentication details
mongo_uri = f"mongodb+srv://{username}:{password}{URL}"

# Create a MongoClient object with the URI
client = MongoClient(mongo_uri)

## Creating database and documents

In [None]:
db = client["wikipedia"]
if "highest_grossing" not in db.list_collection_names():  # if the collection is not present
  collection = db["highest_grossing"]
  for film in films:
      collection.insert_one(film)
else:
  collection = db["highest_grossing"]
  # collection.drop()   # uncomment to drop database

## Exporting to JSON

In [None]:
cursor = collection.find()

# Converting cursor to the list of dictionaries
list_cur = list(cursor)

# Remove MongoDB id field
for film in list_cur:
  film.pop('_id')

# Convert to json string
json_data = json.dumps(list_cur, indent=4, ensure_ascii=False)

# Save to a file
with open('data.json', 'w', encoding='utf-8') as f:
    f.write(json_data)