# Full Stack Data - Sens Critique - Prangon Ghose

### Data Scrapping

#### Import Library

In [1]:
from bs4 import BeautifulSoup # pip install beautifulsoup4
import pandas as pd # pip install pandas
import numpy as np # pip install numpy
import time # pip install time
import re # pip install re -- to remove unnecessary spaces in the movie titles

The data used in this project is from the `Catalogue` page of the `Sens Critique` website. There are 8000 films data in this page. The collected columns are `title`, `name of directors`, `genres`, `Release Year`. The data is dynamically loaded in the `Catalogue` page, hence `selenium` is used to simulate a browser environment.

#### Import Selenium

In [2]:
from selenium import webdriver # pip install selenium
from selenium.webdriver.chrome.options import Options # import options for chrome
from selenium.webdriver.common.by import By # import By
from selenium.webdriver.support.ui import WebDriverWait # import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC # import expected_conditions
from selenium.common.exceptions import TimeoutException # import TimeoutException
from selenium.common.exceptions import NoSuchElementException # import NoSuchElementException
from selenium.common.exceptions import ElementNotInteractableException # import ElementClickInterceptedException
from selenium.webdriver.common.keys import Keys # import Keys
from selenium.webdriver.common.action_chains import ActionChains # import ActionChains

#### Initializing browser without GUI

In [3]:

# Set up the Chrome WebDriver with headless option
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode (no GUI)

prefs = {
  "profile.managed_default_content_settings.images": 2, 
  "profile.managed_default_content_settings.javascript": 2,
  "profile.managed_default_content_settings.stylesheet": 2
}

chrome_options.add_experimental_option("prefs", prefs) # Disable images

# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)


As 8000 movies data is dynamically loaded in 500 pages, looping through all the pages and using asynchronous programming are must. To develop the machine learning model for determining movie popularity, only the rating is not enough. The model must know the number of users rated the movie, the number of users bookmarked the movie, and the number of users loved the movie. Sometime the duration of the movie also effects the popularity. That is why these features are necessary to predict a movie popularity.

In Sens Critique website some necessary information can be found in the about page of one particular movie. But unfornately in this website coming back to previous page is not possible and also the later pages of the `catalogue` is not accessible directly. And so a different approach of opening a film about page in a new tab is taken to collect those data.

**N.B. Collecting 8000 movie data and traversing through each movie about page might take a few minutes. But the result is worth it. Number of pages can be adjusted in between 1-500 to reduce time. Parallelization cannot be performed as the catalogue webpages is not accessible directly.**

In [4]:
page = 1 # page number
df_films = pd.DataFrame() # initialize the dataframe

def remove_extra_spaces(input_string):
  # Using regular expression to replace multiple spaces with a single space as there are some titles with unnecessary spaces
  return re.sub(r'\s+', ' ', input_string).strip()

try: # try to load the page
  url_sc = 'https://www.senscritique.com/search?filters%5B0%5D%5Bidentifier%5D=universe&filters%5B0%5D%5Bvalue%5D=movie&size=16' # url of the page
  driver.get(url_sc) # load the page
  # Waiting for up to 10 seconds until the specific element is present in the DOM (the first page)
  element_present = EC.presence_of_element_located((By.CLASS_NAME, 'ExplorerProductCard__Container-sc-1l583m3-0')) # finding the element of film data holder
  WebDriverWait(driver, 10).until(element_present) # wait until the film data is present in the DOM
except TimeoutException: # if the page is not loaded
  print("Timed out waiting for page to load")

print('Scrapped Pages: ')

while page <= 50: # the total number of pages can be adjusted from 1-500
  # Wait for the specific element to be present in the DOM
  soup = BeautifulSoup(driver.page_source, "html.parser") # parse the page

  titles = []
  ratings = []
  infos = []
  genres = []
  users_rated = []
  users_bookmarked = []
  users_loved = []

  films = soup.find_all('div', attrs= {'data-testid': 'product-explorer-card'}) # find all the films

  for film in range(len(films)): # for each film
    if films[film].find('div', attrs={'data-testid': 'Rating'}) is not None:
      ratings.append(films[film].find('div', attrs={'data-testid': 'Rating'}).text) # find the rating
    else:
      ratings.append(np.nan) # if there is no rating, put NaN
    if films[film].find('a', attrs={'data-testid': 'product-explorer-title'}) is not None:
      title = films[film].find('a', attrs={'data-testid': 'product-explorer-title'}).text # find the title
      title = remove_extra_spaces(title) # remove unnecessary spaces
      titles.append(title) # find the title
    else:
      titles.append(np.nan) # if there is no title, put NaN
    if films[film].find('p', attrs={'data-testid': 'product-explorer-genre'}) is not None:
      genre = films[film].find('p', attrs={'data-testid': 'product-explorer-genre'}).text # find the genre
      genres.append(genre)
    else:
      genres.append(np.nan) # if there is no genre, put NaN

    try:
      # Attempt to find and click the element
      element = driver.find_element(By.LINK_TEXT, title)
      if element.is_displayed() and element.is_enabled():
        ActionChains(driver).key_down(Keys.CONTROL).click(element).key_up(Keys.CONTROL).perform()
        driver.switch_to.window(driver.window_handles[1]) # switch to the new tab
    except (NoSuchElementException, ElementNotInteractableException) as e:
      users_rated.append(np.nan)
      users_bookmarked.append(np.nan)
      users_loved.append(np.nan)
      if film_about_soup.find('p', attrs={'data-testid': 'product-explorer-creator'}) is not None:
        infos.append(film_about_soup.find('p', attrs={'data-testid': 'product-explorer-creator'}).text) # find the info
      else:
        infos.append(np.nan) # if there is no info, put NaN

    try:
      # Waiting for up to 30 seconds until the specific element is present in the DOM (the first page)
      element_present = EC.presence_of_element_located((By.CLASS_NAME, 'CoverProductInfos__WrapperStats-sc-1un0kh1-2')) # finding the element of film data holder
      WebDriverWait(driver, 30).until(element_present) # wait until the film data is present in the DOM
    except TimeoutException: # if the page is not loaded
      print("Timed out waiting for page to load")
    film_about_soup = BeautifulSoup(driver.page_source, "html.parser") # parse the page of the film
    if film_about_soup.find('p', class_ = "Text__SCText-sc-1aoldkr-0") is not None:
      users = film_about_soup.find_all('p', class_ = "Stats__Text-sc-1u6v943-2") # find the number of users

    if len(users) == 3:
      users_rated.append(users[0].text) # append the number of users rated
      users_bookmarked.append(users[1].text) # append the number of users bookmarked
      users_loved.append(users[2].text) # append the number of users loved
    elif len(users) == 2:
      users_rated.append(users[0].text)
      users_bookmarked.append(users[1].text)
      users_loved.append(np.nan)
    elif len(users) == 1:
      users_rated.append(users[0].text)
      users_bookmarked.append(np.nan)
      users_loved.append(np.nan)
    else:
      users_rated.append(np.nan)
      users_bookmarked.append(np.nan)
      users_loved.append(np.nan)

    if film_about_soup.find('p', class_ = 'Creators__Text-sc-1ghc3q0-0') is not None:
      infos.append(film_about_soup.find('p', class_ = 'Creators__Text-sc-1ghc3q0-0').text) # find the info
    else:
      infos.append(np.nan) # if there is no info, put NaN

    driver.close() # close the tab
    driver.switch_to.window(driver.window_handles[0]) # switch to the main tab

  df_extracted_films = pd.DataFrame({'titles': titles, 'ratings': ratings, 'infos': infos, 'genres': genres, 'users_rated': users_rated, 'users_bookmarked': users_bookmarked, 'users_loved': users_loved}) # create a dataframe of the extracted films
  df_films = pd.concat([df_films, df_extracted_films], ignore_index=True) # concatenate the extracted films to the main dataframe
  if page % 50 != 0:
    print(page, end = ',')
  else:
    print(page, end = '\n')
  page += 1 # increment the page number
  try:
    # Waiting for up to 0.5 seconds until the specific element is present in the DOM. Drawback is the program will run more than 4 mintues for 500 pages.
    # element_present = EC.presence_of_element_located((By.CLASS_NAME, 'ExplorerProductCard__Container-sc-1l583m3-0'))
    # WebDriverWait(driver, 30).until(element_present)
    if page < 501: # the total number of pages is 500
      path = "//span[@data-testid='click-" + str(page) + "']" # find the next page path
      next_button = driver.find_element(By.XPATH, path) # find the next page button
      next_button.click() # simulate a click the next page button
      time.sleep(0.5) # wait for 0.5 seconds
  except TimeoutException:
    print("Timed out waiting for page to load")

Scrapped Pages: 
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,

### Data Cleaning

Dropping any present duplicates

In [5]:
df_films = df_films.drop_duplicates() # drop the duplicates
df_films = df_films.dropna() # drop the NaN values
df_films = df_films.reset_index(drop=True) # reset the index
df_films

Unnamed: 0,titles,ratings,infos,genres,users_rated,users_bookmarked,users_loved
0,Fight Club,8.1,Film de David Fincher · 2 h 19 min · 10 novemb...,Drame,246.8K,23.9K,14.1K
1,Pulp Fiction,8.3,Film de Quentin Tarantino · 2 h 34 min · 26 oc...,"Gangster, Comédie",228.4K,27.7K,12.4K
2,Inception,7.5,Film de Christopher Nolan · 2 h 28 min · 21 ju...,"Action, Thriller, Science-fiction",219.3K,22.6K,8.9K
3,Inglourious Basterds,7.4,Film de Quentin Tarantino · 2 h 33 min · 19 ao...,"Drame, Guerre",180.6K,15.8K,6.4K
4,Avatar,6.4,Film de James Cameron · 2 h 42 min · 16 décemb...,"Action, Aventure, Science-fiction",157.1K,11.8K,3.4K
...,...,...,...,...,...,...,...
795,Les Triplettes de Belleville,7,Long-métrage d'animation de Sylvain Chomet · 1...,"Animation, Comédie",22.4K,3.9K,1.1K
796,The Host,7,Film de Bong Joon-Ho · 1 h 59 min · 22 novembr...,"Action, Épouvante-Horreur, Science-fiction",20.4K,6.1K,1.2K
797,The Dictator,5.5,Film de Larry Charles · 1 h 23 min · 20 juin 2...,Comédie,25K,1.7K,277
798,Valse avec Bachir,7.7,Documentaire d'animation de Ari Folman · 1 h 3...,"Biopic, Drame, Guerre, Animation",17.5K,8.9K,1.5K


Extract `Directors`, `Duration`, and `Release Date` from `infos` column.

In [6]:
df_films['Directors'] = df_films['infos'].str.extract(r'.+ de (.*?) ·') # extract the directors
df_films['Duration'] = df_films['infos'].str.extract(r'· (.*?) ·') # extract the duration
df_films['Release Date'] = df_films['infos'].str.extract(r'· \d+ h \d+ min · (.+?) \(') # extract the release date

Drop `infos` column.

In [7]:
df_films.drop(columns=['infos'], inplace=True) # drop the infos column

Converting the `Duration` into `Duration (min)` to keep similar unit for future machine learning model.

In [8]:
def convert_duration(x): # convert the duration to minutes
  if x is not np.nan:
    if 'h' in x and 'min' in x:
      return int(x.split('h')[0]) * 60 + int(x.split('h')[1].split('min')[0])
    elif 'h' in x and 'min' not in x:
      return int(x.split('h')[0]) * 60
    elif 'h' not in x and 'min' in x:
      return int(x.split('min')[0])
    else:
      return int(x)
  else:
    return x

df_films['Duration (min)'] = df_films['Duration'].apply(convert_duration) # apply the convert_duration function to the duration column

In [9]:
df_films.drop(columns=['Duration'], inplace=True) # drop the duration column

Converting different users number into numbers.

In [10]:
def convert_thousands(x):
  if 'K' in x:
    return float(x.split('K')[0]) * 1000
  else:
    return float(x)
  
df_films['users_rated'] = df_films['users_rated'].apply(convert_thousands) # apply the convert_thousands function to the users_rated column
df_films['users_bookmarked'] = df_films['users_bookmarked'].apply(convert_thousands) # apply the convert_thousands function to the users_bookmarked column
df_films['users_loved'] = df_films['users_loved'].apply(convert_thousands) # apply the convert_thousands function to the users_loved column

Translating the `Release Date` from French to English to convert this into `DateTime`.

In [11]:
def translate_month_from_french(x):
  month_dict = { # create a dictionary of months
    'janvier': 'January',
    'février': 'February',
    'mars': 'March',
    'avril': 'April',
    'mai': 'May',
    'juin': 'June',
    'juillet': 'July',
    'août': 'August',
    'septembre': 'September',
    'octobre': 'October',
    'novembre': 'November',
    'décembre': 'December'
  }

  if x is not np.nan:
    month_list = x.split(' ')
    if len(month_list) == 3: # if variable has day, month, and year
      month_list[1] = month_dict[month_list[1]]
    elif len(month_list) == 2: # if variable has month and year
      month_list[0] = month_dict[month_list[0]]
    elif len(month_list) == 1 and month_list[0] in month_dict: # if variable has only month
      month_list[0] = month_dict[month_list[0]]
    else: # if variable has only year
      pass
    return ' '.join(month_list)
  else:
    return x
  
df_films['Release Date'] = df_films['Release Date'].apply(translate_month_from_french) # apply the translate_month_from_french function to the Release Date column

In [12]:
df_films['Release Date'] = pd.to_datetime(df_films['Release Date'], format='%d %B %Y', errors='coerce') # convert the Release Date column to datetime

Display the cleaned dataframe.

In [13]:
df_films

Unnamed: 0,titles,ratings,genres,users_rated,users_bookmarked,users_loved,Directors,Release Date,Duration (min)
0,Fight Club,8.1,Drame,246800.0,23900.0,14100.0,David Fincher,1999-11-10,139
1,Pulp Fiction,8.3,"Gangster, Comédie",228400.0,27700.0,12400.0,Quentin Tarantino,1994-10-26,154
2,Inception,7.5,"Action, Thriller, Science-fiction",219300.0,22600.0,8900.0,Christopher Nolan,2010-07-21,148
3,Inglourious Basterds,7.4,"Drame, Guerre",180600.0,15800.0,6400.0,Quentin Tarantino,2009-08-19,153
4,Avatar,6.4,"Action, Aventure, Science-fiction",157100.0,11800.0,3400.0,James Cameron,2009-12-16,162
...,...,...,...,...,...,...,...,...,...
795,Les Triplettes de Belleville,7,"Animation, Comédie",22400.0,3900.0,1100.0,Sylvain Chomet,2003-06-11,80
796,The Host,7,"Action, Épouvante-Horreur, Science-fiction",20400.0,6100.0,1200.0,Bong Joon-Ho,2006-11-22,119
797,The Dictator,5.5,Comédie,25000.0,1700.0,277.0,Larry Charles,2012-06-20,83
798,Valse avec Bachir,7.7,"Biopic, Drame, Guerre, Animation",17500.0,8900.0,1500.0,Ari Folman,2008-06-25,90


The dataframe is ready to be used for the analysis and building a machine learning algorithm. The next step is to determine necessary features to be included in the algorithm. In my opinion, the features that can be used are: `ratings, users_rated, users_bookmarked, users_loved, Duraion (min), and Release Date`. In order to include this many features, I will use the Random Forest algorithm.

In [14]:
df_films.to_json('sens-critique.json', orient='records') # save the dataframe to json file