# KinoPoisk Top 250 Movies Scraper

This notebook uses Selenium to scrape the top 250 movies from KinoPoisk, a popular Russian movie database. It extracts movie details such as title, rating, and additional information like reviews, ratings, and more.

---

## Step 1: Import Required Libraries

We start by importing the necessary libraries for web scraping and data handling:
- **Selenium**: For automating the browser to scrape data.
- **time**: To add delays between actions to allow the page to load.
- **json**: To store the scraped data in a structured format.

In [2]:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import time
import json

## Step 2: Configure Selenium WebDriver

We configure the Chrome WebDriver to run in **headless mode**, which means the browser window will not open during the scraping process.

In [3]:
chrome_options = Options()
chrome_options.add_argument("--headless")

## Step 3: Open KinoPoisk's Top 250 Movies Page

We initialize the Chrome WebDriver and navigate to the KinoPoisk Top 250 movies page. The `time.sleep(5)` ensures that the page has enough time to load before we start scraping.

In [9]:
service = Service('/usr/local/bin/chromedriver')
driver = Chrome(service=service, options=chrome_options)

In [46]:
url = "https://www.kinopoisk.ru/lists/movies/top250/"
driver.get(url)

time.sleep(5)

In [47]:
movies_data = []
movies_data_full = []

## Step 4: Extract Movie Data

We loop through the pages of the KinoPoisk Top 250 list and extract the following details for each movie:

- **Title**: The name of the movie.
- **URL**: The link to the movie's page on KinoPoisk.

The script checks for a "Next" button to navigate through the pages. If the button is disabled, the loop breaks, indicating that all pages have been scraped.

In [None]:
while True:
    elements = driver.find_elements(By.CSS_SELECTOR, ".base-movie-main-info_link__YwtP1")
    for element in elements:
        movie_dict = {}
        movie_dict['title'] = element.text.split('\n')[0]  # Extract the title
        movie_dict['url'] = element.get_attribute("href")  # Extract the URL
        movies_data.append(movie_dict)
        movies_data_full.append(movie_dict)

    # Check if there is a "Next" button and click it
    try:
        next_button = driver.find_element(By.CSS_SELECTOR, '.styles_end__aEsmB')
        if 'disabled' in next_button.get_attribute('class'):
            break 
        next_button.click()
        time.sleep(5)
    except Exception as e:
        print("No more pages or error navigating:", e)
        break

No more pages or error navigating: Message: no such element: Unable to locate element: {"method":"css selector","selector":".styles_end__aEsmB"}
  (Session info: chrome=133.0.6943.99); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
0   chromedriver                        0x00000001012b6bac cxxbridge1$str$ptr + 2724820
1   chromedriver                        0x00000001012af20c cxxbridge1$str$ptr + 2693684
2   chromedriver                        0x0000000100e15afc cxxbridge1$string$len + 93348
3   chromedriver                        0x0000000100e5c8f8 cxxbridge1$string$len + 383648
4   chromedriver                        0x0000000100e9db94 cxxbridge1$string$len + 650556
5   chromedriver                        0x0000000100e50ba0 cxxbridge1$string$len + 335176
6   chromedriver                        0x000000010127f664 cxxbridge1$str$ptr + 2498188
7   chromedriver                   

In [49]:
print(movies_data)

[{'title': '1+1', 'url': 'https://www.kinopoisk.ru/film/535341/'}, {'title': 'Интерстеллар', 'url': 'https://www.kinopoisk.ru/film/258687/'}, {'title': 'Побег из Шоушенка', 'url': 'https://www.kinopoisk.ru/film/326/'}, {'title': 'Остров проклятых', 'url': 'https://www.kinopoisk.ru/film/397667/'}, {'title': 'Зеленая миля', 'url': 'https://www.kinopoisk.ru/film/435/'}, {'title': 'Бойцовский клуб', 'url': 'https://www.kinopoisk.ru/film/361/'}, {'title': 'Джентльмены', 'url': 'https://www.kinopoisk.ru/film/1143242/'}, {'title': 'Властелин колец: Возвращение короля', 'url': 'https://www.kinopoisk.ru/film/3498/'}, {'title': 'Форрест Гамп', 'url': 'https://www.kinopoisk.ru/film/448/'}, {'title': 'Властелин колец: Братство кольца', 'url': 'https://www.kinopoisk.ru/film/328/'}, {'title': 'Унесённые призраками', 'url': 'https://www.kinopoisk.ru/film/370/'}, {'title': 'Терминатор 2: Судный день', 'url': 'https://www.kinopoisk.ru/film/444/'}, {'title': 'Зеленая книга', 'url': 'https://www.kinopois

## Step 5: Save Data to JSON

After extracting all the movie details, we save the data into a structured JSON file named `movies_links.json`. This file will contain the titles and URLs of all the movies.

In [50]:
with open("movies_links.json", "w", encoding="utf-8") as json_file:
    json.dump(movies_data, json_file, ensure_ascii=False, indent=4)

## Step 6: Extract Additional Movie Details

- This cell enriches the movie data by extracting additional details (e.g., genre, director, release year, etc.) from each movie's individual page on KinoPoisk.
- The extracted details are stored in the `movies_data_full` list, which will later be saved to a JSON file for further analysis or use.

In [None]:
for movie in movies_data_full:
    url = movie['url']
    driver.get(url)
    time.sleep(10)

    try:
        data_elements = driver.find_elements(By.CSS_SELECTOR, ".styles_rowLight__P8Y_1")
        if len(data_elements) == 0:
            data_elements = driver.find_elements(By.CSS_SELECTOR, ".styles_rowDark__ucbcz")

        movie_details = {}

        # Process each data element
        for element in data_elements:
            text = element.text  # Get the text of the element
            parts = text.split('\n')  
            if len(parts) >= 2: 
                key = parts[0].strip()  
                value = parts[1].strip()
                movie_details[key] = value 

        movie['details'] = movie_details
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")


hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello
hello


In [52]:
with open("movies_details_full.json", "w", encoding="utf-8") as json_file:
    json.dump(movies_data_full, json_file, ensure_ascii=False, indent=4)

## Step 7: Extract Additional Movie Details

We now navigate to each movie's individual page to extract additional details such as:

- **Rating**: The movie's rating on KinoPoisk.
- **Top 250 Position**: The movie's position in the Top 250 list.
- **Number of Ratings**: The total number of ratings the movie has received.

These details are stored in the `movies_data_full` list.

In [55]:
with open('movies_details_full.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    driver.get(url)
    time.sleep(10)

    try:
        data_elements = driver.find_elements(By.CSS_SELECTOR, ".styles_rootMSize__B8Ch0")

        # Process each data element
        for element in data_elements:
            if isinstance(movie.get('details'), dict):
                movie['details']['Рейтинг'] = data_elements[0].text.split('\n')[0]  
                movie['details']['Топ 250'] = data_elements[0].text.split('\n')[3]
                movie['details']['Оценок'] = data_elements[0].text.split('\n')[4]
            else:
                print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")

In [57]:
with open('movies_with_rating.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

## Step 8: Extract Actor Information

In this step, we further enrich the movie data by scraping the list of actors for each movie from its individual page on KinoPoisk. The cell does:

1. **Iterate Through Movies**:
   - The loop iterates through each movie in the `movies_data_full` list, which already contains basic details (title, URL) and additional details (e.g., genre, director, etc.).

2. **Navigate to Movie Page**:
   - For each movie, the script navigates to its individual page using the `url` stored in the movie dictionary.
   - A delay of 10 seconds (`time.sleep(10)`) is added to ensure the page fully loads before scraping.

3. **Extract Actor Information**:
   - The script attempts to locate elements on the page using the CSS selector `.styles_list___ufg4`.


In [60]:
with open('movies_with_rating.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    driver.get(url)
    time.sleep(10)

    try:
        # Extract the data block
        data_elements = driver.find_elements(By.CSS_SELECTOR, ".styles_list___ufg4")

        # Process each data element
        for element in data_elements:
            if isinstance(movie.get('details'), dict):
                movie['details']['Актеры'] = data_elements[0].text.split('\n') 
            else:
                print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")

In [62]:
with open('movies_with_actors.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

## Step 9: Extract Reviews

We navigate to the reviews sections of each movie's page to extract:

- **User Reviews**: The number of positive, negative, and neutral reviews from users.

In [67]:
with open('movies_with_actors.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    new_url = url.rstrip('/') + '/reviews/'


    driver.get(new_url)
    time.sleep(10)

    try:
        # Extract the data block
        all_elements = driver.find_elements(By.CSS_SELECTOR, "ul.resp_type b")


        # Process each data element
        if isinstance(movie.get('details'), dict):
            movie['details']['Количество рецензий от зрителей'] = all_elements[0].text
            movie['details']['Количество положительных рецензий от зрителей'] = all_elements[1].text
            movie['details']['Количество отрицательных рецензий от зрителей'] = all_elements[2].text
            movie['details']['Количество нейтральных рецензий от зрителей'] = all_elements[4].text
        else:
            print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")
    

In [75]:
with open('movies_with_rewiw_score.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

## Step 10: Extract Press Ratings

We navigate to the press sections of each movie's page to extract:

- **Press Ratings**: The percentage of positive reviews from international and Russian critics.

In [79]:
with open('movies_with_rewiw_score.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    new_url = url.rstrip('/') + '/press/'


    driver.get(new_url)
    time.sleep(10)

    try:
        # Extract the data block
        all_elements = driver.find_elements(By.CSS_SELECTOR, ".criticsRatingBlock")

        # Process each data element
        if isinstance(movie.get('details'), dict):
            movie['details']['Процент положительных рецензий международных критиков'] = all_elements[0].text.split('\n')[1]
            movie['details']['Процент положительных рецензий российских критиков'] = all_elements[0].text.split('\n')[5]

        else:
            print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")
    

Error extracting data for Достучаться до небес: list index out of range
Error extracting data for Операция «Ы» и другие приключения Шурика: list index out of range
Error extracting data for Бриллиантовая рука: list index out of range
Error extracting data for Девчата: list index out of range
Error extracting data for Брат 2: list index out of range
Error extracting data for Иван Васильевич меняет профессию: list index out of range
Error extracting data for Собачье сердце: list index out of range
Error extracting data for Дангал: list index out of range
Error extracting data for Кавказская пленница, или Новые приключения Шурика: list index out of range
Error extracting data for Титаник: list index out of range
Error extracting data for ...А зори здесь тихие: list index out of range
Error extracting data for В августе 44-го: list index out of range
Error extracting data for Укрощение строптивого: list index out of range
Error extracting data for Джентльмены удачи: list index out of range

In [104]:
with open('movies_with_press.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

## Step 11: Extract 100 User Reviews

Finally, we extract the first 100 user reviews for each movie and add them to the dataset.

In [151]:
with open('movies_with_press.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    new_url = url.rstrip('/') + '/reviews/ord/date/status/all/perpage/100/'


    driver.get(new_url)
    time.sleep(15)

    try:
        # Extract the data block
        all_elements = driver.find_elements(By.CSS_SELECTOR, "div.response p")

        if isinstance(movie.get('details'), dict):
            if 'Рецензии 100 зрителей' not in movie['details']:
                movie['details']['Рецензии 100 зрителей'] = []
                
            for review in all_elements:

                review_text = review.text.split('\n')

                main_text = [line for line in review_text if line.strip() not in ["прямая ссылка", "+ комментарий"]]

                main_text = '\n'.join(main_text)

                if main_text.strip():
                    movie['details']['Рецензии 100 зрителей'].append(main_text)

        else:
            print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")
    

In [153]:
with open('movies_all.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

This step is repeated to ensure that **all movies** in the dataset have their **first 100 user reviews** extracted. 

### **Purpose**:
- This repetition ensures that the final dataset is **comprehensive** and **error-free**, making it ready for analysis or further processing.

In [5]:
with open('movies_all.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    new_url = url.rstrip('/') + '/reviews/ord/date/status/all/perpage/100/'
    if movie['details']['Рецензии 100 зрителей'] != []:
        continue

    driver.get(new_url)
    time.sleep(15)

    try:
        # Extract the data block
        all_elements = driver.find_elements(By.CSS_SELECTOR, "div.response p")

        if isinstance(movie.get('details'), dict):
            if 'Рецензии 100 зрителей' not in movie['details']:
                movie['details']['Рецензии 100 зрителей'] = []
                
            for review in all_elements:

                review_text = review.text.split('\n')

                main_text = [line for line in review_text if line.strip() not in ["прямая ссылка", "+ комментарий"]]

                main_text = '\n'.join(main_text)

                if main_text.strip():
                    movie['details']['Рецензии 100 зрителей'].append(main_text)

        else:
            print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")
    

In [7]:
with open('movies_all_final.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

In [10]:
with open('movies_all_final.json', 'r', encoding='utf-8') as file:
    movies_data_full = json.load(file)

In [None]:
for movie in movies_data_full:
    url = movie['url']
    new_url = url.rstrip('/') + '/reviews/ord/date/status/all/perpage/100/'
    if movie['details']['Рецензии 100 зрителей'] != []:
        continue

    driver.get(new_url)
    time.sleep(20)

    try:
        # Extract the data block
        all_elements = driver.find_elements(By.CSS_SELECTOR, "div.response p")

        if isinstance(movie.get('details'), dict):
            if 'Рецензии 100 зрителей' not in movie['details']:
                movie['details']['Рецензии 100 зрителей'] = []
                
            for review in all_elements:

                review_text = review.text.split('\n')

                main_text = [line for line in review_text if line.strip() not in ["прямая ссылка", "+ комментарий"]]

                main_text = '\n'.join(main_text)

                if main_text.strip():
                    movie['details']['Рецензии 100 зрителей'].append(main_text)

        else:
            print(movie['url'])
            
        
    except Exception as e:
        print(f"Error extracting data for {movie['title']}: {e}")
    

## Step 12: Save Final Dataset to JSON

We save the final dataset, now including 100 user reviews, into a JSON file named `movies_final.json`.

In [13]:
with open('movies_final.json', 'w', encoding='utf-8') as file:
    json.dump(movies_data_full, file, ensure_ascii=False, indent=4)

## Step 13: Close the WebDriver

After completing the scraping process, we close the WebDriver to free up system resources.

In [8]:
driver.quit()