Tools for crawling data from metacritic.com (for educational purposes)
IMPORTANT NOTE:
- Is under your responsibility that you respect the Terms of Use of Metacritic, especially the point 11.13
- This script uses some outdated packages (check #34).
These tools are designed for creating a SQLite file with different kind of data that extracts from Metacritic. You won't find the result of the crawl, like a database, as this data is protected by copyright apart from that the content varies very frequently. For more information of how it works check out this.
You can install all this packages with pip install -r requirements.txt
or you can manually install them.
Python ≥3.6 (preferably ≥3.7)
Scrapy 1.8.0 pip install Scrapy==1.8.0
Tqdm ≥4.31.1 pip install tqdm
Scrapy has his own command line tool, you shouldn't use the default Python Shell. 0. Be sure that you have Python 3 installed.
- Download the repository and travel to the repository folder through your OS Command Line Tool
- Install all the requirements with
pip install -r requirements.txt
- Run the following command
scrapy runspider games.py -o gm.jl
which will create a file called gm.jl. This file will include the links to all the Metacritic games, the process of completing this file will take around 40-80 minutes. You can modify some parameters inline - Run the command
scrapy runspider analyze.py
which will create the database games.db. To complete this process, it will take around 2 hours. - Done! The file
games.db
includes all the information. Use your preferred SQLite reader.
These modifiers are for games.py
and should be placed after the command scrapy runspider games.py -o gm.jl
.
Options | Values | Description | Default value |
---|---|---|---|
-a start_page= | 0-161* | Page number to begin scrape | 0 |
-a delay= | 0-∞ | Delay between page scrapes | 3 |
-a items_per_page= | 1-100 | Number of games to scrape per page | 100 |
-s CLOSESPIDER_PAGECOUNT= | 1-161* | Number of pages to scrape | 161* |
*Represents the last game page. Verify the latest page number here.
The process consists of 2 files, the first, games.py
, runs through this page and collects all the data on a file called games.jl. After, analyze.py
uses this list of games and goes to every single page and gets the details of all the games. These details are converted to a SQLite database, this process occurs while new pages are being scraped, so do not hesitate about stopping the script (note: for how scrapy works it may take a while to stop, as it waits until the loaded pages are scrapped).
This is an example of the result of running these scripts. The first line is the variable names of analyze.py. The second line includes the information from the game Tetris DS.
title | platform | company | release | description | metascore | critics_desc | critics_count | user_score | user_desc | user_count | players | rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|
t | p | c | r | d | cs | cd | cn | us | ud | un | pl | rt |
Tetris DS | DS | Nintendo | Mar 20, 2006 | 10 DS players can battle(...) | 84 | Generally favorable reviews | 56 | 8.0 | Generally favorable reviews | 54 Ratings | 4 Online | E |
Markel F. – @Markel_f
Distributed under the BSD license. See LICENSE
for more information.
https://github.com/MarkelFe