- Edoardo Zapata
- Juan Francisco Camberos
- Raúl Maya
- Ricardo Pérez
- Salvador del Cos Garza
The scope of this project was to extract data from different data sources, clean and transform them and load them into a database in order to solve specific queries. This project was developed using Python, Pandas, MongoDB and PyMongo. Data was retrieved from two sources: dataworld.com for Console videogames and kaggle.com for Steam videogames.
Project Report
Datasets were downloaded in CSV format and then imported into Pandas DataFrame:
| Raw data for Console & Steam games |
|---|
![]() |
![]() |
After exploring raw data, datasets were cleaned and formatted (select relevant fields, cast types, compute missing fields, rename columns, etc.):
| Clean data for Console & Steam games |
|---|
![]() |
![]() |
The team decided to use a NoSQL Database (MongoDB) as the best fit since the data gathered had relevant fields with mixed types (int, NaN, str). Both datasets were loaded in the same collection in order to solve queries.
MongoDB united collection |
|---|
![]() |
"Number of games for preferred platforms, grouped by genre"
- A query was built using PyMongo to find all the games for a defined set of preferred platforms: (PS4, Xbox One & Steam). The query results were stored in a DataFrame.
- Considering PS4 and Xbox One release dates, the DataFrame was filtered for games released between 2014 and 2016.
- Finally, the number of games for each of the defined platforms -by genre- was obtained using
groupby.
| Example Query (PS4 slice) |
|---|
![]() |





