Project created for the "Visualizzazione Scientifica" course in 2022-2023.
- various jupyter notebook and .py files (if you want to see the code)
- powerpoint presentation (pdf)
- Power BI Report (pdf)
How I extracted the data
- Most of the data that i found, was in pdf format, the "good" thing was that the data itself was in a tabular format, so i used a tool, tabula, written in java to extract the data in csv format.
- I couldn't find a way to automate the process, although there is also a python version of tabula, but, in this case was less effective than the original tool.
- Also, sometimes, the tool wasn't able to get all the data, or it got partial data or it merged multiple column in ones, so i had to restore the data manually. I also used an extension on vscode called edit csv, in order to facilitate this task. This is also his github.
- Other data was in excel format, but also in this case i couldn't find a way to automate the process, because the files structure was the same only for some years, and the spreadsheet names were different too. This time the data was fewer and the copy/paste was also easier.
- In order to create the italian actors network graph, i used the TMDB API through tmdbsimple, a python wrapper for this api, and saved the data in json format.
In order to re-run my code about the actors network you have to:
- get a TMDB api key by following the official documentation
- rename the file .env_sample in .env
- open the .env file
- replace the placeholder with your api key
- to Cinetel/Anica, where you can find most of the data that i reworked
- to SIAE, where you can find other data that i reworked
- to Istat, where you can find the shapefiles of Italy
- to TMDB API, for the movies/actors data