These scripts require the following dependencies:
- numpy
- scipy
- random
- matplotlib
- plotly
- colorama
- pickle
- pandas
- requests
- bs4
- tqdm
- networkx
They have been tested with Python 3.7.
https://cosylab.iiitd.edu.in/recipedb/ can be scraped using the get_dataRDB.py
script.
The URL indices start at 2610
and end at 149191
.
Within this range, some URLs do not contain information.
These are filtered out later.
To use the script:
- Change the
start
andend
variables such that:2610 <= start <= end <= 149191
- Run the script
Copies of this script can be made and started independently to scrape the website faster. If the URL is found, the time taken by the request and processing of the data is printed. Otherwise, only the URL index is printed.
Once the data is downloaded, the combine_pkl_files.py
script can be customized and ran to combine all data
into a single pandas dataframe.
To customize the script:
- Edit the
read_pickle
statements to add all generated dataframes into a single list. - Run the script
This will merge all dataframes into one and remove URLs that could not be reached. The resulting dataframe is saved in the current directory.
The process_data.py
script provides:
- some examples of reading and compiling some results
- some data processing and filtering for the data to build a graph.
- saving of data for specific locations with a large enough number of recipes.
The full filtered data gets saved to the same directory.
The country data gets saved to the country_data
directory.
The region data gets saved to the region_data
directory.
The continent data gets saved to the continent_data
directory.
The world data gets saved to the root directory.
The script can be executed without any modifications.
The construct_graphs.py
script is used to build two graphs:
- An unweighted undirected graph between recipes and ingredients with
_ingredients
appended to the original file name. - A weighted undirected graph between recipes and nutrients (normalized by energy) with
_nutrients
appended to the original file name.
Two .gml
files are saved to the same directory where the .pkl
file resides.
The script can be executed without any modifications.
The ingredients_graph_processing.py
script processes the graph by
- Removing recipes with a small number of ingredients
- Removing ingredients that are only used a few number of times (and those that have a name less than 2 characters long)
- Creating a bipartite graph for the remaining recipes and ingredients
- Creating the 1-mode projections onto the recipes and ingredients
The script prints out some data (number of nodes/edges before/after the removal of nodes).
The script saves three .gml
files: one for the reduced graph, one for the projection on ingredients, and one for the projection on the recipes.
Simply update the list of locations and run the script.
The nutrients_graph_processing.py
script processes the graph by
- Using the recipes from the "Ingredients Graphs" section
- Only keeping five nutrients: Fats, Protein, Carbs, Sugars, Fiber
- Reweighing the edges to percentage weight of the five nutrients
- Removing edges with low weight
- Creating a bipartite graph for the remaining nutrients and ingredients
- Creating the 1-mode projections onto the nutrients and ingredients
The script prints out some data (number of nodes/edges before/after the removal of nodes).
The script saves three .gml
files: one for the reduced graph, one for the projection on nutrients, and one for the projection on the recipes.
Simply update the list of locations and run the script.
Analyzes the ingredients graphs' 1-mode projection on the ingredients generating details about
- graph diameters
- degree centrality
- betweenness centrality
- degree distribution
Plots graphs and saves tables for results from ingredients_graphs_analysis.py
Analyzes the nutrients graphs' 1-mode projection on the nutrients generating details about
- recipes that have only one main nutrient
- recipes that connect two nutrients
- radar and matrix graphs of the nutrition content of each graph (normalized)
Looks at the assortativity within the 1-mode projection of the ingredients graphs on the ingredients. Each ingredient is labeled its dominant macro-nutrient (fat, protein, or carb). Ingredients that cannot be labeled by one of these three are removed from the graph. Networkx modularity is used to compute the modularity with respect to these three classes. We also perform a random sampling of the nodes and compute the modularity of the induced subgraph.