This README outlines the steps needed to set up and run the project environment, including installing necessary libraries, processing data, and loading it into a Neo4j database.
Ensure you have Python 3.9 installed on your system. You can check your Python version by running:
python --version
First, install the required libraries listed in requirements.txt
by running the following command:
pip install -r requirements.txt
Download the dataset from the provided hyperlink. Ensure the downloaded dataset is saved in the project's root directory.
Run the notebooks/etl.ipynb
Jupyter notebook to create necessary files from the dataset. Initially, all_data.csv
will be used.
The keyword creation part of the ETL process is time-consuming. If you prefer to skip this step, directly use all_data_with_keywords.csv
in the notebook instead of all_data.csv
.
After processing the data, set up a Neo4j DBMS and obtain the path to its DBMS directory. On macOS, the path looks like this:
/Users/user/Library/Application%20Support/Neo4j%20Desktop/Application/relate-data/dbmss/<dbms-id>/import/csv_path.txt
Replace <dbms-id>
with the appropriate folder name for your DBMS.
Move all the CSV files generated by etl.ipynb
to the directory path you obtained in the previous step.
Execute the data pipeline with the following bash script command. Ensure to replace --config
with your configuration file path, if necessary. Also, make the necessary modifications of the username, password, and database in the config file.
bash run_loader.sh --config config.ini
Below are the sample results we obtained from running the pipeline: