Welcome to the DevGPT Pipeline! This project extracts, processes, and analyzes data from the DevGPT repository through a streamlined pipeline.
- Download data from Zenodo.
- Organize data into designated folders.
- Extract diverse data types: Issues, Pull Requests, Discussions, Commits, Hacker News, Sharing, and Conversations.
- Hashing the text data to create unique IDs for each row. ( Since some data are duplicated since we are extracting from multiple snapshots of the repository hashing helps use create unique id based on the textual content of the row, if there is a change in the content of the row the hash will be different and hence we can identify the change in the content of the row. Ensuring that we are not double counting the data at the same time we are not missing any data. )
- Use a parser to extract and map data.
- Classify data into relevant tables: Issues, Pull Requests, Commits, Discussions, Hacker News, ChatGPT Sharing, and ChatGPT Conversation.
- Store data in an SQLite database.
- Detect non-English content.
- Translate non-English content into English.
- Implement comprehensive data cleaning using regex, nltk and Spacy.
- Convert text to lowercase and apply lemmatization.
- Store tokenized data in CSV files.
- Optimize keyword searching with Dask parallelization.
- Process 60,000+ rows in under two hours.
- Identify successful matches bidirectionally.
- Store results in the database.
-
Docker and Docker Compose: Ensure Docker and Docker Compose are installed on your machine.
-
Run Docker Compose:
docker-compose up --build
This will build the Docker image and run the pipeline. You can check the logs to see the progress of the pipeline.
-
To get the devgpt.sqlite file, run the following command:
docker cp <container_id>:/app/devgpt.sqlite /path/on/local/machine
The container_id can be found by running docker ps
.
Sample usage:
docker cp devgpt-mining-controller-1:/app/devgpt.sqlite .
Ensure a stable internet connection for dataset downloads from Zenodo.