DevGPT Pipeline

Introduction

Welcome to the DevGPT Pipeline! This project extracts, processes, and analyzes data from the DevGPT repository through a streamlined pipeline.

Pipeline Flow

Pipeline Components

1. Data Extraction

Download data from Zenodo.
Organize data into designated folders.
Extract diverse data types: Issues, Pull Requests, Discussions, Commits, Hacker News, Sharing, and Conversations.
Hashing the text data to create unique IDs for each row. ( Since some data are duplicated since we are extracting from multiple snapshots of the repository hashing helps use create unique id based on the textual content of the row, if there is a change in the content of the row the hash will be different and hence we can identify the change in the content of the row. Ensuring that we are not double counting the data at the same time we are not missing any data. )

2. Data Parsing and Modeling

Use a parser to extract and map data.
Classify data into relevant tables: Issues, Pull Requests, Commits, Discussions, Hacker News, ChatGPT Sharing, and ChatGPT Conversation.
Store data in an SQLite database.

3. Data Translation

Detect non-English content.
Translate non-English content into English.

4. Data Cleaning and Tokenization

Implement comprehensive data cleaning using regex, nltk and Spacy.
Convert text to lowercase and apply lemmatization.
Store tokenized data in CSV files.

5. Keyword Searching and Dask Parallelization

Optimize keyword searching with Dask parallelization.
Process 60,000+ rows in under two hours.
Identify successful matches bidirectionally.
Store results in the database.

Setup and Execution

Docker and Docker Compose: Ensure Docker and Docker Compose are installed on your machine.
Run Docker Compose:
```
docker-compose up --build
```
This will build the Docker image and run the pipeline. You can check the logs to see the progress of the pipeline.

To get the devgpt.sqlite file, run the following command:

docker cp <container_id>:/app/devgpt.sqlite /path/on/local/machine

The container_id can be found by running docker ps. Sample usage:

 docker cp devgpt-mining-controller-1:/app/devgpt.sqlite .

Note

Ensure a stable internet connection for dataset downloads from Zenodo.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
alembic		alembic
devgpt_pipeline		devgpt_pipeline
images		images
keywords		keywords
pipeline		pipeline
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
__init__.py		__init__.py
alembic.ini		alembic.ini
controller.py		controller.py
docker-compose.yaml		docker-compose.yaml
pythonpath.sh		pythonpath.sh
requirements.txt		requirements.txt
stats.ipynb		stats.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DevGPT Pipeline

Introduction

Pipeline Flow

Pipeline Components

1. Data Extraction

2. Data Parsing and Modeling

3. Data Translation

4. Data Cleaning and Tokenization

5. Keyword Searching and Dask Parallelization

Setup and Execution

Note

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AnonymousResearcherSubmission/devgpt-pipeline1

Folders and files

Latest commit

History

Repository files navigation

DevGPT Pipeline

Introduction

Pipeline Flow

Pipeline Components

1. Data Extraction

2. Data Parsing and Modeling

3. Data Translation

4. Data Cleaning and Tokenization

5. Keyword Searching and Dask Parallelization

Setup and Execution

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages