# 📦 Creating Your Own Dataset with 🤗 Datasets

Not every dataset you need exists on the Hugging Face Hub!  
Let's go step by step through creating a custom NLP dataset:  
We'll fetch GitHub issues from the 🤗 Datasets repository, clean & augment them, and push the final result to the Hub for the world to use.



Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "lakshmi.adhikari26@gmail.com"
!git config --global user.name "Lakshmi-Adhikari-AI"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

### Getting the data
How to fetch GitHub issues data using the GitHub API.

In [None]:
# Install requests library(for making HTTP calls)
!pip install requests

In [None]:
import requests

url = "https://api.github.com/repos/huggingface/datasets/issues?page=1&per_page=1"
response = requests.get(url)

In [None]:
response.status_code

In [None]:
response.json()

## 1️⃣ Install and Import Requirements

We'll use `requests` to fetch data from the GitHub API, and `tqdm`, `math`, `pandas`, and `datasets` for data processing.


In [None]:
# Install requests if not present
!pip install requests tqdm

import requests
import time
import math
from pathlib import Path
import pandas as pd
import json
from tqdm.notebook import tqdm
from datasets import load_dataset



## 2️⃣ Getting Data from the GitHub API

We retrieve issues (and pull requests) from the 🤗 Datasets repo using the GitHub REST API.  
We recommend authenticating with a GitHub [Personal Access Token](https://github.com/settings/tokens)—this increases the rate limit from 60 to 5000 requests/hour.


In [None]:
# Fill in your GitHub token
GITHUB_TOKEN="your GitHub token"
headers={"Authorization":f"token{GITHUB_TOKEN}"}

## 3️⃣ Define a Function to Download All Issues (and Pull Requests)

This function will paginate through issues, avoid rate limits, and save the raw data to a JSONL file for later use.


In [None]:

def fetch_issues(
    owner="huggingface",
    repo="datasets",
    num_issues=10_000,
    rate_limit=5_000,
    issues_path=Path("."),
):
    if not issues_path.is_dir():
        issues_path.mkdir(exist_ok=True)

    batch = []
    all_issues = []
    per_page = 100  # Number of issues to return per page
    num_pages = math.ceil(num_issues / per_page)
    base_url = "https://api.github.com/repos"

    for page in tqdm(range(num_pages)):
        # Query with state=all to get both open and closed issues
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}", headers=headers)
        batch.extend(issues.json())

        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []  # Flush batch for next time period
            print(f"Reached GitHub rate limit. Sleeping for one hour ...")
            time.sleep(60 * 60 + 1)

    all_issues.extend(batch)
    df = pd.DataFrame.from_records(all_issues)
    df.to_json(f"{issues_path}/{repo}-issues.jsonl", orient="records", lines=True)
    print(
        f"Downloaded all the issues for {repo}! Dataset stored at {issues_path}/{repo}-issues.json"
    )

## 4️⃣ Downloading All Issues from 🤗 Datasets

⚠️ Depending on your connection, this may take a few minutes.  
Afterwards, you'll have a `datasets-issues.jsonl` file.


In [None]:
fetch_issues()

## 5️⃣ Load the Issues Locally as a Hugging Face Dataset

We'll use `load_dataset` to read the JSON lines file as a Dataset object for further analysis.


In [None]:
issues_dataset = load_dataset("json", data_files="datasets-issues.jsonl", split="train")
print(issues_dataset)


## 6️⃣ Distinguishing Issues from Pull Requests

GitHub returns both as issues; pull requests have a `pull_request` field.   
Let's add an `is_pull_request` boolean column.


In [None]:
issues_dataset = issues_dataset.map(
    lambda x: {"is_pull_request": False if x["pull_request"] is None else True}
)


## 7️⃣ Sampling and Inspecting the Structure

Shuffle and print three samples for manual inspection.


In [None]:
sample=issues_dataset.shuffle(seed=666).select(range(3))
for url,pr in zip(sample["html_url"],sample["[pull_request]"]):
  print(f">>URL: {url}")
  print(f">> Pull request:{pr}\n")

## 8️⃣ (Optional) Add Comments to Each Issue

Let's fetch all comments for each issue, and add them as a "comments" column.  
**This is network-intensive and can take a while!**


In [None]:
def get_comments(issue_number):
  url=f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
  response=requests.get(url,headers=headers)
  return [r["body"] for r in response.json()]
# Add comments to all issues
issues_with_comments_dataset=issues_dataset.map(
    lambda x: {"comments":get_comments(x["number"])}
)

## 9️⃣ Push Your Dataset to the Hugging Face Hub

Log in to Hugging Face in your notebook, then upload!


In [None]:
from huggingface_hub import notebook_login
notebook_login()
# Push to the Hub(replace your_name/dataset_name as needed)
issues_with_comments_dataset.push_to_hub("Lakshmi26/github-issues")

In [None]:
def get_comments(issue_number):
    url = f"https://api.github.com/repos/huggingface/datasets/issues/{issue_number}/comments"
    response = requests.get(url, headers=headers)
    return [r["body"] for r in response.json()]


# Test our function works as expected
get_comments(2792)

## 🔟 (Optional) Load Your Uploaded Dataset

Anyone can now use your dataset as below:


In [None]:
remote_dataset=load_dataset("Username/github-issues")
print(remote_dataset)

## 🔖 Remember: Create a Dataset Card!

After pushing, add a `README.md` ("dataset card") on the Hub.  
Describe the dataset, how it was built, license, intended use, and fields.  
Follow the [Hugging Face guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) for best practice templates.
