**Initialization**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION: 
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading Libraries and Dependencies**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [2]:
#@ IMPORTING MODULES: UNCOMMENT BELOW:
# !pip install transformers
import time
import math
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm.auto import tqdm

#@ IGNORING WARNINGS: 
import warnings
warnings.filterwarnings("ignore")

**Getting Data**

In [3]:
#@ FUNCTION FOR FETCHING ISSUES DATA: 
def fetch_issues(owner="huggingface", repo="transformers", num_issues=10_000,
                 rate_limit=5_000):                                                 # Defining function.
    batch = []                                                                      # Initialization.
    all_issues = []                                                                 # Initialization.
    per_page = 100                                                                  # Initialization.
    num_pages = math.ceil(num_issues / per_page)                                    # Initialization.
    base_url = "https://api.github.com/repos"                                       # Initialization.
    for page in tqdm(range(num_pages)):
        query = f"issues?page={page}&per_page={per_page}&state=all"
        issues = requests.get(f"{base_url}/{owner}/{repo}/{query}")                 # Getting issues.
        batch.extend(issues.json())
        if len(batch) > rate_limit and len(all_issues) < num_issues:
            all_issues.extend(batch)
            batch = []
            print(f"Reached GitHub rate limit. Sleeping for one hour")
            time.sleep(60 * 60 + 1)
    all_issues.extend(batch)                                                        # Adding issues.
    df = pd.DataFrame.from_records(all_issues)                                      # Creating the dataframe.
    df.to_json(f"github-issues-{repo}.jsonl", orient="records", lines=True)         # Convertin to json.

#@ GETTING THE DATA: UNCOMMENT BELOW:
# fetch_issues()

In [5]:
#@ GETTING THE DATA: UNCOMMENT BELOW:
# !wget https://raw.githubusercontent.com/nlp-with-transformers/notebooks/main/data/github-issues-transformers.jsonl

**Preparing Data**

In [12]:
#@ PREPARING THE DATA:
dataset_url = "https://git.io/nlp-with-transformers"            # Initialization.
df_issues = pd.read_json(dataset_url, lines=True)               # Creating dataframe. 
print(f"DataFrame shape: {df_issues.shape}")                    # Inspecting shape.

DataFrame shape: (9930, 26)


In [13]:
#@ PREPARING THE DATA:
cols = ["url", "id", "title", "user", "labels", "state", "created_at", "body"]      # Initializing column names.
df_issues.loc[2, cols].to_frame()                                                   # Inspection.

Unnamed: 0,2
url,https://api.github.com/repos/huggingface/trans...
id,849529761
title,[DeepSpeed] ZeRO stage 3 integration: getting ...
user,"{'login': 'stas00', 'id': 10676103, 'node_id':..."
labels,"[{'id': 2659267025, 'node_id': 'MDU6TGFiZWwyNj..."
state,open
created_at,2021-04-02 23:40:42
body,"**[This is not yet alive, preparing for the re..."


In [14]:
#@ INSPECTING THE DATAFRAME: LABELS:
df_issues["labels"] = df_issues["labels"].apply(lambda x: [meta["name"] for meta in x])     # Initializing labels.
df_issues[["labels"]].head()

Unnamed: 0,labels
0,[]
1,[]
2,[DeepSpeed]
3,[]
4,[]


In [15]:
#@ INSPECTING THE LABELS:
df_issues["labels"].apply(lambda x: len(x)).value_counts().to_frame().T

Unnamed: 0,0,1,2,3,4,5
labels,6440,3057,305,100,25,3


In [17]:
#@ INSPECTING THE LABELS: FREQUENT:
df_counts = df_issues["labels"].explode().value_counts()        # Exploding the labels column.
print(f"Number of labels: {len(df_counts)}")                    # Inspection.
df_counts.to_frame().head(7).T                                  # Inspecting labels.

Number of labels: 65


Unnamed: 0,wontfix,model card,Core: Tokenization,New model,Core: Modeling,Help wanted,Good First Issue
labels,2284,649,106,98,64,52,50


In [18]:
#@ PREPARING THE LABELS COLUMN:
label_map = {"Core: Tokenization": "tokenization", 
             "New model": "new model",
             "Core: Modeling": "model training",
             "Usage": "usage", 
             "Core: Pipeline": "pipeline",
             "TensorFlow": "tensorflow or tf",
             "PyTorch": "pytorch", 
             "Examples": "examples", 
             "Documentation": "documentation"}                          # Standardization of labels.
def filter_labels(x):                                                   # Defining function. 
    return [label_map[label] for label in x if label in label_map]      # Filtering labels.
df_issues["labels"] = df_issues["labels"].apply(filter_labels)          # Implementation of filtering.
all_labels = list(label_map.values())                                   # Initializing labels.
all_labels                                                              # Inspection.

['tokenization',
 'new model',
 'model training',
 'usage',
 'pipeline',
 'tensorflow or tf',
 'pytorch',
 'examples',
 'documentation']

In [19]:
#@ INSPECTING DISTRIBUTION OF LABELS:
df_counts = df_issues["labels"].explode().value_counts()        # Exploding the labels column.
df_counts.to_frame().T                                          # Inspecting labels.

Unnamed: 0,tokenization,new model,model training,usage,pipeline,tensorflow or tf,pytorch,documentation,examples
labels,106,98,64,46,42,41,37,28,24


In [20]:
#@ SPLITTING LABELED AND UNLABELED ISSUES:
df_issues["split"] = "unlabeled"                                # Initialization.
mask = df_issues["labels"].apply(lambda x: len(x)) > 0          # Initializing mask.
df_issues.loc[mask, "split"] = "labeled"                        # Initialization.
df_issues["split"].value_counts().to_frame().T                  # Inspection.

Unnamed: 0,unlabeled,labeled
split,9489,441


In [26]:
#@ INSPECTING AN EXAMPLE:
for column in ["title", "body", "labels"]:
    print(f"{column}: {df_issues[column].iloc[26][:500]}\n")

title: Add new CANINE model

body: # 🌟 New model addition

## Model description

Google recently proposed a new **C**haracter **A**rchitecture with **N**o tokenization **I**n **N**eural **E**ncoders architecture (CANINE). Not only the title is exciting:

> Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually en

labels: ['new model']

