# Dataset creation
This notebook will:
* Find github issues by label
* Remove Technical debt labels
* Make a test set
It will produce 4 different dataset:
1. High/med/priority dataset
2. Test dataset
3. Debt dataset
4. Full dataset with nothing removed

In [1]:
import pandas as pd
import re
import os

In [2]:
# folder path
dir_path = r'../../csv/all_issues' ## Point to the extracted folder containing all the issues csv files

# list to store files
res = []

# Iterate directory
for path in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, path)):
        res.append(path)
print(res)

['issues_all_2020-02-03.csv', 'issues_all_2016-09-19.csv', 'issues_all_2023-09-11.csv', 'issues_all_2020-09-24.csv', 'issues_all_2023-05-07.csv', 'issues_all_2022-09-18.csv', 'issues_all_2018-02-13.csv', 'issues_all_2019-12-22.csv', 'issues_all_2016-04-20.csv', 'issues_all_2017-01-01.csv', 'issues_all_2017-08-27.csv', 'issues_all_2022-08-20.csv', 'issues_all_2016-11-23.csv', 'issues_all_2018-12-12.csv', 'issues_all_2019-06-06.csv', 'issues_all_2023-01-20.csv', 'issues_all_2022-04-13.csv', 'issues_all_2022-06-10.csv', 'issues_all_2021-06-12.csv', 'issues_all_2019-07-13.csv', 'issues_all_2019-01-24.csv', 'issues_all_2017-03-06.csv', 'issues_all_2015-01-01.csv', 'issues_all_2022-05-03.csv', 'issues_all_2020-02-16.csv', 'issues_all_2015-06-09.csv', 'issues_all_2022-11-23.csv', 'issues_all_2017-02-08.csv', 'issues_all_2020-11-26.csv', 'issues_all_2016-08-11.csv', 'issues_all_2023-04-24.csv', 'issues_all_2019-01-14.csv', 'issues_all_2017-04-30.csv', 'issues_all_2018-12-06.csv', 'issues_all_2

In [3]:
appended_data = []
# Regular expression to capture various variations of "high priority"
high_priority = r"\bhigh\W*p(?:ri(?:o(?:rity)?)?)?\b|\bp(?:ri(?:o(?:rity)?)?)?\W*high\b"

not_high_priority = r"\b(?:high\W*|critical\W*|severe\W*|important\W*|urgent\W*|essential\W*|imperative\W*|paramount\W*|pressing\W*|crucial\W*|vital\W*|mandatory\W*|top\W*priority\W*|compulsory\W*|expedient\W*)(?:p(?:ri(?:o(?:rity)?)?)?|\burgent\b|\bsevere\b)\b|\b(?:p(?:ri(?:o(?:rity)?)?)?|\burgent\b|\bsevere\b)\W*(?:high|critical|severe|important|urgent|essential|imperative|paramount|pressing|crucial|vital|mandatory|top\W*priority|compulsory|expedient)\b"
medium_priority = r"\b(?:medium|mid)\W*p(?:ri(?:o(?:rity)?)?)?\b|\bp(?:ri(?:o(?:rity)?)?)?\W*(?:medium|mid)\b"

low_priority = r"\blow\W*p(?:ri(?:o(?:rity)?)?)?\b|\bp(?:ri(?:o(?:rity)?)?)?\W*low\b"

pattern=not_high_priority
file_name = "not_high"
length_res = len(res)
for i, r in enumerate(res):
    if i > 200:
        break
    try:
        file_path = f"{dir_path}/{r}"
        df = pd.read_csv(file_path, index_col=0)
        
        # Make sure the dataframe is not empty 
        df = df[df['labels'].notnull() & df['labels'].str.strip().astype(bool)]


        # ~ Finds NOT in regex
        df = df[
           ~df["labels"].str.contains(pattern, case=False, na=False, regex=True)
        ]

        if not df.empty:  # Append non-empty dataframes to the list
           appended_data.append(df)


        print(f"{i}/{length_res} Processed: {r} ")
    except Exception as e:
        print(f"{i}/{length_res} Error processing {r}: {e}")

appended_data = pd.concat(appended_data, ignore_index=True)


0/3137 Processed: issues_all_2020-02-03.csv 
1/3137 Processed: issues_all_2016-09-19.csv 
2/3137 Error processing issues_all_2023-09-11.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

3/3137 Processed: issues_all_2020-09-24.csv 
4/3137 Processed: issues_all_2023-05-07.csv 
5/3137 Processed: issues_all_2022-09-18.csv 
6/3137 Processed: issues_all_2018-02-13.csv 
7/3137 Processed: issues_all_2019-12-22.csv 
8/3137 Processed: issues_all_2016-04-20.csv 
9/3137 Processed: issues_all_2017-01-01.csv 
10/3137 Processed: issues_all_2017-08-27.csv 
11/3137 Processed: issues_all_2022-08-20.csv 
12/3137 Processed: issues_all_2016-11-23.csv 
13/3137 Processed: issues_all_2018-12-12.csv 
14/3137 Processed: issues_all_2019-06-06.csv 


  df = pd.read_csv(file_path, index_col=0)


15/3137 Processed: issues_all_2023-01-20.csv 
16/3137 Processed: issues_all_2022-04-13.csv 
17/3137 Processed: issues_all_2022-06-10.csv 
18/3137 Processed: issues_all_2021-06-12.csv 
19/3137 Processed: issues_all_2019-07-13.csv 
20/3137 Processed: issues_all_2019-01-24.csv 
21/3137 Processed: issues_all_2017-03-06.csv 
22/3137 Processed: issues_all_2015-01-01.csv 
23/3137 Processed: issues_all_2022-05-03.csv 
24/3137 Processed: issues_all_2020-02-16.csv 
25/3137 Processed: issues_all_2015-06-09.csv 
26/3137 Error processing issues_all_2022-11-23.csv: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

27/3137 Processed: issues_all_2017-02-08.csv 
28/3137 Processed: issues_all_2020-11-26.csv 
29/3137 Processed: issues_all_2016-08-11.csv 
30/3137 Processed: issues_all_2023-04-24.csv 
31/3137 Processed: issues_all_2019-01-14.csv 
32/3137 Processed: issues_all_2017-04-30.csv 
33/3137 Processed: issues_all_2018-12-06.csv 
34/3137 Processed: issues_all_2

  df = pd.read_csv(file_path, index_col=0)


45/3137 Processed: issues_all_2021-04-23.csv 
46/3137 Processed: issues_all_2017-06-30.csv 
47/3137 Processed: issues_all_2018-07-08.csv 
48/3137 Processed: issues_all_2017-07-26.csv 
49/3137 Processed: issues_all_2017-05-20.csv 
50/3137 Processed: issues_all_2016-08-07.csv 
51/3137 Processed: issues_all_2021-05-05.csv 
52/3137 Processed: issues_all_2022-06-17.csv 
53/3137 Processed: issues_all_2019-08-21.csv 
54/3137 Processed: issues_all_2019-12-24.csv 
55/3137 Processed: issues_all_2020-01-29.csv 
56/3137 Processed: issues_all_2016-09-07.csv 
57/3137 Processed: issues_all_2017-12-17.csv 
58/3137 Processed: issues_all_2016-03-09.csv 
59/3137 Processed: issues_all_2023-07-25.csv 
60/3137 Processed: issues_all_2019-09-06.csv 
61/3137 Processed: issues_all_2015-04-26.csv 
62/3137 Processed: issues_all_2023-09-24.csv 
63/3137 Processed: issues_all_2015-07-23.csv 
64/3137 Processed: issues_all_2018-09-15.csv 
65/3137 Processed: issues_all_2017-11-19.csv 
66/3137 Processed: issues_all_2015

  df = pd.read_csv(file_path, index_col=0)


67/3137 Processed: issues_all_2023-01-24.csv 
68/3137 Processed: issues_all_2015-06-18.csv 
69/3137 Processed: issues_all_2018-12-30.csv 
70/3137 Processed: issues_all_2021-03-31.csv 
71/3137 Processed: issues_all_2017-03-19.csv 
72/3137 Processed: issues_all_2017-10-31.csv 
73/3137 Processed: issues_all_2016-03-31.csv 
74/3137 Processed: issues_all_2021-10-09.csv 
75/3137 Processed: issues_all_2019-06-08.csv 
76/3137 Processed: issues_all_2019-02-04.csv 
77/3137 Processed: issues_all_2018-06-28.csv 
78/3137 Processed: issues_all_2017-12-23.csv 


  df = pd.read_csv(file_path, index_col=0)


79/3137 Processed: issues_all_2023-03-15.csv 


  df = pd.read_csv(file_path, index_col=0)


80/3137 Processed: issues_all_2016-06-23.csv 
81/3137 Processed: issues_all_2018-04-08.csv 
82/3137 Processed: issues_all_2018-06-06.csv 
83/3137 Processed: issues_all_2016-09-09.csv 
84/3137 Processed: issues_all_2017-02-27.csv 
85/3137 Processed: issues_all_2018-12-28.csv 
86/3137 Processed: issues_all_2021-04-30.csv 
87/3137 Processed: issues_all_2015-12-07.csv 
88/3137 Processed: issues_all_2019-12-02.csv 
89/3137 Processed: issues_all_2015-03-27.csv 
90/3137 Processed: issues_all_2015-01-10.csv 
91/3137 Processed: issues_all_2018-02-26.csv 
92/3137 Processed: issues_all_2019-04-22.csv 
93/3137 Processed: issues_all_2021-12-24.csv 
94/3137 Processed: issues_all_2022-05-22.csv 
95/3137 Processed: issues_all_2016-06-15.csv 
96/3137 Processed: issues_all_2020-08-03.csv 
97/3137 Processed: issues_all_2015-02-14.csv 
98/3137 Processed: issues_all_2017-07-16.csv 


  df = pd.read_csv(file_path, index_col=0)


99/3137 Processed: issues_all_2023-02-10.csv 
100/3137 Processed: issues_all_2017-03-18.csv 
101/3137 Processed: issues_all_2016-01-17.csv 
102/3137 Processed: issues_all_2023-07-24.csv 
103/3137 Processed: issues_all_2018-11-19.csv 


  df = pd.read_csv(file_path, index_col=0)


104/3137 Processed: issues_all_2022-07-08.csv 
105/3137 Processed: issues_all_2022-05-30.csv 
106/3137 Processed: issues_all_2020-08-05.csv 
107/3137 Processed: issues_all_2020-06-04.csv 
108/3137 Processed: issues_all_2021-06-24.csv 
109/3137 Processed: issues_all_2020-12-31.csv 
110/3137 Processed: issues_all_2019-06-21.csv 
111/3137 Processed: issues_all_2021-06-06.csv 
112/3137 Processed: issues_all_2022-04-17.csv 
113/3137 Processed: issues_all_2022-08-12.csv 
114/3137 Processed: issues_all_2017-09-08.csv 
115/3137 Processed: issues_all_2021-01-09.csv 
116/3137 Processed: issues_all_2023-06-07.csv 
117/3137 Processed: issues_all_2020-11-09.csv 
118/3137 Processed: issues_all_2015-03-13.csv 
119/3137 Processed: issues_all_2017-03-08.csv 
120/3137 Processed: issues_all_2020-11-19.csv 
121/3137 Processed: issues_all_2021-08-12.csv 
122/3137 Processed: issues_all_2016-05-06.csv 
123/3137 Processed: issues_all_2020-02-15.csv 
124/3137 Processed: issues_all_2018-03-30.csv 
125/3137 Proc

  df = pd.read_csv(file_path, index_col=0)


146/3137 Processed: issues_all_2023-02-16.csv 
147/3137 Processed: issues_all_2021-05-16.csv 
148/3137 Processed: issues_all_2018-02-21.csv 
149/3137 Processed: issues_all_2022-10-27.csv 
150/3137 Processed: issues_all_2022-01-30.csv 
151/3137 Processed: issues_all_2022-01-09.csv 
152/3137 Processed: issues_all_2018-08-07.csv 
153/3137 Processed: issues_all_2023-08-18.csv 
154/3137 Processed: issues_all_2016-08-15.csv 
155/3137 Processed: issues_all_2018-08-25.csv 
156/3137 Processed: issues_all_2021-08-08.csv 
157/3137 Processed: issues_all_2021-02-12.csv 
158/3137 Processed: issues_all_2018-12-14.csv 
159/3137 Processed: issues_all_2015-08-05.csv 
160/3137 Processed: issues_all_2019-05-04.csv 
161/3137 Processed: issues_all_2015-06-19.csv 
162/3137 Processed: issues_all_2019-09-28.csv 
163/3137 Processed: issues_all_2017-08-29.csv 
164/3137 Processed: issues_all_2016-12-23.csv 
165/3137 Processed: issues_all_2018-10-16.csv 
166/3137 Processed: issues_all_2020-01-23.csv 
167/3137 Proc

In [4]:
appended_data = pd.DataFrame(appended_data)
appended_data

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,11417537389,IssuesEvent,2020-02-03 00:00:07,automationbs/testbugreporting,https://api.github.com/repos/automationbs/test...,opened,Default title,bug,default description\n\n|Property | Value|\n|--...
1,11417537804,IssuesEvent,2020-02-03 00:00:12,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Show alias,Issue-Question Resolution-Answered,"<!--\r\n\r\nFor Windows PowerShell 5.1 issues,..."
2,11417537835,IssuesEvent,2020-02-03 00:00:13,thadiun/hello-world,https://api.github.com/repos/thadiun/hello-world,closed,[Test] testing needs-triage (no delay),triage,
3,11417538003,IssuesEvent,2020-02-03 00:00:16,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,$ENV in using module,Area-Language Issue-Question Resolution-Answered,# Summary of the new feature/enhancement\r\n\r...
4,11417538116,IssuesEvent,2020-02-03 00:00:18,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Measure-Command does not measure its self,Area-Cmdlets-Utility Issue-Question Resolution...,## The Issue\r\nI'm pretty sure this was just ...
...,...,...,...,...,...,...,...,...,...
2029982,8109961054,IssuesEvent,2018-08-14 09:21:15,urbit/arvo,https://api.github.com/repos/urbit/arvo,opened,%bad-text trips up hall JSON conversion,:hall / :talk cause known marks web interface,"Haven't tested this in detail yet, but pretty ..."
2029983,8109961116,IssuesEvent,2018-08-14 09:21:16,highcharts/highcharts-react,https://api.github.com/repos/highcharts/highch...,closed,HighMaps mapBubble type,pending reply,"Hello,\r\nIt is possible to create also HighMa..."
2029984,8109981673,IssuesEvent,2018-08-14 09:25:07,Loriowar/comindivion,https://api.github.com/repos/Loriowar/comindivion,opened,Add a validation on a belonging of a predicate...,bug data manipulation,Now user can change html content of the intera...
2029985,8109983220,IssuesEvent,2018-08-14 09:25:24,TiiQu-Network/TQ-test-page,https://api.github.com/repos/TiiQu-Network/TQ-...,closed,Error on submit with no values,bug,TypeError: Too few arguments in function sum (...


In [5]:
# Remove medium priority issues
appended_data = appended_data[
   ~appended_data["labels"].str.contains(medium_priority, case=False, na=False, regex=True)
]

# Remove low priority issues
appended_data = appended_data[
   ~appended_data["labels"].str.contains(low_priority, case=False, na=False, regex=True)
]
appended_data.reset_index(drop=True, inplace=True)

appended_data

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,11417537389,IssuesEvent,2020-02-03 00:00:07,automationbs/testbugreporting,https://api.github.com/repos/automationbs/test...,opened,Default title,bug,default description\n\n|Property | Value|\n|--...
1,11417537804,IssuesEvent,2020-02-03 00:00:12,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Show alias,Issue-Question Resolution-Answered,"<!--\r\n\r\nFor Windows PowerShell 5.1 issues,..."
2,11417537835,IssuesEvent,2020-02-03 00:00:13,thadiun/hello-world,https://api.github.com/repos/thadiun/hello-world,closed,[Test] testing needs-triage (no delay),triage,
3,11417538003,IssuesEvent,2020-02-03 00:00:16,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,$ENV in using module,Area-Language Issue-Question Resolution-Answered,# Summary of the new feature/enhancement\r\n\r...
4,11417538116,IssuesEvent,2020-02-03 00:00:18,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Measure-Command does not measure its self,Area-Cmdlets-Utility Issue-Question Resolution...,## The Issue\r\nI'm pretty sure this was just ...
...,...,...,...,...,...,...,...,...,...
2003584,8109961054,IssuesEvent,2018-08-14 09:21:15,urbit/arvo,https://api.github.com/repos/urbit/arvo,opened,%bad-text trips up hall JSON conversion,:hall / :talk cause known marks web interface,"Haven't tested this in detail yet, but pretty ..."
2003585,8109961116,IssuesEvent,2018-08-14 09:21:16,highcharts/highcharts-react,https://api.github.com/repos/highcharts/highch...,closed,HighMaps mapBubble type,pending reply,"Hello,\r\nIt is possible to create also HighMa..."
2003586,8109981673,IssuesEvent,2018-08-14 09:25:07,Loriowar/comindivion,https://api.github.com/repos/Loriowar/comindivion,opened,Add a validation on a belonging of a predicate...,bug data manipulation,Now user can change html content of the intera...
2003587,8109983220,IssuesEvent,2018-08-14 09:25:24,TiiQu-Network/TQ-test-page,https://api.github.com/repos/TiiQu-Network/TQ-...,closed,Error on submit with no values,bug,TypeError: Too few arguments in function sum (...


In [6]:
# Create debt dataframe
technical_debt_regex = r'debt|\bTD\b'
contains_debt_high_priority = appended_data['labels'].str.contains(technical_debt_regex, case=False, na=False)
debt = appended_data[contains_debt_high_priority].reset_index(drop=True)
debt

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,11417553790,IssuesEvent,2020-02-03 00:05:55,voces/fixus,https://api.github.com/repos/voces/fixus,closed,Refactor into semantic TypeScript,techdebt,"The code is ported to TypeScript, but it does ..."
1,11418032700,IssuesEvent,2020-02-03 02:43:22,BeeStation/NSV13,https://api.github.com/repos/BeeStation/NSV13,opened,Snowflake engine depends on same area to conne...,National Debt,this is dumb. why do we do this? There's only ...
2,11418809688,IssuesEvent,2020-02-03 06:03:33,skbkontur/retail-ui,https://api.github.com/repos/skbkontur/retail-ui,closed,[retail-ui] Перевести сборку библиотеки на bab...,in progress technical debt,- [ ] TS → Babel@7 #1129\r\n - [ ] Remove all...
3,11419464215,IssuesEvent,2020-02-03 08:02:43,skbkontur/retail-ui,https://api.github.com/repos/skbkontur/retail-ui,closed,[retail-ui] warnings в компонентах,minor technical debt,Компоненты генерят много сообщений в консоль -...
4,11419744108,IssuesEvent,2020-02-03 08:42:02,monarc-project/MonarcAppFO,https://api.github.com/repos/monarc-project/Mo...,opened,Improve the import speed of analyses and insta...,Technical debt important-for-v3,Currently the medium size analyses import can ...
...,...,...,...,...,...,...,...,...,...
3715,8109276248,IssuesEvent,2018-08-14 06:53:54,spring-cloud/spring-cloud-dataflow,https://api.github.com/repos/spring-cloud/spri...,closed,Move stream aggregate state calculation logic ...,in pr technical-debt,"Currently, the stream aggregate state calculat..."
3716,8109490286,IssuesEvent,2018-08-14 07:45:55,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,CI is broken,techdebt,Caused by #43
3717,8109490292,IssuesEvent,2018-08-14 07:45:55,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,gcc is not supported on Travis CI,techdebt,introduced in #20
3718,8109508676,IssuesEvent,2018-08-14 07:49:59,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,LICENSE,techdebt,


In [7]:
# Dataset with debt and not priority
debt_file_name = f"csv/{file_name}_priority_with_debt_only.csv"
debt.to_csv(debt_file_name)

In [8]:
debt.labels.value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
technical debt,593
tech debt,286
tech-debt,130
debt,111
techdebt,95
Technical Debt,90
Tech Debt,73
technical-debt,52
Technical debt,47
enhancement technical debt,38


In [9]:
# To labels
appended_data.labels.value_counts().to_frame()[:50] 

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
enhancement,244118
bug,240687
security vulnerability,53086
question,34833
feature,20745
documentation,17938
Bug,12282
help wanted,10875
stale branch,8274
wontfix,7279


In [10]:
# Remove debt from the dataset
appended_data_no_debt = appended_data[~contains_debt_high_priority].reset_index(drop=True)
appended_data_no_debt.reset_index(drop=True, inplace=True)
appended_data_no_debt

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,11417537389,IssuesEvent,2020-02-03 00:00:07,automationbs/testbugreporting,https://api.github.com/repos/automationbs/test...,opened,Default title,bug,default description\n\n|Property | Value|\n|--...
1,11417537804,IssuesEvent,2020-02-03 00:00:12,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Show alias,Issue-Question Resolution-Answered,"<!--\r\n\r\nFor Windows PowerShell 5.1 issues,..."
2,11417537835,IssuesEvent,2020-02-03 00:00:13,thadiun/hello-world,https://api.github.com/repos/thadiun/hello-world,closed,[Test] testing needs-triage (no delay),triage,
3,11417538003,IssuesEvent,2020-02-03 00:00:16,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,$ENV in using module,Area-Language Issue-Question Resolution-Answered,# Summary of the new feature/enhancement\r\n\r...
4,11417538116,IssuesEvent,2020-02-03 00:00:18,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Measure-Command does not measure its self,Area-Cmdlets-Utility Issue-Question Resolution...,## The Issue\r\nI'm pretty sure this was just ...
...,...,...,...,...,...,...,...,...,...
1999864,8109961054,IssuesEvent,2018-08-14 09:21:15,urbit/arvo,https://api.github.com/repos/urbit/arvo,opened,%bad-text trips up hall JSON conversion,:hall / :talk cause known marks web interface,"Haven't tested this in detail yet, but pretty ..."
1999865,8109961116,IssuesEvent,2018-08-14 09:21:16,highcharts/highcharts-react,https://api.github.com/repos/highcharts/highch...,closed,HighMaps mapBubble type,pending reply,"Hello,\r\nIt is possible to create also HighMa..."
1999866,8109981673,IssuesEvent,2018-08-14 09:25:07,Loriowar/comindivion,https://api.github.com/repos/Loriowar/comindivion,opened,Add a validation on a belonging of a predicate...,bug data manipulation,Now user can change html content of the intera...
1999867,8109983220,IssuesEvent,2018-08-14 09:25:24,TiiQu-Network/TQ-test-page,https://api.github.com/repos/TiiQu-Network/TQ-...,closed,Error on submit with no values,bug,TypeError: Too few arguments in function sum (...


In [11]:
# Sanity check
should_be_empty=appended_data_no_debt[appended_data_no_debt['labels'].str.contains(technical_debt_regex, case=False, na=False)]
should_be_empty

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body


In [12]:

contains_stale = appended_data['labels'].str.contains("stale", case=False, na=False)
appended_data_no_debt = appended_data_no_debt[~contains_stale].reset_index(drop=True)
appended_data_no_debt.reset_index(drop=True, inplace=True)
appended_data_no_debt

  appended_data_no_debt = appended_data_no_debt[~contains_stale].reset_index(drop=True)


Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,11417537389,IssuesEvent,2020-02-03 00:00:07,automationbs/testbugreporting,https://api.github.com/repos/automationbs/test...,opened,Default title,bug,default description\n\n|Property | Value|\n|--...
1,11417537804,IssuesEvent,2020-02-03 00:00:12,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Show alias,Issue-Question Resolution-Answered,"<!--\r\n\r\nFor Windows PowerShell 5.1 issues,..."
2,11417537835,IssuesEvent,2020-02-03 00:00:13,thadiun/hello-world,https://api.github.com/repos/thadiun/hello-world,closed,[Test] testing needs-triage (no delay),triage,
3,11417538003,IssuesEvent,2020-02-03 00:00:16,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,$ENV in using module,Area-Language Issue-Question Resolution-Answered,# Summary of the new feature/enhancement\r\n\r...
4,11417538116,IssuesEvent,2020-02-03 00:00:18,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Measure-Command does not measure its self,Area-Cmdlets-Utility Issue-Question Resolution...,## The Issue\r\nI'm pretty sure this was just ...
...,...,...,...,...,...,...,...,...,...
1962483,8109961054,IssuesEvent,2018-08-14 09:21:15,urbit/arvo,https://api.github.com/repos/urbit/arvo,opened,%bad-text trips up hall JSON conversion,:hall / :talk cause known marks web interface,"Haven't tested this in detail yet, but pretty ..."
1962484,8109961116,IssuesEvent,2018-08-14 09:21:16,highcharts/highcharts-react,https://api.github.com/repos/highcharts/highch...,closed,HighMaps mapBubble type,pending reply,"Hello,\r\nIt is possible to create also HighMa..."
1962485,8109981673,IssuesEvent,2018-08-14 09:25:07,Loriowar/comindivion,https://api.github.com/repos/Loriowar/comindivion,opened,Add a validation on a belonging of a predicate...,bug data manipulation,Now user can change html content of the intera...
1962486,8109983220,IssuesEvent,2018-08-14 09:25:24,TiiQu-Network/TQ-test-page,https://api.github.com/repos/TiiQu-Network/TQ-...,closed,Error on submit with no values,bug,TypeError: Too few arguments in function sum (...


In [13]:
appended_data_no_debt.labels.value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
enhancement,240489
bug,237012
security vulnerability,52497
question,34359
feature,20451
documentation,17632
Bug,12098
help wanted,10727
wontfix,7180
greenkeeper,6767


In [14]:
# Convert to csv
not_hp_no_debt_file_name = f"csv/{file_name}_no_td1.csv"
appended_data_no_debt.to_csv(not_hp_no_debt_file_name)

## Test if file is not corrupted

In [15]:
debt = pd.read_csv(debt_file_name, index_col=0)
debt

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,1.141755e+10,IssuesEvent,2020-02-03 00:05:55,voces/fixus,https://api.github.com/repos/voces/fixus,closed,Refactor into semantic TypeScript,techdebt,"The code is ported to TypeScript, but it does ..."
1,1.141803e+10,IssuesEvent,2020-02-03 02:43:22,BeeStation/NSV13,https://api.github.com/repos/BeeStation/NSV13,opened,Snowflake engine depends on same area to conne...,National Debt,this is dumb. why do we do this? There's only ...
2,1.141881e+10,IssuesEvent,2020-02-03 06:03:33,skbkontur/retail-ui,https://api.github.com/repos/skbkontur/retail-ui,closed,[retail-ui] Перевести сборку библиотеки на bab...,in progress technical debt,- [ ] TS → Babel@7 #1129\r\n - [ ] Remove all...
3,1.141946e+10,IssuesEvent,2020-02-03 08:02:43,skbkontur/retail-ui,https://api.github.com/repos/skbkontur/retail-ui,closed,[retail-ui] warnings в компонентах,minor technical debt,Компоненты генерят много сообщений в консоль -...
4,1.141974e+10,IssuesEvent,2020-02-03 08:42:02,monarc-project/MonarcAppFO,https://api.github.com/repos/monarc-project/Mo...,opened,Improve the import speed of analyses and insta...,Technical debt important-for-v3,Currently the medium size analyses import can ...
...,...,...,...,...,...,...,...,...,...
3715,8.109276e+09,IssuesEvent,2018-08-14 06:53:54,spring-cloud/spring-cloud-dataflow,https://api.github.com/repos/spring-cloud/spri...,closed,Move stream aggregate state calculation logic ...,in pr technical-debt,"Currently, the stream aggregate state calculat..."
3716,8.109490e+09,IssuesEvent,2018-08-14 07:45:55,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,CI is broken,techdebt,Caused by #43
3717,8.109490e+09,IssuesEvent,2018-08-14 07:45:55,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,gcc is not supported on Travis CI,techdebt,introduced in #20
3718,8.109509e+09,IssuesEvent,2018-08-14 07:49:59,tsoding/ray-tracer,https://api.github.com/repos/tsoding/ray-tracer,closed,LICENSE,techdebt,


In [16]:
not_hp_no_debt = pd.read_csv(not_hp_no_debt_file_name, index_col=0)
not_hp_no_debt

  not_hp_no_debt = pd.read_csv(not_hp_no_debt_file_name, index_col=0)


Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body
0,1.141754e+10,IssuesEvent,2020-02-03 00:00:07,automationbs/testbugreporting,https://api.github.com/repos/automationbs/test...,opened,Default title,bug,default description\n\n|Property | Value|\n|--...
1,1.141754e+10,IssuesEvent,2020-02-03 00:00:12,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Show alias,Issue-Question Resolution-Answered,"<!--\r\n\r\nFor Windows PowerShell 5.1 issues,..."
2,1.141754e+10,IssuesEvent,2020-02-03 00:00:13,thadiun/hello-world,https://api.github.com/repos/thadiun/hello-world,closed,[Test] testing needs-triage (no delay),triage,
3,1.141754e+10,IssuesEvent,2020-02-03 00:00:16,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,$ENV in using module,Area-Language Issue-Question Resolution-Answered,# Summary of the new feature/enhancement\r\n\r...
4,1.141754e+10,IssuesEvent,2020-02-03 00:00:18,PowerShell/PowerShell,https://api.github.com/repos/PowerShell/PowerS...,closed,Measure-Command does not measure its self,Area-Cmdlets-Utility Issue-Question Resolution...,## The Issue\r\nI'm pretty sure this was just ...
...,...,...,...,...,...,...,...,...,...
1962483,8.109961e+09,IssuesEvent,2018-08-14 09:21:15,urbit/arvo,https://api.github.com/repos/urbit/arvo,opened,%bad-text trips up hall JSON conversion,:hall / :talk cause known marks web interface,"Haven't tested this in detail yet, but pretty ..."
1962484,8.109961e+09,IssuesEvent,2018-08-14 09:21:16,highcharts/highcharts-react,https://api.github.com/repos/highcharts/highch...,closed,HighMaps mapBubble type,pending reply,"Hello,\r\nIt is possible to create also HighMa..."
1962485,8.109982e+09,IssuesEvent,2018-08-14 09:25:07,Loriowar/comindivion,https://api.github.com/repos/Loriowar/comindivion,opened,Add a validation on a belonging of a predicate...,bug data manipulation,Now user can change html content of the intera...
1962486,8.109983e+09,IssuesEvent,2018-08-14 09:25:24,TiiQu-Network/TQ-test-page,https://api.github.com/repos/TiiQu-Network/TQ-...,closed,Error on submit with no values,bug,TypeError: Too few arguments in function sum (...


In [17]:
not_hp_no_debt.labels.value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
stale-branch,6827422
stale branch 🗑️,348533
enhancement,240489
bug,237012
security vulnerability,52497
question,34359
feature,20451
documentation,17632
Bug,12098
help wanted,10727
