# Data cleaning
We want the class with labels as numerical value and the body with clean text.

This will remove:
* duplicates
* NaN entires
* non english
* url, html

* make it lowercase
* combine title and body

In [1]:
import pandas as pd
import sys
import numpy as np
sys.path.append("../../../scripts_shared/")
from preprocess_text import preprocess_text


In [2]:
# Read CSV into a dataframe
high_priority = pd.read_csv("csv/high/high_priority_no_td.csv", index_col=0)
medium_priority = pd.read_csv("csv/medium/medium_priority_no_td.csv", index_col=0)
low_priority = pd.read_csv("csv/low/low_priority_no_td.csv", index_col=0)

In [3]:
# Number or different labels
high_priority.labels.value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
labels,Unnamed: 1_level_1
bug high priority,10435
high priority,10360
High Priority,7484
enhancement high priority,5486
Priority: High,4751
priority.High,3694
priority.High type.Task,3685
priority.high type.task,3209
priority.High type.Story,2979
priority.high,2749


In [4]:
#Give each priority a label by number.
# 'Label encoding'. Makes is easier for machine learning models to work with categorical data.
high_priority["labels"] = 0
high_priority["class"] = "high_priority"
medium_priority["labels"] = 1
medium_priority["class"] = "medium_priority"
low_priority["labels"] = 2
low_priority["class"] = "low_priority"
high_priority.head()

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body,class
0,11417540000.0,IssuesEvent,2020-02-03 00:00:44,unitystation/unitystation,https://api.github.com/repos/unitystation/unit...,closed,Client breaking NRE when using edit field on C...,0,### Bug:\r\n\r\nIf you use the edit field of t...,high_priority
1,11417540000.0,IssuesEvent,2020-02-03 00:01:26,zowe/sample-spring-boot-api-service,https://api.github.com/repos/zowe/sample-sprin...,closed,The SDK provides a separate Java (no-Spring) l...,0,- The commons-spring library is split into:\r\...,high_priority
2,11417550000.0,IssuesEvent,2020-02-03 00:02:58,openmsupply/mobile,https://api.github.com/repos/openmsupply/mobile,closed,Auto-log out after some time frame,0,## Is your feature request related to a proble...,high_priority
3,11417550000.0,IssuesEvent,2020-02-03 00:04:18,UltimateCodeMonkeys/CodeMonkeysMVVM,https://api.github.com/repos/UltimateCodeMonke...,opened,Migrate: CodeMonkeys ViewModelNavigationServic...,0,Migrate the Xamarin.Forms navigation service i...,high_priority
4,11417560000.0,IssuesEvent,2020-02-03 00:08:03,wordpress-mobile/WordPress-Android,https://api.github.com/repos/wordpress-mobile/...,closed,IA Reader filter bottom sheet: manage untitled...,0,In the filter bottom sheet we introduced in th...,high_priority


In [5]:
priority = pd.concat([high_priority, medium_priority, low_priority] , ignore_index = True)

In [6]:
priority[priority["repo"] == "python/mypy"]

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body,class


In [7]:
# Remove mypy from the dataset
priority = priority[priority["repo"] != "python/mypy"]
priority

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body,class
0,1.141754e+10,IssuesEvent,2020-02-03 00:00:44,unitystation/unitystation,https://api.github.com/repos/unitystation/unit...,closed,Client breaking NRE when using edit field on C...,0,### Bug:\r\n\r\nIf you use the edit field of t...,high_priority
1,1.141754e+10,IssuesEvent,2020-02-03 00:01:26,zowe/sample-spring-boot-api-service,https://api.github.com/repos/zowe/sample-sprin...,closed,The SDK provides a separate Java (no-Spring) l...,0,- The commons-spring library is split into:\r\...,high_priority
2,1.141755e+10,IssuesEvent,2020-02-03 00:02:58,openmsupply/mobile,https://api.github.com/repos/openmsupply/mobile,closed,Auto-log out after some time frame,0,## Is your feature request related to a proble...,high_priority
3,1.141755e+10,IssuesEvent,2020-02-03 00:04:18,UltimateCodeMonkeys/CodeMonkeysMVVM,https://api.github.com/repos/UltimateCodeMonke...,opened,Migrate: CodeMonkeys ViewModelNavigationServic...,0,Migrate the Xamarin.Forms navigation service i...,high_priority
4,1.141756e+10,IssuesEvent,2020-02-03 00:08:03,wordpress-mobile/WordPress-Android,https://api.github.com/repos/wordpress-mobile/...,closed,IA Reader filter bottom sheet: manage untitled...,0,In the filter bottom sheet we introduced in th...,high_priority
...,...,...,...,...,...,...,...,...,...,...
821952,2.060586e+10,IssuesEvent,2022-03-06 23:45:23,bounswe/bounswe2022group1,https://api.github.com/repos/bounswe/bounswe20...,closed,Editing Navigator of Wiki,2,The navigator of the wiki should be edited and...,low_priority
821953,7.334740e+09,IssuesEvent,2018-03-06 00:10:40,hoodedice/notes,https://api.github.com/repos/hoodedice/notes,opened,Check if passwords match,2,JavaScript code (or see if possible without) t...,low_priority
821954,7.334876e+09,IssuesEvent,2018-03-06 00:53:49,zephyrproject-rtos/zephyr,https://api.github.com/repos/zephyrproject-rto...,closed,Add doc to samples/bluetooth/mesh & samples/bl...,2,We should document what exactly the sample is ...,low_priority
821955,7.335047e+09,IssuesEvent,2018-03-06 01:48:07,uwnrg/minotaur-cpp,https://api.github.com/repos/uwnrg/minotaur-cpp,closed,Fix the photo in about dialog xd,2,,low_priority


In [8]:
# Drop duplicates by the content of the body
priority = priority.drop_duplicates(subset=['title'], keep='last')
priority.dropna(inplace=True)
priority.reset_index(inplace=True)
priority.drop(columns=["index"] , inplace= True)
priority["class"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  priority.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  priority.drop(columns=["index"] , inplace= True)


class
high_priority      236610
medium_priority    160163
low_priority       143121
Name: count, dtype: int64

In [9]:
# Split the dataframe 
high_priority = priority.loc[priority['class'] == 'high_priority'].copy()
medium_priority = priority.loc[priority['class'] == 'medium_priority'].copy()
low_priority = priority.loc[priority['class'] == 'low_priority'].copy()

In [10]:
high_label_counts = high_priority["class"].value_counts()
medium_label_counts = medium_priority["class"].value_counts()
low_label_counts = low_priority["class"].value_counts()
hp_count = high_label_counts["high_priority"]
print(hp_count)
mp_count = medium_label_counts["medium_priority"]
print(mp_count)
lp_count = low_label_counts["low_priority"]
lp_count

236610
160163


143121

In [11]:
medium_priority = medium_priority.sample(frac=lp_count/mp_count, random_state=42)
med_and_low_priority = pd.concat([medium_priority, low_priority] , ignore_index = True)
med_and_low_priority["class"].value_counts()

class
medium_priority    143121
low_priority       143121
Name: count, dtype: int64

In [12]:
med_and_low_priority["labels"] = 1
med_and_low_priority["class"] = "medium_and_low_priority"
med_and_low_priority["class"].value_counts()

class
medium_and_low_priority    286242
Name: count, dtype: int64

In [13]:
med_and_low_label_counts = med_and_low_priority["class"].value_counts()
ml_count = med_and_low_label_counts["medium_and_low_priority"]
ml_count

286242

In [14]:

med_and_low_priority = med_and_low_priority.sample(frac=hp_count/ml_count, random_state=42)
all_priority = pd.concat([high_priority, med_and_low_priority] , ignore_index = True)
all_priority["class"].value_counts()

class
high_priority              236610
medium_and_low_priority    236610
Name: count, dtype: int64

In [15]:
print(all_priority["title"][0])
print(all_priority["body"][0])

Auto-log out after some time frame
## Is your feature request related to a problem? Please describe.

From @craigdrown 

IC Requirement: Automatically log a user out after a time frame of X for security reasons. Data should not be lost which was being worked on.

## Describe the solution you'd like

Log the user out after X time

## Implementation

Could use store custom data `logoutTimeoutPeriodInSecondsOrMinutes` to set the period..

- Use [AppState](https://facebook.github.io/react-native/docs/appstate#addeventlistener) to hook into the app going into the background - log the user out if so. Android would go to sleep after X amount of time, putting the app into the background (I THINK, UNTESTED)
- Use a package like https://www.npmjs.com/package/redux-idle-monitor to track activity, have a schedule that is reset each time an event happens.
- Use the above package but combine it with the current user authentication scheduler (tries to re-auth against the server every X minutes, or th

In [16]:
# Copy content of body to a new col named text
all_priority["text"] = all_priority["title"] + all_priority["body"]
all_priority.tail()

Unnamed: 0,id,type,created_at,repo,repo_url,action,title,labels,body,class,text
473215,5903893000.0,IssuesEvent,2017-05-19 08:18:32,TEAMMATES/teammates,https://api.github.com/repos/TEAMMATES/teammates,opened,Include MOTD feature in CI tests,1,Current:\r\n* Message of the Day feature is om...,medium_and_low_priority,Include MOTD feature in CI testsCurrent:\r\n* ...
473216,8817135000.0,IssuesEvent,2018-12-30 19:46:09,toobigtoignore/issf,https://api.github.com/repos/toobigtoignore/issf,closed,Checkbox to disable tweeting displayed for non...,1,When contributing Case Study or Capacity Devel...,medium_and_low_priority,Checkbox to disable tweeting displayed for non...
473217,12952080000.0,IssuesEvent,2020-07-19 19:10:37,banzaicloud/logging-operator,https://api.github.com/repos/banzaicloud/loggi...,closed,Add additional labels to service monitors,1,**Is your feature request related to a problem...,medium_and_low_priority,Add additional labels to service monitors**Is ...
473218,20968990000.0,IssuesEvent,2022-03-28 09:33:41,Igalia/wolvic,https://api.github.com/repos/Igalia/wolvic,closed,No mic button in the Keyboard for HVR build,1,## Configuration\r\n\r\n<!--- State the versio...,medium_and_low_priority,No mic button in the Keyboard for HVR build## ...
473219,4725469000.0,IssuesEvent,2016-10-18 06:43:30,pmem/issues,https://api.github.com/repos/pmem/issues,closed,rpmem: errno is not set when passing invalid l...,1,1) create local pool of **X** size via malloc(...,medium_and_low_priority,rpmem: errno is not set when passing invalid l...


In [17]:
all_priority["text"][0]

"Auto-log out after some time frame## Is your feature request related to a problem? Please describe.\r\n\r\nFrom @craigdrown \r\n\r\nIC Requirement: Automatically log a user out after a time frame of X for security reasons. Data should not be lost which was being worked on.\r\n\r\n## Describe the solution you'd like\r\n\r\nLog the user out after X time\r\n\r\n## Implementation\r\n\r\nCould use store custom data `logoutTimeoutPeriodInSecondsOrMinutes` to set the period..\r\n\r\n- Use [AppState](https://facebook.github.io/react-native/docs/appstate#addeventlistener) to hook into the app going into the background - log the user out if so. Android would go to sleep after X amount of time, putting the app into the background (I THINK, UNTESTED)\r\n- Use a package like https://www.npmjs.com/package/redux-idle-monitor to track activity, have a schedule that is reset each time an event happens.\r\n- Use the above package but combine it with the current user authentication scheduler (tries to r

In [18]:
# Make a new dataframe with only text, label and class cols.
all_priority_subset = all_priority[["text" , "labels" , "class"]]
all_priority_subset

Unnamed: 0,text,labels,class
0,Auto-log out after some time frame## Is your f...,0,high_priority
1,Image Picker for SourceImplement an Android Im...,0,high_priority
2,Fix Video Page ListItem Hovering Behaviour- Wh...,0,high_priority
3,Escape shuttle reaches ludicrous speed## Descr...,0,high_priority
4,constructing virtual router structure\r\n,0,high_priority
...,...,...,...
473215,Include MOTD feature in CI testsCurrent:\r\n* ...,1,medium_and_low_priority
473216,Checkbox to disable tweeting displayed for non...,1,medium_and_low_priority
473217,Add additional labels to service monitors**Is ...,1,medium_and_low_priority
473218,No mic button in the Keyboard for HVR build## ...,1,medium_and_low_priority


In [19]:
# Convert to string
all_priority_subset["text_str"] = all_priority_subset['text'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_priority_subset["text_str"] = all_priority_subset['text'].astype(str)


In [20]:
all_priority_subset

Unnamed: 0,text,labels,class,text_str
0,Auto-log out after some time frame## Is your f...,0,high_priority,Auto-log out after some time frame## Is your f...
1,Image Picker for SourceImplement an Android Im...,0,high_priority,Image Picker for SourceImplement an Android Im...
2,Fix Video Page ListItem Hovering Behaviour- Wh...,0,high_priority,Fix Video Page ListItem Hovering Behaviour- Wh...
3,Escape shuttle reaches ludicrous speed## Descr...,0,high_priority,Escape shuttle reaches ludicrous speed## Descr...
4,constructing virtual router structure\r\n,0,high_priority,constructing virtual router structure\r\n
...,...,...,...,...
473215,Include MOTD feature in CI testsCurrent:\r\n* ...,1,medium_and_low_priority,Include MOTD feature in CI testsCurrent:\r\n* ...
473216,Checkbox to disable tweeting displayed for non...,1,medium_and_low_priority,Checkbox to disable tweeting displayed for non...
473217,Add additional labels to service monitors**Is ...,1,medium_and_low_priority,Add additional labels to service monitors**Is ...
473218,No mic button in the Keyboard for HVR build## ...,1,medium_and_low_priority,No mic button in the Keyboard for HVR build## ...


In [21]:
# Clean the data.
all_priority_subset["text_clean"] = all_priority_subset["text_str"].map(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  all_priority_subset["text_clean"] = all_priority_subset["text_str"].map(preprocess_text)


In [22]:
# Make a subset with text_clean and label
priority_label_text = all_priority_subset[["text_clean" , "labels", "class"]]
priority_label_text

Unnamed: 0,text_clean,labels,class
0,autolog out after some time frame is your feat...,0,high_priority
1,image picker for sourceimplement an android im...,0,high_priority
2,fix video page listitem hovering behaviour whe...,0,high_priority
3,escape shuttle reaches ludicrous speed descrip...,0,high_priority
4,,0,high_priority
...,...,...,...
473215,include motd feature in ci testscurrent messag...,1,medium_and_low_priority
473216,checkbox to disable tweeting displayed for non...,1,medium_and_low_priority
473217,add additional labels to service monitorsis yo...,1,medium_and_low_priority
473218,no mic button in the keyboard for hvr build co...,1,medium_and_low_priority


In [23]:
# Need to dropna here since cleaning function returns NaN for not english text.
priority_label_text.dropna(inplace=True)
priority_label_text.reset_index(inplace=True)
priority_label_text.drop(columns=["index"] , inplace= True)

priority_label_text

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  priority_label_text.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  priority_label_text.drop(columns=["index"] , inplace= True)


Unnamed: 0,text_clean,labels,class
0,autolog out after some time frame is your feat...,0,high_priority
1,image picker for sourceimplement an android im...,0,high_priority
2,fix video page listitem hovering behaviour whe...,0,high_priority
3,escape shuttle reaches ludicrous speed descrip...,0,high_priority
4,binder doesnt load notebooks outside of the un...,0,high_priority
...,...,...,...
441934,include motd feature in ci testscurrent messag...,1,medium_and_low_priority
441935,checkbox to disable tweeting displayed for non...,1,medium_and_low_priority
441936,add additional labels to service monitorsis yo...,1,medium_and_low_priority
441937,no mic button in the keyboard for hvr build co...,1,medium_and_low_priority


In [24]:
# Splitting the DataFrame
test_df = priority_label_text.sample(frac=0.05, random_state=1)  # Select 5% of the data
test_file_name = f"csv/clean_test_high_vs_med_low_priority.csv"
test_df.to_csv(test_file_name, index=False)
priority_df = priority_label_text.drop(test_df.index)
priority_df.reset_index(drop=True, inplace=True)
priority_df

Unnamed: 0,text_clean,labels,class
0,autolog out after some time frame is your feat...,0,high_priority
1,image picker for sourceimplement an android im...,0,high_priority
2,fix video page listitem hovering behaviour whe...,0,high_priority
3,escape shuttle reaches ludicrous speed descrip...,0,high_priority
4,binder doesnt load notebooks outside of the un...,0,high_priority
...,...,...,...
419837,include motd feature in ci testscurrent messag...,1,medium_and_low_priority
419838,checkbox to disable tweeting displayed for non...,1,medium_and_low_priority
419839,add additional labels to service monitorsis yo...,1,medium_and_low_priority
419840,no mic button in the keyboard for hvr build co...,1,medium_and_low_priority


In [25]:
# Clean dataset with clean text and labels.
# 0 = high priority, 1 = not high priority
file_name = f"csv/clean_high_vs_med_and_low_priority.csv"
priority_df.to_csv(file_name, index=False)

In [26]:
pri = pd.read_csv(file_name)
pri

Unnamed: 0,text_clean,labels,class
0,autolog out after some time frame is your feat...,0,high_priority
1,image picker for sourceimplement an android im...,0,high_priority
2,fix video page listitem hovering behaviour whe...,0,high_priority
3,escape shuttle reaches ludicrous speed descrip...,0,high_priority
4,binder doesnt load notebooks outside of the un...,0,high_priority
...,...,...,...
419837,include motd feature in ci testscurrent messag...,1,medium_and_low_priority
419838,checkbox to disable tweeting displayed for non...,1,medium_and_low_priority
419839,add additional labels to service monitorsis yo...,1,medium_and_low_priority
419840,no mic button in the keyboard for hvr build co...,1,medium_and_low_priority


In [27]:
test_df = pd.read_csv(test_file_name)
test_df

Unnamed: 0,text_clean,labels,class
0,block sizesadjust block sizes to fit the provi...,1,medium_and_low_priority
1,ui style and widgets broken in the test enviro...,0,high_priority
2,this dropdown list should not work in this scr...,0,high_priority
3,e2e docker use mariadb instead of mysqllike is...,1,medium_and_low_priority
4,crosscompile toolchain variant doesnt working ...,1,medium_and_low_priority
...,...,...,...
22092,error handling sync error handling mechanismmw...,0,high_priority
22093,limit maximum requests in a given timethis is ...,0,high_priority
22094,looking for a volunteer officiel facebook page...,1,medium_and_low_priority
22095,add tests for add todo command its related tes...,0,high_priority
