# Data cleaning of test datasets
We want the class with labels as numerical value and the body with clean text.

This will remove:
* duplicates
* NaN entires
* non english
* url, html

* make it lowercase
* combine title and body

In [1]:
import pandas as pd
import sys
sys.path.append("../../../scripts_shared/")
from preprocess_text import preprocess_text


In [2]:
file_name = "test_sets_projects.csv"
df = pd.read_csv(file_name)
df

Unnamed: 0,priority,description,project,labels,issuetype,collection
0,Low,some errors show up as shown in the screenshot...,Sourcetree for Windows,[],Bug,Jira
1,Low,I have been using Sourcetree 3.4.4. We use cu...,Sourcetree for Windows,[],Bug,Jira
2,Low,After installing SourceTree for Windows 10 64b...,Sourcetree for Windows,[],Bug,Jira
3,Low,"On windows, Sourcetree.exe will start ""git.exe...",Sourcetree for Windows,[],Bug,Jira
4,Low,"Hello,\r\n\r\nSourceTree 3.4.7.\r\n\r\nOS: Win...",Sourcetree for Windows,[],Bug,Jira
...,...,...,...,...,...,...
386200,1 - Blocker,I am attempting to to follow the guide found h...,Artifactory Binary Repository,[],Bug,JFrog
386201,4 - Normal,"In binarystore.xml, maxCacheSize is in bytes b...",Artifactory Binary Repository,[],Bug,JFrog
386202,4 - Normal,{color:#000000}We are using an artifact(folder...,Artifactory Binary Repository,[],New Feature,JFrog
386203,4 - Normal,Remote repositories created with the repo name...,Artifactory Binary Repository,[],Bug,JFrog


In [3]:
# Count per priority
df['priority'].value_counts()

priority
Major - P3                109573
Low                        76547
P2: Important              46926
Medium                     43672
P3: Somewhat important     28075
P1: Critical               20750
Minor - P4                 12964
4 - Normal                 11231
High                       10283
P4: Low                     7375
Highest                     5022
Critical - P2               3437
3 - High                    2509
Trivial - P5                2124
P0: Blocker                 2050
2 - Critical                1145
Blocker - P1                1044
P5: Not important            935
1 - Blocker                  498
5 - Minor                     37
6 - Trivial                    8
Name: count, dtype: int64

In [4]:
df['issuetype'].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
issuetype,Unnamed: 1_level_1
Bug,264247
Task,45736
Improvement,31603
New Feature,11622
Suggestion,9543
Sub-task,6846
Technical task,3256
Build Failure,2491
Support Request,2415
User Story,2318


In [5]:
# Unique projects
df['project'].nunique()


88

In [6]:
# Unique collections
df['collection'].nunique()

4

In [7]:
# Count per collection
df['collection'].value_counts().to_frame()

Unnamed: 0_level_0,count
collection,Unnamed: 1_level_1
Jira,135524
MongoDB,129142
Qt,106111
JFrog,15428


In [8]:
# Drop duplicates by the content of the description
df = df.drop_duplicates(subset=['description'], keep='last')
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.drop(columns=["index"] , inplace= True)
df["priority"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["index"] , inplace= True)


priority
Major - P3                87427
Low                       73555
P2: Important             44433
Medium                    41241
P3: Somewhat important    27518
P1: Critical              18183
Minor - P4                12089
High                       9841
4 - Normal                 9171
P4: Low                    7203
Highest                    4750
Critical - P2              3051
3 - High                   2357
Trivial - P5               1979
P0: Blocker                1972
2 - Critical               1041
Blocker - P1                928
P5: Not important           910
1 - Blocker                 470
5 - Minor                    32
6 - Trivial                   8
Name: count, dtype: int64

In [9]:
print(df["description"][0])

some errors show up as shown in the screenshot and when I try to clone the repo the sourcetree software crashes and closes.

!image-2022-01-04-12-38-58-699.png!


In [10]:
# Convert to string
df["text_str"] = df['description'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_str"] = df['description'].astype(str)


In [11]:
# Clean the data.
df["text_clean"] = df["text_str"].map(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_clean"] = df["text_str"].map(preprocess_text)


In [12]:
# Row with NaN
df[df.isna().any(axis=1)]

Unnamed: 0,priority,description,project,labels,issuetype,collection,text_str,text_clean
17,Low,[https://pasteboard.co/cngS3Gi4a3LS.png]\r\n\r...,Sourcetree for Windows,[],Bug,Jira,[https://pasteboard.co/cngS3Gi4a3LS.png]\r\n\r...,
18,Low,!image-2021-11-22-10-34-33-096.png!,Sourcetree for Windows,[],Bug,Jira,!image-2021-11-22-10-34-33-096.png!,
51,Low,\r\n\r\n!image-2021-08-12-21-48-12-368.png!\r...,Sourcetree for Windows,[],Bug,Jira,\r\n\r\n!image-2021-08-12-21-48-12-368.png!\r...,
52,Low,!image-2021-08-12-22-00-24-074.png!\r\n\r\n!im...,Sourcetree for Windows,[],Bug,Jira,!image-2021-08-12-22-00-24-074.png!\r\n\r\n!im...,
60,Low,Kinda annoying tbh:\r\n\r\n!https://i.postimg....,Sourcetree for Windows,[],Bug,Jira,Kinda annoying tbh:\r\n\r\n!https://i.postimg....,
...,...,...,...,...,...,...,...,...
347866,4 - Normal,Add all permissions actions for:\r\nGET ‘ui/bu...,Artifactory Binary Repository,[],Bug,JFrog,Add all permissions actions for:\r\nGET ‘ui/bu...,
347919,4 - Normal,[https://docs.google.com/document/d/1H3D8GgXMW...,Artifactory Binary Repository,[],New Feature,JFrog,[https://docs.google.com/document/d/1H3D8GgXMW...,
347923,4 - Normal,Ability to configure fine grain license polici...,Artifactory Binary Repository,[],Task,JFrog,Ability to configure fine grain license polici...,
348032,4 - Normal,RTFACT-18583 : not solved.\r\n\r\n[https://git...,Artifactory Binary Repository,"['artifactory', 'docker', 'repository']",Bug,JFrog,RTFACT-18583 : not solved.\r\n\r\n[https://git...,


In [13]:
# Need to dropna here since cleaning function returns NaN for not english text.
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)

df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


Unnamed: 0,priority,description,project,labels,issuetype,collection,text_str,text_clean
0,Low,some errors show up as shown in the screenshot...,Sourcetree for Windows,[],Bug,Jira,some errors show up as shown in the screenshot...,some errors show up as shown in the screenshot...
1,Low,I have been using Sourcetree 3.4.4. We use cu...,Sourcetree for Windows,[],Bug,Jira,I have been using Sourcetree 3.4.4. We use cu...,i have been using sourcetree we use custom act...
2,Low,After installing SourceTree for Windows 10 64b...,Sourcetree for Windows,[],Bug,Jira,After installing SourceTree for Windows 10 64b...,after installing sourcetree for windows machin...
3,Low,"On windows, Sourcetree.exe will start ""git.exe...",Sourcetree for Windows,[],Bug,Jira,"On windows, Sourcetree.exe will start ""git.exe...",on windows sourcetreeexe will start gitexe fsm...
4,Low,"Hello,\r\n\r\nSourceTree 3.4.7.\r\n\r\nOS: Win...",Sourcetree for Windows,[],Bug,Jira,"Hello,\r\n\r\nSourceTree 3.4.7.\r\n\r\nOS: Win...",hello sourcetree os windows pro problem create...
...,...,...,...,...,...,...,...,...
343214,4 - Normal,"Hi,\r\n\r\nWhen using REST API to create repos...",Artifactory Binary Repository,"['PB_Done', 'QF', 'QF-P2', 'S-P1']",Bug,JFrog,"Hi,\r\n\r\nWhen using REST API to create repos...",hi when using rest api to create repository if...
343215,1 - Blocker,I am attempting to to follow the guide found h...,Artifactory Binary Repository,[],Bug,JFrog,I am attempting to to follow the guide found h...,i am attempting to to follow the guide found h...
343216,4 - Normal,"In binarystore.xml, maxCacheSize is in bytes b...",Artifactory Binary Repository,[],Bug,JFrog,"In binarystore.xml, maxCacheSize is in bytes b...",in binarystorexml maxcachesize is in bytes by ...
343217,4 - Normal,{color:#000000}We are using an artifact(folder...,Artifactory Binary Repository,[],New Feature,JFrog,{color:#000000}We are using an artifact(folder...,are using an artifactfolder promotion process ...


In [14]:
# Rows with NaN
df[df.isna().any(axis=1)]

Unnamed: 0,priority,description,project,labels,issuetype,collection,text_str,text_clean


In [15]:
null_rows = df[df['text_clean'].isnull()]
null_rows

Unnamed: 0,priority,description,project,labels,issuetype,collection,text_str,text_clean


In [16]:

# save full df to csv
df.to_csv("jira_clean_testset_with_all_cols.csv", index=False)

In [17]:
import os
# Saves each collection to a separate csv
priority_levels = ['Jira', 'MongoDB', 'JFrog', 'Qt']

for dataset in priority_levels:
    try:
        # Make dir with level
        os.makedirs(f'test_datasets', exist_ok=True)
        # df with level class
        df_level = df[df['collection'] == dataset]
        # Save to csv
        df_level.to_csv(f'test_datasets/clean_{dataset}.csv', index=False)
        print(f"Saved {dataset}.csv")
    except Exception as e:
        print(f"An error occurred for level {dataset}: {str(e)}")

Saved Jira.csv
Saved MongoDB.csv
Saved JFrog.csv
Saved Qt.csv


In [18]:
# Read csv to check if file is saved correctly
for dataset in priority_levels:
    try:
        df = pd.read_csv(f'test_datasets/clean_{dataset}.csv')
        print(f"Read {dataset}.csv")
    except Exception as e:
        print(f"An error occurred while reading {dataset}.csv: {str(e)}")

Read Jira.csv
Read MongoDB.csv
Read JFrog.csv
Read Qt.csv
