# Data cleaning
We want the class with labels as numerical value and the body with clean text.

This will remove:
* duplicates
* NaN entires
* non english
* url, html

* make it lowercase
* combine title and body

In [1]:
import pandas as pd
import sys
sys.path.append("../../../scripts_shared/")
from preprocess_text import preprocess_text


In [2]:
file_name = "all_priority_group_in_classes.csv"
df = pd.read_csv(file_name)
df

Unnamed: 0,priority,description,project,labels,issuetype,collection,class
0,Blocker,We tried upgrading from Spring Boot 2.0.6 to S...,Spring XD,[],Bug,Spring,Highest
1,Major,The jobs that appear under Executions section ...,Spring XD,[],Bug,Spring,Medium
2,Trivial,Working with Spring-XD version 1.3.2.RELEASE\n...,Spring XD,[],Bug,Spring,Lowest
3,Major,My project 7 node cluster and in that 2 node a...,Spring XD,"['Spring', 'xd']",Bug,Spring,Medium
4,Minor,See https://github.com/spring-projects/spring-...,Spring XD,[],Story,Spring,Low
...,...,...,...,...,...,...,...
1611180,Major,it is very beautiful.,Community Support - Open Source Project Reposi...,[],New Project,Sonatype,Medium
1611181,Major,library,Community Support - Open Source Project Reposi...,[],New Project,Sonatype,Medium
1611182,Major,What is reactive-gremlin\r\n\r\nreactive-greml...,Community Support - Open Source Project Reposi...,[],New Project,Sonatype,Medium
1611183,Major,"Android view for a swipeable, weekly calendar.",Community Support - Open Source Project Reposi...,[],New Project,Sonatype,Medium


In [3]:
# Count per priority
df['class'].value_counts()

class
Medium     1118034
Low         299697
High         89619
Highest      61754
Lowest       36703
Name: count, dtype: int64

In [4]:
df['issuetype'].value_counts().to_frame()[:50]

Unnamed: 0_level_0,count
issuetype,Unnamed: 1_level_1
Bug,759733
Improvement,268685
Task,157186
Sub-task,118987
New Feature,71440
New Project,65487
Feature Request,42457
Story,28386
Enhancement,28333
Test,10147


In [5]:
# Unique projects
df['project'].nunique()


1157

In [6]:
# Unique collections
df['collection'].nunique()

10

In [7]:
# Count per collection
df['collection'].value_counts().to_frame()

Unnamed: 0_level_0,count
collection,Unnamed: 1_level_1
Apache,984269
RedHat,338623
Sonatype,87255
Spring,68556
Sakai,49820
JiraEcosystem,41484
Hyperledger,28144
IntelDAOS,9306
Mindville,2115
SecondLife,1613


In [8]:
# Drop duplicates by the content of the description
df = df.drop_duplicates(subset=['description'], keep='last')
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.drop(columns=["index"] , inplace= True)
df["class"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["index"] , inplace= True)


class
Medium     960459
Low        271629
High        82646
Highest     56220
Lowest      33291
Name: count, dtype: int64

In [9]:
print(df["description"][0])

We tried upgrading from Spring Boot 2.0.6 to Spring 2.1.0 and noticing a critical multi-threading bug in the WebClient.  We are using Spring WebFlux with Netty Embedded Server

In SpringBoot 2.0.6 you can see data received and published on happens on two different threads but in SpringBoot 2.1.0 all execution is happening on the same thread even data is published to same thread.  Any reason why the multi-threading behavior has changed in SpringBoot 2.1.0?  Seems like a major defect.  This flaw is preventing us from upgrading to Spring Boot 2.1.0.

It appears with Spring Boot 2.1.0 the default threading behavior of the WebClient has changed to where emissions on published on main thread as opposed to a different thread

See results below

I print the thread on the doOnRequest method of the WebClient

I also print the thread on the doOnNext of the Mono

For version 2.10 you can see emissions are published on the main thread but in earlier version

2.0.6 emissions published on a different

In [10]:
# Convert to string
df["text_str"] = df['description'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_str"] = df['description'].astype(str)


In [11]:
# Clean the data.
df["text_clean"] = df["text_str"].map(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text_clean"] = df["text_str"].map(preprocess_text)


In [12]:
# save df to csv
df.to_csv("jira_clean_with_all_cols.csv", index=False)

In [13]:
# Extract only cols need
df = df[["class", "text_clean"]]

In [14]:
# Row with NaN
df[df.isna().any(axis=1)]

Unnamed: 0,class,text_clean
4,Low,
22,Low,
34,Low,
43,Highest,
52,Low,
...,...,...
1404154,Medium,
1404188,Medium,
1404189,Medium,
1404201,Medium,


In [15]:
# Need to dropna here since cleaning function returns NaN for not english text.
df.dropna(inplace=True)
df.reset_index(drop=True,inplace=True)

df

Unnamed: 0,class,text_clean
0,Highest,we tried upgrading from spring boot to spring ...
1,Medium,the jobs that appear under executions section ...
2,Lowest,working with springxd version the server with ...
3,Medium,my project node cluster and in that node are a...
4,Medium,im trying to run a job on springxd and the job...
...,...,...
1370584,Medium,simple android lib for pushing little messages...
1370585,Medium,it is very beautiful
1370586,Medium,what is reactivegremlin reactivegremlin is a s...
1370587,Medium,android view for a swipeable weekly calendar


In [16]:
# Rows with NaN
df[df.isna().any(axis=1)]

Unnamed: 0,class,text_clean


In [17]:
null_rows = df[df['text_clean'].isnull()]
null_rows

Unnamed: 0,class,text_clean


In [18]:
# Clean dataset with clean text and class.
name = f"jira_clean.csv"
df.to_csv(name, index=False)

In [19]:
import os
priority_levels = ['Highest', 'High', 'Medium', 'Low', 'Lowest']

for level in priority_levels:
    try:
        # Make dir with level
        os.makedirs(f'{level}', exist_ok=True)
        # df with level class
        df_level = df[df['class'] == level]
        # Save to csv
        df_level.to_csv(f'{level}/clean_{level}.csv', index=False)
        print(f"Saved {level}.csv")
    except Exception as e:
        print(f"An error occurred for level {level}: {str(e)}")

Saved Highest.csv
Saved High.csv
Saved Medium.csv
Saved Low.csv
Saved Lowest.csv


In [20]:
# Read csv to check if file is saved correctly
for level in priority_levels:
    try:
        df = pd.read_csv(f'{level}/clean_{level}.csv')
        print(f"Read {level}.csv")
    except Exception as e:
        print(f"An error occurred while reading {level}.csv: {str(e)}")

Read Highest.csv
Read High.csv
Read Medium.csv
Read Low.csv
Read Lowest.csv


In [21]:
pri = pd.read_csv(name)
pri

Unnamed: 0,class,text_clean
0,Highest,we tried upgrading from spring boot to spring ...
1,Medium,the jobs that appear under executions section ...
2,Lowest,working with springxd version the server with ...
3,Medium,my project node cluster and in that node are a...
4,Medium,im trying to run a job on springxd and the job...
...,...,...
1370584,Medium,simple android lib for pushing little messages...
1370585,Medium,it is very beautiful
1370586,Medium,what is reactivegremlin reactivegremlin is a s...
1370587,Medium,android view for a swipeable weekly calendar


In [22]:
null_rows = pri[pri['text_clean'].isnull()]
null_rows

Unnamed: 0,class,text_clean
887,Low,
12981,Low,
14443,Medium,
14444,Medium,
18752,Low,
...,...,...
1361963,Medium,
1363716,Medium,
1364722,Medium,
1366970,Medium,


In [23]:
# Remove rows with NaN
pri = pri.dropna()
# Reset index
pri.reset_index(drop=True, inplace=True)
pri

Unnamed: 0,class,text_clean
0,Highest,we tried upgrading from spring boot to spring ...
1,Medium,the jobs that appear under executions section ...
2,Lowest,working with springxd version the server with ...
3,Medium,my project node cluster and in that node are a...
4,Medium,im trying to run a job on springxd and the job...
...,...,...
1369976,Medium,simple android lib for pushing little messages...
1369977,Medium,it is very beautiful
1369978,Medium,what is reactivegremlin reactivegremlin is a s...
1369979,Medium,android view for a swipeable weekly calendar
