The purpose of this cleaning is to make sure that non-security dataset doesn't have security related issues 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [79]:
sec_df = pd.read_csv("data/security_issues.csv")
non_sec_df = pd.read_csv("data/non-security_issues.csv")

### Cyber security glossary

In [4]:
from tika import parser

# source: https://nvlpubs.nist.gov/nistpubs/ir/2013/nist.ir.7298r2.pdf
raw = parser.from_file('cybersec_glossary.pdf')

2022-12-03 11:44:42,577 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to C:\Users\datapaf\AppData\Local\Temp\tika-server.jar.
2022-12-03 11:45:13,012 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to C:\Users\datapaf\AppData\Local\Temp\tika-server.jar.md5.
2022-12-03 11:45:13,698 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [9]:
# save the raw text to a txt file
with open('cybersec_glossary.txt', 'w', encoding="utf-8") as f:
    f.write(raw['content'])

In [10]:
# read the txt file
with open('cybersec_glossary.txt', 'r', encoding="utf-8") as f:
    text = f.read()

In [34]:
# use regex to extract the terms and definitions
import re

#pattern = r'[\s]+([\w]+ )+ \- '
pattern = r'\n([\w\s]+) – '


terms = re.findall(pattern, text)

In [38]:
# remove leading digits, spaces, and newlines
terms = [term.lstrip('0123456789. \n') for term in terms]

In [40]:
terms_series = pd.Series(terms)
terms_series

0                         Access
1               Access Authority
2                 Access Control
3       Access Control Mechanism
4                   Access Level
                  ...           
1212                      Zombie
1213             Zone Of Control
1214                         III
1215                         III
1216              Subchapter III
Length: 1217, dtype: object

In [49]:
# replace newlines with spaces
terms_series = terms_series.str.replace('\n', ' ')

In [51]:
# remove repeating spaces
terms_series = terms_series.str.replace(' +', ' ')

  terms_series = terms_series.str.replace(' +', ' ')


In [53]:
# lowercase all terms
terms_series = terms_series.str.lower()

In [55]:
# remove duplicate terms
terms_series = terms_series.drop_duplicates()

In [56]:
# save the terms to a csv file
terms_series.to_csv('cybersec_terms.csv', index=False, header=False)

### Keyword search

In [69]:
# read the terms from the csv file
terms_df = pd.read_csv('cybersec_terms.csv', header=None)[0]

In [70]:
terms_df

0                         access
1               access authority
2                 access control
3       access control mechanism
4                   access level
                  ...           
1133                   zero fill
1134                 zeroization
1135                     zeroize
1136                      zombie
1137             zone of control
Name: 0, Length: 1138, dtype: object

In [80]:
non_sec_df

Unnamed: 0,repository,number,title,description,comments,is_pr,labels,is_sec
0,matomo-org/matomo,19865,Make the copy the dashboard to user feature co...,"Hi all,\r\n\r\nnot sure where to put this tick...",,no,"Enhancement,",no
1,matomo-org/matomo,19864,Allow user to define a custom name for the tra...,A user suggests:\r\n\r\n\r\n\r\n_The stats tra...,@atom-box thank you for making useful ticketsH...,no,"Enhancement,",no
2,matomo-org/matomo,19862,Callback in sendRequest() / trackPageView() is...,## Expected Behavior\r\n\r\nThe callback in a ...,"For anyone having the same issue, this is my w...",no,"Potential Bug,",no
3,matomo-org/matomo,19861,Ensure password check can only throw wrong pas...,### Description:\r\n\r\n\r\n\r\nrefs #19857\r\...,,yes,"Needs Review,",no
4,matomo-org/matomo,19860,[automatic composer updates],composer update log:\r\n```\r\nLoading compose...,,yes,"not-in-changelog,",no
...,...,...,...,...,...,...,...,...
36086,ipython/ipython,12557,Use pathlib in page.py,"#12515 Change the tmpname to tmppath, and find...",Thanks !,yes,"8.0 what's new,",no
36087,ipython/ipython,12552,use pathlib in fixup_whats_new_pr.py,Related to https://github.com/ipython/ipython/...,Perfect ! Thanks !,yes,"8.0 what's new,",no
36088,ipython/ipython,12551,Invalid exec call crashed IPython shell,"The following code snippet, when typed into an...",This still happens with IPython 7.19.0.This st...,no,"bug,",no
36089,ipython/ipython,12545,more useful feedback on style checker,follow-up to #12502 \r\r\n\r\r\n- give copy-pa...,"I'm unsure about `$TRAVIS_COMMIT_RANGE`, with ...",yes,"8.0 what's new,",no


In [85]:
# concatenate titles and descriptions of non-security issues
title_desc_series = non_sec_df['title'] + ' ' + non_sec_df['description']

In [86]:
title_desc_series

0        Make the copy the dashboard to user feature co...
1        Allow user to define a custom name for the tra...
2        Callback in sendRequest() / trackPageView() is...
3        Ensure password check can only throw wrong pas...
4        [automatic composer updates] composer update l...
                               ...                        
36086    Use pathlib in page.py #12515 Change the tmpna...
36087    use pathlib in fixup_whats_new_pr.py Related t...
36088    Invalid exec call crashed IPython shell The fo...
36089    more useful feedback on style checker follow-u...
36090    Use pathlib for edit magic command Change 'wit...
Length: 36091, dtype: object

In [87]:
# search for the terms in title_desc_series
found_terms = title_desc_series.str.lower().str.extract('(' + '|'.join(terms_df) + ')', expand=False)

In [88]:
found_terms

0               user
1               user
2               test
3           password
4        information
            ...     
36086            NaN
36087            NaN
36088           code
36089           code
36090            NaN
Length: 36091, dtype: object

In [89]:
# count non-null values
found_terms.count()

30208

In [99]:
# show the non-null values
found_terms[found_terms.notnull()]

0               user
1               user
2               test
3           password
4        information
            ...     
36076           user
36077           code
36082           read
36088           code
36089           code
Length: 30208, dtype: object

In [97]:
# count null values
found_terms.isnull().sum()

5883

In [100]:
# show the null values
found_terms[found_terms.isnull()]

5        NaN
73       NaN
76       NaN
131      NaN
181      NaN
        ... 
36084    NaN
36085    NaN
36086    NaN
36087    NaN
36090    NaN
Length: 5883, dtype: object

In [103]:
print(title_desc_series[36076])

DOC: Define ipy directive as such in docs. By utilizing the directive `rst:directive::`, one can now link to the
documentation on the sphinx directive using an inline reference as well
as now being able to link to any of the options.

Put this early in the documentation as it's useful to quickly summarize
the options and decorators available to the user immediately.

The currently provided examples are phenomenal but having a thorough
reference and all of the options in one place and having them be easy to
create a hyperlink to is also very important.


In [104]:
non_sec_df[36076:36077]

Unnamed: 0,repository,number,title,description,comments,is_pr,labels,is_sec
36076,ipython/ipython,12580,DOC: Define ipy directive as such in docs.,"By utilizing the directive `rst:directive::`, ...",Thanks ! Looks great !,yes,"8.0 what's new,",no


In [102]:
print(title_desc_series[5])

[automatic submodule updates] Updated submodules:

- plugins/DeviceDetectorCache

- plugins/LoginLdap

- ~plugins/TagManager~




## Simple classifiers

### Naive Bayes

In [106]:
df = pd.concat([sec_df, non_sec_df]).reset_index(drop=True)
df['title_desc'] = df['title'] + ' ' + df['description'].fillna('')

In [111]:
# # split the dataset into training and testing sets
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(
#     df['title_desc'],
#     df['is_sec'], 
#     test_size=0.5,
#     random_state=42,
# )

In [119]:
# vectorize df['title_desc'] using count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X = count_vect.fit_transform(df['title_desc'])

In [120]:
X.shape

(45783, 221206)

In [121]:
# encode df['is_sec'] using label encoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['is_sec'])

In [122]:
y

array([1, 1, 1, ..., 0, 0, 0])

In [123]:
# train multinomial naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X, y)

In [124]:
clf

In [125]:
y_pred = clf.predict(X)

In [126]:
# show classification report
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.91      0.94     36091
           1       0.74      0.92      0.82      9692

    accuracy                           0.92     45783
   macro avg       0.86      0.92      0.88     45783
weighted avg       0.93      0.92      0.92     45783



In [130]:
# show false positives
nb_false_positives = df[y_pred > y]['title_desc']
nb_false_positives

9693     Allow user to define a custom name for the tra...
9745     Is there a way to filter a specific user ID on...
9769     Bring back an option for Add Users / New user ...
9770     When a User without permission to any Measurab...
9786     2FA should require password confirmation Not a...
                               ...                        
45483    Enable `check_make_token_by_line_never_ends_em...
45484    Rewrite bunch of `raise AssertionError` and `a...
45496    Fix `history_manager.search(?, unique=True)` r...
45588    Backport PR #13044 on branch 7.x (Update refer...
45593    Don't access current_frame f_locals This shoul...
Name: title_desc, Length: 3101, dtype: object

In [131]:
nb_false_positives_list = nb_false_positives.tolist()

In [138]:
print(nb_false_positives_list[4])

2FA should require password confirmation Not an issue actually.


### Naive Bayes with Cybersec Terms

In [None]:
df = pd.concat([sec_df, non_sec_df]).reset_index(drop=True)
df['title_desc'] = df['title'] + ' ' + df['description'].fillna('')

In [139]:
# vectorize df['title_desc'] using count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(vocabulary=terms_df)
X = count_vect.fit_transform(df['title_desc'])

In [140]:
X.shape

(45783, 1138)

In [141]:
# encode df['is_sec'] using label encoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['is_sec'])

In [142]:
# train multinomial naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X, y)

In [143]:
y_pred = clf.predict(X)

In [144]:
# show classification report
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.93      0.90     36091
           1       0.66      0.50      0.57      9692

    accuracy                           0.84     45783
   macro avg       0.77      0.71      0.73     45783
weighted avg       0.83      0.84      0.83     45783



In [145]:
# show false positives
nb_false_positives = df[y_pred > y]['title_desc']
nb_false_positives

9692     Make the copy the dashboard to user feature co...
9693     Allow user to define a custom name for the tra...
9696     [automatic composer updates] composer update l...
9711     Referrer Spam Blacklist scheduled task does no...
9733     Password Error Our current password error take...
                               ...                        
45591    Backport PR #13021 on branch 7.x (Don't access...
45643    Documentation style mismatch between official ...
45674    Expanding IPython LaTeX (blackslash) completio...
45681    Add completion type (_jupyter_types_experiment...
45695    Make better prediction of the virtualenv site-...
Name: title_desc, Length: 2470, dtype: object

In [157]:
print(nb_false_positives[45681])

Add completion type (_jupyter_types_experimental) for dictionary keys, file paths, etc I would very much like the IPython to return completion type for all completions, not just for the completions from Jedi. This would not only make it possible to display the type ot the user in frontends, but also allow users to create custom rules (e.g. show paths first or show paths last).  I am happy to work on a PR and maintain this part of the codebase afterwards. Would you consider a refactor of the current completions to allow for returning type in scope for IPython current plans?

Currently completions are being passed around in three forms:
- the new (unstable) [Completion](https://github.com/ipython/ipython/blob/167f683f56a900200f5bc13227639c2ebdfb1925/IPython/core/completer.py#L355) class mostly used downstream of Jedi
- the Jedi Completion class which is an implementation detail of Jedi
- the [match tuples](https://github.com/ipython/ipython/blob/167f683f56a900200f5bc13227639c2ebdfb1925/I

## Cleaning

In [282]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [283]:
sec_df = pd.read_csv("data/security_issues.csv")
non_sec_df = pd.read_csv("data/non-security_issues.csv")

In [289]:
# drop duplicates
sec_df = sec_df.drop_duplicates(subset=['title', 'description'])
non_sec_df = non_sec_df.drop_duplicates(subset=['title', 'description'])

In [290]:
# undersample non-security issues
non_sec_df = non_sec_df.sample(n=len(sec_df))

In [291]:
# read the terms from the csv file
terms_df = pd.read_csv('cybersec_terms.csv', header=None)[0]

In [292]:
df = pd.concat([sec_df, non_sec_df]).reset_index(drop=True)
df['title_desc'] = df['title'] + ' ' + df['description'].fillna('')

In [293]:
# encode df['is_sec'] using label encoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['is_sec'])

### Naive Bayes with CountVectorizer

In [294]:
# vectorize df['title_desc'] using count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(vocabulary=terms_df)
X_count = count_vect.fit_transform(df['title_desc'])

In [295]:
# train multinomial naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_count, y)
y_pred = clf.predict(X_count)

In [296]:
# show classification report
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.85      0.76      9639
           1       0.80      0.60      0.69      9639

    accuracy                           0.73     19278
   macro avg       0.74      0.73      0.72     19278
weighted avg       0.74      0.73      0.72     19278



In [297]:
# show false positives
countvec_false_positives = df[y_pred > y]['title_desc']
countvec_false_positives

9646     problem with chronology SMS/MMS in Gmail Hello...
9649     Simplify TLS logging for the modern pyOpenSSL....
9653     Fix DB2 tests failing on Travis with a "Connec...
9662     Netmiko Jump Server Friends,\r\r\nI need help ...
9663     Add tool for building DEB/RPM packages Add `ma...
                               ...                        
19232    Unable to join cluster ## Environment\r\r\n\r\...
19233    Export services to be exposed as pages We curr...
19234    Bump urllib3 from 1.7.1 to 1.24.2 in /deploy B...
19261    Crash after open aplication - After Update 1.7...
19266    Race condition(s) around handshake timeout fun...
Name: title_desc, Length: 1423, dtype: object

### Naive Bayes with TF-IDF

In [298]:
# vectorize df['title_desc'] using tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(vocabulary=terms_df)
X_tfidf = tfidf_vect.fit_transform(df['title_desc'])

In [299]:
# train multinomial naive bayes classifier
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_tfidf, y)
y_pred = clf.predict(X_tfidf)

In [300]:
# show classification report
from sklearn.metrics import classification_report

print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.88      0.77      9639
           1       0.83      0.58      0.68      9639

    accuracy                           0.73     19278
   macro avg       0.75      0.73      0.73     19278
weighted avg       0.75      0.73      0.73     19278



In [301]:
# show false positives
tfidf_false_positives = df[y_pred > y]['title_desc']
tfidf_false_positives

9646     problem with chronology SMS/MMS in Gmail Hello...
9649     Simplify TLS logging for the modern pyOpenSSL....
9653     Fix DB2 tests failing on Travis with a "Connec...
9662     Netmiko Jump Server Friends,\r\r\nI need help ...
9674     Large form is slow in Form Builder in part due...
                               ...                        
19218    GitWrapper\GitException in App\Jobs\Publish ##...
19232    Unable to join cluster ## Environment\r\r\n\r\...
19234    Bump urllib3 from 1.7.1 to 1.24.2 in /deploy B...
19261    Crash after open aplication - After Update 1.7...
19266    Race condition(s) around handshake timeout fun...
Name: title_desc, Length: 1154, dtype: object

### Remove suspicious non-security issues

and replace them with issues that don't contain cybersec terms

In [302]:
# unite the false positives
false_positives = pd.concat([countvec_false_positives, tfidf_false_positives]).drop_duplicates()
false_positives

9646     problem with chronology SMS/MMS in Gmail Hello...
9649     Simplify TLS logging for the modern pyOpenSSL....
9653     Fix DB2 tests failing on Travis with a "Connec...
9662     Netmiko Jump Server Friends,\r\r\nI need help ...
9663     Add tool for building DEB/RPM packages Add `ma...
                               ...                        
18178    Templates can be used for "master" domains as ...
18209    [CI] InternalTestClusterIT testOperationsDurin...
18871    npm prune not work for extraneous error #### I...
18965    ClockMock autoregister on test subdirectories ...
19048    HTTP Request TimeOut errors for Elixir tests i...
Name: title_desc, Length: 1468, dtype: object

In [366]:
# remove issues from df with indices in false_positives.index
new_df = df.drop(false_positives.index)

In [367]:
# show the new distribution of security and non-security issues
new_df_counts = new_df['is_sec'].value_counts()
new_df_counts

yes    9639
no     8171
Name: is_sec, dtype: int64

In [368]:
# read the terms from the csv file
terms_df = pd.read_csv('cybersec_terms.csv', header=None)[0]

In [369]:
non_sec_df = pd.read_csv("data/non-security_issues.csv")

In [370]:
# concatenate titles and descriptions of non-security issues
non_sec_df['title_desc'] = non_sec_df['title'] + ' ' + non_sec_df['description']

In [371]:
# find non-security issues that are not in new_df
new_non_sec = non_sec_df[~non_sec_df['title_desc'].isin(new_df['title_desc'])]

In [372]:
# find issues in new_non_sec that don't have any of the terms in terms_df
new_non_sec = new_non_sec[new_non_sec['title_desc'].str.contains('|'.join(terms_df)) == False]

In [373]:
# delete title_desc duplicates from new_non_sec
new_non_sec = new_non_sec.drop_duplicates(subset=['title_desc'])

In [374]:
# sample new_df_counts['yes']-new_df_counts['no'] issues from new_non_sec
new_non_sec = new_non_sec.sample(
    n=new_df_counts['yes']-new_df_counts['no'],
    random_state=0,
)

In [375]:
# add new_non_sec to new_df
new_df = pd.concat([new_df, new_non_sec]).reset_index(drop=True)
new_df

Unnamed: 0,repository,number,title,description,comments,is_pr,labels,is_sec,title_desc
0,matomo-org/matomo,10939,"In Personal settings page and API page, only s...",To prevent API token authentication data leaka...,,no,"c: Security,",yes,"In Personal settings page and API page, only s..."
1,matomo-org/matomo,6678,Task to automatically delete the scheduled rep...,The goal of this issue is to create a task to ...,Likely this is not needed now that reports are...,no,"Task,wontfix,c: Security,c: Usability,",yes,Task to automatically delete the scheduled rep...
2,matomo-org/matomo,11826,Check if prefixurl for api listing starts with...,,,yes,"c: Security,not-in-changelog,Needs Review,",yes,Check if prefixurl for api listing starts with...
3,matomo-org/matomo,17665,Disable logme functionality by default,### Description:\r\n\r\n\r\n\r\nLogme function...,@sgiehl should the developer changelog be upda...,yes,"c: Security,Needs Review,",yes,Disable logme functionality by default ### Des...
4,matomo-org/matomo,17545,Ensure login is set for brute force log when 2...,### Description:\r\n\r\n\r\n\r\nFollow up to #...,"Tested locally, works",yes,"c: Security,not-in-changelog,Needs Review,",yes,Ensure login is set for brute force log when 2...
...,...,...,...,...,...,...,...,...,...
19273,factor/factor,1339,nth is faster than nth-unsafe in some cases,I noticed that `nth` is faster than `nth-unsaf...,@bjourne or @ajvondrak might be interested in ...,no,"compiler,performance,sequences,bits,unsafe,",no,nth is faster than nth-unsafe in some cases I ...
19274,rails/rails,44292,Not depending on unmaintained dependencies,"Hello,\r\n\r\n\r\n\r\naccording to https://git...",Rails 7 doesn't have any hard dependency on sp...,no,"asset pipeline,",no,Not depending on unmaintained dependencies Hel...
19275,ohmyzsh/ohmyzsh,10979,I'm unable to run existing NPM packages even t...,### Describe the bug\r\n\r\nI've installed few...,Are you using `nvm` plugin? I don't use nvm ac...,no,"Resolution: not our issue,",no,I'm unable to run existing NPM packages even t...
19276,showdownjs/showdown,64,a bug about options！,call:\r\nvar converter = new Showdown.converte...,Should be fixed with the new extension loading...,yes,"bug,",no,a bug about options！ call:\r\nvar converter = ...


In [376]:
# show the new distribution of security and non-security issues
new_df['is_sec'].value_counts()

yes    9639
no     9639
Name: is_sec, dtype: int64

In [379]:
# delete title_desc duplicates from new_df
new_df = new_df.drop_duplicates(subset=['title_desc'], keep=False)

In [380]:
# show the new distribution of security and non-security issues
new_df['is_sec'].value_counts()

yes    9638
no     9638
Name: is_sec, dtype: int64

In [381]:
# savce new_df to csv
new_df.to_csv('data/new_issues.csv', index=False)