The content pre-processing step takes in both normal text and code, performs 
* Tokenization: Tokenization breaks a paragraph into word tokens.
* Stop word removal: Stop word removal removes commonly used words like: is, are, I, you, etc.
* Stemming: Stemming reduces a word to its root form, e.g., reading to read, etc. 

For the code, we remove reserved keywords such as: if, while, etc.,
curly brackets, etc, and extract identifiers and comments. These are then subjected to tokenization, stemming, and
stop word removal too [1].

Additional preprocessing steps of code snippets we hope to use influenced from [2] and [3]. 

References
1. Wang et. al.
2. Indentifier tokenization: Son Nguyen, Hung Phan, Trinh Le, and Tien N. Nguyen. 2020. Suggesting natural method names to check name consistencies. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE '20). Association for Computing Machinery, New York, NY, USA, 1372–1384. DOI:https://doi.org/10.1145/3377811.3380926
3. https://stackoverflow.com/questions/29916065/how-to-do-camelcase-split-in-python

In [1]:
import sklearn as sk
import numpy as np
import scipy
import nltk
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer 

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nimmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
title = "How to escape - (hyphen) using the groovy language"
text = "I am trying to declare a variable that requires hyphen as part of the design spec. However, i am getting this error - https://www.tutorialspoint.com/execute_groovy_online.php"
codes = [
    "def user-svc = \"accounts\"",
    "$groovy main.groovy \n Hello world \n Caught: groovy.lang.MissingPropertyException: No such property: user for class: main \n groovy.lang.MissingPropertyException: No such property: user for class: main \n at main.run(main.groovy:3)"
]

Important:
* title could be a error message
* text could contain links
* code could be invalid, For example, code could contain runtime errors copies pasted from the console

Step 1: Tokenization

In [4]:
def tockenize(text):
    tokens = text.split(" ")
    if "" in tokens:
        tokens.remove("")
    return tokens

In [5]:
tockenize(title)

['How', 'to', 'escape', '-', '(hyphen)', 'using', 'the', 'groovy', 'language']

In [6]:
tockenize(codes[0])

['def', 'user-svc', '=', '"accounts"']

Step 2: Remove non-alphabetic tokens

In [7]:
def remove_non_alphabetic_tokens(tokens):
    tokens = [word for word in tokens if word.isalpha()]
    return tokens

In [8]:
tokens = tockenize(title)
remove_non_alphabetic_tokens(tokens)

['How', 'to', 'escape', 'using', 'the', 'groovy', 'language']

In [9]:
tokens = tockenize(codes[0])
remove_non_alphabetic_tokens(tokens)

['def']

Removing non alphabetic tokens is not working for codes. Codes should be preprocessed before applying these steps.

Step 3: Remove stop words

In [10]:
def remove_stop_words(tokens):
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    return tokens

In [11]:
tokens = tockenize(title)
tokens = remove_non_alphabetic_tokens(tokens)
remove_stop_words(tokens)

['How', 'escape', 'using', 'groovy', 'language']

Step 4: Stemming

In [14]:
def stem(tokens):
    ps = PorterStemmer() 
    tokens = [ps.stem(w) for w in tokens]
    return tokens

In [15]:
tokens = tockenize(title)
tokens = remove_non_alphabetic_tokens(tokens)
tokens = remove_stop_words(tokens)
stem(tokens)

['how', 'to', 'escap', 'use', 'the', 'groovi', 'languag']

Step 5: remove short tokens

In [16]:
def remove_short_tokens(tokens,length=1):
    tokens = [word for word in tokens if len(word) > length]
    return tokens

In [17]:
tokens = tockenize(title)
tokens = remove_non_alphabetic_tokens(tokens)
tokens = remove_stop_words(tokens)
remove_short_tokens(tokens)

['How', 'escape', 'using', 'groovy', 'language']

Code preprocess before tokenizing

"For the code, we remove reserved keywords such as: if, while, etc., curly brackets, etc, and extract identifiers and comments. These are then subjected to tokenization, stemming, and stop word removal too." [1].

What we planned to do:
1. Extract Identifiers
2. Extract Comments
3. Tokenization, stemming, and stop word removal