# Cleaning the dataset

The goal of this project is to compare different language models. The [dataset](https://liveproject-resources.s3.amazonaws.com/116/other/stackexchange_812k.csv.gz) is composed of over 800k questions and answers extracted from The Stack Exchange website for the CrossValidated site.

The dataset has 5 variables: post_id, parent_id, comment_id, text, and category. In this file, the column text is cleaned.

The dataset is cleaned in the following way:

- html tags are removed
- Latex expressions are removed
- Remove rows with value NA in column 
- Remove trailing whitespace
- Remove digits
- convert text to lower case
- Tokenize sentences, by using word_tokenize from NLTK library
- Remove words containing other than allowed characters (puncuation and letters)
- Remove very short and very long texts.


In [83]:
import re
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize

In [84]:
data = pd.read_csv('D:/liveproject_manning/stackexchange_812k.csv')

In [85]:
print(data.shape)
print(data.head)

(812132, 5)
<bound method NDFrame.head of         post_id  parent_id  comment_id  \
0             1        NaN         NaN   
1             2        NaN         NaN   
2             3        NaN         NaN   
3             4        NaN         NaN   
4             6        NaN         NaN   
...         ...        ...         ...   
812127   279994        NaN    536471.0   
812128   279998        NaN    536439.0   
812129   279998        NaN    536514.0   
812130   279999        NaN    536802.0   
812131   279999        NaN    542550.0   

                                                     text category  
0                           Eliciting priors from experts    title  
1                                      What is normality?    title  
2       What are some valuable Statistical Analysis op...    title  
3       Assessing the significance of differences in d...    title  
4       The Two Cultures: statistics vs. machine learn...    title  
...                                    

In [86]:
data['text_clean'] = data['text']

Remove html tags

In [87]:
re_html = re.compile('<.*?>')
data['text_clean']= data['text_clean'].str.replace(re_html,'', regex = True)

Remove latex expressions

In [88]:
re_latex = re.compile('\\$.*?\\$')
data['text_clean']= data['text_clean'].str.replace(re_latex,'', regex = True)

Remove rows with value NA in relevant column.

In [89]:
data = data.dropna(subset=['text_clean'])

Trailing whitespaces are removed.

In [90]:
data['text_clean']= data['text_clean'].str.rstrip()

Remove digits

In [91]:
re_digits = re.compile(r'\d+')
data['text_clean']= data['text_clean'].str.replace(re_digits,'', regex = True)

Convert to lower case.

In [92]:
data['text_clean']= data['text_clean'].str.lower()

Tokenize the texts.

In [93]:
data['text_clean'] = data['text_clean'].apply(word_tokenize)

In [94]:
data.head

<bound method NDFrame.head of         post_id  parent_id  comment_id  \
0             1        NaN         NaN   
1             2        NaN         NaN   
2             3        NaN         NaN   
3             4        NaN         NaN   
4             6        NaN         NaN   
...         ...        ...         ...   
812127   279994        NaN    536471.0   
812128   279998        NaN    536439.0   
812129   279998        NaN    536514.0   
812130   279999        NaN    536802.0   
812131   279999        NaN    542550.0   

                                                     text category  \
0                           Eliciting priors from experts    title   
1                                      What is normality?    title   
2       What are some valuable Statistical Analysis op...    title   
3       Assessing the significance of differences in d...    title   
4       The Two Cultures: statistics vs. machine learn...    title   
...                                          

Now let's have a look what kind of characters occur in the dataset. This shows that there is a lot of rubbish that needs to be removed.

In [95]:
char_set = set()

all_strings = list(data['text_clean'])

for text in all_strings:
    for word in text:
        for char in word:
            char_set.add(char)
        
print(char_set)

{'🎁', '̂', 'م', 'מ', '枚', '¨', '╩', '⊆', 'ｗ', '⏜', '≪', '임', '𝑛', 'ṽ', '≥', '𝑣', '′', '；', '𝑀', '╯', 's', 'ｒ', '\u242a', '΄', 'ý', '▽', '😜', 'ｉ', 'ℙ', '𝐻', '六', '⎥', 'ú', 'v', '😔', 'β', '𝑋', '%', '𝑤', '½', 'ʼ', '回', 'し', 'ツ', '房', '𝛔', '║', 'щ', '₂', 'ل', 'æ', '¢', 'ʃ', '╤', '羲', '｛', '사', 'х', 'ר', '°', 'ｃ', 'α', '╝', '◊', '☃', 'ⁿ', '皮', '十', 'ώ', '�', '인', '˜', '⁻', '„', '宇', '\u200b', 'ø', '𝑄', '捕', '𝖳', 'ơ', 'ｎ', '№', '提', '𝜒', '≠', '¥', '\ue012', '≫', 'ῖ', '😊', '³', 'õ', '⅕', '[', '⋅', '␣', '🏼', '⁵', 'ă', '‘', '𝑚', 'ｈ', 'ℓ', '⁞', '𝑉', '►', 'ļ', '玉', '╘', 'ô', '𝑔', '😕', 'く', '个', '®', '을', '♂', 'ν', '😂', '~', 'ὶ', '𝐱', '😛', '\uf0e0', 'ϵ', 'з', 'g', 'ρ', '̃', 'ϐ', 'ａ', '»', '₹', 'ʞ', 'ḽ', '⁶', '≤', '天', '⊙', '┤', 'て', 'φ', 'ḇ', 'ύ', '♠', '├', '⁴', '∨', 'ﬁ', 'ἀ', '麻', '∃', '╮', '¬', 'ｇ', 'ò', 'ל', '∥', '海', '⠀', '𝐴', '😅', 'ö', 'ن', 'ｖ', '\U0010fc04', 'ц', 'ł', '⋮', '△', '湖', '˙', '•', '𝛂', '♢', 'ю', '؟', '\U0010fc06', '𝖯', '시', '╒', 'σ', 'ɛ', 'å', 'е', 'ɐ', 'ˇ', 'も', '︵', '▀', 'û', '

In [99]:
len(char_set)

870

Now we clean the dataset radically, by selecting only words in each text that consist completely of a number of allowed characters (these are letters and '.', ',', '!', ':', ';', '?'). We can always decide later to make a different choice.

In [101]:
# define allowed characters, using the ord() value.
punctuation_and_letters = [33, 44, 46, 58, 59, 63, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 
                          110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122]

Now a list of lists is made, consisting of all tokenized and cleaned characters. Of course, it would be more elegant if this was done directly in the dataframe, but I could not make this working efficiently.

In [102]:
new_texts = []

all_texts = list(data['text_clean'])

for text in all_texts:
    
    new_text = []
    for word in text:
        only_allowed = True
        for char in word:    
            if ord(char) not in punctuation_and_letters:
                only_allowed = False
                
        if only_allowed:
            new_text.append(word)
            
    new_texts.append(new_text)

Now reconstruct complete sentences from lists with tokenized sentences.

In [103]:
cleaned_texts = [' '.join(tokenized_list) for tokenized_list in new_texts]

In [104]:
data = data.assign(cleaned_texts=cleaned_texts)

Remove the intermediate cilumn 'texts_clean'.

In [105]:
data = data.drop(columns=['text_clean'])

In [106]:
data.head

<bound method NDFrame.head of         post_id  parent_id  comment_id  \
0             1        NaN         NaN   
1             2        NaN         NaN   
2             3        NaN         NaN   
3             4        NaN         NaN   
4             6        NaN         NaN   
...         ...        ...         ...   
812127   279994        NaN    536471.0   
812128   279998        NaN    536439.0   
812129   279998        NaN    536514.0   
812130   279999        NaN    536802.0   
812131   279999        NaN    542550.0   

                                                     text category  \
0                           Eliciting priors from experts    title   
1                                      What is normality?    title   
2       What are some valuable Statistical Analysis op...    title   
3       Assessing the significance of differences in d...    title   
4       The Two Cultures: statistics vs. machine learn...    title   
...                                          

Finally, remove very short and very long texts. This meand that texts fith fewer than 25 characters or more than 2000 characters are removed. If needed, this can be adapted in the process of training models.

In [107]:
mask = (data['cleaned_texts'].str.len() >= 25) & (data['cleaned_texts'].str.len() <= 2000)
data = data.loc[mask]

Check the result.

In [108]:
data.shape

(779478, 6)

In [109]:
data.head

<bound method NDFrame.head of         post_id  parent_id  comment_id  \
0             1        NaN         NaN   
2             3        NaN         NaN   
3             4        NaN         NaN   
4             6        NaN         NaN   
5             7        NaN         NaN   
...         ...        ...         ...   
812127   279994        NaN    536471.0   
812128   279998        NaN    536439.0   
812129   279998        NaN    536514.0   
812130   279999        NaN    536802.0   
812131   279999        NaN    542550.0   

                                                     text category  \
0                           Eliciting priors from experts    title   
2       What are some valuable Statistical Analysis op...    title   
3       Assessing the significance of differences in d...    title   
4       The Two Cultures: statistics vs. machine learn...    title   
5                  Locating freely available data samples    title   
...                                          

In [111]:
data['cleaned_texts'].to_list()[0:100]

['eliciting priors from experts',
 'what are some valuable statistical analysis open source projects ?',
 'assessing the significance of differences in distributions',
 'the two cultures : statistics vs. machine learning ?',
 'locating freely available data samples',
 'so how many staticians it take to screw in a lightbulb ?',
 'under what conditions should likert scales be used as ordinal or interval data ?',
 'multivariate interpolation approaches',
 'forecasting demographic census',
 'bayesian and frequentist reasoning in plain english',
 'finding the pdf given the cdf',
 'tools for modeling financial time series',
 'what is a standard deviation ?',
 'testing random variate generation algorithms',
 'what is the meaning of p values and t values in statistical tests ?',
 'r packages for seasonality analysis',
 'examples for teaching : correlation does not mean causation',
 'number generation algorithms',
 'explain data visualization',
 'clustering of large , dataset',
 'pca on correla

Save the cleaned dataset as csv file.

In [113]:
data.to_csv('cleaned_dataset.csv', index=False)