# Feature Selection

This notebook helps to create the different feature selection patterns. We use Spacy to create a large document from the concatenated Training Dataset string. We are using Spacy because it is easy to filter out stuff using POS tags. 

In [1]:
from spacy.lang.en import English
import spacy
spacy.prefer_gpu()
from collections import Counter
import csv
import config
from tqdm import tqdm
import os
import nltk
from nltk.corpus import stopwords
import re
import string
import fnmatch
from src.config import config_io
from data_processing.preprocessing import get_dataset_text# get_word_index_list, write_index_to_file, read_index_as_string

At first we concatenate the paragraphs of the training dataset to create a very large string called **all_text**.
We can choose to save the file.

We are using Spacy to create the doc Object from the **all_text**. Now **all_text** is very large (about 196,000,000 characters or 200 million characters). Spacy has a max_length limit of 1,000,000 characters and we run into computational issues (kernel fails). As such, we must break down the document into smaller chunks. We saw that 4 million chars was manageable. 

Thus the idea was to create smaller documents of 4 million chars and iteratively pass them on to spacy. Spacy creates the document and we extract the tokens passed through desired filters. We could not take the naive approach to chop **all_text** into 4 M char pieces as it would lead to splitting up of terms.

To avoid it, we decided to make the end point of these documents flexible, i.e the document ends at either a period or a question mark or an exclamation only. If the character at position 4M is not a (.|?|!) then we continue to look forward or backward from that point until we enocunter the mentioned punctuations.
As a result we get slightly larger or smaller documents than 4M but no split words.

We note the start and end positions in a list of tuples called the **start_end**

In [2]:
def get_start_end(all_text):
    count = 0
    start_point = 0
    start_end = []
    if len (all_text)< 4000000:
        return [(0, len(all_text))]
    for i in range(4000000, len(all_text), 4000000):
        print(count, i, all_text[i], all_text[i-20:i+20])
        match = re.search("\.|\?|!", all_text[i-20:i+20])
        print(match)
        if match is not None:
            end_point = i-20+match.span()[0]
        else:
            j = 10
        while match is None: # if we dont find a match, keep increasing the scope of the search text until we find a match
            if i+j+20<len(all_text):
                match = re.search("\.|\?|!", all_text[i-j-20:i+j+20])
                print(j, match, all_text[i-j-20:i+j+20])
                if match is not None:
                    end_point = i-j-20+match.span()[0]
                j+=10
            else:
                end_point = i
                break
        print("\t end point:",end_point, all_text[end_point],"finished")
        start_end.append((start_point,end_point))
        start_point = end_point
        count+=1
    return start_end


Code to  view the different start and end points

In [3]:
#[(i,j,j-i) for i,j in start_end]

The function **get_spacy_doc** creates and returns an NLP document with the model **en_core_web_sm**

In [4]:
def get_spacy_doc(text):
    #nlp = English()
    #tkr = nlp.Defaults.create_tokenizer(nlp)
    #tkr(text.lower())
    nlp = spacy.load("en_core_web_sm") #English()
    nlp.max_length = 540000000 # needs to adjust
    doc = nlp(text.lower()) 
    return doc

Next, we iterate over **all_text** chunk by chunk according to the **start_end** list. For each iteration we create a spacy doc using the **get_spacy_doc** function. The filter condition is used while creating the **token_list** from the doc. The terms are updated in the **term_dict**.
We use **term_dict** to store the terms as keys and their respective counts as values in an ordered manner. 
**stopword_nltk** contains a set of stopwords from the nltk library. We compared the list of stopwords in nltk(179) and spacy(326). We found that nltk was a better choice as Spacy had too many words. We did not want to restrict so many words.


We can check the most common 50 terms using the code below

In [5]:
'''term_dict_mod1.most_common(50)'''

'term_dict_mod1.most_common(50)'

In [6]:
def write_vocab(term_dict, folder):
    feature_set_size = [50, 100, 150, 200, 250,300,350,400,450,500]
    for i in feature_set_size:
        #folder = '/home/sukanya/PycharmProjects/SiameseNetworkTensorflow/src/baseline_with_Spacy/word_index/opt4/'
        path = folder + str(i)+"word_list.txt"
        mf_token_list = term_dict.most_common(i)
        print(mf_token_list)
        word_index_str = " ".join([k for k,v in mf_token_list])
        with open(path, "w") as file:
            file.write(word_index_str)

In [14]:
# combining all steps in one function
def create_vocab(training_file_path, validation_file_path, vocab_path_list):
    all_text = get_dataset_text(training_file_path,  pattern = "*") + get_dataset_text(validation_file_path,  pattern = "*")
    print("all_text len:", len(all_text))
    start_end = get_start_end(all_text)
    print(start_end)
    term_dict_mod1 = Counter()
    term_dict_mod2 = Counter()
    term_dict_mod3 = Counter()
    term_dict_mod4 = Counter()
    stopword_nltk = set(stopwords.words('english'))
    
    for start,end in start_end:
        print("\n\n", start, end, end/len(all_text)) # all_text[start: end]
        doc = get_spacy_doc(all_text[start: end])
        #Model1 (stopwords = Yes, punct = No ), check for period again after filtering
        token_list_mod1 = [token.text for token in doc if token.pos_!="PUNCT"]

        #Model2 (stopwords = Yes, punct = Yes)
        token_list_mod2 = [token.text for token in doc if token.pos_!="PUNCT" or (token.pos_=="PUNCT"and (token.text=="." or token.text==","))]

        #Model3 (stopwords = No, punct = No), check for period again after filtering
        token_list_mod3 = [token.text for token in doc if token.pos_!="PUNCT" and token.text not in stopword_nltk]

        #Model4 (stopwords = No, punct = Yes)
        token_list_mod4 = [token.text for token in doc if (token.pos_!="PUNCT" or (token.pos_=="PUNCT"and (token.text=="." or token.text==","))) and token.text not in stopword_nltk]

        term_dict_mod1.update(token_list_mod1)
        term_dict_mod2.update(token_list_mod2)
        term_dict_mod3.update(token_list_mod3)
        term_dict_mod4.update(token_list_mod4)
        print("mod1 term dict len:", len(term_dict_mod1))
        print("mod2 term dict len:", len(term_dict_mod2))
        print("mod3 term dict len:", len(term_dict_mod3))
        print("mod4 term dict len:", len(term_dict_mod4))
    #Check and filter out periods in mod1 and mod3
    
    del term_dict_mod1['.']
    del term_dict_mod3['.']
    for vocab_path ,term_dict in zip(vocab_path_list, [term_dict_mod1, term_dict_mod2, term_dict_mod3, term_dict_mod4]):
        if not os.path.exists(vocab_path):
            os.makedirs(vocab_path)
        write_vocab(term_dict, vocab_path)
    print("Writing completed!")

In [8]:
'''vocab_path_list = ['/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt1/',
                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt2/',
                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt3/',
                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt4/']
create_vocab(config.config_io.get('pan_20_processed_train_narrow'), vocab_path_list)'''

"vocab_path_list = ['/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt1/',\n                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt2/',\n                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt3/',\n                  '/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/vocab/opt4/']\ncreate_vocab(config.config_io.get('pan_20_processed_train_narrow'), vocab_path_list)"

In [15]:
training_file_path = config_io['pan_21_processed_train']
validation_file_path = config_io['pan_21_processed_test'] # these were validation files actually which were used for testing
vocab_base_folder_list = ['/home/sukanya/PhD/Results/012_28_Apr_PAN_21_dataset/baseline']

for vocab_base_folder in vocab_base_folder_list:
    print(training_file_path)
    vocab_path_list = [vocab_base_folder+"/"+opt for opt in ['opt1/', 'opt2/', 'opt3/', 'opt4/']]
    create_vocab(training_file_path,validation_file_path, vocab_path_list)

/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/train.csv
/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/train.csv
/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/test.csv
all_text len: 40211155
0 4000000 i for me reliably in diverse scenarios, al
None
10 None as worked for me reliably in diverse scenarios, although I s
20 <re.Match object; span=(7, 8), match='.'> 't work. Has worked for me reliably in diverse scenarios, although I should note
	 end point: 3999967 . finished
1 8000000 s ere are several users with a large datab
None
10 None ver, if there are several users with a large database team a
20 <re.Match object; span=(3, 4), match='.'> ons.  However, if there are several users with a large database team and sensiti
	 end point: 7999963 . finished
2 12000000 o et to 1, and error code is set to 0. we 
<re.Match object; span=(35, 36), match='.'>
	 end point: 12000015 . finished
3 16000000 t ee different

[(' ', 67062), ("'s", 38447), ("n't", 38406), ('use', 26692), ('would', 26055), ('/', 25908), ('server', 25421), ('one', 24123), ('like', 18878), ('using', 18756), ('need', 17425), ('also', 15950), ('get', 14914), ('windows', 14820), ('data', 14805), ('want', 14167), ('time', 13774), ('could', 13687), ('file', 13681), ('way', 12667), ('system', 11227), ('network', 10978), ('make', 10966), ('work', 10657), ("'m", 10539), ('may', 10382), ('problem', 10122), ("'re", 9922), ('see', 9848), ('set', 9756), ('files', 9595), ('run', 9345), ('new', 9328), ("'ve", 9203), ('ip', 8843), ('-', 8791), ('used', 8777), ('different', 8736), ('even', 8650), ('user', 8504), ('know', 8438), ('might', 8231), ('running', 8095), ('first', 8087), ('two', 7903), ('something', 7761), ('access', 7738), ('drive', 7380), ('much', 7222), ('case', 7178), ('code', 7120), ('well', 7059), ('another', 6960), ('good', 6947), ('able', 6764), ('machine', 6691), ('find', 6674), ('really', 6600), ('sure', 6582), ('try', 6460)

Anything beyond this point is old code that is not required.

In [23]:
training_files = [config_io['pan_21_processed_train']]

vocab_base_folder_list = ['/home/sukanya/PhD/Results/012_28_Apr_PAN_21_dataset/baseline']


for training_file_path,vocab_base_folder in zip(training_files, vocab_base_folder_list):
    print(training_file_path)
    vocab_path_list = [vocab_base_folder+"/"+opt for opt in ['opt1/', 'opt2/', 'opt3/', 'opt4/']]
    create_vocab(training_file_path, vocab_path_list)

/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/train.csv
/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/train.csv
all_text len: 33110363
0 4000000 i for me reliably in diverse scenarios, al
None
10 None as worked for me reliably in diverse scenarios, although I s
20 <re.Match object; span=(7, 8), match='.'> 't work. Has worked for me reliably in diverse scenarios, although I should note
	 end point: 3999967 . finished
1 8000000 s ere are several users with a large datab
None
10 None ver, if there are several users with a large database team a
20 <re.Match object; span=(3, 4), match='.'> ons.  However, if there are several users with a large database team and sensiti
	 end point: 7999963 . finished
2 12000000 o et to 1, and error code is set to 0. we 
<re.Match object; span=(35, 36), match='.'>
	 end point: 12000015 . finished
3 16000000 t ee different Azure Storage blob containe
None
10 None I have three different Azure Storage blob

[('.', 252557), (',', 228021), (' ', 53602), ("n't", 30754), ("'s", 30413), ('use', 21328), ('would', 20930), ('/', 20865), ('server', 20461), ('one', 19287), ('like', 15100), ('using', 14958), ('need', 13970), ('also', 12680), ('get', 11872), ('windows', 11812), ('data', 11746), ('want', 11274), ('could', 10997), ('file', 10927), ('time', 10870), ('way', 10206), ('system', 8990), ('network', 8787), ('make', 8767), ('work', 8447), ("'m", 8408), ('may', 8380), ('problem', 8136), ('see', 7828), ("'re", 7816), ('set', 7816), ('files', 7635), ('new', 7518), ('run', 7474), ("'ve", 7318), ('different', 7101), ('-', 7076), ('ip', 7015), ('used', 6993), ('even', 6909), ('user', 6875), ('know', 6708), ('might', 6595), ('running', 6509), ('first', 6425), ('two', 6336), ('access', 6225), ('something', 6177), ('drive', 5798)]
[('.', 252557), (',', 228021), (' ', 53602), ("n't", 30754), ("'s", 30413), ('use', 21328), ('would', 20930), ('/', 20865), ('server', 20461), ('one', 19287), ('like', 15100)

In [11]:
for training_file_path,vocab_base_folder in zip(training_files, vocab_base_folder_list):
    print (training_file_path, vocab_base_folder)

/home/sukanya/PhD/Datasets/PAN SCD/pan21-style-change-detection/processed/train.csv /home/sukanya/PhD/Results/011_16_Apr_PAN_21_dataset/vocab/


In [19]:
training_files[-1:]

['/home/sukanya/PhD/Results/010_22_Feb_PAN_dataset_MPL_based/datasets/training_2000.csv']

Compare the saved models for 500 vocab sizes for respective models

In [3]:
mod1_pan = "/home/sukanya/PycharmProjects/SiameseNNPAN/src/word_index/opt1/500word_list.txt"
mod2_pan = "/home/sukanya/PycharmProjects/SiameseNNPAN/src/word_index/opt2/500word_list.txt"
mod3_pan = "/home/sukanya/PycharmProjects/SiameseNNPAN/src/word_index/opt3/500word_list.txt"
mod4_pan = "/home/sukanya/PycharmProjects/SiameseNNPAN/src/word_index/opt4/500word_list.txt"
mod1_koppel = "/home/sukanya/PycharmProjects/SiameseNetworkTensorflow/src/baseline_with_Spacy/word_index/opt1/500word_list.txt"
mod2_koppel = "/home/sukanya/PycharmProjects/SiameseNetworkTensorflow/src/baseline_with_Spacy/word_index/opt2/500word_list.txt"
mod3_koppel = "/home/sukanya/PycharmProjects/SiameseNetworkTensorflow/src/baseline_with_Spacy/word_index/opt3/500word_list.txt"
mod4_koppel = "/home/sukanya/PycharmProjects/SiameseNetworkTensorflow/src/baseline_with_Spacy/word_index/opt4/500word_list.txt"

In [13]:
def read_file(path):
    with open(path) as f:
        data = f.read()
    return data
text_mod1_pan = read_file(mod1_pan)
text_mod2_pan = read_file(mod2_pan)
text_mod3_pan = read_file(mod3_pan)
text_mod4_pan = read_file(mod4_pan)
text_mod1_koppel = read_file(mod1_koppel)
text_mod2_koppel = read_file(mod2_koppel)
text_mod3_koppel = read_file(mod3_koppel)
text_mod4_koppel = read_file(mod4_koppel)


In [26]:
def get_spacy_tokens(text):
    #nlp = English()
    #tkr = nlp.Defaults.create_tokenizer(nlp)
    #tkr(text.lower())
    nlp = spacy.load("en_core_web_sm") #English()
    nlp.max_length = 540000000 # needs to adjust
    doc = nlp(text.lower()) 
    return [token.text for token in doc]

tokens_mod1_pan = set(get_spacy_tokens(text_mod1_pan))
tokens_mod2_pan = set(get_spacy_tokens(text_mod2_pan))
tokens_mod3_pan = set(get_spacy_tokens(text_mod3_pan))
tokens_mod4_pan = set(get_spacy_tokens(text_mod4_pan))
tokens_mod1_koppel = set(get_spacy_tokens(text_mod1_koppel))
tokens_mod2_koppel = set(get_spacy_tokens(text_mod2_koppel))
tokens_mod3_koppel = set(get_spacy_tokens(text_mod3_koppel))
tokens_mod4_koppel = set(get_spacy_tokens(text_mod4_koppel))

In [38]:
def set1_common_set2(set1, set2):
    common = set1.intersection(set2)
    print("Number of common tokens", len(common))
    print("percentage of common", len(common)/len(set1))
    print("len of set1", len(set1))
    print("len of set2", len(set2))
    
set1_common_set2(tokens_mod4_pan, tokens_mod4_koppel)

Number of common tokens 263
percentage of common 0.526
len of set1 500
len of set2 500


In [28]:
len(tokens_mod2_pan.intersection(tokens_mod2_koppel))

261

In [29]:
len(tokens_mod3_pan)

500

In [30]:
len(tokens_mod3_koppel)

500