# Preprocess Steps + Text Representation

In this notebook we aim to create a CSV file with preprocessing steps like removing stop-words, punctuation, and lemmatization. Additionally we would like to add to the csv file the word vector representation using Spacy pre-trained models. We decided to store all these pre-steps in-order to reduce the amount of time and resources required when using different models. 

### Set Up and data preparation

In [48]:
import pandas as pd
import spacy
from tqdm import tqdm

In [49]:
#this initialize tqdm which is useful to show a progress bar when applying operations in a pandas df
tqdm.pandas()

In [34]:
#the data used for this is the train.csv file originally from the kaggel competition
data = pd.read_csv('data/train.csv')

In [35]:
# we only want to work out the comment text and keep the target including the toxicity score
df_reduced = data[['comment_text','target']]
print(f'No of Columns : {len(df_reduced.columns)}')
print(f'No of Rows : {len(df_reduced)}')

No of Columns : 2
No of Rows : 1804874


In [36]:
# here we create a new column for the label that contains 1 if toxicity score is bigger than 0.5
df_reduced['toxic'] = (df_reduced['target'] >= 0.5).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reduced['toxic'] = (df_reduced['target'] >= 0.5).astype(int)


In [37]:
#we do not need the toxicity score, we will keep only the label
df = df_reduced.drop('target', axis=1)

In [38]:
#check for any null values
df.isnull().sum()

comment_text    3
toxic           0
dtype: int64

In [39]:
#there are 3 NAs values wich will be removed with the following line
df_clean = df.dropna()

In [40]:
#double check for NAs 
df_clean.isnull().sum()

comment_text    0
toxic           0
dtype: int64

In [41]:
#we end with round about 1,8M rows
df_clean.shape

(1804871, 2)

# Start with the preprocess

### Preprocess Function - remove Stop Words & Punctuation, and convert to Lemma

In [43]:
# Load english language model and create nlp object from it
nlp = spacy.load('en_core_web_sm')

In [44]:
# Preprocess Function
def preprocess(text):
    doc = nlp(text)

    filtered_tokens = []

    for token in doc:
        if token.is_stop or token.is_punct:
           continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens) #this convert the list into a string separated by spaces

In [45]:
df = df_clean.copy()

In [96]:
# this line below execute the function above and preprocess 1.8M of rows. Be patient when running it! 
#df['stopwords_punct_lemma'] = df['comment_text'].progress_apply(lambda text: preprocess(text))

In [97]:
# This line below was used to store temporarely the dataframe containing the preprocess comment text

#df.to_csv('data/all_data_pp.csv', index=False)

KeyboardInterrupt: 

### Vectorization

In [55]:
# this line below load the data from the temporary file 
df1 = pd.read_csv('data/all_data_pp.csv')

In [59]:
# we used the large model from spacy to convert to do our text representation in vectors, this runs over 1.8 M rows again, be careful!
nlp = spacy.load("en_core_web_lg") # we load the model
df['vector_spacy'] = df['stopwords_punct_lemma'].progress_apply(lambda text: nlp(text).vector)

100%|██████████| 1804871/1804871 [3:26:37<00:00, 145.58it/s]  


In [67]:
# this line was used to temporary safe the data with the vectors and the preprocess steps

#df.to_csv('data/all_data_pp.csv', index=False)

## Merge data with data from Pos tagging preprocess

The following part aims to merge the current preprocess steps with pos tagging process that were executed in a different workflow. The data is stored in a csv file called: pos_data.csv

In [81]:
df_pos = pd.read_csv('data/pos_data.csv')

In [82]:
# since comment text and toxic columns are in both datasets, I will remove it from one of them
df_pos = df_pos.drop(['comment_text','toxic'],axis=1)


In [83]:
df_pos

Unnamed: 0,pos_tags,pos_tags_str
0,"[('This', 'DT'), ('is', 'VBZ'), ('so', 'RB'), ...","DT VBZ RB JJ . PRP VBZ IN , FW PRP VBP PRP$ NN..."
1,"[('Thank', 'NNP'), ('you', 'PRP'), ('!', '.'),...",NNP PRP . . DT MD VB PRP$ NN DT NN JJR JJ . VB...
2,"[('This', 'DT'), ('is', 'VBZ'), ('such', 'JJ')...",DT VBZ JJ DT JJ NN NN : VB TO PRP IN VBG PRP I...
3,"[('Is', 'VBZ'), ('this', 'DT'), ('something', ...",VBZ DT NN PRP MD VB JJ TO VB IN PRP$ NN . WRB ...
4,"[('haha', 'NN'), ('you', 'PRP'), ('guys', 'NNS...",NN PRP NNS VBP DT NN IN NNS .
...,...,...
1804866,"[('Maybe', 'RB'), ('the', 'DT'), ('tax', 'NN')...",RB DT NN IN `` NNS '' MD VB VBN WRB DT NN VBZ ...
1804867,"[('What', 'WP'), ('do', 'VBP'), ('you', 'PRP')...",WP VBP PRP VB NNS WP VBP VBP DT NN VBD DT NN I...
1804868,"[('thank', 'NN'), ('you', 'PRP'), (',', ','), ...","NN PRP , , , RB CC JJ , , , VBP VBP VBG PRP$ NN"
1804869,"[('Anyone', 'NN'), ('who', 'WP'), ('is', 'VBZ'...","NN WP VBZ VBN IN VBG DT JJ NN , RB IN JJ , MD ..."


In [86]:

# Since the index of both dataframe should be the same to merge a reset index in df is necessary. The old index will be remove.
df_to_merge = df.reset_index()
df_to_merge = df_to_merge.drop('index',axis=1)

In [93]:
# here are both dataframes merged
merged_pp_df = pd.merge(df_to_merge, df_pos, left_index=True, right_index=True)

In [94]:
merged_pp_df

Unnamed: 0,comment_text,toxic,stopwords_punct_lemma,vector_spacy,pos_tags,pos_tags_str
0,"This is so cool. It's like, 'would you want yo...",0,cool like want mother read great idea,"[0.57358134, 0.40742856, -2.652657, -2.6345057...","[('This', 'DT'), ('is', 'VBZ'), ('so', 'RB'), ...","DT VBZ RB JJ . PRP VBZ IN , FW PRP VBP PRP$ NN..."
1,Thank you!! This would make my life a lot less...,0,thank life lot anxiety inducing let way,"[2.3985057, 0.08947158, -3.6875572, -0.9417053...","[('Thank', 'NNP'), ('you', 'PRP'), ('!', '.'),...",NNP PRP . . DT MD VB PRP$ NN DT NN JJR JJ . VB...
2,This is such an urgent design problem; kudos t...,0,urgent design problem kudo take impressive,"[0.9049366, 1.0650175, -1.8506068, -0.6678533,...","[('This', 'DT'), ('is', 'VBZ'), ('such', 'JJ')...",DT VBZ JJ DT JJ NN NN : VB TO PRP IN VBG PRP I...
3,Is this something I'll be able to install on m...,0,able install site release,"[2.15365, 0.84712, -1.303075, -1.1850657, 1.32...","[('Is', 'VBZ'), ('this', 'DT'), ('something', ...",VBZ DT NN PRP MD VB JJ TO VB IN PRP$ NN . WRB ...
4,haha you guys are a bunch of losers.,1,haha guy bunch loser,"[-1.30565, -1.2035375, -1.544195, -0.53491247,...","[('haha', 'NN'), ('you', 'PRP'), ('guys', 'NNS...",NN PRP NNS VBP DT NN IN NNS .
...,...,...,...,...,...,...
1804866,"Maybe the tax on ""things"" would be collected w...",0,maybe tax thing collect product import registe...,"[-1.74022, -1.2526344, -1.8042428, -0.02307643...","[('Maybe', 'RB'), ('the', 'DT'), ('tax', 'NN')...",RB DT NN IN `` NNS '' MD VB VBN WRB DT NN VBZ ...
1804867,What do you call people who STILL think the di...,0,people think divine role creation,"[-1.804096, 1.24632, -1.8707399, -1.5715679, 2...","[('What', 'WP'), ('do', 'VBP'), ('you', 'PRP')...",WP VBP PRP VB NNS WP VBP VBP DT NN VBD DT NN I...
1804868,"thank you ,,,right or wrong,,, i am following ...",0,thank right wrong follow advice,"[1.060516, 0.04716005, -1.3738799, -1.42328, 0...","[('thank', 'NN'), ('you', 'PRP'), (',', ','), ...","NN PRP , , , RB CC JJ , , , VBP VBP VBG PRP$ NN"
1804869,Anyone who is quoted as having the following e...,1,quote have follow exchange apocryphal receive ...,"[-0.5423877, -0.25980785, -0.28140348, -0.7341...","[('Anyone', 'NN'), ('who', 'WP'), ('is', 'VBZ'...","NN WP VBZ VBN IN VBG DT JJ NN , RB IN JJ , MD ..."


In [95]:
# This below line is used to temporary safe the data in a csv file

#merged_pp_df.to_csv('data/merged_pp_df.csv', index=False)