## Text preprocessing

This notebook is dedicated to text preprocessing, such as removing punctuation, tokenizing the sentences, etc. This step is necessary before running TF-IDF.

### Set-up

In [1]:
import pickle

import numpy as np
import pandas as pd

import string
import re

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s1027177\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\s1027177\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Input your repository path here:

In [2]:
repsource = "C:/Users/s1027177/OneDrive - Syngenta/Documents/FOAD/au_secours/"

In [6]:
file=open(repsource+"new_df","rb")
new_df=pickle.load(file)
file.close()

In [11]:
new_df.columns

Index(['ups', 'link_id', 'name', 'id', 'author', 'body', 'parent_id',
       'popularity', 'day_week', 'monday', 'tuesday', 'wednesday', 'thursday',
       'friday', 'saturday', 'sunday', 'hour', 'ups_mean', 'active_user1',
       'active_user2', 'active_user3', 'active_user4', 'top_comment',
       'rank_comment', 'parent_ups', 'seconds_after_parent', 'parent_seg1',
       'parent_seg2', 'parent_seg3', 'parent_seg4', 'parent_ups_mean',
       'positive_com', 'neutral_com', 'negative_com'],
      dtype='object')

### Text Preprocessing

In this part we will focus on the analysis of textual data.
On `body`:
* Cleaning: lower-casing, punctuation removal, stop words removal, lemmatization
* Tokenization: cut comments into separate words
* Modelling: a simple TF-IDF, Word2vec is the next step

In [12]:
def convert_text_to_lowercase(df, colname):
    df[colname] = df[colname].str.lower()
    return df
    
def not_regex(pattern):
    return r"((?!{}).)".format(pattern)

def remove_punctuation(df, colname):
    df[colname] = df[colname].str.replace('\n', ' ')
    df[colname] = df[colname].str.replace('\r', ' ')
    alphanumeric_characters_extended = '(\\b[-/]\\b|[a-zA-Z0-9])'
    df[colname] = df[colname].str.replace(not_regex(alphanumeric_characters_extended), ' ')
    return df

def tokenize_sentence(df, colname):
    df[colname] = df[colname].str.split()
    return df

def remove_stop_words(df, colname):
    stop_words = stopwords.words('english')
    stop_words.extend(['deleted', 'http'])
    for idx, val in df[colname].items():
        if type(val) == list:
            df.at[idx, colname] = [word for word in val if word not in stop_words]
        else:
            df.at[idx, colname] = [' ']
    return df

def lemmatize_words(df, colname):
    lemmatizer = WordNetLemmatizer()
    df[colname] = df[colname].apply(lambda com: [lemmatizer.lemmatize(word) for word in com])
    return df

def reverse_tokenize_sentence(df, colname):
    df[colname] = df[colname].map(lambda word: ' '.join(word))
    return df


def text_cleaning(df, colname):
    """
    Takes in a string of text, then performs the following:
    1. convert text to lowercase
    2. remove punctuation and new line characters '\n'
    3. Tokenize sentences
    4. Remove all stopwords
    5. convert tokenized text to text
    """
    df = (
      df
      .pipe(convert_text_to_lowercase, colname)
      .pipe(remove_punctuation, colname)
      .pipe(tokenize_sentence, colname)
      .pipe(remove_stop_words, colname)
      .pipe(lemmatize_words, colname)
      .pipe(reverse_tokenize_sentence, colname)
    )
    return df

In [13]:
df_cleaned = text_cleaning(new_df, 'body')

  df[colname] = df[colname].str.replace(not_regex(alphanumeric_characters_extended), ' ')


In [7]:
del new_df

In [14]:
df_cleaned.shape

(4234970, 34)

In [15]:
df_cleaned.ups.value_counts()

1.0       1676837
2.0        575082
3.0        219060
0.0        153372
4.0         75130
           ...   
2193.0          1
2942.0          1
3962.0          1
4559.0          1
3006.0          1
Name: ups, Length: 4023, dtype: int64

Before moving to the TF-IDF, we add a column to the dataset which corresponds to the length of the comment. It is an additional information for the analysis.

In [16]:
df_cleaned['comment_length'] = df_cleaned['body'].apply(lambda com: len(com) if pd.notnull(com) and
                                                        com!='deleted' and com!='[deleted]' 
                                                        else 0
                                                        )

Save new dataset with `body` cleaned and new column `comment_length`.

In [17]:
file1=open(repsource+"df_body_cleaned","wb")
pickle.dump(df_cleaned,file1)
file1.close()