# Text Pre-processing

In this notebook, the data is pre-processed:

* Section 1 - pre-processing
* Section 2 - text cleaning processes demonstrated on a toy dataset
* Section 3 - text cleaning against sampled train, validation and test
* Section 4 - text cleaning against full review dataset
* Section 5 - text cleaning against dataset exploded to individual sentences

Dataset to run notebook:

* sampled_data.csv
* all_reviews.csv

Processed data saved to:

* cleanedsampletext.csv   - cleaned sample text for model
* fulldatasetcleaned.csv  - full review dataset cleaned
* explodedsentencescleaned.csv - full reviews exploded to sentences and cleaned

## Import Libraries and Data

In [1]:
# Install language_check - note pyahocorasick had to also be installed using --add channels conda-forge and
# conda install pyahocorasick. Java also installed in the path.
# ! pip install --upgrade language-check

In [2]:
#! pip install contractions
#! pip install pyspellchecker 
#! pip install autocorrect
#!pip install Gensim
#! conda update pandas

In [1]:
import pandas as pd
import numpy as np
import pickle
import re
import contractions
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import RegexpTokenizer, word_tokenize,sent_tokenize
import string
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from scipy import stats
#from deepsegment import DeepSegment
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
# Read in balanced datasets from Post Attributes Base notebook
df = pd.read_csv("sampled_data.csv",index_col = 0)

In [5]:
df.head(2)

Unnamed: 0,Name,Category,Town,Type,Contributions,Title,Review,Rating,Date,LocCode,Cuisine,Score
0,The Spur,Food,Arundel,Pub/Bar,14,Very disappointing,Three of us ate on a quiet night. First of all...,2,4.0,2.0,British,1
1,Inglenook,Accommodation,Bognor,Hotel,10,Amazing place!!!,We had a lovely stay at the Inklenook ..room w...,5,1.0,1.0,0,0


# Section 1: Preprocessing the dataframe

Processing tasks:
* join title and review text together into one column
* set bad ratings of 1&2 to 1 and good of 4&5 to 0 and drop those rated 3
* split reviews in dataframes for each category
* sample 'good' reviews to match number of 'bad' to create balanced datasets
* split into train, validation and testing sets for each category stratifying the y values so the same proportions appear
  in each of train,val and test sets.
* recombine the category feature dataframes to create combined feature dataframes for training, validation and test
* concat the dataframes to create 3 balanced final dataframes with features and rating for accom, food and attractions

In [6]:
# Function to select columns of interest, join title and review, drop unwanted columns and reset index
def proc(df):
    df = df[["Town","Category","Title","Review","Score"]]
    df["all_text"] = df["Title"] +" "+ df["Review"]
    df.drop(columns = ["Title","Review"],axis = 1, inplace = True)
    df.reset_index(inplace = True)
    df.columns = ["OrgInd","Town","Category","Score","all_text"]
    return df

In [8]:
df=proc(df)

# Section 2: Text Cleaning

## a) Text Cleaning against a toy dataset

In [9]:
# Set up test dataframe (uncomment to run a new sample)
test = df.sample(10,random_state = 0)
test

Unnamed: 0,OrgInd,Town,Category,Score,all_text
1032,1032,Bognor,Food,0,An excellent pub and restaurant An excellent p...
2406,2406,Arundel,Food,0,Very good service Both my fiance and I have ha...
2591,2591,Arundel,Food,1,Not bad but wouldn’t return Over priced and f...
1185,1185,Littlehampton,Accommodation,0,Visit a friend We recently stayed at this hote...
1099,1099,Bognor,Food,0,wonderful Yet again beautiful carvery meat coo...
2488,2488,Arundel,Food,0,Birthday bash Tarrant Street can no longer be...
1018,1018,Arundel,Accommodation,0,"Wendy Excellent Hotel just outside Arundel, bu..."
2508,2508,Arundel,Food,0,Great friendly Place in Findon! Stopped for br...
778,778,Littlehampton,Food,1,Expensive Went for lunch with Grandson.\nLimit...
2313,2313,Littlehampton,Food,1,"It’s was,...ok We came for a family meal, for ..."


In [10]:
# Strip newlines, whitespace and set to lowercase then strip newlines
test['lower'] = test["all_text"].apply(lambda x: x.replace('\n',''))
test['lower'] = test["lower"].apply(lambda x: x.strip().lower())

In [11]:
# Sample review
test["lower"].iloc[1]

'very good service both my fiance and i have had breakfast here a few times now and a mate and i have had a few pints before. but tonight we got some dinner and our server lucy where really nice to us and the legend in the kitchen, damien, made sure we where happy. thanks!'

In [12]:
# Replace words, and remove the 'read more', 'read less' tags (not relevant to example review)
test['clean'] = test["lower"].replace({'xmas': 'christmas'}, regex=True)
test['clean'] = test.clean.str.replace(r'\read less$', '', regex=True).str.strip()
test['clean'] = test.clean.str.replace(r'\read more$', '', regex=True).str.strip()

In [13]:
# Remove other characters, split two words separated with slash, remove digits plus am and pm, 
# remove digits plus th or nd indicating dates, multiple full stops not removed by punctuation

test['clean'] = test["clean"].replace({'\£':''}, regex = True)
test['clean'] = test["clean"].replace(r'\/'," ", regex=True)
test['clean'] = test["clean"].replace({'\d+\-\d+':""}, regex = True)
test['clean'] = test["clean"].replace({'\d+\w{2}':""}, regex = True)
test['clean'] = test["clean"].replace({'\.{3,}':""}, regex = True)

In [14]:
# Example
test["clean"].iloc[0]

'an excellent pub and restaurant an excellent pub and restaurant not a five star restaurant but offering excellent value and really good food with a very good selection of beers keep at the right temperature i have visited many times and have last found time to share my experience lloyd'

In [15]:
# Expand contractions
test["contracts"] = test["clean"].apply(lambda x: contractions.fix(x))

In [16]:
test["contracts"].iloc[0]

'an excellent pub and restaurant an excellent pub and restaurant not a five star restaurant but offering excellent value and really good food with a very good selection of beers keep at the right temperature i have visited many times and have last found time to share my experience lloyd'

In [17]:
# Tokenize text
test["token"] = test["contracts"].apply(lambda x: nltk.word_tokenize(x))

In [18]:
# Example review
print(test["token"].iloc[0], end = '')

['an', 'excellent', 'pub', 'and', 'restaurant', 'an', 'excellent', 'pub', 'and', 'restaurant', 'not', 'a', 'five', 'star', 'restaurant', 'but', 'offering', 'excellent', 'value', 'and', 'really', 'good', 'food', 'with', 'a', 'very', 'good', 'selection', 'of', 'beers', 'keep', 'at', 'the', 'right', 'temperature', 'i', 'have', 'visited', 'many', 'times', 'and', 'have', 'last', 'found', 'time', 'to', 'share', 'my', 'experience', 'lloyd']

In [19]:
# Remove punctuation
punc = string.punctuation
test["punct"] = test["token"].apply(lambda x: [word for word in x if word not in punc])

In [20]:
# Example review
print(test["punct"].iloc[0])

['an', 'excellent', 'pub', 'and', 'restaurant', 'an', 'excellent', 'pub', 'and', 'restaurant', 'not', 'a', 'five', 'star', 'restaurant', 'but', 'offering', 'excellent', 'value', 'and', 'really', 'good', 'food', 'with', 'a', 'very', 'good', 'selection', 'of', 'beers', 'keep', 'at', 'the', 'right', 'temperature', 'i', 'have', 'visited', 'many', 'times', 'and', 'have', 'last', 'found', 'time', 'to', 'share', 'my', 'experience', 'lloyd']


In [21]:
# Remove numbers, except words that contain numbers.
test["ex_num"] = test["punct"].apply(lambda x: [n for n in x if not n.isnumeric()])

In [22]:
# Remove non ascii characters
test["ascii"]= test["ex_num"].apply(lambda x: [e for e in x if e.encode("ascii","ignore")])

In [23]:
# Example review
print(test["ascii"].iloc[0])

['an', 'excellent', 'pub', 'and', 'restaurant', 'an', 'excellent', 'pub', 'and', 'restaurant', 'not', 'a', 'five', 'star', 'restaurant', 'but', 'offering', 'excellent', 'value', 'and', 'really', 'good', 'food', 'with', 'a', 'very', 'good', 'selection', 'of', 'beers', 'keep', 'at', 'the', 'right', 'temperature', 'i', 'have', 'visited', 'many', 'times', 'and', 'have', 'last', 'found', 'time', 'to', 'share', 'my', 'experience', 'lloyd']


In [24]:
# Print stopwords list
stop_words = set(stopwords.words('english')) 
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [25]:
# Remove and add words to stopwords list - expanding contractions eliminated most negation words like 'didn't already 
# but the word'not' is taken out as contractions convert to 'did not' etc.

stop = stopwords.words('english')
stop_remove = ["not"]
stop_left = [s for s in stop if s not in stop_remove]
newStopWords = ['etc']
stop_left.extend(newStopWords)

In [26]:
# Remove common/stopwords
test["no_stop"] = test["ascii"].apply(lambda x: [w for w in x if w not in stop_left])

In [27]:
# Example review
print(test["no_stop"].iloc[0])

['excellent', 'pub', 'restaurant', 'excellent', 'pub', 'restaurant', 'not', 'five', 'star', 'restaurant', 'offering', 'excellent', 'value', 'really', 'good', 'food', 'good', 'selection', 'beers', 'keep', 'right', 'temperature', 'visited', 'many', 'times', 'last', 'found', 'time', 'share', 'experience', 'lloyd']


In [28]:
def spell_check(text_chunk):
    spell = SpellChecker()
    new_list = []
    corrected = []
    for word in text_chunk:
        if spell.correction(word) != word:
            new_word = spell.correction(word)
            new_list.append(new_word)
        else:
            new_list.append(word)
    return new_list

In [29]:
# Run with example text to show corrections
text_example_spelling = ["where", "is", "the", "best", "restarant"]
spell_check(text_example_spelling)

['where', 'is', 'the', 'best', 'restaurant']

In [30]:
# Apply to test dataframe
test["no_stop"].apply(lambda x: spell_check(x))

1032    [excellent, pub, restaurant, excellent, pub, r...
2406    [good, service, fiance, breakfast, times, mate...
2591    [not, bad, would, not, return, priced, food, d...
1185    [visit, friend, recently, stayed, hotel, night...
1099    [wonderful, yet, beautiful, carvery, meat, coo...
2488    [birthday, bash, tarrant, street, longer, cons...
1018    [wendy, excellent, hotel, outside, asunder, st...
2508    [great, friendly, place, finden, stopped, brea...
778     [expensive, went, lunch, grandson.limited, chi...
2313    [ok, came, family, meal, time, day, week, woul...
Name: no_stop, dtype: object

In [31]:
print(test["no_stop"].iloc[0])

['excellent', 'pub', 'restaurant', 'excellent', 'pub', 'restaurant', 'not', 'five', 'star', 'restaurant', 'offering', 'excellent', 'value', 'really', 'good', 'food', 'good', 'selection', 'beers', 'keep', 'right', 'temperature', 'visited', 'many', 'times', 'last', 'found', 'time', 'share', 'experience', 'lloyd']


In [32]:
# Lemmatize to common root
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in text]

test['lemma'] = test.no_stop.apply(lemmatize_text)

In [33]:
# Get parts of speech using NLTK
test['pos_tags'] = test['no_stop'].apply(nltk.tag.pos_tag)
test.head(2)

Unnamed: 0,OrgInd,Town,Category,Score,all_text,lower,clean,contracts,token,punct,ex_num,ascii,no_stop,lemma,pos_tags
1032,1032,Bognor,Food,0,An excellent pub and restaurant An excellent p...,an excellent pub and restaurant an excellent p...,an excellent pub and restaurant an excellent p...,an excellent pub and restaurant an excellent p...,"[an, excellent, pub, and, restaurant, an, exce...","[an, excellent, pub, and, restaurant, an, exce...","[an, excellent, pub, and, restaurant, an, exce...","[an, excellent, pub, and, restaurant, an, exce...","[excellent, pub, restaurant, excellent, pub, r...","[excellent, pub, restaurant, excellent, pub, r...","[(excellent, JJ), (pub, NN), (restaurant, NN),..."
2406,2406,Arundel,Food,0,Very good service Both my fiance and I have ha...,very good service both my fiance and i have ha...,very good service both my fiance and i have ha...,very good service both my fiance and i have ha...,"[very, good, service, both, my, fiance, and, i...","[very, good, service, both, my, fiance, and, i...","[very, good, service, both, my, fiance, and, i...","[very, good, service, both, my, fiance, and, i...","[good, service, fiance, breakfast, times, mate...","[good, service, fiance, breakfast, time, mate,...","[(good, JJ), (service, NN), (fiance, NN), (bre..."


In [34]:
# Example item
print(test.pos_tags.iloc[0])

[('excellent', 'JJ'), ('pub', 'NN'), ('restaurant', 'NN'), ('excellent', 'JJ'), ('pub', 'NN'), ('restaurant', 'NN'), ('not', 'RB'), ('five', 'CD'), ('star', 'NN'), ('restaurant', 'NN'), ('offering', 'NN'), ('excellent', 'JJ'), ('value', 'NN'), ('really', 'RB'), ('good', 'JJ'), ('food', 'NN'), ('good', 'JJ'), ('selection', 'NN'), ('beers', 'NNS'), ('keep', 'VB'), ('right', 'JJ'), ('temperature', 'NN'), ('visited', 'VBD'), ('many', 'JJ'), ('times', 'NNS'), ('last', 'JJ'), ('found', 'VBD'), ('time', 'NN'), ('share', 'NN'), ('experience', 'NN'), ('lloyd', 'NN')]


In [35]:
print(test.all_text.iloc[0],"\n")
print(test.lemma.iloc[0],"\n")
print(test.pos_tags.iloc[0],"\n")

An excellent pub and restaurant An excellent pub and restaurant not a five star restaurant but offering excellent value and really good food with a very good selection of beers keep at the right temperature I have visited many times and have last found time to share my experience Lloyd 

['excellent', 'pub', 'restaurant', 'excellent', 'pub', 'restaurant', 'not', 'five', 'star', 'restaurant', 'offering', 'excellent', 'value', 'really', 'good', 'food', 'good', 'selection', 'beer', 'keep', 'right', 'temperature', 'visited', 'many', 'time', 'last', 'found', 'time', 'share', 'experience', 'lloyd'] 

[('excellent', 'JJ'), ('pub', 'NN'), ('restaurant', 'NN'), ('excellent', 'JJ'), ('pub', 'NN'), ('restaurant', 'NN'), ('not', 'RB'), ('five', 'CD'), ('star', 'NN'), ('restaurant', 'NN'), ('offering', 'NN'), ('excellent', 'JJ'), ('value', 'NN'), ('really', 'RB'), ('good', 'JJ'), ('food', 'NN'), ('good', 'JJ'), ('selection', 'NN'), ('beers', 'NNS'), ('keep', 'VB'), ('right', 'JJ'), ('temperature', 

## b) Text Cleaning Sentence for Demo in Appendix

In [36]:
example = "Nice spacious room, clean and cmfortable beds, we stayed 3 nights and I couldn't fault anything! 😊 read less"

In [37]:
example = example.strip().lower()
example

"nice spacious room, clean and cmfortable beds, we stayed 3 nights and i couldn't fault anything! 😊 read less"

In [38]:
example = example.replace('read less', '')
example

"nice spacious room, clean and cmfortable beds, we stayed 3 nights and i couldn't fault anything! 😊 "

In [39]:
example = contractions.fix(example)
example

'nice spacious room, clean and cmfortable beds, we stayed 3 nights and i could not fault anything! 😊 '

In [40]:
example = nltk.word_tokenize(example)
print(example)

['nice', 'spacious', 'room', ',', 'clean', 'and', 'cmfortable', 'beds', ',', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything', '!', '😊']


In [41]:
example = [word for word in example if word not in punc]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything', '😊']


In [42]:
example = [e for e in example if e.encode("ascii","ignore")]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything']


In [43]:
example = [n for n in example if not n.isnumeric()]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything']


In [44]:
example = [w for w in example if w not in stop_left]
print(example)

['nice', 'spacious', 'room', 'clean', 'cmfortable', 'beds', 'stayed', 'nights', 'could', 'not', 'fault', 'anything']


In [45]:
def spell_check(text_chunk):
    spell = SpellChecker()
    new_list = []
    corrected = []
    for word in text_chunk:
        if spell.correction(word) != word:
            new_word = spell.correction(word)
            new_list.append(new_word)
        else:
            new_list.append(word)
    return new_list

In [46]:
example2 = spell_check(example)
print(example2)

['nice', 'spacious', 'room', 'clean', 'comfortable', 'beds', 'stayed', 'nights', 'could', 'not', 'fault', 'anything']


In [47]:
example = lemmatize_text(example2)
print(example)

['nice', 'spacious', 'room', 'clean', 'comfortable', 'bed', 'stayed', 'night', 'could', 'not', 'fault', 'anything']


In [48]:
print(nltk.tag.pos_tag(example))

[('nice', 'RB'), ('spacious', 'JJ'), ('room', 'NN'), ('clean', 'NN'), ('comfortable', 'JJ'), ('bed', 'NN'), ('stayed', 'VBD'), ('night', 'NN'), ('could', 'MD'), ('not', 'RB'), ('fault', 'VB'), ('anything', 'NN')]


# Section 3: Text cleaning function applied to the sample dataset for use in the text models

Build a pre-processing function based on the steps outlined above

In [49]:
# General pre-processing function with the above tasks to clean text -
# spellchecker can replace words with incorrect words that do not make sense so can be uncommented to run if necessary

def process(text):
    
       
    # Replace newlines, strip whitespace and set to lowercase
    text = text.apply(lambda x: x.replace('\n',' '))
    text = text.apply(lambda x: x.strip().lower())
    
    # Replace words, and remove the 'read more', 'read less' tags 
    text = text.apply(lambda x: x.replace('xmas','christmas'))
    text = text.apply(lambda x: x.replace('\nread less',""))
    text = text.apply(lambda x: x.replace('\nread more',""))
       
                  
    # Clean other issues with text
    text = text.replace({'\£':""}, regex = True) # remove pound sign
    text = text.replace(r'\/'," ", regex=True) # split words separated with slash
    text = text.replace({'\d+\-\d+':""}, regex = True) # remove digits
    text = text.replace({'\d+\w{2}':""}, regex = True) # remove number plus am, pm, th or nd
    text = text.replace({'\.{3,}':""}, regex = True) # remove multiple full stops not removed by punctuation
    
   # Expand contractions
    text = text.apply(lambda x: contractions.fix(x))

    # Tokenize text
    text = text.apply(lambda x: nltk.word_tokenize(x))
    
    # Remove punctuation
    punc = string.punctuation
    text = text.apply(lambda x: [word for word in x if word not in punc])
             
    # Remove numbers, except words that contain numbers.
    text = text.apply(lambda x: [n for n in x if not n.isnumeric()])

    # Remove non ascii characters
    text = text.apply(lambda x: [e for e in x if e.encode("ascii","ignore")])

    # Remove common/stopwords - extend and remove some words from the list as negation words might be important to retain
    stop = stopwords.words('english')
    stop_remove =["not","don't","didn't","wasn't","won't","isn't"]
    stop1 = [elem for elem in stop if elem not in stop_remove] 
    add_stop = ['etc','read','read less','lot','butlins', 'bognor','regis','b',' i ','..','arundel castle','premier','inn','u',
                'castle',"year","hilton","time","day","shoreline","oyster","bay","church farm","hotham","hotham park",
                "hawk walk","hawk","arundel","littlehampton"]
    stop1.extend(add_stop)
    text = text.apply(lambda x: [w for w in x if w not in stop1])
    
    # Lemmatize to common root
    lemmatizer = WordNetLemmatizer()
    def lemmatize_text(text):
        return [lemmatizer.lemmatize(w) for w in text]
       
    text = text.apply(lemmatize_text)
    
     # Spelling
    def spell_correction(text):
        spell = SpellChecker()
        word_list = []
        for word in text:
            new_word = spell.correction(word)
            if new_word != word:
                word_list.append(new_word)
            else: word_list.append(word)
        return word_list      
                
    text = text.apply(spell_correction)
    
    # Convert list to string 
    text = text.apply(lambda x: ' '.join(x))
    
    # Remove trailing 'i'
    text = text.apply(lambda x: x.replace(' i ',""))
    text = text.apply(lambda x: x.replace(' le ',""))
      
    return text

In [51]:
# Run function - uncomment to run as resource intensive
df_cleaned = process(df["all_text"])

In [52]:
# Function to make dataframe, name column then find parts of speech for text in that column
def make_df(df):
    df = pd.DataFrame(df)
    df.columns = ["text_clean"]
    df['pos'] = df["text_clean"].apply(lambda x:nltk.tag.pos_tag(x.split()))
    return df

In [53]:
# Run function
df_cleaned = make_df(df_cleaned)

In [54]:
# Function to concat two dataframes and name columns
def convert(df,df2):
    df = pd.concat([df2,df],axis = 1)
    columns = "OrgInd","Town","Category","Score","Sent","Sent_clean","Pos"
    df.columns = (columns)
    return df

In [55]:
# Run function
df_cleaned = convert(df_cleaned,df)

In [56]:
df_cleaned.head(2)

Unnamed: 0,OrgInd,Town,Category,Score,Sent,Sent_clean,Pos
0,0,Arundel,Food,1,Very disappointing Three of us ate on a quiet ...,disappointing three u ate quiet night first go...,"[(disappointing, JJ), (three, CD), (u, JJ), (a..."
1,1,Bognor,Accommodation,0,Amazing place!!! We had a lovely stay at the ...,amazing place lovely stay inklenook room world...,"[(amazing, JJ), (place, NN), (lovely, RB), (st..."


In [65]:
# Send cleaned and combined dataframes to csv
df_cleaned.to_csv("cleanedsampletext.csv")

# Section 4: Text Pre-processing function applied to the total reviews dataset for use with topic models

In [67]:
# Bring in dataframe from Exploratory Data Analysis Notebook 1 - combined data (outliers not excluded)
df_combined = pd.read_csv("all_reviews.csv",index_col = 0)

In [68]:
df_combined.shape

(10407, 12)

In [69]:
df_combined.head(2)

Unnamed: 0,Name,Category,Town,Type,Location,Contributions,Title,Review,Rating,id,ReviewMonth,VisitMonth
0,Butlins,Accommodation,Bognor,Hotel,"Hitchin, United Kingdom",25,"Nice break, shame about the accommodation...",We booked our 3 night stay from 27-30 December...,4,6804,12,12
1,Butlins,Accommodation,Bognor,Hotel,"London, United Kingdom",69,Horrendous noise Oyster Bay,In Oyster Bay. Oh dear.\n\nVery poor sound ins...,1,1536,12,12


In [70]:
# Combine title and review
df_combined = df_combined[["Category","Town","Title","Review","Rating"]]
df_combined["all_text"] = df_combined["Title"] +" "+ df_combined["Review"]
df_combined.drop(columns = ["Title","Review"],axis = 1, inplace = True)
df_combined.head(2)

Unnamed: 0,Category,Town,Rating,all_text
0,Accommodation,Bognor,4,"Nice break, shame about the accommodation... W..."
1,Accommodation,Bognor,1,Horrendous noise Oyster Bay In Oyster Bay. Oh ...


In [71]:
# Run function against the dataframe for reviews - UNCOMMENT TO RUN AS THIS TAKES A LONG TIME TO PROCESS!!
df_combined["cleaned"] = process(df_combined["all_text"])

In [72]:
df_combined.head(2)

Unnamed: 0,Category,Town,Rating,all_text,cleaned
0,Accommodation,Bognor,4,"Nice break, shame about the accommodation... W...",nice break shame accommodation booked night st...
1,Accommodation,Bognor,1,Horrendous noise Oyster Bay In Oyster Bay. Oh ...,horrendous noise oh dear poor sound insulation...


In [73]:
# Save to file. Do not uncomment unless to overwrite the file
df_combined.to_csv("fulldatasetcleaned.csv")

# Section 5: Text Pre-processing function applied to sentences

In [74]:
# Read in saved dataset of reviews
full_df = pd.read_csv("fulldatasetcleaned.csv")
full_df.columns = ["OrigInd","Category","Town","Rating","all_text","cleaned"]
full_df.head(2)

Unnamed: 0,OrigInd,Category,Town,Rating,all_text,cleaned
0,0,Accommodation,Bognor,4,"Nice break, shame about the accommodation... W...",nice break shame accommodation booked night st...
1,1,Accommodation,Bognor,1,Horrendous noise Oyster Bay In Oyster Bay. Oh ...,horrendous noise oh dear poor sound insulation...


In [75]:
# Drop cleaned column
full_df.drop(columns = ["cleaned"],axis = 1, inplace = True)

In [76]:
full_df.head(2)

Unnamed: 0,OrigInd,Category,Town,Rating,all_text
0,0,Accommodation,Bognor,4,"Nice break, shame about the accommodation... W..."
1,1,Accommodation,Bognor,1,Horrendous noise Oyster Bay In Oyster Bay. Oh ...


In [77]:
# Split each review into separate sentences, explode onto separate lines and check size
full_df["sentences"] = full_df["all_text"].apply(lambda x: nltk.sent_tokenize(x))

In [78]:
full_df_exploded  = full_df.explode("sentences")

In [79]:
# Sentences with the 'Read Less' tag removed
full_df_exploded = full_df_exploded[full_df_exploded["sentences"] != "Read less"]
full_df_exploded.shape

(60231, 6)

In [80]:
# Get the length of sentences
full_df_exploded["len"] = full_df_exploded["sentences"].apply(lambda x: len(x))
full_df_exploded.describe()

Unnamed: 0,OrigInd,Rating,len
count,60231.0,60231.0,60231.0
mean,4631.481895,4.005612,84.022563
std,3135.820731,1.329433,66.467749
min,0.0,1.0,1.0
25%,1780.0,3.0,42.0
50%,4421.0,5.0,70.0
75%,7360.5,5.0,107.0
max,10406.0,5.0,2094.0


In [81]:
# Remove rows with just one character such as exclamation marks
full_df_exploded = full_df_exploded[full_df_exploded["len"] > 1]
full_df_exploded.shape

(59956, 7)

In [82]:
# Find how many outliers there are in terms of length of sentence
full_df_exploded[(np.abs(stats.zscore(full_df_exploded["len"])) > 3)].shape

(876, 7)

In [83]:
# Look at review sentences over 1000 characters
full_df_exploded[full_df_exploded["len"] > 1000]

Unnamed: 0,OrigInd,Category,Town,Rating,all_text,sentences,len
78,78,Accommodation,Bognor,1,Boring boggy butlins Well silver accommodation...,It’s ridiculous...adults are kept in line with...,1475
98,98,Accommodation,Bognor,1,Angry We visited butlins bogner for my daughte...,Angry We visited butlins bogner for my daughte...,1693
115,115,Accommodation,Bognor,1,Don't waste your time or money I am so angry w...,Don't waste your time or money I am so angry w...,2094
314,314,Accommodation,Bognor,5,Tots week Just got back from our 2nd tots week...,Tots week Just got back from our 2nd tots week...,1143
665,665,Accommodation,Arundel,5,Brilliant ignore bad reviews Well after readi...,Brilliant ignore bad reviews Well after readi...,1096
999,999,Accommodation,Littlehampton,2,Half Term Break Site Complex not big enough fo...,Half Term Break Site Complex not big enough fo...,1959
1680,1680,Accommodation,Littlehampton,4,Lovely stay overall As we drove in the first a...,Lovely stay overall As we drove in the first a...,1098
2080,2080,Accommodation,Bognor,3,Easter Break few minor ussues went here for an...,Easter Break few minor ussues went here for an...,1344
8911,8911,Food,Bognor,1,Wost company iv ever ordered from worst delive...,Wost company iv ever ordered from worst delive...,1029


Almost a thousand review sentences are considered to be outliers, these are retained for aspect extraction analysis as they still contain valuable information. There will always be reviews written without punctuation which stops the sentences being tokenized properly by nltk sentence tokenizer but it reflects the actual text input that is likely to be received. However, these reviews can be removed using the code in the cell below if required.

In [84]:
# Drop long reviews that cannot be tokenized into sentences from the dataset - uncomment to run, long reviews kept otherwise
#full_df = full_df[(np.abs(stats.zscore(full_df["len"])) <3)]
#full_df.shape

In [85]:
# Use alternative library to try to split the long reviews into shorter ones based named entities
# https://pypi.org/project/deepsegment/ NEEDS TENSORFLOW LOADED TO RUN - LOOK AT LATER

#segmenter = DeepSegment('en')
#segmenter.segment(test)

In [86]:
# Run function against the dataframe for reviews - UNCOMMENT TO RUN AS THIS TAKES OVER AN HOUR TO PROCESS!!!
full_df_exploded["cleaned"] = process(full_df_exploded["sentences"])

In [91]:
full_df_exploded["cleaned"].head()

0    nice break shame accommodation booked night st...
0    would never not really sure expect review eith...
0    first impression not good we arrived parked ca...
0    bearing mind people apartment obviously self-c...
0    trip back forth one side resort balancing stuf...
Name: cleaned, dtype: object

In [88]:
full_df_exploded.to_csv("explodedsentencescleaned.csv")