# Text Pre-processing

In this notebook, the data is pre-processed:

* Section 1 - pre-processing
* Section 2 - text cleaning processes demonstrated on a toy dataset
* Section 3 - text cleaning against dataset

## Import Libraries and Data

In [10]:
# Install language_check - note pyahocorasick had to also be installed using --add channels conda-forge and
# conda install pyahocorasick. Java also installed in the path.
# ! pip install --upgrade language-check

In [11]:
#! pip install contractions
#! pip install pyspellchecker 
#! pip install autocorrect
#!pip install Gensim
#! conda update pandas

In [1]:
import pandas as pd
import numpy as np
import pickle
import re
import contractions
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import RegexpTokenizer, word_tokenize,sent_tokenize
import string
from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\imoge\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [109]:
# Read in balanced datasets from Post Attributes Base notebook
train = pd.read_csv("combined_train.csv",index_col = 0)
val = pd.read_csv("combined_val.csv",index_col = 0)
test = pd.read_csv("combined_test.csv",index_col = 0)

In [110]:
train.head(2)

Unnamed: 0,Name,Category,Town,Type,Contributions,Title,Review,Date,LocCode,Cuisine,Score
245,Trevali Buest House,Accommodation,Bognor,B&B/Inn,315,Central B & B,"We had room 6, excellent view, we could see th...",1.0,2.0,0,0
209,Sea View,Accommodation,Littlehampton,Hotel,50,Not what it used to be...,"We lived in the area for 25 years and in fact,...",2.0,2.0,0,1


In [111]:
train.shape

(1453, 11)

# Section 1: Preprocessing the dataframe

Processing tasks:
* join title and review text together into one column
* set bad ratings of 1&2 to 1 and good of 4&5 to 0 and drop those rated 3
* split reviews in dataframes for each category
* sample 'good' reviews to match number of 'bad' to create balanced datasets
* split into train, validation and testing sets for each category stratifying the y values so the same proportions appear
  in each of train,val and test sets.
* recombine the category feature dataframes to create combined feature dataframes for training, validation and test
* concat the dataframes to create 3 balanced final dataframes with features and rating for accom, food and attractions

In [112]:
# Function to select columns of interest, join title and review, drop unwanted columns and reset index
def proc(df):
    df = df[["Town","Category","Title","Review","Score"]]
    df["all_text"] = df["Title"] +" "+ df["Review"]
    df.drop(columns = ["Title","Review"],axis = 1, inplace = True)
    df.reset_index(inplace = True)
    df.columns = ["OrgInd","Town","Category","Score","all_text"]
    return df

In [113]:
df_train = proc(train)
df_val = proc(val)
df_test = proc(test)

In [114]:
df_train.head(2)

Unnamed: 0,OrgInd,Town,Category,Score,all_text
0,245,Bognor,Accommodation,0,"Central B & B We had room 6, excellent view, w..."
1,209,Littlehampton,Accommodation,1,Not what it used to be... We lived in the area...


# Section 2: Text Cleaning

## a) Text Cleaning against a toy dataset

In [115]:
# Set up test dataframe (uncomment to run a new sample)
test = df_train.sample(10,random_state = 0)
test

Unnamed: 0,OrgInd,Town,Category,Score,all_text
1352,564,Littlehampton,Food,1,Best avoided As we were told the café was full...
482,245,Bognor,Food,0,First class food in tasteful surroundings I ha...
1309,299,Littlehampton,Food,1,Lunchtime chaos Booked table for 4 several wee...
270,216,Bognor,Accommodation,1,Excessive cost Premier Inn charged me. I was c...
278,138,Arundel,Accommodation,0,"Charming hotel I booked a stay with dinner, sp..."
665,157,Littlehampton,Food,0,What a secret gem What a beautiful treat. We w...
1012,704,Littlehampton,Food,0,Family run restaurant with amazing food Amazin...
1166,992,Littlehampton,Food,1,Disappointing Where do i start\n\nThe so calle...
1099,1168,Arundel,Food,0,super brunch in lovely surroundings Popped in ...
1062,998,Littlehampton,Food,1,Probably the unfriendliest place I have ever v...


In [116]:
# Strip whitespace and set to lowercase
test['lower'] = test["all_text"].apply(lambda x: x.strip().lower())

In [117]:
# Sample review
test["lower"].iloc[1]

'first class food in tasteful surroundings i have lunched twice at mustards within the last few days.\n\nmy first lunch included last sunday was a delicious homemade broccoli soup and a generous, equally delicious, portion of roast lamb, yorkshire pudding, roast potatoes, gravy and two side dishes with freshly cooked vegetables. in addition there was warm, home baked bread and a carafe of water.\n\ntoday being less hungry i chose fried cod, chips and mashed peas; all cooked to perfection. once again there was freshly baked bread. afterwards i ate homemade lemon, lime and raspberry sorbets - so much better than thé mass produced variety.\n\non both occasions the service was efficient and friendly. mustards is stylishly furnished and attention has been paid to many details: for example: cloth serviettes and cloth towels in the immaculately clean bathroom.'

In [118]:
# Replace words, and remove the 'read more', 'read less' tags (not relevant to example review)
test['clean'] = test["lower"].replace({'xmas': 'christmas'}, regex=True)
test['clean'] = test.clean.str.replace(r'\read less$', '', regex=True).str.strip()
test['clean'] = test.clean.str.replace(r'\read more$', '', regex=True).str.strip()

In [119]:
# Remove other characters, split two words separated with slash, remove digits plus am and pm, 
# remove digits plus th or nd indicating dates, multiple full stops not removed by punctuation

test['clean'] = test["clean"].replace({'\£':''}, regex = True)
test['clean'] = test["clean"].replace(r'\/'," ", regex=True)
test['clean'] = test["clean"].replace({'\d+\-\d+':""}, regex = True)
test['clean'] = test["clean"].replace({'\d+\w{2}':""}, regex = True)
test['clean'] = test["clean"].replace({'\.{3,}':""}, regex = True)

In [120]:
# Example
test["clean"].iloc[0]

"best avoided as we were told the café was full (even though there were empty tables) we foolishly went to the kiosk outside to order food and drinks. it was barely organised chaos. the queue just didn't move, while they faffed about pretending to know how to prepare food and drinks. when we finally got to order they had run out of most things we wanted on the limited menu (sausage rolls and sandwiches) so we ended up waiting 20 mins for some chips to be cooked, which was only a small portion. the tables out on the beach were dirty, with rubbish underneath them. a poorly managed place, which clearly trades on its location."

In [121]:
# Expand contractions
test["contracts"] = test["clean"].apply(lambda x: contractions.fix(x))

In [122]:
test["contracts"].iloc[0]

'best avoided as we were told the café was full (even though there were empty tables) we foolishly went to the kiosk outside to order food and drinks. it was barely organised chaos. the queue just did not move, while they faffed about pretending to know how to prepare food and drinks. when we finally got to order they had run out of most things we wanted on the limited menu (sausage rolls and sandwiches) so we ended up waiting 20 mins for some chips to be cooked, which was only a small portion. the tables out on the beach were dirty, with rubbish underneath them. a poorly managed place, which clearly trades on its location.'

In [123]:
# Tokenize text
test["token"] = test["contracts"].apply(lambda x: nltk.word_tokenize(x))

In [124]:
# Example review
print(test["token"].iloc[0], end = '')

['best', 'avoided', 'as', 'we', 'were', 'told', 'the', 'café', 'was', 'full', '(', 'even', 'though', 'there', 'were', 'empty', 'tables', ')', 'we', 'foolishly', 'went', 'to', 'the', 'kiosk', 'outside', 'to', 'order', 'food', 'and', 'drinks', '.', 'it', 'was', 'barely', 'organised', 'chaos', '.', 'the', 'queue', 'just', 'did', 'not', 'move', ',', 'while', 'they', 'faffed', 'about', 'pretending', 'to', 'know', 'how', 'to', 'prepare', 'food', 'and', 'drinks', '.', 'when', 'we', 'finally', 'got', 'to', 'order', 'they', 'had', 'run', 'out', 'of', 'most', 'things', 'we', 'wanted', 'on', 'the', 'limited', 'menu', '(', 'sausage', 'rolls', 'and', 'sandwiches', ')', 'so', 'we', 'ended', 'up', 'waiting', '20', 'mins', 'for', 'some', 'chips', 'to', 'be', 'cooked', ',', 'which', 'was', 'only', 'a', 'small', 'portion', '.', 'the', 'tables', 'out', 'on', 'the', 'beach', 'were', 'dirty', ',', 'with', 'rubbish', 'underneath', 'them', '.', 'a', 'poorly', 'managed', 'place', ',', 'which', 'clearly', 'tra

In [125]:
# Remove punctuation
punc = string.punctuation
test["punct"] = test["token"].apply(lambda x: [word for word in x if word not in punc])

In [126]:
# Example review
print(test["punct"].iloc[0])

['best', 'avoided', 'as', 'we', 'were', 'told', 'the', 'café', 'was', 'full', 'even', 'though', 'there', 'were', 'empty', 'tables', 'we', 'foolishly', 'went', 'to', 'the', 'kiosk', 'outside', 'to', 'order', 'food', 'and', 'drinks', 'it', 'was', 'barely', 'organised', 'chaos', 'the', 'queue', 'just', 'did', 'not', 'move', 'while', 'they', 'faffed', 'about', 'pretending', 'to', 'know', 'how', 'to', 'prepare', 'food', 'and', 'drinks', 'when', 'we', 'finally', 'got', 'to', 'order', 'they', 'had', 'run', 'out', 'of', 'most', 'things', 'we', 'wanted', 'on', 'the', 'limited', 'menu', 'sausage', 'rolls', 'and', 'sandwiches', 'so', 'we', 'ended', 'up', 'waiting', '20', 'mins', 'for', 'some', 'chips', 'to', 'be', 'cooked', 'which', 'was', 'only', 'a', 'small', 'portion', 'the', 'tables', 'out', 'on', 'the', 'beach', 'were', 'dirty', 'with', 'rubbish', 'underneath', 'them', 'a', 'poorly', 'managed', 'place', 'which', 'clearly', 'trades', 'on', 'its', 'location']


In [127]:
# Remove numbers, except words that contain numbers.
test["ex_num"] = test["punct"].apply(lambda x: [n for n in x if not n.isnumeric()])

In [128]:
# Remove non ascii characters
test["ascii"]= test["ex_num"].apply(lambda x: [e for e in x if e.encode("ascii","ignore")])

In [129]:
# Example review
print(test["ascii"].iloc[0])

['best', 'avoided', 'as', 'we', 'were', 'told', 'the', 'café', 'was', 'full', 'even', 'though', 'there', 'were', 'empty', 'tables', 'we', 'foolishly', 'went', 'to', 'the', 'kiosk', 'outside', 'to', 'order', 'food', 'and', 'drinks', 'it', 'was', 'barely', 'organised', 'chaos', 'the', 'queue', 'just', 'did', 'not', 'move', 'while', 'they', 'faffed', 'about', 'pretending', 'to', 'know', 'how', 'to', 'prepare', 'food', 'and', 'drinks', 'when', 'we', 'finally', 'got', 'to', 'order', 'they', 'had', 'run', 'out', 'of', 'most', 'things', 'we', 'wanted', 'on', 'the', 'limited', 'menu', 'sausage', 'rolls', 'and', 'sandwiches', 'so', 'we', 'ended', 'up', 'waiting', 'mins', 'for', 'some', 'chips', 'to', 'be', 'cooked', 'which', 'was', 'only', 'a', 'small', 'portion', 'the', 'tables', 'out', 'on', 'the', 'beach', 'were', 'dirty', 'with', 'rubbish', 'underneath', 'them', 'a', 'poorly', 'managed', 'place', 'which', 'clearly', 'trades', 'on', 'its', 'location']


In [130]:
# Print stopwords list
stop_words = set(stopwords.words('english')) 
stop_words

In [131]:
# Remove and add words to stopwords list - expanding contractions eliminated most negation words like 'didn't already 
# but the word'not' is taken out as contractions convert to 'did not' etc.

stop = stopwords.words('english')
stop_remove = ["not"]
stop_left = [s for s in stop if s not in stop_remove]
newStopWords = ['etc']
stop_left.extend(newStopWords)

In [132]:
# Remove common/stopwords
test["no_stop"] = test["ascii"].apply(lambda x: [w for w in x if w not in stop_left])

In [133]:
# Example review
print(test["no_stop"].iloc[0])

['best', 'avoided', 'told', 'café', 'full', 'even', 'though', 'empty', 'tables', 'foolishly', 'went', 'kiosk', 'outside', 'order', 'food', 'drinks', 'barely', 'organised', 'chaos', 'queue', 'not', 'move', 'faffed', 'pretending', 'know', 'prepare', 'food', 'drinks', 'finally', 'got', 'order', 'run', 'things', 'wanted', 'limited', 'menu', 'sausage', 'rolls', 'sandwiches', 'ended', 'waiting', 'mins', 'chips', 'cooked', 'small', 'portion', 'tables', 'beach', 'dirty', 'rubbish', 'underneath', 'poorly', 'managed', 'place', 'clearly', 'trades', 'location']


In [134]:
def spell_check(text_chunk):
    spell = SpellChecker()
    new_list = []
    corrected = []
    for word in text_chunk:
        if spell.correction(word) != word:
            new_word = spell.correction(word)
            new_list.append(new_word)
        else:
            new_list.append(word)
    return new_list

In [135]:
# Run with example text to show corrections
text_example_spelling = ["where", "is", "the", "best", "restarant"]
spell_check(text_example_spelling)

['where', 'is', 'the', 'best', 'restaurant']

In [136]:
# Apply to test dataframe
test["no_stop"].apply(lambda x: spell_check(x))

1352    [best, avoided, told, cafe, full, even, though...
482     [first, class, food, tasteful, surroundings, l...
1309    [lunchtime, chaos, booked, table, several, wee...
270     [excessive, cost, premier, inn, charged, charg...
278     [charming, hotel, booked, stay, dinner, spa, t...
665     [secret, gem, beautiful, treat, spa, day, bail...
1012    [family, run, restaurant, amazing, food, amazi...
1166    [disappointing, start, called, fresh, salad, b...
1099    [super, brunch, lovely, surroundings, popped, ...
1062    [probably, unfriendlies, place, ever, visited,...
Name: no_stop, dtype: object

In [137]:
print(test["no_stop"].iloc[0])

['best', 'avoided', 'told', 'café', 'full', 'even', 'though', 'empty', 'tables', 'foolishly', 'went', 'kiosk', 'outside', 'order', 'food', 'drinks', 'barely', 'organised', 'chaos', 'queue', 'not', 'move', 'faffed', 'pretending', 'know', 'prepare', 'food', 'drinks', 'finally', 'got', 'order', 'run', 'things', 'wanted', 'limited', 'menu', 'sausage', 'rolls', 'sandwiches', 'ended', 'waiting', 'mins', 'chips', 'cooked', 'small', 'portion', 'tables', 'beach', 'dirty', 'rubbish', 'underneath', 'poorly', 'managed', 'place', 'clearly', 'trades', 'location']


In [138]:
# Lemmatize to common root
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in text]

test['lemma'] = test.no_stop.apply(lemmatize_text)

In [139]:
# Get parts of speech using NLTK
test['pos_tags'] = test['no_stop'].apply(nltk.tag.pos_tag)
test.head(2)

Unnamed: 0,OrgInd,Town,Category,Score,all_text,lower,clean,contracts,token,punct,ex_num,ascii,no_stop,lemma,pos_tags
1352,564,Littlehampton,Food,1,Best avoided As we were told the café was full...,best avoided as we were told the café was full...,best avoided as we were told the café was full...,best avoided as we were told the café was full...,"[best, avoided, as, we, were, told, the, café,...","[best, avoided, as, we, were, told, the, café,...","[best, avoided, as, we, were, told, the, café,...","[best, avoided, as, we, were, told, the, café,...","[best, avoided, told, café, full, even, though...","[best, avoided, told, café, full, even, though...","[(best, RB), (avoided, VBN), (told, NN), (café..."
482,245,Bognor,Food,0,First class food in tasteful surroundings I ha...,first class food in tasteful surroundings i ha...,first class food in tasteful surroundings i ha...,first class food in tasteful surroundings i ha...,"[first, class, food, in, tasteful, surrounding...","[first, class, food, in, tasteful, surrounding...","[first, class, food, in, tasteful, surrounding...","[first, class, food, in, tasteful, surrounding...","[first, class, food, tasteful, surroundings, l...","[first, class, food, tasteful, surroundings, l...","[(first, RB), (class, NN), (food, NN), (tastef..."


In [140]:
# Example item
print(test.pos_tags.iloc[0])

[('best', 'RB'), ('avoided', 'VBN'), ('told', 'NN'), ('café', 'NN'), ('full', 'JJ'), ('even', 'RB'), ('though', 'IN'), ('empty', 'JJ'), ('tables', 'NNS'), ('foolishly', 'RB'), ('went', 'VBD'), ('kiosk', 'NNS'), ('outside', 'IN'), ('order', 'NN'), ('food', 'NN'), ('drinks', 'NNS'), ('barely', 'RB'), ('organised', 'VBD'), ('chaos', 'NN'), ('queue', 'NN'), ('not', 'RB'), ('move', 'VB'), ('faffed', 'RB'), ('pretending', 'VBG'), ('know', 'PRP'), ('prepare', 'JJ'), ('food', 'NN'), ('drinks', 'NNS'), ('finally', 'RB'), ('got', 'VBD'), ('order', 'NN'), ('run', 'VB'), ('things', 'NNS'), ('wanted', 'VBD'), ('limited', 'JJ'), ('menu', 'NN'), ('sausage', 'NN'), ('rolls', 'NNS'), ('sandwiches', 'NNS'), ('ended', 'VBD'), ('waiting', 'VBG'), ('mins', 'NNS'), ('chips', 'NNS'), ('cooked', 'VBD'), ('small', 'JJ'), ('portion', 'NN'), ('tables', 'NNS'), ('beach', 'VBP'), ('dirty', 'JJ'), ('rubbish', 'JJ'), ('underneath', 'NN'), ('poorly', 'RB'), ('managed', 'VBD'), ('place', 'NN'), ('clearly', 'RB'), ('tr

In [141]:
print(test.all_text.iloc[0],"\n")
print(test.lemma.iloc[0],"\n")
print(test.pos_tags.iloc[0],"\n")

Best avoided As we were told the café was full (even though there were empty tables) we foolishly went to the kiosk outside to order food and drinks. It was barely organised chaos. The queue just didn't move, while they faffed about pretending to know how to prepare food and drinks. When we finally got to order they had run out of most things we wanted on the limited menu (sausage rolls and sandwiches) so we ended up waiting 20 mins for some chips to be cooked, which was only a small portion. The tables out on the beach were dirty, with rubbish underneath them. A poorly managed place, which clearly trades on its location. 

['best', 'avoided', 'told', 'café', 'full', 'even', 'though', 'empty', 'table', 'foolishly', 'went', 'kiosk', 'outside', 'order', 'food', 'drink', 'barely', 'organised', 'chaos', 'queue', 'not', 'move', 'faffed', 'pretending', 'know', 'prepare', 'food', 'drink', 'finally', 'got', 'order', 'run', 'thing', 'wanted', 'limited', 'menu', 'sausage', 'roll', 'sandwich', 'e

## b) Text Cleaning Sentence for Demo in Appendix

In [142]:
example = "Nice spacious room, clean and cmfortable beds, we stayed 3 nights and I couldn't fault anything! 😊 read less"

In [143]:
example = example.strip().lower()
example

"nice spacious room, clean and cmfortable beds, we stayed 3 nights and i couldn't fault anything! 😊 read less"

In [144]:
example = example.replace('read less', '')
example

"nice spacious room, clean and cmfortable beds, we stayed 3 nights and i couldn't fault anything! 😊 "

In [145]:
example = contractions.fix(example)
example

'nice spacious room, clean and cmfortable beds, we stayed 3 nights and i could not fault anything! 😊 '

In [146]:
example = nltk.word_tokenize(example)
print(example)

['nice', 'spacious', 'room', ',', 'clean', 'and', 'cmfortable', 'beds', ',', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything', '!', '😊']


In [147]:
example = [word for word in example if word not in punc]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything', '😊']


In [148]:
example = [e for e in example if e.encode("ascii","ignore")]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', '3', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything']


In [149]:
example = [n for n in example if not n.isnumeric()]
print(example)

['nice', 'spacious', 'room', 'clean', 'and', 'cmfortable', 'beds', 'we', 'stayed', 'nights', 'and', 'i', 'could', 'not', 'fault', 'anything']


In [150]:
example = [w for w in example if w not in stop_left]
print(example)

['nice', 'spacious', 'room', 'clean', 'cmfortable', 'beds', 'stayed', 'nights', 'could', 'not', 'fault', 'anything']


In [151]:
def spell_check(text_chunk):
    spell = SpellChecker()
    new_list = []
    corrected = []
    for word in text_chunk:
        if spell.correction(word) != word:
            new_word = spell.correction(word)
            new_list.append(new_word)
        else:
            new_list.append(word)
    return new_list

In [152]:
example2 = spell_check(example)
print(example2)

['nice', 'spacious', 'room', 'clean', 'comfortable', 'beds', 'stayed', 'nights', 'could', 'not', 'fault', 'anything']


In [153]:
example = lemmatize_text(example2)
print(example)

['nice', 'spacious', 'room', 'clean', 'comfortable', 'bed', 'stayed', 'night', 'could', 'not', 'fault', 'anything']


In [154]:
print(nltk.tag.pos_tag(example))

[('nice', 'RB'), ('spacious', 'JJ'), ('room', 'NN'), ('clean', 'NN'), ('comfortable', 'JJ'), ('bed', 'NN'), ('stayed', 'VBD'), ('night', 'NN'), ('could', 'MD'), ('not', 'RB'), ('fault', 'VB'), ('anything', 'NN')]


# Section 3: Text cleaning function applied to the combined dataframes

Build a pre-processing function based on the steps outlined above

In [400]:
# General pre-processing function with the above tasks to clean text -
# spellchecker can replace words with incorrect words that do not make sense so can be uncommented to run if necessary

def process(text):
    
       
    # Strip whitespace and set to lowercase
    text = text.apply(lambda x: x.strip().lower())
    
    # Replace words, and remove the 'read more', 'read less' tags 
    text = text.apply(lambda x: x.replace('xmas','christmas'))
    text = text.apply(lambda x: x.replace('\nread less',""))
    text = text.apply(lambda x: x.replace('\nread more',""))
       
                  
    # Clean other issues with text
    text = text.replace({'\£':''}, regex = True) # remove pound sign
    text = text.replace(r'\/'," ", regex=True) # split words separated with slash
    text = text.replace({'\d+\-\d+':""}, regex = True) # remove digits
    text = text.replace({'\d+\w{2}':""}, regex = True) # remove number plus am, pm, th or nd
    text = text.replace({'\.{3,}':""}, regex = True) # remove multiple full stops not removed by punctuation
    
   # Expand contractions
    text = text.apply(lambda x: contractions.fix(x))

    # Tokenize text
    text = text.apply(lambda x: nltk.word_tokenize(x))
    
    # Remove punctuation
    punc = string.punctuation
    text = text.apply(lambda x: [word for word in x if word not in punc])
             
    # Remove numbers, except words that contain numbers.
    text = text.apply(lambda x: [n for n in x if not n.isnumeric()])

    # Remove non ascii characters
    text = text.apply(lambda x: [e for e in x if e.encode("ascii","ignore")])

    # Remove common/stopwords - extend and remove some words from the list as negation words might be important to retain
    stop = stopwords.words('english')
    stop_remove =["not","don't","didn't","wasn't","won't","isn't"]
    stop1 = [elem for elem in stop if elem not in stop_remove] 
    add_stop = ['etc','read','butlins', 'bognor','regis','b',' i '
                '..','arundel castle','premier','inn','u','castle',
                "year","hilton","time","day","shoreline","oyster","bay","church farm"]
    stop1.extend(add_stop)
    text = text.apply(lambda x: [w for w in x if w not in stop1])
    
    # Run spellchecker - (uncomment to run)
   # def spell_check(text):
        #spell = SpellChecker()
        #text2 = []
        #corrected = []
        #for word in text:
           # if spell.correction(word) != word:
               # new_word = spell.correction(word)
                #text2.append(new_word)
           # else:
                #text2.append(word)
       # return text2
    
    text = text.apply(lambda x: spell_check(x))
             
    # Lemmatize to common root
    lemmatizer = WordNetLemmatizer()
    def lemmatize_text(text):
        return [lemmatizer.lemmatize(w) for w in text]
    
    text = text.apply(lemmatize_text)
    
    # Convert list to string 
    text = text.apply(lambda x: ' '.join(x))
    
    # Remove trailing 'i'
    text = text.apply(lambda x: x.replace(' i ',""))
  
    return text

In [401]:
# Run function - uncomment to run as resource intensive
df_train_cleaned = process(df_train["all_text"])
df_val_cleaned = process(df_val["all_text"])
df_test_cleaned = process(df_test["all_text"])

In [402]:
# Function to make dataframe, name column then find parts of speech for text in that column
def make_df(df):
    df = pd.DataFrame(df)
    df.columns = ["text_clean"]
    df['pos'] = df["text_clean"].apply(lambda x:nltk.tag.pos_tag(x.split()))
    return df

In [403]:
# Run function
df_train_cleaned = make_df(df_train_cleaned)
df_val_cleaned = make_df(df_val_cleaned)
df_test_cleaned = make_df(df_test_cleaned)

In [404]:
# Function to concat two dataframes and name columns
def convert(df,df2):
    df = pd.concat([df2,df],axis = 1)
    columns = "OrgInd","Town","Category","Score","Sent","Sent_clean","Pos"
    df.columns = (columns)
    return df

In [405]:
# Run function
train_cleaned = convert(df_train_cleaned,df_train)
val_cleaned = convert(df_val_cleaned,df_val)
test_cleaned = convert(df_test_cleaned,df_test)

In [406]:
train_cleaned.head(2)

Unnamed: 0,OrgInd,Town,Category,Score,Sent,Sent_clean,Pos
0,245,Bognor,Accommodation,0,"Central B & B We had room 6, excellent view, w...",central room excellent view could see sea room...,"[(central, JJ), (room, NN), (excellent, JJ), (..."
1,209,Littlehampton,Accommodation,1,Not what it used to be... We lived in the area...,not used lived area year fact daughter worked ...,"[(not, RB), (used, VBN), (lived, VBN), (area, ..."


In [407]:
# Send cleaned and combined dataframes to csv
train_cleaned.to_csv("train_cleaned.csv")
val_cleaned.to_csv("val_cleaned.csv")
test_cleaned.to_csv("test_cleaned.csv")