# Data Prep Inspection
In this notebook we will look at the data prep for the LDA and word2vec models. The process starts with the spacy processing pipeline. This results in the writing of two files one contains only unigram sentences one sentence per line and the other contains unigram reviews one review per line. This saves us a step and we will only have to pass the text through the spacy pipeline once which is quite time consuming. Next we train our bigram Phrases model and wright the output of that model to a file. That output is then used to train our trigram Phrases model and the corpus is then passed through it to get an output sutable for the word to vec model. Finlly we feed the unigram review corpus through our bigram and trigram Phrases models, remove stopwords and wright out the processed reviews sutable for training our LDA model.

see prep_data.py for exact process and all functions used.

In [1]:
import spacy
from itertools import islice
from typing import Generator, Any
from pathlib import Path

## Extracting Unigram Sentences and reviews
First we need to define a function to extract a slice of our file without loading the full file into memory. In many cases when working with very large corpora it is necessary to stream data from disc.

In [2]:
def get_reviews_slice(file_path:str|Path, start:int, stop:int) -> Generator[str, None, None]:
    """grabs a slice from a text file

    Args:
        file_path (str | Path): Path to the file
        start (int): the line where we want to start
        stop (int): the end of the slice

    Yields:
        Generator[str, None, None]: a text review string
    """
    with open(file_path, 'r', encoding='utf-8') as reviews_file:
        for review in islice(reviews_file, start, stop):
            yield review.replace('\\n', '\n')

This step is the first in the pipeline, where the text is processed by the spacy model. This does things like tokenization, converting to lowercase, lemmatization, and named entity recognition. We also remove punctuation and extra spaces.

In [3]:
# en_core_web_trf slower/more accurate, has max len of ~500 tokens
# en_core_web_sm faster/less acurate
spacy_model = spacy.load('en_core_web_sm')

for index, review in enumerate(get_reviews_slice('data/raw_reviews.txt', 0, 5), start=1):
    print(f'\nReview: {index}')
    print('\nOriginal Review:\n')
    print(review)
    print('\nLemmatized Review Sentences:\n')
    prep_revs = spacy_model(review)
    
    for sentence in prep_revs.sents:
        lemmatized_sentence = ' '.join([token.lemma_ for token in sentence if not token.is_punct and not token.is_space])
        print(lemmatized_sentence)
    print()
    print('='*100)


Review: 1

Original Review:

If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. 

The food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.


Lemmatized Review Sentences:

if you decide to eat here just be aware it be go to take about 2 hour from begin to end
we have try it multiple time because I want to like it
I have be to it be other location in NJ and never have a bad experience
the food be good but it take a very long time to come out
the waitstaff be very young but usually pleasant
we have just have too many experience where we spend way too long wait
we usually opt for another dine

### Inspect unigram sentences
Lets inspect the results of the first step, creating unigram sentences from the review text.

In [4]:
for line in get_reviews_slice(file_path='data/unigram_sents.txt', start=0, stop=20):
    print(line.rstrip('\n'))

if you decide to eat here just be aware it be go to take about hour from begin to end
we have try it multiple time because I want to like it
I have be to it be other location in NJ and never have a bad experience
the food be good but it take a very long time to come out
the waitstaff be very young but usually pleasant
we have just have too many experience where we spend way too long wait
we usually opt for another diner or restaurant on the weekend in order to be do quick
family diner
have the buffet
eclectic assortment a large chicken leg fried jalapeño tamale two roll grape leave fresh melon
all good
lot of mexican choice there
also have a menu with breakfast serve all day long
friendly attentive staff
good place for a casual relaxed meal with no expectation
next to the Clarion Hotel
wow Yummy different delicious
our favorite be the lamb curry and korma
with different kind of naan
do not let the outside deter you because we almost change our mind go in and try something new


### Inspect bigram sentences
Next lets inspect the bigram sentences output after training and saving the bigram Phrases model.

In [5]:
for line in get_reviews_slice(file_path='data/bigram_sents.txt', start=0, stop=20):
    print(line.rstrip('\n'))

if you decide to eat here just be aware it be go to take about hour from begin to end
we have try it multiple time because I want to like it
I have be to it be other location in NJ and never have a bad experience
the food be good but it take a very long time to come out
the waitstaff be very young but usually pleasant
we have just have too many experience where we spend way too long wait
we usually opt for another diner or restaurant on the weekend in order to be do quick
family diner
have the buffet
eclectic assortment a large chicken leg fried jalapeño tamale two roll grape_leave fresh melon
all good
lot of mexican choice there
also have a menu with breakfast serve all day long
friendly attentive staff
good place for a casual relaxed meal with no expectation
next to the Clarion_Hotel
wow Yummy different delicious
our favorite be the lamb curry and korma
with different kind of naan
do not let the outside deter you because we almost change our mind go in and try something new


Above we can see that things like grape_leave was created as they apear together in the text frequently. we can also see that named entities from the spacy model are also concatemated like Clarion_Hotel as they also apear together in the text.

### Inspect trigram sentences
Next we will look at the output of the trigram Phraser model.

In [6]:
for line in get_reviews_slice(file_path='data/trigram_sents.txt', start=60, stop=80):
    print(line.rstrip('\n'))

the bun make the Sonoran_Dog
it be like a snuggie for the pup
a first it seem ridiculous and almost like it be go to be too much exactly like everyone 's favorite blanket with sleeve
too much softness too much smush too indulgent
Wrong
it be warm soft chewy fragrant and it succeed where other famed Sonoran_Dogs fail
the hot_dog itself be flavorful but I would prefer that it or the bacon have a little more bite or snap to well hold their own against the dominant mustard and onion
I be with the masse on the carne_asada_caramelo
excellent tortilla salty melty_cheese and great carne
Super cheap and you can drive_through
great place for breakfast
I have the waffle which be fluffy and perfect and home fry which be nice and smash and crunchy
friendly waitstaff
will definitely be back
tremendous service Big shout_out to Douglas that complement the delicious food
pretty expensive establishment $ avg for your main_course but its definitely back that up with an atmosphere that be comparable with 

Above we can see terms like carne_asada_caramelo and across_the_country have been created.

### inspect Trigram Reviews

Finally lets look at the output of the full reviews that will be used to train the LDA model. Where after processing and Phrase detection we remove remaining stopwords.

In [7]:
start, stop = 500, 505
raw_revs_gen_slice = list(get_reviews_slice('data/raw_reviews.txt', start, stop))

trigram_revs_gen_slice = list(get_reviews_slice('data/trigram_reviews.txt', start, stop))

for rev_ind in range(stop-start):
    print('\nOrigonal review:\n')
    print(raw_revs_gen_slice[rev_ind])
    print('\nTrigram Review:\n')
    print(trigram_revs_gen_slice[rev_ind])
    print()
    print('='*100)


Origonal review:

Tried this place out after many suggestions from friends.  We went in one night about an hour or so before closing just to find that they were only serving a partial menu - we're not talking that late, it was like, 9:30 on a weekend.  There was no mention of going down to a bar menu on the website so I was a bit annoyed by that.  Something about 'surprise disappointments that could be totally avoided' really bother me.  So, the dish I had picked out online wasn't available and I ordered a burger instead.  The server seemed to have a very hard time understanding my burger and suggested that modifications couldn't be made?  I told him I'm sure removing a single topping wouldn't cause the kitchen staff any issues...  The rest of the meal we were generally ignored and the service wasn't great.  I found the food to be overpriced for the quality.  Unless there is a special reason to attend, I don't think we'll be going back.


Trigram Review:

try place suggestion friend n