<a href="https://colab.research.google.com/github/FernandoBRdgz/inteligencia_artificial/blob/main/incrustaciones_de_palabras/word2vec_yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introducción

El conjunto de datos de Yelp es un subconjunto de nuestros negocios, reseñas y datos de usuario para su uso con fines personales, educativos y académicos. Disponible como archivos JSON, úselo para enseñar a los estudiantes acerca de las bases de datos, para aprender NLP o para obtener datos de producción de muestra mientras aprende a crear aplicaciones móviles.

Enlace al conjunto de datos: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os
import json
from pprint import pprint

In [4]:
main_path = '/content/drive/MyDrive/'

In [5]:
data_directory = os.path.join(main_path, 'data', 'yelp_dataset')

In [6]:
businesses_filepath = os.path.join(data_directory, 'yelp_academic_dataset_business.json')

In [7]:
with open(businesses_filepath) as f:
    first_business_record = f.readline() 

pprint(first_business_record)

('{"business_id":"Pns2l4eNsfO8kk83dixA6A","name":"Abby Rappoport, LAC, '
 'CMQ","address":"1616 Chapala St, Ste 2","city":"Santa '
 'Barbara","state":"CA","postal_code":"93101","latitude":34.4266787,"longitude":-119.7111968,"stars":5.0,"review_count":7,"is_open":0,"attributes":{"ByAppointmentOnly":"True"},"categories":"Doctors, '
 'Traditional Chinese Medicine, Naturopathic\\/Holistic, Acupuncture, Health & '
 'Medical, Nutritionists","hours":null}\n')


In [8]:
review_json_filepath = os.path.join(data_directory, 'yelp_academic_dataset_review.json')

In [9]:
with open(review_json_filepath) as f:
    first_review_record = f.readline()
    
pprint(first_review_record)

('{"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If '
 'you decide to eat here, just be aware it is going to take about 2 hours from '
 'beginning to end. We have tried it multiple times, because I want to like '
 "it! I have been to it's other locations in NJ and never had a bad "
 'experience. \\n\\nThe food is good, but it takes a very long time to come '
 'out. The waitstaff is very young, but usually pleasant. We have just had too '
 'many experiences where we spent way too long waiting. We usually opt for '
 'another diner or restaurant on the weekends, in order to be done '
 'quicker.","date":"2018-07-07 22:09:11"}\n')


In [10]:
restaurant_ids = set()

with open(businesses_filepath) as f:    
    for business_json in f:
        business = json.loads(business_json)
        if not business.get('categories'):
            continue
        if 'Restaurants' not in business['categories']:
            continue
        restaurant_ids.add(business['business_id'])

restaurant_ids = frozenset(restaurant_ids)

pprint(f'{len(restaurant_ids):,} restaurants in the dataset.')

'52,268 restaurants in the dataset.'


In [11]:
scratch_directory = os.path.join(data_directory, 'scratch')

try:
    os.mkdir(scratch_directory)
except FileExistsError:
    pass

review_txt_filepath = os.path.join(scratch_directory, 'review_text_all.txt')

In [12]:
%%time
execute = False

if execute:
    review_count = 0
    with open(review_txt_filepath, 'w') as review_txt_file:
        with open(review_json_filepath) as review_json_file:
            for review_json in review_json_file:
                review = json.loads(review_json)
                if review['business_id'] not in restaurant_ids:
                    continue
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1
    print(f'Text from {review_count:,} restaurant reviews written to the new txt file.')
    
else:
    with open(review_txt_filepath) as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print(f'Text from {review_count + 1:,} restaurant reviews in the txt file.')

Text from 4,725,884 restaurant reviews in the txt file.
CPU times: user 7.6 s, sys: 1.4 s, total: 9 s
Wall time: 35.1 s


In [13]:
import spacy
from spacy import displacy
import pandas as pd
import itertools as it

In [14]:
!python -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [15]:
nlp = spacy.load('en_core_web_md')

In [16]:
review_num = 42

with open(review_txt_filepath) as f:
    sample_review = list(it.islice(f, review_num, review_num+1))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

Excellent service! Great diner food and breakfast is served all day. Came here for lunch- they were busy but very friendly and hospitable. Easy to get to off the 295.



In [17]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 19.5 ms, sys: 1.13 ms, total: 20.6 ms
Wall time: 35.8 ms


In [18]:
print(parsed_review)

Excellent service! Great diner food and breakfast is served all day. Came here for lunch- they were busy but very friendly and hospitable. Easy to get to off the 295.



In [19]:
displacy.render(parsed_review, style="ent", jupyter=True)

In [20]:
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence

In [21]:
def punct_space(token):
    return token.is_punct or token.is_space

def pronoun_lemmatize(token):
    if token.lemma_ == '-PRON-':
        return token.lower_
    
    else:
        return token.lemma_.lower()

def line_review(filename):
    with open(filename) as f:
        for review in f:
            yield review.replace('\\n', '\n')

In [22]:
review_lemmatized_filepath = os.path.join(scratch_directory, 'review_lemmatized_all.txt')
sentence_lemmatized_filepath = os.path.join(scratch_directory, 'sentence_lemmatized_all.txt')

In [23]:
%%time
execute = False

if execute:
    with open(review_lemmatized_filepath, 'w') as review_file:
        with open(sentence_lemmatized_filepath, 'w') as sentence_file:
            pipe = nlp.pipe(
                line_review(review_txt_filepath),
                batch_size=5000
                )
            
            for parsed_review in pipe:
                lemmatized_review = ' '.join([
                    pronoun_lemmatize(token)
                    for token in parsed_review
                    if not punct_space(token)
                    ])
                
                review_file.write(lemmatized_review + '\n')
        
                for sent in parsed_review.sents:
                    lemmatized_sentence = ' '.join([
                        pronoun_lemmatize(token)
                        for token in sent
                        if not punct_space(token)
                        ])
                    
                    sentence_file.write(lemmatized_sentence + '\n')

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 8.58 µs


In [24]:
sentences_unigrams = LineSentence(sentence_lemmatized_filepath)

In [25]:
for sentence_unigrams in it.islice(sentences_unigrams, 60, 70):
    print(' '.join(sentence_unigrams))
    print('')

the bun make the sonoran dog

it be like a snuggie for the pup

a first it seem ridiculous and almost like it be go to be too much exactly like everyone 's favorite blanket with sleeve

too much softness too much smush too indulgent

wrong

it be warm soft chewy fragrant and it succeed where other famed sonoran dogs fail

the hot dog itself be flavorful but i would prefer that it or the bacon have a little more bite or snap to well hold their own against the dominant mustard and onion

i be with the masse on the carne asada caramelo

excellent tortilla salty melty cheese and great carne

super cheap and you can drive through



In [26]:
bigram_model_filepath = os.path.join(scratch_directory, 'bigram_phrase_model')

In [27]:
%%time
execute = False

if execute:

    bigram_phrases = Phrases(sentences_unigrams)
    bigram_phrases = Phraser(bigram_phrases)
    bigram_phrases.save(bigram_model_filepath)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs


In [28]:
bigram_phrases = Phraser.load(bigram_model_filepath)



In [29]:
sentences_bigrams_filepath = os.path.join(scratch_directory, 'sentence_bigram_phrases_all.txt')

In [30]:
%%time
execute = False
if execute:
    with open(sentences_bigrams_filepath, 'w') as f:
        for sentence_unigrams in sentences_unigrams:
            sentence_bigrams = ' '.join(bigram_phrases[sentence_unigrams])
            f.write(sentence_bigrams + '\n')

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 8.34 µs


In [31]:
sentences_bigrams = LineSentence(sentences_bigrams_filepath)

In [32]:
for sentence_bigrams in it.islice(sentences_bigrams, 60, 70):
    print(' '.join(sentence_bigrams))
    print('')

the bun make the sonoran_dog

it be like a snuggie for the pup

a first it seem ridiculous and almost like it be go to be too much exactly like everyone 's favorite blanket with sleeve

too much softness too much smush too indulgent

wrong

it be warm soft chewy fragrant and it succeed where other famed sonoran_dogs fail

the hot_dog itself be flavorful but i would prefer that it or the bacon have a little more bite or snap to well hold their own against the dominant mustard and onion

i be with the masse on the carne_asada caramelo

excellent tortilla salty melty_cheese and great carne

super cheap and you can drive_through



In [33]:
trigram_model_filepath = os.path.join(scratch_directory, 'trigram_phrase_model')

In [34]:
%%time
execute = False

if execute:

    trigram_phrases = Phrases(sentences_bigrams)
    trigram_phrases = Phraser(trigram_phrases)
    trigram_phrases.save(trigram_model_filepath)

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.63 µs


In [35]:
trigram_phrases = Phraser.load(trigram_model_filepath)



In [36]:
sentences_trigrams_filepath = os.path.join(scratch_directory, 'sentence_trigram_phrases_all.txt')

In [37]:
%%time
execute = False

if execute:
    with open(sentences_trigrams_filepath, 'w') as f:
        for sentence_bigrams in sentences_bigrams:
            sentence_trigrams = ' '.join(trigram_phrases[sentence_bigrams])
            f.write(sentence_trigrams + '\n')

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.58 µs


In [38]:
sentences_trigrams = LineSentence(sentences_trigrams_filepath)

In [39]:
for sentence_trigrams in it.islice(sentences_trigrams, 60, 70):
    print(' '.join(sentence_trigrams))
    print('')

the bun make the sonoran_dog

it be like a snuggie for the pup

a first it seem ridiculous and almost like it be go to be too much exactly like everyone 's favorite blanket with sleeve

too much softness too much smush too indulgent

wrong

it be warm soft chewy fragrant and it succeed where other famed sonoran_dogs fail

the hot_dog itself be flavorful but i would prefer that it or the bacon have a little more bite or snap to well hold their own against the dominant mustard and onion

i be with the masse on the carne_asada_caramelo

excellent tortilla salty melty_cheese and great carne

super cheap and you can drive_through



In [40]:
review_trigrams_filepath = os.path.join(scratch_directory, 'review_trigrams_all.txt')

In [41]:
%%time
execute = False

if execute:
    reviews_lemmatized = LineSentence(review_lemmatized_filepath)

    with open(review_trigrams_filepath, 'w') as f:
        
        for review_unigrams in reviews_lemmatized:
            review_bigrams = bigram_phrases[review_unigrams]
            review_trigrams = trigram_phrases[review_bigrams]

            review_trigrams = [
                term
                for term in review_trigrams
                if term not in nlp.Defaults.stop_words
                ]

            review_trigrams = ' '.join(review_trigrams)
            f.write(review_trigrams + '\n')

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 9.54 µs


In [42]:
review_num = 0

print('Original:' + '\n')

for review in it.islice(line_review(review_txt_filepath), review_num, review_num+1):
    print(review)

print('----' + '\n')
print('Transformed:' + '\n')

with open(review_trigrams_filepath) as f:
    for review in it.islice(f, review_num, review_num+1):
        print(review)

Original:

If you decide to eat here, just be aware it is going to take about 2 hours from beginning to end. We have tried it multiple times, because I want to like it! I have been to it's other locations in NJ and never had a bad experience. 

The food is good, but it takes a very long time to come out. The waitstaff is very young, but usually pleasant. We have just had too many experiences where we spent way too long waiting. We usually opt for another diner or restaurant on the weekends, in order to be done quicker.

----

Transformed:

decide eat aware 2 hour begin end try multiple time want like location nj bad experience food good long time come waitstaff young usually pleasant experience spend way long wait usually opt diner restaurant weekend order quick



In [43]:
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting numpy>=1.24.2 (from pyLDAvis)
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
Collecting pandas>=2.0.0 (from pyLDAvis)
  Downloading pandas-2.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, numpy, pandas, pyLDAvis
  Attempting uninstall: numpy
    Fou

In [44]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

In [45]:
dictionary_filepath = os.path.join(scratch_directory, 'trigram_dict_all.dict')

  and should_run_async(code)


In [46]:
%%time
execute = False

if execute:
    reviews_trigrams = LineSentence(review_trigrams_filepath)
    dictionary_trigrams = Dictionary(reviews_trigrams)
    dictionary_trigrams.filter_extremes(no_below=20, no_above=0.4)
    dictionary_trigrams.compactify()
    dictionary_trigrams.save(dictionary_filepath)  

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 9.06 µs


  and should_run_async(code)


In [47]:
dictionary_trigrams = Dictionary.load(dictionary_filepath)

  and should_run_async(code)


In [48]:
bow_corpus_filepath = os.path.join(scratch_directory, 'bow_trigrams_corpus_all.mm')

  and should_run_async(code)


In [49]:
def bow_generator(filepath):
   
    for review in LineSentence(filepath):
        yield dictionary_trigrams.doc2bow(review)

  and should_run_async(code)


In [50]:
%%time
execute = False

if execute:
    MmCorpus.serialize(bow_corpus_filepath, bow_generator(review_trigrams_filepath))

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 7.63 µs


  and should_run_async(code)


In [51]:
trigram_bow_corpus = MmCorpus(bow_corpus_filepath)

  and should_run_async(code)


In [52]:
lda_model_filepath = os.path.join(scratch_directory, 'lda_model_all')

  and should_run_async(code)


In [53]:
%%time
execute = False

if execute:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        lda = LdaMulticore(trigram_bow_corpus, num_topics=50, id2word=dictionary_trigrams, workers=7)
    
    lda.save(lda_model_filepath)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs


  and should_run_async(code)


In [54]:
lda = LdaMulticore.load(lda_model_filepath)

  and should_run_async(code)


In [60]:
def explore_topic(topic_number, topn=25):
    print(f'{"term":20} {"frequency"}' + '\n')

    for term, frequency in lda.show_topic(topic_number, topn=topn):
        print(f'{term:20} {round(frequency, 3):.3f}')

  and should_run_async(code)


In [70]:
explore_topic(topic_number=10, topn=5)

term                 frequency

order                0.039
service              0.016
ask                  0.016
time                 0.015
come                 0.015


  and should_run_async(code)


In [71]:
topic_names = {
    0: 'place1',
    1: 'sauce',
    2: 'place2',
    3: 'time',
    4: 'service',
    5: 'seafood',
    6: 'reservation',
    7: 'taste',
    8: 'donut',
    9: 'vietnam',
    10: 'order',
    11: 'nightlife',
    12: 'burger & fries',
    13: 'classy ambience', #
    14: 'long wait',
    15: 'chicken',
    16: 'sandwiches',
    17: 'good serivce',
    18: 'vegas hotel',
    19: 'pizza',
    20: 'salad',
    21: 'bar vibe', #
    22: 'meal experience', #
    23: 'slow service',
    24: 'brunch',
    25: 'portion sizes',
    26: 'beer, wings, sports',
    27: 'breakfast',
    28: 'miscellaneous',
    29: 'non-English',
    30: 'deli',
    31: 'barbecue',
    32: 'local business',
    33: 'miscellaneous',
    34: 'hole-in-the-wall',
    35: 'asian',
    36: 'specials',
    37: 'coffeeshop',
    38: 'prices',
    39: 'flavor & texture',
    40: 'noodles',
    41: 'canadian',
    42: 'highly recommended',
    43: 'sushi',
    44: 'ordering',
    45: 'mediterranean',
    46: 'decent value',
    47: 'cleanliness',
    48: 'lobster',
    49: 'seafood'
    }

  and should_run_async(code)


In [72]:
topic_names_filepath = os.path.join(scratch_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'wb') as f:
    pickle.dump(topic_names, f)

  and should_run_async(code)


In [73]:
LDAvis_data_filepath = os.path.join(scratch_directory, 'ldavis_prepared')

  and should_run_async(code)


**Por hacer**

* Añadir comentarios
* Incrustaciones de palabra con Word2vec
* Visualizaciones
* Perfilamiento de tópicos
* Álgebra de palabras

**Referencias**

* https://spacy.io/
* https://radimrehurek.com/gensim/
* https://github.com/pwharrison/modern-nlp-in-python-2019/blob/master/notebooks/Modern_NLP_in_Python.ipynb