<a href="https://colab.research.google.com/github/FernandoBRdgz/inteligencia_artificial/blob/main/incrustaciones_de_palabras/word2vec_yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introducción

El conjunto de datos de Yelp es un subconjunto de nuestros negocios, reseñas y datos de usuario para su uso con fines personales, educativos y académicos. Disponible como archivos JSON, úselo para enseñar a los estudiantes acerca de las bases de datos, para aprender NLP o para obtener datos de producción de muestra mientras aprende a crear aplicaciones móviles.

Enlace al conjunto de datos: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import os
import json
from pprint import pprint

In [4]:
main_path = '/content/drive/MyDrive/'

In [5]:
data_directory = os.path.join(main_path, 'data', 'yelp_dataset')

In [6]:
businesses_filepath = os.path.join(data_directory, 'yelp_academic_dataset_business.json')

In [7]:
with open(businesses_filepath) as f:
    first_business_record = f.readline() 

pprint(first_business_record)

('{"business_id":"Pns2l4eNsfO8kk83dixA6A","name":"Abby Rappoport, LAC, '
 'CMQ","address":"1616 Chapala St, Ste 2","city":"Santa '
 'Barbara","state":"CA","postal_code":"93101","latitude":34.4266787,"longitude":-119.7111968,"stars":5.0,"review_count":7,"is_open":0,"attributes":{"ByAppointmentOnly":"True"},"categories":"Doctors, '
 'Traditional Chinese Medicine, Naturopathic\\/Holistic, Acupuncture, Health & '
 'Medical, Nutritionists","hours":null}\n')


In [8]:
review_json_filepath = os.path.join(data_directory, 'yelp_academic_dataset_review.json')

In [9]:
with open(review_json_filepath) as f:
    first_review_record = f.readline()
    
pprint(first_review_record)

('{"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If '
 'you decide to eat here, just be aware it is going to take about 2 hours from '
 'beginning to end. We have tried it multiple times, because I want to like '
 "it! I have been to it's other locations in NJ and never had a bad "
 'experience. \\n\\nThe food is good, but it takes a very long time to come '
 'out. The waitstaff is very young, but usually pleasant. We have just had too '
 'many experiences where we spent way too long waiting. We usually opt for '
 'another diner or restaurant on the weekends, in order to be done '
 'quicker.","date":"2018-07-07 22:09:11"}\n')


In [10]:
restaurant_ids = set()

with open(businesses_filepath) as f:    
    for business_json in f:
        business = json.loads(business_json)
        if not business.get('categories'):
            continue
        if 'Restaurants' not in business['categories']:
            continue
        restaurant_ids.add(business['business_id'])

restaurant_ids = frozenset(restaurant_ids)

pprint(f'{len(restaurant_ids):,} restaurants in the dataset.')

'52,268 restaurants in the dataset.'


In [11]:
scratch_directory = os.path.join(data_directory, 'scratch')

try:
    os.mkdir(scratch_directory)
except FileExistsError:
    pass

review_txt_filepath = os.path.join(scratch_directory, 'review_text_all.txt')

In [12]:
%%time
execute = False

if execute:
    review_count = 0
    with open(review_txt_filepath, 'w') as review_txt_file:
        with open(review_json_filepath) as review_json_file:
            for review_json in review_json_file:
                review = json.loads(review_json)
                if review['business_id'] not in restaurant_ids:
                    continue
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1
    print(f'Text from {review_count:,} restaurant reviews written to the new txt file.')
    
else:
    with open(review_txt_filepath) as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print(f'Text from {review_count + 1:,} restaurant reviews in the txt file.')

Text from 4,725,884 restaurant reviews in the txt file.
CPU times: user 10.5 s, sys: 2.09 s, total: 12.6 s
Wall time: 29.3 s


In [13]:
import spacy
from spacy import displacy
import pandas as pd
import itertools as it

In [14]:
!python -m spacy download en_core_web_md

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [15]:
nlp = spacy.load('en_core_web_md')

In [16]:
review_num = 42

with open(review_txt_filepath) as f:
    sample_review = list(it.islice(f, review_num, review_num+1))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

Excellent service! Great diner food and breakfast is served all day. Came here for lunch- they were busy but very friendly and hospitable. Easy to get to off the 295.



In [17]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 20.2 ms, sys: 4.03 ms, total: 24.2 ms
Wall time: 35.8 ms


In [18]:
print(parsed_review)

Excellent service! Great diner food and breakfast is served all day. Came here for lunch- they were busy but very friendly and hospitable. Easy to get to off the 295.



In [19]:
displacy.render(parsed_review, style="ent", jupyter=True)

In [20]:
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import LineSentence

In [21]:
def punct_space(token):
    return token.is_punct or token.is_space

def pronoun_lemmatize(token):
    if token.lemma_ == '-PRON-':
        return token.lower_
    
    else:
        return token.lemma_.lower()

def line_review(filename):
    with open(filename) as f:
        for review in f:
            yield review.replace('\\n', '\n')

In [22]:
review_lemmatized_filepath = os.path.join(scratch_directory, 'review_lemmatized_all.txt')
sentence_lemmatized_filepath = os.path.join(scratch_directory, 'sentence_lemmatized_all.txt')

In [23]:
%%time
execute = False

if execute:
    with open(review_lemmatized_filepath, 'w') as review_file:
        with open(sentence_lemmatized_filepath, 'w') as sentence_file:
            pipe = nlp.pipe(
                line_review(review_txt_filepath),
                batch_size=5000
                )
            
            for parsed_review in pipe:
                lemmatized_review = ' '.join([
                    pronoun_lemmatize(token)
                    for token in parsed_review
                    if not punct_space(token)
                    ])
                
                review_file.write(lemmatized_review + '\n')
        
                for sent in parsed_review.sents:
                    lemmatized_sentence = ' '.join([
                        pronoun_lemmatize(token)
                        for token in sent
                        if not punct_space(token)
                        ])
                    
                    sentence_file.write(lemmatized_sentence + '\n')

CPU times: user 0 ns, sys: 5 µs, total: 5 µs
Wall time: 9.3 µs


In [24]:
sentences_unigrams = LineSentence(sentence_lemmatized_filepath)

In [25]:
for sentence_unigrams in it.islice(sentences_unigrams, 60, 70):
    print(' '.join(sentence_unigrams))
    print('')

the bun make the sonoran dog

it be like a snuggie for the pup

a first it seem ridiculous and almost like it be go to be too much exactly like everyone 's favorite blanket with sleeve

too much softness too much smush too indulgent

wrong

it be warm soft chewy fragrant and it succeed where other famed sonoran dogs fail

the hot dog itself be flavorful but i would prefer that it or the bacon have a little more bite or snap to well hold their own against the dominant mustard and onion

i be with the masse on the carne asada caramelo

excellent tortilla salty melty cheese and great carne

super cheap and you can drive through



**Por hacer**

* Añadir comentarios
* Incrustaciones de palabra con Word2vec
* Visualizaciones
* Modelado de frases
* Limpieza de texto
* Bigramas y trigramas
* LDA
* Álgebra de palabras

**Referencias**

* https://spacy.io/
* https://radimrehurek.com/gensim/
* https://github.com/pwharrison/modern-nlp-in-python-2019/blob/master/notebooks/Modern_NLP_in_Python.ipynb