<a href="https://colab.research.google.com/github/FernandoBRdgz/inteligencia_artificial/blob/main/incrustaciones_de_palabras/word2vec_yelp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introducción

El conjunto de datos de Yelp es un subconjunto de nuestros negocios, reseñas y datos de usuario para su uso con fines personales, educativos y académicos. Disponible como archivos JSON, úselo para enseñar a los estudiantes acerca de las bases de datos, para aprender NLP o para obtener datos de producción de muestra mientras aprende a crear aplicaciones móviles.

Enlace al conjunto de datos: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
import os
import json
from pprint import pprint

In [3]:
main_path = '/content/drive/MyDrive/'

In [4]:
data_directory = os.path.join(main_path, 'data', 'yelp_dataset')

In [7]:
businesses_filepath = os.path.join(data_directory, 'yelp_academic_dataset_business.json')

In [12]:
with open(businesses_filepath) as f:
    first_business_record = f.readline() 

pprint(first_business_record)

('{"business_id":"Pns2l4eNsfO8kk83dixA6A","name":"Abby Rappoport, LAC, '
 'CMQ","address":"1616 Chapala St, Ste 2","city":"Santa '
 'Barbara","state":"CA","postal_code":"93101","latitude":34.4266787,"longitude":-119.7111968,"stars":5.0,"review_count":7,"is_open":0,"attributes":{"ByAppointmentOnly":"True"},"categories":"Doctors, '
 'Traditional Chinese Medicine, Naturopathic\\/Holistic, Acupuncture, Health & '
 'Medical, Nutritionists","hours":null}\n')


In [13]:
review_json_filepath = os.path.join(data_directory, 'yelp_academic_dataset_review.json')

In [15]:
with open(review_json_filepath) as f:
    first_review_record = f.readline()
    
pprint(first_review_record)

('{"review_id":"KU_O5udG6zpxOg-VcAEodg","user_id":"mh_-eMZ6K5RLWhZyISBhwA","business_id":"XQfwVwDr-v0ZS3_CbbE5Xw","stars":3.0,"useful":0,"funny":0,"cool":0,"text":"If '
 'you decide to eat here, just be aware it is going to take about 2 hours from '
 'beginning to end. We have tried it multiple times, because I want to like '
 "it! I have been to it's other locations in NJ and never had a bad "
 'experience. \\n\\nThe food is good, but it takes a very long time to come '
 'out. The waitstaff is very young, but usually pleasant. We have just had too '
 'many experiences where we spent way too long waiting. We usually opt for '
 'another diner or restaurant on the weekends, in order to be done '
 'quicker.","date":"2018-07-07 22:09:11"}\n')


In [17]:
restaurant_ids = set()

with open(businesses_filepath) as f:    
    for business_json in f:
        business = json.loads(business_json)
        if not business.get('categories'):
            continue
        if 'Restaurants' not in business['categories']:
            continue
        restaurant_ids.add(business['business_id'])

restaurant_ids = frozenset(restaurant_ids)

pprint(f'{len(restaurant_ids):,} restaurants in the dataset.')

'52,268 restaurants in the dataset.'


In [19]:
scratch_directory = os.path.join(data_directory, 'scratch')

try:
    os.mkdir(scratch_directory)
except FileExistsError:
    pass

review_txt_filepath = os.path.join(scratch_directory, 'review_text_all.txt')

In [21]:
%%time
execute = True

if execute:
    review_count = 0
    with open(review_txt_filepath, 'w') as review_txt_file:
        with open(review_json_filepath) as review_json_file:
            for review_json in review_json_file:
                review = json.loads(review_json)
                if review['business_id'] not in restaurant_ids:
                    continue
                review_txt_file.write(review['text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print(f'Text from {review_count:,} restaurant reviews written to the new txt file.')
    
else:
    with open(review_txt_filepath) as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print(f'Text from {review_count + 1:,} restaurant reviews in the txt file.')

Text from 4,724,471 restaurant reviews written to the new txt file.
CPU times: user 1min 15s, sys: 9.03 s, total: 1min 24s
Wall time: 2min 20s


**Por hacer**

* Añadir comentarios
* Incrustaciones de palabra con Word2vec
* Visualizaciones
* Modelado de frases
* Limpieza de texto
* Bigramas y trigramas
* LDA
* Álgebra de palabras

**Referencias**

* https://github.com/pwharrison/modern-nlp-in-python-2019/blob/master/notebooks/Modern_NLP_in_Python.ipynb