# Introducción, objetivos y contenido
====================================================================================================================================

Contenidos:

* Importación de librerías
* Carga de datos
* Preprocesamiento: gestión de tipos de datos, valores nulos y duplicados
    * Dataset TIP
    * Dataset REVIEW
* Análisis de datos
    


# Importación de librerías
====================================================================================================================================

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from scipy import stats as st
import json

from joblib import Parallel, delayed
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Carga de datos
====================================================================================================================================

In [2]:
# Restaurantes
dfyrst = pd.read_parquet('dfyrst_gastronomics.parquet')

In [4]:
# Tips
tip = pd.read_json('dataset_y_tips.json', lines=True)

In [6]:
# Review
review = pd.read_parquet('dataset_y_reviews.parquet')

# Preprocesamiento
====================================================================================================================================

## Dataset RESTAURANTES

In [7]:
dfyrst.sample()

Unnamed: 0,business_id,name,city,postal_code,latitude,longitude,stars,review_count,is_open,state,state_city,city_postalcode,state_city_postalcode,categories,food
177548,sUfFxJcSCzizQ6WflJWWiA,Creek Side Diner,Kennett Square,19348,39.845317,-75.721428,3.5,24,1,PA,PA - Kennett Square,Kennett Square - 19348,PA - Kennett Square - 19348,Restaurants,yes


## Dataset TIP

In [8]:
tip.sample(2)

Unnamed: 0,user_id,business_id,text,date,compliment_count
807213,ThuiwScMkhSD22UgAmEPsQ,7v5mDIxhIwOJINFHaGOGAA,They are pushy sales people who constantly fin...,2017-11-07 22:30:41,0
590905,O2t1y_ZKrEnwa_mCfyyEFA,bchZXVE4feVx4Q5rhIRCGg,$5 min for cards,2017-11-02 16:27:30,0


In [9]:
# Selección de campos
dfytip = tip

dfytip['year'] = dfytip['date'].dt.year
dfytip['month'] = dfytip['date'].dt.month
dfytip['year_month'] = dfytip['year'].astype(str).str.slice(-2) + dfytip['month'].astype(str).str.zfill(2)

# Filtrado por restaurantes
dfytip = dfytip[dfytip['business_id'].isin(dfyrst['business_id'])]
dfytip

dfytip.info()
dfytip.sample(5)

<class 'pandas.core.frame.DataFrame'>
Index: 737451 entries, 1 to 908914
Data columns (total 8 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   user_id           737451 non-null  object        
 1   business_id       737451 non-null  object        
 2   text              737451 non-null  object        
 3   date              737451 non-null  datetime64[ns]
 4   compliment_count  737451 non-null  int64         
 5   year              737451 non-null  int32         
 6   month             737451 non-null  int32         
 7   year_month        737451 non-null  object        
dtypes: datetime64[ns](1), int32(2), int64(1), object(4)
memory usage: 45.0+ MB


Unnamed: 0,user_id,business_id,text,date,compliment_count,year,month,year_month
373853,qjfMBIZpQT9DDtw_BWCopQ,McYtxSrd0HQ07nsc3X65WQ,Seafood lafitte seafood casserole if u like cr...,2013-01-05 21:40:02,0,2013,1,1301
348591,tdEtv1_1SShU9RHz6J-DZw,3XzvT-SaEpx8Dun_Jv2LLw,Get some Founder's Breakfast Stout - a must try!,2015-10-31 20:33:44,0,2015,10,1510
280016,4bniqNkPi-5RkVxuDYpViQ,vpg7BwhGHjsAHm-bC1i3Mw,"Very friendly and enthusiastic owner, ecxited ...",2013-12-27 23:58:29,0,2013,12,1312
473779,FMmGJZtWMhnNoLNc6nncUw,0xZhPb5BSxIx9M5vatyE3g,Super friendly staff!,2014-03-02 21:30:02,0,2014,3,1403
464512,yLOHrp1kv9Ut4-rFVDAN8w,885WesBEfEB2t7iWmmm6qw,"Check out their chalk board, today they had $4...",2014-01-19 04:47:47,0,2014,1,1401


In [13]:
# Análisis de sentimientos a partir del campo "text"
dfytip['text'] = dfytip['text'].astype(str)

analyzer = SentimentIntensityAnalyzer()
dfytip['polarity'] = dfytip['text'].apply(lambda text: analyzer.polarity_scores(text)['compound'])
dfytip['sentiment'] = pd.cut(dfytip['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[-1, 0, 1])

dfytip.info()
dfytip.sample(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfytip['text'] = dfytip['text'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfytip['polarity'] = dfytip['text'].apply(lambda text: analyzer.polarity_scores(text)['compound'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfytip['sentiment'] = pd.cut(dfytip['polarity'], bins=[-float(

<class 'pandas.core.frame.DataFrame'>
Index: 737451 entries, 1 to 908914
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   user_id           737451 non-null  object        
 1   business_id       737451 non-null  object        
 2   text              737451 non-null  object        
 3   date              737451 non-null  datetime64[ns]
 4   compliment_count  737451 non-null  int64         
 5   year              737451 non-null  int32         
 6   month             737451 non-null  int32         
 7   year_month        737451 non-null  object        
 8   polarity          737451 non-null  float64       
 9   sentiment         737451 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int32(2), int64(1), object(4)
memory usage: 51.3+ MB


Unnamed: 0,user_id,business_id,text,date,compliment_count,year,month,year_month,polarity,sentiment
723714,qOmTJhh7QT4Exm7gekXVow,0QeEnTzmUTQHie1MPZVHLg,They were slow in delivering and the pizza was...,2017-07-31 16:37:24,0,2017,7,1707,0.0,0
266034,Ymqodts5h5I2QzBZ_PrInQ,mU49-Sb2tnKZvqppDfDw5g,"Happy Hour Buy 1 well, get one !!!",2018-05-31 13:21:17,0,2018,5,1805,0.7701,1
667145,U1GveY7G0jwNO4XSz3AmXg,CqL9aYc4_6YFiYVrq415Jg,Manager was inappropriate and flat out rude to...,2016-10-23 20:26:49,0,2016,10,1610,0.882,1
577073,cfthyNZzoLnrmuG5FOamCQ,XYwx1tsEB3_G0tgS6l-0PQ,Looks good so far. I'll let you know after the...,2012-04-17 22:55:50,0,2012,4,1204,0.4404,1
239385,WZjrnw_XnpYKdkEUZjDnlA,d4ZPdoYxDnT6f70AZPGrcw,Solid Brunch. You can't order booze on the sto...,2013-09-08 17:07:44,0,2013,9,1309,0.8625,1


In [34]:
dfytip.to_parquet('dfy_tips.parquet', index=False)

## Dataset REVIEW

In [15]:
print(review.shape)
review.sample(2)

(6990280, 9)


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
4992661,_GFzOCJ2LOgkHkKYqPVtTA,zPil5bzquRDpSDXpNw_cNA,rLMRRaLhgxH2cU2XuJf_BQ,5.0,0,0,0,What a great little place in Tucson! 4 of us d...,2016-02-27 19:44:25
1448473,bW5FkOF_sTDu9js_vopwdQ,vGXqgn5tohJnrnmYRNiRbw,0R2yKDNMUztQwgo8pG4z-Q,4.0,1,0,0,Woop woop for good beer! I've been to many bre...,2011-06-30 03:43:54


In [16]:
# Selección de campos
dfyrev = review
# Adecuación de campos
dfyrev['date'] = pd.to_datetime(dfyrev['date'], errors='coerce')
dfyrev.reset_index(drop=True, inplace=True)

# Eliminación de duplicados: No tiene duplicados, se ha analizado fuera de este archivo

# Eliminación de nulos
dfyrev = dfyrev.dropna()

dfyrev['year'] = dfyrev['date'].dt.year
dfyrev['month'] = dfyrev['date'].dt.month
dfyrev['year_month'] = dfyrev['year'].astype(str).str.slice(-2) + dfyrev['month'].astype(str).str.zfill(2)

dfyrev = dfyrev[(dfyrev['year'] >= 2010) & (dfyrev['year'] <= 2021)]

# Filtrado por restaurantes
dfyrev = dfyrev[dfyrev['business_id'].isin(dfyrst['business_id'])]
dfyrev

dfyrev.info()
dfyrev.sample(2)

<class 'pandas.core.frame.DataFrame'>
Index: 5086663 entries, 0 to 6990279
Data columns (total 12 columns):
 #   Column       Dtype         
---  ------       -----         
 0   review_id    object        
 1   user_id      object        
 2   business_id  object        
 3   stars        float64       
 4   useful       int64         
 5   funny        int64         
 6   cool         int64         
 7   text         object        
 8   date         datetime64[ns]
 9   year         int32         
 10  month        int32         
 11  year_month   object        
dtypes: datetime64[ns](1), float64(1), int32(2), int64(3), object(5)
memory usage: 465.7+ MB


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,year,month,year_month
5737495,AXyLVo2YM4gRSbZk9Ib6cg,XLs_PhrJ7Qwn_RfgMM7Djw,Q4PsKtbnsRlccjs2C-oMng,1.0,2,1,0,I tried their most popular veggie pizza... The...,2013-07-21 01:37:28,2013,7,1307
5590497,XML0RS6nmd14S02SemUhJQ,UQ0lW_Zv5VYdEjjSyCpmow,fKbrDP45PmhwvVkNx3Tuow,1.0,0,0,0,"I hate being this person, but this location cl...",2020-05-09 03:54:49,2020,5,2005


In [17]:
'''
# Análisis de sentimientos a partir del campo "text
#dfyr['text'] = dfyr['text'].astype(str)
#dfyr['polarity'] = dfyr['text'].apply(lambda text: TextBlob(text).sentiment.polarity)
#dfyr['sentiment'] = pd.cut(dfyr['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[-1, 0, 1])


'''

'\n# Análisis de sentimientos a partir del campo "text\n#dfyr[\'text\'] = dfyr[\'text\'].astype(str)\n#dfyr[\'polarity\'] = dfyr[\'text\'].apply(lambda text: TextBlob(text).sentiment.polarity)\n#dfyr[\'sentiment\'] = pd.cut(dfyr[\'polarity\'], bins=[-float(\'inf\'), -0.001, 0.0, float(\'inf\')], labels=[-1, 0, 1])\n\n\nanalyzer = SentimentIntensityAnalyzer()\ndfyrev[\'polarity\'] = dfyrev[\'text\'].apply(lambda text: analyzer.polarity_scores(text)[\'compound\'])\ndfyrev[\'sentiment\'] = pd.cut(dfytip[\'polarity\'], bins=[-float(\'inf\'), -0.001, 0.0, float(\'inf\')], labels=[-1, 0, 1])\n'

In [31]:
# Especificar la semilla aleatoria para reproducibilidad
random_state = 42
dfyrev_sample = dfyrev.sample(n=1000000, random_state=random_state)
dfyrev_sample.head(2)

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,year,month,year_month
2254991,kAvTYPTeyG_USQBzRd5hYw,PGas3x06gHXGQITDMnFyow,uZsStnH9w2xY15og9VsQZA,1.0,0,0,0,Blech! Only went there cuz we had a gift cert...,2012-04-10 23:19:02,2012,4,1204
944067,60HMcgQ9EJZkhicMtnZ6JQ,D9wybQ24_bpA1WCadfqEig,ngCSdj_2csgsfgpLipCaMg,1.0,0,1,0,Me and my wife ate the chili cheese taquitos a...,2013-11-26 06:31:39,2013,11,1311


In [32]:
analyzer = SentimentIntensityAnalyzer()

dfyrev_sample['polarity'] = [analyzer.polarity_scores(text)['compound'] for text in dfyrev_sample['text']]
dfyrev_sample['sentiment'] = pd.cut(dfyrev_sample['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[-1, 0, 1])

print(dfyrev_sample.info())
dfyrev_sample.sample(2)

<class 'pandas.core.frame.DataFrame'>
Index: 1000000 entries, 2254991 to 6314535
Data columns (total 14 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   review_id    1000000 non-null  object        
 1   user_id      1000000 non-null  object        
 2   business_id  1000000 non-null  object        
 3   stars        1000000 non-null  float64       
 4   useful       1000000 non-null  int64         
 5   funny        1000000 non-null  int64         
 6   cool         1000000 non-null  int64         
 7   text         1000000 non-null  object        
 8   date         1000000 non-null  datetime64[ns]
 9   year         1000000 non-null  int32         
 10  month        1000000 non-null  int32         
 11  year_month   1000000 non-null  object        
 12  polarity     1000000 non-null  float64       
 13  sentiment    1000000 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(2), int32(2), int64(3)

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date,year,month,year_month,polarity,sentiment
3575409,evpGzDTJBI4kPf6BV7GaDg,JDc7QsqqnCpWk-PwRC67yA,iDLDMMwPtS6_SSkNn62qEw,4.0,2,0,0,The wine is very.good that's the reason for t...,2015-04-20 03:16:59,2015,4,1504,0.8399,1
1772556,O6-SJ33ROMjjP3RwH602fQ,yBFHjAyiNm8xwpNP1450_Q,hz_zPpBrGPuypuIptEmMeA,1.0,0,0,0,Very dissaponted. The service was horrible it ...,2017-08-13 23:32:16,2017,8,1708,-0.6808,-1


In [33]:
dfyrev_sample.to_parquet('dfy_reviews.parquet', index=False)

In [None]:
'''dfyrev['text'] = dfyrev['text'].astype(str)

analyzer = SentimentIntensityAnalyzer()
dfyrev['polarity'] = dfyrev['text'].apply(lambda text: analyzer.polarity_scores(text)['compound'])
dfyrev['sentiment'] = pd.cut(dfyrev['polarity'], bins=[-float('inf'), -0.001, 0.0, float('inf')], labels=[-1, 0, 1])

dfyrev.info()
dfyrev.sample(5)'''