<p align=center><img src=https://d31uz8lwfmyn8g.cloudfront.net/Assets/logo-henry-white-lg.png><p>

# <h1 align=center> **PROYECTO INDIVIDUAL Nº1** </h1>
### Procesamiento de datos de Steam_games

In [1]:
import pandas as pd
import ast
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyarrow as pa
import pyarrow.parquet as pq
import gzip

filas = []
with gzip.open("steam_games.json.gz", "rb") as archivo:
    for linea in archivo:
        try:
            objeto_json = json.loads(linea)
            filas.append(objeto_json)
        except json.JSONDecodeError:
            print(f"Error de formato JSON en la línea: {linea}")

#Convertir la lista de objetos JSON en un DataFrame
df_games = pd.DataFrame(filas)
df_games.head(12)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,
5,,,,,,,,,,,,,
6,,,,,,,,,,,,,
7,,,,,,,,,,,,,
8,,,,,,,,,,,,,
9,,,,,,,,,,,,,


Mostramos el número de filas y columnas del DataFrame, el nombre y catidad de valores no nulos de cada columna.

In [2]:
df_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 11.9+ MB


Identificamos si se tiener valores Nulos  cada columna

In [3]:
nulos1 = df_games.isna().sum()
nulos1

publisher       96362
genres          91593
app_name        88312
title           90360
url             88310
release_date    90377
tags            88473
reviews_url     88312
specs           88980
price           89687
early_access    88310
id              88312
developer       91609
dtype: int64

Se deben eliminar las columnas con info inútil

In [4]:
df_games.drop(columns=["publisher","title","url","early_access","reviews_url","specs"], inplace=True)
df_games.dropna(subset=["id"], inplace=True)  # Eliminamos los valores nulos de la columna id
df_games.reset_index(drop=True, inplace=True)  # Se deben resetear los indices para conservar el orden correcto
df_games.head(2)

Unnamed: 0,genres,app_name,release_date,tags,price,id,developer
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",4.99,761140,Kotoshiro
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,643980,Secret Level SRL


In [5]:
df_games.isnull().sum() #se deben identificar los nulos

genres          3282
app_name           1
release_date    2066
tags             162
price           1377
id                 0
developer       3298
dtype: int64

Se debe obtener el año apartir de la columna release_date

In [6]:
df_games['año_lanzamiento'] = df_games['release_date'].str.extract(r'(\d{4})')
df_games.drop(columns=["release_date"],inplace=True)
df_games.head(2)

Unnamed: 0,genres,app_name,tags,price,id,developer,año_lanzamiento
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",4.99,761140,Kotoshiro,2018
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,643980,Secret Level SRL,2018


Crear dummies para la columna genres que utilizare para la funcion de busqueda de juegos por genero

In [7]:
df_games['genres'] = df_games['genres'].fillna('[]')  # Rellenar los valores faltantes con una lista vacía
df_games['genres'] = df_games['genres'].apply(lambda x: ', '.join(x))  # Convertir la lista de géneros a una cadena separada por comas

#Crear variables ficticias para los géneros
dummy_genres = df_games['genres'].str.get_dummies(', ')

#Concatenar las variables ficticias con el DataFrame original
df_games = pd.concat([df_games, dummy_genres], axis=1)
df_games.head(2)

Unnamed: 0,genres,app_name,tags,price,id,developer,año_lanzamiento,Accounting,Action,Adventure,...,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing,[,]
0,"Action, Casual, Indie, Simulation, Strategy",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",4.99,761140,Kotoshiro,2018,0,1,0,...,0,1,0,0,1,0,0,0,0,0
1,"Free to Play, Indie, RPG, Strategy",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...",Free To Play,643980,Secret Level SRL,2018,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [8]:
df_games.drop(columns=["[","]","genres"],inplace=True) #eliminamos la columna genres y [] que no tiene datos
df_games.shape

(32133, 28)

Se debe validar los datos nulos 

In [9]:
df_games.isnull().sum()

app_name                        1
tags                          162
price                        1377
id                              0
developer                    3298
año_lanzamiento              2167
Accounting                      0
Action                          0
Adventure                       0
Animation &amp; Modeling        0
Audio Production                0
Casual                          0
Design &amp; Illustration       0
Early Access                    0
Education                       0
Free to Play                    0
Indie                           0
Massively Multiplayer           0
Photo Editing                   0
RPG                             0
Racing                          0
Simulation                      0
Software Training               0
Sports                          0
Strategy                        0
Utilities                       0
Video Production                0
Web Publishing                  0
dtype: int64

Reemplazo los nulos por 0

In [10]:
df_games.fillna('0',inplace=True)
df_games.isnull().sum()

app_name                     0
tags                         0
price                        0
id                           0
developer                    0
año_lanzamiento              0
Accounting                   0
Action                       0
Adventure                    0
Animation &amp; Modeling     0
Audio Production             0
Casual                       0
Design &amp; Illustration    0
Early Access                 0
Education                    0
Free to Play                 0
Indie                        0
Massively Multiplayer        0
Photo Editing                0
RPG                          0
Racing                       0
Simulation                   0
Software Training            0
Sports                       0
Strategy                     0
Utilities                    0
Video Production             0
Web Publishing               0
dtype: int64

Cambiamos el tipo de dato de la columna año de lanzamiento 

In [11]:
df_games['año_lanzamiento'] = df_games['año_lanzamiento'].astype(int)

Renombramos la columna id por item_id

In [12]:
df_games = df_games.rename(columns={'id': 'item_id'})

Verificar en la columna precio si hay valores en str

In [13]:
string_prices = df_games[df_games['price'].apply(lambda x: isinstance(x, str))]
print(string_prices['price'].value_counts())

price
0                                1377
Free                              905
Free to Play                      520
Free To Play                      462
Free Mod                            4
Free Demo                           3
Play for Free!                      2
Third-party                         2
Play Now                            2
Starting at $499.00                 1
Free Movie                          1
Free to Try                         1
Starting at $449.00                 1
Install Theme                       1
Play the Demo                       1
Free HITMAN™ Holiday Pack           1
Play WARMACHINE: Tactics Demo       1
Install Now                         1
Free to Use                         1
Name: count, dtype: int64


Convertir los valores de la columna "price" que son strings a cero

In [14]:

df_games.loc[df_games['price'].apply(lambda x: isinstance(x, str)), 'price'] = 0
df_games.isnull().sum()

app_name                     0
tags                         0
price                        0
item_id                      0
developer                    0
año_lanzamiento              0
Accounting                   0
Action                       0
Adventure                    0
Animation &amp; Modeling     0
Audio Production             0
Casual                       0
Design &amp; Illustration    0
Early Access                 0
Education                    0
Free to Play                 0
Indie                        0
Massively Multiplayer        0
Photo Editing                0
RPG                          0
Racing                       0
Simulation                   0
Software Training            0
Sports                       0
Strategy                     0
Utilities                    0
Video Production             0
Web Publishing               0
dtype: int64

In [15]:
games= df_games.to_csv('games.csv',index=False)
# Leer el archivo CSV en un DataFrame de pandas
games = pd.read_csv('games.csv')

# Convertir el DataFrame de pandas a una tabla de PyArrow
table = pa.Table.from_pandas(games)

# Escribir la tabla en un archivo Parquet
pq.write_table(table, 'games.parquet')

In [16]:
# Leer el archivo Parquet en una tabla de PyArrow
table = pq.read_table('games.parquet')

# Convertir la tabla de PyArrow a un DataFrame de pandas
games_parquet = table.to_pandas()

### Procesamiento de datos de users_items

In [17]:
filas1 = list()
with gzip.open("users_items.json.gz", "rb") as file:
    # Procesa el contenido del archivo aquí

    # Procesa las líneas del archivo
    for line in file:
        try:
            decoded_line = line.decode("utf-8")    # Decodifica la línea a texto UTF-8
            objeto_json = ast.literal_eval(decoded_line)   # Intenta analizar el texto JSON
            filas1.append(objeto_json)   # Agrega el objeto JSON a la lista
        except json.JSONDecodeError:
            print(f"Error de formato JSON en la línea: {line}")
            continue
#Convertir la lista de objetos JSON en un DataFrame
df_items = pd.DataFrame(filas1)
df_items.head(5)

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."


Creamos una nueva fila para cada elemento de la lista de la columna items

In [18]:
df_items = df_items.explode("items").reset_index()
df_items = df_items.drop(columns="index")
df_items = pd.concat([df_items, pd.json_normalize(df_items['items'])], axis=1)
df_items.head()

Unnamed: 0,user_id,items_count,steam_id,user_url,items,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '10', 'item_name': 'Counter-Strike...",10,Counter-Strike,6.0,0.0
1,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '20', 'item_name': 'Team Fortress ...",20,Team Fortress Classic,0.0,0.0
2,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '30', 'item_name': 'Day of Defeat'...",30,Day of Defeat,7.0,0.0
3,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '40', 'item_name': 'Deathmatch Cla...",40,Deathmatch Classic,0.0,0.0
4,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '50', 'item_name': 'Half-Life: Opp...",50,Half-Life: Opposing Force,0.0,0.0


Verificamos los valores nulos

In [19]:
df_items.isnull().sum()

user_id                 0
items_count             0
steam_id                0
user_url                0
items               16806
item_id             16806
item_name           16806
playtime_forever    16806
playtime_2weeks     16806
dtype: int64

Eliminamos los nulos

In [20]:
df_items = df_items.dropna()
df_items.isnull().sum()

user_id             0
items_count         0
steam_id            0
user_url            0
items               0
item_id             0
item_name           0
playtime_forever    0
playtime_2weeks     0
dtype: int64

Observamos los nombres de las columnas y su tipo de dato 

In [21]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5153209 entries, 0 to 5170013
Data columns (total 9 columns):
 #   Column            Dtype  
---  ------            -----  
 0   user_id           object 
 1   items_count       int64  
 2   steam_id          object 
 3   user_url          object 
 4   items             object 
 5   item_id           object 
 6   item_name         object 
 7   playtime_forever  float64
 8   playtime_2weeks   float64
dtypes: float64(2), int64(1), object(6)
memory usage: 393.2+ MB


In [22]:
print(f"El número de renglones es: {len(df_items)}")
print(f"El número de columnas es: {len(df_items.columns)}")

El número de renglones es: 5153209
El número de columnas es: 9


In [23]:
#Eliminamos columnas que no aportan info util
df_items.drop(["user_url","items"], axis=1, inplace=True)
df_items.head(10)

Unnamed: 0,user_id,items_count,steam_id,item_id,item_name,playtime_forever,playtime_2weeks
0,76561197970982479,277,76561197970982479,10,Counter-Strike,6.0,0.0
1,76561197970982479,277,76561197970982479,20,Team Fortress Classic,0.0,0.0
2,76561197970982479,277,76561197970982479,30,Day of Defeat,7.0,0.0
3,76561197970982479,277,76561197970982479,40,Deathmatch Classic,0.0,0.0
4,76561197970982479,277,76561197970982479,50,Half-Life: Opposing Force,0.0,0.0
5,76561197970982479,277,76561197970982479,60,Ricochet,0.0,0.0
6,76561197970982479,277,76561197970982479,70,Half-Life,0.0,0.0
7,76561197970982479,277,76561197970982479,130,Half-Life: Blue Shift,0.0,0.0
8,76561197970982479,277,76561197970982479,300,Day of Defeat: Source,4733.0,0.0
9,76561197970982479,277,76561197970982479,240,Counter-Strike: Source,1853.0,0.0


In [24]:
items= df_items.to_csv('items.csv',index=False) #convertimos a csv
items= pd.read_csv('items.csv')

Se cambia el tipo de dato de la columna items_id para evitar futuros errores

In [25]:
items['items_id'] = items['item_id'].astype(int)

In [26]:
# Convertir el DataFrame de pandas a una tabla de PyArrow
table = pa.Table.from_pandas(items)
# Escribir la tabla en un archivo Parquet
pq.write_table(table, 'items.parquet')

In [27]:
# Leer el archivo Parquet en una tabla de PyArrow
table = pq.read_table('items.parquet')
# Convertir la tabla de PyArrow a un DataFrame de pandas
items_parquet = table.to_pandas()
# Imprimir el DataFrame
items_parquet

Unnamed: 0,user_id,items_count,steam_id,item_id,item_name,playtime_forever,playtime_2weeks,items_id
0,76561197970982479,277,76561197970982479,10,Counter-Strike,6.0,0.0,10
1,76561197970982479,277,76561197970982479,20,Team Fortress Classic,0.0,0.0,20
2,76561197970982479,277,76561197970982479,30,Day of Defeat,7.0,0.0,30
3,76561197970982479,277,76561197970982479,40,Deathmatch Classic,0.0,0.0,40
4,76561197970982479,277,76561197970982479,50,Half-Life: Opposing Force,0.0,0.0,50
...,...,...,...,...,...,...,...,...
5153204,76561198329548331,7,76561198329548331,346330,BrainBread 2,0.0,0.0,346330
5153205,76561198329548331,7,76561198329548331,373330,All Is Dust,0.0,0.0,373330
5153206,76561198329548331,7,76561198329548331,388490,One Way To Die: Steam Edition,3.0,3.0,388490
5153207,76561198329548331,7,76561198329548331,521570,You Have 10 Seconds 2,4.0,4.0,521570


### Procesamiento de datos de user_reviews.json

In [28]:
filas2 = list()
with gzip.open("user_reviews.json.gz", "rb") as file:
       # Procesa las líneas del archivo
    for line in file:
        try:
            decoded_line = line.decode("utf-8") # Decodifica la línea a texto UTF-8
            objeto_json = ast.literal_eval(decoded_line) # Intenta analizar el texto JSON            
            filas2.append(objeto_json) # Agrega el objeto JSON a la lista
        except json.JSONDecodeError:
            print(f"Error de formato JSON en la línea: {line}")
            continue
#Convertir la lista de objetos JSON en un DataFrame
df_reviews = pd.DataFrame(filas2)
df_reviews.head(5)

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."


Hacemos explode en la columa reviews para normalizar los datos

In [29]:
df_reviews = df_reviews.explode("reviews").reset_index()
df_reviews = df_reviews.drop(columns="index")
df_reviews = pd.concat([df_reviews, pd.json_normalize(df_reviews['reviews'])], axis=1)
df_reviews.head()

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.
2,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011...",,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...
3,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....",,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...
4,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2...",,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...


Observamos los nombres de las columnas su tipo de dato y los no nulos.

In [30]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59333 entries, 0 to 59332
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      59333 non-null  object
 1   user_url     59333 non-null  object
 2   reviews      59305 non-null  object
 3   funny        59305 non-null  object
 4   posted       59305 non-null  object
 5   last_edited  59305 non-null  object
 6   item_id      59305 non-null  object
 7   helpful      59305 non-null  object
 8   recommend    59305 non-null  object
 9   review       59305 non-null  object
dtypes: object(10)
memory usage: 4.5+ MB


Verificamos  los valores  nulos

In [31]:
nulos3 = df_reviews.isna().sum()
nulos3

user_id         0
user_url        0
reviews        28
funny          28
posted         28
last_edited    28
item_id        28
helpful        28
recommend      28
review         28
dtype: int64

Extraemenos el año de la columna Posted

In [32]:
# Utilizar una expresión regular para extraer el año
df_reviews['año'] = df_reviews['posted'].str.extract(r'(\d{4})')
df_reviews.head(2)

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review,año
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,2011


In [33]:
df_reviews["año"].isna().value_counts()

año
False    49186
True     10147
Name: count, dtype: int64

In [34]:
df_reviews.isnull().sum()

user_id            0
user_url           0
reviews           28
funny             28
posted            28
last_edited       28
item_id           28
helpful           28
recommend         28
review            28
año            10147
dtype: int64

Eliminamos los valores vacíos de la columna año

In [35]:
df_reviews = df_reviews.dropna(subset=['año'])
df_reviews.head(2)

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review,año
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,2011
1,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,2011


In [36]:
df_reviews.isnull().sum()

user_id        0
user_url       0
reviews        0
funny          0
posted         0
last_edited    0
item_id        0
helpful        0
recommend      0
review         0
año            0
dtype: int64

Análisis de sentimientos con el umbral menos extresante

In [37]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

def analyze_sentiment(review):
    
    if isinstance(review, str):
        sentiment = analyzer.polarity_scores(review)

        if sentiment['compound'] >= 0.05:
            return 2
        elif sentiment['compound'] <= -0.05:
            return 0
        else:
            return 1
    else:
        return 1
        
#Aplica la función de análisis de sentimiento a la columna 'reviews' si existe al menos una reseña

df_reviews['sentiment_analysis'] = df_reviews['review'].apply(analyze_sentiment)
df_reviews["sentiment_analysis"].value_counts()


sentiment_analysis
2    33422
1     9080
0     6684
Name: count, dtype: int64

Eliminamos columnas irrelevantes

In [38]:
df_reviews.drop(["reviews",'user_url', 'funny', 'last_edited',"posted","review"], axis=1, inplace=True)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49186 entries, 0 to 59304
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             49186 non-null  object
 1   item_id             49186 non-null  object
 2   helpful             49186 non-null  object
 3   recommend           49186 non-null  object
 4   año                 49186 non-null  object
 5   sentiment_analysis  49186 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 2.6+ MB


In [39]:
#convertir a csv
reviews= df_reviews.to_csv('reviews.csv',index=False)
reviews=pd.read_csv("reviews.csv")

In [40]:
# Convertir el DataFrame de pandas a una tabla de PyArrow
table = pa.Table.from_pandas(reviews)
# Escribir la tabla en un archivo Parquet
pq.write_table(table, 'reviews.parquet')

In [41]:
# Leer el archivo Parquet en una tabla de PyArrow
table = pq.read_table('reviews.parquet')
# Convertir la tabla de PyArrow a un DataFrame de pandas
reviews_parquet = table.to_pandas()
# Imprimir el DataFrame
reviews_parquet

Unnamed: 0,user_id,item_id,helpful,recommend,año,sentiment_analysis
0,76561197970982479,1250,No ratings yet,True,2011,2
1,76561197970982479,22200,No ratings yet,True,2011,2
2,76561197970982479,43110,No ratings yet,True,2011,2
3,js41637,251610,15 of 20 people (75%) found this review helpful,True,2014,2
4,js41637,227300,0 of 1 people (0%) found this review helpful,True,2013,2
...,...,...,...,...,...,...
49181,wayfeng,730,1 of 1 people (100%) found this review helpful,True,2015,1
49182,76561198251004808,253980,No ratings yet,True,2015,2
49183,72947282842,730,No ratings yet,True,2015,0
49184,ApxLGhost,730,No ratings yet,True,2015,2


In [45]:
from fuzzywuzzy import fuzz

cadena1 = "14 Dimension Enterprise"
cadena2 = "14Dimension Enterprise"

# Obtén el porcentaje de similitud
porcentaje_similitud = fuzz.ratio(cadena1, cadena2)

print(f"Porcentaje de similitud: {porcentaje_similitud}%")


Porcentaje de similitud: 98%


In [46]:
from fuzzywuzzy import fuzz

cadena1 = "2Chance Projects,IIchan Eroge Team"
cadena2 = "2Chance Projects,IIchan Eroge Team,DjSM"

# Obtén el porcentaje de similitud
porcentaje_similitud = fuzz.ratio(cadena1, cadena2)

print(f"Porcentaje de similitud: {porcentaje_similitud}%")


Porcentaje de similitud: 93%
