
# Advancing Game Discovery: A Data-Driven Approach to Personalization in Gaming
![Data Science](https://img.shields.io/badge/Data%20Science-Advanced-brightgreen)
![Machine Learning](https://img.shields.io/badge/Machine%20Learning-Expert-blue)
![Natural Language Processing](https://img.shields.io/badge/NLP-Proficient-yellow)
[![FastAPI](https://img.shields.io/badge/FastAPI-Integration-orange)](https://https://fastapi.tiangolo.com)
[![Docker](https://img.shields.io/badge/Docker-Containerization-blue)](https://www.docker.com)
[![Render](https://img.shields.io/badge/Render-Deployment-success)](https://www.render.com)
[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://www.python.org/)
[![pandas](https://img.shields.io/badge/pandas-Data%20Manipulation-yellowgreen)](https://pandas.pydata.org/)
[![NumPy](https://img.shields.io/badge/NumPy-Array%20Operations-blue)](https://numpy.org/)
[![scikit-learn](https://img.shields.io/badge/scikit--learn-Machine%20Learning-orange)](https://scikit-learn.org/stable/)
[![NLTK](https://img.shields.io/badge/NLTK-Natural%20Language%20Processing-yellow)](https://www.nltk.org/)
[![spaCy](https://img.shields.io/badge/spaCy-NLP%20Library-green)](https://spacy.io/)
[![Matplotlib](https://img.shields.io/badge/Matplotlib-Data%20Visualization-lightgrey)](https://matplotlib.org/)
[![Parquet](https://img.shields.io/badge/Parquet-Data%20Storage%20Format-ff69b4)](https://parquet.apache.org/)

## Executive Summary
In an era marked by the proliferation of digital entertainment options, the Steam Game Recommendation System distinguishes itself as a sophisticated application of data science and machine learning to enhance user experience in selecting video games. This project represents more than just technical adeptness in managing complex datasets; it is a nuanced exploration into user behavior and preference patterns within the gaming industry.

Central to this initiative is the meticulous aggregation and analysis of Steam's extensive gaming data. The project adeptly navigates through a multitude of data points - from game titles and genres to user reviews and engagement metrics. This comprehensive ETL (Extract, Transform, Load) process lays the foundation for a robust analytical framework.

Employing advanced data processing techniques, the project conducts a thorough Exploratory Data Analysis (EDA) to unearth underlying patterns and correlations within the gaming data. This stage is crucial in shaping the project's strategic approach to recommendation algorithms, ensuring that insights derived are both statistically significant and contextually relevant.

The heart of the system is its recommendation engine, where machine learning comes into play. By leveraging algorithms like TF-IDF for text analysis and cosine similarity for measuring game likeness, the project meticulously crafts a model that not only understands but anticipates user preferences. This predictive capacity is key to personalizing game recommendations, making them more relevant and engaging for each individual user.

Furthermore, the integration of this model within a FastAPI framework showcases the project's commitment to practical applicability. It's a demonstration of how data science can be effectively translated into user-friendly applications, providing real-time, dynamic game suggestions to enhance the user experience.

### The Challenge: The Paradox of Choice in Digital Gaming

Steam, a leading platform in the digital distribution of games, offers a staggering array of choices to its users. However, this abundance creates a paradox of choice, where users often find themselves overwhelmed, unable to make informed decisions about their next gaming experience. This project addresses this challenge by implementing a sophisticated recommendation system.

### Methodological Framework

The project employs a multifaceted approach:

1. **Data Collection and Preprocessing:** Leveraging Steam's extensive dataset, the project involves meticulous data collection, cleansing, and preprocessing to structure a robust dataset suitable for analysis.

2. **Analytical Techniques:** Utilizing advanced natural language processing techniques, particularly TF-IDF, the project analyzes game descriptions and user reviews. This analysis forms the foundation for understanding user sentiment and game characteristics.

3. **Machine Learning Integration:** The core of the project is the development of a recommendation algorithm based on cosine similarity. This algorithm is fine-tuned to assess user preferences and suggest games that align with individual tastes, based on historical data.

4. **FastAPI Deployment:** The implementation of the recommendation system within a FastAPI framework demonstrates practical skills in deploying machine learning models in a real-world, user-facing application.

### Implications and Professional Competence

This project is not merely an academic exercise but a real-world application that demonstrates a deep understanding of key data science concepts and their application:

- **Data Analysis and Processing:** Showcases the ability to handle and interpret large datasets, turning raw data into actionable insights.
  
- **Machine Learning Proficiency:** Reflects the skill in applying sophisticated algorithms to solve practical problems, demonstrating an understanding of both theory and application.

- **Software Engineering:** The use of FastAPI for deployment illustrates competence in software development, emphasizing the importance of making data science solutions accessible and user-friendly.

- **Business Acumen:** Beyond technical skills, this project highlights an understanding of user experience and market needs in the digital entertainment industry.


# File OUTPUT_STEAM_GAMES.JSON Data Analysis and Extraction

In [134]:
import pandas as pd
import json

file_path = r'Datasets/output_steam_games.json'  # Replace with your file path

data = []
with open(file_path, 'r') as file:
    for line in file:
        try:
            json_object = json.loads(line)
            data.append(json_object)
        except json.JSONDecodeError:
            # You can handle or log errors here if needed
            pass

df = pd.DataFrame(data)

# Optionally, check the first few rows of the DataFrame
print(df.head())
df

  publisher genres app_name title  url release_date tags reviews_url specs  \
0       NaN    NaN      NaN   NaN  NaN          NaN  NaN         NaN   NaN   
1       NaN    NaN      NaN   NaN  NaN          NaN  NaN         NaN   NaN   
2       NaN    NaN      NaN   NaN  NaN          NaN  NaN         NaN   NaN   
3       NaN    NaN      NaN   NaN  NaN          NaN  NaN         NaN   NaN   
4       NaN    NaN      NaN   NaN  NaN          NaN  NaN         NaN   NaN   

  price early_access   id developer  
0   NaN          NaN  NaN       NaN  
1   NaN          NaN  NaN       NaN  
2   NaN          NaN  NaN       NaN  
3   NaN          NaN  NaN       NaN  
4   NaN          NaN  NaN       NaN  


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"


In [135]:
df.dropna(how='all', inplace=True)

Taking into consideration the information needed to create the API calls, we decide to drop all columns that are not relevant to the task.
We will keep 'genres' and complete missing values using 'tags' for all tags that were found to be genres too.
'release date' will be turned into 'release year' because that's the only relevant value.

In [136]:
columns_to_drop = ['publisher', 'app_name', 'url', 'reviews_url', 'specs', 'price', 'early_access']
df.drop(columns=columns_to_drop, inplace=True)
# Reset the index
df.reset_index(drop=True, inplace=True)
df.index += 1

In [137]:
df

Unnamed: 0,genres,title,release_date,tags,id,developer
1,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
2,"[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",643980,Secret Level SRL
3,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",670290,Poolians.com
4,"[Action, Adventure, Casual]",弹炸人2222,2017-12-07,"[Action, Adventure, Casual]",767400,彼岸领域
5,,,,"[Action, Indie, Casual, Sports]",773570,
...,...,...,...,...,...,...
32131,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,2018-01-04,"[Strategy, Indie, Casual, Simulation]",773640,"Nikita ""Ghost_RUS"""
32132,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,2018-01-04,"[Strategy, Indie, Casual]",733530,Sacada
32133,"[Indie, Racing, Simulation]",Russian Roads,2018-01-04,"[Indie, Simulation, Racing]",610660,Laush Dmitriy Sergeevich
32134,"[Casual, Indie]",EXIT 2 - Directions,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",658870,"xropi,stev3ns"


We are going to check first for the amount of genres there are.

In [138]:
df_exploded = df.explode('genres')

In [139]:
df_exploded

Unnamed: 0,genres,title,release_date,tags,id,developer
1,Action,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
1,Casual,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
1,Indie,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
1,Simulation,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
1,Strategy,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",761140,Kotoshiro
...,...,...,...,...,...,...
32133,Racing,Russian Roads,2018-01-04,"[Indie, Simulation, Racing]",610660,Laush Dmitriy Sergeevich
32133,Simulation,Russian Roads,2018-01-04,"[Indie, Simulation, Racing]",610660,Laush Dmitriy Sergeevich
32134,Casual,EXIT 2 - Directions,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",658870,"xropi,stev3ns"
32134,Indie,EXIT 2 - Directions,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",658870,"xropi,stev3ns"


In [140]:
unique_genres_count = df_exploded['genres'].nunique()
unique_genres_breakdown = df_exploded['genres'].value_counts()

print(f"Total number of unique genres: {unique_genres_count}")
print("Breakdown of each genre's frequency:\n", unique_genres_breakdown)

Total number of unique genres: 22
Breakdown of each genre's frequency:
 genres
Indie                        15858
Action                       11321
Casual                        8282
Adventure                     8243
Strategy                      6957
Simulation                    6699
RPG                           5479
Free to Play                  2031
Early Access                  1462
Sports                        1257
Massively Multiplayer         1108
Racing                        1083
Design &amp; Illustration      460
Utilities                      340
Web Publishing                 268
Animation &amp; Modeling       183
Education                      125
Video Production               116
Software Training              105
Audio Production                93
Photo Editing                   77
Accounting                       7
Name: count, dtype: int64


In order to make further transformations and consolide the info we have, we save the changes to a csv file.

In [141]:
df.to_csv('steam_games.csv', index=False)

In [142]:
df = pd.read_csv('steam_games.csv')

We complete the missing data in 'genre' with the tags present if they are also a genre.

In [143]:
import ast
# Function to safely parse a string list
def parse_string_list(string_list):
    try:
        return ast.literal_eval(string_list)
    except (ValueError, SyntaxError):
        return []

# Update missing genres based on tags
for index, row in df.iterrows():
    if pd.isna(row['genres']):
        tags = parse_string_list(row['tags'])
        # Assuming the tags contain genre information
        df.at[index, 'genres'] = tags


Now we can safetly remove 'tags'.

In [144]:
df = df.drop(columns=['tags'])

In [145]:
df

Unnamed: 0,genres,title,release_date,id,developer
0,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018-01-04,761140.0,Kotoshiro
1,"['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018-01-04,643980.0,Secret Level SRL
2,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017-07-24,670290.0,Poolians.com
3,"['Action', 'Adventure', 'Casual']",弹炸人2222,2017-12-07,767400.0,彼岸领域
4,"[Action, Indie, Casual, Sports]",,,773570.0,
...,...,...,...,...,...
32130,"['Casual', 'Indie', 'Simulation', 'Strategy']",Colony On Mars,2018-01-04,773640.0,"Nikita ""Ghost_RUS"""
32131,"['Casual', 'Indie', 'Strategy']",LOGistICAL: South Africa,2018-01-04,733530.0,Sacada
32132,"['Indie', 'Racing', 'Simulation']",Russian Roads,2018-01-04,610660.0,Laush Dmitriy Sergeevich
32133,"['Casual', 'Indie']",EXIT 2 - Directions,2017-09-02,658870.0,"xropi,stev3ns"


We shall remove missing values for 'release_date' because we can't infer data based on what we have. 

In [146]:
df.dropna(subset=['release_date'], inplace=True)
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce').dt.year
# Renaming 'release_date' to 'release_year' and converting to integer
df.rename(columns={'release_date': 'release_year'}, inplace=True)
df.dropna(subset=['release_year'], inplace=True)
df['release_year'] = df['release_year'].astype(int)

In [147]:
df

Unnamed: 0,genres,title,release_year,id,developer
0,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018,761140.0,Kotoshiro
1,"['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018,643980.0,Secret Level SRL
2,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017,670290.0,Poolians.com
3,"['Action', 'Adventure', 'Casual']",弹炸人2222,2017,767400.0,彼岸领域
5,"['Action', 'Adventure', 'Simulation']",Battle Royale Trainer,2018,772540.0,Trickjump Games Ltd
...,...,...,...,...,...
32129,"['Action', 'Adventure', 'Casual', 'Indie']",Kebab it Up!,2018,745400.0,Bidoniera Games
32130,"['Casual', 'Indie', 'Simulation', 'Strategy']",Colony On Mars,2018,773640.0,"Nikita ""Ghost_RUS"""
32131,"['Casual', 'Indie', 'Strategy']",LOGistICAL: South Africa,2018,733530.0,Sacada
32132,"['Indie', 'Racing', 'Simulation']",Russian Roads,2018,610660.0,Laush Dmitriy Sergeevich


We check for duplicates in 'id' and missing data, since we can't infer which game info is missing.

In [148]:
df.drop_duplicates(subset='id', inplace=True)
df.dropna(subset=['id'], inplace=True)
# Converting 'id' to integer
df['id'] = df['id'].astype(int)
df.rename(columns={'id': 'game_id'}, inplace=True)

In [149]:
df

Unnamed: 0,genres,title,release_year,game_id,developer
0,"['Action', 'Casual', 'Indie', 'Simulation', 'S...",Lost Summoner Kitty,2018,761140,Kotoshiro
1,"['Free to Play', 'Indie', 'RPG', 'Strategy']",Ironbound,2018,643980,Secret Level SRL
2,"['Casual', 'Free to Play', 'Indie', 'Simulatio...",Real Pool 3D - Poolians,2017,670290,Poolians.com
3,"['Action', 'Adventure', 'Casual']",弹炸人2222,2017,767400,彼岸领域
5,"['Action', 'Adventure', 'Simulation']",Battle Royale Trainer,2018,772540,Trickjump Games Ltd
...,...,...,...,...,...
32129,"['Action', 'Adventure', 'Casual', 'Indie']",Kebab it Up!,2018,745400,Bidoniera Games
32130,"['Casual', 'Indie', 'Simulation', 'Strategy']",Colony On Mars,2018,773640,"Nikita ""Ghost_RUS"""
32131,"['Casual', 'Indie', 'Strategy']",LOGistICAL: South Africa,2018,733530,Sacada
32132,"['Indie', 'Racing', 'Simulation']",Russian Roads,2018,610660,Laush Dmitriy Sergeevich


Final check for missing values.

In [150]:
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 genres             0
title              1
release_year       0
game_id            0
developer       1250
dtype: int64


The file is ready!

In [151]:
df['genres'] = df['genres'].apply(lambda x: ','.join(x) if isinstance(x, list) else x)

In [152]:
df.to_parquet('games.parquet', engine='fastparquet', index=False)

# File USER_ITEMS.JSON.GZ Data Analysis and Extraction

In [153]:
import pandas as pd
import ast
import gzip
row = [] # creo una lista vacia para ir guardando lo que recorre el bucle for

for i in gzip.open(r'Datasets\users_items.json.gz'): # creo un buble para recorrer el dataset 
    row.append(ast.literal_eval(i.decode("utf-8"))) #leo cada linea y la guardo en la lista row

#Creo el dataframe a partir de la lista 
df_items = pd.DataFrame(row)

In [154]:
df_items

Unnamed: 0,user_id,items_count,steam_id,user_url,items
0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
1,js41637,888,76561198035864385,http://steamcommunity.com/id/js41637,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
2,evcentric,137,76561198007712555,http://steamcommunity.com/id/evcentric,"[{'item_id': '1200', 'item_name': 'Red Orchest..."
3,Riot-Punch,328,76561197963445855,http://steamcommunity.com/id/Riot-Punch,"[{'item_id': '10', 'item_name': 'Counter-Strik..."
4,doctr,541,76561198002099482,http://steamcommunity.com/id/doctr,"[{'item_id': '300', 'item_name': 'Day of Defea..."
...,...,...,...,...,...
88305,76561198323066619,22,76561198323066619,http://steamcommunity.com/profiles/76561198323...,"[{'item_id': '413850', 'item_name': 'CS:GO Pla..."
88306,76561198326700687,177,76561198326700687,http://steamcommunity.com/profiles/76561198326...,"[{'item_id': '11020', 'item_name': 'TrackMania..."
88307,XxLaughingJackClown77xX,0,76561198328759259,http://steamcommunity.com/id/XxLaughingJackClo...,[]
88308,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"[{'item_id': '304930', 'item_name': 'Unturned'..."


We see the structure of the 'items' column provide useful information for each user. They give the amount of time each player has spent in each game. Therefore, we can build the required functions taking into consideration this information.

In [155]:
#Unwrap 'items'
df_items1 = df_items.explode(['items'])
df_items2 = pd.json_normalize(df_items1['items']).set_index(df_items1['items'].index)
df_items = pd.concat([df_items2, df_items1], axis=1)

In [156]:
df_items

Unnamed: 0,item_id,item_name,playtime_forever,playtime_2weeks,user_id,items_count,steam_id,user_url,items
0,10,Counter-Strike,6.0,0.0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '10', 'item_name': 'Counter-Strike..."
0,20,Team Fortress Classic,0.0,0.0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '20', 'item_name': 'Team Fortress ..."
0,30,Day of Defeat,7.0,0.0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '30', 'item_name': 'Day of Defeat'..."
0,40,Deathmatch Classic,0.0,0.0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '40', 'item_name': 'Deathmatch Cla..."
0,50,Half-Life: Opposing Force,0.0,0.0,76561197970982479,277,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'item_id': '50', 'item_name': 'Half-Life: Opp..."
...,...,...,...,...,...,...,...,...,...
88308,373330,All Is Dust,0.0,0.0,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"{'item_id': '373330', 'item_name': 'All Is Dus..."
88308,388490,One Way To Die: Steam Edition,3.0,3.0,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"{'item_id': '388490', 'item_name': 'One Way To..."
88308,521570,You Have 10 Seconds 2,4.0,4.0,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"{'item_id': '521570', 'item_name': 'You Have 1..."
88308,519140,Minds Eyes,3.0,3.0,76561198329548331,7,76561198329548331,http://steamcommunity.com/profiles/76561198329...,"{'item_id': '519140', 'item_name': 'Minds Eyes..."


Now we drop all the unnecessary columns

In [157]:
# Define the columns to drop
columns_to_drop = ['user_url', 'playtime_2weeks', 'items', 'steam_id', 'items_count', 'item_name']

# Drop the specified columns
df_items = df_items.drop(columns_to_drop, axis=1)


In [158]:
df_items

Unnamed: 0,item_id,playtime_forever,user_id
0,10,6.0,76561197970982479
0,20,0.0,76561197970982479
0,30,7.0,76561197970982479
0,40,0.0,76561197970982479
0,50,0.0,76561197970982479
...,...,...,...
88308,373330,0.0,76561198329548331
88308,388490,3.0,76561198329548331
88308,521570,4.0,76561198329548331
88308,519140,3.0,76561198329548331


In [159]:
# Remove duplicate rows based on the 'steam_id' column
df_items.dropna(subset=['item_id', 'playtime_forever'])

# Also remove empty or 0 rows
df_items[df_items['playtime_forever'] != 0]

# And any other potential data that we might have not seen
df_items.drop_duplicates()
df_items.dropna()

Unnamed: 0,item_id,playtime_forever,user_id
0,10,6.0,76561197970982479
0,20,0.0,76561197970982479
0,30,7.0,76561197970982479
0,40,0.0,76561197970982479
0,50,0.0,76561197970982479
...,...,...,...
88308,346330,0.0,76561198329548331
88308,373330,0.0,76561198329548331
88308,388490,3.0,76561198329548331
88308,521570,4.0,76561198329548331


So far we have the information organized as intended. We now save the file to be able to get back to this point for the different API calls.

In [160]:
df_items.to_parquet('items.parquet')

# File AUSTRALIAN_USER_REVIEWS.JSON Data Analysis and Extraction

In [161]:
import pandas as pd
import ast

# Generate an empty list to store data
review = []

file_path = 'australian_user_reviews.json'

# Search each line for data and printing an error message in case there are any problems
with open(file_path, 'r', encoding='utf-8') as file:
    for line in file:
        try:
            json_data = ast.literal_eval(line)
            review.append(json_data)
        except ValueError as e:
            print(f"Error en la línea: {line}")
            continue

# Creates a DataFrame with the data
df_reviews = pd.DataFrame(review)

In [162]:
df_reviews

Unnamed: 0,user_id,user_url,reviews
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"[{'funny': '', 'posted': 'Posted November 5, 2..."
1,js41637,http://steamcommunity.com/id/js41637,"[{'funny': '', 'posted': 'Posted June 24, 2014..."
2,evcentric,http://steamcommunity.com/id/evcentric,"[{'funny': '', 'posted': 'Posted February 3.',..."
3,doctr,http://steamcommunity.com/id/doctr,"[{'funny': '', 'posted': 'Posted October 14, 2..."
4,maplemage,http://steamcommunity.com/id/maplemage,"[{'funny': '3 people found this review funny',..."
...,...,...,...
25794,76561198306599751,http://steamcommunity.com/profiles/76561198306...,"[{'funny': '', 'posted': 'Posted May 31.', 'la..."
25795,Ghoustik,http://steamcommunity.com/id/Ghoustik,"[{'funny': '', 'posted': 'Posted June 17.', 'l..."
25796,76561198310819422,http://steamcommunity.com/profiles/76561198310...,"[{'funny': '1 person found this review funny',..."
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"[{'funny': '', 'posted': 'Posted July 21.', 'l..."


The file contains the reviews we require to create the Sentiment Analysis, so we need to process this information. We use the same strategy that we found useful in the previous analysis.

In [163]:
df_review1 = df_reviews.explode(['reviews'])
df_review2 = df_review1['reviews'].apply(pd.Series)
df_reviews = pd.concat([df_review1, df_review2], axis=1)

In [164]:
df_reviews

Unnamed: 0,user_id,user_url,reviews,funny,posted,last_edited,item_id,helpful,recommend,review,0
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted November 5, 20...",,"Posted November 5, 2011.",,1250,No ratings yet,True,Simple yet with great replayability. In my opi...,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted July 15, 2011....",,"Posted July 15, 2011.",,22200,No ratings yet,True,It's unique and worth a playthrough.,
0,76561197970982479,http://steamcommunity.com/profiles/76561197970...,"{'funny': '', 'posted': 'Posted April 21, 2011...",,"Posted April 21, 2011.",,43110,No ratings yet,True,Great atmosphere. The gunplay can be a bit chu...,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted June 24, 2014....",,"Posted June 24, 2014.",,251610,15 of 20 people (75%) found this review helpful,True,I know what you think when you see this title ...,
1,js41637,http://steamcommunity.com/id/js41637,"{'funny': '', 'posted': 'Posted September 8, 2...",,"Posted September 8, 2013.",,227300,0 of 1 people (0%) found this review helpful,True,For a simple (it's actually not all that simpl...,
...,...,...,...,...,...,...,...,...,...,...,...
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"{'funny': '', 'posted': 'Posted July 10.', 'la...",,Posted July 10.,,70,No ratings yet,True,a must have classic from steam definitely wort...,
25797,76561198312638244,http://steamcommunity.com/profiles/76561198312...,"{'funny': '', 'posted': 'Posted July 8.', 'las...",,Posted July 8.,,362890,No ratings yet,True,this game is a perfect remake of the original ...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,"{'funny': '1 person found this review funny', ...",1 person found this review funny,Posted July 3.,,273110,1 of 2 people (50%) found this review helpful,True,had so much fun plaing this and collecting res...,
25798,LydiaMorley,http://steamcommunity.com/id/LydiaMorley,"{'funny': '', 'posted': 'Posted July 20.', 'la...",,Posted July 20.,,730,No ratings yet,True,:D,


We can now drop the columns we don't use and delete data that is not present.

In [165]:
columns_to_drop = ['funny', 'last_edited', 'user_url', 'reviews', 'helpful', df_reviews.columns[-1]]
df_reviews = df_reviews.drop(columns_to_drop, axis=1)

In [166]:
df_reviews

Unnamed: 0,user_id,posted,item_id,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,True,Simple yet with great replayability. In my opi...
0,76561197970982479,"Posted July 15, 2011.",22200,True,It's unique and worth a playthrough.
0,76561197970982479,"Posted April 21, 2011.",43110,True,Great atmosphere. The gunplay can be a bit chu...
1,js41637,"Posted June 24, 2014.",251610,True,I know what you think when you see this title ...
1,js41637,"Posted September 8, 2013.",227300,True,For a simple (it's actually not all that simpl...
...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,True,a must have classic from steam definitely wort...
25797,76561198312638244,Posted July 8.,362890,True,this game is a perfect remake of the original ...
25798,LydiaMorley,Posted July 3.,273110,True,had so much fun plaing this and collecting res...
25798,LydiaMorley,Posted July 20.,730,True,:D


And also delete non existing data

In [167]:
df_reviews = df_reviews.dropna(subset=['posted', 'recommend', 'review', 'item_id'])

We are now ready to produce the Sentiment Analysis. We need to install and import Vader Lexicon to interpret the text present in the reviews and assign values whether it's positive, neutral or negative.

In [168]:
import nltk
# descarga del diccionario vader_lexicon 
nltk.download('vader_lexicon')
# importar el analizador de sentimientos vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer 

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\juanp\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


We are ready to begin training the model

In [169]:
sia = SentimentIntensityAnalyzer()

# Defines a categorization for each sentiment analyzed: 2 for positive, 1 for neutral and 0 for negative.
def categorize_sentiment(score):
    if score < -0.05:
        return 0 #Malo
    elif score > 0.05:
        return 2 #Positivo
    else:
        return 1 #Neutral

# In case it's needed, we make sure that the column is a string
df_reviews['review'] = df_reviews['review'].astype(str) 

# Applys the sentintiment analysis to reviews and sorts results
df_reviews['review'] = df_reviews['review'].apply(lambda review: sia.polarity_scores(review)['compound'])
df_reviews['review'] = df_reviews['review'].apply(categorize_sentiment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reviews['review'] = df_reviews['review'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reviews['review'] = df_reviews['review'].apply(lambda review: sia.polarity_scores(review)['compound'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_reviews['review'] = df_reviews['review']

In [170]:
df_reviews

Unnamed: 0,user_id,posted,item_id,recommend,review
0,76561197970982479,"Posted November 5, 2011.",1250,True,2
0,76561197970982479,"Posted July 15, 2011.",22200,True,2
0,76561197970982479,"Posted April 21, 2011.",43110,True,2
1,js41637,"Posted June 24, 2014.",251610,True,2
1,js41637,"Posted September 8, 2013.",227300,True,2
...,...,...,...,...,...
25797,76561198312638244,Posted July 10.,70,True,2
25797,76561198312638244,Posted July 8.,362890,True,2
25798,LydiaMorley,Posted July 3.,273110,True,2
25798,LydiaMorley,Posted July 20.,730,True,2


We need to extract the information and clean the data produced.

In [171]:
df_reviews.to_parquet(r'reviews.parquet')

In [172]:
df_reviews = pd.read_parquet('reviews.parquet')

In [173]:
# Extracts the year from the posted column and rename it to 'year_posted'
df_reviews['year_posted'] = df_reviews['posted'].str.extract(r'(\d{4})')
df_reviews.drop('posted' , axis = 1, inplace = True) 

In [174]:
df_reviews

Unnamed: 0,user_id,item_id,recommend,review,year_posted
0,76561197970982479,1250,True,2,2011
0,76561197970982479,22200,True,2,2011
0,76561197970982479,43110,True,2,2011
1,js41637,251610,True,2,2014
1,js41637,227300,True,2,2013
...,...,...,...,...,...
25797,76561198312638244,70,True,2,
25797,76561198312638244,362890,True,2,
25798,LydiaMorley,273110,True,2,
25798,LydiaMorley,730,True,2,


We see there are some null values in year_posted we just drop them not to create confusion.

In [175]:
df_reviews = df_reviews.dropna(subset=['year_posted'])


We remove duplicates as well.

In [176]:
df_reviews = df_reviews.drop_duplicates()

We change the type of Year to string for convenience.

In [177]:
df_reviews['year_posted'] = df_reviews['year_posted'].astype(int)

And we are done!

In [178]:
df_reviews.to_parquet('reviews.parquet')

# Data merger to simplify data manipulation for the API Endpoints

We are ready to merge all the data into a single data file to make API calls to that file only. We begin by selecting information in each data file and merging them into one unique data file.

In [179]:
import pandas as pd
import pyarrow.parquet as pq

#Abrimos los archivos parquet
games = pd.read_parquet('games.parquet')
items = pd.read_parquet('items.parquet')
reviews = pd.read_parquet('reviews.parquet')

# Selecting relevant columns from "items"
items = items[['user_id','item_id','playtime_forever']]

# Generating a unique identifier in the DataFrame "items"
items['item_id'] = items['item_id'].astype(str)
items['id'] = items['user_id'] + items['item_id']

# Renaming the column "id" to "item_id" in the table "games"
games = games.rename(columns={'game_id': 'item_id'})

# Changing the data type of the 'item_id' column in the DataFrame "games"
games['item_id'] = games['item_id'].astype(str)

# Generating a unique identifier in the DataFrame "reviews"
reviews['item_id'] = reviews['item_id'].astype(str)
reviews['id'] = reviews['user_id'] + reviews['item_id']

# Merging the DataFrames "reviews" and "games" on 'item_id' and removing nulls
merged_df = reviews.merge(games, on='item_id', how='left')
merged_df.dropna(inplace=True)


In [180]:
merged_df

Unnamed: 0,user_id,item_id,recommend,review,year_posted,id,genres,title,release_year,developer
0,76561197970982479,1250,True,2,2011,765611979709824791250,['Action'],Killing Floor,2009.0,Tripwire Interactive
1,76561197970982479,22200,True,2,2011,7656119797098247922200,"['Action', 'Indie']",Zeno Clash,2009.0,ACE Team
4,js41637,227300,True,2,2013,js41637227300,"['Indie', 'Simulation']",Euro Truck Simulator 2,2013.0,SCS Software
5,js41637,239030,True,2,2013,js41637239030,"['Adventure', 'Indie']","Papers, Please",2013.0,3909
6,evcentric,370360,True,2,2015,evcentric370360,"['Indie', 'Simulation']",TIS-100,2015.0,Zachtronics
...,...,...,...,...,...,...,...,...,...,...
48492,76561198239215706,730,True,2,2015,76561198239215706730,['Action'],Counter-Strike: Global Offensive,2012.0,Valve
48493,wayfeng,730,True,1,2015,wayfeng730,['Action'],Counter-Strike: Global Offensive,2012.0,Valve
48494,76561198251004808,253980,True,2,2015,76561198251004808253980,['RPG'],Enclave,2003.0,Starbreeze
48495,72947282842,730,True,0,2015,72947282842730,['Action'],Counter-Strike: Global Offensive,2012.0,Valve


Upon merging, we make sure data is ready and correct for the machine learning model.

In [181]:

# Creating the final DataFrame by merging 'items' with 'merged_df' on the unique identifier 'id', named "api_calls"
api_calls = items.merge(merged_df, on='id')

# Renaming columns in the 'api_calls' DataFrame
api_calls = api_calls.rename(columns={'user_id_x': 'user_id', 'item_id_x': 'item_id'})

# Dropping redundant columns from 'api_calls'
api_calls.drop(['user_id_y', 'item_id_y'], axis='columns', inplace=True)

# Changing data types for certain columns
api_calls['year_posted'] = api_calls['year_posted'].astype('int')
api_calls['playtime_forever'] = api_calls['playtime_forever'].astype('int')
api_calls['release_year'] = api_calls['release_year'].astype('int')

# Selecting final columns
api_calls = api_calls[['user_id', 'item_id', 'playtime_forever', 'recommend', 'review', 'year_posted', 'genres', 'title', 'developer', 'release_year']]

# Converting 'playtime_forever' values from minutes to hours
api_calls['playtime_forever'] = api_calls['playtime_forever'] / 60

api_calls = api_calls.rename(columns={'release_year': 'year'})

# Saving the DataFrame to a Parquet file
api_calls.to_parquet('API_requests.parquet')

# Machine Learning Model

Using the API_requests file we have the required data for the machine learning model.


### Loading and Preprocessing the Data

First, we load our dataset from a parquet file. A parquet file is a type of storage format that is highly efficient for storing and processing large amounts of data. Our dataset contains information about various games available on Steam.

Once the data is loaded, we perform some preprocessing:
- We remove certain columns that are not relevant to our analysis, such as user IDs and playtimes.
- We convert the 'item_id' and 'year_posted' columns to strings. This is because these IDs are not numerical values that we can perform calculations on, but rather unique identifiers for each game.
- We remove duplicate entries based on the 'item_id' to ensure each game is represented only once.
- We then create a new column called 'features' by combining game titles, the year they were posted, and their developers into a single string. This helps in analyzing the textual data.

### Creating the TF-IDF Matrix and Cosine Similarity Matrix

To analyze the textual data in our 'features' column, we use two key concepts in natural language processing:

1. **TF-IDF (Term Frequency-Inverse Document Frequency):** This is a statistical measure used to evaluate the importance of a word in a document, which in our case, is the combined text of game titles, years, and developers. Words that are more unique to a document weigh more in TF-IDF.

2. **Cosine Similarity:** This measures the similarity between two documents. Here, it's used to find how similar each game is to every other game, based on their 'features'.

We transform our 'features' column into a TF-IDF matrix and then use this matrix to create a cosine similarity matrix. Each entry in this matrix represents how similar a pair of games are based on their features.

### Recommender Function

The `recomendacion_juego` function is our game recommendation engine:
- It takes an `item_id` of a game as input.
- The function first checks if this game exists in our dataset. If not, it returns a message indicating the game is not found.
- If the game is found, it then looks for other games that are most similar to it, based on our cosine similarity matrix.
- It returns the top 5 similar games, suggesting them as recommendations if someone likes the input game.

This recommendation is based on the assumption that games with similar titles, release years, and developers are likely to be similar in other aspects too, such as genre or gameplay style.

### Example Usage

To demonstrate how our recommendation system works, we use an example `item_id`. The system will then provide us with a list of games similar to the one represented by this ID, illustrating how you might use this system to find new games to explore based on your preferences.


We are ready to make the function that will recommend games!

In [183]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the data
steam_games = pd.read_parquet('API_requests.parquet')

# Preprocess the data
columns_to_drop = ['user_id', 'playtime_forever', 'recommend', 'review', 'year']
steam_games.drop(columns_to_drop, axis=1, inplace=True)

steam_games['item_id'] = steam_games['item_id'].astype(str)
steam_games['year_posted'] = steam_games['year_posted'].astype(str)
steam_games = steam_games.drop_duplicates(subset='item_id', keep='first')
steam_games['features'] = steam_games[['title', 'year_posted', 'developer']].agg(', '.join, axis=1)
steam_games.drop(['title', 'year_posted', 'developer'], axis=1, inplace=True)

# Create the TF-IDF matrix and cosine similarity matrix
tfidv = TfidfVectorizer(min_df=2, max_df=0.7, token_pattern=r'\b[a-zA-Z0-9]\w+\b')
data_vector = tfidv.fit_transform(steam_games['features'])
data_vector_df = pd.DataFrame(data_vector.toarray(), index=steam_games['item_id'])
cos_sim_df = pd.DataFrame(cosine_similarity(data_vector_df), index=data_vector_df.index, columns=data_vector_df.index)

# Define the recommendation function
def recomendacion_juego(item_id):
    # Ensure item_id is a string
    item_id = str(item_id)

    # Check if item_id exists in the data
    if item_id not in steam_games['item_id'].values:
        return {"message": "Item ID not found in the data"}

    # Get similar games
    similar_games = cos_sim_df.loc[item_id].sort_values(ascending=False).head(6)
    result_df = similar_games.reset_index().merge(steam_games, on='item_id')

    # Prepare result
    game_title = result_df[result_df['item_id'] == item_id]['features'].values[0].split(', ')[0]
    message = f"If you liked the game {item_id} : {game_title}, you might also like:"
    result_dict = {
        'message': message,
        'recommended games': result_df['features'][1:6].apply(lambda x: x.split(', ')[0]).tolist()
    }

    return result_dict

# Example usage
item_id_example = '500' # Replace with an item ID
recommendations = recomendacion_juego(item_id_example)
print(recommendations)



And that's it! We are done with the data so we can now build the main.py for the api deployment.