# Análise exploratória

Este é a análise exploratória do Steam Games Database que pode ser encontrado aqui: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset

Antes de fazer a análise, coloque os arquivos dentro da pasta `/datasets`.

## Preparando ambiente

In [54]:
import csv
import sys
import numpy
import pandas
import sqlalchemy
import sqlalchemy.orm as orm
from dotenv import load_dotenv
import os
import json
import re
import html
import unicodedata

In [2]:
# remove csv fild limit
csv.field_size_limit(sys.maxsize)

# mostra todas as colunas
pandas.set_option('display.max_columns', None)

In [3]:
dataset_csv_path = "./datasets/games.csv"
dataset_csv_fixed_path = "./datasets/games_fixed.csv"


## Preparando datasets

O arquivo `games.csv` está quebrado, então precisa arrumar ele antes de usar de vez no pandas. O erro é que a quantidade de colunas no cabeçalho está diferente da quantidade de colunas nas rows em si. Para resolver esse problema eu apliquei uma correção:

In [4]:
# ver 5 primeiras linhas do arquivo
with open(dataset_csv_path, 'r', encoding='utf-8') as f:
    for i in range(5):
        print(f.readline())


AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DiscountDLC count,About the game,Supported languages,Full audio languages,Reviews,Header image,Website,Support url,Support email,Windows,Mac,Linux,Metacritic score,Metacritic url,User score,Positive,Negative,Score rank,Achievements,Recommendations,Notes,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies

20200,"Galactic Bowling","Oct 21, 2008","0 - 20000",0,0,19.99,0,0,"Galactic Bowling is an exaggerated and stylized bowling game with an intergalactic twist. Players will engage in fast-paced single and multi-player competition while being submerged in a unique new universe filled with over-the-top humor, wild characters, unique levels, and addictive game play. The title is aimed at players of all ages and skill sets. Through accessible and intuitive controls and game-play, Galactic Bowling allows you to j

In [5]:
# Ler o arquivo
with open(dataset_csv_path, 'r', encoding='utf-8') as f:
    linhas = f.readlines()

# Corrigir o cabeçalho
linhas[0] = linhas[0].replace('DiscountDLC count', 'Discount,DLC count')

# Salvar o arquivo corrigido
with open(dataset_csv_fixed_path, 'w', encoding='utf-8') as f:
    f.writelines(linhas)

print("✅ Arquivo corrigido salvo como: games_fixed.csv")

✅ Arquivo corrigido salvo como: games_fixed.csv


Carregando o dataset corrigido para dentro do pandas:

In [6]:
games_dataset = pandas.read_csv(
  dataset_csv_fixed_path,
  sep=",",
  quotechar='"',
  quoting=csv.QUOTE_MINIMAL,
  engine="python",
  encoding="utf-8-sig",
)

## Análise inicial

In [7]:
games_dataset.info() # Ver tipos e valores nulos

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111452 entries, 0 to 111451
Data columns (total 40 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   AppID                       111452 non-null  int64  
 1   Name                        111446 non-null  object 
 2   Release date                111452 non-null  object 
 3   Estimated owners            111452 non-null  object 
 4   Peak CCU                    111452 non-null  int64  
 5   Required age                111452 non-null  int64  
 6   Price                       111452 non-null  float64
 7   Discount                    111452 non-null  int64  
 8   DLC count                   111452 non-null  int64  
 9   About the game              104969 non-null  object 
 10  Supported languages         111452 non-null  object 
 11  Full audio languages        111452 non-null  object 
 12  Reviews                     10624 non-null   object 
 13  Header image  

Podemos ver que a tabela é formada por 40 colunas.

In [8]:
# Ver primeiras linhas
games_dataset.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,Supported languages,Full audio languages,Reviews,Header image,Website,Support url,Support email,Windows,Mac,Linux,Metacritic score,Metacritic url,User score,Positive,Negative,Score rank,Achievements,Recommendations,Notes,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,['English'],[],,https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.galacticbowling.net,,,True,False,False,0,,0,6,11,,30,0,,0,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,http://trainbandit.com,,support@rustymoyher.com,True,True,False,0,,0,53,5,,12,0,,0,0,0,0,Rusty Moyher,Wild Rooster,"Single-player,Steam Achievements,Full controll...","Action,Indie","Indie,Action,Pixel Graphics,2D,Retro,Arcade,Sc...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,,,ramoncampiaof31@gmail.com,True,False,False,0,,0,0,0,,0,0,,0,0,0,0,Campião Games,Campião Games,Single-player,"Action,Adventure,Indie,Strategy",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,1355720,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,https://henosisgame.com/,https://henosisgame.com/,info@henosisgame.com,True,True,True,0,,0,3,0,,0,0,,0,0,0,0,Odd Critter Games,Odd Critter Games,"Single-player,Full controller support","Adventure,Casual,Indie","2D Platformer,Atmospheric,Surreal,Mystery,Puzz...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,1139950,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,https://www.unusual-games.com/home/,https://www.unusual-games.com/contact/,welistentoyou@unusual-games.com,True,True,False,0,,0,50,8,,17,0,This Game may contain content not appropriate ...,0,0,0,0,Unusual Games,Unusual Games,"Single-player,Steam Achievements","Adventure,Indie","Indie,Adventure,Nudity,Violent,Sexual Content,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


In [9]:
games_dataset.describe() # Estatísticas básicas

Unnamed: 0,AppID,Peak CCU,Required age,Price,Discount,DLC count,Metacritic score,User score,Positive,Negative,Score rank,Achievements,Recommendations,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks
count,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0,44.0,111452.0,111452.0,111452.0,111452.0,111452.0,111452.0
mean,1716972.0,177.7215,0.254208,7.061568,0.464209,0.44953,2.623354,0.030408,754.3525,125.859177,98.909091,17.511144,616.3715,81.24729,9.174954,72.65133,9.891038
std,920385.9,8390.462,2.035653,12.563246,3.503658,12.006677,13.736245,1.565136,21394.1,4002.844431,0.857747,150.139008,15738.54,999.935906,168.20103,1321.333137,183.232812
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,936255.0,0.0,0.0,0.99,0.0,0.0,0.0,0.0,0.0,0.0,98.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1665065.0,0.0,0.0,3.99,0.0,0.0,0.0,0.0,3.0,1.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,2453585.0,1.0,0.0,9.99,0.0,0.0,0.0,0.0,29.0,8.0,100.0,17.0,0.0,0.0,0.0,0.0,0.0
max,3671840.0,1311366.0,21.0,999.98,92.0,2366.0,97.0,100.0,5764420.0,895978.0,100.0,9821.0,3441592.0,145727.0,19159.0,208473.0,19159.0


In [10]:
games_dataset.shape # Dimensões

(111452, 40)

In [11]:
games_dataset.columns.tolist() # Lista de colunas

['AppID',
 'Name',
 'Release date',
 'Estimated owners',
 'Peak CCU',
 'Required age',
 'Price',
 'Discount',
 'DLC count',
 'About the game',
 'Supported languages',
 'Full audio languages',
 'Reviews',
 'Header image',
 'Website',
 'Support url',
 'Support email',
 'Windows',
 'Mac',
 'Linux',
 'Metacritic score',
 'Metacritic url',
 'User score',
 'Positive',
 'Negative',
 'Score rank',
 'Achievements',
 'Recommendations',
 'Notes',
 'Average playtime forever',
 'Average playtime two weeks',
 'Median playtime forever',
 'Median playtime two weeks',
 'Developers',
 'Publishers',
 'Categories',
 'Genres',
 'Tags',
 'Screenshots',
 'Movies']

## Migrar dataset

O dataset está em um arquivo `.csv`. Para facilitar a análise vamos migrar ele para dentro de um banco Postgres.

In [12]:
load_dotenv()

# Preparar conexão com o banco
POSTGRES_HOST=os.getenv("POSTGRES_HOST")
POSTGRES_PORT=os.getenv("POSTGRES_PORT")
POSTGRES_USER=os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD=os.getenv("POSTGRES_PASSWORD")
POSTGRES_DB=os.getenv("POSTGRES_DB")

print(POSTGRES_HOST, POSTGRES_PORT, POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB)

DATABASE_URL=f"postgresql+psycopg://{POSTGRES_USER}:{POSTGRES_PASSWORD}@{POSTGRES_HOST}:{POSTGRES_PORT}/{POSTGRES_DB}"

# Criar engine
engine = sqlalchemy.create_engine(DATABASE_URL)

# testar conexão
try:
    with engine.connect() as connection:
        print("✅ Conexão com o banco de dados estabelecida com sucesso!")
except Exception as e:
    print(f"❌ Erro ao conectar ao banco de dados: {e}")


localhost 5432 postgres postgres steam_games
✅ Conexão com o banco de dados estabelecida com sucesso!


Agora precisamos preparar as tabelas usando migrations

In [13]:
Base = orm.declarative_base()

# =========================================
# TABELAS PIVÔ (ASSOCIAÇÃO) M:N
# =========================================

game_developer = sqlalchemy.Table(
    "game_developer",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("developer_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("developers.id"), primary_key=True)
)

game_publisher = sqlalchemy.Table(
    "game_publisher",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("publisher_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("publishers.id"), primary_key=True)
)

game_category = sqlalchemy.Table(
    "game_category",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("category_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("categories.id"), primary_key=True)
)

game_genre = sqlalchemy.Table(
    "game_genre",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("genre_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("genres.id"), primary_key=True)
)

game_tag = sqlalchemy.Table(
    "game_tag",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("tag_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("tags.id"), primary_key=True)
)

game_language = sqlalchemy.Table(
    "game_language",
    Base.metadata,
    sqlalchemy.Column("game_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), primary_key=True),
    sqlalchemy.Column("language_id", sqlalchemy.Integer, sqlalchemy.ForeignKey("languages.id"), primary_key=True)
)

# =========================================
# MODELS
# =========================================

class Developer(Base):
    __tablename__ = "developers"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.Text, nullable=False)

    games = orm.relationship("Game", secondary=game_developer, back_populates="developers")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Publisher(Base):
    __tablename__ = "publishers"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.Text, nullable=False)

    games = orm.relationship("Game", secondary=game_publisher, back_populates="publishers")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Category(Base):
    __tablename__ = "categories"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)

    games = orm.relationship("Game", secondary=game_category, back_populates="categories")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Genre(Base):
    __tablename__ = "genres"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)

    games = orm.relationship("Game", secondary=game_genre, back_populates="genres")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Tag(Base):
    __tablename__ = "tags"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)

    games = orm.relationship("Game", secondary=game_tag, back_populates="tags")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Language(Base):
    __tablename__ = "languages"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)

    games = orm.relationship("Game", secondary=game_language, back_populates="languages")

    @classmethod
    def get_or_create(cls, name, session):
        parsed = name.strip().lower()
        obj = session.query(cls).filter_by(name=parsed).first()
        if obj:
            return obj
        obj = cls(name=parsed)
        session.add(obj)
        session.commit()
        return obj


class Screenshot(Base):
    __tablename__ = "screenshots"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    game_id = sqlalchemy.Column(sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), nullable=False)
    screenshot_url = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)


class Movie(Base):
    __tablename__ = "movies"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    game_id = sqlalchemy.Column(sqlalchemy.Integer, sqlalchemy.ForeignKey("games.id"), nullable=False)
    movie_url = sqlalchemy.Column(sqlalchemy.String(255), nullable=False)


class Game(Base):
    __tablename__ = "games"

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.Text, nullable=False)
    release_date = sqlalchemy.Column(sqlalchemy.Date, nullable=False)
    estimated_owners_lower = sqlalchemy.Column(sqlalchemy.Integer, nullable=False)
    estimated_owners_upper = sqlalchemy.Column(sqlalchemy.Integer, nullable=False)
    peak_ccu = sqlalchemy.Column(sqlalchemy.Integer, nullable=False, default=0)
    required_age = sqlalchemy.Column(sqlalchemy.Integer, nullable=False, default=0)
    price = sqlalchemy.Column(sqlalchemy.Float, nullable=False, default=0.0)
    discount = sqlalchemy.Column(sqlalchemy.Float, nullable=False, default=0.0)
    dlc_count = sqlalchemy.Column(sqlalchemy.Integer, nullable=False, default=0)
    about_the_game = sqlalchemy.Column(sqlalchemy.Text, nullable=False)
    header_image = sqlalchemy.Column(sqlalchemy.Text, nullable=True)
    website = sqlalchemy.Column(sqlalchemy.Text, nullable=True)
    support_url = sqlalchemy.Column(sqlalchemy.Text, nullable=True)
    support_email = sqlalchemy.Column(sqlalchemy.Text, nullable=True)
    windows = sqlalchemy.Column(sqlalchemy.Boolean, nullable=False, default=False)
    mac = sqlalchemy.Column(sqlalchemy.Boolean, nullable=False, default=False)
    linux = sqlalchemy.Column(sqlalchemy.Boolean, nullable=False, default=False)
    metacritic_score = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    metacritic_url = sqlalchemy.Column(sqlalchemy.Text, nullable=True)
    user_score = sqlalchemy.Column(sqlalchemy.Float, nullable=True)
    positive = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    negative = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    score_rank = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    achievements = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    recommendations = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    average_playtime_forever = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    average_playtime_2weeks = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    median_playtime_forever = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)
    median_playtime_2weeks = sqlalchemy.Column(sqlalchemy.Integer, nullable=True)

    # RELACIONAMENTOS M:N
    developers = orm.relationship("Developer", secondary=game_developer, back_populates="games")
    publishers = orm.relationship("Publisher", secondary=game_publisher, back_populates="games")
    categories = orm.relationship("Category", secondary=game_category, back_populates="games")
    genres = orm.relationship("Genre", secondary=game_genre, back_populates="games")
    tags = orm.relationship("Tag", secondary=game_tag, back_populates="games")
    languages = orm.relationship("Language", secondary=game_language, back_populates="games")


In [14]:
Base.metadata.create_all(engine)
session = sqlalchemy.orm.Session(engine)

In [15]:
# drop all tables
# Base.metadata.drop_all(engine)

In [16]:
games_dataset.head(1)

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,Supported languages,Full audio languages,Reviews,Header image,Website,Support url,Support email,Windows,Mac,Linux,Metacritic score,Metacritic url,User score,Positive,Negative,Score rank,Achievements,Recommendations,Notes,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,['English'],[],,https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.galacticbowling.net,,,True,False,False,0,,0,6,11,,30,0,,0,0,0,0,Perpetual FX Creative,Perpetual FX Creative,"Single-player,Multi-player,Steam Achievements,...","Casual,Indie,Sports","Indie,Casual,Sports,Bowling",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...


### Cadastrar tags no postgres

In [17]:
# pega todas as tags da coluna, remove NaN e garante strings
tag_column = games_dataset['Tags'].dropna().astype(str)

unique_tags = set()

# extrai tags únicas
for tag_string in tag_column:
    names = (name.strip().lower() for name in tag_string.split(','))
    unique_tags.update(names)

print("Total únicos encontrados:", len(unique_tags))


# pega do banco as tags já existentes
existing_tags = {
    t.name.lower()
    for t in session.query(Tag).all()
}

# mantém só as realmente novas
new_tags = [
    Tag(name=tag)
    for tag in unique_tags
    if tag not in existing_tags
]

# insere tudo de uma vez (muito mais rápido)
session.bulk_save_objects(new_tags)
session.commit()

print(f"Adicionadas: {len(new_tags)} tags")


Total únicos encontrados: 453
Adicionadas: 453 tags


In [18]:
# cria um dicionário {nome_lowercase: id} para lookup rápido
tag_map = {
    tag.name.lower(): tag.id
    for tag in session.query(Tag).all()
}

def get_tag_id(tag_names):
    if not isinstance(tag_names, str):
        return None
    
    ids = []
    for name in map(str.strip, tag_names.lower().split(',')):
        tag_id = tag_map.get(name)
        if tag_id is None:
            print(f"Tag {name} não encontrada")
        else:
            ids.append(tag_id)
    return ids

# aplica ao dataset
games_dataset['Tags'] = games_dataset['Tags'].apply(get_tag_id)

print("Tags trocadas com sucesso!")

Tags trocadas com sucesso!


### Cadastrar developers no Postgres

In [19]:
# pega todos os valores da coluna
developer_column = games_dataset['Developers'].dropna().astype(str)

unique_developers = set()

# extrai todos os developers únicos
for dev_string in developer_column:
    names = (name.strip().lower() for name in dev_string.split(','))
    unique_developers.update(names)

print("Total únicos encontrados:", len(unique_developers))


# pega do banco de dados os developers já existentes
existing = {
    d.name.lower()
    for d in session.query(Developer).all()
}

# filtra apenas os novos
new_developers = [
    Developer(name=name)
    for name in unique_developers
    if name not in existing
]

# insere em lote (muito mais rápido do que inserir um a um)
batch_size = 2500
for i in range(0, len(new_developers), batch_size):
    batch = new_developers[i:i+batch_size]
    session.add_all(batch)
    session.commit()

print(f"Adicionados: {len(new_developers)} developers")


Total únicos encontrados: 67956
Adicionados: 67956 developers


In [20]:
# cria um dicionário {nome_lowercase: id}
developer_map = {
    dev.name.lower(): dev.id
    for dev in session.query(Developer).all()
}

def get_developer_id(developer_names):
    if not isinstance(developer_names, str):
        return None
    
    ids = []
    for name in map(str.strip, developer_names.lower().split(',')):
        dev_id = developer_map.get(name)
        if dev_id is None:
            print(f"Developer {name} não encontrado")
        else:
            ids.append(dev_id)
    return ids

# aplica ao dataset
games_dataset['Developers'] = games_dataset['Developers'].apply(get_developer_id)

print("Developers trocados com sucesso!")


Developers trocados com sucesso!


### Cadastrar publishers no Postgres

In [22]:
# Pega todos os valores da coluna
publisher_column = games_dataset['Publishers'].dropna().astype(str)

unique_publishers = set()

# extrai todos os publishers únicos
for pub_string in publisher_column:
    names = (name.strip().lower() for name in pub_string.split(','))
    unique_publishers.update(names)

print("Total únicos encontrados:", len(unique_publishers))


# pega do banco de dados os publishers já existentes
existing = {
    p.name.lower()
    for p in session.query(Publisher).all()
}

# filtra apenas os novos
new_publishers = [
    Publisher(name=name)
    for name in unique_publishers
    if name not in existing
]

# insere em lote (muito mais rápido do que inserir um a um)
batch_size = 1000
for i in range(0, len(new_publishers), batch_size):
    batch = new_publishers[i:i+batch_size]
    session.add_all(batch)
    session.commit()

print(f"Adicionados: {len(new_publishers)} publishers")

Total únicos encontrados: 56476
Adicionados: 56476 publishers


In [23]:
# cria um dicionário {nome_lowercase: id}
publisher_map = {
    pub.name.lower(): pub.id
    for pub in session.query(Publisher).all()
}

def get_publisher_id(publisher_names):
    if not isinstance(publisher_names, str):
        return None
    
    ids = []
    for name in map(str.strip, publisher_names.lower().split(',')):
        pub_id = publisher_map.get(name)
        if pub_id is None:
            print(f"Publisher {name} não encontrado")
        else:
            ids.append(pub_id)
    return ids

# aplica ao dataset
games_dataset['Publishers'] = games_dataset['Publishers'].apply(get_publisher_id)

print("Publishers trocados com sucesso!")


Publishers trocados com sucesso!


### Cadastrar categorias no Postgres

In [28]:
# pega todos os valores da coluna
categories_column = games_dataset['Categories'].dropna().astype(str)

unique_categories = set()

# extrai todos os categories únicos
for cat_string in categories_column:
    names = (name.strip().lower() for name in cat_string.split(','))
    unique_categories.update(names)

print("Total únicos encontrados:", len(unique_categories))


# pega do banco de dados os categories já existentes
existing = {
    c.name.lower()
    for c in session.query(Category).all()
}

# filtra apenas os novos
new_categories = [
    Category(name=name)
    for name in unique_categories
    if name not in existing
]

# insere em lote (muito mais rápido do que inserir um a um)
batch_size = 1000
for i in range(0, len(new_categories), batch_size):
    batch = new_categories[i:i+batch_size]
    session.add_all(batch)
    session.commit()

print(f"Adicionados: {len(new_categories)} categories")

Total únicos encontrados: 43
Adicionados: 43 categories


In [29]:
category_map = {
    cat.name.lower(): cat.id
    for cat in session.query(Category).all()
}

def get_category_id(category_names):
    if not isinstance(category_names, str):
        return None
    
    ids = []
    for name in map(str.strip, category_names.lower().split(',')):
        cat_id = category_map.get(name)
        if cat_id is None:
            print(f"Category {name} não encontrada")
        else:
            ids.append(cat_id)
    return ids

# aplica ao dataset
games_dataset['Categories'] = games_dataset['Categories'].apply(get_category_id)

print("Categories trocadas com sucesso!")


Categories trocadas com sucesso!


### Cadastrar generos no Postgres

In [32]:
# Pega todos os valores da coluna
genres_column = games_dataset['Genres'].dropna().astype(str)

unique_genres = set()

# extrai todos os genres únicos
for genre_string in genres_column:
    names = (name.strip().lower() for name in genre_string.split(','))
    unique_genres.update(names)

print("Total únicos encontrados:", len(unique_genres))

# Pega do banco de dados os genres já existentes
existing = {
    g.name.lower()
    for g in session.query(Genre).all()
}

# filtra apenas os novos
new_genres = [
    Genre(name=name)
    for name in unique_genres
    if name not in existing
]

# insere em lote (muito mais rápido do que inserir um a um)
batch_size = 1000
for i in range(0, len(new_genres), batch_size):
    batch = new_genres[i:i+batch_size]
    session.add_all(batch)
    session.commit()

print(f"Adicionados: {len(new_genres)} genres")

Total únicos encontrados: 33
Adicionados: 33 genres


In [33]:
genre_map = {
    genre.name.lower(): genre.id
    for genre in session.query(Genre).all()
}

def get_genre_id(genre_names):
    if not isinstance(genre_names, str):
        return None
    
    ids = []
    for name in map(str.strip, genre_names.lower().split(',')):
        genre_id = genre_map.get(name)
        if genre_id is None:
            print(f"Genre {name} não encontrado")
        else:
            ids.append(genre_id)
    return ids

# aplica ao dataset
games_dataset['Genres'] = games_dataset['Genres'].apply(get_genre_id)

print("Genres trocados com sucesso!")

Genres trocados com sucesso!


### Cadastrar languages no Postgres

In [56]:
import ast
import json
import re

def fix_brackets(m):
    items = m.group(1).split(",")
    items = [f'"{i.strip().strip("\"")}"' for i in items]
    return "[" + ",".join(items) + "]"

def safe_parse_languages(s):
    original = s.strip()

    # 1. Tentar JSON direto (apenas se começar com [)
    if original.startswith("["):
        try:
            return json.loads(original)
        except:
            pass

    # 2. Tentar literal_eval direto
    try:
        return ast.literal_eval(original)
    except:
        pass

    # 3. Tentar corrigir strings malformadas
    fixed = original

    # 3.1 Trocar aspas simples por duplas
    fixed = fixed.replace("'", '"')

    # 3.2 Garantir que itens sem aspas fiquem entre aspas
    # ex: K"iche" -> "K\"iche\""
    fixed = re.sub(r'(\w+)"', r'"\1"', fixed)

    # 3.3 Garantir que itens isolados fiquem entre aspas
    # ex: [English, French] → ["English", "French"]
    # fixed = re.sub(r'\[(.*?)\]', lambda m: "[" + ",".join(f'"{x.strip().strip(\'"\')}"' for x in m.group(1).split(",")) + "]", fixed)
    fixed = re.sub(r"\[(.*?)\]", fix_brackets, fixed)

    # 4. Tentar JSON novamente após correções
    try:
        return json.loads(fixed)
    except:
        pass

    # 5. Fallback manual — remove colchetes e divide por vírgula
    fallback = original.strip("[]").split(",")
    fallback = [x.strip().strip('"').strip("'") for x in fallback]
    return fallback

def clean_language(raw):
    if not isinstance(raw, str):
        return []

    # ---- 1) Decode de HTML entities ----
    text = html.unescape(raw)

    # ---- 2) Remover tags HTML e BBCode ----
    text = re.sub(r'<[^>]*>', '', text)
    text = re.sub(r'\[/?[a-zA-Z0-9]+\]', '', text)

    # ---- 3) Trocar quebras de linha por vírgula ----
    text = text.replace("\r", ",").replace("\n", ",")

    # ---- 4) Separar itens que vêm grudados ----
    parts = re.split(r'[,\|;/]+', text)

    cleaned = []

    for item in parts:
        item = item.strip().lower()

        if not item:
            continue

        # Remover hashtags (#lang_français)
        if item.startswith("#"):
            continue

        # Remover sobras de HTML mal formadas (lt, gt, amp)
        item = re.sub(r'\b(lt|gt|amp|strong)\b', '', item)
        item = item.replace("&lt", "").replace("&gt", "").replace("&amp", "")

        # Remover símbolos no começo/fim
        item = re.sub(r'^[^a-z0-9]+|[^a-z0-9]+$', '', item)

        # Normalizar Unicode (corrige francês → français)
        item = unicodedata.normalize("NFKC", item)

        # Recolocar idiomas compostos comuns
        item = item.replace("simplified chinese text only", "simplified chinese")
        item = item.replace("traditional chinese text only", "traditional chinese")

        # Remover duplicações internas
        item = re.sub(r'\b(\w+)\s+\1\b', r'\1', item)

        # Tratar casos como english dutch english
        words = item.split()
        if len(words) > 1 and all(w.isalpha() for w in words):
            # Se for uma sequência de idiomas sem vírgula, quebrar
            for w in words:
                cleaned.append(w)
            continue

        # Arrumar k'iche (sem remover apóstrofo)
        if "k'iche" in item:
            item = "k'iche'"

        # Arrumar idiomas que ficaram sem ')'
        if "(" in item and ")" not in item:
            item += ")"  

        # Descartar se ficou vazio
        item = item.strip()
        if item:
            cleaned.append(item)

    return cleaned


# ---- processamento da coluna ----

languages_column = games_dataset['Supported languages'].dropna().astype(str)

unique_languages = set()

for lang_string in languages_column:
    lang_list = safe_parse_languages(lang_string)

    for name in lang_list:
        for cleaned in clean_language(name):
            if cleaned:
                unique_languages.add(cleaned)

sorted_unique_languages = sorted(unique_languages)
print(sorted_unique_languages)
print("Total únicos encontrados:", len(unique_languages))

# pega do banco de dados os languages já existentes
existing = {
    l.name.lower()
    for l in session.query(Language).all()
}

# filtra apenas os novos
new_languages = [
    Language(name=name)
    for name in unique_languages
    if name not in existing
]

# insere em lote (muito mais rápido do que inserir um a um)
batch_size = 1000
for i in range(0, len(new_languages), batch_size):
    batch = new_languages[i:i+batch_size]
    session.add_all(batch)
    session.commit()

print(f"Adicionados: {len(new_languages)} languages")


['afrikaans', 'albanian', 'amharic', 'arabic', 'armenian', 'assamese', 'azerbaijani', 'bangla', 'basque', 'belarusian', 'bosnian', 'br', 'bulgarian', 'catalan', 'cherokee', 'chinese', 'croatian', 'czech', 'danish', 'dari', 'dutch', 'english', 'english (full audio)', 'estonian', 'filipino', 'finnish', 'french', 'galician', 'georgian', 'german', 'greek', 'gujarati', 'hausa', 'hebrew', 'hindi', 'hungarian', 'icelandic', 'igbo', 'indonesian', 'irish', 'italian', 'japanese', 'japanese (all with full audio support)', "k'iche'", 'kannada', 'kazakh', 'khmer', 'kinyarwanda', 'konkani', 'korean', 'kyrgyz', 'latvian', 'lithuanian', 'luxembourgish', 'macedonian', 'malay', 'malayalam', 'maltese', 'maori', 'marathi', 'mongolian', 'nepali', 'norwegian', 'odia', 'persian', 'polish', 'portuguese', 'portuguese - brazil', 'portuguese - portugal', 'punjabi (gurmukhi)', 'punjabi (shahmukhi)', 'quechua', 'romanian', 'russian', 'scots', 'serbian', 'simplified', 'sindhi', 'sinhala', 'slovak', 'slovakian', 'sl

In [57]:
language_map = {
    language.name.lower(): language.id
    for language in session.query(Language).all()
}

def get_language_id(language_names):
    if not isinstance(language_names, str):
        return None
    
    ids = []
    # name of languages are very dirty on dataset, so we need to clean it before apply the map
    lang_list = safe_parse_languages(language_names)
    for name in lang_list:
        for cleaned in clean_language(name):
            if cleaned:
                lang_id = language_map.get(cleaned)
                if lang_id is None:
                    print(f"Language {cleaned} not found")
                else:
                    ids.append(lang_id)
    return ids

games_dataset['Supported languages'] = games_dataset['Supported languages'].apply(get_language_id)

print("Languages trocados com sucesso!")

Languages trocados com sucesso!


### Cadastrar games no Postgres

### Cadastrar movies no Postgres

### Cadastrar screenshots no Postgres

In [34]:
games_dataset.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,Supported languages,Full audio languages,Reviews,Header image,Website,Support url,Support email,Windows,Mac,Linux,Metacritic score,Metacritic url,User score,Positive,Negative,Score rank,Achievements,Recommendations,Notes,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,20200,Galactic Bowling,"Oct 21, 2008",0 - 20000,0,0,19.99,0,0,Galactic Bowling is an exaggerated and stylize...,['English'],[],,https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.galacticbowling.net,,,True,False,False,0,,0,6,11,,30,0,,0,0,0,0,[40282],[33489],"[5, 11, 9, 21]","[15, 20, 26]","[126, 98, 265, 233]",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
1,655370,Train Bandit,"Oct 12, 2017",0 - 20000,0,0,0.99,0,0,THE LAW!! Looks to be a showdown atop a train....,"['English', 'French', 'Italian', 'German', 'Sp...",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,http://trainbandit.com,,support@rustymoyher.com,True,True,False,0,,0,53,5,,12,0,,0,0,0,0,[54162],[11233],"[5, 9, 19, 13, 32, 22, 38]","[21, 20]","[126, 444, 114, 183, 182, 450, 362, 108, 213, ...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
2,1732930,Jolt Project,"Nov 17, 2021",0 - 20000,0,0,4.99,0,0,Jolt Project: The army now has a new robotics ...,"['English', 'Portuguese - Brazil']",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,,,ramoncampiaof31@gmail.com,True,False,False,0,,0,0,0,,0,0,,0,0,0,0,[1593],[1332],[5],"[21, 22, 20, 18]",,https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
3,1355720,Henosis™,"Jul 23, 2020",0 - 20000,0,0,5.99,0,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,"['English', 'French', 'Italian', 'German', 'Sp...",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,https://henosisgame.com/,https://henosisgame.com/,info@henosisgame.com,True,True,True,0,,0,3,0,,0,0,,0,0,0,0,[31767],[26387],"[5, 19]","[22, 15, 20]","[443, 21, 11, 236, 406, 2, 187, 275, 247, 345,...",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
4,1139950,Two Weeks in Painland,"Feb 3, 2020",0 - 20000,0,0,0.0,0,0,ABOUT THE GAME Play as a hacker who has arrang...,"['English', 'Spanish - Spain']",[],,https://cdn.akamai.steamstatic.com/steam/apps/...,https://www.unusual-games.com/home/,https://www.unusual-games.com/contact/,welistentoyou@unusual-games.com,True,True,False,0,,0,50,8,,17,0,This Game may contain content not appropriate ...,0,0,0,0,[20133],[16647],"[5, 9]","[22, 20]","[126, 187, 12, 375, 298, 143]",https://cdn.akamai.steamstatic.com/steam/apps/...,http://cdn.akamai.steamstatic.com/steam/apps/2...
