## Overview

This notebook covers the ETL process for `steam_games.json`, detailing each step from data extraction, transformation and loading. I will show the methods I used and the reasoning behind each phase.



In [94]:
# import libraries to be used for this project
import json
import pandas as pd
import numpy as np
from datetime import datetime
import re


## Data Extraction:
**Extracting JSON Data**: 
   - Used outside software such as `7Zip` to decompress the .json.gz file. This can be done using `gzip` on`python` however to cut on time and mistakes this approach was selected. 
   - After decompressing the file, the JSON data was extracted using the method shown below. 
   - The `pandas` library was then used read the JSON data directly into a DataFrame for further analysis and manipulation.

In [95]:
file_path = '../data/raw/output_steam_games.json'

#list to store the JSON data in 
df_games = []

# read and convert each line into a python object
with open(file_path,'rt', encoding='utf-8') as file:
    for line in file:
        df_games.append(json.loads(line))

# create dataframe and display        
df_games  = pd.DataFrame(df_games)
df_games.head()

# create a copy of the dataframe to be worked on 
df_games_clean = df_games.copy()

## Data Transformation

**Identifying and Managing Data Quality Issues:**

1. Identified a significant number of rows with missing values and removed them.
2. Changed order of columns for better visualization and analysis.
3. Removed unecesary columns `price`, `url`,`specs`, `early_access`,`reviews_url`
4. Fixed missing values in `Genre`, `developer`, `id`, `release_date` columns. This was done using the redundant information from other columns. 
5. Check for duplicate rows.
6. Rename Columns
7. Created dummy tables for the `genre`. Then drop the column. 
8. Format data types: Only dates were changed others are fine as they are

 
**Aditional Information**
- A lot of data could of been aquered using webscrapping with the `url` provided for each game. However, due to the substancial amount of rows taking too long. This is why the aproach above was used. 
- Most missing values that were left at the end of this workbook were values that were not video-games. Some of them were soundtracks, trailes, movies, etc. We dont want these in our analysis. 

**Removing empty rows.**

In [96]:

# Count the number of rows where all values are NaN
all_rows_nan = df_games_clean.isna().all(axis=1).sum()

# show the numer of rows and display on dataframe to validate
print(f"Number of rows where all values are NaN: {all_rows_nan}")
rows_nan = df_games_clean[df_games_clean.isna().all(axis=1)]
rows_nan.head()

Number of rows where all values are NaN: 88310


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


In [97]:
# delete all NaN rows 
df_games_clean = df_games_clean.dropna(how='all')

# Double check the rows are gone and display
all_rows_nan = df_games_clean.isna().all(axis=1).sum()
print(f"Number of rows where all values are NaN: {all_rows_nan}")

Number of rows where all values are NaN: 0


In [98]:
df_games_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32135 entries, 88310 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 3.4+ MB


**Re-shuffling columns for better visualization**

In [99]:
# Reorder the columns
new_column_order = [
    'id', 'developer', 'publisher', 'genres', 'app_name', 'title', 'release_date', 'price', 'tags','url', 'specs', 'early_access', 'reviews_url'
]

# Reassign columns
df_games_clean = df_games_clean[new_column_order]

df_games_clean.head()

Unnamed: 0,id,developer,publisher,genres,app_name,title,release_date,price,tags,url,specs,early_access,reviews_url
88310,761140,Kotoshiro,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,4.99,"[Strategy, Action, Indie, Casual, Simulation]",http://store.steampowered.com/app/761140/Lost_...,[Single-player],False,http://steamcommunity.com/app/761140/reviews/?...
88311,643980,Secret Level SRL,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,2018-01-04,Free To Play,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://store.steampowered.com/app/643980/Ironb...,"[Single-player, Multi-player, Online Multi-Pla...",False,http://steamcommunity.com/app/643980/reviews/?...
88312,670290,Poolians.com,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,Free to Play,"[Free to Play, Simulation, Sports, Casual, Ind...",http://store.steampowered.com/app/670290/Real_...,"[Single-player, Multi-player, Online Multi-Pla...",False,http://steamcommunity.com/app/670290/reviews/?...
88313,767400,彼岸领域,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,2017-12-07,0.99,"[Action, Adventure, Casual]",http://store.steampowered.com/app/767400/2222/,[Single-player],False,http://steamcommunity.com/app/767400/reviews/?...
88314,773570,,,,Log Challenge,,,2.99,"[Action, Indie, Casual, Sports]",http://store.steampowered.com/app/773570/Log_C...,"[Single-player, Full controller support, HTC V...",False,http://steamcommunity.com/app/773570/reviews/?...


**Removing Columns:**
1. Remove the ones that were clearly not necesary for this project
2. Analized the Title column: A significant number of `Nan values` and `typos` were observed; most likely due to HTML encoding. Upon further examination, it became evident that these issues mirrored those in the `app_name` column and can be assumed these are the correct values. To streamline our data and address the empty values, the `title` column will be dropped due to its redundancy.

In [100]:
# Removing the columns
df_games_clean = df_games_clean.drop(['price', 'url','specs', 'early_access','reviews_url'], axis=1)

In [101]:
df_games_clean.head()

Unnamed: 0,id,developer,publisher,genres,app_name,title,release_date,tags
88310,761140,Kotoshiro,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]"
88311,643980,Secret Level SRL,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game..."
88312,670290,Poolians.com,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind..."
88313,767400,彼岸领域,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,2017-12-07,"[Action, Adventure, Casual]"
88314,773570,,,,Log Challenge,,,"[Action, Indie, Casual, Sports]"


**Removing `Title` column.**

In [102]:
# check the empty values from 'title' and see if they are relevant
# notice that each of these empty value is matched with a correct app_name
nan_title_count  = df_games_clean[df_games_clean['title'].isna()]
nan_title_count.tail(3)

Unnamed: 0,id,developer,publisher,genres,app_name,title,release_date,tags
120387,705860,,,,SpaceWalker,,,"[Early Access, Casual]"
120395,755540,,,,LIV Client,,,"[Video Production, Utilities, Web Publishing]"
120444,681550,,,,Maze Run VR,,,"[Early Access, Adventure, Indie, Action, Simul..."


In [103]:
# Filter to show rows where 'app_name' and 'title' have values but are different
df_diff = df_games_clean[df_games_clean['app_name'].notna() & df_games_clean['title'].notna() & (df_games_clean['app_name'] != df_games_clean['title'])]

# display dataframe and notice that 'app_name' column has the corrected values for each value in the 'title' column
df_diff[['app_name', 'title']].head()

Unnamed: 0,app_name,title
88390,Sam & Max 101: Culture Shock,Sam &amp; Max 101: Culture Shock
88393,Sam & Max 102: Situation: Comedy,Sam &amp; Max 102: Situation: Comedy
88419,Command & Conquer: Red Alert 3,Command &amp; Conquer: Red Alert 3
88492,Heroes of Might & Magic V: Hammers of Fate,Heroes of Might &amp; Magic V: Hammers of Fate
88494,Heroes of Might & Magic V: Tribes of the East,Heroes of Might &amp; Magic V: Tribes of the East


In [104]:
# remove title column
df_games_clean.drop(columns=['title'], inplace=True)
df_games_clean.head()

Unnamed: 0,id,developer,publisher,genres,app_name,release_date,tags
88310,761140,Kotoshiro,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]"
88311,643980,Secret Level SRL,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game..."
88312,670290,Poolians.com,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind..."
88313,767400,彼岸领域,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,2017-12-07,"[Action, Adventure, Casual]"
88314,773570,,,,Log Challenge,,"[Action, Indie, Casual, Sports]"


In [105]:
df_games_clean.isna().sum()

id                 2
developer       3299
publisher       8052
genres          3283
app_name           2
release_date    2067
tags             163
dtype: int64

**Missing Value Analysis**
1. Filled missing values of `genre` from the information in `tags`. Then dropped the Tags Column.
- Step 1: Created a set of unique genres from non-null 'genres' 
- Step 2: Filter 'tags' to keep only those present in 'genres_set'
- Step 3: Fill null values in 'genres' with values from 'tags'
- Step 4: Apply Function to merge 'tags' into 'genres' without duplicates
2. Filled missing values of `developer` from the information from the `publisher` column
3. Cehcekd missing values in app_name. Checked these manualy and filled with information from its perspectibe url 
4. Checked missing values in id

`genre` Column

In [106]:
# Filtering rows where 'genres' is empty but 'tags' has information
filtered_df = df_games_clean[pd.isnull(df_games_clean['genres']) & df_games_clean['tags'].notnull()]
filtered_df.head()

Unnamed: 0,id,developer,publisher,genres,app_name,release_date,tags
88314,773570,,,,Log Challenge,,"[Action, Indie, Casual, Sports]"
88321,724910,,,,Icarus Six Sixty Six,,[Casual]
88329,772590,,,,After Life VR,,"[Early Access, Indie, VR]"
88330,640250,,,,Kitty Hawk,,"[Early Access, Action, Adventure, Indie, Casual]"
88332,711440,,,,Mortars VR,,"[Early Access, Strategy, Action, Indie, Casual..."


In [107]:
# Step 1
genres_set = set(genre for genres_list in df_games_clean['genres'].dropna() for genre in genres_list)

# Step 2
df_games_clean['tags'] = df_games_clean['tags'].apply(lambda x: [item for item in x if item in genres_set] if isinstance(x, list) else x)

# Step 3
df_games_clean['genres'].fillna(df_games_clean['tags'], inplace=True)

# Step 4
def merge_genres_tags(row):
    if isinstance(row['genres'], list) and isinstance(row['tags'], list):
        return list(set(row['genres'] + row['tags']))
    return row['genres']
# Apply function
df_games_clean['genres'] = df_games_clean.apply(merge_genres_tags, axis=1)

In [108]:
df_games_clean = df_games_clean.drop(columns=['tags'])

In [109]:
# Filter rows where 'genre' is missing
rows_with_missing_genre = df_games_clean[df_games_clean['genres'].isna()]
rows_with_missing_genre.head()

Unnamed: 0,id,developer,publisher,genres,app_name,release_date
88384,,,,,,
88668,25806.0,Paradox Interactive,Paradox Interactive,,Europa Universalis III: Heir to the Throne,2009-12-15
88779,27930.0,DnS Development,DnS Development,,Booster Trooper Demo,2010-08-31
88922,56436.0,"Relic Entertainment,Feral Interactive (Mac/Linux)","SEGA, Feral Interactive (Mac/Linux)",,"Warhammer 40,000: Dawn of War II - Retribution...",2011-02-28
89089,202520.0,Trendy Entertainment,Trendy Entertainment,,Dungeon Defenders Halloween Costume Pack,2011-11-11


In [110]:
df_games_clean.head()

Unnamed: 0,id,developer,publisher,genres,app_name,release_date
88310,761140,Kotoshiro,Kotoshiro,"[Indie, Action, Strategy, Simulation, Casual]",Lost Summoner Kitty,2018-01-04
88311,643980,Secret Level SRL,"Making Fun, Inc.","[Strategy, Free to Play, RPG, Indie]",Ironbound,2018-01-04
88312,670290,Poolians.com,Poolians.com,"[Indie, Sports, Simulation, Free to Play, Casual]",Real Pool 3D - Poolians,2017-07-24
88313,767400,彼岸领域,彼岸领域,"[Action, Casual, Adventure]",弹炸人2222,2017-12-07
88314,773570,,,"[Sports, Action, Casual, Indie]",Log Challenge,


`developer` column

In [111]:
# Filter rows 
filtered_df = df_games_clean[pd.isnull(df_games_clean['developer']) & df_games_clean['publisher'].notnull()]
filtered_df.head()

Unnamed: 0,id,developer,publisher,genres,app_name,release_date
88427,9730,,Retroism,[Simulation],Tycoon City: New York,2006-02-21
88576,12690,,"ValuSoft, Retroism","[Sports, Simulation]",Hunting Unlimited 2010,2009-07-07
88614,11390,,Meridian4,"[Racing, Action, Simulation, Adventure]",Crash Time 2,2009-08-27
88617,33730,,"ValuSoft, Retroism",[Simulation],18 Wheels of Steel: Extreme Trucker,2009-09-23
89039,33420,,Ubisoft,"[Action, Adventure]",Call of Juarez®: The Cartel,2011-09-13


In [112]:
# Fill in missing values in 'developer' column with values from 'publisher'
df_games_clean['developer'] = df_games_clean['developer'].fillna(df_games_clean['publisher'])

In [113]:
# Drop the publisher column
df_games_clean = df_games_clean.drop(columns=['publisher'])

Missing `id` values: 
We can observe that one is complete empty and the other is a duplicate of another row. We then drop them since they arent needed. 

In [114]:
# filter id values 
empty_id_rows = df_games_clean[pd.isna(df_games_clean['id'])]
empty_id_rows

Unnamed: 0,id,developer,genres,app_name,release_date
88384,,,,,
119271,,"Rocksteady Studios,Feral Interactive (Mac)","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,2012-09-07


In [115]:
# Filter rows where 'app_name' is 'Batman: Arkham City - Game of the Year Edition'
filtered_rows = df_games_clean[df_games_clean['app_name'] == 'Batman: Arkham City - Game of the Year Edition']
filtered_rows

Unnamed: 0,id,developer,genres,app_name,release_date
89378,200260.0,"Rocksteady Studios,Feral Interactive (Mac)","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,2012-09-07
119271,,"Rocksteady Studios,Feral Interactive (Mac)","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,2012-09-07


In [116]:
df_games_clean = df_games_clean.dropna(subset=['id'])

In [117]:
df_games_clean.head()

Unnamed: 0,id,developer,genres,app_name,release_date
88310,761140,Kotoshiro,"[Indie, Action, Strategy, Simulation, Casual]",Lost Summoner Kitty,2018-01-04
88311,643980,Secret Level SRL,"[Strategy, Free to Play, RPG, Indie]",Ironbound,2018-01-04
88312,670290,Poolians.com,"[Indie, Sports, Simulation, Free to Play, Casual]",Real Pool 3D - Poolians,2017-07-24
88313,767400,彼岸领域,"[Action, Casual, Adventure]",弹炸人2222,2017-12-07
88314,773570,,"[Sports, Action, Casual, Indie]",Log Challenge,


`app_name`

In [118]:
# Filtering rows where 'genres' is empty but 'tags' has information
filtered_df = df_games_clean[pd.isnull(df_games_clean['app_name'])]
filtered_df

Unnamed: 0,id,developer,genres,app_name,release_date
90890,317160,,"[Action, Indie]",,2014-08-26


In [119]:
# Update 'app_name' and 'developer' where 'id' is 317160
df_games_clean.loc[df_games_clean['id'] == '317160', 'app_name'] = 'Duet'
df_games_clean.loc[df_games_clean['id'] == '317160', 'developer'] = 'Kumobius'

In [120]:

id_317160 = df_games_clean[df_games_clean['id'] == '317160']
id_317160


Unnamed: 0,id,developer,genres,app_name,release_date
90890,317160,Kumobius,"[Action, Indie]",Duet,2014-08-26


`release_date` modifications: 
1. Check and covert each value to the right format.
2. Those with values that are not in the correct format or are empty are mostly games that have not relesed yet. Made these values NaN and will be dropped later on.  
3. Created new Year column and dropped release_date.


In [121]:
# Modified conversion function
def standardize_date(date):
    if pd.isna(date) or re.match(r'^\d{4}-\d{2}-\d{2}$', date):
        return date  # Return standard dates and NaN as-is
    elif re.match(r'^[A-Za-z]+\s+\d{4}$', date):
        try:
            return datetime.strptime(date, '%B %Y').strftime('%Y-%m-%d')
        except ValueError:
            try:
                return datetime.strptime(date, '%b %Y').strftime('%Y-%m-%d')
            except ValueError:
                return pd.NA
    elif re.match(r'^\d{4}$', date):
        return date + '-01-01'  # Convert 'YYYY' to 'YYYY-01-01'
    else:
        return pd.NA  # Assign NaN to other formats

# Apply the function to create 'standardized_release_date'
df_games_clean['standardized_release_date'] = df_games_clean['release_date'].apply(standardize_date)

# Extract the year to create 'release_year'
df_games_clean['release_year'] = pd.to_datetime(df_games_clean['standardized_release_date'], errors='coerce').dt.year

# Drop the original 'release_date' and 'standardized_release_date' columns
df_games_clean.drop(columns=['release_date', 'standardized_release_date'], inplace=True)


**The rest of missing values:**
1. Calculate percentages. Max impact is 10% of total data. 
2. Drop the rows with missing values since it wont inpact a huge ammount of the dataset. We want it as clean as possible, specially for our machine learning models. 


In [122]:
total_rows = len(df_games_clean)

nan_count_developer = df_games_clean['developer'].isna().sum()
nan_count_genres = df_games_clean['genres'].isna().sum()
nan_count_release_date = df_games_clean['release_year'].isna().sum()

nan_percentage_developer = (nan_count_developer / total_rows) * 100
nan_percentage_genres = (nan_count_genres / total_rows) * 100
nan_percentage_release_date = (nan_count_release_date / total_rows) * 100

print(f"Percentage of NaN values in 'developer': {nan_percentage_developer}%")
print(f"Percentage of NaN values in 'genres': {nan_percentage_genres}%")
print(f"Percentage of NaN values in 'release_year': {nan_percentage_release_date}%")


Percentage of NaN values in 'developer': 10.058195624435937%
Percentage of NaN values in 'genres': 0.42946503594435625%
Percentage of NaN values in 'release_year': 6.97725080135686%


In [123]:
# Drop rows with missing values in any column
df_games_clean = df_games_clean.dropna()

In [124]:
# check for missing values in each column 
missing_values = df_games_clean.isna().sum()
print(missing_values)

id              0
developer       0
genres          0
app_name        0
release_year    0
dtype: int64


**Find Duplicate ID** 

In [125]:
duplicates_count = df_games_clean[df_games_clean.duplicated(['id'])]
duplicates_count

Unnamed: 0,id,developer,genres,app_name,release_year
102883,612880,Machine Games,"[Action, Adventure]",Wolfenstein II: The New Colossus,2017.0


In [126]:
# Remove duplicate rows based on the 'id' column, keeping the first occurrence
df_games_clean = df_games_clean.drop_duplicates(subset=['id'], keep='first')

**Format data types:**

In [127]:
df_games_clean['release_year'] = df_games_clean['release_year'].astype(int)

**Rename columns**

In [128]:
new_names = {
    'id': 'game_id',
}

df_games_clean.rename(columns=new_names, inplace=True)

df_games_clean.head()

Unnamed: 0,game_id,developer,genres,app_name,release_year
88310,761140,Kotoshiro,"[Indie, Action, Strategy, Simulation, Casual]",Lost Summoner Kitty,2018
88311,643980,Secret Level SRL,"[Strategy, Free to Play, RPG, Indie]",Ironbound,2018
88312,670290,Poolians.com,"[Indie, Sports, Simulation, Free to Play, Casual]",Real Pool 3D - Poolians,2017
88313,767400,彼岸领域,"[Action, Casual, Adventure]",弹炸人2222,2017
88315,772540,Trickjump Games Ltd,"[Action, Simulation, Adventure]",Battle Royale Trainer,2018


**Create dummy tables for `genre`**

In [129]:

genres_dummies = df_games_clean['genres'].explode().str.get_dummies().groupby(level=0).sum()

genres_dummies = pd.concat([df_games_clean[['game_id']], genres_dummies], axis=1)

genres_dummies = genres_dummies.reset_index(drop=True)

genres_dummies.columns = genres_dummies.columns.str.replace('&amp;', '&')

genres_dummies.head()

Unnamed: 0,game_id,Accounting,Action,Adventure,Animation & Modeling,Audio Production,Casual,Design & Illustration,Early Access,Education,...,Photo Editing,RPG,Racing,Simulation,Software Training,Sports,Strategy,Utilities,Video Production,Web Publishing
0,761140,0,1,0,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
1,643980,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
2,670290,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,1,0,0,0,0
3,767400,0,1,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,772540,0,1,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [130]:
# drop the dummy columns 
df_games_clean = df_games_clean.drop(columns=['genres'])

In [131]:
#reset the index for a cleaner visualization
df_games_clean = df_games_clean.reset_index(drop=True)
df_games_clean.head()

Unnamed: 0,game_id,developer,app_name,release_year
0,761140,Kotoshiro,Lost Summoner Kitty,2018
1,643980,Secret Level SRL,Ironbound,2018
2,670290,Poolians.com,Real Pool 3D - Poolians,2017
3,767400,彼岸领域,弹炸人2222,2017
4,772540,Trickjump Games Ltd,Battle Royale Trainer,2018


Change type of ID to int: 

In [132]:
df_games_clean['game_id'] = df_games_clean['game_id'].astype('Int64')

In [133]:
missing_values = df_games_clean.isna().sum()
print(missing_values)

game_id         0
developer       0
app_name        0
release_year    0
dtype: int64


In [134]:
df_games_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28592 entries, 0 to 28591
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   game_id       28592 non-null  Int64 
 1   developer     28592 non-null  object
 2   app_name      28592 non-null  object
 3   release_year  28592 non-null  int32 
dtypes: Int64(1), int32(1), object(2)
memory usage: 809.9+ KB


## Loading/Saving the Data
1. Saved dataframes: `df_games_clean`, `genres_dummies`
2. Saved the data in`.csv` 
3. File Path: `'../data/processed/'`

In [135]:

save_path = '../data/processed/'

df_games_clean.to_csv(save_path + 'df_games_clean.csv', index=False)
genres_dummies.to_csv(save_path + 'df_genres_dummies.csv', index=False)


In [136]:
# Save df_games_clean in Parquet format
df_games_clean.to_parquet(save_path + 'df_games_clean.parquet')
genres_dummies.to_parquet(save_path + 'df_genres_dummies.parquet')

In [137]:
save_path = '../data/processed/'
df_games_clean.head(10).to_csv(save_path + '10_df_games_clean.csv', index=False)
genres_dummies.head(10).to_csv(save_path + '10_genres_dummies.csv', index=False)