# **Machin Learning - Proyecto en Clase: Game**

## Objetivo

El objetivo principal de ese código de notebook es construir, entrenar y guardar un modelo de Machine Learning (Regresión) capaz de predecir las ventas totales (`total_sales`) de un videojuego basándose en sus características (plataforma, género, año, puntuaciones, etc.).

## Librerias

### Librerias genrales

In [1]:
import pandas as pd 
import numpy as numpy
import time

### Librerias de Machin Lerning

In [2]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
#from lightgbm import LGBMRegressor
#from xgboost import XGBRegressor
#from sklearn.model_selection import GridSearchCV

## Cargar Datos Limpios

In [3]:
ruta = r"D:\UNIANDES\Carrera de Software\Titulacion\Seminario_Python\seminario_complexivo_demo-games\data\processed\games_clean.csv"
games_clean = pd.read_csv(ruta)

# Random Forest

## Prepara los datos pasra el modelo

In [4]:
games_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8296 entries, 0 to 8295
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   videogame_names            8296 non-null   object 
 1   platform                   8296 non-null   object 
 2   year_of_release            8296 non-null   int64  
 3   genre                      8296 non-null   object 
 4   na_sales                   8296 non-null   float64
 5   eu_sales                   8296 non-null   float64
 6   jp_sales                   8296 non-null   float64
 7   other_sales                8296 non-null   float64
 8   critic_score               8296 non-null   float64
 9   user_score                 8296 non-null   float64
 10  rating_esrb                8296 non-null   object 
 11  total_sales                8296 non-null   float64
 12  gen_platform               8296 non-null   object 
 13  classification_user_score  8296 non-null   objec

In [5]:
col_categoricas = ["platform", "genre", "rating_esrb"]
col_numericas = ["year_of_release", "user_score", "critic_score"]

In [6]:
target = "total_sales"

In [7]:
X_categoricas = games_clean[col_categoricas]
X_numericas = games_clean[col_numericas]
y = games_clean[target]

# Aplicacion de One-hOT eNCONDING

In [8]:
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoder

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [9]:
X_categoricas_encoded = encoder.fit_transform(X_categoricas)

In [10]:
X_categoricas_encoded

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], shape=(8296, 43))

In [11]:
nuevas_columnas = encoder.get_feature_names_out(col_categoricas)

In [12]:
games_encoded = pd.DataFrame(
    X_categoricas_encoded, 
    columns = nuevas_columnas
)

In [13]:
print("Numero de filas por columnas: {games_encoded.shape}")
display(games_encoded.head())

Numero de filas por columnas: {games_encoded.shape}


Unnamed: 0,platform_2600,platform_3DS,platform_DC,platform_DS,platform_GB,platform_GBA,platform_GC,platform_GEN,platform_N64,platform_NES,...,genre_Simulation,genre_Sports,genre_Strategy,rating_esrb_AO,rating_esrb_E,rating_esrb_E10+,rating_esrb_K-A,rating_esrb_M,rating_esrb_RP,rating_esrb_T
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [14]:
X_numericas.head()

Unnamed: 0,year_of_release,user_score,critic_score
0,2006,8.0,76.0
1,2008,8.3,82.0
2,2009,8.0,80.0
3,2006,8.5,89.0
4,2006,6.6,58.0


In [15]:
X = pd.concat([X_numericas.reset_index(drop=True), games_encoded], axis = 1)

In [16]:
y.head()

0    82.54
1    35.52
2    32.77
3    29.80
4    28.91
Name: total_sales, dtype: float64

## Dividir los datos

In [17]:
len(games_clean)

8296

In [18]:
# definir varibles para seprar datos
RAMDOM_STATE = 50
TEST_SIZE = 0.25

In [19]:
# Dividir los datos de entrenamienton y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = TEST_SIZE, random_state = RAMDOM_STATE)

In [20]:
print(f"Tamaño X_train: {X_train.shape}")
print(f"Tamaño X_test: {X_test.shape}")
print(f"Tamaño y_train: {y_train.shape}")
print(f"Tamaño y_test: {y_test.shape}")

Tamaño X_train: (6222, 46)
Tamaño X_test: (2074, 46)
Tamaño y_train: (6222,)
Tamaño y_test: (2074,)


In [21]:
X_train.head()

Unnamed: 0,year_of_release,user_score,critic_score,platform_2600,platform_3DS,platform_DC,platform_DS,platform_GB,platform_GBA,platform_GC,...,genre_Simulation,genre_Sports,genre_Strategy,rating_esrb_AO,rating_esrb_E,rating_esrb_E10+,rating_esrb_K-A,rating_esrb_M,rating_esrb_RP,rating_esrb_T
2686,2009,8.6,91.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5305,2005,7.1,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3139,2004,8.9,77.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
550,2007,7.5,64.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5583,2005,8.9,68.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Entrenar el modelo Random Forest Regressoor

In [24]:
#fijando parametros del modelo
modelo = RandomForestRegressor(
    n_estimators=100,
    random_state=RAMDOM_STATE,
    n_jobs=-1,
    oob_score=True
)

In [25]:
#fit es igual a entrenar el modelo
modelo.fit(X_train, y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [None]:
print(f"OOB Score (R2 estimado): {modelo.oob_score_}")

## Evaluar el modelo

In [26]:
#predict es igual a predecir el modelo
predicciones = modelo.predict(X_test)

In [30]:
rmse = root_mean_squared_error(y_test, predicciones)

In [32]:
df_compraracion = pd.DataFrame({"Datos_Reales": y_test, "Prediccion": predicciones}).reset_index(drop=True)

In [33]:
df_compraracion.head(20)

Unnamed: 0,Datos_Reales,Prediccion
0,0.13,0.0828
1,0.53,0.2885
2,0.11,0.1937
3,0.86,1.1481
4,2.11,2.1431
5,1.3,1.4344
6,0.34,0.2226
7,0.14,0.7465
8,0.11,0.2947
9,0.04,0.0212


1. Seleccionamos los datos numéricos y categoricos.
2. Los categóricos los pasamos a una matriz de 1 y (OneHotEncoder)
3. Dividimos los datos - Datos de Entrenamiento (75%) Datos de Prueba (25%)
4. Entreno los datos con mi modelo (cualquier modelo ML), los datos de X (variables dependientes), y (variables Independientes)
5. Evaluo con los datos de prueba solo con X (Variables Dependientes)
6. Saco metricas de que tan buienoo es mi modelo prediciendo datos.