**Instalaciones necesarias para Google Colab**

In [127]:
!pip3 install h2o



**Importación de librerías**

In [0]:
import pandas as pd

from google.colab import files

In [0]:
from h2o.automl import H2OAutoML, get_leaderboard

from h2o.estimators import H2ORandomForestEstimator

from h2o.estimators.glm import H2OGeneralizedLinearEstimator

from h2o.estimators.gbm import H2OGradientBoostingEstimator

**Inicio H2O**

In [130]:
import h2o
h2o.init(nthreads = -1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,2 hours 7 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.3
H2O cluster version age:,1 month and 5 days
H2O cluster name:,H2O_from_python_unknownUser_xvztxb
H2O cluster total nodes:,1
H2O cluster free memory:,7.703 Gb
H2O cluster total cores:,2
H2O cluster allowed cores:,2


**Importación y pequeña preparación de Pandas**

In [0]:
df = pd.read_csv("/content/drive/My Drive/Csv4.csv").reset_index(drop=True)

In [0]:
del df["Unnamed: 0"]
df.drop(columns=["Medal_Gold","Medal_Silver","Medal_Bronze"], axis=1, inplace=True)

In [170]:
print(df.shape)

df.head()

(1365, 6)


Unnamed: 0,Host Country,Year,Region,Athlete,Sport,Medals
0,ESP,1992,Kuwait,32,7,0
1,ESP,1992,Niger,3,1,0
2,ESP,1992,Nigeria,55,8,11
3,ESP,1992,North Korea,64,12,10
4,ESP,1992,Norway,83,17,23


**Preparación de la maquinaria de Machine Learning**

In [171]:
data = h2o.H2OFrame(df)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [0]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

In [0]:
train = splits[0]
valid = splits[1]
test = splits[2]

In [0]:
y = 'Medals'
x = list(data.columns)

x.remove("Medals")

**AutoMachine Learning**

In [138]:
aml = H2OAutoML(max_models = 20, nfolds=10)
aml.train(x = x, y = y, training_frame = data)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [139]:
aml.leaderboard.head(1)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
DeepLearning_grid__1_AutoML_20200311_171021_model_1,63.1684,7.94785,63.1684,3.60037,




**Random Forest**

In [140]:
model = H2ORandomForestEstimator(ntrees=200, max_depth=20, nfolds=10)
model.train(x = x, y = y, training_frame = train,validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [141]:
performance = model.model_performance(test_data=test)
print(performance)
model.r2()


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 133.4332616997603
RMSE: 11.551331598554354
MAE: 4.447496464786203
RMSLE: 0.7187852638757543
Mean Residual Deviance: 133.4332616997603



0.8662591950147571

**Gradient Boosting Machine**

In [142]:
model = H2OGradientBoostingEstimator(distribution = "poisson", categorical_encoding = "auto")
model.train(x = x, y = y, training_frame = train,validation_frame=valid)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [143]:
performance = model.model_performance(test_data=test)
print(performance)
model.r2()


ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 129.55432097783694
RMSE: 11.382193153247618
MAE: 3.771920285674336
RMSLE: 0.49329578416113057
Mean Residual Deviance: -52.93303127062341



0.9849132360909248

**Modelo de Regresión Lineal**

In [144]:
model = H2OGeneralizedLinearEstimator()
model.train(x=x, y= y, training_frame=train, validation_frame=valid,model_id="glm_logistic")

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [145]:
performance = model.model_performance(test_data=test)
print(performance)
model.r2()


ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 396.54846572329615
RMSE: 19.913524693617052
MAE: 10.15434942461909
RMSLE: 1.6003723335630478
R^2: 0.4303482664184036
Mean Residual Deviance: 396.54846572329615
Null degrees of freedom: 206
Residual degrees of freedom: 204
Null deviance: 144279.2484192329
Residual deviance: 82085.5324047223
AIC: 1833.8797941108348



0.4030045606576991

**Guardado del mejor modelo encontrado (AML)**

In [146]:
h2o.save_model(aml.leader, path = "model")

'/content/model/DeepLearning_grid__1_AutoML_20200311_171021_model_1'

**Predicciones**

In [0]:
modelNew= h2o.load_model("/content/drive/My Drive/StackedEnsemble_BestOfFamily_AutoML_20200310_010756")

In [148]:
pred = modelNew.predict(test)
pred[pred < 0] = 0
pred = pred.round()
pred.head(3)

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
0
0
179




In [0]:
predDef = pred.as_data_frame()


**Preparación de los Juegos Olímpicos de Tokio**

Se toman como referencia, los datos de los anteriores JJOO

In [172]:
df.head()

Unnamed: 0,Host Country,Year,Region,Athlete,Sport,Medals
0,ESP,1992,Kuwait,32,7,0
1,ESP,1992,Niger,3,1,0
2,ESP,1992,Nigeria,55,8,11
3,ESP,1992,North Korea,64,12,10
4,ESP,1992,Norway,83,17,23


In [0]:
df2 = df.loc[df["Year"]==2016].sort_values(by=["Year"]).reset_index(drop=True)
df2["Year"] = df2["Year"].replace({2016: 2020})

In [174]:
df2.head()

Unnamed: 0,Host Country,Year,Region,Athlete,Sport,Medals
0,BRA,2020,Oman,4,2,0
1,BRA,2020,Albania,6,3,0
2,BRA,2020,Algeria,64,13,2
3,BRA,2020,Andorra,4,4,0
4,BRA,2020,Angola,26,7,0


Predicciones en los Juegos Olímpicos 2020

In [175]:
data2 = h2o.H2OFrame(df2)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [176]:
pred = modelNew.predict(data2)
pred[pred < 0] = 0
pred = pred.round()

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [0]:
pred = pred.as_data_frame()


In [0]:
dfMain = df2.join(pred)

In [181]:
dfMain["Host Country"] = dfMain["Host Country"].replace({"BRA":"JPN"})
dfMain.head(10)

Unnamed: 0,Host Country,Year,Region,Athlete,Sport,Medals,predict
0,JPN,2020,Oman,4,2,0,1
1,JPN,2020,Albania,6,3,0,1
2,JPN,2020,Algeria,64,13,2,3
3,JPN,2020,Andorra,4,4,0,0
4,JPN,2020,Angola,26,7,0,0
5,JPN,2020,Antigua,8,2,0,1
6,JPN,2020,Argentina,215,26,22,29
7,JPN,2020,Armenia,31,8,4,3
8,JPN,2020,Aruba,7,4,0,0
9,JPN,2020,Czech Republic,104,20,15,8


Exportación de resultados

In [0]:
dfMain.to_csv('dfPredicted.csv')
files.download('dfPredicted.csv')