# MARATÓN BEHIND THE CODE 2020

# DESAFÍO 8 - Digital House

El reto de Digital House tiene que ver con la misión y visión de la empresa; la cual busca transformar la vida de las personas, desarrollando competencias digitales, para que generen un impacto positivo en la sociedad. En ese sentido, Digital House busca a través de este desafío medir la empleabilidad de los cursos que disponibiliza en su plataforma, es decir que tan factible es que un alumno o egresado de Digital House acceda al mercado laboral o consiga un trabajo en su área luego de finalizar uno o varios de los cursos ofrecidos en la plataforma.

Entender que características o variables hacen que una persona sea más o menos empleable es fundamental para la empresa y para generar ese impacto positivo que busca aportar a la sociedad

<hr>

## Instalación de algunas bibliotecas de Python

In [1]:
!pip install scikit-learn --upgrade
!pip install scipy --upgrade

Collecting scikit-learn
  Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.23.1
    Uninstalling scikit-learn-0.23.1:
      Successfully uninstalled scikit-learn-0.23.1
Successfully installed scikit-learn-0.23.2
Collecting scipy
  Downloading scipy-1.5.2-cp38-cp38-win_amd64.whl (31.4 MB)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.5.0
    Uninstalling scipy-1.5.0:
      Successfully uninstalled scipy-1.5.0
Successfully installed scipy-1.5.2


<hr>

## Descargue el conjunto de datos del desafío en formato .csv

In [1]:
import pandas as pd

# !wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/vanderlei-test/654986294958/master/train_dataset_digitalhouse.csv
# !python -m wget https://raw.githubusercontent.com/vanderlei-test/654986294958/master/train_dataset_digitalhouse.csv
df_training_dataset = pd.read_csv(r'train_dataset_digitalhouse.csv')
df_training_dataset.tail()

Unnamed: 0.1,Unnamed: 0,EDAD,GENERO,RESIDENCIA,NV_ESTUDIO,ESTUDIO_PREV,TRACK_DH,AVG_DH,MINUTES_DH,EXPERIENCIA,DIAS_EMP
8990,9995,29.0,MASCULINO,ARGENTINA,TERTIARY,DEVELOPMENT,,4.0,4701.6,9.1,86.0
8991,9996,34.0,,ARGENTINA,UNIVERSITARY,ENGINEERING,PROGRAMACION,3.4,4646.2,16.8,95.0
8992,9997,28.0,FEMENINO,ARGENTINA,POST_GRADUATE,ENGINEERING,EJECUTIVO,,3315.1,5.6,95.0
8993,9998,23.0,MASCULINO,MEXICO,TERTIARY,ENGINEERING,PROGRAMACION,3.3,4437.8,0.9,87.0
8994,9999,36.0,MASCULINO,ARGENTINA,UNIVERSITARY,COMMERCIAL,DATA,3.4,4600.8,19.6,88.0


## En el conjunto de datos proporcionado, tenemos las siguientes columnas:

* Unnamed: 0
* EDAD
* GENERO
* RESIDENCIA
* NV_ESTUDIO
* ESTUDIO_PREV
* TRACK_DH
* AVG_DH
* MINUTES_DH
* EXPERIENCIA
* **DIAS_EMP = El valor "target" que se va a predecir**

Podemos verificar fácilmente que faltan valores usando el siguiente código:

```df_training_dataset.info()```

Debe manejar cuidadosamente estos valores faltantes antes de crear un modelo de regresión.

In [2]:
df_training_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8995 entries, 0 to 8994
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    8995 non-null   int64  
 1   EDAD          7668 non-null   float64
 2   GENERO        7620 non-null   object 
 3   RESIDENCIA    7638 non-null   object 
 4   NV_ESTUDIO    7623 non-null   object 
 5   ESTUDIO_PREV  7665 non-null   object 
 6   TRACK_DH      7714 non-null   object 
 7   AVG_DH        7651 non-null   float64
 8   MINUTES_DH    7619 non-null   float64
 9   EXPERIENCIA   7618 non-null   float64
 10  DIAS_EMP      8995 non-null   float64
dtypes: float64(5), int64(1), object(5)
memory usage: 773.1+ KB


In [3]:
df_training_dataset.nunique()

Unnamed: 0      8995
EDAD              37
GENERO             2
RESIDENCIA         3
NV_ESTUDIO         3
ESTUDIO_PREV       5
TRACK_DH           4
AVG_DH            29
MINUTES_DH      5109
EXPERIENCIA      318
DIAS_EMP          34
dtype: int64

In [4]:
# Mostrando los datos nulos del dataset antes de la primera transformación (df)
print("Valores nulos del df_training_dataset antes de la transformación DropNA: \n\n{}\n".format(df_training_dataset.isnull().sum(axis = 0)))

Valores nulos del df_training_dataset antes de la transformación DropNA: 

Unnamed: 0         0
EDAD            1327
GENERO          1375
RESIDENCIA      1357
NV_ESTUDIO      1372
ESTUDIO_PREV    1330
TRACK_DH        1281
AVG_DH          1344
MINUTES_DH      1376
EXPERIENCIA     1377
DIAS_EMP           0
dtype: int64



<hr>

## Algunos consejos antes de entrenar a un modelo:

1. Manejar valores perdidos

2. Manejar variables categóricas

<hr>

## A continuación, le brindamos una plantilla simple para que sepa cómo estructurar las entradas y salidas de su modelo

### Removiendo columnas

In [5]:
cleanup_nums = {"RESIDENCIA":     {"ARGENTINA": 0, "MEXICO": 1, "BRAZIL": 2},
                "GENERO": {"MASCULINO": 0, "FEMENINO": 1 },
                "NV_ESTUDIO": {"UNIVERSITARY": 0, "POST_GRADUATE": 1, "TERTIARY": 2 },
                "ESTUDIO_PREV": {"DEVELOPMENT": 0, "ENGINEERING": 1, "COMMERCIAL": 2, "MARKETING": 3, "BUSINESS": 4 },
                "TRACK_DH": {"PROGRAMACION": 0, "DATA": 1, "EJECUTIVO": 2, "MARKETING": 3}}
df_training_dataset_1 = df_training_dataset.replace(cleanup_nums, inplace=False)

### Codificando variables categóricas

In [6]:
# Columnas One-hot-encoding del dataset usando el metodo de Pandas ``get_dummies`` (demontración)
# df_training_dataset_3 = pd.get_dummies(df_training_dataset_2, columns=['GENERO','RESIDENCIA','NV_ESTUDIO','ESTUDIO_PREV','TRACK_DH'])
df_training_dataset_1.tail()

Unnamed: 0.1,Unnamed: 0,EDAD,GENERO,RESIDENCIA,NV_ESTUDIO,ESTUDIO_PREV,TRACK_DH,AVG_DH,MINUTES_DH,EXPERIENCIA,DIAS_EMP
8990,9995,29.0,0.0,0.0,2.0,0.0,,4.0,4701.6,9.1,86.0
8991,9996,34.0,,0.0,0.0,1.0,0.0,3.4,4646.2,16.8,95.0
8992,9997,28.0,1.0,0.0,1.0,1.0,2.0,,3315.1,5.6,95.0
8993,9998,23.0,0.0,1.0,2.0,1.0,0.0,3.3,4437.8,0.9,87.0
8994,9999,36.0,0.0,0.0,0.0,2.0,1.0,3.4,4600.8,19.6,88.0


### Seleccionando las columnas "features" y "target"

In [34]:
features = df_training_dataset_1.drop('DIAS_EMP', axis = 1)
target = df_training_dataset_1['DIAS_EMP']  ## NO CAMBIE EL NOMBRE DE LA VARIABLE "target".

In [35]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp_mean = IterativeImputer(random_state=0)
imp_mean.fit(features)
IterativeImputer(random_state=0)
features = pd.DataFrame(imp_mean.transform(features), columns= features.columns)
features

Unnamed: 0.1,Unnamed: 0,EDAD,GENERO,RESIDENCIA,NV_ESTUDIO,ESTUDIO_PREV,TRACK_DH,AVG_DH,MINUTES_DH,EXPERIENCIA
0,1000.0,37.0,1.000000,2.000000,0.0,1.196804,0.000000,3.100000,4131.5,21.300000
1,1001.0,40.0,1.000000,0.594966,0.0,2.000000,0.000000,3.100000,4160.4,25.200000
2,1002.0,35.0,0.317661,0.000000,0.0,0.000000,1.000000,3.100000,4087.6,18.000000
3,1003.0,33.0,1.000000,2.000000,0.0,1.000000,0.851602,3.100000,4043.2,13.600000
4,1004.0,29.0,0.324470,2.000000,0.0,0.000000,0.859862,3.600000,4688.0,9.130095
...,...,...,...,...,...,...,...,...,...,...
8990,9995.0,29.0,0.000000,0.000000,2.0,0.000000,0.804419,4.000000,4701.6,9.100000
8991,9996.0,34.0,0.308509,0.000000,0.0,1.000000,0.000000,3.400000,4646.2,16.800000
8992,9997.0,28.0,1.000000,0.000000,1.0,1.000000,2.000000,2.721005,3315.1,5.600000
8993,9998.0,23.0,0.000000,1.000000,2.0,1.000000,0.000000,3.300000,4437.8,0.900000


In [36]:
from sklearn.preprocessing import StandardScaler

col_names = features.columns
scaler = StandardScaler()
scaler.fit(features)
features = pd.DataFrame(scaler.transform(features), columns = col_names)

In [37]:
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFE

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDRegressor

regression_model = SGDRegressor().fit(features, target)

rfe = RFE(estimator=regression_model, step=1)
rfe.fit(features, target)

print("Optimal number of features : %d" % rfe.n_features_)
col = features.columns[rfe.support_]
features = features[col]
print(rfe.ranking_)
features.tail()

Optimal number of features : 5
[4 1 1 6 1 2 5 3 1 1]


Unnamed: 0,EDAD,GENERO,NV_ESTUDIO,MINUTES_DH,EXPERIENCIA
8990,-1.003119,-0.734429,1.824234,0.550069,-0.994837
8991,0.005887,-0.012012,-0.764917,0.441931,0.101575
8992,-1.20492,1.607211,0.529658,-2.15632,-1.493207
8993,-2.213926,-0.734429,1.824234,0.035143,-2.162445
8994,0.40949,-0.734429,-0.764917,0.353312,0.50027


### Dividiendo el conjunto de datos para una "prueba ciega"

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=133)

### Evaluando el modelo de regresión con el método  "score()" de sklearn

In [39]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDRegressor

regression_model = SGDRegressor().fit(X_train, y_train)
scores = cross_val_score(regression_model, features, target, cv=5)
scores.mean()

0.7675610252911295

In [41]:
regression_model.score(X_test, y_test)

0.7777187715929601

In [42]:
# import seaborn as sns
# sns.pairplot(features, diag_kind="kde")

<hr>

## Completar los datos necesarios para entregar la solución

### Como entrega de su solución, esperamos los resultados numéricos predichos por su modelo. Como entrada utilizará el archivo "to_be_scored.csv"

In [43]:
# !wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/vanderlei-test/654986294958/master/to_be_scored_digitalhouse.csv
# !python -m wget https://raw.githubusercontent.com/vanderlei-test/654986294958/master/to_be_scored_digitalhouse.csv
df_to_be_scored = pd.read_csv(r'to_be_scored_digitalhouse.csv')
df_to_be_scored.tail()

Unnamed: 0.1,Unnamed: 0,EDAD,GENERO,RESIDENCIA,NV_ESTUDIO,ESTUDIO_PREV,TRACK_DH,AVG_DH,MINUTES_DH,EXPERIENCIA
995,995,28.0,,BRAZIL,UNIVERSITARY,DEVELOPMENT,DATA,4.0,4730.5,
996,996,30.0,MASCULINO,ARGENTINA,UNIVERSITARY,ENGINEERING,DATA,,4698.4,10.0
997,997,33.0,,BRAZIL,POST_GRADUATE,DEVELOPMENT,PROGRAMACION,3.9,4644.7,14.4
998,998,26.0,MASCULINO,,UNIVERSITARY,DEVELOPMENT,PROGRAMACION,3.4,4498.1,6.1
999,999,46.0,FEMENINO,ARGENTINA,POST_GRADUATE,ENGINEERING,,3.9,4682.0,


# ¡Atención!

### El marco de datos ``to_be_scored`` es su "hoja de evaluación". Tenga en cuenta que la columna "target" no existe en esta muestra, por lo que no se puede utilizar para modelos de entrenamiento basados en el aprendizaje supervisado.

# ¡Atención!

### Debes realizar los mismos pasos de procesamiento previo que hiciste en el conjunto de datos de entrenamiento antes de calificar la "hoja de respuestas"

In [46]:
df_to_be_scored_1 = df_to_be_scored.replace(cleanup_nums, inplace=False)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp_mean = IterativeImputer(random_state=0)
imp_mean.fit(df_to_be_scored_1)
IterativeImputer(random_state=0)
df_to_be_scored_1 = pd.DataFrame(imp_mean.transform(df_to_be_scored_1), columns= df_to_be_scored_1.columns)
df_to_be_scored_1

df_to_be_scored_1 = df_to_be_scored_1[col]
df_to_be_scored_1.tail()

Unnamed: 0,EDAD,GENERO,NV_ESTUDIO,MINUTES_DH,EXPERIENCIA
995,28.0,0.341753,0.0,4730.5,7.576609
996,30.0,0.0,0.0,4698.4,10.0
997,33.0,0.338506,1.0,4644.7,14.4
998,26.0,0.0,0.0,4498.1,6.1
999,46.0,1.0,1.0,4682.0,32.995542


In [47]:
df_to_be_scored_1.columns

Index(['EDAD', 'GENERO', 'NV_ESTUDIO', 'MINUTES_DH', 'EXPERIENCIA'], dtype='object')

<hr>

### Hacer las predicciones con el método "predict()" de sklearn y agregar los resultados en el marco de datos de la "hoja de evaluación"

In [48]:
y_pred = regression_model.predict(df_to_be_scored_1)
df_to_be_scored_1['target'] = y_pred
df_to_be_scored_1.tail()

Unnamed: 0,EDAD,GENERO,NV_ESTUDIO,MINUTES_DH,EXPERIENCIA,target
995,28.0,0.341753,0.0,4730.5,7.576609,-2138.097915
996,30.0,0.0,0.0,4698.4,10.0,-2119.064395
997,33.0,0.338506,1.0,4644.7,14.4,-2082.716391
998,26.0,0.0,0.0,4498.1,6.1,-2031.801127
999,46.0,1.0,1.0,4682.0,32.995542,-2063.214852


# ¡Atención!

### La columna agregada con los resultados debe llamarse "target", de lo contrario, su envío fallará.

<hr>

### Exportar el marco de datos de resultados como un archivo .csv a su proyecto de Watson Studio.

In [None]:
project.save_data(file_name="results.csv", data=df_to_be_scored_1.to_csv(index=False))