## Descargar el conjunto de datos

Trabajarás con el [Conjunto de Datos de Ingresos del Censo](http://archive.ics.uci.edu/ml/datasets/Census+Income), un conjunto de datos que se puede utilizar para predecir si una persona gana más o menos de 50k dólares estadounidenses anualmente. A continuación, se muestra un resumen de los nombres de los atributos con descripciones/valores esperados, y puedes leer más sobre ellos [en este archivo de descripción de datos.](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names)

* **age (edad)**: continuo.
* **workclass (clase de trabajo)**: Privado, Autónomo-no-incorporado, Autónomo-incorporado, Gobierno-federal, Gobierno-local, Gobierno-estatal, Sin-pago, Nunca-trabajado.
* **fnlwgt**: continuo.
* **education (educación)**: Licenciatura, Algo-de-universidad, 11° grado, Graduado-de-secundaria, Escuela-profesional, Asoc-académico, Asoc-vocacional, 9° grado, 7°-8° grado, 12° grado, Maestría, 1°-4° grado, 10° grado, Doctorado, 5°-6° grado, Preescolar.
* **education-num (número de educación)**: continuo.
* **marital-status (estado civil)**: Casado-con-cónyuge-civil, Divorciado, Nunca-casado, Separado, Viudo, Casado-cónyuge-ausente, Casado-cónyuge-AF.
* **occupation (ocupación)**: Soporte-técnico, Reparación-artesanal, Otros-servicios, Ventas, Ejecutivo-gerencial, Profesional-especializado, Manipuladores-limpiadores, Inspección-operación-maquinaria, Administrativo-clerical, Agricultura-pesca, Transporte-mudanzas, Servicio-doméstico, Servicio-protección, Fuerzas-armadas.
* **relationship (relación)**: Esposa, Hijo-propio, Esposo, No-en-familia, Otro-pariente, Soltero.
* **race (raza)**: Blanco, Asiático-Isleño-del-Pacífico, Indio-Americano-Eskimal, Otro, Negro.
* **sex (sexo)**: Femenino, Masculino.
* **capital-gain (ganancia de capital)**: continuo.
* **capital-loss (pérdida de capital)**: continuo.
* **hours-per-week (horas por semana)**: continuo.
* **native-country (país de origen)**: Estados Unidos, Camboya, Inglaterra, Puerto Rico, Canadá, Alemania, Territorios-Exteriores-EEUU(Guam-Islas-Vírgenes-EEUU-etc), India, Japón, Grecia, Sur, China, Cuba, Irán, Honduras, Filipinas, Italia, Polonia, Jamaica, Vietnam, México, Portugal, Irlanda, Francia, República Dominicana, Laos, Ecuador, Taiwán, Haití, Colombia, Hungría, Guatemala, Nicaragua, Escocia, Tailandia, Yugoslavia, El Salvador, Trinidad y Tobago, Perú, Hong Kong, Holanda-Países Bajos.

In [18]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import os

In [19]:
# URL del archivo de datos
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

# Nombre del archivo donde se guardará el dataset
filename = "census_income_data.csv"

# Descargar el archivo
response = requests.get(url)
response.raise_for_status()  # Esto arrojará un error si la descarga falla

# Guardar el archivo como CSV
with open(filename, 'wb') as file:
    file.write(response.content)

# Cargar el dataset en un DataFrame de pandas
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", 
                "hours-per-week", "country", "income"]

df = pd.read_csv(filename, header=None, names=column_names)

# Mostrar las primeras filas del DataFrame
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [23]:
for f in df["education"].unique():
    print(f'{f.upper()} = "{f}"')

 BACHELORS = " Bachelors"
 HS-GRAD = " HS-grad"
 11TH = " 11th"
 MASTERS = " Masters"
 9TH = " 9th"
 SOME-COLLEGE = " Some-college"
 ASSOC-ACDM = " Assoc-acdm"
 ASSOC-VOC = " Assoc-voc"
 7TH-8TH = " 7th-8th"
 DOCTORATE = " Doctorate"
 PROF-SCHOOL = " Prof-school"
 5TH-6TH = " 5th-6th"
 10TH = " 10th"
 1ST-4TH = " 1st-4th"
 PRESCHOOL = " Preschool"
 12TH = " 12th"


In [20]:
df.tail(100).to_csv("test_data.csv")

¡NO HAY TRUE MODEL!

### Preprocess

In [3]:
salary_map={' <=50K':1,' >50K':0}
df['income']=df['income'].map(salary_map).astype(int)

In [4]:
df['sex'] = df['sex'].map({' Male': 1,' Female': 0}).astype(int)
df['country'] = df['country'].replace(' ?', np.nan)
df['workclass'] = df['workclass'].replace(' ?', np.nan)
df['occupation'] = df['occupation'].replace(' ?', np.nan)

df.dropna(how='any',inplace=True)

In [5]:
df.loc[df['country'] != ' United-States', 'country'] = 'Non-US'
df.loc[df['country'] == ' United-States', 'country'] = 'US'

df['country'] = df['country'].map({'US':1,'Non-US':0}).astype(int)

In [6]:
df['marital-status'] = df['marital-status'].replace([' Divorced',' Married-spouse-absent',' Never-married',' Separated',' Widowed'],'Single')
df['marital-status'] = df['marital-status'].replace([' Married-AF-spouse',' Married-civ-spouse'],'Couple')
df['marital-status'] = df['marital-status'].map({'Couple':0,'Single':1})

In [7]:
rel_map = {' Unmarried':0,' Wife':1,' Husband':2,' Not-in-family':3,' Own-child':4,' Other-relative':5}

df['relationship'] = df['relationship'].map(rel_map)

In [8]:
race_map={' White':0,' Amer-Indian-Eskimo':1,' Asian-Pac-Islander':2,' Black':3,' Other':4}

df['race']= df['race'].map(race_map)

In [9]:
def f(x):
    if x['workclass'] == ' Federal-gov' or x['workclass']== ' Local-gov' or x['workclass']==' State-gov': return 'govt'
    elif x['workclass'] == ' Private':return 'private'
    elif x['workclass'] == ' Self-emp-inc' or x['workclass'] == ' Self-emp-not-inc': return 'self_employed'
    else: return 'without_pay'
    
df['employment_type']=df.apply(f, axis=1)

In [10]:
employment_map = {'govt':0,'private':1,'self_employed':2,'without_pay':3}

df['employment_type'] = df['employment_type'].map(employment_map)

In [11]:
df.drop(labels=['workclass','education','occupation'], axis=1, inplace=True)

In [12]:
df.loc[(df['capital-gain'] > 0),'capital-gain'] = 1
df.loc[(df['capital-gain'] == 0 ,'capital-gain')]= 0

In [13]:
df.loc[(df['capital-loss'] > 0),'capital-loss'] = 1
df.loc[(df['capital-loss'] == 0 ,'capital-loss')]= 0

In [14]:
df.head()

Unnamed: 0,age,fnlwgt,education-num,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,income,employment_type
0,39,77516,13,1,3,0,1,1,0,40,1,1,0
1,50,83311,13,0,2,0,1,0,0,13,1,1,2
2,38,215646,9,1,3,0,1,0,0,40,1,1,1
3,53,234721,7,0,2,3,1,0,0,40,1,1,1
4,28,338409,13,0,1,3,0,0,0,40,0,1,1


In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X = df.drop(['income'],axis=1)
y = df['income']

split_size=0.3

#Creation of Train and Test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=split_size,random_state=22)

#Creation of Train and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=5)

In [33]:
X_test

Unnamed: 0,age,fnlwgt,education-num,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country,employment_type
15659,26,27834,13,1,3,0,1,0,0,40,1,0
29204,27,55390,10,1,4,0,1,0,0,45,1,1
9176,20,188612,10,1,4,0,0,0,0,38,0,1
13853,24,200295,10,1,3,0,1,0,0,40,1,1
161,25,252752,9,1,0,0,0,0,0,40,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
21238,29,140830,9,1,4,0,0,0,0,40,1,1
29843,36,214378,9,1,4,0,0,0,0,40,1,1
1515,44,123983,13,1,3,2,1,0,0,40,0,1
22497,28,272913,6,0,2,0,1,0,0,30,0,1


In [32]:
X_test.to_csv("test_data.csv")

#### Model

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [18]:
models = []
names = ['LR','Random Forest','Neural Network','GaussianNB','DecisionTreeClassifier','SVM',]

models.append((LogisticRegression(max_iter=100000)))
models.append((RandomForestClassifier(n_estimators=100)))
models.append((MLPClassifier()))
models.append((GaussianNB()))
models.append((DecisionTreeClassifier()))
models.append((SVC()))

Para el modelo, podemos elegir uno de todos esos para ver cuál es mejor.

In [19]:
from sklearn import model_selection
from sklearn.metrics import accuracy_score

In [20]:
kfold = model_selection.KFold(n_splits=5)

for i in range(0,len(models)):    
    cv_result = model_selection.cross_val_score(models[i],X_train,y_train,cv=kfold,scoring='accuracy')
    score=models[i].fit(X_train,y_train)
    prediction = models[i].predict(X_val)
    acc_score = accuracy_score(y_val,prediction)     
    print ('-'*40)
    print (f'{names[i]}: {acc_score}')

----------------------------------------
LR: 0.832346672981293
----------------------------------------
Random Forest: 0.8193227563343595
----------------------------------------
Neural Network: 0.7492304049254085
----------------------------------------
GaussianNB: 0.7712526639829506
----------------------------------------
DecisionTreeClassifier: 0.7778830215486621
----------------------------------------
SVM: 0.7492304049254085


Eale, nos gusta la regresión logística.

In [21]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [26]:
lr = LogisticRegression(max_iter=100000)
lr.fit(X_train,y_train)
prediction = lr.predict(X_test)

In [27]:
print ('-'*40)
print ('Accuracy score:')
print (accuracy_score(y_test,prediction))
print ('-'*40)
print ('Confusion Matrix:')
print (confusion_matrix(y_test,prediction))
print ('-'*40)
print ('Classification Matrix:')
print (classification_report(y_test,prediction))

----------------------------------------
Accuracy score:
0.8299259586694663
----------------------------------------
Confusion Matrix:
[[1237 1006]
 [ 533 6273]]
----------------------------------------
Classification Matrix:
              precision    recall  f1-score   support

           0       0.70      0.55      0.62      2243
           1       0.86      0.92      0.89      6806

    accuracy                           0.83      9049
   macro avg       0.78      0.74      0.75      9049
weighted avg       0.82      0.83      0.82      9049



#### Veamos en la API

In [28]:
import pickle

In [29]:
with open('model.pkl', 'wb') as model_file:
    pickle.dump(lr, model_file)