# One Hot Encoding (Pre Procesamiento de datos)

La técnica One-Hot Encoding (codificación one-hot) es un método comúnmente utilizado en aprendizaje automático y procesamiento de datos para representar variables categóricas como vectores binarios. Esta técnica transforma variables categóricas en una forma que puede ser proporcionada a algoritmos de aprendizaje automático para mejorar la precisión de los modelos.

Por ejemplo, supongamos que tienes una variable categórica "Fruta" con las categorías "Manzana", "Banana" y "Naranja". Aplicando One-Hot Encoding, obtendrías tres nuevas columnas: "Manzana", "Banana" y "Naranja". Cada fila en el conjunto de datos tendrá un 1 en la columna correspondiente a la fruta que representa y 0 en las otras dos columnas.

| Fruta  | Manzana | Banana | Naranja |
|--------|---------|--------|---------|
| Manzana| 1       | 0      | 0       |
| Banana | 0       | 1      | 0       |
| Naranja| 0       | 0      | 1       |


La columna "Manzana" tiene un 1 donde la fruta es "Manzana" y 0 en las otras filas.

La columna "Banana" tiene un 1 donde la fruta es "Banana" y 0 en las otras filas.

La columna "Naranja" tiene un 1 donde la fruta es "Naranja" y 0 en las otras filas.

Cada columna representa una categoría específica y muestra si esa categoría está presente o no en cada fila. No hay dependencia de una columna en función de sí misma; más bien, cada columna es independiente y refleja la presencia o ausencia de una categoría particular para cada observación en el conjunto de datos.

In [2]:
import pandas as pd

route=r"\Users\Cristian\PythonLogic\Media\HRDataset_v14.csv"

human=pd.read_csv(filepath_or_buffer=route)

human.head(10)

Unnamed: 0,Employee_Name,EmpID,MarriedID,MaritalStatusID,GenderID,EmpStatusID,DeptID,PerfScoreID,FromDiversityJobFairID,Salary,...,ManagerName,ManagerID,RecruitmentSource,PerformanceScore,EngagementSurvey,EmpSatisfaction,SpecialProjectsCount,LastPerformanceReview_Date,DaysLateLast30,Absences
0,"Adinolfi, Wilson K",10026,0,0,1,1,5,4,0,62506,...,Michael Albert,22.0,LinkedIn,Exceeds,4.6,5,0,1/17/2019,0,1
1,"Ait Sidi, Karthikeyan",10084,1,1,1,5,3,3,0,104437,...,Simon Roup,4.0,Indeed,Fully Meets,4.96,3,6,2/24/2016,0,17
2,"Akinkuolie, Sarah",10196,1,1,0,5,5,3,0,64955,...,Kissy Sullivan,20.0,LinkedIn,Fully Meets,3.02,3,0,5/15/2012,0,3
3,"Alagbe,Trina",10088,1,1,0,1,5,3,0,64991,...,Elijiah Gray,16.0,Indeed,Fully Meets,4.84,5,0,1/3/2019,0,15
4,"Anderson, Carol",10069,0,2,0,5,5,3,0,50825,...,Webster Butler,39.0,Google Search,Fully Meets,5.0,4,0,2/1/2016,0,2
5,"Anderson, Linda",10002,0,0,0,1,5,4,0,57568,...,Amy Dunn,11.0,LinkedIn,Exceeds,5.0,5,0,1/7/2019,0,15
6,"Andreola, Colby",10194,0,0,0,1,4,3,0,95660,...,Alex Sweetwater,10.0,LinkedIn,Fully Meets,3.04,3,4,1/2/2019,0,19
7,"Athwal, Sam",10062,0,4,1,1,5,3,0,59365,...,Ketsia Liebig,19.0,Employee Referral,Fully Meets,5.0,4,0,2/25/2019,0,19
8,"Bachiochi, Linda",10114,0,0,0,3,5,3,1,47837,...,Brannon Miller,12.0,Diversity Job Fair,Fully Meets,4.46,3,0,1/25/2019,0,4
9,"Bacong, Alejandro",10250,0,2,1,1,3,3,0,50178,...,Peter Monroe,7.0,Indeed,Fully Meets,5.0,5,6,2/18/2019,0,16


In [3]:
human.columns.values

array(['Employee_Name', 'EmpID', 'MarriedID', 'MaritalStatusID',
       'GenderID', 'EmpStatusID', 'DeptID', 'PerfScoreID',
       'FromDiversityJobFairID', 'Salary', 'Termd', 'PositionID',
       'Position', 'State', 'Zip', 'DOB', 'Sex', 'MaritalDesc',
       'CitizenDesc', 'HispanicLatino', 'RaceDesc', 'DateofHire',
       'DateofTermination', 'TermReason', 'EmploymentStatus',
       'Department', 'ManagerName', 'ManagerID', 'RecruitmentSource',
       'PerformanceScore', 'EngagementSurvey', 'EmpSatisfaction',
       'SpecialProjectsCount', 'LastPerformanceReview_Date',
       'DaysLateLast30', 'Absences'], dtype=object)

In [4]:
human_table = human[["Employee_Name","Position","Salary","State", "Sex", "DOB", "MaritalDesc","CitizenDesc", "Department", "PerformanceScore"
, "EngagementSurvey", "EmpSatisfaction" ]]

human_table.head(10)


Unnamed: 0,Employee_Name,Position,Salary,State,Sex,DOB,MaritalDesc,CitizenDesc,Department,PerformanceScore,EngagementSurvey,EmpSatisfaction
0,"Adinolfi, Wilson K",Production Technician I,62506,MA,M,07/10/83,Single,US Citizen,Production,Exceeds,4.6,5
1,"Ait Sidi, Karthikeyan",Sr. DBA,104437,MA,M,05/05/75,Married,US Citizen,IT/IS,Fully Meets,4.96,3
2,"Akinkuolie, Sarah",Production Technician II,64955,MA,F,09/19/88,Married,US Citizen,Production,Fully Meets,3.02,3
3,"Alagbe,Trina",Production Technician I,64991,MA,F,09/27/88,Married,US Citizen,Production,Fully Meets,4.84,5
4,"Anderson, Carol",Production Technician I,50825,MA,F,09/08/89,Divorced,US Citizen,Production,Fully Meets,5.0,4
5,"Anderson, Linda",Production Technician I,57568,MA,F,05/22/77,Single,US Citizen,Production,Exceeds,5.0,5
6,"Andreola, Colby",Software Engineer,95660,MA,F,05/24/79,Single,US Citizen,Software Engineering,Fully Meets,3.04,3
7,"Athwal, Sam",Production Technician I,59365,MA,M,02/18/83,Widowed,US Citizen,Production,Fully Meets,5.0,4
8,"Bachiochi, Linda",Production Technician I,47837,MA,F,02/11/70,Single,US Citizen,Production,Fully Meets,4.46,3
9,"Bacong, Alejandro",IT Support,50178,MA,M,01/07/88,Divorced,US Citizen,IT/IS,Fully Meets,5.0,5


In [5]:
human_table.describe()

Unnamed: 0,Salary,EngagementSurvey,EmpSatisfaction
count,311.0,311.0,311.0
mean,69020.684887,4.11,3.890675
std,25156.63693,0.789938,0.909241
min,45046.0,1.12,1.0
25%,55501.5,3.69,3.0
50%,62810.0,4.28,4.0
75%,72036.0,4.7,5.0
max,250000.0,5.0,5.0


In [6]:
dummy_variable_sex=pd.get_dummies(human_table["Sex"], prefix="sex") #Primero se crea una variable dummy para almacenar los datos, con prefix de como va a quedar esta tabla 

dummy_variable_sex.head(10)

Unnamed: 0,sex_F,sex_M
0,False,True
1,False,True
2,True,False
3,True,False
4,True,False
5,True,False
6,True,False
7,False,True
8,True,False
9,False,True


In [7]:
human_table=human_table.drop(["Sex"],axis=1) #De la tabla original se elimina la columna que vamos a pre procesar

human_table

Unnamed: 0,Employee_Name,Position,Salary,State,DOB,MaritalDesc,CitizenDesc,Department,PerformanceScore,EngagementSurvey,EmpSatisfaction
0,"Adinolfi, Wilson K",Production Technician I,62506,MA,07/10/83,Single,US Citizen,Production,Exceeds,4.60,5
1,"Ait Sidi, Karthikeyan",Sr. DBA,104437,MA,05/05/75,Married,US Citizen,IT/IS,Fully Meets,4.96,3
2,"Akinkuolie, Sarah",Production Technician II,64955,MA,09/19/88,Married,US Citizen,Production,Fully Meets,3.02,3
3,"Alagbe,Trina",Production Technician I,64991,MA,09/27/88,Married,US Citizen,Production,Fully Meets,4.84,5
4,"Anderson, Carol",Production Technician I,50825,MA,09/08/89,Divorced,US Citizen,Production,Fully Meets,5.00,4
...,...,...,...,...,...,...,...,...,...,...,...
306,"Woodson, Jason",Production Technician II,65893,MA,05/11/85,Single,US Citizen,Production,Fully Meets,4.07,4
307,"Ybarra, Catherine",Production Technician I,48513,MA,05/04/82,Single,US Citizen,Production,PIP,3.20,2
308,"Zamora, Jennifer",CIO,220450,MA,08/30/79,Single,US Citizen,IT/IS,Exceeds,4.60,5
309,"Zhou, Julia",Data Analyst,89292,MA,02/24/79,Single,US Citizen,IT/IS,Fully Meets,5.00,3


In [8]:
human_table2=pd.concat([human_table,dummy_variable_sex],axis=1) #Se concatena en una nueva tabla usando la original y la nueva de dummy, con axis correspondiente a las columnas

human_table2

Unnamed: 0,Employee_Name,Position,Salary,State,DOB,MaritalDesc,CitizenDesc,Department,PerformanceScore,EngagementSurvey,EmpSatisfaction,sex_F,sex_M
0,"Adinolfi, Wilson K",Production Technician I,62506,MA,07/10/83,Single,US Citizen,Production,Exceeds,4.60,5,False,True
1,"Ait Sidi, Karthikeyan",Sr. DBA,104437,MA,05/05/75,Married,US Citizen,IT/IS,Fully Meets,4.96,3,False,True
2,"Akinkuolie, Sarah",Production Technician II,64955,MA,09/19/88,Married,US Citizen,Production,Fully Meets,3.02,3,True,False
3,"Alagbe,Trina",Production Technician I,64991,MA,09/27/88,Married,US Citizen,Production,Fully Meets,4.84,5,True,False
4,"Anderson, Carol",Production Technician I,50825,MA,09/08/89,Divorced,US Citizen,Production,Fully Meets,5.00,4,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
306,"Woodson, Jason",Production Technician II,65893,MA,05/11/85,Single,US Citizen,Production,Fully Meets,4.07,4,False,True
307,"Ybarra, Catherine",Production Technician I,48513,MA,05/04/82,Single,US Citizen,Production,PIP,3.20,2,True,False
308,"Zamora, Jennifer",CIO,220450,MA,08/30/79,Single,US Citizen,IT/IS,Exceeds,4.60,5,True,False
309,"Zhou, Julia",Data Analyst,89292,MA,02/24/79,Single,US Citizen,IT/IS,Fully Meets,5.00,3,True,False


In [16]:
dummy_Country = pd.get_dummies(human_table["CitizenDesc"], prefix="citizen")

# Eliminar la columna original "CitizenDesc"
human_table = human_table.drop("CitizenDesc", axis=1)

# Concatenar el DataFrame original con las variables dummy
human_Table3 = pd.concat([human_table2, dummy_Country], axis=1)

human_Table3.head(20)

#El error que tuve fue que estaba trabajando en human_table2 que se le eliminaba esa tabla por lo tanto no habia donde buscar, se
#Trabajar en la original


Tarea: Hacer una funcion que automatize esto



Unnamed: 0,Employee_Name,Position,Salary,State,DOB,MaritalDesc,Department,PerformanceScore,EngagementSurvey,EmpSatisfaction,sex_F,sex_M,citizen_Eligible NonCitizen,citizen_Non-Citizen,citizen_US Citizen
0,"Adinolfi, Wilson K",Production Technician I,62506,MA,07/10/83,Single,Production,Exceeds,4.6,5,False,True,False,False,True
1,"Ait Sidi, Karthikeyan",Sr. DBA,104437,MA,05/05/75,Married,IT/IS,Fully Meets,4.96,3,False,True,False,False,True
2,"Akinkuolie, Sarah",Production Technician II,64955,MA,09/19/88,Married,Production,Fully Meets,3.02,3,True,False,False,False,True
3,"Alagbe,Trina",Production Technician I,64991,MA,09/27/88,Married,Production,Fully Meets,4.84,5,True,False,False,False,True
4,"Anderson, Carol",Production Technician I,50825,MA,09/08/89,Divorced,Production,Fully Meets,5.0,4,True,False,False,False,True
5,"Anderson, Linda",Production Technician I,57568,MA,05/22/77,Single,Production,Exceeds,5.0,5,True,False,False,False,True
6,"Andreola, Colby",Software Engineer,95660,MA,05/24/79,Single,Software Engineering,Fully Meets,3.04,3,True,False,False,False,True
7,"Athwal, Sam",Production Technician I,59365,MA,02/18/83,Widowed,Production,Fully Meets,5.0,4,False,True,False,False,True
8,"Bachiochi, Linda",Production Technician I,47837,MA,02/11/70,Single,Production,Fully Meets,4.46,3,True,False,False,False,True
9,"Bacong, Alejandro",IT Support,50178,MA,01/07/88,Divorced,IT/IS,Fully Meets,5.0,5,False,True,False,False,True


In [15]:
human_table2.columns

Index(['Employee_Name', 'Position', 'Salary', 'State', 'DOB', 'MaritalDesc',
       'Department', 'PerformanceScore', 'EngagementSurvey', 'EmpSatisfaction',
       'sex_F', 'sex_M '],
      dtype='object')