# Análisis de empresas

Para llevar a cabo este proyecto se parte de un dataset extraido de Kaggle **(datasets/davidgauthier/glassdoor-job-reviews)**. Pese a ello, la primera idea para elaborar el proyecto era realizar WebScraping de Glassdoor o Indeed para extraer este tipo de información. 

Considerando los términos y condiciones de dichas webs, he preferido trabajar con un dataset de Kaggle que me ofreciera la misma información sin incurrir problemas legales.

**Nota:** No se tiene confirmación de que las valoraciones sean creadas por empleados o exempleados de la empresa.

In [None]:
%pip install matplotlib

In [2]:
# Carga de librerias

import pandas as pd 
import numpy as np 
#import matplotlib as plt

In [3]:
# Almacenamos el csv con pandas en df
df = pd.read_csv("glassdoor_reviews.csv")
df

Unnamed: 0,firm,date_review,job_title,current,location,overall_rating,work_life_balance,culture_values,diversity_inclusion,career_opp,comp_benefits,senior_mgmt,recommend,ceo_approv,outlook,headline,pros,cons
0,AFH-Wealth-Management,2015-04-05,,Current Employee,,2,4.0,3.0,,2.0,3.0,3.0,x,o,r,"Young colleagues, poor micro management",Very friendly and welcoming to new staff. Easy...,"Poor salaries, poor training and communication."
1,AFH-Wealth-Management,2015-12-11,Office Administrator,"Current Employee, more than 1 year","Bromsgrove, England, England",2,3.0,1.0,,2.0,1.0,4.0,x,o,r,"Excellent staff, poor salary","Friendly, helpful and hard-working colleagues",Poor salary which doesn't improve much with pr...
2,AFH-Wealth-Management,2016-01-28,Office Administrator,"Current Employee, less than 1 year","Bromsgrove, England, England",1,1.0,1.0,,1.0,1.0,1.0,x,o,x,"Low salary, bad micromanagement",Easy to get the job even without experience in...,"Very low salary, poor working conditions, very..."
3,AFH-Wealth-Management,2016-04-16,,Current Employee,,5,2.0,3.0,,2.0,2.0,3.0,x,o,r,Over promised under delivered,Nice staff to work with,No career progression and salary is poor
4,AFH-Wealth-Management,2016-04-23,Office Administrator,"Current Employee, more than 1 year","Bromsgrove, England, England",1,2.0,1.0,,2.0,1.0,1.0,x,o,x,client reporting admin,"Easy to get the job, Nice colleagues.","Abysmal pay, around minimum wage. No actual tr..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838561,the-LEGO-Group,2021-06-02,Marketing Manager,"Current Employee, more than 5 years","München, Bavaria, Bavaria",5,4.0,5.0,4.0,4.0,4.0,4.0,v,v,v,Just an awesome company to work for!!!,"Great company values, awesome product, smart c...",Not very easy to transfer to other locations
838562,the-LEGO-Group,2021-06-03,Sales Associate,"Current Employee, less than 1 year","London, England, England",3,,,,,,,o,o,o,working at lego,staff discount is really nice,micro managing is a hassle\r\ncan become menta...
838563,the-LEGO-Group,2021-06-03,Strategist,Current Employee,,4,5.0,5.0,5.0,3.0,5.0,3.0,v,o,o,not interested in growing their people,loved brand for a lot of people,you can spend 6-10 years without any promotion...
838564,the-LEGO-Group,2021-06-04,Customer Service Representative,"Current Employee, less than 1 year",,5,,,,,,,o,o,o,Great Place to Work,"Good wages, good hours, lots of resources","Working every other weekend, busy seasons can ..."


### Comprensión de variables:

Las variables `recommend` , `ceo_approv` y `outlook` tienen asignado un valor `v/r/x/o`, referidos a: v-Positivo, r-Moderado, x-Negativo y o-Sin opinión.

La variable `date_review` aparece como tipo `object` por lo que podemos transformalos a tipo `time`. Esto nos servíra tanto ahora como en fases posteriores de reporte.

Podríamos pensar que `current` debería ser tipo `bool`, dado que contiene texto adicional por el momento seguiremos trabajando con el como `object`.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 838566 entries, 0 to 838565
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   firm                 838566 non-null  object 
 1   date_review          838566 non-null  object 
 2   job_title            838566 non-null  object 
 3   current              838566 non-null  object 
 4   location             541223 non-null  object 
 5   overall_rating       838566 non-null  int64  
 6   work_life_balance    688672 non-null  float64
 7   culture_values       647193 non-null  float64
 8   diversity_inclusion  136066 non-null  float64
 9   career_opp           691065 non-null  float64
 10  comp_benefits        688484 non-null  float64
 11  senior_mgmt          682690 non-null  float64
 12  recommend            838566 non-null  object 
 13  ceo_approv           838566 non-null  object 
 14  outlook              838566 non-null  object 
 15  headline         

Como podemos apreciar, de las 18 variables unicamente 7 son formato numérico.

-  `overall-rating` represente la puntuación general que le dan a la empresa.
-  `work_life_balanca` representa el grado de conciliación entre la vida laboral y personal.
-  `culture_values` representa la valoración de la cultura y los valores de la empresa.
-  `diversity_inclusion` representa la diversidad e inclusión de la empresa.
-  `career_opp` hace referencia a la proyección profesional.
-  `comp_beneficts` hace referencia a la remuneración y los beneficios obtenidos.
-  `senior_mgmt` hace referencia a la dirección ejecutiva de la empresa.

Dado que todas las puntuaciones que se dan van de 1-5 en formato entero, podríamos cambiar el tipo de las variables `float` a `int`.

### Análisis exploratorio de Datos

Primero vamos a identificar los valores faltantes y en base al número e impacto aplicaremos un tratamiento u otro.



In [5]:
# Verificamos los nulos de cada columna
df.isnull().sum()

firm                        0
date_review                 0
job_title                   0
current                     0
location               297343
overall_rating              0
work_life_balance      149894
culture_values         191373
diversity_inclusion    702500
career_opp             147501
comp_benefits          150082
senior_mgmt            155876
recommend                   0
ceo_approv                  0
outlook                     0
headline                 2590
pros                        2
cons                       13
dtype: int64

In [6]:
# Explorando el dataset podemos ver que hay valores en blanco que no aparecen como Missing.
df["job_title"].unique()

array([' ', ' Office Administrator', ' IFA', ...,
       ' Seasonal Ride Operator/Attendant', ' Service Employee',
       ' Senior Experience Designer'], shape=(62275,), dtype=object)

In [29]:
# Esta es una forma de representar los valores que tengan en su variable "job_title" == ' '
df[df["job_title"]== ' ']
# Tenemos 79065 entradas sin especificar el puesto, entorno al 9.5 %.

Unnamed: 0,firm,date_review,job_title,current,location,overall_rating,work_life_balance,culture_values,diversity_inclusion,career_opp,comp_benefits,senior_mgmt,recommend,ceo_approv,outlook,headline,pros,cons
0,AFH-Wealth-Management,2015-04-05,,Current Employee,,2,4.0,3.0,,2.0,3.0,3.0,x,o,r,"Young colleagues, poor micro management",Very friendly and welcoming to new staff. Easy...,"Poor salaries, poor training and communication."
3,AFH-Wealth-Management,2016-04-16,,Current Employee,,5,2.0,3.0,,2.0,2.0,3.0,x,o,r,Over promised under delivered,Nice staff to work with,No career progression and salary is poor
66,AJ-Bell,2015-07-01,,"Former Employee, more than 3 years",,3,4.0,1.0,,2.0,2.0,2.0,x,v,x,Average company,Good team work\r\nLife / work balance,No development\r\nLack of leadership\r\nPoor l...
71,AJ-Bell,2016-05-23,,Former Employee,,1,,,,,,,o,o,o,Tunbridge Wells office - ONLY good for first-j...,Great Experience for 18months - 2 years if str...,Pay in the Tunbridge Wells office matches the ...
74,AJ-Bell,2016-08-04,,Current Employee,,3,3.0,3.0,,3.0,2.0,2.0,x,v,v,Pensions Administrator,"People make the place, the best Manager I've h...","Management issues, salary is lower than averag..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838070,the-LEGO-Group,2017-11-23,,Current Employee,"San Mateo, CA",3,4.0,4.0,,1.0,2.0,2.0,x,r,v,"Great team, cooperate inconsistent",People you work with are great. \nGreat produc...,District Management and higher tell one story ...
838115,the-LEGO-Group,2018-03-26,,Former Employee,,5,3.0,5.0,,4.0,2.0,4.0,v,v,r,Associate,"Great environment, Team Friendly, Never a dull...","Long hours, sometimes bad management"
838171,the-LEGO-Group,2018-10-19,,Current Employee,,4,,,,,,,o,o,o,Supply planning manager,"Flexible, good working environment","Work life balance, Manuel system"
838177,the-LEGO-Group,2018-11-12,,Current Employee,,5,4.0,5.0,,3.0,4.0,4.0,v,v,v,Great Culture,Amazing culture & very enjoyable place to work.,Maybe a bit stressful to keep a smile for more...


In [31]:
df[(df["job_title"]== ' ') & (df["location"] != np.dtype(object))]

Unnamed: 0,firm,date_review,job_title,current,location,overall_rating,work_life_balance,culture_values,diversity_inclusion,career_opp,comp_benefits,senior_mgmt,recommend,ceo_approv,outlook,headline,pros,cons
0,AFH-Wealth-Management,2015-04-05,,Current Employee,,2,4.0,3.0,,2.0,3.0,3.0,x,o,r,"Young colleagues, poor micro management",Very friendly and welcoming to new staff. Easy...,"Poor salaries, poor training and communication."
3,AFH-Wealth-Management,2016-04-16,,Current Employee,,5,2.0,3.0,,2.0,2.0,3.0,x,o,r,Over promised under delivered,Nice staff to work with,No career progression and salary is poor
66,AJ-Bell,2015-07-01,,"Former Employee, more than 3 years",,3,4.0,1.0,,2.0,2.0,2.0,x,v,x,Average company,Good team work\r\nLife / work balance,No development\r\nLack of leadership\r\nPoor l...
71,AJ-Bell,2016-05-23,,Former Employee,,1,,,,,,,o,o,o,Tunbridge Wells office - ONLY good for first-j...,Great Experience for 18months - 2 years if str...,Pay in the Tunbridge Wells office matches the ...
74,AJ-Bell,2016-08-04,,Current Employee,,3,3.0,3.0,,3.0,2.0,2.0,x,v,v,Pensions Administrator,"People make the place, the best Manager I've h...","Management issues, salary is lower than averag..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
838070,the-LEGO-Group,2017-11-23,,Current Employee,"San Mateo, CA",3,4.0,4.0,,1.0,2.0,2.0,x,r,v,"Great team, cooperate inconsistent",People you work with are great. \nGreat produc...,District Management and higher tell one story ...
838115,the-LEGO-Group,2018-03-26,,Former Employee,,5,3.0,5.0,,4.0,2.0,4.0,v,v,r,Associate,"Great environment, Team Friendly, Never a dull...","Long hours, sometimes bad management"
838171,the-LEGO-Group,2018-10-19,,Current Employee,,4,,,,,,,o,o,o,Supply planning manager,"Flexible, good working environment","Work life balance, Manuel system"
838177,the-LEGO-Group,2018-11-12,,Current Employee,,5,4.0,5.0,,3.0,4.0,4.0,v,v,v,Great Culture,Amazing culture & very enjoyable place to work.,Maybe a bit stressful to keep a smile for more...


### Ingeniería de características:

Considerando las variables `recommend`, `ceo_approv` y `outlook`. Podemos crear un puntuaje en base a dichas valoraciones, calcular la media y en base a esta clasificar las opiniones de los "encuestados".

Estas variables hacen referencia a: la recomendarías a un amigo, la valoración del rendimiento laboral del ceo y las perspectivas de la empresa en 6 meses respectivamente. 

Para ello, traduciremos `v/r/x/o` en +2/+1/-1/0 puntos respectivamente. Tras esto, calcularemos la puntuación media en una nueva variable `avg_score` (la máxima puntuación a la que se puede optar es de 2 puntos y la mínima de -1).