# Python Code for Analyzing Glassdoor Data

In [5]:
import pandas as pd
import numpy as np
from scipy import stats

In [6]:
G= pd.read_csv('employee_reviews.csv', encoding='ISO-8859-1')

Before anything, I need to decide what will I do with the "none" objects in my data. On Glassdoor, reviewers can ignore any of the ratings, except the overall rating. My options are to leave it as "na" and ignore the values or copy the overall ratings to the "none" objects. Overall ratings may be a better indicator of what those missing values might be rather than just ignoring them. I decided to make T-test and see if there is a difference between using them.

In [7]:
G_ovr= G[['overall-ratings','work-balance-stars', 'culture-values-stars', 'carrer-opportunities-stars', \
      'comp-benefit-stars', 'senior-mangemnet-stars','helpful-count']].replace('none', np.nan).astype(float)

In [8]:
G_ovr['work-balance-stars'].fillna(G_ovr['overall-ratings'],inplace=True)
G_ovr['culture-values-stars'].fillna(G_ovr['overall-ratings'],inplace=True)
G_ovr['carrer-opportunities-stars'].fillna(G_ovr['overall-ratings'],inplace=True)
G_ovr['comp-benefit-stars'].fillna(G_ovr['overall-ratings'],inplace=True)
G_ovr['senior-mangemnet-stars'].fillna(G_ovr['overall-ratings'],inplace=True)

In [9]:
G_nan= G[['overall-ratings','work-balance-stars', 'culture-values-stars', 'carrer-opportunities-stars', \
      'comp-benefit-stars', 'senior-mangemnet-stars','helpful-count']].replace('none', np.nan).astype(float)

In [10]:
stats.ttest_ind(G_ovr, G_nan, nan_policy= 'omit')

Ttest_indResult(statistic=masked_array(data=[0.0, 8.329244622305506, -0.5100861707200497,
                   4.723569370096835, 0.1391372807425222,
                   9.669769122711893, 0.0],
             mask=[False, False, False, False, False, False, False],
       fill_value=1e+20), pvalue=masked_array(data=[1.00000000e+00, 8.21507067e-17, 6.09992020e-01,
                   2.31986255e-06, 8.89341888e-01, 4.12458660e-22,
                   1.00000000e+00],
             mask=False,
       fill_value=1e+20))

It looks like the highest p-value is .089341888 for one of the columns, but the lowest is 4.12458660e-22. I think it is safe to use overall ratings to replace the NaN values for the analysis.

I will proceed to clean the overall structure of the data.

In [12]:
G[['overall-ratings','work-balance-stars', 'culture-values-stars', 'carrer-opportunities-stars', \
   'comp-benefit-stars', 'senior-mangemnet-stars','helpful-count']] = \
G_ovr[['overall-ratings','work-balance-stars', 'culture-values-stars', \
       'carrer-opportunities-stars', 'comp-benefit-stars', 'senior-mangemnet-stars','helpful-count']]

G=G.drop(['Unnamed: 0','link', 'summary', 'pros', 'cons', 'advice-to-mgmt'], axis=1)

In [13]:
G = G.rename(columns={'carrer-opportunities-stars': 'career-opportunities-stars', \
                      'senior-mangemnet-stars': 'senior-management-stars'})