# Employer Review Prediction
## Reynara Ezra Pratama

## Background

## Business Understanding

1. Mengetahui *review* yang diberikan oleh pegawai terhadap perusahaan.
2. Memprediksi *review* yang diberikan dan mengkategorikannya ke dalam *review* yang bersifat positif, netral, atau negatif.

## Data Understanding

1. `ReviewTitle` : Topik dari *review*.
2. `CompleteReview` : *Review* yang diberikan pegawai perusahaan.
3. `URL` : *Uniform Resource Locator*.
4. `Rating` : Penilaian yang diberikan pegawai perusahaan.
5. `ReviewDetails` : Detail mengenai *review*.

## Import Library

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

import tensorflow as tf
import nltk

import warnings 
warnings.filterwarnings('ignore')

## Loading Dataset

**Load Data From Github**

In [2]:
# url = "https://raw.githubusercontent.com/ReynaraEzra/Employer-Review/main/data_input/results.json"
# df = pd.read_json(url)

**Load Data From Local File**

In [3]:
df = pd.read_json('data_input/results.json')

## Checking Dataset

In [4]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,..."
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021"
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021"
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021"


In [5]:
df.tail()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
145204,Definitely very good place to work and can hav...,We get a lot to learn in the company. Very sys...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 20, 2012"
145205,IT Services Company; Great scope for improvement.,Lot of scope to learn different technologies u...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 19, 2012"
145206,"Productive, fun to work, great place to do cer...","An overall positive experience, nice environme...",https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 19, 2012"
145207,Great place to start the career.,Happy that I've started my career from such a ...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,3,"(Former Employee) - - January 7, 2012"
145208,Nice place to work,Got good experience and knowledge about my wor...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,5,"(Former Employee) - - December 19, 2011"


In [6]:
df.sample(5)

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
68328,Nice place to work,Outstanding place to work for. Nice and fun fi...,https://in.indeed.com/cmp/Oracle/reviews?start...,5,"(Current Employee) - - August 19, 2016"
88916,Good place to work. One of the best in India,Very good work environment. All facilities are...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - July 18, 2018"
128938,Good learning experience,Got to work in WMB v8 for the first time in So...,https://in.indeed.com/cmp/Cognizant-Technology...,3,"(Former Employee) - - February 9, 2014"
129475,good,Its a very good organization.Where i joined he...,https://in.indeed.com/cmp/IBM/reviews?start=6640,4,"(Former Employee) - - June 20, 2016"
76178,Great place to work,"If beginning is Good, End must be perfect”-tha...",https://in.indeed.com/cmp/Ericsson/reviews?sta...,4,"Officer (Current Employee) - - April 26,..."


## Check Characteristic Data

**Data Shape**

In [7]:
df.shape

(145209, 5)

**Data Columns**

In [8]:
df.columns

Index(['ReviewTitle', 'CompleteReview', 'URL', 'Rating', 'ReviewDetails'], dtype='object')

**Data Info**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145209 entries, 0 to 145208
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   ReviewTitle     145209 non-null  object
 1   CompleteReview  145209 non-null  object
 2   URL             145209 non-null  object
 3   Rating          145209 non-null  int64 
 4   ReviewDetails   145209 non-null  object
dtypes: int64(1), object(4)
memory usage: 5.5+ MB


**Descriptive Statistic**

In [10]:
df.describe()

Unnamed: 0,Rating
count,145209.0
mean,4.053661
std,0.925805
min,1.0
25%,4.0
50%,4.0
75%,5.0
max,5.0


**Check Missing Value**

In [11]:
df.isnull().sum()

ReviewTitle       0
CompleteReview    0
URL               0
Rating            0
ReviewDetails     0
dtype: int64

**Check and Drop Duplicate Data**

In [12]:
df = df.drop_duplicates(keep='first')
df.reset_index(drop=True, inplace=True)

In [13]:
df.shape

(145191, 5)

## Feature Extraction

In [14]:
df.head(3)

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,..."
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021"


**Make Company Name Columns**

In [15]:
df['Company Name'] = df['URL'].str.split('/')
df['Company Name'] = df['Company Name'].str[4]

In [16]:
df['Company Name'].head()

0    Reliance-Industries-Ltd
1    Reliance-Industries-Ltd
2    Reliance-Industries-Ltd
3    Reliance-Industries-Ltd
4    Reliance-Industries-Ltd
Name: Company Name, dtype: object

In [17]:
df['Company Name'].unique()

array(['Reliance-Industries-Ltd', 'Mphasis', 'Kpmg', 'Yes-Bank',
       'Sutherland', 'Marriott-International,-Inc.', 'DHL', 'Jio',
       'Vodafoneziggo', 'HP', 'Maersk', 'Ride.swiggy', 'Jll', 'Alstom',
       'UnitedHealth-Group', 'Tata-Consultancy-Services-(tcs)',
       'Capgemini', 'Teleperformance', 'Cognizant-Technology-Solutions',
       'Mahindra-&-Mahindra-Ltd', 'L&T-Technology-Services-Ltd.',
       'Bharti-Airtel-Limited', 'Indeed', 'Hyatt',
       'Icici-Prudential-Life-Insurance', 'Accenture', 'Honeywell',
       'Standard-Chartered-Bank', 'Nokia', 'Apollo-Hospitals',
       'Tata-Aia-Life', 'Hdfc-Bank', 'Bosch', 'Deloitte', 'Ey',
       'Microsoft', 'Barclays', 'JPMorgan-Chase', 'Muthoot-Finance',
       'Wns-Global-Services', 'Kotak-Mahindra-Bank', 'Infosys', 'Oracle',
       "Byju's", 'Deutsche-Bank', 'Hinduja-Global-Solutions', 'Ericsson',
       'Axis-Bank', 'IBM', 'Concentrix', 'Wells-Fargo', 'Google',
       'Dell-Technologies', 'Facebook', 'Amazon.com', 'Flipkart.

In [18]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails,Company Name
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,...",Reliance-Industries-Ltd
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021",Reliance-Industries-Ltd
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021",Reliance-Industries-Ltd
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021",Reliance-Industries-Ltd
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021",Reliance-Industries-Ltd


**Make Date Columns**

In [38]:
df['Date'] = df['ReviewDetails'].str.split('-', expand=True)[2]

In [39]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails,Company Name,Date
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,...",Reliance-Industries-Ltd,"August 30, 2021"
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021",Reliance-Industries-Ltd,"August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021",Reliance-Industries-Ltd,"August 17, 2021"
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021",Reliance-Industries-Ltd,"August 17, 2021"
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021",Reliance-Industries-Ltd,"August 9, 2021"


**Make Year, Month, Day Columns**

In [48]:
df['Year'] = df['Date'].str.split(',', expand=True)[1]

In [55]:
df['Month'] = df['Date'].str.split(' ', expand=True)[2]

In [62]:
df['Day'] = df['Date'].str.split(' ', expand=True)[3]
df['Day'] = df['Day'].str.replace(',','')

**Check Columns**

Column `Year`

In [65]:
df['Year'].unique()

array([' 2021', ' 2020', ' 2019', ' 2018', ' 2017', ' 2016', None,
       ' 2015', ' 2014', ' 2013', ' 2012', ' 2011',
       ' GWAL PAHARI GURGAON  ', ' airoli  ', ' Malad west  ',
       ' Sp Infocity & Quadra ', ' New Delhi', ' Tamil nadu  ',
       'Gurgaon  '], dtype=object)

In [67]:
df['Year'] = df['Year'].str.replace(' ','')

In [68]:
df['Year'].unique()

array(['2021', '2020', '2019', '2018', '2017', '2016', None, '2015',
       '2014', '2013', '2012', '2011', 'GWALPAHARIGURGAON', 'airoli',
       'Maladwest', 'SpInfocity&Quadra', 'NewDelhi', 'Tamilnadu',
       'Gurgaon'], dtype=object)

In [84]:
df['Year'].value_counts(sort=False)

2021                  2967
2020                 12674
2019                 16985
2018                 15782
2017                 37335
2016                 15295
2015                 13664
2014                 15016
2013                 11140
2012                  4153
2011                    38
GWALPAHARIGURGAON        1
airoli                   1
Maladwest                1
SpInfocity&Quadra        1
NewDelhi                 1
Tamilnadu                1
Gurgaon                  1
Name: Year, dtype: int64

In [91]:
df['Year'].isnull().sum()

142

In [95]:
valid_year = ['2011', '2012', '2013', '2014', '2014', '2015', 
               '2016', '2017', '2018', '2019', '2020', '2021']
df['Year'] = df['Year'].apply(lambda x:x if x in valid_year else np.nan)

In [96]:
df['Year'].value_counts()

2017    37335
2019    16985
2018    15782
2016    15295
2014    15016
2015    13664
2020    12674
2013    11140
2012     4153
2021     2967
2011       38
Name: Year, dtype: int64

In [92]:
df['Year'].isnull().sum()

142

Column `Month`

In [80]:
df['Month'].unique()

array(['August', 'July', 'September', 'May', 'June', 'April', 'March',
       'February', 'January', 'December', 'November', 'October', '',
       'Africa', 'bagh', 'Consultant', 'Road', '9', '.CLUSTER', '(west)',
       'West', 'PAHARI', 'mumbai', None, 'west', 'Ramannagar', 'West.',
       'Raman', 'park', 'Technohub', 'Solutions', 'Office', 'Estate',
       'Infocity', 'Nagar', 'Delhi,', 'Tamil', 'parel', ')', 'Locatino',
       'complex'], dtype=object)

In [94]:
df['Month'].value_counts()

March         15438
January       12626
June          12573
July          12214
September     12090
February      12008
May           11848
April         11667
August        11635
November      11397
October       10945
December      10608
                114
Infocity          1
Technohub         1
Solutions         1
Office            1
Estate            1
Locatino          1
Nagar             1
Delhi,            1
Tamil             1
parel             1
)                 1
Raman             1
park              1
West              1
West.             1
Ramannagar        1
west              1
mumbai            1
PAHARI            1
(west)            1
.CLUSTER          1
9                 1
Road              1
Consultant        1
bagh              1
Africa            1
complex           1
Name: Month, dtype: int64

In [97]:
df['Month'].isnull().sum()

1

In [98]:
valid_month = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
df['Month'] = df['Month'].apply(lambda x:x if x in valid_month else np.nan)

In [99]:
df['Month'].value_counts()

March        15438
January      12626
June         12573
July         12214
September    12090
February     12008
May          11848
April        11667
August       11635
November     11397
October      10945
December     10608
Name: Month, dtype: int64

In [100]:
df['Month'].isnull().sum()

142

Column `Day`

In [101]:
df['Day'].unique()

array(['30', '26', '17', '9', '22', '18', '7', '8', '5', '3', '15', '6',
       '20', '2', '16', '10', '31', '23', '11', '28', '24', '21', '19',
       '13', '1', '4', '25', '12', '27', '29', '14', None, '',
       'chandigarh', '&', 'GURGAON', 'Nagar', 'malad', 'India', 'nadu'],
      dtype=object)

In [110]:
df['Day'].value_counts()

5             5159
2             4992
4             4969
21            4929
9             4920
3             4910
8             4910
7             4881
17            4864
12            4842
6             4829
22            4828
20            4825
23            4824
18            4823
16            4812
11            4793
19            4721
25            4709
24            4705
28            4691
26            4684
10            4680
15            4664
27            4660
13            4572
1             4557
14            4466
29            4311
30            4138
31            2381
                80
&                2
chandigarh       1
GURGAON          1
Nagar            1
malad            1
India            1
nadu             1
Name: Day, dtype: int64

In [111]:
valid_day = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
             '12', '13', '14', '15', '16', '17', '18', '19', '20', '21',
             '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
df['Day'] = df['Day'].apply(lambda x:x if x in valid_day else np.nan)

In [117]:
df['Day'].isnull().sum()

142

**Check Missing Value**

In [119]:
df.isnull().sum()

ReviewTitle         0
CompleteReview      0
URL                 0
Rating              0
ReviewDetails       0
Company Name        0
Date                0
Year              142
Month             142
Day               142
dtype: int64

In [120]:
df.shape

(145191, 10)

In [122]:
df = df.dropna()

In [123]:
df.shape

(145049, 10)

In [124]:
df.isnull().sum()

ReviewTitle       0
CompleteReview    0
URL               0
Rating            0
ReviewDetails     0
Company Name      0
Date              0
Year              0
Month             0
Day               0
dtype: int64

**Create Date Columns**

In [130]:
df['Date'] = df['Year']+df['Month']+df['Day']
df['Date'] = pd.to_datetime(df['Date'], format='%Y%B%d')

In [131]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails,Company Name,Date,Year,Month,Day
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,...",Reliance-Industries-Ltd,2021-08-30,2021,August,30
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021",Reliance-Industries-Ltd,2021-08-26,2021,August,26
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021",Reliance-Industries-Ltd,2021-08-17,2021,August,17
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021",Reliance-Industries-Ltd,2021-08-17,2021,August,17
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021",Reliance-Industries-Ltd,2021-08-09,2021,August,9


**Create Employee Status Columns**

In [134]:
df['Employee Status'] = df['ReviewDetails'].str.split('-', expand=True)[0]

In [135]:
df['Employee Status'].unique()

array(['(Current Employee)  ', '(Former Employee)  ',
       'Training   (Former Employee)  ', 'Officer   (Former Employee)  ',
       'Leader   (Current Employee)  ',
       'health care   (Current Employee)  ',
       'Good team worker   (Former Employee)  ',
       'Officer   (Current Employee)  ',
       'Sr.G.M.Engineering and projects .   (Former Employee)  ',
       'Hospitality   (Former Employee)  ',
       'Employee   (Current Employee)  ',
       'Employee   (Former Employee)  ', 'Worker   (Former Employee)  ',
       'SBI PR outbound    (Current Employee)  ',
       'PR in SBI outbound    (Current Employee)  ',
       'SBI PR    (Former Employee)  ', 'Senior   (Former Employee)  ',
       'Sbi inbound    (Current Employee)  ',
       'KOTAK CARD    (Current Employee)  ',
       'Marketing   (Current Employee)  ', 'Yes   (Current Employee)  ',
       'Kotak cardit card    (Current Employee)  ',
       'Kotak cards    (Current Employee)  ',
       'OFFICER   (Current Employee

In [137]:
df['Employee Status']

0         (Current Employee)  
1          (Former Employee)  
2          (Former Employee)  
3         (Current Employee)  
4          (Former Employee)  
                  ...         
145186     (Former Employee)  
145187     (Former Employee)  
145188     (Former Employee)  
145189     (Former Employee)  
145190     (Former Employee)  
Name: Employee Status, Length: 145049, dtype: object

In [140]:
former = 'Former Employee'
df['Employee Status'].apply(lambda x: 'Former Employee' if x in former else 'Current Employee')

0         Current Employee
1         Current Employee
2         Current Employee
3         Current Employee
4         Current Employee
                ...       
145186    Current Employee
145187    Current Employee
145188    Current Employee
145189    Current Employee
145190    Current Employee
Name: Employee Status, Length: 145049, dtype: object