# Employer Review Prediction
## Reynara Ezra Pratama

## Background

## Business Understanding

1. Mengetahui *review* yang diberikan oleh pegawai terhadap perusahaan.
2. Memprediksi *review* yang diberikan dan mengkategorikannya ke dalam *review* yang bersifat positif, netral, atau negatif.

## Data Understanding

1. `ReviewTitle` : Topik dari *review*.
2. `CompleteReview` : *Review* yang diberikan pegawai perusahaan.
3. `URL` : *Uniform Resource Locator*.
4. `Rating` : Penilaian yang diberikan pegawai perusahaan.
5. `ReviewDetails` : Detail mengenai *review*.

## Import Library

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEnco
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score


import tensorflow as tf
import nltk

import warnings 
warnings.filterwarnings('ignore')

## Loading Dataset

**Load Data From Github**

In [2]:
# url = "https://raw.githubusercontent.com/ReynaraEzra/Employer-Review/main/data_input/results.json"
# df = pd.read_json(url)

**Load Data From Local File**

In [3]:
df = pd.read_json('data_input/results.json')

## Checking Dataset

In [4]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,..."
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021"
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021"
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021"


In [5]:
df.tail()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
145204,Definitely very good place to work and can hav...,We get a lot to learn in the company. Very sys...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 20, 2012"
145205,IT Services Company; Great scope for improvement.,Lot of scope to learn different technologies u...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 19, 2012"
145206,"Productive, fun to work, great place to do cer...","An overall positive experience, nice environme...",https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - January 19, 2012"
145207,Great place to start the career.,Happy that I've started my career from such a ...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,3,"(Former Employee) - - January 7, 2012"
145208,Nice place to work,Got good experience and knowledge about my wor...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,5,"(Former Employee) - - December 19, 2011"


In [6]:
df.sample(5)

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
68328,Nice place to work,Outstanding place to work for. Nice and fun fi...,https://in.indeed.com/cmp/Oracle/reviews?start...,5,"(Current Employee) - - August 19, 2016"
88916,Good place to work. One of the best in India,Very good work environment. All facilities are...,https://in.indeed.com/cmp/Tata-Consultancy-Ser...,4,"(Former Employee) - - July 18, 2018"
128938,Good learning experience,Got to work in WMB v8 for the first time in So...,https://in.indeed.com/cmp/Cognizant-Technology...,3,"(Former Employee) - - February 9, 2014"
129475,good,Its a very good organization.Where i joined he...,https://in.indeed.com/cmp/IBM/reviews?start=6640,4,"(Former Employee) - - June 20, 2016"
76178,Great place to work,"If beginning is Good, End must be perfect”-tha...",https://in.indeed.com/cmp/Ericsson/reviews?sta...,4,"Officer (Current Employee) - - April 26,..."


## Check Characteristic Data

**Data Shape**

In [7]:
df.shape

(145209, 5)

**Data Columns**

In [8]:
df.columns

Index(['ReviewTitle', 'CompleteReview', 'URL', 'Rating', 'ReviewDetails'], dtype='object')

**Data Info**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145209 entries, 0 to 145208
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   ReviewTitle     145209 non-null  object
 1   CompleteReview  145209 non-null  object
 2   URL             145209 non-null  object
 3   Rating          145209 non-null  int64 
 4   ReviewDetails   145209 non-null  object
dtypes: int64(1), object(4)
memory usage: 5.5+ MB


**Descriptive Statistic**

In [10]:
df.describe()

Unnamed: 0,Rating
count,145209.0
mean,4.053661
std,0.925805
min,1.0
25%,4.0
50%,4.0
75%,5.0
max,5.0


**Check Missing Value**

In [11]:
df.isnull().sum()

ReviewTitle       0
CompleteReview    0
URL               0
Rating            0
ReviewDetails     0
dtype: int64

**Check and Drop Duplicate Data**

In [12]:
df = df.drop_duplicates(keep='first')
df.reset_index(drop=True, inplace=True)

In [13]:
df.shape

(145191, 5)

## Feature Extraction

In [14]:
df.head(3)

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,..."
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021"


**Make Company Name Columns**

In [15]:
df['Company Name'] = df['URL'].str.split('/')
df['Company Name'] = df['Company Name'].str[4]

In [16]:
df['Company Name'].head()

0    Reliance-Industries-Ltd
1    Reliance-Industries-Ltd
2    Reliance-Industries-Ltd
3    Reliance-Industries-Ltd
4    Reliance-Industries-Ltd
Name: Company Name, dtype: object

In [17]:
df['Company Name'].unique()

array(['Reliance-Industries-Ltd', 'Mphasis', 'Kpmg', 'Yes-Bank',
       'Sutherland', 'Marriott-International,-Inc.', 'DHL', 'Jio',
       'Vodafoneziggo', 'HP', 'Maersk', 'Ride.swiggy', 'Jll', 'Alstom',
       'UnitedHealth-Group', 'Tata-Consultancy-Services-(tcs)',
       'Capgemini', 'Teleperformance', 'Cognizant-Technology-Solutions',
       'Mahindra-&-Mahindra-Ltd', 'L&T-Technology-Services-Ltd.',
       'Bharti-Airtel-Limited', 'Indeed', 'Hyatt',
       'Icici-Prudential-Life-Insurance', 'Accenture', 'Honeywell',
       'Standard-Chartered-Bank', 'Nokia', 'Apollo-Hospitals',
       'Tata-Aia-Life', 'Hdfc-Bank', 'Bosch', 'Deloitte', 'Ey',
       'Microsoft', 'Barclays', 'JPMorgan-Chase', 'Muthoot-Finance',
       'Wns-Global-Services', 'Kotak-Mahindra-Bank', 'Infosys', 'Oracle',
       "Byju's", 'Deutsche-Bank', 'Hinduja-Global-Solutions', 'Ericsson',
       'Axis-Bank', 'IBM', 'Concentrix', 'Wells-Fargo', 'Google',
       'Dell-Technologies', 'Facebook', 'Amazon.com', 'Flipkart.

In [18]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails,Company Name
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,...",Reliance-Industries-Ltd
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021",Reliance-Industries-Ltd
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021",Reliance-Industries-Ltd
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021",Reliance-Industries-Ltd
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021",Reliance-Industries-Ltd


**Make Date Columns**

In [22]:
dummy = df['ReviewDetails'].str.split('-', expand=True)
dummy

Unnamed: 0,0,1,2,3,4
0,(Current Employee),Ghansoli,"August 30, 2021",,
1,(Former Employee),,"August 26, 2021",,
2,(Former Employee),,"August 17, 2021",,
3,(Current Employee),,"August 17, 2021",,
4,(Former Employee),,"August 9, 2021",,
...,...,...,...,...,...
145186,(Former Employee),,"January 20, 2012",,
145187,(Former Employee),,"January 19, 2012",,
145188,(Former Employee),,"January 19, 2012",,
145189,(Former Employee),,"January 7, 2012",,


In [35]:
dummy[3].nunique()

137

In [136]:
dummy = df['ReviewDetails'].str.split('-', expand=True)
dummy = dummy[2]
dummy = dummy.str.replace(',','')
dummy = dummy.str.split(' ', expand=True)
dummy

Unnamed: 0,0,1,2,3,4,5,6,7
0,,,August,30,2021,,,
1,,,August,26,2021,,,
2,,,August,17,2021,,,
3,,,August,17,2021,,,
4,,,August,9,2021,,,
...,...,...,...,...,...,...,...,...
145186,,,January,20,2012,,,
145187,,,January,19,2012,,,
145188,,,January,19,2012,,,
145189,,,January,7,2012,,,


In [178]:
dummy.sample(10)

Unnamed: 0,0,1,2,3,4,5,6,7
21345,,,June,7,2017,,,
120751,,,May,9,2016,,,
24953,,,May,20,2017,,,
19897,,,November,6,2019,,,
58358,,,January,18,2020,,,
61224,,,September,4,2019,,,
28647,,,March,16,2017,,,
117346,,,April,27,2017,,,
43846,,,June,21,2019,,,
79858,,,April,29,2015,,,


In [179]:
dummy[2].unique()

array(['August', 'July', 'September', 'May', 'June', 'April', 'March',
       'February', 'January', 'December', 'November', 'October', '',
       'Africa', 'bagh', 'Consultant', 'Road', '9', '.CLUSTER', '(west)',
       'West', 'PAHARI', 'mumbai', None, 'west', 'Ramannagar', 'West.',
       'Raman', 'park', 'Technohub', 'Solutions', 'Office', 'Estate',
       'Infocity', 'Nagar', 'Delhi', 'Tamil', 'parel', ')', 'Locatino',
       'complex'], dtype=object)

In [155]:
year = dummy[4]
month = dummy[2]
day = dummy[3]

In [157]:
cols = [day, month, year]

for col in cols:
    print(col.isnull().sum())

54
1
115


In [172]:
time = pd.DataFrame()
time['Day'] = day
time['Month'] = month
time['Year'] = year

In [173]:
time = time.dropna()
time

Unnamed: 0,Day,Month,Year
0,30,August,2021
1,26,August,2021
2,17,August,2021
3,17,August,2021
4,9,August,2021
...,...,...,...
145186,20,January,2012
145187,19,January,2012
145188,19,January,2012
145189,7,January,2012


In [175]:
time['Day'].unique()

array(['30', '26', '17', '9', '22', '18', '7', '8', '5', '3', '15', '6',
       '20', '2', '16', '10', '31', '23', '11', '28', '24', '21', '19',
       '13', '1', '4', '25', '12', '27', '29', '14', '', 'chandigarh',
       '&', 'GURGAON', 'Nagar', 'malad', 'India', 'nadu'], dtype=object)

In [169]:
cols=["Year","Month","Day"]
time['Date'] = pd.to_datetime(time[cols].apply(lambda x: '/'.join(x.values.astype(str)), axis=1))

ParserError: Unknown string format: /Africa/

In [171]:
time[time['Date']=='/Africa/']

Unnamed: 0,Day,Month,Year,Date
1437,,Africa,,/Africa/


In [166]:
time['Date']

0           2021/August/30
1           2021/August/26
2           2021/August/17
3           2021/August/17
4            2021/August/9
                ...       
145186     2012/January/20
145187     2012/January/19
145188     2012/January/19
145189      2012/January/7
145190    2011/December/19
Name: Date, Length: 145076, dtype: object

In [168]:
pd.to_datetime(time['Date'], format="%Y/%B/%d")

ValueError: time data '/Africa/' does not match format '%Y/%B/%d' (match)