### Analysis ideas:
- Across time, how have the topics changed? Can also view how the topics have changed for each occupation
- What are the aspects that cause ratings to be high?
- Among current employees/ those who have left, what are some aspects that cause them to remain/ leave?
- Why recommend/ not recommend DBS?
- For those who have chosen to stay for a long period of time (e.g., more than 3 years), why have they stayed on?

### To run the following codes for data cleaning, run the following command in anaconda prompt first:
*`conda activate dbs_employee_reviews`*

Main idea of script:
- Split sentences (pros and cons) up based on new line/ fullstop
- Add the index (review number) to each sentence
- Preprocess the pros and cons

In [1]:
# import pandas as pd
# intermediate = pd.read_csv(r'C:\Users\jingh\OneDrive\Desktop\DBS Internship stuff\Employee Reviews Documents\Employee Reviews\Microsoft Reviews\All_Microsoft_reviews.csv')
# intermediate.to_excel(r'C:\Users\jingh\OneDrive\Desktop\DBS Internship stuff\Employee Reviews Documents\Employee Reviews\Microsoft Reviews\All_Microsoft_reviews.xlsx', index=False)

In [2]:
import re
import time
import nltk
import string
import timeit
import unidecode
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from nltk import word_tokenize
from autocorrect import Speller
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('stopwords')

df = pd.read_excel('../Employee Reviews/DBS Reviews/All DBS Reviews/All DBS Reviews.xlsx').drop_duplicates().reset_index().drop(['index'],axis=1)
df = df[~pd.isna(df['Date'])]
df['Employee Status'], df['Duration'] = df['Employee Type'].str.split(',', 1).str
df = df.drop(['Employee Type'],axis=1)
df = df[['Review_type', 'Date', 'Occupation', 'Employee Status', 'Duration', 'Pros', 'Cons', 'Recommended Or Not', 'CEO Approval', 'Business Outlook', 'Rating']]
df

  df['Employee Status'], df['Duration'] = df['Employee Type'].str.split(',', 1).str


Unnamed: 0,Review_type,Date,Occupation,Employee Status,Duration,Pros,Cons,Recommended Or Not,CEO Approval,Business Outlook,Rating
0,DBS Bank Reviews,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,Good work life balance and salary,"Hierarchical, does not listen to feedback from...",Recommended,Recommended,Recommended,4.0
1,DBS Bank Reviews,27 Jun 2022,Senior Associate,Current Employee,,Annual leave granted per year is good,Number of carry forward leave allow is too little,Neutral,Neutral,Neutral,4.0
2,DBS Bank Reviews,27 Jun 2022,Senior Associate,Current Employee,,Infrastructure is good and the ambience,No work Life balance.. managers force to work ...,Not Recommended,Neutral,Neutral,1.0
3,DBS Bank Reviews,27 Jun 2022,Senior Data Analyst,Former Contractor,more than 1 year,opportunities to work on interesting and chall...,"some politics, heavy workload on some projects",Neutral,Neutral,Neutral,3.0
4,DBS Bank Reviews,27 Jun 2022,Graduate Associate,Current Employee,more than 1 year,you get exposed to multiple skillsets,you need to get exposed to multiple skillsets,Neutral,Neutral,Neutral,4.0
...,...,...,...,...,...,...,...,...,...,...,...
4593,DBS Bank Reviews,6 Nov 2009,Summer Intern,Former Employee,,The environment at DBS is very friendly. My co...,No much to criticize or comment about. I guess...,Recommended,Recommended,Neutral,5.0
4594,DBS Bank Reviews,20 Oct 2009,Analyst,Former Employee,,the brand name of the company,my boss was a slave driver who did not promote...,Not Recommended,Neutral,Neutral,2.0
4595,DBS Bank Reviews,27 Aug 2009,Assistant Vice President,Former Employee,,"The brand name in the Asia arena; stability, l...","Very bureaucratic. Seniority matters the most,...",Neutral,Neutral,Neutral,3.0
4596,DBS Bank Reviews,3 Aug 2009,Assistant Vice President,Current Employee,,people are not the smartest around; hence very...,far too many deadwood esp in middle management...,Neutral,Not Recommended,Neutral,2.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4598 entries, 0 to 4597
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Review_type         4598 non-null   object 
 1   Date                4598 non-null   object 
 2   Occupation          4598 non-null   object 
 3   Employee Status     4598 non-null   object 
 4   Duration            2650 non-null   object 
 5   Pros                4598 non-null   object 
 6   Cons                4598 non-null   object 
 7   Recommended Or Not  4598 non-null   object 
 8   CEO Approval        4598 non-null   object 
 9   Business Outlook    4598 non-null   object 
 10  Rating              4598 non-null   float64
dtypes: float64(1), object(10)
memory usage: 560.1+ KB


### Break up all pros and cons based on new line
Rationale for breaking up sentences: BERTopic maps a review to a certain topic (One to one mapping)

In [5]:
def split_sentences_after_new_line(sentence):
    return sentence.split('\n')
def split_sentences_after_fullstop(sentence_list):
    lis = []
    for i in sentence_list:
        lis.extend(i.split('. '))
    return lis
def split_sentences_after_semicolon(sentence_list):
    lis = []
    for i in sentence_list:
        lis.extend(i.split('; '))
    return lis
def split_sentences_after_exclamation(sentence_list):
    lis = []
    for i in sentence_list:
        lis.extend(i.split('! '))
    return lis


In [6]:
def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', '').replace('\n', '').replace('\\t','').replace('\\', '').replace('\r','')
    return Formatted_text

def remove_whitespace(text):
    """ This function will remove 
        extra whitespaces from the text
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after extra whitespaces removed .
        
    Example:
    Input : How   are   you   doing   ?
    Output : How are you doing ?     
        
    """
    pattern = re.compile(r'\s+') 
    Without_whitespace = re.sub(pattern, ' ', text)
    # There are some instances where there is no space after '?' & ')', 
    # So I am replacing these with one space so that It will not consider two words as one token.
    text = Without_whitespace.replace('?', ' ? ').replace(')', ') ')
    return text    

In [7]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have",
}
# The code for expanding contraction words
def expand_contractions(text, contraction_mapping =  CONTRACTION_MAP):
    """expand shortened words to the actual form.
       e.g. don't to do not
    
       arguments:
            input_text: "text" of type "String".
         
       return:
            value: Text with expanded form of shorthened words.
        
       Example: 
       Input : ain't, aren't, can't, cause, can't've
       Output :  is not, are not, cannot, because, cannot have 
    
     """
    # Tokenizing text into tokens.
    list_Of_tokens = text.split(' ')

    # Checking for whether the given token matches with the Key & replacing word with key's value.
    
    # Check whether Word is in lidt_Of_tokens or not.
    for Word in list_Of_tokens: 
        # Check whether found word is in dictionary "Contraction Map" or not as a key. 
         if Word in CONTRACTION_MAP: 
                # If Word is present in both dictionary & list_Of_tokens, replace that word with the key value.
                list_Of_tokens = [item.replace(Word, CONTRACTION_MAP[Word]) for item in list_Of_tokens]
                
    # Converting list of tokens to String.
    String_Of_tokens = ' '.join(str(e) for e in list_Of_tokens) 
    return String_Of_tokens

def removing_special_characters(text):
    """Removing all the special characters except the one that is passed within 
       the regex to match, as they have imp meaning in the text provided.
   
    
    arguments:
         input_text: "text" of type "String".
         
    return:
        value: Text with removed special characters that don't require.
        
    Example: 
    Input : Hello, K-a-j-a-l. Thi*s is $100.05 : the payment that you will recieve! (Is this okay?) 
    Output :  Hello, Kajal. This is $100.05 : the payment that you will recieve! Is this okay?
    
   """
    # The formatted text after removing not necessary punctuations.
    Formatted_Text = re.sub(r"[^a-zA-Z0-9:$-,%.?!]+", ' ', text) 
    # In the above regex expression,I am providing necessary set of punctuations that are frequent in this particular dataset.
    return Formatted_Text

In [8]:
def joining(text):
    return ' '.join(text)
def to_lower(text):
    return text.lower()
def removing_x000d(text):
    text = text.replace(' x000d ', '')
    return text

In [9]:
df['Pros']=df['Pros'].apply(to_lower).apply(split_sentences_after_new_line)
df['Cons']=df['Cons'].apply(to_lower).apply(split_sentences_after_new_line)
df['Pros']=df['Pros'].apply(split_sentences_after_fullstop).apply(split_sentences_after_semicolon)
df['Cons']=df['Cons'].apply(split_sentences_after_fullstop).apply(split_sentences_after_semicolon)
df

Unnamed: 0,Review_type,Date,Occupation,Employee Status,Duration,Pros,Cons,Recommended Or Not,CEO Approval,Business Outlook,Rating
0,DBS Bank Reviews,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,[good work life balance and salary],"[hierarchical, does not listen to feedback fro...",Recommended,Recommended,Recommended,4.0
1,DBS Bank Reviews,27 Jun 2022,Senior Associate,Current Employee,,[annual leave granted per year is good],[number of carry forward leave allow is too li...,Neutral,Neutral,Neutral,4.0
2,DBS Bank Reviews,27 Jun 2022,Senior Associate,Current Employee,,[infrastructure is good and the ambience],"[no work life balance., managers force to work...",Not Recommended,Neutral,Neutral,1.0
3,DBS Bank Reviews,27 Jun 2022,Senior Data Analyst,Former Contractor,more than 1 year,[opportunities to work on interesting and chal...,"[some politics, heavy workload on some projects]",Neutral,Neutral,Neutral,3.0
4,DBS Bank Reviews,27 Jun 2022,Graduate Associate,Current Employee,more than 1 year,[you get exposed to multiple skillsets],[you need to get exposed to multiple skillsets],Neutral,Neutral,Neutral,4.0
...,...,...,...,...,...,...,...,...,...,...,...
4593,DBS Bank Reviews,6 Nov 2009,Summer Intern,Former Employee,,"[the environment at dbs is very friendly, my c...","[no much to criticize or comment about, i gues...",Recommended,Recommended,Neutral,5.0
4594,DBS Bank Reviews,20 Oct 2009,Analyst,Former Employee,,[the brand name of the company],[my boss was a slave driver who did not promot...,Not Recommended,Neutral,Neutral,2.0
4595,DBS Bank Reviews,27 Aug 2009,Assistant Vice President,Former Employee,,"[the brand name in the asia arena, stability, ...","[very bureaucratic, seniority matters the most...",Neutral,Neutral,Neutral,3.0
4596,DBS Bank Reviews,3 Aug 2009,Assistant Vice President,Current Employee,,"[people are not the smartest around, hence ver...",[far too many deadwood esp in middle managemen...,Neutral,Not Recommended,Neutral,2.0


### Appending index to review

In [10]:
pros_and_idx_lis = []
cons_and_idx_lis = []

for i in range(len(df)):
    date = [df.loc[i, 'Date']] * len(df.loc[i, 'Pros']) 
    occupation = [df.loc[i, 'Occupation']] * len(df.loc[i, 'Pros'])
    employee_status = [df.loc[i, 'Employee Status']] * len(df.loc[i, 'Pros'])
    duration = [df.loc[i, 'Duration']] * len(df.loc[i, 'Pros'])
    pros = df.loc[i, 'Pros']
    recommended_or_not = [df.loc[i, 'Recommended Or Not']] * len(df.loc[i, 'Pros']) 
    ceo_approval = [df.loc[i, 'CEO Approval']] * len(df.loc[i, 'Pros'])
    business_outlook = [df.loc[i, 'Business Outlook']] * len(df.loc[i, 'Pros'])
    index = [i] * len(df.loc[i, 'Pros'])
    pros_and_idx_lis.append(list(zip(date, occupation, employee_status, duration, pros, recommended_or_not,\
                                     ceo_approval, business_outlook, index)))
    
    date = [df.loc[i, 'Date']] * len(df.loc[i, 'Cons'])
    occupation = [df.loc[i, 'Occupation']] * len(df.loc[i, 'Cons'])
    employee_status = [df.loc[i, 'Employee Status']] * len(df.loc[i, 'Cons'])
    duration = [df.loc[i, 'Duration']] * len(df.loc[i, 'Cons'])
    cons = df.loc[i, 'Cons']
    recommended_or_not = [df.loc[i, 'Recommended Or Not']] * len(df.loc[i, 'Cons']) 
    ceo_approval = [df.loc[i, 'CEO Approval']] * len(df.loc[i, 'Cons'])
    business_outlook = [df.loc[i, 'Business Outlook']] * len(df.loc[i, 'Cons'])
    index = [i] * len(df.loc[i, 'Cons'])
    cons_and_idx_lis.append(list(zip(date, occupation, employee_status, duration, cons, recommended_or_not,\
                                     ceo_approval, business_outlook, index)))
pros_and_idx_lis

[[('27 Jun 2022 ',
   ' Software Engineer',
   'Current Employee',
   ' more than 5 years',
   'good work life balance and salary',
   'Recommended',
   'Recommended',
   'Recommended',
   0)],
 [('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'annual leave granted per year is good',
   'Neutral',
   'Neutral',
   'Neutral',
   1)],
 [('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'infrastructure is good and the ambience',
   'Not Recommended',
   'Neutral',
   'Neutral',
   2)],
 [('27 Jun 2022 ',
   ' Senior Data Analyst',
   'Former Contractor',
   ' more than 1 year',
   'opportunities to work on interesting and challenging projects',
   'Neutral',
   'Neutral',
   'Neutral',
   3)],
 [('27 Jun 2022 ',
   ' Graduate Associate',
   'Current Employee',
   ' more than 1 year',
   'you get exposed to multiple skillsets',
   'Neutral',
   'Neutral',
   'Neutral',
   4)],
 [('27 Jun 2022 ',
   ' Engineer',
   'Current Employee',
   n

In [11]:
cons_and_idx_lis

[[('27 Jun 2022 ',
   ' Software Engineer',
   'Current Employee',
   ' more than 5 years',
   'hierarchical, does not listen to feedback from junior staff',
   'Recommended',
   'Recommended',
   'Recommended',
   0)],
 [('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'number of carry forward leave allow is too little',
   'Neutral',
   'Neutral',
   'Neutral',
   1)],
 [('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'no work life balance.',
   'Not Recommended',
   'Neutral',
   'Neutral',
   2),
  ('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'managers force to work from office..they say sez rules and all 5days should work only in office and even on weekends if needed u need to travel to office.🙏🙏_x000d_',
   'Not Recommended',
   'Neutral',
   'Neutral',
   2),
  ('27 Jun 2022 ',
   ' Senior Associate',
   'Current Employee',
   nan,
   'no desks/cabin for seating..feels like call centre job_x000d_'

In [12]:
pros_and_idx_lis2 = []
cons_and_idx_lis2 = []

for i in pros_and_idx_lis:
    pros_and_idx_lis2.extend(i)
for i in cons_and_idx_lis:
    cons_and_idx_lis2.extend(i)

pros_and_idx_lis2

[('27 Jun 2022 ',
  ' Software Engineer',
  'Current Employee',
  ' more than 5 years',
  'good work life balance and salary',
  'Recommended',
  'Recommended',
  'Recommended',
  0),
 ('27 Jun 2022 ',
  ' Senior Associate',
  'Current Employee',
  nan,
  'annual leave granted per year is good',
  'Neutral',
  'Neutral',
  'Neutral',
  1),
 ('27 Jun 2022 ',
  ' Senior Associate',
  'Current Employee',
  nan,
  'infrastructure is good and the ambience',
  'Not Recommended',
  'Neutral',
  'Neutral',
  2),
 ('27 Jun 2022 ',
  ' Senior Data Analyst',
  'Former Contractor',
  ' more than 1 year',
  'opportunities to work on interesting and challenging projects',
  'Neutral',
  'Neutral',
  'Neutral',
  3),
 ('27 Jun 2022 ',
  ' Graduate Associate',
  'Current Employee',
  ' more than 1 year',
  'you get exposed to multiple skillsets',
  'Neutral',
  'Neutral',
  'Neutral',
  4),
 ('27 Jun 2022 ',
  ' Engineer',
  'Current Employee',
  nan,
  'great learning and nice culture',
  'Recommende

In [13]:
# date, occupation, employee_status, duration, cons, recommended_or_not,\
#                                      ceo_approval, business_outlook, index
pros_and_idx_lis_df = pd.DataFrame(pros_and_idx_lis2, columns = ['Date', 'Occupation', 'Employee Status', 'Duration', 'Pros',\
                                                                 'Recommended Or Not', 'CEO Approval', 'Business Outlook', 'Review Index'])
cons_and_idx_lis_df = pd.DataFrame(cons_and_idx_lis2, columns = ['Date', 'Occupation', 'Employee Status', 'Duration', 'Cons',\
                                                                 'Recommended Or Not', 'CEO Approval', 'Business Outlook', 'Review Index'])
pros_and_idx_lis_df

Unnamed: 0,Date,Occupation,Employee Status,Duration,Pros,Recommended Or Not,CEO Approval,Business Outlook,Review Index
0,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,good work life balance and salary,Recommended,Recommended,Recommended,0
1,27 Jun 2022,Senior Associate,Current Employee,,annual leave granted per year is good,Neutral,Neutral,Neutral,1
2,27 Jun 2022,Senior Associate,Current Employee,,infrastructure is good and the ambience,Not Recommended,Neutral,Neutral,2
3,27 Jun 2022,Senior Data Analyst,Former Contractor,more than 1 year,opportunities to work on interesting and chall...,Neutral,Neutral,Neutral,3
4,27 Jun 2022,Graduate Associate,Current Employee,more than 1 year,you get exposed to multiple skillsets,Neutral,Neutral,Neutral,4
...,...,...,...,...,...,...,...,...,...
7682,22 Aug 2008,Vice President - DII,Former Employee,,people on the outside think it is a good place...,Not Recommended,Neutral,Neutral,4597
7683,22 Aug 2008,Vice President - DII,Former Employee,,it is also well repected name to have on the r...,Not Recommended,Neutral,Neutral,4597
7684,22 Aug 2008,Vice President - DII,Former Employee,,senior managemnt is usually top calibre,Not Recommended,Neutral,Neutral,4597
7685,22 Aug 2008,Vice President - DII,Former Employee,,i assume as a glc they do have to hire well at...,Not Recommended,Neutral,Neutral,4597


### Preprocessing all Pros and Cons
Steps done for preprocessing:
- Remove new lines and tabs
- Expand Contractions
- Remove white space
- Remove special characters (E.g., Emoji's)
- Lower case

Will not be doing the following for pre-processing:
- Lemmatisation
- Removing stop words
- Making corrections to spelling

In [14]:
pros_and_idx_lis_df['Pros'] = pros_and_idx_lis_df['Pros'].apply(remove_newlines_tabs)
cons_and_idx_lis_df['Cons'] = cons_and_idx_lis_df['Cons'].apply(remove_newlines_tabs)

pros_and_idx_lis_df['Pros'] = pros_and_idx_lis_df['Pros'].apply(expand_contractions).apply(remove_whitespace).apply(removing_special_characters)#.apply(joining)
cons_and_idx_lis_df['Cons'] = cons_and_idx_lis_df['Cons'].apply(expand_contractions).apply(remove_whitespace).apply(removing_special_characters)#.apply(joining)

In [15]:
pros_and_idx_lis_df

Unnamed: 0,Date,Occupation,Employee Status,Duration,Pros,Recommended Or Not,CEO Approval,Business Outlook,Review Index
0,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,good work life balance and salary,Recommended,Recommended,Recommended,0
1,27 Jun 2022,Senior Associate,Current Employee,,annual leave granted per year is good,Neutral,Neutral,Neutral,1
2,27 Jun 2022,Senior Associate,Current Employee,,infrastructure is good and the ambience,Not Recommended,Neutral,Neutral,2
3,27 Jun 2022,Senior Data Analyst,Former Contractor,more than 1 year,opportunities to work on interesting and chall...,Neutral,Neutral,Neutral,3
4,27 Jun 2022,Graduate Associate,Current Employee,more than 1 year,you get exposed to multiple skillsets,Neutral,Neutral,Neutral,4
...,...,...,...,...,...,...,...,...,...
7682,22 Aug 2008,Vice President - DII,Former Employee,,people on the outside think it is a good place...,Not Recommended,Neutral,Neutral,4597
7683,22 Aug 2008,Vice President - DII,Former Employee,,it is also well repected name to have on the r...,Not Recommended,Neutral,Neutral,4597
7684,22 Aug 2008,Vice President - DII,Former Employee,,senior managemnt is usually top calibre,Not Recommended,Neutral,Neutral,4597
7685,22 Aug 2008,Vice President - DII,Former Employee,,i assume as a glc they do have to hire well at...,Not Recommended,Neutral,Neutral,4597


### Make sure that serial number is distinct for all verbatims

Would be better to use 2 dp instead of 1dp for the serial number since certain verbatims take up more than 10 rows once the verbatim has been split based on new line/ exclamation mark, etc

In [16]:
pros_and_idx_lis_df['Increment'] = pd.DataFrame([0.01] * len(pros_and_idx_lis_df))
for i in range(len(pros_and_idx_lis_df)-1):
    if pros_and_idx_lis_df.loc[i+1, 'Review Index'] == pros_and_idx_lis_df.loc[i, 'Review Index']:
        pros_and_idx_lis_df.loc[i+1, 'Increment'] += pros_and_idx_lis_df.loc[i, 'Increment']
pros_and_idx_lis_df['Serial Number'] = pros_and_idx_lis_df['Review Index'] + pros_and_idx_lis_df['Increment']
pros_and_idx_lis_df = pros_and_idx_lis_df.drop(['Review Index', 'Increment'], axis=1)
pros_and_idx_lis_df

Unnamed: 0,Date,Occupation,Employee Status,Duration,Pros,Recommended Or Not,CEO Approval,Business Outlook,Serial Number
0,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,good work life balance and salary,Recommended,Recommended,Recommended,0.01
1,27 Jun 2022,Senior Associate,Current Employee,,annual leave granted per year is good,Neutral,Neutral,Neutral,1.01
2,27 Jun 2022,Senior Associate,Current Employee,,infrastructure is good and the ambience,Not Recommended,Neutral,Neutral,2.01
3,27 Jun 2022,Senior Data Analyst,Former Contractor,more than 1 year,opportunities to work on interesting and chall...,Neutral,Neutral,Neutral,3.01
4,27 Jun 2022,Graduate Associate,Current Employee,more than 1 year,you get exposed to multiple skillsets,Neutral,Neutral,Neutral,4.01
...,...,...,...,...,...,...,...,...,...
7682,22 Aug 2008,Vice President - DII,Former Employee,,people on the outside think it is a good place...,Not Recommended,Neutral,Neutral,4597.02
7683,22 Aug 2008,Vice President - DII,Former Employee,,it is also well repected name to have on the r...,Not Recommended,Neutral,Neutral,4597.03
7684,22 Aug 2008,Vice President - DII,Former Employee,,senior managemnt is usually top calibre,Not Recommended,Neutral,Neutral,4597.04
7685,22 Aug 2008,Vice President - DII,Former Employee,,i assume as a glc they do have to hire well at...,Not Recommended,Neutral,Neutral,4597.05


In [17]:
pros_and_idx_lis_df['Pros'] = pros_and_idx_lis_df['Pros'].apply(removing_x000d)

In [18]:
pros_and_idx_lis_df.to_excel('../Employee Reviews/DBS Reviews/All DBS Reviews/DBS_pros_reviews.xlsx', index=False)

In [19]:
pros_and_idx_lis_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7687 entries, 0 to 7686
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                7687 non-null   object 
 1   Occupation          7687 non-null   object 
 2   Employee Status     7687 non-null   object 
 3   Duration            4841 non-null   object 
 4   Pros                7687 non-null   object 
 5   Recommended Or Not  7687 non-null   object 
 6   CEO Approval        7687 non-null   object 
 7   Business Outlook    7687 non-null   object 
 8   Serial Number       7687 non-null   float64
dtypes: float64(1), object(8)
memory usage: 540.6+ KB


In [20]:
cons_and_idx_lis_df['Increment'] = pd.DataFrame([0.01] * len(cons_and_idx_lis_df))
for i in range(len(cons_and_idx_lis_df)-1):
    if cons_and_idx_lis_df.loc[i+1, 'Review Index'] == cons_and_idx_lis_df.loc[i, 'Review Index']:
        cons_and_idx_lis_df.loc[i+1, 'Increment'] += cons_and_idx_lis_df.loc[i, 'Increment']
cons_and_idx_lis_df['Serial Number'] = cons_and_idx_lis_df['Review Index'] + cons_and_idx_lis_df['Increment']
cons_and_idx_lis_df = cons_and_idx_lis_df.drop(['Review Index', 'Increment'], axis=1)
cons_and_idx_lis_df

Unnamed: 0,Date,Occupation,Employee Status,Duration,Cons,Recommended Or Not,CEO Approval,Business Outlook,Serial Number
0,27 Jun 2022,Software Engineer,Current Employee,more than 5 years,"hierarchical, does not listen to feedback from...",Recommended,Recommended,Recommended,0.01
1,27 Jun 2022,Senior Associate,Current Employee,,number of carry forward leave allow is too little,Neutral,Neutral,Neutral,1.01
2,27 Jun 2022,Senior Associate,Current Employee,,no work life balance.,Not Recommended,Neutral,Neutral,2.01
3,27 Jun 2022,Senior Associate,Current Employee,,managers force to work from office..they say s...,Not Recommended,Neutral,Neutral,2.02
4,27 Jun 2022,Senior Associate,Current Employee,,no desks cabin for seating..feels like call ce...,Not Recommended,Neutral,Neutral,2.03
...,...,...,...,...,...,...,...,...,...
8371,22 Aug 2008,Vice President - DII,Former Employee,,the middle management is overall farily poor,Not Recommended,Neutral,Neutral,4597.02
8372,22 Aug 2008,Vice President - DII,Former Employee,,middle management at dbs then to hoard informa...,Not Recommended,Neutral,Neutral,4597.03
8373,22 Aug 2008,Vice President - DII,Former Employee,,cronisim and boys club mentatliy in middle ma...,Not Recommended,Neutral,Neutral,4597.04
8374,22 Aug 2008,Vice President - DII,Former Employee,,if you can get away with it you can hide in th...,Not Recommended,Neutral,Neutral,4597.05


In [21]:
cons_and_idx_lis_df['Cons'] = cons_and_idx_lis_df['Cons'].apply(removing_x000d)

In [22]:
cons_and_idx_lis_df.to_excel('../Employee Reviews/DBS Reviews/All DBS Reviews/DBS_cons_reviews.xlsx', index=False)