# Data preprocessing

#### This notebook contains some of the code for the data preprocessing for cleaning and preparing the text values.
#### The output of the notebook results in a file named *all_processed.csv* containing the processed data.
#### Be aware that running the entire notebook will take some time depending on the performance of your computer. For our computer it took around 1 1/2 hour.

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


In [2]:
df = pd.read_csv("employee_reviews.csv")

In [3]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,company,location,dates,job-title,summary,pros,cons,advice-to-mgmt,overall-ratings,work-balance-stars,culture-values-stars,carrer-opportunities-stars,comp-benefit-stars,senior-mangemnet-stars,helpful-count,link
0,1,google,none,"Dec 11, 2018",Current Employee - Anonymous Employee,Best Company to work for,People are smart and friendly,Bureaucracy is slowing things down,none,5.0,4.0,5.0,5.0,4.0,5.0,0,https://www.glassdoor.com/Reviews/Google-Revie...


### Some rows contained empty values and some contained string values with the text "*none*". We made a method for removing these values for a given column

In [4]:
df.count()[0]

67529

In [5]:
def clean_by_clmn(df, clmn_name):
    df = df[df[clmn_name] != "none"]
    df = df[df[clmn_name] != ""]
    df = df[df[clmn_name].notna()]
    return df

In [6]:
df = clean_by_clmn(df, "summary")
df = df = clean_by_clmn(df, "pros")
df = clean_by_clmn(df, "cons")

After applying the function for the *summary*, *"pros"* and "*cons*" columns only 10 rows had been removed.

In [7]:
df.count()[0]

67399

### We used the Natural Language Toolkit for removing stopwords and stemming

In [8]:
import re #regular expression
import nltk
nltk.download("stopwords") # THE IN YOU FOR
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

stopwords = set(stopwords.words("english"))
ps = PorterStemmer()

[nltk_data] Downloading package stopwords to /home/ras/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### We also wants wanted to remove company names from the review so that the text will not be biased by mentioning any of the companies that we trained on

In [9]:
companies = [c.lower() for c in df["company"].unique()]
companies

['google', 'amazon', 'facebook', 'netflix', 'apple', 'microsoft']

### The below function the reason the note book takes a long time to run on all the data
In the *clean_text_serie()* function the text data goes through following steps:

1. Making all characters lowercase.
2. Removing all non-a-z characters.
3. Removal of stop words.

Due to the long processing time for the function a time estimater was also added.

In [10]:
def clean_text_serie(df, serie_name):
    print("clean_text_serie() is going to take some time...")
    
    processed_clmn_key = serie_name + "_processed"
    char_length_key = serie_name + "_char_length"
    word_count_key = serie_name + "_word_count"
    stopword_count_key = serie_name + "_stopword_count"
    stopword_freq_key = serie_name + "_stopword_freq"
    
    df[processed_clmn_key] = ""
    df[char_length_key] = np.nan
    df[word_count_key] = np.nan
    df[stopword_count_key] = np.nan
    df[stopword_freq_key] = np.nan
    
    
    j = 1000
    start_time = time.time()
    df_count = df.count()[0]
    for i, row in df.iterrows():
        
        if (i == j):
            up_time = (time.time() - start_time)/60
            
            remaining = int((up_time * (df_count/i)) - up_time)
            
            print("{0}/{1} - {2} minutes since start - estimate: {3} minutes left"
                  .format(i,df_count,int(up_time),remaining))
            j+=1000
            
        cleaned = df[serie_name][i]
        # "Bob is, happy for Facebook!"""
        
        char_length = len(cleaned)
        
        cleaned = cleaned.lower()
        # "bob is, happy for facebook!"
        cleaned = re.sub("[^a-z]", " ", cleaned)
        # "bob is  happy for facebook "
        cleaned = cleaned.split()
        # ["bob", "is", "happy", "for", "facebook"]
        
        word_count = len(cleaned)
        
        cleaned = [word for word in cleaned if not word in stopwords]
        # ["bob", "happy", "facebook"]
        
        stopword_count = word_count - len(cleaned)
        stopword_freq = 0
        if stopword_count > 0:
            stopword_freq = stopword_count / word_count
            
        
        cleaned = [ps.stem(word) for word in cleaned if not word in companies]
        # ["bob", "happi"]
        
        df[processed_clmn_key][i] = " ".join(cleaned)
        # "bob happi"
        
        df[char_length_key][i] = char_length # character length of original text
        df[word_count_key][i] = word_count # word count of original text
        df[stopword_count_key][i] = stopword_count # number of stopwords in original text
        df[stopword_freq_key][i] = stopword_freq # frequency of stopwords in original text
    
    print("clean_text_serie() is finished!")
    print("It took a total of {0} minutes to process '{1}' column"
          .format((time.time() - start_time)/60, serie_name))
    return df

In [11]:
serie = df["summary"]
serie.count()

67399

In [12]:
df.head()["summary"]

0                             Best Company to work for
1    Moving at the speed of light, burn out is inev...
2    Great balance between big-company security and...
3    The best place I've worked and also the most d...
4                      Unique, one of a kind dream job
Name: summary, dtype: object

In [13]:
df = clean_text_serie(df, "summary")

clean_text_serie() is going to take some time...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


clean_text_serie() is finished!
It took a total of 0.02358227570851644 minutes to process 'summary' column


In [14]:
df.head(1)

Unnamed: 0.1,Unnamed: 0,company,location,dates,job-title,summary,pros,cons,advice-to-mgmt,overall-ratings,...,carrer-opportunities-stars,comp-benefit-stars,senior-mangemnet-stars,helpful-count,link,summary_processed,summary_char_length,summary_word_count,summary_stopword_count,summary_stopword_freq
0,1,google,none,"Dec 11, 2018",Current Employee - Anonymous Employee,Best Company to work for,People are smart and friendly,Bureaucracy is slowing things down,none,5.0,...,5.0,4.0,5.0,0,https://www.glassdoor.com/Reviews/Google-Revie...,best compani work,24.0,5.0,2.0,0.4


In [15]:
df = clean_text_serie(df, "pros")

clean_text_serie() is going to take some time...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


clean_text_serie() is finished!
It took a total of 0.023672839005788166 minutes to process 'pros' column


In [16]:
df = clean_text_serie(df, "cons")

clean_text_serie() is going to take some time...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


clean_text_serie() is finished!
It took a total of 0.021798606713612875 minutes to process 'cons' column


In [17]:
df["text"] = df["summary"].str.cat(df["pros"].str.cat(df["cons"],sep=" "),sep=" ")

In [18]:
df = clean_text_serie(df, "text")

clean_text_serie() is going to take some time...


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


clean_text_serie() is finished!
It took a total of 0.024330619970957437 minutes to process 'text' column


### Saving the data to "*all_processed.csv*"

In [19]:
df.to_csv(r"all_processed2.csv")

In [20]:
df.head()

Unnamed: 0.1,Unnamed: 0,company,location,dates,job-title,summary,pros,cons,advice-to-mgmt,overall-ratings,...,cons_char_length,cons_word_count,cons_stopword_count,cons_stopword_freq,text,text_processed,text_char_length,text_word_count,text_stopword_count,text_stopword_freq
0,1,google,none,"Dec 11, 2018",Current Employee - Anonymous Employee,Best Company to work for,People are smart and friendly,Bureaucracy is slowing things down,none,5.0,...,34.0,5.0,2.0,0.4,Best Company to work for People are smart and ...,best compani work peopl smart friendli bureauc...,89.0,15.0,6.0,0.4
1,2,google,"Mountain View, CA","Jun 21, 2013",Former Employee - Program Manager,"Moving at the speed of light, burn out is inev...","1) Food, food, food. 15+ cafes on main campus ...",1) Work/life balance. What balance? All those ...,1) Don't dismiss emotional intelligence and ad...,4.0,...,2403.0,407.0,210.0,0.515971,"Moving at the speed of light, burn out is inev...",move speed light burn inevit food food food ca...,3505.0,575.0,272.0,0.473043
2,3,google,"New York, NY","May 10, 2014",Current Employee - Software Engineer III,Great balance between big-company security and...,"* If you're a software engineer, you're among ...","* It *is* becoming larger, and with it comes g...",Keep the focus on the user. Everything else wi...,5.0,...,1064.0,179.0,79.0,0.441341,Great balance between big-company security and...,great balanc big compani secur fun fast move p...,4772.0,838.0,408.0,0.486874
3,4,google,"Mountain View, CA","Feb 8, 2015",Current Employee - Anonymous Employee,The best place I've worked and also the most d...,You can't find a more well-regarded company th...,I live in SF so the commute can take between 1...,Keep on NOT micromanaging - that is a huge ben...,5.0,...,2614.0,512.0,286.0,0.558594,The best place I've worked and also the most d...,best place work also demand find well regard c...,4243.0,821.0,461.0,0.56151
4,5,google,"Los Angeles, CA","Jul 19, 2018",Former Employee - Software Engineer,"Unique, one of a kind dream job",Google is a world of its own. At every other c...,"If you don't work in MTV (HQ), you will be giv...",Promote managers into management for their man...,5.0,...,4694.0,830.0,417.0,0.50241,"Unique, one of a kind dream job Google is a wo...",uniqu one kind dream job world everi compani l...,12902.0,2276.0,1146.0,0.503515
