# Employee work life analysis

## Introduction
In the pursuit of understanding the intricacies of a company's culture and the experiences of its employees, the extraction and summarization of work-life anecdotes emerge as a valuable undertaking. In this phase of the project, we delve into the narratives of individuals who have been part of various organizations, gathering insights that encapsulate the essence of their professional journeys. To accomplish this, we employ a combination of web scraping, data extraction, and text summarization techniques, augmented by popular libraries and methodologies in the realm of Natural Language Processing (NLP). By employing these methods, we aim to provide concise and insightful summaries of employees' experiences, enabling a deeper comprehension of the working environments, challenges, and successes that define a company's culture.

## Web Scraping Quora blog posts

### Input the company name

In [5]:
company = input("Which company's employee experiences do you want to know about?")

Which company's employee experiences do you want to know about? adobe


In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = f"https://www.quora.com/search?q=what%20is%20it%20like%20working%20at%20{company}"

# Initialize a headless browser 
driver = webdriver.Chrome()  

driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "qu-userSelect--text")))

page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

button_elements = driver.find_elements(By.XPATH, "//div[@class='q-text qu-cursor--pointer QTextTruncated__StyledReadMoreLink-sc-1pev100-3 dXJUbS qt_read_more qu-color--blue_dark qu-fontFamily--sans qu-pl--tiny']")

company_blog_posts=[]
for button_element in button_elements[:10]:
    button_element.click()  
    
page_source = driver.page_source
# Get the page source after dynamic content has loaded
soup = BeautifulSoup(page_source, "html.parser")

div_elements = soup.find_all('div', class_='q-box spacing_log_answer_content puppeteer_test_answer_content')
for div_element in div_elements[:5]:
    company_blog_posts.append(div_element.get_text())
    
for c in company_blog_posts:
    print("Found a matching div element:")
    print(c)
    print("---")



Found a matching div element:
My 3-month internship at Adobe was one of the best in terms of type of work, company culture, managing team, level of challenge, and perks. Since experiences can vary from team to team and position to position, I'll focus more on the company wide factors that made my time there so great.Perks: Even though I was only there for 3-months, I received most of the perks that full-timers get. Tremendous product discounts (THESE ARE SUH-WEEEEET). $365 "Wellness" buck to spend on anything health related (Gym memberships, yoga class, REI membership). Caltrain GoPass. The perks I got at Adobe are much more aligned to my lifestyle when compared to the ones I get at my current company, which seems more geared towards an older crowd (free health exams, free 24/7 k-12 tutoring). Office: Both the SJ and SF offices are absolutely beautiful and a pleasure to work in. They're very well-designed spaces inside and out that bear no resemblance to gloomy labyrinths or cubicle pr

Note that we have extracted raw blog posts and have stored it in company_blog_posts. 

## Pre-Processing of extracted data (Removal of stop words and tokenisation)

In my code, I have incorporated the NLTK library's Stop words and Punkt tokenizer for effective text processing. 
<br/> **Stop words** removal plays a crucial role in enhancing the quality of textual analysis by filtering out common words that carry little semantic meaning, thus allowing a focus on more significant terms. 
<br/> **Punkt Sentence Tokenizer** helps to process the abbreviation words, upper case characters, collocations, special characters, whitespaces and many more, making it particularly effective summarisation.

These preprocessing steps contribute to the efficiency and precision of natural language processing tasks, ensuring a more refined understanding of the underlying text.

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def process_review(review):
    words = word_tokenize(review)
    length=len(words)
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    summary = ' '.join(filtered_words)
    return length, summary

cummulative_pp = ''
cummulative_p = company_blog_posts[0]
total_length= len(company_blog_posts[0])
for i in range(len(company_blog_posts)):
    length, summary = process_review(company_blog_posts[i])
    cummulative_pp += summary + ' '
    if i==0:
        total_length=length

individual_post_pp = [process_review(review) for review in company_blog_posts]

# Display individual summaries
print("\nPreprocessed Individual Blogs")
for i, summary in enumerate(individual_post_pp, start=1):
    print(f"{i}: {summary}\n")




Preprocessed Individual Blogs
1: (404, 'internship adobe one best terms type work company culture managing team level challenge perks since experiences vary team team position position focus company wide factors made time even though received perks get tremendous product discounts 365 wellness buck spend anything health related gym memberships yoga class rei membership caltrain gopass perks got adobe much aligned lifestyle compared ones get current company seems geared towards older crowd free health exams free tutoring office sj sf offices absolutely beautiful pleasure work spaces inside bear resemblance gloomy labyrinths cubicle prisons also pleasant walk away caltrain also breeze find book meetings rooms realize luxury role adobe required interact entire spectrum employees engineers developers content managers sure glad everybody friendly dedicated happy importantly looked experts fields actions ramblings environment another luxury know adobe intranet extremely usable useful guess 

[nltk_data] Downloading package stopwords to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Summarization using my trained model

In [12]:
from tf_keras.models import model_from_json

# Load encoder model architecture from JSON file
with open('encoder_model.json', 'r') as json_file:
    loaded_model_json = json_file.read()

# Create model from the loaded JSON
loaded_encoder_model = model_from_json(loaded_model_json)

# Load weights into the new model
loaded_encoder_model.load_weights("encoder_model_weights.h5")


with open('decoder_model.json', 'r') as json_file:
    loaded_dmodel_json = json_file.read()
loaded_decoder_model = model_from_json(loaded_dmodel_json)

# Load weights into the new model
loaded_decoder_model.load_weights("decoder_model_weights.h5")


In [15]:
import pickle
from tf_keras.preprocessing.sequence import pad_sequences
import numpy as np

def decode_sequence_loaded_model(input_seq, loaded_emodel, loaded_dmodel):
    # Encode the input as state vectors using the loaded model.
    e_out, e_h, e_c = loaded_emodel.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    
    # Populate the first word of the target sequence with the start word.
    target_seq[0, 0] = target_word_index['sostok']

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        # Predict using the loaded model.
        output_tokens, h, c = loaded_dmodel.predict([target_seq] + [e_out, e_h, e_c])

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = reverse_target_word_index[sampled_token_index]

        if sampled_token != 'eostok':
            decoded_sentence += ' ' + sampled_token

        # Exit condition: either hit max length or find stop word.
        if sampled_token == 'eostok' or len(decoded_sentence.split()) >= (30 - 1):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update internal states
        e_h, e_c = h, c

    return decoded_sentence


input_text = cummulative_p

# Loading the tokenizer
with open('x_tokenizer.pkl', 'rb') as f:
    x_tokenizer = pickle.load(f)
with open('y_tokenizer.pkl', 'rb') as f:
    y_tokenizer = pickle.load(f)
    
reverse_target_word_index = y_tokenizer.index_word 
reverse_source_word_index = x_tokenizer.index_word 
target_word_index = y_tokenizer.word_index

input_sequence = x_tokenizer.texts_to_sequences([input_text])
input_sequence = pad_sequences(input_sequence, maxlen=80, padding='post')
loaded_model_predictions = decode_sequence_loaded_model(input_sequence.reshape(1, 80), loaded_encoder_model, loaded_decoder_model)
print("Loaded Model Predicted summary:", loaded_model_predictions)

Loaded Model Predicted summary:  calda kronung potato bloddy hh hh hh newest doesnt senior fed toddy fulfilled ki ki value gingersnap hubby hubby sacrifice puff properties notch mastiffs bundle buiscuits microwaved swisshot batter


## Summarization using pre-trained model

In [18]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
!pip install -q transformers[sentencepiece]
!pip install -q sentencepiece

model_directory = './Pre-Trained'

# Load the T5 model
loaded_model = T5ForConditionalGeneration.from_pretrained(model_directory)
loaded_tokenizer = T5Tokenizer.from_pretrained(model_directory)

In [17]:
# Take input from the user
user_input = cummulative_p

# Tokenize and encode the input text using the loaded tokenizer
input_ids = loaded_tokenizer.encode(user_input, return_tensors='pt', max_length=total_length, truncation=True)

# Generate the summary using the loaded T5 model
summary_ids = loaded_model.generate(input_ids, max_length=150, num_beams=2, repetition_penalty=2.5, length_penalty=1.0, early_stopping=True)

# Decode the generated summary IDs to text
generated_summary = loaded_tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)

# Print the generated summary
print("\nGenerated Summary:")
print(generated_summary)



Generated Summary:
perks: I received most of the perks that full-timers get. Adobe was one of the best in terms of type of work, company culture, managing team, perks, and perks. People: Everyone was very friendly, dedicated, and happy. Intranet/Work Environment: Another luxury I didn't realize I had at the time.
