# Employee work life analysis

## Introduction
In the pursuit of understanding the intricacies of a company's culture and the experiences of its employees, the extraction and summarization of work-life anecdotes emerge as a valuable undertaking. In this phase of the project, we delve into the narratives of individuals who have been part of various organizations, gathering insights that encapsulate the essence of their professional journeys. To accomplish this, we employ a combination of web scraping, data extraction, and text summarization techniques, augmented by popular libraries and methodologies in the realm of Natural Language Processing (NLP). By employing these methods, we aim to provide concise and insightful summaries of employees' experiences, enabling a deeper comprehension of the working environments, challenges, and successes that define a company's culture.

## Web Scraping Quora blog posts

### Input the company name

In [1]:
company = input("Which company's employee experiences do you want to know about?")

Which company's employee experiences do you want to know about? facebook


In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = f"https://www.quora.com/search?q=what%20is%20it%20like%20working%20at%20{company}"

# Initialize a headless browser 
driver = webdriver.Chrome()  

driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "qu-userSelect--text")))

page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

button_elements = driver.find_elements(By.XPATH, "//div[@class='q-text qu-cursor--pointer QTextTruncated__StyledReadMoreLink-sc-1pev100-3 dXJUbS qt_read_more qu-color--blue_dark qu-fontFamily--sans qu-pl--tiny']")

company_blog_posts=[]
for button_element in button_elements[:10]:
    button_element.click()  
    
page_source = driver.page_source
# Get the page source after dynamic content has loaded
soup = BeautifulSoup(page_source, "html.parser")

div_elements = soup.find_all('div', class_='q-box spacing_log_answer_content puppeteer_test_answer_content')
for div_element in div_elements[:10]:
    company_blog_posts.append(div_element.get_text())

for c in company_blog_posts:
    print("Found a matching div element:")
    print(c)
    print("---")



Found a matching div element:
The Perks I can clearly remember how strange it all felt at first. I'm from a small Canadian prairie city and had spent the majority of my career working in art galleries where we were always strapped for cash. Facebook, on the other hand, is a place designed so that their employees can focus entirely on productivity. Breakfast, lunch and dinner are available for free, there's an onsite dry cleaning drop off so you don't have to worry about laundry, the shuttle picks you up in the morning and jets you off to work as you catch up on email using the Wi-Fi in the leather-seated vehicle ... For about six months after I started, I walked around completely astonished at how lucky I was, wrapped in a bubble of gratitude. Eventually these perks start to feel normal and although they're what people talk about most often, they're not the reason I stuck around for four years.  The Hustle I've never worked harder in my life than I did at Facebook. I put in long days, 

Note that we have extracted raw blog posts and have stored it in company_blog_posts. 

## Pre-Processing of extracted data (Removal of stop words and tokenisation)

In my code, I have incorporated the NLTK library's Stop words and Punkt tokenizer for effective text processing. 
<br/> **Stop words** removal plays a crucial role in enhancing the quality of textual analysis by filtering out common words that carry little semantic meaning, thus allowing a focus on more significant terms. 
<br/> **Punkt Sentence Tokenizer** helps to process the abbreviation words, upper case characters, collocations, special characters, whitespaces and many more, making it particularly effective summarisation.

These preprocessing steps contribute to the efficiency and precision of natural language processing tasks, ensuring a more refined understanding of the underlying text.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def process_review(review):
    words = word_tokenize(review)
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    summary = ' '.join(filtered_words)
    return summary

cummulative_pp = ''
for review in company_blog_posts:
    summary = process_review(review)
    cummulative_pp += summary + ' '

individual_post_pp = [process_review(review) for review in company_blog_posts]

# Display individual summaries
print("\nPreprocessed Individual Blogs")
for i, summary in enumerate(individual_post_pp, start=1):
    print(f"{i}: {summary}\n")




Preprocessed Individual Blogs
1: perks clearly remember strange felt first small canadian prairie city spent majority career working art galleries always strapped cash facebook hand place designed employees focus entirely productivity breakfast lunch dinner available free onsite dry cleaning drop worry laundry shuttle picks morning jets work catch email using vehicle six months started walked around completely astonished lucky wrapped bubble gratitude eventually perks start feel normal although people talk often reason stuck around four years hustle never worked harder life facebook put long days often logged weekends felt tremendous sense responsibility good job environment facebook unstructured one really tells large extent hustle make sure invited right meetings spending time wisely appreciated freedom flexibility also struggled bit lack process content strategist role well defined understood design roles sometimes challenging find way project meaningful way structure facebook flat

[nltk_data] Downloading package stopwords to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Summarizer (Pre-Trained Bart-Large model)

### Summarisation (raw blogs) 

In [4]:
from transformers import pipeline
from nltk.tokenize import word_tokenize

# Set up the summarizer pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
# Initialize an empty list to store summaries
summaries = []

# Process each post
for post in company_blog_posts:
    # Tokenize the post
    tokens = word_tokenize(post)

    summary_chunk = summarizer(post, max_length=len(tokens), min_length=min(30, len(tokens)), do_sample=False)
    summaries.append(summary_chunk[0]['summary_text'])

# Print or use the generated summaries
for summary in summaries:
    print(summary)

The day-to-day environment at Facebook is unstructured. No one really tells you what to do and when to do it. It's up to you to hustle to make sure you're invited to the right meetings and spending your time wisely.
I manage a team of Data Scientists and analysts working on Ads. We belong to probably the largest and most centralized analytics team at Facebook. Our goal is to come up with data backed insights which will result in informing the product road-map.
I recently worked at Facebook. It was horrendous. I can’t speak for the entire company, but my organization was a disaster. The pay and benefits were outstanding, so please believe me when I say that I quit because each day felt like mental torture.
Facebook has the highest levels of transperancy in the industry. You will get all the resources to execute your job well. There’s a lot of flexibility that FB offers - work from home, 4 months later Ety leaves, unlimited sick days etc.
Facebook Hyderabad is one of the best Facebook of

### Summarisation on preprocessed blogs

In [5]:
from transformers import pipeline
from nltk.tokenize import word_tokenize

# Set up the summarizer pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
# Initialize an empty list to store summaries
summaries = []

# Process each post
for i in individual_post_pp:
    # Tokenize the post
    tokens = word_tokenize(i)

    # Summarize the last chunk
    summary_chunk = summarizer(i, max_length=max(min(150,  len(tokens)),min(30,min(30, len(tokens)))), min_length=min(30,min(30, len(tokens))), do_sample=False)
    summaries.append(summary_chunk[0]['summary_text'])

# Print or use the generated summaries
for summary in summaries:
    print(summary)

Six months started walked around completely astonished lucky wrapped bubble gratitude eventually perks start feel normal although people talk often reason stuck around four years hustle never worked harder life facebook put long days often logged weekends felt tremendous sense responsibility good job environment.
Data scientist responsibilities spend time analyzing designing experiments optimize product features move key metrics. Data come business opportunities pursue product feature suggestions sometimes understand metric movements. Building production ml models though mostly done sw engineering multidisciplinary nature.
 recently worked facebook horrendous speak entire company organization disaster pay benefits outstanding please believe say quit day felt like mental torture starters gross ineptitude extended way management chain people worked idea apart distributing corporate acting like used car salesmen help lower level employees.
Facebook offers a lot of perks and luxuries. The 

# Training Model for summarisation 

In [25]:
!pip install --user transformers==4.20.0
!pip install --user keras_nlp==0.3.0
!pip install datasets
!pip install huggingface-hub
!pip install rouge-score

Collecting transformers==4.20.0
  Using cached transformers-4.20.0-py3-none-any.whl (4.4 MB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1 (from transformers==4.20.0)
  Using cached tokenizers-0.12.1.tar.gz (220 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: tokenizers
  Building wheel for tokenizers (pyproject.toml): started
  Building wheel for tokenizers (pyproject.toml): finished with status 'error'
Failed to build tokenizers


  error: subprocess-exited-with-error
  
  Building wheel for tokenizers (pyproject.toml) did not run successfully.
  exit code: 1
  
  [51 lines of output]
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-cpython-311
  creating build\lib.win-amd64-cpython-311\tokenizers
  copying py_src\tokenizers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers
  creating build\lib.win-amd64-cpython-311\tokenizers\models
  copying py_src\tokenizers\models\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\models
  creating build\lib.win-amd64-cpython-311\tokenizers\decoders
  copying py_src\tokenizers\decoders\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\decoders
  creating build\lib.win-amd64-cpython-311\tokenizers\normalizers
  copying py_src\tokenizers\normalizers\__init__.py -> build\lib.win-amd64-cpython-311\tokenizers\normalizers
  creating build\lib.win-amd64-cpython-311\tokenizers\pre_tokenizers
  copying py_

Collecting keras_nlp==0.3.0
  Using cached keras_nlp-0.3.0-py3-none-any.whl (142 kB)
INFO: pip is looking at multiple versions of keras-nlp to determine which version is compatible with other requirements. This could take a while.


ERROR: Could not find a version that satisfies the requirement tensorflow-text (from keras-nlp) (from versions: none)
ERROR: No matching distribution found for tensorflow-text




In [17]:
import os
import logging

import nltk
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [18]:
TRAIN_TEST_SPLIT = 0.1  

MAX_INPUT_LENGTH = 1024  
MIN_TARGET_LENGTH = 5  
MAX_TARGET_LENGTH = 128  
BATCH_SIZE = 8         
LEARNING_RATE = 2e-5    
MAX_EPOCHS = 2          

MODEL_CHECKPOINT = "t5-small"  


In [19]:
from datasets import load_dataset

dataset = load_dataset("multi_news", split="train")
print(dataset[0])

{'document': 'National Archives \n \n Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. \n \n A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists polled by Dow Jones Newswires, compared to 227,000 jobs added in February. The unemployment rate is expected to hold steady at 8.3%. \n \n Here at MarketBeat HQ, we’ll be offering color commentary before and after the data crosses the wires. Feel free to weigh-in yourself, via the comments section. And while you’re here, why don’t you sign up to follow us on Twitter. \n \n Enjoy the show. ||||| Employers pulled back sharply on hiring last month, a reminder that the U.S. economy 

## Training-Testing split

In [20]:
raw_datasets = dataset.train_test_split(
    train_size=TRAIN_TEST_SPLIT, test_size=TRAIN_TEST_SPLIT
)
if MODEL_CHECKPOINT in ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""



In [21]:
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["summary"]
        )

    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [22]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/4497 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (519 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/4498 [00:00<?, ? examples/s]

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = TFAutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

In [53]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")

In [None]:
train_dataset = tokenized_datasets["train"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=True,
    collate_fn=data_collator,
)
test_dataset = tokenized_datasets["test"].to_tf_dataset(
    batch_size=BATCH_SIZE,
    columns=["input_ids", "attention_mask", "labels"],
    shuffle=False,
    collate_fn=data_collator,
)
generation_dataset = (
    tokenized_datasets["test"]
    .shuffle()
    .select(list(range(200)))
    .to_tf_dataset(
        batch_size=BATCH_SIZE,
        columns=["input_ids", "attention_mask", "labels"],
        shuffle=False,
        collate_fn=data_collator,
    )
)

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(optimizer=optimizer)

In [None]:
import keras_nlp

rouge_l = keras_nlp.metrics.RougeL()

def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    result = rouge_l(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    return result


# Model Training

In [None]:
from transformers.keras_callbacks import KerasMetricCallback
from keras.callbacks import ModelCheckpoint

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True
)
save_dir = "/kaggle/working/saved_models"
os.makedirs(save_dir, exist_ok=True)

model_checkpoint_callback = ModelCheckpoint(
    filepath=os.path.join(save_dir, 'model_epoch_{epoch:02d}.h5'),  # No monitor parameter
    save_best_only=False,  # Save on every epoch
    save_weights_only=False,  # Save entire model
    save_freq='epoch',  # Save on every epoch
    verbose=1
)

callbacks = [metric_callback]

# For now we will use our test set as our validation_data
model.fit(
    train_dataset, validation_data=test_dataset, epochs=MAX_EPOCHS, callbacks=callbacks
)

# Testing

In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model=model, tokenizer=tokenizer, framework="tf")
print("Actual Data",raw_datasets["test"][0]["document"])
print("\n-----\n")
summarizer(
    raw_datasets["test"][0]["document"],
    min_length=MIN_TARGET_LENGTH,
    max_length=MAX_TARGET_LENGTH,
)

In [None]:
# This is the same model used in Reputation Analysis from news. Facing compatibility issues in transformers and keras, will submit by next phase.