# Employee work life analysis

## Introduction
In the pursuit of understanding the intricacies of a company's culture and the experiences of its employees, the extraction and summarization of work-life anecdotes emerge as a valuable undertaking. In this phase of the project, we delve into the narratives of individuals who have been part of various organizations, gathering insights that encapsulate the essence of their professional journeys. To accomplish this, we employ a combination of web scraping, data extraction, and text summarization techniques, augmented by popular libraries and methodologies in the realm of Natural Language Processing (NLP). By employing these methods, we aim to provide concise and insightful summaries of employees' experiences, enabling a deeper comprehension of the working environments, challenges, and successes that define a company's culture.

## Web Scraping Quora blog posts

### Input the company name

In [4]:
company = input("Which company's employee experiences do you want to know about?")

Which company's employee experiences do you want to know about? google


In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup

url = f"https://www.quora.com/search?q=what%20is%20it%20like%20working%20at%20{company}"

# Initialize a headless browser 
driver = webdriver.Chrome()  

driver.get(url)

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "qu-userSelect--text")))

page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

button_elements = driver.find_elements(By.XPATH, "//div[@class='q-text qu-cursor--pointer QTextTruncated__StyledReadMoreLink-sc-1pev100-3 dXJUbS qt_read_more qu-color--blue_dark qu-fontFamily--sans qu-pl--tiny']")

company_blog_posts=[]
for button_element in button_elements[:10]:
    button_element.click()  
    
page_source = driver.page_source
# Get the page source after dynamic content has loaded
soup = BeautifulSoup(page_source, "html.parser")

div_elements = soup.find_all('div', class_='q-box spacing_log_answer_content puppeteer_test_answer_content')
for div_element in div_elements[:100]:
    company_blog_posts.append(div_element.get_text())

for c in company_blog_posts:
    print("Found a matching div element:")
    print(c)
    print("---")



Found a matching div element:
I have never worked anywhere apart from Google, so I cannot draw a comparison but here is what I think. Pros :I get to work with a set of people who always make me feel shit about myself and make me realize that I know very little. When I joined Google, I suffered from Imposter’s syndrome. I am always out of my comfort zone here. I always know how I can improve. It helps me grow and learn faster.The degree of freedom I have regarding what I want to work on is spectacular. Changing project / team is relatively easier here. I can choose to spend 20% of my time on anything I want.The food and other much talked about perks of Google actually help me improve my productivity. I feel like my employer cares about me and I would like to return the favor. There is even an internal website for employees that talk about how to waste time at Google in a good way.The company invests in growth of it’s employees. There are tons of learning resources. People volunteer to m

Note that we have extracted raw blog posts and have stored it in company_blog_posts. 

## Pre-Processing of extracted data (Removal of stop words and tokenisation)

In my code, I have incorporated the NLTK library's Stop words and Punkt tokenizer for effective text processing. 
<br/> **Stop words** removal plays a crucial role in enhancing the quality of textual analysis by filtering out common words that carry little semantic meaning, thus allowing a focus on more significant terms. 
<br/> **Punkt Sentence Tokenizer** helps to process the abbreviation words, upper case characters, collocations, special characters, whitespaces and many more, making it particularly effective summarisation.

These preprocessing steps contribute to the efficiency and precision of natural language processing tasks, ensuring a more refined understanding of the underlying text.

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def process_review(review):
    words = word_tokenize(review)
    filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
    summary = ' '.join(filtered_words)
    return summary

cummulative_pp = ''
for review in company_blog_posts:
    summary = process_review(review)
    cummulative_pp += summary + ' '

individual_post_pp = [process_review(review) for review in company_blog_posts]

# Display individual summaries
print("\nPreprocessed Individual Blogs")
for i, summary in enumerate(individual_post_pp, start=1):
    print(f"{i}: {summary}\n")




Preprocessed Individual Blogs
1: never worked anywhere apart google draw comparison think pros get work set people always make feel shit make realize know little joined google suffered imposter syndrome always comfort zone always know improve helps grow learn degree freedom regarding want work spectacular changing project team relatively easier choose spend 20 time anything food much talked perks google actually help improve productivity feel like employer cares would like return favor even internal website employees talk waste time google good company invests growth employees tons learning resources people volunteer mentor others access code ever written google several voluntary trainings bottleneck learning googley culture amazing people worked google high ethical moral standards system continuous feedback people trying pull people often get rewarded recognized via public thank sufficient number channels bubble ideas concerns appropriate mailing list almost everything game thrones c

[nltk_data] Downloading package stopwords to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Aishika
[nltk_data]     Nandi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Summarization using my trained model

In [2]:
from tf_keras.models import load_model
loaded_model = load_model('summarization.h5')




