# Analysis of Reviews Containing Identified Stress Mentions

**Purpose**: This notebook delves into the reviews that have been flagged for containing one or more of the stress mentions identified in the `medical-entity-analysis.ipynb` notebook. The aim is to understand the context and prevalence of these mentions within the broader dataset.

**Description**: 
- **Filtering Process**: The initial step involves filtering the reviews to retain only those that match the identified stress mentions.
- **Observational Analysis**: A basic observational analysis is conducted on the filtered dataset to understand patterns, frequencies, and other key metrics.
- **Stress Score Extraction**: Adhering to the methodology described in the literature, a stress score will be computed for each company based on the propotion of tis posts related to stress. This score provides a quantitative measure of the stress level associated with each company.
- **Comparative Analysis**: The notebook also compares key statistics between the entire dataset and the filtered subset. This comparison provides insights into how the presence of stress mentions affects the overall tone and content of reviews.

In [1]:
import os
import numpy as np
import pandas as pd

import plotly.graph_objects as go

In [2]:
# Set the Pandas display option to show the entire content of each cell without truncation
pd.set_option('display.max_colwidth', None)

In [3]:
GREEN = "\033[92m"  # ANSI escape code for green text color
ENDC = "\033[0m"    # ANSI escape code to reset text color to default

## 1. Data Loading

In [4]:
# Specify the path to the folder containing the data (relative)
path = os.path.join("..", "..", "..", "Data")

# File containing the stress entities
txt_fpath = os.path.join(path, "stress_entities.txt")

# File containing glassdoor's reviews
reviews_fpath = os.path.join(path, "reviews.csv")

# File containing glassdoor companies
companies_fpath = os.path.join(path, "companies.json")

In [5]:
# Load the TXT file into a Python List object
stress_entities = []

with open(txt_fpath, "r") as file:
    text = file.read()

    text = text.replace("'", "")
    text = text.replace("[", "").replace("]", "")
    
    stress_entities = text.split(",")

In [6]:
# Load the reviews CSV file into a Pandas DataFrame object
reviews = pd.read_csv(reviews_fpath)
reviews.dropna(subset="summary", inplace=True)

In [7]:
# Load the companies CSV file into a Pandas DataFrame object
companies = pd.read_json(companies_fpath)

## 2. Preview of Stress Mentioned Reviews

The extracted list of stress entities will now be used to sample some reviews that contain the related mentions

In [8]:
def find_reviews_with_pattern(pattern, case=False, n=10):
    # Filter reviews to find rows where the summary collumn contains the desired pattern
    matching_reviews = reviews[reviews["summary"].str.contains(pattern, case=case)]
    matching_reviews = matching_reviews["summary"]
    matching_reviews.reset_index(drop=True, inplace=True)
    
    count = matching_reviews.shape[0]
    matching_reviews = matching_reviews.head(n)

    return (matching_reviews, count)

In [9]:
for entity in stress_entities:
    matching_reviews, count = find_reviews_with_pattern(entity)

    print(f"Total matching reviews for {GREEN}{entity}{ENDC}: {count}")
    print(matching_reviews, "\n")

Total matching reviews for [92mstress[0m: 25489
0                                                                                      Stressful & Frustrating
1                                                                                                    Stressful
2                                                                                                   Stressfull
3                                                                  My experience consists of being stressed...
4    Fun people, chaotic day-to-day, no work-life balance, poor information sharing, high stress, no training.
5                                                                                  Stressful at times but fine
6                                                                      Disorganized, stressful and incompetent
7                                                                                         very stress full job
8                                                             

## 3. Filtering Reviews

Now we will generate a new DataFrame that retains only the rows which their review contains at least one of the identified stress mentions.

In [10]:
mask = reviews["summary"].str.contains("|".join(stress_entities), case=False)

In [11]:
filtered_reviews = reviews[mask].copy()
filtered_reviews.reset_index(drop=True, inplace=True)

In [12]:
filtered_reviews["summary"].sample(10)

8976     Starts off very fun and slowly becomes more stressful. While not commission, pressure is there to meet number targets.
319                                                                                                           Far too stressful
11501                                                                                                                 Stressing
27302                                                                                                                 Stressful
41462                                                                                             Best place if you love stress
3194                                                                   Good paying job could be very stressful most of the time
20415                                                                                                          Growing pains...
18352                                                                                                   

In [13]:
# How many records are we left with?
filtered_reviews.shape[0]

48631

Let us also add a new boolean column to the existing DataFrame indicating whether a stress metnion is present in the summary column

In [14]:
reviews["has_stress"] = mask

In [15]:
reviews[reviews["has_stress"] == 1][["has_stress", "summary"]].sample(5)

Unnamed: 0,has_stress,summary
5656083,True,"Nepotism, precarious and stressful managemnt"
5647315,True,"Difficult work environment, smart people, high pressure"
5698499,True,High turnover and burnout
8104845,True,"Great place to work, long hours, some stress"
8647502,True,good if you like high pressure sales


## 4. Stress Score Extraction

### 4.1. Associating Stress Types with Companies
Following the literture's methodology, we will compute the stress score for each company based on the proportion of its posts related to stress. This involves calculating the average review rating and stress score for each company, followed by z-scoring these values for comparability.

Since we want to find how COVID-19 affected the reviews and the overall stress of the employees we will compute the rating and stress before and after covid.

In [16]:
# Cut-off point of COVID-19
covid_start_date = "2019-12-25"

In [17]:
reviews["reviewDateTime"] = pd.to_datetime(reviews["reviewDateTime"])

In [18]:
# Acquire the reviews submited before COVID-19 
reviews_pre_covid = reviews[reviews["reviewDateTime"] < covid_start_date]

# Acquire the reviews submited during COVID-19
reviews_during_covid = reviews[reviews["reviewDateTime"] >= covid_start_date]

In [19]:
def compute_average_rating_per_company(df):
    # Group by 'company_id' and calculate mean of 'ratingOverall'
    result = df[["ratingOverall", "company_id"]].groupby(by="company_id").mean().reset_index()
    result.columns = ["company_id", "average_rating"]

    # Compute z-score for rating
    result["zrating"] = (result["average_rating"] - result["average_rating"].mean()) / result["average_rating"].std()
    return result

In [20]:
def compute_stress_score_per_company(df):
    # Reviews per company that had stress mentions
    true_counts = df[["company_id", "has_stress"]].groupby(by="company_id").sum()
    
    # Total number of reviews per company
    total_counts = df[["company_id", "has_stress"]].groupby(by="company_id").count()

    # Compute their ratio / stress score
    result = true_counts / total_counts

    result.reset_index(inplace=True)
    result.columns = ["company_id", "stress"]

    # Drop companies with 0 stress
    result = result[result["stress"] != 0]
    
    # Compute z-score for stress
    result["zstress"] = (result["stress"] - result["stress"].mean()) / result["stress"].std()

    return result

In [21]:
avg_rating_pre_covid = compute_average_rating_per_company(reviews_pre_covid)
stress_pre_covid = compute_stress_score_per_company(reviews_pre_covid)

results_pre_covid = pd.merge(left=avg_rating_pre_covid, right=stress_pre_covid, on="company_id")
results_pre_covid = pd.merge(left=companies, right=results_pre_covid, left_on="_id", right_on="company_id")

del results_pre_covid["_id"]

In [22]:
avg_rating_during_covid = compute_average_rating_per_company(reviews_during_covid)
stress_during_covid = compute_stress_score_per_company(reviews_during_covid)

results_during_covid = pd.merge(left=avg_rating_during_covid, right=stress_during_covid, on="company_id")
results_during_covid = pd.merge(left=companies, right=results_during_covid, left_on="_id", right_on="company_id")

del results_during_covid["_id"]

In [23]:
def plot_quadrants(df, title):
    # Define colors and quadrant names
    quadrants = {
        "Q1": {"name": "Positive Stress", "color": "orange", "position": (3, 3)},  # Top-right quadrant
        "Q2": {"name": "Negative Stress", "color": "red", "position": (-3, 3)},   # Top-left quadrant
        "Q3": {"name": "Passive", "color": "black", "position": (-3, -3)},        # Bottom-left quadrant
        "Q4": {"name": "Low Stress", "color": "green", "position": (3, -3)}       # Bottom-right quadrant
    }
    
    # Determine the quadrant for each point
    df["quadrant"] = df.apply(lambda row: "Q1" if row["zrating"] >= 0 and row["zstress"] >= 0 else 
                              ("Q2" if row["zrating"] < 0 and row["zstress"] >= 0 else 
                              ("Q3" if row["zrating"] < 0 and row["zstress"] < 0 else "Q4")), axis=1)
    
    # Create the base figure
    fig = go.Figure()
    
    # Add scatter points for each quadrant
    for quad, details in quadrants.items():
        df_quad = df[df["quadrant"] == quad]
        fig.add_trace(go.Scatter(x=df_quad["zrating"], y=df_quad["zstress"], 
                                 mode="markers", 
                                 marker=dict(color=details["color"]),
                                 name=details["name"], 
                                 hovertext=df_quad["company name"]))
        
        # Add annotation (text) to indicate the number of companies in the quadrant
        fig.add_annotation(
            text=str(len(df_quad)),
            x=details["position"][0], 
            y=details["position"][1],
            showarrow=False,
            font=dict(
                size=15,
                color=details["color"]
            ),
            bgcolor="white",
            borderpad=4
        )
    
    # Add vertical line at x=0
    fig.add_shape(type="line", line=dict(dash="dash"), x0=0, x1=0, y0=-3, y1=3)
    
    # Add horizontal line at y=0
    fig.add_shape(type="line", line=dict(dash="dash"), y0=0, y1=0, x0=-3, x1=3)
    
    # Update layout, titles, and axis ranges
    fig.update_layout(title=f"Stress Quadrants: {title}", 
                      xaxis_title="zrating", 
                      yaxis_title="zstress",
                      legend_title="Quadrants",
                      height=800)

    # Display the plot
    fig.show()

In [24]:
plot_quadrants(df=results_pre_covid, title=f"Before COVID-19 ({covid_start_date})")

In [25]:
plot_quadrants(results_during_covid, title=f"During/After COVID-19 ({covid_start_date})")

In [26]:
results_pre_covid

Unnamed: 0,company name,company_id,average_rating,zrating,stress,zstress,quadrant
0,IHG Hotels and Resorts,4232,3.737639,0.436332,0.003532,-0.729921,Q4
1,FCA Fiat Chrysler Automobiles,149,3.319199,-0.436247,0.006990,-0.386843,Q3
2,Penn State,2931,4.066023,1.121115,0.002684,-0.814049,Q4
3,NetApp,5406,3.595722,0.140391,0.006417,-0.443656,Q4
4,Mercer,35818,3.216216,-0.650997,0.005019,-0.582338,Q3
...,...,...,...,...,...,...,...
5212,NYS Office Information Technology Services,703480,2.671141,-1.787649,0.026846,1.583140,Q2
5213,Turn5,341949,2.682927,-1.763072,0.012195,0.129603,Q2
5214,JAKKS Pacific,6084,3.621053,0.193213,0.021053,1.008392,Q1
5215,NorthMarq Capital,16676,3.338710,-0.395560,0.016129,0.519902,Q2


In [27]:
results_during_covid

Unnamed: 0,company name,company_id,average_rating,zrating,stress,zstress,quadrant
0,IHG Hotels and Resorts,4232,3.934496,0.415672,0.002641,-0.828235,Q4
1,FCA Fiat Chrysler Automobiles,149,3.488924,-0.975659,0.005538,-0.462486,Q3
2,Penn State,2931,4.216767,1.297086,0.001853,-0.927808,Q4
3,NetApp,5406,4.128931,1.022811,0.003145,-0.764680,Q4
4,Mercer,35818,3.760222,-0.128510,0.004764,-0.560238,Q3
...,...,...,...,...,...,...,...
5365,Skyeng,1063171,3.927022,0.392334,0.003945,-0.663653,Q4
5366,@properties,349652,3.960317,0.496303,0.007937,-0.159634,Q4
5367,Turn5,341949,3.456140,-1.078028,0.017544,1.053442,Q2
5368,KeyW,354608,4.033898,0.726064,0.016949,0.978351,Q1


In [28]:
def calculate_association(df):
    R = np.sqrt(np.power(df["zrating"], 2) + np.power(df["zstress"], 2))

    alpha = np.arccos(np.abs(df["zrating"]) / R)
    beta = np.arccos(np.abs(df["zstress"]) / R)
    gamma = np.maximum((alpha - np.pi / 4), (beta - np.pi / 4))

    return pd.Series(R / (gamma + np.pi))


In [29]:
results_pre_covid["association"] = calculate_association(results_pre_covid)
results_during_covid["assocation"] = calculate_association(results_during_covid)

### 4.2. Computing Stress Scores Over the Years

In [30]:
# Compute the scores for the whole data (without cutoff point)
avg_rating = compute_average_rating_per_company(reviews)
stress = compute_stress_score_per_company(reviews)

results = pd.merge(left=avg_rating, right=stress, on="company_id")
results = pd.merge(left=companies, right=results, left_on="_id", right_on="company_id")

results["association"] = calculate_association(results)

del results["_id"]

In [31]:
plot_quadrants(results, "All Years")

In [32]:
results.head()

Unnamed: 0,company name,company_id,average_rating,zrating,stress,zstress,association,quadrant
0,IHG Hotels and Resorts,4232,3.833806,0.454736,0.003097,-0.782521,0.266148,Q4
1,FCA Fiat Chrysler Automobiles,149,3.382111,-0.796629,0.006452,-0.208491,0.224314,Q3
2,Penn State,2931,4.146942,1.322239,0.002238,-0.929515,0.487669,Q4
3,NetApp,5406,3.77585,0.294175,0.005312,-0.403551,0.151464,Q4
4,Mercer,35818,3.484439,-0.513142,0.004893,-0.475122,0.219911,Q3


In [109]:
def calculate_weight(df):
    temp = df.groupby(by="company_id").size()
    temp = temp.reset_index(name="count")

    temp["weight"] = temp["count"] / df.shape[0]

    return temp

In [34]:
grouped = reviews.groupby(reviews["reviewDateTime"].dt.year)

In [116]:
stress_over_years = {}

for year, group in grouped:
    
    # Compute the scores for the whole data (without cutoff point)
    avg_rating = compute_average_rating_per_company(group)
    stress = compute_stress_score_per_company(group)

    results = pd.merge(left=avg_rating, right=stress, on="company_id")
    results = pd.merge(left=companies, right=results, left_on="_id", right_on="company_id")

    results["association"] = calculate_association(results)

    weights = calculate_weight(group)
    results = pd.merge(left=results, right=weights, on="company_id")

    del results["_id"]
    del results["count"]

    stress_over_years[year] = np.sum(results["association"] * results["weight"])

In [117]:
stress_over_years

{2008: 0.09324091240870061,
 2009: 0.07116437805398336,
 2010: 0.06416544064484744,
 2011: 0.0962872832321128,
 2012: 0.11143981724722324,
 2013: 0.10595639243912847,
 2014: 0.13365759239297104,
 2015: 0.11867266216490756,
 2016: 0.11559177132537822,
 2017: 0.13039039906966987,
 2018: 0.12516665354421314,
 2019: 0.12799110235879727,
 2020: 0.1394409159276057,
 2021: 0.19642980020771347,
 2022: 0.1958061097487982,
 2023: 0.14480755369180437}

In [118]:
years = list(stress_over_years.keys())
stress_values = list(stress_over_years.values())

# Create a trace for the stress values
trace = go.Scatter(
    x=years,
    y=stress_values,
    mode='lines+markers',
    name='Stress',
    marker=dict(
        size=8,
        color='blue'
    ),
    line=dict(
        width=2
    )
)

layout = go.Layout(
    title='Stress Over the Years',
    xaxis=dict(title='Year'),
    yaxis=dict(title='Stress'),
    hovermode='closest'
)

fig = go.Figure(data=[trace], layout=layout)

# Show the plot (you can also save it to an HTML file)
fig.show()