# Working on Amazon Employee Reviews dataset

The task is to develop a system that can identify both negative and positive topics in the provided dataset (“Amazon Jan 2023.csv”). In simpler terms, your task involves the following:
1. **Sentiment Analysis:** Divide the dataset into positive and negative data. Please consider that each comment can be classified only as positive or negative.
2. **Topic Modeling:** Detect the subjects being discussed in both the positive and negative comments. Please keep in mind that each comment might cover multiple topics.
3. **Displaying the Results:** Upon executing your code, we expect to observe several additional columns integrated into the current dataset.
    * A new column labeled `Sentiment` featuring either _"Positive"_ or _"Negative"_ labels corresponding to the sentiment of each comment.
    * Separate columns for each identified topic, featuring either _"Yes"_ or _"No"_ labels to indicate whether a comment covers that specific topic or not.

In [61]:
# # Install bertopic
# !pip install bertopic

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import nltk
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

import gensim
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

# Topic model
from bertopic import BERTopic

# Dimension reduction
from umap import UMAP

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# df = pd.read_csv("Data/Amazon Jan 2023.csv") # On system
df = pd.read_csv("/content/drive/MyDrive/Data_Science/Coding_Challenge/Amazon Jan 2023.csv") # Google colab
df.head()

Unnamed: 0,Comment
0,Fun and flexible work environment and an oppor...
1,"be prepared for constant changes, rules, error..."
2,Award-winning training. All the support you ne...
3,very long hours
4,Inclusive environment. Good salary and benefit...


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  99 non-null     object
dtypes: object(1)
memory usage: 920.0+ bytes


In [7]:
# Check for null values
df.isnull().sum()

Comment    0
dtype: int64

In [8]:
df.isna().sum()

Comment    0
dtype: int64

In [10]:
df.shape

(99, 1)

In [11]:
df.loc[90, "Comment"]

'The opportunities to raise through the ladder are low and are extremely tough with very high competition.'

In [12]:
df.loc[9, "Comment"]

'Prepare to drive lots of miles for work. Drive tons of miles and far distances from pickup points at times.'

In [13]:
df.loc[50, "Comment"]

'However, I do have a really good manager which helps the situation a lot.'

In [14]:
data = df.copy()

## Sentiment Analysis Scores using Roberta

* Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive

In [15]:
# Load model and tokenizer
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
# Define a function that will generate sentiment scores
def polarity_scores_roberta(comment):
    inputs = tokenizer.encode(comment, return_tensors='pt')
    outputs = model(inputs)[0]
    scores = softmax(outputs.detach().numpy(), axis=1)[0]
    return scores

# Generate sentiment scores for each comment
data["Roberta_Sentiment_Scores"] = data["Comment"].apply(polarity_scores_roberta)
data.head()


Unnamed: 0,Comment,Roberta_Sentiment_Scores
0,Fun and flexible work environment and an oppor...,"[0.004399926, 0.057076164, 0.9385239]"
1,"be prepared for constant changes, rules, error...","[0.5717372, 0.3928328, 0.03543008]"
2,Award-winning training. All the support you ne...,"[0.0065416875, 0.032872286, 0.9605861]"
3,very long hours,"[0.52065134, 0.41311723, 0.06623143]"
4,Inclusive environment. Good salary and benefit...,"[0.005523243, 0.032195657, 0.96228105]"


In [17]:
# Define a function that will generate sentiment
def analyze_sentiment(comment):
    inputs = tokenizer(comment, return_tensors='pt', truncation=True, padding=True, max_length=512)
    logits = model(**inputs).logits
    scores = "Positive" if logits.argmax() == 2 else "Negative"
    return scores

# Generate sentiment scores for each comment
data["Roberta_Sentiment"] = data["Comment"].apply(analyze_sentiment)
data.head()


Unnamed: 0,Comment,Roberta_Sentiment_Scores,Roberta_Sentiment
0,Fun and flexible work environment and an oppor...,"[0.004399926, 0.057076164, 0.9385239]",Positive
1,"be prepared for constant changes, rules, error...","[0.5717372, 0.3928328, 0.03543008]",Negative
2,Award-winning training. All the support you ne...,"[0.0065416875, 0.032872286, 0.9605861]",Positive
3,very long hours,"[0.52065134, 0.41311723, 0.06623143]",Negative
4,Inclusive environment. Good salary and benefit...,"[0.005523243, 0.032195657, 0.96228105]",Positive


In [18]:
data.loc[20, "Comment"]

"Typical Day* Challenging, but always rewarding. What I've Learned? * Overcoming the challenges and trying New things. Management* Caring, Very Helpful and wants you to succeed in every area possible. Workplace Culture* It's culture moves around Fast Pace and Productivity and Innovation. The Most Enjoyable Part of The Job* Knowing being Rewarded in your good and leveling up to become Great in your Greatness. Benefits, Dental, Medical, Visual Insurance."

In [20]:
data.loc[20, "Roberta_Sentiment_Scores"]

array([0.00521637, 0.02851149, 0.96627206], dtype=float32)

In [21]:
data.loc[20, "Roberta_Sentiment"]

'Positive'

In [32]:
data.loc[35, "Comment"]

'Union busting is disgusting! Even though we might not work at the air hub, working people must fight back against Amazon’s bullying tactics of firing workers who dare to fight for a fair wage and safe conditions!'

In [33]:
data.loc[35, "Roberta_Sentiment_Scores"]

array([0.8833754 , 0.10639054, 0.01023402], dtype=float32)

In [34]:
data.loc[35, "Roberta_Sentiment"]

'Negative'

In [25]:
data.loc[40, "Comment"]

'Stepping down from running the San Diego studio at Amazon Games. I have loved working for Christoph. Our San Diego team is incredible and the Games leadership is so talented. Can’t wait to see what the future brings for the team.'

In [26]:
data.loc[40, "Roberta_Sentiment_Scores"]

array([0.00289751, 0.00987672, 0.9872257 ], dtype=float32)

In [28]:
data.loc[40, "Roberta_Sentiment"]

'Positive'

In [29]:
data.loc[1, "Comment"]

'be prepared for constant changes, rules, errors, lack of support'

In [30]:
data.loc[1, "Roberta_Sentiment_Scores"]

array([0.5717372 , 0.3928328 , 0.03543008], dtype=float32)

In [31]:
data.loc[1, "Roberta_Sentiment"]

'Negative'

## Topic Modeling

In [35]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Text Preprocessing

In [36]:
# Preprocessing function
def preprocess_text(text):
    text = text.lower() # Lowercase text
    text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    tokens = word_tokenize(text) # Tokenize text

    # Remove stopwords and lemmatize words
    stop_words = set(stopwords.words('english'))
    # tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    # tokens = [lemmatizer.lemmatize(word) for word in tokens]

    review_lematized = " ".join(tokens)

    return review_lematized


In [37]:
# Apply preprocessing to each comment
data["Processed_Comment"] = data["Comment"].apply(preprocess_text)
data.head()

Unnamed: 0,Comment,Roberta_Sentiment_Scores,Roberta_Sentiment,Processed_Comment
0,Fun and flexible work environment and an oppor...,"[0.004399926, 0.057076164, 0.9385239]",Positive,fun flexible work environment opportunity lear...
1,"be prepared for constant changes, rules, error...","[0.5717372, 0.3928328, 0.03543008]",Negative,prepared constant change rule error lack support
2,Award-winning training. All the support you ne...,"[0.0065416875, 0.032872286, 0.9605861]",Positive,awardwinning training support need havent done...
3,very long hours,"[0.52065134, 0.41311723, 0.06623143]",Negative,long hour
4,Inclusive environment. Good salary and benefit...,"[0.005523243, 0.032195657, 0.96228105]",Positive,inclusive environment good salary benefit mana...


In [38]:
data.loc[90, "Comment"]

'The opportunities to raise through the ladder are low and are extremely tough with very high competition.'

In [39]:
data.loc[90, "Processed_Comment"]

'opportunity raise ladder low extremely tough high competition'

In [None]:
data.loc[90, "Processed_Comment"]

['opportunity',
 'raise',
 'ladder',
 'low',
 'extremely',
 'tough',
 'high',
 'competition']

### Topic Modeling using BERTopic

* **UMAP(Uniform Manifold Approximation and Projection)**

UMAP  is a dimensionality reduction technique that's often used in combination with `BERTopic` for visualization purposes. It's used to reduce the high-dimensional vector representations generated by BERT embeddings into a lower-dimensional space while preserving the underlying structure of the data as much as possible.

In [42]:
# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5, # target dimensions for UMAP is 5
                  min_dist=0.0, # how tightly UMAP can pack points together, its the min distance btw pnts in d low dimensional space
                  metric='cosine', # using cosine to measure the distance
                  random_state=45)

# Initiate BERTopic
topic_model = BERTopic(umap_model=umap_model, language="english", calculate_probabilities=True)

# Run BERTopic model
topics, probabilities = topic_model.fit_transform(data['Processed_Comment'])

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [43]:
# Extract topics - get the list of topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,79,0_work_time_management_good,"[work, time, management, good, day, hour, job,...",[honestly say work culture great want working ...
1,1,20,1_amazon_working_work_team,"[amazon, working, work, team, loved, ive, thin...",[great job loved manger care time work load ba...


In [44]:
# Get top 10 terms for a topic
topic_model.get_topic(0)

[('work', 0.09768722161098312),
 ('time', 0.06734318099925994),
 ('management', 0.05452003038482862),
 ('good', 0.05088536169250671),
 ('day', 0.04928336464690906),
 ('hour', 0.04807233775373709),
 ('job', 0.03998135561554098),
 ('break', 0.03919419120757278),
 ('like', 0.03839381200331249),
 ('place', 0.03839381200331249)]

In [45]:
topic_model.get_topic(1)

[('amazon', 0.20160322111982354),
 ('working', 0.10994436623839304),
 ('work', 0.08267030474268558),
 ('team', 0.07706625332245272),
 ('loved', 0.056783942926480174),
 ('ive', 0.052382862548171163),
 ('thing', 0.04686028366341683),
 ('ppl', 0.04493464630013803),
 ('great', 0.04489768179622335),
 ('people', 0.04489768179622335)]

In [46]:
# Visualize top topic keywords
topic_model.visualize_barchart(top_n_topics=12)

In [47]:
# Visualize similarity using heatmap
topic_model.visualize_heatmap()

In [48]:
# Get the topic predictions
topic_prediction = topic_model.topics_[:]

# Save the predictions in the dataframe
data['topic_prediction'] = topic_prediction

data.head()

Unnamed: 0,Comment,Roberta_Sentiment_Scores,Roberta_Sentiment,Processed_Comment,topic_prediction
0,Fun and flexible work environment and an oppor...,"[0.004399926, 0.057076164, 0.9385239]",Positive,fun flexible work environment opportunity lear...,0
1,"be prepared for constant changes, rules, error...","[0.5717372, 0.3928328, 0.03543008]",Negative,prepared constant change rule error lack support,0
2,Award-winning training. All the support you ne...,"[0.0065416875, 0.032872286, 0.9605861]",Positive,awardwinning training support need havent done...,0
3,very long hours,"[0.52065134, 0.41311723, 0.06623143]",Negative,long hour,0
4,Inclusive environment. Good salary and benefit...,"[0.005523243, 0.032195657, 0.96228105]",Positive,inclusive environment good salary benefit mana...,0


In [58]:
topic = data['topic_prediction'].unique()

# Create seprate columns for each topic
for topic in topics:
  data[f'Topic {topic}'] = data['topic_prediction'].apply(lambda x: 'Yes' if x == topic else 'No')

data.head()

Unnamed: 0,Comment,Roberta_Sentiment_Scores,Roberta_Sentiment,Processed_Comment,topic_prediction,Topic_0,Topic_1,Topic 0,Topic 1
0,Fun and flexible work environment and an oppor...,"[0.004399926, 0.057076164, 0.9385239]",Positive,fun flexible work environment opportunity lear...,0,Yes,No,Yes,No
1,"be prepared for constant changes, rules, error...","[0.5717372, 0.3928328, 0.03543008]",Negative,prepared constant change rule error lack support,0,Yes,No,Yes,No
2,Award-winning training. All the support you ne...,"[0.0065416875, 0.032872286, 0.9605861]",Positive,awardwinning training support need havent done...,0,Yes,No,Yes,No
3,very long hours,"[0.52065134, 0.41311723, 0.06623143]",Negative,long hour,0,Yes,No,Yes,No
4,Inclusive environment. Good salary and benefit...,"[0.005523243, 0.032195657, 0.96228105]",Positive,inclusive environment good salary benefit mana...,0,Yes,No,Yes,No


In [59]:
df_1 = data[['Comment', 'Roberta_Sentiment', 'Topic 0', 'Topic 1']]
df_1 = df_1.rename(columns={'Roberta_Sentiment':'Sentiment'})
df_1.head()

Unnamed: 0,Comment,Sentiment,Topic 0,Topic 1
0,Fun and flexible work environment and an oppor...,Positive,Yes,No
1,"be prepared for constant changes, rules, error...",Negative,Yes,No
2,Award-winning training. All the support you ne...,Positive,Yes,No
3,very long hours,Negative,Yes,No
4,Inclusive environment. Good salary and benefit...,Positive,Yes,No


In [60]:
df_1.to_csv('/content/drive/MyDrive/Data_Science/Coding_Challenge/Submission.csv')

------------