<a href="https://colab.research.google.com/github/SRINIRAGZ/sentimentAnalysis/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clone the git repo

In [None]:
from getpass import getpass

# Securely ask for GitHub token
token = getpass("Enter your GitHub token: ")
!git clone https://{token}@github.com/SRINIRAGZ/sentimentAnalysis.git

# Installation of libraries

In [None]:
!pip install transformers torch

# Code

In [None]:
import numpy as np
import os
import pandas as pd
import re


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

current_directory = os.getcwd()
print(f"Current working directory: {current_directory}")

## Configs

In [None]:
filename = 'engagements.csv'
DataFolder = './sentimentAnalysis/data/{filename}'
ResultsFolder = './sentimentAnalysis/results/{filename}'

## Data Import

In [None]:
df = pd.read_csv(DataFolder.format(filename=filename))

In [None]:
df.head()

## Data Cleaning

converting to valid datetime type for tiemstamp
\n converting text colums to string and stripping any leading and trailing spaces


In [None]:

df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed', utc=True)
df['media_caption'] = df['media_caption'].astype(str)
df['media_caption'] = df['media_caption'].str.strip()
df['comment_text'] = df['comment_text'].astype(str)
df['comment_text'] = df['comment_text'].str.strip()
print(df.shape)
print(df.dtypes)

cleaning duplicates if present

In [None]:
df.drop_duplicates(inplace=True)
df.shape

estimatine timestap difference between same media

In [None]:

df.sort_values(['timestamp','media_id'], inplace=True)
df['timedelta'] = df.groupby('media_id')['timestamp'].diff()

add comment category by looking for mentions
Splitting into 3-4 categories (Comments - comments only; mentions - mentions only; commentions - mentions and comments existing; no comments - anything left blank)

In [None]:

def comments_category(text):
  mention_pattern = r"'@\w+"
  has_mention = re.search(mention_pattern, text)
  text_without_mentions = re.sub(mention_pattern, '', text)
  if has_mention and text_without_mentions == '':
      return 'mentions'
  elif has_mention and text_without_mentions != '':
      return 'commentions'
  elif len(text.strip())==0:
      return 'no comments'
  else:
      return 'comments'



In [None]:
df['comment_type'] = df['comment_text'].apply(comments_category)
df['comment_type'].value_counts()

add media count

In [None]:

df['count'] = df.groupby('media_id')['media_id'].transform('count')


add comments without mentions

In [None]:

df['comment_wo_mentions'] = df['comment_text'].str.replace(r"'@\w+",'', regex=True)
df[df['comment_type']=='mentions'].head()

## Looking for Semantic Similarities between media text

Semantic similarities can be achieved using mini language models which help in identifying clusters in language models and also help in clustering. Here our aim is to cluster using the embeddings we can further chalk analyze based on individula clusters.

In [None]:
#params
MODEL='all-mpnet-base-v2' #'all-MiniLM-L12-v2'

In [None]:
df2=df[['media_id','media_caption']].drop_duplicates()

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

# Generate embeddings
embedder = SentenceTransformer(MODEL)
embeddings = embedder.encode(list(df2['media_caption']))



Looking at optimal cluster value for media captions

In [None]:
clusterlist = range(3,10)
sse=[]

for k in clusterlist:
    clustering_model = KMeans(n_clusters=k, random_state=42)
    clustering_model.fit(embeddings)
    sse.append(clustering_model.inertia_)

# Plot Elbow Curve
plt.plot(clusterlist, sse, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Within-cluster SSE")
plt.title("Elbow Method for Optimal cluster")
plt.show()
# for k, sse in zip(list(range(len(sse))),sse):
#   print(k, sse)

Running Kmeans on optimal cluster value of 38 which is the elbow

In [None]:
# k= ~10 for optimal clustering
num_clusters = 10
clustering_model = KMeans(n_clusters=num_clusters, random_state=42)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_

In [None]:
df2['cluster'] = cluster_assignment
df2.head()

adding the cluster value to original dataframe

In [None]:
df = df.merge(df2[['media_id','cluster']],how='left',on='media_id')
df['cluster'] = df['cluster'].astype(str)
df.head()

## Sentiment Analysis

Using RoBERTa model to analyse sentiment because its a robust pretrained BERT model optimized for sentiment analysis. BERT also is better at understanding context

In [None]:
#model params
MODEL = "cardiffnlp/twitter-roberta-base-sentiment"
label_mapping = {'LABEL_0': 'Negative', 'LABEL_1': 'Neutral', 'LABEL_2': 'Positive'}

In [None]:
from transformers import pipeline

# Load sentiment analysis pipeline

sentiment_classifier = pipeline("sentiment-analysis", model=MODEL)


Run sentiment analysis on comments without mentions that we have created

In [None]:
# Run sentiment analysis
df3 = list(df['comment_wo_mentions'])
results = sentiment_classifier(df3)

In [None]:
#Rework on the dataframe with sentiment results
df['sentiment'] = [label_mapping[result['label']] for result in results]
df['sentiment_score'] = [result['score'] for result in results]
df.head()

# Results Analysis

Sentiment analysis even though exists for all comments, it mainly pertain to only commnets and commentions (meaning comments with mentions)

In [None]:
df.pivot_table(index='comment_type', columns='sentiment',values='sentiment_score',aggfunc='mean')

above data tells that mentions are always neutral.l sentiment analysis is for only the comment text having comments and not mentions only.

###Prepping data for mentions count by media and cluster
### mentions count can be treated as positive as customers are exposing visibility to other users impacting in a positive way

In [None]:
df['mentions_count_bymedia'] = df[df.comment_type.str.contains('mentions')].groupby('media_id',as_index=False)['comment_type'].transform('count')
df['mentions_count_bycluster'] = df[df.comment_type.str.contains('mentions')].groupby('cluster',as_index=False)['comment_type'].transform('count')

df[df.comment_type== 'mentions'].head()


In [None]:
df4 = df[(df.comment_type== 'mentions')&(df.mentions_count_bymedia>10)][['media_id','mentions_count_bymedia']].drop_duplicates()
fig,ax = plt.subplots(figsize=(10, 6))
sns.barplot(data=df4,y='mentions_count_bymedia',x='media_id', ax=ax, order=df4.sort_values('mentions_count_bymedia', ascending=False).media_id)
ax.set_xlabel('Media ID')
ax.set_ylabel('Mentions Count')
ax.set_title('Top media id where Mentions Count is high')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
plt.show()

### Clusters have some indepth meaning based on the text the captions have

In [None]:
df4 = df[(df.comment_type== 'mentions')&(df.mentions_count_bycluster>10)][['cluster','mentions_count_bycluster']].drop_duplicates()
df4['cluster'] = df4['cluster'].astype(str)
print(df4.dtypes)
fig2,ax2 = plt.subplots(figsize=(10, 6))
sns.barplot(data=df4,y='mentions_count_bycluster', x='cluster', ax=ax2)
ax2.set_xlabel('cluster')
ax2.set_ylabel('Mentions Count')
ax2.set_title('Top clusters where Mentions Count is high')
plt.show()

###sentiment plot by media

In [None]:
def sentiment_bar_plot(col, title, rot=0, sortby='x'):
  sentiment_counts = df.groupby([col, 'sentiment']).size().unstack(fill_value=0)

  if sortby=='x':
    top_ids = df[col].value_counts().head(10).index
    sentiment_counts_subset = sentiment_counts.loc[top_ids]
  else:
    top_ids = sentiment_counts.sort_values(by=sortby,ascending=False).head(10).index
    sentiment_counts_subset = sentiment_counts.loc[top_ids][sortby]

  sentiment_counts_subset.plot(kind='bar', figsize=(15, 7))
  plt.title(f'Sentiment Distribution per {title}')
  plt.xlabel(f'{col}')
  plt.ylabel('Number of Comments')
  plt.xticks(rotation=rot)
  plt.tight_layout()
  plt.show()

In [None]:

sentiment_bar_plot(col='media_id', title='media (top 10)', rot=90)


###Top negative comments by media

In [None]:
sentiment_bar_plot(col='media_id', title='media (top 10) - Negative', rot=90, sortby='Negative')

###Top comments by cluster

In [None]:
sentiment_bar_plot(col='cluster', title='cluster', rot=0)

###Top negative comment by cluster

In [None]:
sentiment_bar_plot(col='cluster', title='cluster', rot=90, sortby='Negative')

###SCatter plot of total comments for each media to removing extremeties

No significant conclusion can be obtained from the counts vs sentiment

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sentiment_df = df[(df.comment_type.str.contains('com'))&(df['count']>25)&(df['count']<250)].copy()
sentiment_df['media_id'] = sentiment_df['media_id'].astype(str)
sentiment_df.head()


sns.scatterplot(data=sentiment_df, x='media_id', y='count', ax=ax, hue='sentiment')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_ylabel('Count')
ax.set_title('Count (without extremeties) vs Media ID - colored by sentiment')
plt.xticks(rotation=90)
# ax.set_yscale('log')
plt.tight_layout()
plt.show()

## Summarization

Summarization model summarizes user comments across different media within positive and negative sentiments. Using BART of flan that has better autoencoder and decoder model for understanding sentiemnt as well as generating the summary

In [None]:
summarizer_model = "facebook/bart-large-cnn"#"facebook/bart-large-cnn" #"google/flan-t5-large"

In [None]:

df[(df['media_id']==17872089159294304)&(df.comment_type.str.contains('com'))&(df.sentiment=='Negative')].shape

In [None]:
summarizer = pipeline("summarization", model=summarizer_model)


#####generate summarization for media with more comments: say ones with more than 50 comments

In [None]:
df_summary = df[(df['count']>100)&(df.media_id!=17872089159294304)]

In [None]:

# for m in df_summary['media_id'].unique():
#   print(f'\nmedia: {m}')
#   for s in ['Positive','Negative']:
#     df_tmp = df[(df['media_id']==m)&(df.comment_type.str.contains('com'))&(df.sentiment==s)]
#     if df_tmp.shape[0]<5:
#       continue
#     comments = list(df_tmp['comment_wo_mentions'])
#     comments = [str(i)+". "+c+"\n" for i,c in enumerate(comments)]
#     allcomments = f"Summarize all the following {s} user comments on products: " + " ".join(comments)
#     summary = summarizer(allcomments, max_length=25, min_length=5, do_sample=False)
#     print(f"\t{s} summary: {summary[0]['summary_text']}")
