<a href="https://colab.research.google.com/github/SterlingHayden/Text-Analytics/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

## Data Exploration

In [None]:
df_iphone = pd.read_excel('/content/iphone_reddit_data.xlsx')
df_pixel = pd.read_excel('/content/pixel_reddit_data.xlsx')

In [None]:
df_iphone.head()

Unnamed: 0,text,upvotes,upvote_ratio,text_type,num_comments
0,TEMP,0,0.0,0,0.0
1,Welcome to the weekly stickied WSIB thread.\n\...,0,0.5,Original Post,5.0
2,When will you think the second batch of iPhone...,1,,Comment/Reply,
3,Girlfriend wants the latest iPhone but can't a...,1,,Comment/Reply,
4,Hi! I'm in a dire need of a new iPhone and bas...,1,,Comment/Reply,


In [None]:
df_pixel.head()

Unnamed: 0,text,upvotes,upvote_ratio,text_type,num_comments
0,TEMP,0,0.0,0,0.0
1,This is the weekly photo Megathread. Photos ca...,2,0.75,Original Post,0.0
2,*If you were redirected here from a removed po...,6,0.8,Original Post,88.0
3,Shipping mega thread link is outdated. That's ...,11,,Comment/Reply,
4,What is your overnight P9PXL drain? I lose abo...,4,,Comment/Reply,


We see that the first two rows are just threads created by bots/mods. i'm going to drop those

In [None]:
df_iphone = df_iphone[2:]
df_pixel = df_pixel[2:]

## Data Preprocessing  

In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def clean_text(text):
    # Remove links
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Tokenize text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Remove punctuation and special characters
    filtered_tokens = [re.sub(r'\W+', '', word) for word in filtered_tokens if re.sub(r'\W+', '', word)]

    # Lowercase all words
    filtered_tokens = [word.lower() for word in filtered_tokens]

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    return ' '.join(clean_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
#applying clean_text() to the text data in r/iphone
df_iphone['cleaned_text'] = df_iphone['text'].astype(str).apply(clean_text)

#applying clean_text() to the text data in r/pixel
df_pixel['cleaned_text'] = df_pixel['text'].astype(str).apply(clean_text)

In [None]:
df_iphone['cleaned_text']

Unnamed: 0,cleaned_text
2,think second batch iphone 16 pro max available...
3,girlfriend want latest iphone ca nt afford sti...
4,hi m dire need new iphone basically question g...
5,product apple arrived defective l returned rea...
6,ordered mine att day launch still getting oct ...
...,...
14000,know annoying pay premium premium product come...
14001,desert titanium pro max arrived scratch saw qu...
14002,know damage happened tiny thing wont show year...
14003,wow thought almost impossible defect apple pro...


In [None]:
df_pixel['cleaned_text']

Unnamed: 0,cleaned_text
2,redirected removed post megathread link please...
3,shipping mega thread link outdated s last year...
4,overnight p9pxl drain lose 15 overnight feel r...
5,p9pxl averaging 6hrs sot last day m ending day...
6,live australia jbhifi promos around new p9 ser...
...,...
12531,deleted
12532,re hiding mentioned new lens new sensor chose ...
12533,suggestion camera hardware improved result bri...
12534,deleted


In [None]:
#df_pixel.to_excel('pixel_processed.xlsx')
#df_iphone.to_excel('iphone_processed.xlsx')

Lets add both dataframes together

In [None]:
df_combined = pd.concat([df_pixel, df_iphone], axis=0, ignore_index=True)
df_combined.head()

Unnamed: 0,text,upvotes,upvote_ratio,text_type,num_comments,cleaned_text
0,*If you were redirected here from a removed po...,6,0.8,Original Post,88.0,redirected removed post megathread link please...
1,Shipping mega thread link is outdated. That's ...,11,,Comment/Reply,,shipping mega thread link outdated s last year...
2,What is your overnight P9PXL drain? I lose abo...,4,,Comment/Reply,,overnight p9pxl drain lose 15 overnight feel r...
3,P9PXL - Averaging about 6hrs SOT the last few ...,4,,Comment/Reply,,p9pxl averaging 6hrs sot last day m ending day...
4,"I live in Australia, JBHIFI are doing some pro...",4,,Comment/Reply,,live australia jbhifi promos around new p9 ser...


## Sentiment Analysis
just on the df_combined

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer  # VADER for sentiment analysis
!nltk.download('vader_lexicon')

/bin/bash: -c: line 1: syntax error near unexpected token `'vader_lexicon''
/bin/bash: -c: line 1: `nltk.download('vader_lexicon')'


In [None]:
def analyze_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    clean = clean_text(text)  # Call your clean_text function to process the text

    sentiment_scores = sia.polarity_scores(clean)

    # Initialize a variable to track the total count of all target words
    total_target_word_count = 0

    # Adjust sentiment score based on occurrences of the target words
    for word in target_words:
        word_count = clean.count(word.lower())  # Count occurrences of the word in the cleaned text
        total_target_word_count += word_count
        if word_count > 0:
            sentiment_scores['compound'] += 0.1 * word_count  # Adjust compound sentiment by a factor

    # Optionally, normalize compound score based on word count if you want a smoother adjustment
    if total_target_word_count > 0:
        sentiment_scores['compound'] += 0.05 * total_target_word_count

    return sentiment_scores

### Cluster A's Keywords

In [None]:
#words in cluster
_words = ['touch', 'charger', 'work', 'phone', 'face', 'charge', 'use', 'id', 'button']
#lemmatizing words of intrest
lemmatizer = WordNetLemmatizer()
target_words = [lemmatizer.lemmatize(word) for word in _words]

In [None]:

df_combined['sentiment_A'] = df_combined['cleaned_text'].apply(analyze_sentiment)

#sentiment components
df_combined['compound_A'] = df_combined['sentiment_A'].apply(lambda x: x['compound'])
df_combined['positive_A'] = df_combined['sentiment_A'].apply(lambda x: x['pos'])
df_combined['neutral_A'] = df_combined['sentiment_A'].apply(lambda x: x['neu'])
df_combined['negative_A'] = df_combined['sentiment_A'].apply(lambda x: x['neg'])

#find average for each sentiment component
average_compound_A = df_combined['compound_A'].mean()
average_positive_A = df_combined['positive_A'].mean()
average_neutral_A = df_combined['neutral_A'].mean()
average_negative_A = df_combined['negative_A'].mean()

#average sentiment scores
print("Average Compound Sentiment of Cluster A:", average_compound_A)
print("Average Positive Sentiment of Cluster A:", average_positive_A)
print("Average Neutral Sentiment of Cluster A:", average_neutral_A)
print("Average Negative Sentiment of Cluster A:", average_negative_A)



Average Compound Sentiment of Cluster A: 0.4051006669932547
Average Positive Sentiment of Cluster A: 0.19315736518822776
Average Neutral Sentiment of Cluster A: 0.7190875004710405
Average Negative Sentiment of Cluster A: 0.0755839770885933


### Cluster B's Keywords

In [None]:
_words = ['iPhone', 'Video', 'Photo', 'Samsung', 'Set', 'Use', 'Phone', 'Pixel', 'App']
#lemmatizing words of intrest
target_words = [lemmatizer.lemmatize(word) for word in _words]

In [None]:
df_combined['sentiment_B'] = df_combined['cleaned_text'].apply(analyze_sentiment)

#sentiment components
df_combined['compound_B'] = df_combined['sentiment_B'].apply(lambda x: x['compound'])
df_combined['positive_B'] = df_combined['sentiment_B'].apply(lambda x: x['pos'])
df_combined['neutral_B'] = df_combined['sentiment_B'].apply(lambda x: x['neu'])
df_combined['negative_B'] = df_combined['sentiment_B'].apply(lambda x: x['neg'])

#find average for each sentiment component
average_compound_B = df_combined['compound_B'].mean()
average_positive_B = df_combined['positive_B'].mean()
average_neutral_B = df_combined['neutral_B'].mean()
average_negative_B = df_combined['negative_B'].mean()

#average sentiment scores
print("Average Compound Sentiment of Cluster B:", average_compound_B)
print("Average Positive Sentiment of Cluster B:", average_positive_B)
print("Average Neutral Sentiment of Cluster B:", average_neutral_B)
print("Average Negative Sentiment of Cluster B:", average_negative_B)

Average Compound Sentiment of Cluster B: 0.4430571051739082
Average Positive Sentiment of Cluster B: 0.19315736518822776
Average Neutral Sentiment of Cluster B: 0.7190875004710405
Average Negative Sentiment of Cluster B: 0.0755839770885933


### Cluster C's Keywords

In [None]:
_words = ['XL', 'Quality', 'Issue', 'Max', '15', '16', 'Pixel', 'iPhone', 'Pro']
#lemmatizing words of intrest
target_words = [lemmatizer.lemmatize(word) for word in _words]

In [None]:
df_combined['sentiment_C'] = df_combined['cleaned_text'].apply(analyze_sentiment)

#sentiment components
df_combined['compound_C'] = df_combined['sentiment_C'].apply(lambda x: x['compound'])
df_combined['positive_C'] = df_combined['sentiment_C'].apply(lambda x: x['pos'])
df_combined['neutral_C'] = df_combined['sentiment_C'].apply(lambda x: x['neu'])
df_combined['negative_C'] = df_combined['sentiment_C'].apply(lambda x: x['neg'])

#find average for each sentiment component
average_compound_C = df_combined['compound_C'].mean()
average_positive_C = df_combined['positive_C'].mean()
average_neutral_C = df_combined['neutral_C'].mean()
average_negative_C = df_combined['negative_C'].mean()

#average sentiment scores
print("Average Compound Sentiment of Cluster C:", average_compound_C)
print("Average Positive Sentiment of Cluster C:", average_positive_C)
print("Average Neutral Sentiment of Cluster C:", average_neutral_C)
print("Average Negative Sentiment of Cluster C:", average_negative_C)

Average Compound Sentiment of Cluster C: 0.36216438934318124
Average Positive Sentiment of Cluster C: 0.19315736518822776
Average Neutral Sentiment of Cluster C: 0.7190875004710405
Average Negative Sentiment of Cluster C: 0.0755839770885933


### Results

In [None]:
#create a df
average_sentiment_df = pd.DataFrame({
    'Cluster': ['Cluster_A', 'Cluster_B', 'Cluster_C'],
    'Average_Compound': [average_compound_A, average_compound_B, average_compound_C],
    'Average_Positive': [average_positive_A, average_positive_B, average_positive_C],
    'Average_Neutral': [average_neutral_A, average_neutral_B, average_neutral_C],
    'Average_Negative': [average_negative_A, average_negative_B, average_negative_C]
})

average_sentiment_df.set_index('Cluster')

Unnamed: 0_level_0,Average_Compound,Average_Positive,Average_Neutral,Average_Negative
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Cluster_A,0.405101,0.193157,0.719088,0.075584
Cluster_B,0.443057,0.193157,0.719088,0.075584
Cluster_C,0.362164,0.193157,0.719088,0.075584
