# Phase 4 Project: Twitter Sentiment Analysis for Apple and Google Products

* Student name: ROBERT KIDAKE KALAFA
* Student pace: DSF-PT07 PART TIME
* Instructor name: WINNIE ANYOSO, SAMUEL G. MWANGI and SAMUEL KARU

![Project Image](Project_Image.jpg)

# 1.0 Business Understanding

## 1.1 Background
In the dynamic world of technology, public sentiment wields considerable power in shaping the strategies and brand images of industry titans like Apple and Google. This analysis delves into Twitter sentiment surrounding these two giants, each commanding a formidable global presence. By tapping into public sentiment, these companies can unlock valuable insights to refine their marketing strategies and drive product innovation. The project aims to develop a sophisticated sentiment analysis model to assess tweets about Apple and Google products. This approach will empower our marketing agency to pinpoint areas for improvement, ultimately enhancing customer satisfaction and loyalty by aligning services with user experiences.

## 1.2 Problem Statement
The goal is to leverage sentiment data from Twitter to generate actionable insights for Apple and Google. This analysis seeks to identify patterns and trends in sentiment fluctuations related to these companies. By detecting spikes in sentiment and understanding their causes, Apple and Google can make more informed decisions, whether to address product concerns or capitalize on positive public perception.

## 1.3 Objectives
* Develop a specialized sentiment analysis model to evaluate Twitter sentiments regarding Apple and Google products.
* Identify differences in sentiment between Apple and Google products on Twitter.
* Capture the overall sentiment towards Apple and Google products as expressed on Twitter.
* Explore recurring topics linked to positive or negative sentiments related to the Apple and Google brands.

# 2.0 Data Understanding

The dataset for this project, sourced from https://data.world/crowdflower/brands-and-product-emotions is well-suited for our objectives. This rich resource is ideal for training and testing our sentiment analysis models, effectively capturing real-world sentiment from a platform where users openly share their opinions.

The dataset comprises three columns and 9,093 rows:

*  tweet_text: The text of the tweets.
*  emotion_in_tweet_is_directed_at: Insights into the emotions expressed.
*  is_there_an_emotion_directed_at_a_brand_or_product: Indicates the specific brand or product related to the emotion.

With its sizable sample, the dataset provides ample data for model training and validation. The features have been carefully selected for their relevance, particularly focusing on the tweet text and the emotions associated with brands or products, which are crucial for understanding sentiment dynamics.

However, the dataset does have limitations. Interpreting tweet sentiment can be complex due to contextual factors and sarcasm. Additionally, it may not fully represent all sentiments expressed on Twitter, potentially impacting the comprehensiveness of our analysis.

In [85]:
# Import all the relevant libraries

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import nltk  
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer  
from nltk.stem import PorterStemmer  
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import os
import re
import sys
import string
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten
import xgboost as xgb
from wordcloud import WordCloud

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to C:\Users\rkalafa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rkalafa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rkalafa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [86]:
# Load the dataset and preview first five rows
data = pd.read_csv('Tweet_Product_Company.csv', encoding='ISO-8859-1')
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [87]:
# Getting information of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [88]:
# Getting the shape of the data
data.shape

(9093, 3)

In [89]:
# Calculate sentiment counts
sentiment_counts = data['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()
print(sentiment_counts)

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: count, dtype: int64


In [90]:
# Examine text data to see what kind of data we are working with
data['tweet_text'].head()

0    .@wesley83 I have a 3G iPhone. After 3 hrs twe...
1    @jessedee Know about @fludapp ? Awesome iPad/i...
2    @swonderlin Can not wait for #iPad 2 also. The...
3    @sxsw I hope this year's festival isn't as cra...
4    @sxtxstate great stuff on Fri #SXSW: Marissa M...
Name: tweet_text, dtype: object

**The following observations were made;**

*  The dataset consists of 9,093 rows and 3 columns.
*  The columns are labeled 'tweet_text', 'emotion_in_tweet_is_directed_at', and 'is_there_an_emotion_directed_at_a_brand_or_product'.
*  The 'is_there_an_emotion_directed_at_a_brand_or_product' column includes four unique values: 'No emotion toward brand or product', 'Positive emotion', 'Negative emotion', and 'I can't tell'.
*  The 'emotion_in_tweet_is_directed_at' column contains nine unique values.
*  The value count indicates that the majority of users who tweeted do not express any specific emotion toward a brand or product, while only a small number of tweets fall into categories where the sentiment (positive or negative) is ambiguous.

# 3.0 Data Preparation

## 3.1 Data Cleaning

This section focuses on preparing the data for exploratory data analysis (EDA) and modeling. We will examine the dataset for:

* Duplicate rows
* Missing values
* During our analysis, we will rename columns to enhance the dataset's readability, clarity, and user-friendliness. 
* Cleaning text data
* Text Vectorization

In [91]:
# Dropping emotion_in_tweet_is_directed_at column since we wont be using it modelling
columns_to_drop = ['emotion_in_tweet_is_directed_at']
data = data.drop(columns=columns_to_drop)

In [92]:
# Checking if the column has been dropped 
data.head()
# 'the emotion in tweet is directed at' column has been dropped.

Unnamed: 0,tweet_text,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


In [93]:
# Dropping 'i cant tell' and No emotion toward brand or product' category since we will only be using the two sentiments.
data = data[(data['is_there_an_emotion_directed_at_a_brand_or_product'] != "I can't tell")]

In [94]:
# Checking to see if 'i cant tell' category has been dropped and 'No emotion toward brand or product' has been replaced
data.is_there_an_emotion_directed_at_a_brand_or_product.value_counts()

is_there_an_emotion_directed_at_a_brand_or_product
No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
Name: count, dtype: int64

We have completed the data preprocessing by removing the 'emotion_in_tweet_is_directed_at' column, which leaves us with two columns for further analysis. Additionally, we have excluded the 'I can't tell' category from the 'emotion' column, so our dataset now consists solely of the 'Positive,' 'Negative,' and 'No emotion toward brand or product' sentiments.

In [95]:
# Checking if our dataset has missing values
data.isna().sum()

tweet_text                                            1
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

In [96]:
# Dropping the row with missing values
# Since we cannot impute text
data = data.dropna(subset=['tweet_text'])

In [97]:
# Checking if the row with missing values has been dropped
data.isna().sum()

tweet_text                                            0
is_there_an_emotion_directed_at_a_brand_or_product    0
dtype: int64

In [98]:
# Handling duplicates
# Check if there are any duplicated values, drop them incase they are there and keep the first value

data['tweet_text'].duplicated().sum()

27

In [99]:
data = data.drop_duplicates(subset='tweet_text', keep='first')

In [100]:
# Renaming column and 'No emotion toward a brand or product' category
# Renaming'is_there_an_emotion_directed_at_a_brand_or_product'column to emotion to make it easy to work with

data.rename(columns={'is_there_an_emotion_directed_at_a_brand_or_product': 'emotion'}, inplace=True)

In [101]:
# Previewing the first five rows to check if the column has been renamed.
data.head()

Unnamed: 0,tweet_text,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Positive emotion


In [102]:
# Renaming 'No emotion toward a brand or product' as neutral for easy analysis
data['emotion'] = data['emotion'].replace({'No emotion toward brand or product': 'Neutral'})

In [103]:
# Checking to see if the row has been renamed
data.emotion.value_counts()

emotion
Neutral             5372
Positive emotion    2968
Negative emotion     569
Name: count, dtype: int64

In [104]:
# Function to clean text data
def clean_text(text):
    # Ensure text is a string
    text = str(text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove hashtags (including the # symbol)
    text = re.sub(r'#\w+', '', text)
    
    # Remove special characters and punctuation (except spaces)
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert text to lowercase
    text = text.lower()
    
    return text
   
     #Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return ' '.join(filtered_tokens)

In [106]:
# Function to apply lemmatization
def lemmatize_text(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

# Apply text cleaning to the "tweet_text" column
data['cleaned_tweet'] = data['tweet_text'].apply(clean_text)

# Tokenize the "tweet_text" column
data['tokenized_tweet'] = data['cleaned_tweet'].apply(lambda x: word_tokenize(x))

# Apply lemmatization to the tokenized text
data['lemmatized_tweet'] = data['tokenized_tweet'].apply(lemmatize_text)

# Display the DataFrame with cleaned, tokenized, and lemmatized text
print(data[['tweet_text', 'cleaned_tweet', 'tokenized_tweet', 'lemmatized_tweet']])

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'C:\\Users\\rkalafa/nltk_data'
    - 'c:\\Users\\rkalafa\\AppData\\Local\\anaconda3\\nltk_data'
    - 'c:\\Users\\rkalafa\\AppData\\Local\\anaconda3\\share\\nltk_data'
    - 'c:\\Users\\rkalafa\\AppData\\Local\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\rkalafa\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [1]:
# Getting information of the data
data.info()

NameError: name 'data' is not defined