# EDA

In this Jupyter notebook, we embark on a journey to analyze sentiment in tweets related to Apple and Google products. Our goals are to understand the dataset's structure, clean and preprocess the data, and prepare it for a machine learning model that predicts sentiment based on tweet content. Let's start by loading the dataset and performing an exploratory data analysis (EDA).


In [1]:
#Import the necessary packages
import pandas as pd

## Load and Explore the Dataset
First, we load the dataset to understand its basic structure, including the number of entries, columns, and types of data it contains. This initial exploration is crucial for planning our preprocessing steps.

The dataset contains three columns:
- `tweet_text`: The text of the tweet.
- `emotion_in_tweet_is_directed_at`: The product or brand the tweet is directed at (e.g., iPhone, iPad, Google).
- `is_there_an_emotion_directed_at_a_brand_or_product`: The sentiment of the tweet (e.g., Positive emotion, Negative emotion).



In [2]:
data_path = 'data/judge-1377884607_tweet_product_company.csv'
tweets_df = pd.read_csv(data_path, encoding='ISO-8859-1')
print("Dataset shape:", tweets_df.shape)
print("Data types:\n", tweets_df.dtypes)
tweets_df.head()

Dataset shape: (9093, 3)
Data types:
 tweet_text                                            object
emotion_in_tweet_is_directed_at                       object
is_there_an_emotion_directed_at_a_brand_or_product    object
dtype: object


Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


## Missing Values Analysis
Identifying and handling missing values is crucial since they can significantly impact the performance of our model.

The dataset contains missing values, primarily in the `emotion_in_tweet_is_directed_at` column, with a smaller number in the `tweet_text` column. Since our primary focus is sentiment analysis based on the tweet text, we'll proceed by dropping rows where the tweet text is missing. The `emotion_in_tweet_is_directed_at column` can be ignored for our current purpose, as we are focusing on sentiment, not the specific product mentioned.


In [3]:
missing_values = tweets_df.isnull().sum()
print("Missing values in each column:\n", missing_values)


Missing values in each column:
 tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64


In [4]:
# Remove rows with null values in the 'tweet_text' column in the original DataFrame
tweets_df = tweets_df.dropna(subset=['tweet_text'])

## Sentiment Distribution
Understanding the balance between different sentiment classes helps us gauge the dataset's bias towards certain sentiments. This is important for selecting appropriate modeling and resampling techniques. 

The `is_there_an_emotion_directed_at_a_brand_or_product` column has four unique labels:

- Negative emotion
- Positive emotion
- No emotion toward brand or product
- I can't tell

For the sentiment analysis model, we'll focus on positive and negative emotions. We'll treat "No emotion toward brand or product" and "I can't tell" as neutral or unknown sentiments, which might be excluded from the training to focus the model on distinguishing clearly between positive and negative sentiments.


It looks like most of the reviews are nuetral. Of the reviews that are either positive or negative, there are about 5 positive reviews for every negative review.

In [5]:
sentiment_distribution = tweets_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()
print("Sentiment distribution:\n", sentiment_distribution)


Sentiment distribution:
 No emotion toward brand or product    5388
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64


## Text Length Analysis
Analyzing the length of tweets can reveal insights about the dataset's variability. It helps us decide if we need to normalize text lengths through padding or truncation during preprocessing.


In [6]:
tweets_df['text_length'] = tweets_df['tweet_text'].apply(lambda x: len(str(x)))
tweets_df['text_length'].describe()


count    9092.000000
mean      104.962275
std        27.187640
min        11.000000
25%        86.000000
50%       109.000000
75%       126.000000
max       178.000000
Name: text_length, dtype: float64

## Frequent Words Analysis
Identifying the most frequently occurring words will help us understand common themes and refine our list of stopwords. High-frequency words with little sentiment value might be removed to improve model accuracy.
|

In [7]:
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

all_words = ' '.join(tweets_df['tweet_text'].dropna()).lower()
all_words_tokenized = word_tokenize(all_words)

word_counts = Counter(all_words_tokenized)
word_counts.most_common(20)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\johns\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('#', 15875),
 ('sxsw', 9516),
 ('@', 7194),
 ('mention', 7124),
 ('.', 5506),
 ('the', 4424),
 ('link', 4313),
 ('}', 4298),
 ('{', 4296),
 ('to', 3586),
 (',', 3533),
 ('at', 3102),
 ('rt', 2962),
 (';', 2800),
 ('&', 2707),
 ('google', 2595),
 ('for', 2545),
 ('ipad', 2446),
 ('!', 2398),
 ('a', 2312)]

## Special Characters and URLs Analysis
Tweets often contain special characters, emojis, and URLs that may not contribute to sentiment analysis. Deciding whether to keep or remove these can impact preprocessing steps and ultimately model performance.


In [8]:
import re

special_chars = tweets_df['tweet_text'].apply(lambda x: re.findall(r'[^\w\s]', str(x)))
special_chars_counts = Counter([item for sublist in special_chars for item in sublist])
print("Special characters counts:", special_chars_counts.most_common(10))

urls_counts = tweets_df['tweet_text'].apply(lambda x: len(re.findall(r"http\S+|www\S+|https\S+", str(x)))).sum()
print("Number of URLs:", urls_counts)


Special characters counts: [('#', 15875), ('.', 8382), ('@', 7194), ('}', 4298), ('{', 4296), (',', 3558), ("'", 2903), (';', 2800), ('&', 2707), ('-', 2438)]
Number of URLs: 44


# Preprocessing


## Cleaning the Text Data
The next step in our preprocessing is to clean the text data. This involves removing URLs, mentions, special characters, and numbers. These elements are generally not useful for sentiment analysis and can introduce noise into our model. Cleaning helps in focusing on the meaningful content of the tweets.


In [9]:
import re

def clean_tweet(tweet):
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)  # Remove URLs
    tweet = re.sub(r'\@\w+|\#', '', tweet)  # Remove mentions and keep hashtags
    tweet = re.sub(r'\d+', '', tweet)  # Remove numbers
    tweet = re.sub(r'[^\w\s]', '', tweet)  # Remove punctuation
    return tweet

tweets_df['cleaned_tweet_text'] = tweets_df['tweet_text'].apply(clean_tweet)


In [10]:
tweets_df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text_length,cleaned_tweet_text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,127,I have a G iPhone After hrs tweeting at RISE...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,139,Know about Awesome iPadiPhone app that youl...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,79,Can not wait for iPad also They should sale ...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,82,I hope this years festival isnt as crashy as ...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,131,great stuff on Fri SXSW Marissa Mayer Google ...


## Encoding Sentiment Labels
For our model to understand the sentiment labels, we need to encode them into a numerical format. In our case, we'll encode positive sentiment as 1 and negative sentiment as 0. This step is crucial for the subsequent modeling process.


In [11]:
sentiment_mapping = {
    'Positive emotion': 1,
    'Negative emotion': 0
}

# Update the mapping based on your dataset's sentiment labels
tweets_df['sentiment_label'] = tweets_df['is_there_an_emotion_directed_at_a_brand_or_product'].map(sentiment_mapping)


In [13]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9092 entries, 0 to 9092
Data columns (total 6 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   tweet_text                                          9092 non-null   object 
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object 
 2   is_there_an_emotion_directed_at_a_brand_or_product  9092 non-null   object 
 3   text_length                                         9092 non-null   int64  
 4   cleaned_tweet_text                                  9092 non-null   object 
 5   sentiment_label                                     3548 non-null   float64
dtypes: float64(1), int64(1), object(4)
memory usage: 497.2+ KB


## Handling Missing Values
After cleaning the text, we need to address missing values. Missing sentiment labels or texts could mislead our model training. Depending on the context, we might choose to fill them with a placeholder value or remove those entries.


In [14]:
# Dropping rows with any missing sentiment labels or cleaned tweets
tweets_df = tweets_df.dropna(subset=['sentiment_label'])
tweets_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3548 entries, 0 to 9088
Data columns (total 6 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   tweet_text                                          3548 non-null   object 
 1   emotion_in_tweet_is_directed_at                     3191 non-null   object 
 2   is_there_an_emotion_directed_at_a_brand_or_product  3548 non-null   object 
 3   text_length                                         3548 non-null   int64  
 4   cleaned_tweet_text                                  3548 non-null   object 
 5   sentiment_label                                     3548 non-null   float64
dtypes: float64(1), int64(1), object(4)
memory usage: 194.0+ KB


## Text Vectorization (TF-IDF)
Machine Learning models require numerical input, so we convert our cleaned text data into a numerical format using TF-IDF (Term Frequency-Inverse Document Frequency). This technique reflects how important a word is to a document in a collection, helping us to weigh the terms accordingly.


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf_vectorizer.fit_transform(tweets_df['cleaned_tweet_text'])


## Splitting the Data
Before training our model, we'll split the dataset into training and testing sets. This allows us to train our model on one subset of the data and then test its performance on unseen data, providing a better evaluation of its real-world performance.


In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, 
    tweets_df['sentiment_label'], 
    test_size=0.2, 
    random_state=42
)


## Handling Class Imbalance
Our EDA might reveal a class imbalance in the sentiment labels. To address this, we can use oversampling for the minority class or undersampling for the majority class. This helps in preventing our model from being biased towards the more frequent class.


In [21]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
