# Natural Language Processing: Sentiment Analysis

**Author**: Albane Colmenares <br>
**Date**: December 12th, 2023 <br>
___________________________________________________________________________

### <u>Table of Content</u>
**1. Overview**<br>
**2. Business Understanding**<br>
**3. Data Understanding**<br>
**4. Data Preparation**<br>
**5. Modeling**<br>
**6. Evaluation**<br>
**7. Findings & Recommendations**<br>
**8. Limits & Next Steps**<br>

## 1. Overview

This notebook examines tweets about several brands and products and predicts whether the sentiment of the short text is positive, negative or neutral. <br>
The organization of this notebook follows the CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process.

Text Text Text Text Text Text 

## 2. Business Understanding

Business and data understanding: *what kind of data are you using, and what makes it well-suited for the business problem?*
* You do not need to include any data visualizations in your summary, but consider including relevant descriptive statistics

Text Text Text Text Text Text 

Text Text Text Text Text Text 

## 3. Data Understanding

**Data Source**

The data comes from CrowdFlower via [data.world](https://data.world/crowdflower/brands-and-product-emotions). 


------------- REPHRASE THIS ------------- 

*Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.*

------------- REPHRASE THIS ------------- 

The file `judge-1377884607_tweet_product_company.csv` can be downloaded at the provided link. 
It was then renamed to `tweet_product_company.csv`and saved into the current folder, within the 'data' subfolder, to be accessed into the raw DataFrame. 


Text Text Text Text Text Text 

**Features**

Prior to preprocessing, the columns are: 

* `tweet_text`: the actual tweet's record
* `emotion_in_tweet_is_directed_at`: the product or company referred to in the tweet
* `is_there_an_emotion_directed_at_a_brand_or_product`: the tweet's sentiment

Text Text Text Text Text Text 

**Target**

The tweet's sentiment is the target for the dataset. The specific column is `is_there_an_emotion_directed_at_a_brand_or_product`. Based on a given set of tweets, we will try to predict if the tweet's emotion was positive, negative or neutral. 

**Loading the data**

In [None]:
# Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import matplotlib.ticker as ticker
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split

%matplotlib inline

The text file is encoded using Latin-1 encoding - and is open as is. Several encodings were tried to ensure the right one matched: utf-8, utf-16, ascii for example.

In [None]:
# Loading dataset and saving it as raw_df
raw_df = pd.read_csv('data/tweet_product_company.csv', encoding='latin-1')

In [None]:
# Inspecting the first 5 rows of the DataFrame
raw_df.head()

In [None]:
print(f'The dataset has '+ str(len(raw_df)) + ' rows and 3 columns.' )

The various companies and products referred to in the tweets will be reviewed to get an understand of the balance in the dataset, along with what is being most often reviewed.  

Similarly, the emotions will be reviewed in a similar way. 

In [None]:
# Inspecting the number of tweets referring to each product or company
raw_df['emotion_in_tweet_is_directed_at'].value_counts()

In [None]:
# Inspecting the number of tweets referring to each emotion
raw_df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

## 4. Data Preparation

This includes data cleaning and exploratory data analysis with `nltk`


more text <br>
more text <br>
more text <br>
more text <br>
more text <br>
more text <br>


*why did you choose the data preparation steps that you did, and what was the result?*

* This should be specific to the kind of data you are working with. For example, if you are doing an NLP project, what did you decide to do with stopwords?
* Be sure to list the packages/libraries used to prepare the data, and why


For a better readability of the tweets' texts, the column width will be increased. In addition, the use of MathJax will be disabled so that the visual representation of mathematical expressions are not displayed so this doesn't cause issues to the environment. 

In [None]:
# Increasing column width
pd.set_option('max_colwidth', 400)
pd.set_option('use_mathjax', False)

### 4. a) Column names' change

The column names are particularly long. For an easier process to handle, they will be renamed in the new DataFrame called `df`:
* `tweet`
* `product_or_company`
* `sentiment`


In [None]:
# Making a copy of the raw DataFrame to modify it
df = raw_df.copy()

In [None]:
# Defining the new columns' names and attributing them to the new DataFrame
df.columns = ['tweet', 'product_or_company', 'sentiment']

In [None]:
# Verifying the changes applied  
df.head()

### 4. c) Missing data

In the next section, the missing values are inspected and handled by category. 
<br>
The `tweet` column only had 1 row with null values and had no implication on other features: it is removed. 
<br>
The `product_or_company` requires contains many more missing values. 

In [None]:
# Looking for missing values
df.info()

* **Tweet**

The tweet column only has one null value with no information on the other columns: it will be dropped from the DataFrame.  

In [None]:
# Inspecting the tweet containing null information 
df[df['tweet'].isnull()]

The null tweet does not contain any information for either column and will be dropped.  

In [None]:
# Dropping the null tweet from the DataFrame

df = df.dropna(subset=['tweet'])

In [None]:
# Verifying it was correctly removed
df.info()

In [None]:
print(f'The dataset now has '+ str(len(df)) + '. The missing tweet was removed.' )

* **Product or Company**

The product_or_company column contains many null values where neither the product or the brand was specified. For now, all null values will be replaced by 'unknown', as the focus is to predict sentiment. 
<br>If the focus on product or company needs to be done, two columns will be created to identify the product and the brand. 

In [None]:
# Inspecting the tweet containing null information 
df[df['product_or_company'].isnull()]

In [None]:
# Replacing the null product or company with 'undefined'
df['product_or_company'] = df['product_or_company'].fillna('undefined')

In [None]:
# Verifying it was correctly handled
df.info()

In [None]:
print(f'The dataset still has '+ str(len(df)) + '.' )

In [None]:
# Verifying the count of rows by unique value in this column
df['product_or_company'].value_counts()

### 4. d) Handling duplicates

In [None]:
# How many rows were duplicates
print(str(len(df[df.duplicated()])) + f' duplicate rows were identified.')

In [None]:
# Viewing the duplicate rows
df[df.duplicated()]

In [None]:
# Verifying with one example that tweets were indeed duplicated 
df[df['tweet'] == 'Before It Even Begins, Apple Wins #SXSW {link}']

In [None]:
# Dropping duplicates
df.drop_duplicates(inplace=True)
df.info()

### 4. d) Turning sentiment classification into a binary one

* **Product or Company**

The product or company column does not have an impact on whether a tweet is positive or negative, so it will not be transformed as it will not be used further for predictions. 

* **Sentiment**

Four sentiment categories are described, which could be grouped in three: positive, negative, neutral. 
<br>This is what will be covered over the next section. 

In [None]:
# Number of rows by emotion
df['sentiment'].value_counts()

* **Categorizing**

Due to the nature of the target, we will focus on the positive ones. Hence all the other tweets, whether they are neutral or negative, will be considered *not positive*, so will be identified as negative.

In [None]:
# Defining the new classifications for the sentiment column 
classification_columns = {
    'sentiment': {
        "No emotion toward brand or product": "negative", 
        "I can't tell": "negative", 
        "Positive emotion": "positive", 
        "Negative emotion": "negative" 
    }
}

In [None]:
# Converting the sentiment column classification

# Defining columns to change
column_classification = ['sentiment']

def convert_class(df, columns_mapping):
    for column, mapping in columns_mapping.items():
        print('Before: ' + column, df[column].unique())
        df[column] = df[column].map(mapping)
        print('After: ' + column, df[column].unique())
    

In [None]:
convert_class(df, classification_columns)

In [None]:
# Number of rows by unique sentiment
df['sentiment'].value_counts()

In [None]:
# Creating a bar chart for to visualize class imbalance
fig, ax = plt.subplots(figsize=(10,6))

# Defining custom colors 
custom_colors = ['#3B3935', '#00917C']

sns.countplot(data=df, x='sentiment', order=df['sentiment'].value_counts().index, palette=custom_colors)

ax.set_xlabel(xlabel = 'Sentiment', fontsize=15)
ax.set_ylabel(ylabel = 'Number of Tweets', fontsize=15)

ax.set_xticklabels(labels=['Negative', 'Positive'])

ax.set_title(f'Number of tweets per sentiment')

plt.show()

### 4. e) Performing a Train-Test Split

In [None]:
# Splitting df into X and y
X = df.drop('sentiment', axis=1)
y = df['sentiment']

In [None]:
X_train, X_test, y_train, _test = train_test_split(X, y, random_state=42, stratify=y)

In [None]:
X_train.head()

In [None]:
y_train.head()

* **Distribution of Target**

In [None]:
train_target_counts = pd.DataFrame(y_train.value_counts())
train_target_counts.index.name = 'target name'
train_target_counts.rename(columns={'sentiment': 'count'}, inplace=True)

In [None]:
train_target_counts

* **Visually Inspecting Features**

In [None]:
# Making a sample of 5 records to display the full text of each
train_sample = X_train.sample(5, random_state=22)
train_sample['label'] = [y_train[val] for val in train_sample.index]
train_sample.style.set_properties(**{'text-align': 'left'})

## 4. or 5. ?

## 4. Data Preparation Continuity or Preprocessing?

### 4. e) Standardizing Case

Before starting any exploratory analysis, two fundamental data cleaning tasks will be performed on the text data: standardizing case and tokenizing. The first one will be standardizing.

We will glance at the first sample of tweet to get an idea of whether we need to standardize case.  

In [None]:
# Isolating the first tweet into windows_sample
windows_sample = train_sample.iloc[0]["tweet"]
windows_sample

Changing to lower case is necessary. We will apply this to the first tweet sample. 

In [None]:
# Transforming sample data to lowercase
windows_sample.lower()

This answers our needs - we will apply this to our sample

* **Lower case**

In [None]:
# Transforming sample data to lowercase
train_sample['tweet'] = train_sample['tweet'].str.lower()
# Displaying full text
train_sample.style.set_properties(**{'text-align': 'left'})

This answers our needs - we will apply this to our full dataset

* **Standardizing Case in the Full Dataset**

In [None]:
# Transforming full training data to lowercase
X_train['tweet'] = X_train['tweet'].str.lower()

In [None]:
# Verifying an example to see if this applied correctly
X_train.iloc[100]['tweet']

### 4. f) Tokenizing

The second fundamental data cleaning step is to tokenize the text data.

In [None]:
# Reviewing one of our train_sample tweets
tweet_sample = train_sample.iloc[1]['tweet']
tweet_sample

We will use `RegexpTokenizer` from NLTK to create tokens of tow or more consecutive word characters, which include letters, numbers and underscores.

* **Tokenizing Pattern**

In [None]:
# Importing RegexpTokenizer

from nltk.tokenize import RegexpTokenizer

basic_token_pattern = r"(?u)\b\w\w+\b"

tokenizer = RegexpTokenizer(basic_token_pattern)
tokenizer.tokenize(tweet_sample)

* **Tokenizing the Full Dataset**

In [None]:
# Creating a column tweet_tokenized on X_train
X_train['tweet_tokenized'] = X_train['tweet'].apply(tokenizer.tokenize)

In [None]:
# Inspecting a tweet example
X_train.iloc[99][['tweet', 'tweet_tokenized']]

We have removed all single-letter words, so instead of "i", "got", "in". We now have'got', 'in'.  

## ?. Exploratory Data Analysis: Frequency Distributions

A frequency distribution is a data structure that contains pieces of data as well as the count of how frequently they appear. 
In this case, pieces of data are words. 

In order to do this, we will use the `FreqDist` package, which allows us to pass in a single list of words, and produces a dictionary-like output of those words and their frequencies.  

We will visualize the top 10 words to evaluate further what cleaning needs to be done. 

In [None]:
# Importing the relevant package: FreqDist
from nltk import FreqDist

* **FreqDist**

In [None]:
example_freq_dist = FreqDist(X_train.iloc[100]['tweet_tokenized'][:20])
example_freq_dist

In [None]:
# Importing the relevant package for top number of words
from matplotlib.ticker import MaxNLocator

# Creating a function to visualize the top 10 words

def visualize_top_10(freq_dist, title):
#     extracting data for graph
    top_10 = list(zip(*freq_dist.most_common(10)))
    tokens = top_10[0]
    counts = top_10[1]
    
#     Setting up graph and plotting data
    fig, ax = plt.subplots()
    ax.bar(tokens, counts)
    
#     Custominzing plot appearance 
    ax.set_title(title)
    ax.set_ylabel('Count')
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    ax.tick_params(axis='x', rotation=90)
    
visualize_top_10(example_freq_dist, "Top 10 Word Frequency for Example Tokens")

* **FreqDist on the Full DataSet**

In order to calculate the count of words, they need to be stored into a list. To do so, we will `explode` the dataset.  

In [None]:
# Creating a frequency distribution for X_train
train_freq_dist = FreqDist(X_train['tweet_tokenized'].explode())

# Plotting the top 10 tokens
visualize_top_10(train_freq_dist, 'Top 10 Word Frequency for Full X_train')


We can also subdivide this by category to see if it makes a difference:

In [None]:
# Adding in labels for filtering
X_train['label'] = [y_train[val] for val in X_train.index]

In [None]:
# Defining funcrion to plot 2 visualizations

# Creating two columns 
def two_subplits():
    fig = plt.figure(figsize=(15, 9))
    fig.set_tight_layout(True)
    gs = fig.add_gridspec(1, 2)
    
    ax1 = fig.add_subplot(gs[0, 0]) #row 0, col 0 
    ax2 = fig.add_subplot(gs[0, 1]) #row 0, col 1 
    return fig, [ax1, ax2]

# Plotting the graph
def plot_distribution_by_sentiment(X_version, column, axes, title = "Word Frequency for:"):
    for index, category in enumerate(X_version['label'].unique()): 
#         Calculating frequency distribution for this subset
        all_words = X_version[X_version['label'] == category][column].explode()
        freq_dist = FreqDist(all_words)
        top_10 = list(zip(*freq_dist.most_common(10)))
        tokens = top_10[0]
        counts = top_10[1]
        
        
#         Setting up a plot
        ax = axes[index]
        ax.bar(tokens, counts)
        
#         Customizing plot appearance
        ax.set_title(f"{title} {category}")
        ax.set_ylabel("Count")
        ax.yaxis.set_major_locator(MaxNLocator(integer=True))
        ax.tick_params(axis='x', rotation=90)
        
        
fig, axes = two_subplits()
plot_distribution_by_sentiment(X_train, 'tweet_tokenized', axes)
fig.suptitle('Word Frequencies for Each Sentiment', fontsize=20)
plt.show()

## 5. Modeling

*what modeling package(s) did you use, which model(s) within the package(s), and what tuning steps did you take?*
* For some projects there may be only one applicable package; you should still briefly explain why this was the appropriate choice

### 5. a) Baseline Model with TfidfVectorizer and MultinomialNB

We will start modeling by building an initial model which only has access to the information in the plots above. So, using the default token pattern to split the full text into tokens, and using a limited vocabulary. 

To give the model a little bit more information with those same features, `TfidVectorizer` will be used to count the term frequency (`tf`) within a single document. This package also includes the inverse document frequency (`idf`): how rare the term is. 

The first step is to import the vectorizer, instantiate a vectorizer object and fit it on `X_train['tweet']`.

In [None]:
# Importing the relevant vectorizer class
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiating a vectorizer with max_features=10 
tfidf = TfidfVectorizer(max_features=10)

# Fitting the vectorizer on X_train['tweet'] and transforming it
X_train_vectorized = tfidf.fit_transform(X_train['tweet'])

# Inspecting the vectorized data
pd.DataFrame.sparse.from_spmatrix(X_train_vectorized, columns=tfidf.get_feature_names_out())

Now that we have preprocessed data, we will fit and evaluate Naive Bayes classifier using `cross_val_score`

In [None]:
# Importing the relevant class function
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# Instantiating a MultinomialNB classifier
baseline_model = MultinomialNB()

# Evaluating the classifier on X_train_vectorized and y_train
# Since we are trying to measure the positive sentiment, we need to subtract the cross val score from 1: 
# positive is the second sentiment
baseline_cv = 1 - cross_val_score(baseline_model, X_train_vectorized, y_train)
baseline_cv.mean()

**Verifying the class balance**

In [None]:
# Verifying the class balance
y_train.value_counts(normalize=True)

How well did the final model perform?

If we guessed the contribution of sentiment every time, we would expect about 33% accuracy. 
Our model baseline is not getting more than just getting every time.  

-----------------------ADD THE OTHER EVALUATION METRICS-----------------------

### <u>2nd iteration</u>: Addressing class imbalance: undersampling negative tweets

In [None]:
# Import relevant packages
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report

In [None]:
# Instantiating the undersampler
undersampler = RandomUnderSampler(random_state=42)
# Applying undersampling only on training data
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)

In [None]:
# Fitting the vectorizer on X_resampled['tweet'] and transforming it
X_resampled_vectorized = tfidf.fit_transform(X_resampled["tweet"])

# Inspecting the vectorized data
pd.DataFrame.sparse.from_spmatrix(X_train_vectorized, columns=tfidf.get_feature_names_out())

Now that we have preprocessed data, we will fit and evaluate Naive Bayes classifier using `cross_val_score`

In [None]:
# Evaluating the classifier on X_train_vectorized and y_train
balanced_cv = 1- cross_val_score(baseline_model, X_resampled_vectorized, y_resampled)
balanced_cv.mean()

The cross_val_score considerably improved from 33% to 44%. 

In [None]:
# Inspecting the new class balance
y_resampled.value_counts(normalize=True)

### <u>3rd iteration</u>: Removing Stopwords

**Removing Stopwords**

Typical list of stopwords to which we will add:
* `sxsw`: the name of the conference 
* `mention`: when tweeted
* `link`: ?

In [None]:
# Importing relevant packages
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

# Creating list to store stopwords
stopwords_list = stopwords.words('english')
stopwords_list[:5]

In [None]:
# Storing words to add to list of stopwords
manual_stopwords = ['sxsw', 'mention', 'link', 'rt']

# Adding to list of stopwords
for word in manual_stopwords:
    stopwords_list.append(word)


In [None]:
# Verifying the new words were added
stopwords_list[-len(manual_stopwords):]

In [None]:
# Just in case some stopwords need to be removed
# stopwords_list.remove(add_stopwords)

In [None]:
# Defining function that takes in a list of strings and returns only those that are not in the list
def remove_stopwords(token_list):
    stopwords_removed = [token for token in token_list if token not in stopwords_list]
    return stopwords_removed


In [None]:
# Testing it on an example
X_train.columns

In [None]:
tokens_example = X_train.iloc[100]['tweet_tokenized']
print("Length with stopwords: ", len(tokens_example))

tokens_example_without_stopwords = remove_stopwords(tokens_example)
print("Length with stopwords: ", len(tokens_example_without_stopwords))

Applying it to all the dataset

In [None]:
X_resampled['tweet_tokenized_without_stopwords'] = X_resampled['tweet_tokenized'].apply(remove_stopwords)

Now let's compare the frequency distribution without stopwords

In [None]:
fig, axes = two_subplits()
plot_distribution_by_sentiment(X_resampled, 'tweet_tokenized_without_stopwords', axes)
fig.suptitle('Word Frequencies for Each Sentiment', fontsize=20)
plt.show()

In [None]:
[X_train_vectorized]

We will now re-run our model

In [None]:
# Instantiating the new vectorizer 
tfidf = TfidfVectorizer(
        max_features=10,
        stop_words=stopwords_list
        )

# Fitting the vectorizer on X_resampled['tweet'] and transforming it
X_resampled_vectorized = tfidf.fit_transform(X_resampled['tweet'])


# Visually inspecting the vectorized data
pd.DataFrame.sparse.from_spmatrix(X_resampled_vectorized, columns=tfidf.get_feature_names_out())

In [None]:
# Evaluating the classifier on X_train_vectorized and y_resampled
stopwords_removed_cv = 1- cross_val_score(baseline_model, X_resampled_vectorized, y_resampled)
stopwords_removed_cv

In [None]:
print("Baseline:         ", baseline_cv.mean())
print("Balanced:         ", balanced_cv.mean())
print("Stopwords removed:", stopwords_removed_cv.mean())

This is an improvement but a lower accuracy than prior to stopwords being removed

### <u>3rd bis</u>: Stopwords should not be removed, only some of them

In [None]:
X_resampled[]

### <u>4th iteration</u>: Lemmatize

In [None]:
# Importing relevant package
from nltk.stem.wordnet import WordNetLemmatizer
# nltk.download('wordnet')
# nltk.download('omw-1.4')

# Instantiating the Lemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
# Instantiating the Lemmatizer
def lemmatize_words(token_list):
    lemmatized_token = [lemmatizer.lemmatize(token, pos='v') for token in token_list]
    return lemmatized_token

In [None]:
# Remember our tokens_examples
tokens_example = X_resampled.iloc[300]['tweet_tokenized_without_stopwords']

In [None]:
tokens_example

In [None]:
lemmatize_words(tokens_example)

## 6. Evaluation

*how well did your final model perform?*
* Include one or more relevant metrics
 
* Be sure to briefly describe your validation approach

Text Text Text Text Text Text 

Text Text Text Text Text Text 

## 7. Findings & Recommendations

Text Text Text Text Text Text 

Text Text Text Text Text Text 

## 8. Limits & Next Steps

Text Text Text Text Text Text 

Text Text Text Text Text Text 