# 4. Feature Analysis

In addition to the *frequency* of distribution of words in the different categories of feelings implemented in the previous sections, we now perform an in-depth ***function*** *analysis* to obtain insights and improve the performance of the classification model that will be selected to best fits the unseen data.

In [1]:
# Import Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

plt.style.use('ggplot')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold

In [2]:
# Read Pre-processed Data
df = pd.read_csv('rws_eda.csv')
df = df.drop(df.columns[0], axis = 1)
df.head(1)

Unnamed: 0,Clean Title,Clean Content,Rating,Sentiment,Target
0,second time buying headphones overall value money,lost first pair replaced really like fit sound...,4,Positive,2


Before proceeding with the vectorization methods to extract meaningful features from text, let's ***split*** *'df'* dataset into train and test sets with a *test_size* of *0.3*, meaning that **30%** of the dataset that should be allocated to the test set. In order to shuffle the data *before* splitting and to ensure randomness, let's set the parameter 'shuffle = True':

In [3]:
# Split Data
train, test = train_test_split(df, test_size = 0.3, random_state = 42)

print(f'Train Set Shape: {train.shape}')
print(f'Test Set Shape: {test.shape}')

Train Set Shape: (70, 5)
Test Set Shape: (30, 5)


For the further modelling section it is necessary to apply the **label encoding** technique, converting categorical variables into numerical format. In particular, let's perform ***count***, ***frequency***, and ***mean*** encoding on the splitted datasets to gain insights into *how many* instances belong to each sentiment category, the *proportion* of instances for each sentiment class, and evaluating the *relationship* between sentiment categories and the target variable. 

Since by using ***mean*** encoding there's a risk of data *leakage*, to avoid overfitting and improve the generalization of the model it is necessary, in further steps, to use techniques like *cross-validation* and *smoothing*.

In [4]:
# Count Encoding
count_encoded = train.groupby('Sentiment').size().reset_index(name = 'Count')
train = train.merge(count_encoded, on = 'Sentiment', how = 'left')
test = test.merge(count_encoded, on = 'Sentiment', how = 'left')

# Frequency Encoding
freq_encoded = train.groupby('Sentiment').size() / len(train)
train['Freq Encoded'] = train['Sentiment'].map(freq_encoded)
test['Freq Encoded'] = test['Sentiment'].map(freq_encoded)

# Mean Encoding
mean_encoded = train.groupby('Sentiment')['Target'].mean()
train['Mean Encoded'] = train['Sentiment'].map(mean_encoded)
test['Mean Encoded'] = test['Sentiment'].map(mean_encoded)

print(train.head())
print(test.head())

                                         Clean Title  \
0                                       tozo earbuds   
1  good value waterproof bluetooth earbuds nice f...   
2                                        great sound   
3                                initial review good   
4                  astonishingly good bass tiny buds   

                                       Clean Content  Rating Sentiment  \
0  great buds super sound also bonus wireless cha...       5  Positive   
1  used different lowend midrange bluetooth earbu...       5  Positive   
2       using walking dog gym really impressed money       4  Positive   
3  received today initial review seem really good...       4  Positive   
4  hard believe tiny buds produce bass sounds fac...       5  Positive   

   Target  Count  Freq Encoded  Mean Encoded  
0       2     59      0.842857           2.0  
1       2     59      0.842857           2.0  
2       2     59      0.842857           2.0  
3       2     59      0.842857

In [5]:
train.head(3)

Unnamed: 0,Clean Title,Clean Content,Rating,Sentiment,Target,Count,Freq Encoded,Mean Encoded
0,tozo earbuds,great buds super sound also bonus wireless cha...,5,Positive,2,59,0.842857,2.0
1,good value waterproof bluetooth earbuds nice f...,used different lowend midrange bluetooth earbu...,5,Positive,2,59,0.842857,2.0
2,great sound,using walking dog gym really impressed money,4,Positive,2,59,0.842857,2.0


In [6]:
sent_2 = train['Target'].value_counts().get(2, 0)
print("Number of rows with Pos Sentiment:", sent_2)
sent_1 = train['Target'].value_counts().get(1, 0)
print("Number of rows with Neu Sentiment:", sent_1)
sent_0 = train['Target'].value_counts().get(0, 0)
print("Number of rows with Neg Sentiment:", sent_0)

Number of rows with Pos Sentiment: 59
Number of rows with Neu Sentiment: 8
Number of rows with Neg Sentiment: 3


## 4.1. Word Vectorization

**Word vectorization** is a fundamental step in *preparing* text data for machine learning: it maps words or phrases from vocabulary to a corresponding vector of real numbers which used to learn *patterns*, make *predictions*, find word *similarities*, and perform various text classification tasks.

Since there are several *techniques* for vectorizing text data and capture word frequencies, let's analyze two of the most common approaches: **Bag-Of-Words** and **TF-IDF** models.

##### BAG OF WORDS

The first step involves *splitting* a sentence or a piece of text into individual words (***token***). Let's transform the *'Clean Content'* of the training dataset into a sparse matrix of word counts, and then converting it to a dense NumPy array using **toarray()** function:

In [7]:
# Instance for Unigrams
cv = CountVectorizer()

# Bag-Of-Words
bow = cv.fit_transform(train['Clean Content']).toarray()
bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In the resulting array, each row corresponds to a document (*review*) and each column corresponds to a *unique* word in the vocabulary. The values in the BoW representation represent the ***count*** of each word in the respective text: if the word appears in the document, the corresponding value is set to the *frequency* of that word; otherwise, it's set to *zero*.

Once the text is tokenized, a **vocabulary** is created by listing all *unique* words found in the entire corpus of documents, where each unique word becomes a *dimension* in the BoW representation. 

Let's show the list of *vocabulary* used for the CountVectorizer during the *word-level* analysis:

In [8]:
# Unigrams Vocabulary
vocab = cv.get_feature_names()
print(f'The first 10 Unique Words in the corpus are: {cv.get_feature_names()[:10]}')
#print(vocab)

The first 10 Unique Words in the corpus are: ['14hrs', '18month', '4hrs', '58x', '6th', 'ability', 'able', 'aboutsound', 'absorbed', 'accepted']




Let's inspect which features are the ***most*** *important*, and which ones are useless considering the following conditions:
* **Value 0:** indicates that a specific word (feature) is ***not*** *present* in the corresponding document. 
* **Value 1:** represents that a particular word is present in the corresponding document exactly once, indicating the *occurrence* of that feature in the document.

In [9]:
bow_df = pd.DataFrame(bow[0].T, index = cv.get_feature_names(), columns = ['BoW'])
bow_df = bow_df.sort_values('BoW', ascending = False)
# Top 10 Most Relevant Unigrams
bow_df.head(10)

Unnamed: 0,BoW
bonus,1
buds,1
charging,1
also,1
great,1
wireless,1
sound,1
super,1
pulled,0
put,0


What about performing a *word-level* analysis including *bigrams* and *trigrams*? 

In [10]:
# Instance for Bigrams
cv_bigrams = CountVectorizer(ngram_range = (2, 2))
# BOW for Bigrams
bow_bigrams = cv_bigrams.fit_transform(train['Clean Content'])
bow_df_bigrams = pd.DataFrame(bow_bigrams[0].T.toarray(), index = cv_bigrams.get_feature_names_out(), columns = ['BoW'])
bow_df_bigrams = bow_df_bigrams.sort_values('BoW', ascending = False)
# Top 10 Most Relevant Bigrams
bow_df_bigrams.head(10)

Unnamed: 0,BoW
sound also,1
wireless charging,1
also bonus,1
buds super,1
great buds,1
bonus wireless,1
super sound,1
products brilliant,0
product however,0
promptly received,0


In [11]:
# Instance for Trigrams
cv_trigrams = CountVectorizer(ngram_range = (3, 3))
# BoW for Trigrams
bow_trigrams = cv_trigrams.fit_transform(train['Clean Content'])
bow_df_trigrams = pd.DataFrame(bow_trigrams[0].T.toarray(), index = cv_trigrams.get_feature_names_out(), columns = ['BoW'])
bow_df_trigrams = bow_df_trigrams.sort_values('BoW', ascending = False)
# Top 10 Most Relevant Trigrams
bow_df_trigrams.head(10)

Unnamed: 0,BoW
super sound also,1
sound also bonus,1
bonus wireless charging,1
great buds super,1
buds super sound,1
also bonus wireless,1
14hrs delivery time,0
purchase working fine,0
purchased previously good,0
purchased black headphones,0


Let's show the list of *vocabulary* used for the corresponding CountVectorizer instances:

In [12]:
# Bigrams Vocabulary
vocab_bi = cv_bigrams.get_feature_names() 
#print(vocab_bi) 

# Trigrams Vocabulary
vocab_tr = cv_trigrams.get_feature_names()                
#print(vocab_tr)



In [13]:
print(f'The first 10 Unique Bigrams in the corpus are: {cv_bigrams.get_feature_names()[:10]}')
print()
print(f'The first 10 Unique Trigrams in the corpus are: {cv_trigrams.get_feature_names()[:10]}')

The first 10 Unique Bigrams in the corpus are: ['14hrs delivery', '18month warranty', '4hrs continuous', '58x open', '6th choice', 'ability charge', 'ability use', 'able charge', 'able hear', 'able job']

The first 10 Unique Trigrams in the corpus are: ['14hrs delivery time', '18month warranty sticker', '4hrs continuous char', '58x open back', '6th choice buds', 'ability charge charging', 'ability use one', 'able charge die', 'able hear conversation', 'able job even']


Despite its semplicity, Bag-Of-Words does ***not*** consider the *order* of words in the text, treating each word independently, and does ***not*** capture the *semantic* meaning of them, considering words with similar meanings or synonyms as distinct entities. Moreover, words not present in the vocabulary are *ignored*, leading to a loss of semantic information.

For these reasons, let's consider a more accurate technique: Term Frequency–Inverse Document Frequency (**TF-IDF**).

##### TF-IDF

*Term Frequency-Inverse Document Frequency* (**TF-IDF**) is used to *convert* a collection of textual documents into a matrix of numerical values, taking into account the *importance* of words within a text and across the entire corpus. TF-IDF assigns ***higher*** *weights* to words that are ***rare*** across the entire dataset but *relevant* in individual texts. It is an important tools to learn sentiment patterns by identifying words that might carry significant sentiment-related information. In particular:

* *Term Frequency* (**TF**) is nothing but the frequency of word in a document out of the total number of words in that document. It is a sort of *normalized* frequency score representing how *frequent* a word is. 

* *Document Frequency* (**DF**) is the ratio between the number of documents containing a word (W) and the total number documents in the corpus. It represents the *proportion* of documents that contain a *certain* word (W).

* *Inverse Document Frequency* (**IDF**) score is nothing but the logarithm applied on the reciprocal of DF: the ***more*** *common* a word is across all documents, the ***lesser*** its *relevance* is for the current text.

The Term Frequency-Inverse Document Frequency score is given by $TF \times IDF$, indicating that the *higher* the score, the more *important* that word is. 

Let's *convert* pre-processed data into TF-IDF features and perform a *word-level* analysis starting from unigrams:

In [14]:
# Instance for Unigrams
tfidf = TfidfVectorizer()

# TF-IDF
words = tfidf.fit_transform(train['Clean Content'])

Let's show the list of *vocabulary* used for the TfidfVectorizer during the *word-level* analysis:

In [15]:
# Word-Level Vocabulary
vocab = tfidf.get_feature_names()
print(f'The first 10 Unique Words in the corpus are: {tfidf.get_feature_names_out()[:10]}')
#print(vocab)

The first 10 Unique Words in the corpus are: ['14hrs' '18month' '4hrs' '58x' '6th' 'ability' 'able' 'aboutsound'
 'absorbed' 'accepted']


In order to inspect which features are the ***most*** *important*, and which ones are useless, it is necessary to *convert* the resulting sparse matrix of word counts to a dense NumPy array using **todense()** function:

In [16]:
tfidf_df = pd.DataFrame(words[0].T.todense(), index = tfidf.get_feature_names(), columns = ["TF-IDF"])
tfidf_df = tfidf_df.sort_values('TF-IDF', ascending = False)
# Top 10 Most Relevant Unigrams
tfidf_df.head(10)

Unnamed: 0,TF-IDF
bonus,0.553146
super,0.504064
also,0.30924
wireless,0.294089
buds,0.280625
great,0.274416
charging,0.257498
sound,0.193575
pulled,0.0
put,0.0


What about performing a *word-level* analysis including *bigrams* and *trigrams*? 

In [17]:
# Instance for Bigrams
tfidf_bigrams = TfidfVectorizer(ngram_range = (2, 2))
# TF-IDF for Bigrams
words_bigrams = tfidf_bigrams.fit_transform(train['Clean Content'])
tfidf_df_bigrams = pd.DataFrame(words_bigrams[0].T.todense(), index = tfidf_bigrams.get_feature_names_out(), columns = ["TF-IDF"])
tfidf_df_bigrams = tfidf_df_bigrams.sort_values('TF-IDF', ascending = False)
# Top 10 Most Relevant Bigrams
tfidf_df_bigrams.head(10)

Unnamed: 0,TF-IDF
sound also,0.389931
bonus wireless,0.389931
also bonus,0.389931
buds super,0.389931
great buds,0.389931
super sound,0.389931
wireless charging,0.296183
products brilliant,0.0
product however,0.0
promptly received,0.0


In [18]:
# Instance for Trigrams
tfidf_trigrams = TfidfVectorizer(ngram_range = (3, 3))
# TF-IDF for Trigrams
words_trigrams = tfidf_trigrams.fit_transform(train['Clean Content'])
tfidf_df_trigrams = pd.DataFrame(words_trigrams[0].T.todense(), index = tfidf_trigrams.get_feature_names_out(), columns = ["TF-IDF"])
tfidf_df_trigrams = tfidf_df_trigrams.sort_values('TF-IDF', ascending = False)
# Top 10 Most Relevant Trigrams
tfidf_df_trigrams.head(10)

Unnamed: 0,TF-IDF
super sound also,0.408248
sound also bonus,0.408248
bonus wireless charging,0.408248
great buds super,0.408248
buds super sound,0.408248
also bonus wireless,0.408248
14hrs delivery time,0.0
purchase working fine,0.0
purchased previously good,0.0
purchased black headphones,0.0


Let's show the list of *vocabulary* used for the corresponding TfidfVectorizer instances:

In [19]:
# Bigrams Vocabulary
vocab_bi = tfidf_bigrams.get_feature_names()
#print(vocab_bi)

# Trigrams Vocabulary
vocab_tr = tfidf_trigrams.get_feature_names()
#print(vocab_tr)



In [20]:
print(f'The first 10 Unique Bigrams in the corpus are: {tfidf_bigrams.get_feature_names_out()[:10]}')
print( )
print(f'The first 10 Unique Trigrams in the corpus are: {tfidf_trigrams.get_feature_names_out()[:10]}')

The first 10 Unique Bigrams in the corpus are: ['14hrs delivery' '18month warranty' '4hrs continuous' '58x open'
 '6th choice' 'ability charge' 'ability use' 'able charge' 'able hear'
 'able job']

The first 10 Unique Trigrams in the corpus are: ['14hrs delivery time' '18month warranty sticker' '4hrs continuous char'
 '58x open back' '6th choice buds' 'ability charge charging'
 'ability use one' 'able charge die' 'able hear conversation'
 'able job even']


To conclude the word vectorization analysis, let's *count* and visualize the total number of unique ***words*** that *appear* in the documents across the training set to identify the most *significant* features provided by BoW and TF-IDF models. Words that are relevant in both the representations of the document are features having a *BoW* value of **1** and a *TF-IDF* value **greater** than **0**:

In [21]:
words_df = bow_df.merge(tfidf_df, left_index = True, right_index = True, suffixes = ('_BoW', '_TFIDF'))
total_count = ((words_df['BoW'] == 1) & (words_df['TF-IDF'] > 0)).sum()
print('Total count of significant features: ', total_count)
words_with_value_1 = words_df.index[(words_df['BoW'] == 1) & (words_df['TF-IDF'] > 0)].tolist()
print('Significant Words included are:', words_with_value_1)

Total count of significant features:  8
Significant Words included are: ['bonus', 'buds', 'charging', 'also', 'great', 'wireless', 'sound', 'super']


What about counting the total number of the most *significant* ***bigrams*** and ***trigrams***?

In [22]:
bigrams_df = bow_df_bigrams.merge(tfidf_df_bigrams, left_index = True, right_index = True, suffixes = ('_BoW', '_TFIDF'))
total_count = ((bigrams_df['BoW'] == 1) & (bigrams_df['TF-IDF'] > 0)).sum()
print('Total count of significant features: ', total_count)
bigrams_with_value_1 = bigrams_df.index[(bigrams_df['BoW'] == 1) & (bigrams_df['TF-IDF'] > 0)].tolist()
print('Significant Bigrams included are:', bigrams_with_value_1)

Total count of significant features:  7
Significant Bigrams included are: ['sound also', 'wireless charging', 'also bonus', 'buds super', 'great buds', 'bonus wireless', 'super sound']


In [23]:
trigrams_df = bow_df_trigrams.merge(tfidf_df_trigrams, left_index = True, right_index = True, suffixes = ('_BoW', '_TFIDF'))
total_count = ((trigrams_df['BoW'] == 1) & (trigrams_df['TF-IDF'] > 0)).sum()
print('Total count of significant features: ', total_count)
trigrams_with_value_1 = trigrams_df.index[(trigrams_df['BoW'] == 1) & (trigrams_df['TF-IDF'] > 0)].tolist()
print('Significant Trigrams included are:', trigrams_with_value_1)

Total count of significant features:  6
Significant Trigrams included are: ['super sound also', 'sound also bonus', 'bonus wireless charging', 'great buds super', 'buds super sound', 'also bonus wireless']


## 4.2. Feature Selection

**Feature Selection** is the process of selecting a subset of *relevant* features from the original dataset in order to improve the *performance* of a machine learning model. In text classification, feature selection involves choosing a subset of words, phrases, or other linguistic elements as input features for the classification algorithm with the purpose to retain the most *significant* features while reducing noise and improving the model's *efficiency* and *effectiveness*.

To improve model's performance and computational efficiency, it could be usefull to *remove* ***Low-variance*** features reducing the complexity of the model and the risk of overfitting, leading to faster training times and more *efficient* predictions.

In [24]:
# Training Set
train = train.loc[:, ['Clean Title', 'Clean Content', 'Rating', 'Sentiment']]
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 69
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Clean Title    69 non-null     object
 1   Clean Content  70 non-null     object
 2   Rating         70 non-null     int64 
 3   Sentiment      70 non-null     object
dtypes: int64(1), object(3)
memory usage: 2.7+ KB


Before removing low-variance features, it is necessary to ***encode*** categorical columns *(object)* into numerical format:

In [25]:
# Columns to be encoded
columns_to_encode = ['Clean Title', 'Clean Content', 'Sentiment']

encoder = LabelEncoder()
for col in columns_to_encode:
    train[col] = encoder.fit_transform(train[col])
    
train.head(3)

Unnamed: 0,Clean Title,Clean Content,Rating,Sentiment
0,60,20,5,2
1,27,59,5,2
2,39,62,4,2


Let's *remove* features with *low* variance from the train dataset, considering only the features (***X***) and not the desidered output (***y***):

In [26]:
# Separate features and target
X = train.drop('Rating', axis = 1)
y = train['Sentiment']

First of all, it is necessary to specify the threshold value below which features will be considered having low variance and thus will be removed from the dataset. In particular:
* If variance threshold **= 0**, *Constant* features will be dropped.
* If variance threshold **> 0**, *Quasi-Constant* Features will be removed.

Since variance measures how *spread out* the values of a feature are around the mean, low-variance features might ***not*** carry *significant* or discriminatory information to distinguish between different categories across classification tasks. 

Setting the threshold to **0.25**, let's apply the **fit_transform()** method to identify and *remove* low-variance features. Once returned the transformed feature matrix, the **get_support()** method provides a boolean mask indicating which features are selected *(True)* and which are removed *(False)* based on the variance threshold:

In [27]:
# Remove Low-Variance features
var_thr = VarianceThreshold(threshold = 0.25) 
var_thr.fit_transform(X)
# Features to keep
var_thr.get_support()

array([ True,  True, False])

In [28]:
# Pick up Low-Variance columns
low = [column for column in X.columns 
          if column not in X.columns[var_thr.get_support()]]

for features in low:
    print(features)
    
print('Number of retained features is: ', sum(var_thr.get_support()))
print('Proportion of retained features is: {:.2f} %'.format(sum(var_thr.get_support()) / len(X.columns) * 100))

Sentiment
Number of retained features is:  2
Proportion of retained features is: 66.67 %


Inspecting the results, it seems that only *'Sentiment'* feature has **not**  passed the variance threshold and has **not** been selected to be retained. Since the dataset's small size compromises its representativeness of the underlying sample, I decide to ***not*** *simplify* the model training, considering that *'Clean Title'* and *'Clean Content'* carry *significant* or discriminatory information to distinguish between different categories.