<a id="4"></a>
<font color="darkslateblue" size=+2.5><b>Natural Language Processing</b></font>

As long as we could improve our relationship with the data, the path is open to start the Natural Language Processing step to analyze the comments left on e-commerce orders. The goal is to use this as input to a `sentimental analysis` model for understanding the customer's sentiment on purchasing things online. Let's take a look on the reviews data.

<a id="4.1"></a>
<font color="dimgrey" size=+2.0><b>Data Understanding</b></font>

In [3]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

In [22]:
import pandas as pd

# 1. Load dataset
df = pd.read_csv("/home/ayush-wase/E Commerce ML/dataset/olist_order_reviews_dataset.csv")

# 2. Keep only the columns we need
df = df[["review_score", "review_comment_message"]]

# 3. Remove rows with empty comments
df = df.dropna()

# 4. Rename columns for simplicity
df.columns = ["score", "comment"]

# 5. Reset index
df = df.reset_index(drop=True)

# 6. Check data
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (40977, 2)


Unnamed: 0,score,comment
0,5,Recebi bem antes do prazo estipulado.
1,5,Parab√©ns lojas lannister adorei comprar pela I...
2,4,aparelho eficiente. no site a marca do aparelh...
3,4,"Mas um pouco ,travando...pelo valor ta Boa.\r\n"
4,5,"Vendedor confi√°vel, produto ok e entrega antes..."


So, we have in hands approximately 41k comments that could be used for training a sentimental analysis model. But, for this to becoming true, we have to go trough a long way of text preparation to transform the comment input into a vector that can be interpreted for a Machine Learning model. **Let's go ahead**

<a id="4.2"></a>
<font color="dimgrey" size=+2.0><b>Regular Expressions</b></font>

As long as we consider the global internet as the source of our comments, probably we have to deal with some HTML tags, break lines, special characteres and other content that could be part of the dataset. Let's dig a little bit more on `Regular Expressions` to search for those patterns.

First of all, let's define a function that will be used for analysing the results of an applied regular expression. With whis we can validate our text pre processing in an easier way.

<a id="4.2.1"></a>
<font color="dimgrey" size=+1.5><b>Breakline and Carriage Return</b></font>

In [23]:
df.sample(10)

Unnamed: 0,score,comment
12256,2,O CORREIO ME ENTREGOU SOMENTE UMA CORTINA SEND...
25586,3,Recebi parcialmente o pedido. De tr√™s rel√≥gios...
29493,5,"Bom produto, r√°pida entrega! Recomendo"
5118,5,Chegou tudo conforme esperado. O que mais surp...
21710,4,Muito boa compra em todos sentidos.
28420,5,Entrega feita normalmente. Produto de boa qual...
26173,4,Bom produto. Boa empresa: entregou no prazo co...
38062,5,Recomendo.
39863,1,"S√≥ recebi 01 cadeira, ainda falta uma."
14052,4,Pr√°tico!


In [24]:
import re

def clean_breaklines(text):
    if isinstance(text, str):
        # Replace \r, \n, and \r\n with a space
        return re.sub(r'[\r\n]+', ' ', text)
    return text


# now lets apply the above function to our dataset

df['comment'] = df['comment'].apply(clean_breaklines)

df['comment'].head()


0                Recebi bem antes do prazo estipulado.
1    Parab√©ns lojas lannister adorei comprar pela I...
2    aparelho eficiente. no site a marca do aparelh...
3         Mas um pouco ,travando...pelo valor ta Boa. 
4    Vendedor confi√°vel, produto ok e entrega antes...
Name: comment, dtype: str

--- Text 1 ---

*Before:* 
*Estava faltando apenas um produto, eu recebi hoje , muito obrigada!*
*Tudo certo!*

*Att*

*Elenice.*

*After:*
*Estava faltando apenas um produto, eu recebi hoje , muito obrigada!  Tudo certo!    Att     Elenice.*


Here it's possible to see the tags \r (_carriage return_ code ASCII 10) and \n (_new line_ code ASCII 13). With RegEx, we could get rid of those patterns.

<a id="4.2.2"></a>
<font color="dimgrey" size=+1.5><b>Sites and Hiperlinks</b></font>

Another pattern that must be threated is sites and hiperlinks. Let's define another function to apply RegEx on this.

In [37]:
import re

def remove_urls(text):
    if isinstance(text, str):
        # Remove http, https, and www links
        return re.sub(r'http\S+|www\S+', '', text)
    return text


df['comment'] = df['comment'].apply(remove_urls)

--- Text 1 ---

Before: 
comprei o produto pela cor ilustrada pelo site da loja americana, no site mostra ser preto http://prntscr.com/jkx7hr quando o produto chegou aqui veio todos com a mesma cor, tabaco http://prntscr.com/

After: 
comprei o produto pela cor ilustrada pelo site da loja americana, no site mostra ser preto  link  quando o produto chegou aqui veio todos com a mesma cor, tabaco  link 

--- Text 2 ---

Before: 
Pedi esse: https://www.lannister.com.br/produto/22880118/botox-capilar-selafix-argan-premium-doux-clair-2x1-litro?pfm_carac=doux%20clair&pfm_index=3&pfm_page=search&pfm_pos=grid&pfm_type=search_page%

After: 
Pedi esse:  link 

<a id="4.2.3"></a>
<font color="dimgrey" size=+1.5><b>Dates</b></font>

Well, as long as we are dealing with customers reviews on items bought online, probably date mentions are very common. Let's see some examples and apply a RegEx to change this to `data` (means `date` in english).

In [36]:
def remove_dates_and_fix_spaces(text):
    if isinstance(text, str):
        text = re.sub(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', '', text)
        text = re.sub(r'\b\d{4}[/-]\d{1,2}[/-]\d{1,2}\b', '', text)
        text = re.sub(r'\s+', ' ', text)  # remove extra spaces
        return text.strip()
    return text

df['comment'] = df['comment'].apply(remove_dates_and_fix_spaces)

--- Text 1 ---

Before: 
(tenso) tinhas mais de 10 lojas pra min escolher qual comprar, o pitei pela lannister por ser uma loja conhecida a entrega estava para dia 22/01/2018 . hoje j√° √© 24/01/2018 pois comprei dia 06/01/18

After: 
(tenso) tinhas mais de 10 lojas pra min escolher qual comprar, o pitei pela lannister por ser uma loja conhecida a entrega estava para dia  data  . hoje j√° √©  data  pois comprei dia  data 

--- Text 2 ---

Before: 
COMPREI EM 21/03/2018, PG VIA CART√ÉO EM 21/03/2018, NF FOI EMITIDA DIA 27/03/2018, PREVIS√ÉO ENTREGA EM 12/04/2018, HOJE √â 14/04/2018, N√ÉO RECEBI, N√ÉO EST√Å EM TRANSPORTE, ESTOU MUITO PREOCUPADO

After: 
COMPREI EM  data , PG VIA CART√ÉO EM  data , NF FOI EMITIDA DIA  data , PREVIS√ÉO ENTREGA EM  data , HOJE √â  data , N√ÉO RECEBI, N√ÉO EST√Å EM TRANSPORTE, ESTOU MUITO PREOCUPADO

--- Text 3 ---

Before: 
J√° comprei v√°rias vezes no site "lannister";mas  desta √∫ltima vez,fiz uma compra de um TONER no  04.10.16 e s√≥ prometeram p/ 25.11.16 e ainda n√£o  recebi o produto.

After: 
J√° comprei v√°rias vezes no site "lannister";mas  desta √∫ltima vez,fiz uma compra de um TONER no   data  e s√≥ prometeram p/  data  e ainda n√£o  recebi o produto.

<a id="4.2.4"></a>
<font color="dimgrey" size=+1.5><b>Money</b></font>

Another pattern that probably is very common on this kind of source is representations of money (R$ _,_). To improve our model, maybe it's a good idea to transform this pattern into a key word like valor (means money or amount in english).

In [35]:
import re

def replace_money(text):
    if isinstance(text, str):
        # Match Brazilian real format: R$ optional space, digits, optional thousands sep, comma, decimals
        text = re.sub(r'R\$ ?[\d\.\,]+', 'valor', text)
        return text
    return text

# Apply directly to comment column
df['comment'] = df['comment'].apply(replace_money)

--- Text 1 ---

Before: 
Recebi o produto correto, por√©m o valor do produto na NF ficou a menor, R$ 172,00 sendo que comprei a 219,90.  O valor do frete calculado foi R$ 18,90 e veio R$ 93,00.  Gostaria que viesse com correto

After: 
Recebi o produto correto, por√©m o valor do produto na NF ficou a menor,  dinheiro  sendo que comprei a 219,90.  O valor do frete calculado foi  dinheiro  e veio  dinheiro .  Gostaria que viesse com correto

--- Text 2 ---

Before: 
Infelizmente, para uma entrega em GRU (Regi√£o Metropolitana da Grande SP) achei bem "salgado" o valor do frete cobrado sobre o pre√ßo do produto! Afinal, a mercadoria custou R$26,70 + R$15,11 de frete!

After: 
Infelizmente, para uma entrega em GRU (Regi√£o Metropolitana da Grande SP) achei bem "salgado" o valor do frete cobrado sobre o pre√ßo do produto! Afinal, a mercadoria custou  dinheiro  +  dinheiro  de frete!

--- Text 3 ---

Before: 
Paguei $48,00 reais de frete e acabei tendo que buscar o pedido no Centro de Distribui√ß√£o dos Correios, por√©m a loja nada tem a ver com o mal servi√ßo prestado pela empresa contrata para entrega.

After: 
Paguei  dinheiro  reais de frete e acabei tendo que buscar o pedido no Centro de Distribui√ß√£o dos Correios, por√©m a loja nada tem a ver com o mal servi√ßo prestado pela empresa contrata para entrega.

<a id="4.2.5"></a>
<font color="dimgrey" size=+1.5><b>Numbers</b></font>

Here we will try to find numbers on reviews and replace them with another string numero (that means number, in english). We could just replace the numbers with whitespace but maybe this would generated some information loss. Let's see what we've got:

In [34]:
import re

def replace_numbers(text):
    if isinstance(text, str):
        # Replace any sequence of digits with 'numero'
        text = re.sub(r'\b\d+\b', 'numero', text)
        return text
    return text

# Apply directly to the comment column
df['comment'] = df['comment'].apply(replace_numbers)

--- Text 1 ---

Before: 
Comprei o produto dia 25 de fevereiro e hoje dia 29 de marco n√£o fora entregue na minha resid√™ncia. N√£o sei se os correios desse Brasil e p√©ssimo ou foi a pr√≥pria loja que demorou postar.

After: 
Comprei o produto dia  numero  de fevereiro e hoje dia  numero  de marco n√£o fora entregue na minha resid√™ncia. N√£o sei se os correios desse Brasil e p√©ssimo ou foi a pr√≥pria loja que demorou postar.


<a id="4.2.6"></a>
<font color="dimgrey" size=+1.5><b>Negation</b></font>

This session was thought and discussed in a special way. The problem statement is that when we remove the stopwords, probabily we would loose the meaning of some phrases about removing the negation words like n√£o (not), for example. So, because of this, maybe is a good idea to replace some negation words with some common words indicating a negation meaning.

In [38]:
# List of common negation words in Portuguese
negation_words = [
    'n√£o', 'nunca', 'jamais', 'nem', 'nenhum', 'ningu√©m', 'nenhuma'
]

# Create a regex pattern to match them as whole words
negation_pattern = r'\b(?:' + '|'.join(negation_words) + r')\b'

def replace_negations(text):
    if isinstance(text, str):
        # Replace each negation word with 'negacao'
        return re.sub(negation_pattern, 'negacao', text, flags=re.IGNORECASE)
    return text

# Apply directly to the comment column
df['comment'] = df['comment'].apply(replace_negations)

--- Text 1 ---

Before: 
O material √© bom, o problema √© que a bolsa n√£o fecha, n√£o possui z√≠per, √© como uma sacola. Isso me deixou insatisfeita, pois na foto n√£o d√° pra perceber e n√£o h√° informa√ß√£o ou foto interna sobre isso.

After: 
O material √© bom, o problema √© que a bolsa  nega√ß√£o  fecha,  nega√ß√£o  possui z√≠per, √© como uma sacola. Isso me deixou insatisfeita, pois na foto  nega√ß√£o  d√° pra perceber e  nega√ß√£o  h√° informa√ß√£o ou foto interna sobre isso.

--- Text 2 ---

Before: 
Meu pedido era para ser entregue at√© dia  data , at√© a presente data ( numero / numero ) a nota fiscal n√£o foi emitida, solicitei v√°rias vezes n√£o obtive retorno, n√£o recomendo esta Loja, nem a lannister!!!!!!

After: 
Meu pedido era para ser entregue at√© dia  data , at√© a presente data ( numero / numero ) a nota fiscal  nega√ß√£o  foi emitida, solicitei v√°rias vezes  nega√ß√£o  obtive retorno,  nega√ß√£o  recomendo esta Loja, nem a lannister!!!!!!

--- Text 3 ---

Before: 
OEQUIPAMENTO N√ÉO FUNCIONA. O mini cartao SD nao encaixa e o computador n√£o reconhece quando √© conectado com o cabo USB

After: 
OEQUIPAMENTO  nega√ß√£o  FUNCIONA. O mini cartao SD  nega√ß√£o  encaixa e o computador  nega√ß√£o  reconhece quando √© conectado com o cabo USB

--- Text 4 ---

Before: 
Cancelei ha tempos, enviaram mesmo assim e nao estornaram os valores

After: 
Cancelei ha tempos, enviaram mesmo assim e  nega√ß√£o  estornaram os valores

<font color="dimgrey" size=+1.5><b>Special Characters</b></font>

The search for special characteres is a really special one because we see a lot of this pattern on online comments. Let's build an RegEx motor to find those ones.

In [39]:
import re

def replace_emojis(text):
    if isinstance(text, str):
        # Emoji ranges in Unicode
        emoji_pattern = re.compile(
            "["
            u"\U0001F600-\U0001F64F"  # Emoticons
            u"\U0001F300-\U0001F5FF"  # Symbols & pictographs
            u"\U0001F680-\U0001F6FF"  # Transport & map symbols
            u"\U0001F1E0-\U0001F1FF"  # Flags
            u"\U00002700-\U000027BF"  # Dingbats
            u"\U0001F900-\U0001F9FF"  # Supplemental Symbols & Pictographs
            u"\U00002600-\U000026FF"  # Misc symbols
            "]+",
            flags=re.UNICODE
        )
        # Replace all emojis/special symbols with 'emoji'
        text = emoji_pattern.sub('emoji', text)

        # Optionally remove other weird non-alphanumeric characters
        text = re.sub(r'[^\w\s]', '', text)  # keep letters, numbers, underscore, spaces

        # Remove extra spaces
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    return text

# Apply to comment column
df['comment'] = df['comment'].apply(replace_emojis)

--- Text 1 ---

Before: 
Este foi o pedido  Balde Com  numero  Pe√ßas - Blocos De Montar  numero  un -  dinheiro  cada ( nega√ß√£o  FOI ENTREGUE)  Vendido e entregue targaryen  Tapete de Eva N¬∫ Letras  numero  Pe√ßas Crian√ßas  numero  un -  dinheiro  (ESTE FOI ENTREG

After: 
Este foi o pedido  Balde Com  numero  Pe√ßas   Blocos De Montar  numero  un    dinheiro  cada   nega√ß√£o  FOI ENTREGUE   Vendido e entregue targaryen  Tapete de Eva N¬∫ Letras  numero  Pe√ßas Crian√ßas  numero  un    dinheiro   ESTE FOI ENTREG

--- Text 2 ---

Before: 
Cada vez que compro mais fico satisfeita parab√©ns pela honestidade com seus clientes üëèüëèüëèüëè?

After: 
Cada vez que compro mais fico satisfeita parab√©ns pela honestidade com seus clientes      

--- Text 3 ---

Before: 
Comprei o produto, paguei no boleto e s√≥ recebi metade do produto, anunciaram uma coisa √© mandaram outra. Muito insatisfeita üò°üò°üò°

After: 
Comprei o produto  paguei no boleto e s√≥ recebi metade do produto  anunciaram uma coisa √© mandaram outra  Muito insatisfeita  

<font color="dimgrey" size=+1.5><b>Additional Whitespaces</b></font>

After all the steps we have taken over here, it's important to clean our text eliminating unecessary whitespaces. Let's apply a RegEx for this and see what we've got.

In [40]:
import re

def clean_whitespaces(text):
    if isinstance(text, str):
        # Replace multiple spaces/tabs/newlines with a single space
        text = re.sub(r'\s+', ' ', text)
        # Remove leading and trailing spaces
        return text.strip()
    return text

# Apply directly to comment column
df['comment'] = df['comment'].apply(clean_whitespaces)

--- Text 1 ---

Before: 
Mas um pouco  travando   pelo valor ta Boa   

After: 
Mas um pouco travando pelo valor ta Boa

--- Text 2 ---

Before: 
Vendedor confi√°vel  produto ok e entrega antes do prazo 

After: 
Vendedor confi√°vel produto ok e entrega antes do prazo

--- Text 3 ---

Before: 
meu produto chegou e ja tenho que devolver  pois est√° com defeito    nega√ß√£o  segurar carga

After: 
meu produto chegou e ja tenho que devolver pois est√° com defeito nega√ß√£o segurar carga


<font color="dimgrey" size=+2.0><b>Stopwords</b></font>

Well, by now we have a text dataset without any pattern that we threated with RegEx and also without punctuations. In other words, we have a half-clean text with a rich transformation applied. 

So, we are ready to apply some advanced text transformations like `stopwords` removal, `stemming` and the `TF-IDF` matrix process. Let's start with portuguese stopwords.

Step 1 : Import Stopwords

In [42]:
from nltk.corpus import stopwords
import nltk

# Download Portuguese stopwords if not already done
nltk.download('stopwords')

# Load Portuguese stopwords
pt_stopwords = set(stopwords.words('portuguese'))
print(f'Total Portuguese stopwords: {len(pt_stopwords)}')

Total Portuguese stopwords: 207


[nltk_data] Downloading package stopwords to /home/ayush-
[nltk_data]     wase/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Step 2 : Write a function

In [43]:
# a function to remove stopwords
def remove_stopwords(text):
    if isinstance(text, str):
        # Split text into words
        words = text.split()
        # Keep only words not in stopwords
        words = [word for word in words if word.lower() not in pt_stopwords]
        # Rejoin into cleaned text
        return ' '.join(words)
    return text

# now apply to our dataframe
df['comment'] = df['comment'].apply(remove_stopwords)

--- Text 1 ---

Before: 
Recebi bem antes do prazo estipulado

After: 
recebi bem antes prazo estipulado

--- Text 2 ---

Before: 
Este foi o pedido Balde Com numero Pe√ßas Blocos De Montar numero un dinheiro cada nega√ß√£o FOI ENTREGUE Vendido e entregue targaryen Tapete de Eva N¬∫ Letras numero Pe√ßas Crian√ßas numero un dinheiro ESTE FOI ENTREG

After: 
pedido balde numero pe√ßas blocos montar numero un dinheiro cada nega√ß√£o entregue vendido entregue targaryen tapete eva n¬∫ letras numero pe√ßas crian√ßas numero un dinheiro entreg

--- Text 3 ---

Before: 
O produto nega√ß√£o √© bom

After: 
produto nega√ß√£o bom

<font color="dimgrey" size=+2.0><b>Stemming</b></font>

Let's define a function to apply the stemming process on the comments. We will also give examples of the results.

Step 1 : Import and initialize the stemmer

In [45]:
from nltk.stem import RSLPStemmer
import nltk

# Download required NLTK data (if not done yet)
nltk.download('rslp')

# Initialize Portuguese stemmer
stemmer = RSLPStemmer()

[nltk_data] Downloading package rslp to /home/ayush-wase/nltk_data...
[nltk_data]   Unzipping stemmers/rslp.zip.


Step 2 : Define a function

In [47]:
def apply_stemming(text):
    if isinstance(text, str):
        # Split text into words
        words = text.split()
        # Apply stemmer to each word
        stemmed_words = [stemmer.stem(word) for word in words]
        # Rejoin back into a single string
        return ' '.join(stemmed_words)
    return text


# apply to our dataframe

df['comment'] = df['comment'].apply(apply_stemming)

--- Text 1 ---

Before: 
recebi bem antes prazo estipulado

After: 
receb bem ant praz estipul

--- Text 2 ---

Before: 
pedido balde numero pe√ßas blocos montar numero un dinheiro cada nega√ß√£o entregue vendido entregue targaryen tapete eva n¬∫ letras numero pe√ßas crian√ßas numero un dinheiro entreg

After: 
ped bald numer pe√ß bloc mont numer un dinh cad neg entreg vend entreg targaryen tapet eva n¬∫ letr numer pe√ß crian√ß numer un dinh entreg

--- Text 3 ---

Before: 
produto chegou ja devolver pois defeito nega√ß√£o segurar carga

After: 
produt cheg ja devolv poi defeit neg segur carg

<font color="dimgrey" size=+1.5><b>TF-IDF</b></font>

With the _Bag of Words_ approach, each words has the same weight, wich maybe can't be true all the time, mainly for those words with a really low frequency on the corpus. So, the _TF-IDF (Term Frequency and Inverse Document Frequency)_ approach can be used with the scikit-learn library following the formulas:

$$TF=\frac{\text{Frequency of a word in the document}}{\text{Total words in the document}}$$

$$IDF = \log\left({\frac{\text{Total number of docs}}{\text{Number of docs containing the words}}}\right)$$

Step 1 ‚Äî Import TF-IDF vectorizer

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

Step 2 ‚Äî Initialize the vectorizer

In [55]:
tfidf_vectorizer = TfidfVectorizer(
    max_features=1000,  # Keep top 1000 words (to limit dimensionality)
    lowercase=True,     # Ensure all text is lowercase
    stop_words=None     # Already removed manually
)

Step 3 ‚Äî Fit and transform the stemmed corpus

In [57]:
# Use the cleaned and stemmed comments directly
corpus = df['comment'].tolist()  # convert column to list

# Fit TF-IDF and transform
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

In [58]:
import pandas as pd

# Feature names (words)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert to DataFrame for readability
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)

# Inspect the first few rows
df_tfidf.head()

Unnamed: 0,220v,abaix,abert,abr,abra√ß,absurd,acab,aceit,acess,ach,...,v√£o,whey,zer,√°gil,√°gu,√≥tim,√∫lt,√∫nic,√∫tel,√∫til
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
