<a href="https://colab.research.google.com/github/lucasmoratof/customers_review_project/blob/master/NLP_Customer_Review_ML_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

In [0]:
reviews = pd.read_csv('https://raw.githubusercontent.com/lucasmoratof/customers_review_project/master/reviews_for_nlp.csv', usecols=['review_comment_message', 'is_good_review'])

In [0]:
reviews.head()

Unnamed: 0,review_comment_message,is_good_review
0,Recebi bem antes do prazo estipulado.,1
1,Parabéns lojas lannister adorei comprar pela I...,1
2,aparelho eficiente. no site a marca do aparelh...,1
3,"Mas um pouco ,travando...pelo valor ta Boa.\n",1
4,"Vendedor confiável, produto ok e entrega antes...",1


I will try some techniques to count the number of characters and words in each review.

In [0]:
# count the lenght of each review
reviews['char_count'] = reviews['review_comment_message'].apply(len)

In [0]:
reviews['char_count'].head()

0     37
1    100
2    174
3     44
4     56
Name: char_count, dtype: int64

In [0]:
# average characters in the reviews
reviews['char_count'].mean() 

69.82382526225032

In [0]:
# create a function to count the number of words in each comment
def count_words(string):
  words = string.split()
  return len(words)

# applying the funciton to create a new feature
reviews['word_count'] = reviews['review_comment_message'].apply(count_words)

# finding the average number of words in the reviews
print(reviews['word_count'].mean()) 

11.901374718589835


Some text preprocessing techiniques:

- Convert words into lowercase
- Removing leading and trailing whitespaces
- Removing punctuation
- Removing stopwords
- Expanding contractions
- Removing special characters (numbers, emojis, etc)

**Tokenization** is the process of converting words into a numerical format, called token. We can also convert sentences and ponctuation into tokens.

**Lemmatization** is the process of converting word into it's lowercase base form.

In [0]:
# If you need to download the model (works on google colab)
import spacy.cli
spacy.cli.download("pt_core_news_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('pt_core_news_sm')


In [0]:
# Load the Portuguese model
import spacy
nlp = spacy.load("pt_core_news_sm")

In [0]:
doc= nlp(reviews['review_comment_message'][2])
# IMPORTANT, when you pass the strings through nlp(), it performs Lemmatization by default

In [0]:
tokens = [token.text for token in doc]
lemmas= [token.lemma_ for token in doc]

In [0]:
print(tokens, "\n", lemmas)

['aparelho', 'eficiente', '.', 'no', 'site', 'a', 'marca', 'do', 'aparelho', 'esta', 'impresso', 'como', '3desinfector', 'e', 'a', 'o', 'chegar', 'esta', 'com', 'outro', 'nome', '...', 'atualizar', 'com', 'a', 'marca', 'correta', 'uma', 'vez', 'que', 'é', 'o', 'mesmo', 'aparelho'] 
 ['aparelhar', 'eficiente', '.', 'o', 'site', 'o', 'marcar', 'do', 'aparelhar', 'este', 'impresso', 'comer', '3desinfector', 'e', 'o', 'o', 'chegar', 'este', 'com', 'outro', 'nome', '...', 'atualizar', 'com', 'o', 'marcar', 'corretar', 'umar', 'vez', 'que', 'ser', 'o', 'mesmo', 'aparelhar']


In [0]:
# Stopwords
stopwords = spacy.lang.pt.stop_words.STOP_WORDS

In [0]:
no_stops= [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
print(' '.join(no_stops))

aparelhar eficiente o site o marcar aparelhar impresso comer e o o chegar outro nome atualizar o marcar corretar umar o aparelhar


In [0]:
# Creating a function that combines tokenization and lemmatization

def preprocessing(text):
  doc= nlp(text) # creates the document
  lemmas= [token.lemma_ for doc in doc] # extracts the lemmas
  # time to remove stopwords (remember that we are using the Portuguese version)
  clean_lemmas= [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]

  return ' '.join(clean_lemmas) 

Part of Speech - POS
It determines the meaning of each word, like proper noun, verb, etc.

In [0]:
# load the model
nlp= spacy.load('pt_core_news_sm')

# create the doc
doc= nlp(reviews['review_comment_message'][2])

# generate tokens and pos tags
pos= [(token.text, token.pos_) for token in doc]
print(pos)


[('aparelho', 'NOUN'), ('eficiente', 'ADJ'), ('.', 'PUNCT'), ('no', 'ADP'), ('site', 'VERB'), ('a', 'DET'), ('marca', 'NOUN'), ('do', 'DET'), ('aparelho', 'NOUN'), ('esta', 'DET'), ('impresso', 'VERB'), ('como', 'ADP'), ('3desinfector', 'NUM'), ('e', 'PUNCT'), ('a', 'ADP'), ('o', 'DET'), ('chegar', 'VERB'), ('esta', 'DET'), ('com', 'ADP'), ('outro', 'DET'), ('nome', 'NOUN'), ('...', 'PUNCT'), ('atualizar', 'VERB'), ('com', 'ADP'), ('a', 'DET'), ('marca', 'NOUN'), ('correta', 'ADJ'), ('uma', 'DET'), ('vez', 'NOUN'), ('que', 'SCONJ'), ('é', 'VERB'), ('o', 'DET'), ('mesmo', 'DET'), ('aparelho', 'NOUN')]


Below I will create to functions, to count the number of proper nouns and nouns, then, I will apply these function on the data separating good reviews and bad reviews. Finally, I will calculate the mean of PROPN and NOUNS on both groups and compare.

In [0]:
# PROPN
def proper_nouns(text, model=nlp):
  # Create doc object
  doc= model(text)
  # Generate list of POS tags
  pos= [token.pos_ for token in doc]
  return pos.count('PROPN')

# NOUN
def nouns(text, model=nlp):
  doc= nlp(text)
  pos= [token.pos_ for token in doc]
  return pos.count('NOUN')

In [0]:
# Create two columns, witht the number of nouns and proper nouns
reviews['num_propn'] = reviews['review_comment_message'].apply(proper_nouns)
reviews['num_noun'] = reviews['review_comment_message'].apply(nouns)

In [0]:
# computing the mean of proper nouns
good_propn= reviews[reviews['is_good_review']== 1]['num_propn'].mean()
bad_propn= reviews[reviews['is_good_review']== 0]['num_propn'].mean()

# computing the mean of nouns
good_noun= reviews[reviews['is_good_review']== 1]['num_noun'].mean()
bad_noun= reviews[reviews['is_good_review']== 0]['num_noun'].mean()

# print results to compare
print("Mean number of proper nouns for good and bad reviews are %.2f and %.2f respectively"%(good_propn, bad_propn))
print("Mean number of nouns for good and bad reviews are %.2f and %.2f respectively"%(good_noun, bad_noun))

Mean number of proper nouns for good and bad reviews are 0.48 and 0.88 respectively
Mean number of nouns for good and bad reviews are 2.10 and 3.63 respectively


Named Entity Recognition

It classifies named entities into predefined categories, like person, organization, country, etc.

Uses:
- Efficient search algorithms
- Question answering
- News article classification
- Customer service

In [0]:
# Let's practice NER
nlp= spacy.load('pt_core_news_sm')
text= reviews['review_comment_message'][11]
doc= nlp(text)

# print all named entities:
for ent in doc.ents:
  print(ent.text, ent.label_)

Comprei PER


To find person's names, we can use the following function:

In [0]:
def find_persons(text, model=nlp):
  doc= model(text)
  persons= [ent.text for ent in doc.ents if ent.label_== 'PERSON']
  return persons

Vectorization

The process of converting text into vectors, so it can be used in ML

Bag of Words is a model that do vectorization. It's important to perform text preprocessing as it leads to smaller vocabularies, and reducing the number of dimensions helps improve performance.

CountVectorizer, from scikit-learn, is the tool used to perform bag of words.
It needs some arguuments to pre-processing text.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Create CountVectorizer object, specifying the arguments to preprocess text
stop_words_port= spacy.lang.pt.stop_words.STOP_WORDS
vectorizer= CountVectorizer(stop_words=stop_words_port)

# Split into training and test sets
X_train, X_test, y_train, y_test= train_test_split(reviews['review_comment_message'], reviews['is_good_review'], test_size=0.25, random_state=24)

# Generate training Bow vectors
X_train_bow= vectorizer.fit_transform(X_train)

# Generate test Bow vector
X_test_bow= vectorizer.transform(X_test) 

print(X_train_bow.shape)
print(X_test_bow.shape)

(31315, 13603)
(10439, 13603)


We will try the Naive Bayes classifier to this problem.

In [0]:
# Import multinomialNB
from sklearn.naive_bayes import MultinomialNB

# create MultinomialNB object
clf= MultinomialNB()

# Train clf
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy= clf.score(X_test_bow, y_test)

print("The accuracy of the classifier is %.3f" % accuracy)

The accuracy of the classifier is 0.866


In [0]:
# Predict the sentiment of a negative review
review= "detestei o produto, nao gostei do vendedor, estou insatisfeito"

prediction= clf.predict(vectorizer.transform([review]))[0]

print("The sentiment predicted by the classifier is %i" % prediction)

The sentiment predicted by the classifier is 0


On the example above, the model correct classified a bad review.

Techniques to give context to a review

n-grams

It is a contiguous sequence of n-elements, or words, in a given document. A bag of words is n-gram model where n= 1.

Example: "I love you". If n=1, we have:
- "I"
- "Love"
- "You"

If we change n to 2, we would have:
- "I love"
- "love you"

It helps the model to undestand the relationship between the words.


In [0]:
# To avoid the curse of dimensionality, don't use more than n=3
# We are going to compare how much it increases when we increase the n-gram

vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1))
ng1 = vectorizer_ng1.fit_transform(X_train)

vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2))
ng2 = vectorizer_ng2.fit_transform(X_train)

vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(X_train)

print("number of features by n-grams is:\n ng1= %i \n ng2= %i \n ng3= %i" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

number of features by n-grams is:
 ng1= 13963 
 ng2= 114172 
 ng3= 295810


We can see that with n=1 we have 13k features, while with n=3 it increases to 295k.

In [0]:
# We will try the same model again, now with n-gram= 2

vectorizer_ng= CountVectorizer(stop_words=stop_words_port, ngram_range=(1,3))

X_train_bow_ng= vectorizer_ng.fit_transform(X_train)
X_test_bow_ng= vectorizer_ng.transform(X_test) 

clf.fit(X_train_bow_ng, y_train)

accuracy_ng= clf.score(X_test_bow_ng, y_test)

print("The accuracy of the classifier is %.3f" % accuracy_ng)

The accuracy of the classifier is 0.872


Term Frenquency - Inverse Document Frequency -  **TF-IDF**

The idea is, more frequent the word is accross all documents, plus the number of times it occurs, more weight it should have. 



In [0]:
# instead using CountVectorizer(), we will use TfadVectorizer() from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer= TfidfVectorizer()

tfidf_matrix= vectorizer.fit_transform(X_train)

print(tfidf_matrix.shape)

(31315, 13963)



Cosine similarity 

It is the cosine distance between two vectors

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
import time
# record time
start= time.time()

# Compute cosine similarity matrix
cosine_sim= cosine_similarity(tfidf_matrix, tfidf_matrix)

# print the cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.         0.20017999 ... 0.         0.         0.11678042]
 [0.         1.         0.         ... 0.         0.04735246 0.        ]
 [0.20017999 0.         1.         ... 0.         0.         0.58337711]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.04735246 0.         ... 0.         1.         0.        ]
 [0.11678042 0.         0.58337711 ... 0.         0.         1.        ]]
Time taken: 18.618732452392578 seconds


In [0]:
# we can use linear_kernal to calculate cosine similarity. It takes less time to process and it produces the same result.
from sklearn.metrics.pairwise import linear_kernel
import time
# record time
start= time.time()

# Compute cosine similarity matrix
cosine_sim= linear_kernel(tfidf_matrix, tfidf_matrix)

# print the cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.         0.20017999 ... 0.         0.         0.11678042]
 [0.         1.         0.         ... 0.         0.04735246 0.        ]
 [0.20017999 0.         1.         ... 0.         0.         0.58337711]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.04735246 0.         ... 0.         1.         0.        ]
 [0.11678042 0.         0.58337711 ... 0.         0.         1.        ]]
Time taken: 19.20054602622986 seconds


Word embeddings
To find similarity between words or sentences.

In [0]:
reviews['review_comment_message'].head()

0                Recebi bem antes do prazo estipulado.
1    Parabéns lojas lannister adorei comprar pela I...
2    aparelho eficiente. no site a marca do aparelh...
3        Mas um pouco ,travando...pelo valor ta Boa.\n
4    Vendedor confiável, produto ok e entrega antes...
Name: review_comment_message, dtype: object

In [0]:
# let's check how similar are the reviews

# first, creat a Doc
review_1_doc= nlp(reviews['review_comment_message'][1])
review_2_doc= nlp(reviews['review_comment_message'][2])
review_3_doc= nlp(reviews['review_comment_message'][3])

# Now, use the function similarity
print(review_1_doc.similarity(review_2_doc))
print(review_2_doc.similarity(review_3_doc))
print(review_3_doc.similarity(review_1_doc))

-0.06718104243836638
0.26803800927046
0.23318636637941698


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


In [0]:
# trying Multinomial Naive Bayes with Tfidf vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
import time

# Create CountVectorizer object, specifying the arguments to preprocess text
stop_words_port= spacy.lang.pt.stop_words.STOP_WORDS
vectorizer= TfidfVectorizer(stop_words=stop_words_port)

# Split into training and test sets
X_train, X_test, y_train, y_test= train_test_split(reviews['review_comment_message'], reviews['is_good_review'], test_size=0.25, random_state=24)

start= time.time()

# Generate training Bow vectors
X_train_vec= vectorizer.fit_transform(X_train)

# Generate test Bow vector
X_test_vec= vectorizer.transform(X_test) 

# create MultinomialNB object
clf= MultinomialNB()

# Train clf
clf.fit(X_train_vec, y_train)

# Compute accuracy on test set
accuracy= clf.score(X_test_vec, y_test)

print("The accuracy of the classifier is %.3f" % accuracy)
print("Time taken: %s seconds" %(time.time() - start))

The accuracy of the classifier is 0.866
Time taken: 0.5172257423400879 seconds


In [0]:
import sklearn.metrics as metrics
from sklearn.metrics import classification_report, confusion_matrix
clf_y_pred = clf.predict(X_test_vec)
print(metrics.classification_report(y_test, clf_y_pred))

              precision    recall  f1-score   support

           0       0.84      0.77      0.81      3757
           1       0.88      0.92      0.90      6682

    accuracy                           0.87     10439
   macro avg       0.86      0.85      0.85     10439
weighted avg       0.87      0.87      0.86     10439

