# Table of Contents

## 1. Importing Libraries  
## 2. Adjusting Row Column Settings  
## 3. Text Preprocessing  
### 3.1 Normalizing Case Folding  
### 3.2 Punctuations  
### 3.3 Numbers  
### 3.4 Stopwords  
### 3.5 Rarewords  
### 3.6 Tokenization  
### 3.7 Lemmatization  

## 4. Text Visualization  
### 4.1 Calculation of Term Frequencies  
### 4.2 Barplot  
### 4.3 Word Cloud  
### 4.4 Word Cloud by Templates  

## 5. Sentiment Analysis  
## 6. Modelling  
### 6.1 Logistic Regression  
### 6.2 Random Forests


### 1 | Importing Libraries

In [23]:
from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk
from PIL import Image
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate, train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from textblob import Word, TextBlob
from wordcloud import WordCloud

### 2 | Adjusting Row Column Settings

In [24]:
filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

#### 3 | Text Preprocessing


In [25]:
df = pd.read_csv("duplicates_output\duplicates_Apple_iPad_Mini_3_MGYE2LL_A_NEWEST_VERSION__16GB__Wi-Fi__Gold___Certified_Refurbished_.csv")


In [26]:
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,Apple iPad Mini 3 MGYE2LL/A NEWEST VERSION (16...,,259.99,5,The IPad mini 3 was in perfect shape aND works...,5.0
1,Apple iPad Mini 3 MGYE2LL/A NEWEST VERSION (16...,,259.99,5,"Really good, new and cheap,",1.0
2,Apple iPad Mini 3 MGYE2LL/A NEWEST VERSION (16...,,259.99,5,"If U don't have an iPad, U don't have an iPad ...",10.0
3,Apple iPad Mini 3 MGYE2LL/A NEWEST VERSION (16...,,259.99,5,"I bought this for my sons 9th birthday, and he...",3.0
4,Apple iPad Mini 3 MGYE2LL/A NEWEST VERSION (16...,,259.99,5,Delivered as promised,1.0


 #### 3.1. | Normalizing Case Folding

 In the first step of our NLP project, we converted the comments in the 'Review' column to lowercase. This standardizes the data into a uniform format, eliminating inconsistencies that might arise from differences in case sensitivity during the text processing phase.

In [27]:
df['Reviews'] = df['Reviews'].str.lower()


In [28]:
df["Reviews"]


0     the ipad mini 3 was in perfect shape and works...
1                           really good, new and cheap,
2     if u don't have an ipad, u don't have an ipad ...
3     i bought this for my sons 9th birthday, and he...
4                                 delivered as promised
                            ...                        
94    the ipad arrived and i figured it would have s...
95    appeared to be as good as new - no scratches, ...
96    i haven't had the chance to utilize this ipad ...
97    this ipad is great!! my son earned this over t...
98                                            excellent
Name: Reviews, Length: 99, dtype: object

#### 3.2 | Punctuations

In this step, we removed punctuation marks from the comments in the 'Review' column. This process aims to enhance the efficiency of our language processing tasks by making our text data cleaner and more suitable for analysis.

In [29]:
df['Reviews'] = df['Reviews'].str.replace('[^\w\s]', '')


In [30]:
df['Reviews']

0     the ipad mini 3 was in perfect shape and works...
1                           really good, new and cheap,
2     if u don't have an ipad, u don't have an ipad ...
3     i bought this for my sons 9th birthday, and he...
4                                 delivered as promised
                            ...                        
94    the ipad arrived and i figured it would have s...
95    appeared to be as good as new - no scratches, ...
96    i haven't had the chance to utilize this ipad ...
97    this ipad is great!! my son earned this over t...
98                                            excellent
Name: Reviews, Length: 99, dtype: object

#### 3.3. | Numbers
In this step, we removed numerical characters from the comments in the 'Review' column. By eliminating numbers from the text, this process enables us to focus more on our language processing and text analysis tasks. Consequently, we can concentrate solely on the textual content, allowing for cleaner and more meaningful data analysis.

In [31]:
df['Reviews'] = df['Reviews'].str.replace('\d', '')
df['Reviews']

0     the ipad mini 3 was in perfect shape and works...
1                           really good, new and cheap,
2     if u don't have an ipad, u don't have an ipad ...
3     i bought this for my sons 9th birthday, and he...
4                                 delivered as promised
                            ...                        
94    the ipad arrived and i figured it would have s...
95    appeared to be as good as new - no scratches, ...
96    i haven't had the chance to utilize this ipad ...
97    this ipad is great!! my son earned this over t...
98                                            excellent
Name: Reviews, Length: 99, dtype: object

#### 3.4 | Stopwords
In this section, by removing frequently repeated and often meaningless words (such as 'the', 'is', 'in') from our texts, we can focus our analysis on more meaningful words. This allows us to better understand the essence of the comments and make our NLP processes more efficient.

In [32]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yassi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [33]:
sw = stopwords.words('english')


In [34]:
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x for x in str(x).split() if x not in sw))
df['Reviews']

0     ipad mini 3 perfect shape works flawlessly. gr...
1                               really good, new cheap,
2     u ipad, u ipad ;)i recommend getting 64gb... u...
3            bought sons 9th birthday, really likes it!
4                                    delivered promised
                            ...                        
94    ipad arrived figured would cosmetic issues din...
95    appeared good new - scratches, dents, etc. cab...
96    chance utilize ipad intended purpose yet; howe...
97    ipad great!! son earned summer working. arrive...
98                                            excellent
Name: Reviews, Length: 99, dtype: object

In [35]:
pd.Series(' '.join(df['Reviews']).split()).value_counts()


ipad        35
mini        19
great       17
love        15
like        14
            ..
yet;         1
purpose      1
intended     1
utilize      1
ipad!!       1
Name: count, Length: 649, dtype: int64