<a href="https://colab.research.google.com/github/EnesGokceDS/Amazon_Reviews_NLP_Capstone_Project/blob/master/1_Data_cleaning_and_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **In this study, I will apply data cleaning and feature extraction methods on Amazon Reviews dataset**



---



---



In [34]:
import pandas as pd
import numpy as np

In [35]:
df = pd.read_csv('/Users/rohitmeena/Desktop/Python/Reviews.csv')
df.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [37]:
df.shape

(568454, 10)

In [4]:
df.describe().round(1)

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.7,2.2,4.2,1296257000.0
std,164098.7,7.6,8.3,1.3,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.2,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.8,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [5]:
# Determine how many missing values exist in the collection, in which case you can use .sum() chained onto is.na()
null_values=df.isna().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

Unnamed: 0,null,percent
Summary,27,0.005
ProfileName,16,0.003
Id,0,0.0
ProductId,0,0.0
UserId,0,0.0
HelpfulnessNumerator,0,0.0
HelpfulnessDenominator,0,0.0
Score,0,0.0
Time,0,0.0
Text,0,0.0


We have small number of missing values. We can drop them completely.

In [6]:
df= df.dropna()
df.shape

(568411, 10)

# Basic Feature Extraction - 1

Normally, I tried to make data cleaning first. Then, I realized that while making data cleaning, I am losing some of characters that can help data cleaning. Therefore, there will be two part of feature extraction. Here, I will extract features that can't be exracted after data cleaning.

### 1) Number of stopwords

In [7]:
!pip install -q wordcloud
import wordcloud
from nltk.corpus import stopwords
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rohitmeena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rohitmeena/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rohitmeena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rohitmeena/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [8]:
df['stopwords'] = df['Text'].apply(lambda x: len([x for x in x.split() if x in stop]))
df[['Text','stopwords']].head()

Unnamed: 0,Text,stopwords
0,I have bought several of the Vitality canned d...,21
1,Product arrived labeled as Jumbo Salted Peanut...,12
2,This is a confection that has been around a fe...,42
3,If you are looking for the secret ingredient i...,15
4,Great taffy at a great price. There was a wid...,12


2. Number of Punctuation

In [9]:
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return count

df['punctuation'] = df['Text'].apply(lambda x: count_punct(x))

In [10]:
df[['Text','punctuation']].head()

Unnamed: 0,Text,punctuation
0,I have bought several of the Vitality canned d...,3
1,Product arrived labeled as Jumbo Salted Peanut...,7
2,This is a confection that has been around a fe...,18
3,If you are looking for the secret ingredient i...,5
4,Great taffy at a great price. There was a wid...,5


### 2) Number of hashtag characters

One more interesting feature which we can extract from a review is calculating the number of hashtags or mentions present in it. This also helps in extracting extra information from our text data.

In [11]:
df['hastags'] = df['Text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
df[['Text','hastags']].head()

Unnamed: 0,Text,hastags
0,I have bought several of the Vitality canned d...,0
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,0
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,0


In [12]:
df.hastags.loc[df.hastags != 0].count()

1414

### 3) Number of numerics
Calculate the number of numerics which are present in the tweets can be useful. At least, it doesn't hurt to have such a data!

In [13]:
df['numerics'] = df['Text'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['Text','numerics']].head()

Unnamed: 0,Text,numerics
0,I have bought several of the Vitality canned d...,0
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,0
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,0


### 4) Number of Uppercase words
Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [14]:
df['upper'] = df['Text'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
df[['Text','upper']].head()

Unnamed: 0,Text,upper
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,2
3,If you are looking for the secret ingredient i...,4
4,Great taffy at a great price. There was a wid...,0




---



---



---



# **Text cleaning techniques**

### Make all text lower case

The first pre-processing step which we will do is transform our reviews into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [15]:
df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Text'].head()

0    i have bought several of the vitality canned d...
1    product arrived labeled as jumbo salted peanut...
2    this is a confection that has been around a fe...
3    if you are looking for the secret ingredient i...
4    great taffy at a great price. there was a wide...
Name: Text, dtype: object

### Removing Punctuation

In [16]:
df['Text'] = df['Text'].str.replace('[^\w\s]','')
df['Text'].head()

  df['Text'] = df['Text'].str.replace('[^\w\s]','')


0    i have bought several of the vitality canned d...
1    product arrived labeled as jumbo salted peanut...
2    this is a confection that has been around a fe...
3    if you are looking for the secret ingredient i...
4    great taffy at a great price there was a wide ...
Name: Text, dtype: object

### Removal of Stop Words

In [17]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['Text'].sample(10)

563342    got free bar voxbox influenster wow soo yummy ...
5689      product delicious say upfront thou semi messy ...
479863    really like product use breads cook run rather...
282706    cant find wheatena local shops buy great produ...
111934    feel really given product fair chance cant get...
469292    dog use get actually still get cronic ear infe...
158912    taste good edible giving 1 star let people kno...
183915    first stuff gives energy without cracked feeli...
84601     tinkyada brand pasta wonderful holds like norm...
403488    wonderful aroma plump moist beans great buy ni...
Name: Text, dtype: object

### Removing URLs

In [18]:
def remove_url(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

In [19]:
# remove all urls from df
import re
import string

df['Text'] = df['Text'].apply(lambda x: remove_url(x))

### Remove html tags

In [20]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

In [21]:
# remove all html tags from df
df['Text'] = df['Text'].apply(lambda x: remove_html(x))

 ### Removing Emojis
Emojis can be indictor of some emotions that can be related to being customer satisfaction. Unfortunately, we need to remove the emojis in our text analysis

In [22]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags 
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [23]:
#Example
remove_emoji("Omg another Earthquake 😔😔")

'Omg another Earthquake '

In [24]:
# remove all emojis from df
df['Text'] = df['Text'].apply(lambda x: remove_emoji(x))

###Remove Emoticons

In previous steps, we have removed emoji. Now, going to remove emoticons. 

***What is the difference between emoji and emoticons?***

*   :-) is an emoticon
*   😜 → emoji.

In [25]:
!pip install emot



In [31]:
from emot.emo_unicode import EMOJI_UNICODE, EMOTICONS_EMO

# Function for removing emoticons
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS_EMO) + u')')
    return emoticon_pattern.sub(r'', text)

In [32]:
#Example
remove_emoticons("Hello :-)")

error: unbalanced parenthesis at position 7

In [None]:
df['Text'] = df['Text'].apply(lambda x: remove_emoticons(x))

### Spell Correction

We’ve all seen tweets with a plethora of spelling mistakes. Our timelines are often filled with hastly sent tweets that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

To achieve this we will use the textblob library. 

In [None]:
from textblob import TextBlob
df['Text'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [None]:
# We could do some of the cleaning steps as a sum of opreation like this:

# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [None]:
df['Text'] = df.Text.apply(round1)
df.Text

In [None]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [None]:
df['Text'] = df.Text.apply(round2)
df.Text

Let's check whether the frequent words make sense or not

In [None]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:20]
freq

# Basic Feature Extraction - 2

###  Number of Words

In [None]:
df['word_count'] = df['Text'].apply(lambda x: len(str(x).split(" ")))
df[['Text','word_count']].head()

Again, let's check the data and number of null values

In [None]:
null_values=df.isna().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot*100
round(null_values,3).sort_values('percent',ascending=False)

### Number of characters

In [None]:
df['char_count'] = df['Text'].str.len() ## this also includes spaces
df[['Text','char_count']].head()

### 3) Average Word Length

In [None]:
def avg_word(sentence):
  words = sentence.split()
  return (sum(len(word) for word in words)/(len(words)+0.000001))

In [None]:
df['avg_word'] = df['Text'].apply(lambda x: avg_word(x)).round(1)
df[['Text','avg_word']].head()

In [None]:
df.sample(2)

**Let's convert the 'time' column to a meaningful format**  

In [None]:
df['Time'] = pd.to_datetime(df['Time'],unit='s')

In [None]:
df.sample(5)

Lastly, we don't need 'ProfileName' feature because we already have 'UserId'. Therefore, we can drop 'ProfileName'

In [None]:
df= df.drop('ProfileName', axis= 1)

In [None]:
list(df)

In [None]:
df.sample(5)



---
##**Now, let's apply round 1 and round 2 data cleaning processes on 'Summary' column**


Keep in mind that round1 operations make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.

In [None]:
df['Summary'] = df.Summary.apply(round1)
df.Summary

And, round2 operations get rid of some additional punctuation and non-sensical text that was missed the first time around.

In [None]:
df['Summary'] = df.Summary.apply(round2)
df.Summary

Let's check whether most frequent words make make. We can add our own stopwords depending on it

In [None]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:50]
freq

#Adding own stopwords

In [None]:
# Adding common words from our document to stop_words

add_words = ["br",     
"also",     
"im",      
"ive",      
]

stop_words = set(stopwords.words("english"))
stop_added = stop_words.union(add_words)

In [None]:
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_added))
df['Text'].sample(10)

In [None]:
df1= df

In [None]:
mask = df1.Text.str.endswith('br') 
df1.loc[mask, 'Text'] = df1.loc[mask, 'Text'].str[:-2]

In [None]:
df1['Text'] = df1['Text'].str.rstrip('tty')

In [None]:
df1['Text'].apply(lambda x: x[:-2] if x.endswith('tty') else x)

In [None]:
df1.loc[df1.Text.str.endswith('br'), 'Text']

In [None]:
df1.loc[df1.punctuation >= 1000].Text.tolist()

In [None]:
df.loc[df.punctuation >= 1000].Text.tolist()

In [None]:
freq = pd.Series(' '.join(df['Text']).split()).value_counts()[:50]
freq

**Now, let's save this clened processed data as CSV file** 

In [None]:
df.to_csv('Amazon_reviews_processed.csv', index=False)