Background:
You work for a tech company that sells various products online. The company has a vast amount of customer reviews for its products and wants to prepare the text data for future sentiment analysis. The goal is to clean and preprocess the data so that it can be easily used for analysis or fed into a machine learning model in the future.

Problem Statement:
Perform comprehensive text data preprocessing on a dataset of customer reviews, including loading the data, cleaning, and applying techniques such as stemming and lemmatization.

Data:
The dataset consists of a large number of text reviews. The dataset is stored in a CSV file, with one column for the text reviews.

Tasks:
Data Loading: Load the dataset into your preferred programming environment (e.g., Python, R) and explore the structure of the data.

Data Cleaning: Perform data cleaning to handle any missing values, remove irrelevant characters, and address any other issues that may affect the quality of the data.

Text Preprocessing: Tokenize the text data and convert it to lowercase. Remove stop words and punctuation. Implement stemming and lemmatization to reduce words to their base or root form.

Exploratory Data Analysis: Explore the preprocessed data to identify common words, word frequencies, or other patterns that might be relevant for future analysis.

Deliverables:
Code implementing the data loading, cleaning, and text preprocessing steps.


In [2]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [5]:
file_path = 'C:/Users/Anes/Downloads/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv'
df = pd.read_csv(file_path)

# explore structure of the data 
print("Dataset structure: ")
print(df.head())

# data cleaning
# handling missing values
df = df.dropna(subset=['reviews.text'])
print(df)

df['cleaned_reviews']=df['reviews.text'].apply(lambda x: ' '.join(word.strip(string.punctuation) for word in x.lower().split()))

Dataset structure: 
                     id             dateAdded           dateUpdated  \
0  AVpgNzjwLJeJML43Kpxn  2015-10-30T08:59:32Z  2019-04-25T09:08:16Z   
1  AVpgNzjwLJeJML43Kpxn  2015-10-30T08:59:32Z  2019-04-25T09:08:16Z   
2  AVpgNzjwLJeJML43Kpxn  2015-10-30T08:59:32Z  2019-04-25T09:08:16Z   
3  AVpgNzjwLJeJML43Kpxn  2015-10-30T08:59:32Z  2019-04-25T09:08:16Z   
4  AVpgNzjwLJeJML43Kpxn  2015-10-30T08:59:32Z  2019-04-25T09:08:16Z   

                                                name                  asins  \
0  AmazonBasics AAA Performance Alkaline Batterie...  B00QWO9P0O,B00LH3DMUO   
1  AmazonBasics AAA Performance Alkaline Batterie...  B00QWO9P0O,B00LH3DMUO   
2  AmazonBasics AAA Performance Alkaline Batterie...  B00QWO9P0O,B00LH3DMUO   
3  AmazonBasics AAA Performance Alkaline Batterie...  B00QWO9P0O,B00LH3DMUO   
4  AmazonBasics AAA Performance Alkaline Batterie...  B00QWO9P0O,B00LH3DMUO   

          brand                                         categories  \
0  Amazo

In [6]:
print(df['cleaned_reviews'])

0        i order 3 of them and one of the item is bad q...
1        bulk is always the less expensive way to go fo...
2        well they are not duracell but for the price i...
3        seem to work as well as name brand batteries a...
4        these batteries are very long lasting the pric...
                               ...                        
28327    i got 2 of these for my 8 yr old twins my 11 y...
28328    i bought this for my niece for a christmas gif...
28329    very nice for light internet browsing keeping ...
28330    this tablet does absolutely everything i want ...
28331    at ninety dollars the expectionations are low ...
Name: cleaned_reviews, Length: 28332, dtype: object


In [7]:
df.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs,cleaned_reviews
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht...",i order 3 of them and one of the item is bad q...
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht...",bulk is always the less expensive way to go fo...
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht...",well they are not duracell but for the price i...
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht...",seem to work as well as name brand batteries a...
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht...",these batteries are very long lasting the pric...


In [8]:
# text preprocessing
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [ps.stem(word) for word in tokens]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ''.join(tokens)

df['preprocessed_reviews']=df['cleaned_reviews'].apply(preprocess_text)
print(df['preprocessed_reviews'])

0        orderoneitembadqualitimissbackupspringputpcalu...
1                        bulkalwayleexpenswaygoproductlike
2                                    wellduracelpricehappi
3              seemworkwellnamebrandbatterimuchbetterprice
4                                batterilonglastpricegreat
                               ...                        
28327       gotyroldtwinyroldoneonebetterperfectwaygetread
28328                        boughtniecchristmayearoldlove
28329    nicelightinternetbrowkeeptopemailviewvideoread...
28330    tabletabsoluteverythwantwatchtvshowmovicheckma...
28331    ninetidollarexpectionlowstillgoodtablgoodlight...
Name: preprocessed_reviews, Length: 28332, dtype: object


In [9]:
# exploratory data analysis
# display common words and their frequencies

word_freq = pd.Series(' '.join(df['preprocessed_reviews']).split()).value_counts()
print("Top 10 common words:")
print(word_freq.head(10))


Top 10 common words:
good           173
great          134
greatprice     120
greatvalu      114
workgreat       95
ok              74
goodvalu        69
goodbatteri     67
work            64
excel           61
Name: count, dtype: int64


In [11]:
df.to_csv('sampleTaskNLP.csv', index=False)