## Prequisite:

- Understanding of Python

- Understanding of Text Analytics and Natural Language Processing 

**Level of Exercise** : Beginner

**Effort in Time** : 120 minutes

# Data Preparation of Text Reviews

### Objective:
  **Musical Instruments data set has review information about musical instruments along with their rating**.
   - Here we are extracting the reviewtext from data set and performing data cleaning steps related to Text Reviews
   - We are using the NLTK(Natural Language Toolkit)  library for this purpose
   - Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, 
     like speech and text, by software.
 ### Data Cleaning Tasks
      
      1.Dropping rows and columns containing missing values.  
      2.Removing Duplicate rows depending on the subset of columns chosen.  
      3.Sentence Tokenization  
      4.Word Tokenization.  
      5.Punctuation Removal  
      6.Stop Words Removal  
      7.Stemming.  
      8.Lemmatization   
      9.POS Tagging.  


# Read the DataSet
###  I have considered Musical Instruments dataset from Amazon http://jmcauley.ucsd.edu/data/amazon/
The source is  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz  
To read the data I have used pandas library   

**Tasks Performed in this section  
1.Read the csv file  
2.Seeing the dimension of the file  
3.Seeing the names of varibales in the dataset  
4.Seeing the top 10 rows.  
5.Seeing the datatypes of the columns.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('https://raw.githubusercontent.com/Lakshmiholla-2808/first/master/Musical_Instruments_5.csv')


In [None]:
#See dimension
df.shape

In [None]:
#See column names
df.columns

In [None]:
#Display top 10 rows
df.head(10)

In [None]:
#Display DataTypes of Each column
df.dtypes

## Get the count of the ratings of each rating category[5,4,3,2,1]  

In [None]:
df['overall'].value_counts()

### Handling Missing Values
     - Identify missing values using isnull() function
     - Drop the rows and columns containing missing values

In [None]:
#Check for missing values
df.isnull().sum()

In [None]:
#Drop rows having missing values
df=df.dropna(axis=0, how='any')
df

## Handle duplicate values.
  **Sort the dataframe by  productid(asin) and consider the records having same reviewerId,reviewerName,reviewTime,summary
    and reviewText as duplicate rows and drop the rows.**

In [None]:
data_f=df.sort_values('asin').drop_duplicates(subset=['reviewerID','reviewerName','reviewTime','summary','reviewText'],keep='first',inplace=False)
data_f

### Before cleaning review text cross check for missing values

In [None]:
df.isnull().sum()

## Display the set of stopwords using stopwords in English module in nltk library
  - Stop words are frequently used words which don't add much value to information extraction

In [None]:
from nltk.corpus import stopwords
stop=set(stopwords.words('english'))
print(stop)

## Sentence Tokenizer
- Consider the reviewText from the dataframe column review Text.
- Use sent_tokenize function to break the review text into a list of sentences.


In [None]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
#Function for sentence tokenization
def sentence_tokenize(sentence):
    return sent_tokenize(sentence)


##  Word Tokenizer
- Consider the tokenized sentences  and give it as input for word Tokeniztion
- Loop through the sentence list
- Tokenize the sentences using inbuilt word_tokenize function in nltk corpus 
- Save it in a list variable called words.
- Remove punctuations
- Return the list of words

In [None]:
import string
#Function for word tokenization
def myword_tokenize(sentList):
    words = list()
    for row in sentList:
        words = words + word_tokenize(row)
        words= list(filter(lambda token: token not in string.punctuation, words))
    return words




##  StopWords Removal 

- Consider the tokenized word list  and give it as input for stopwords Removal
- Loop through the words and check whether they are in list of stopwords.
- If so filter them and return the filtered list.


In [None]:
#Function for stop word removal
def remove_stopwords(words):
    stop_words = stopwords.words('english')
    stopword_removed_list= [i for i in words if i not in stop_words]
    return(stopword_removed_list)

##  Stemming

- Stemming: A process of removing and replacing suffixes to get to the root form of the word, which is called stem.
- For example, connection, connected, connecting word reduce to a common word "connect".
- The word list after removing stopwords and punctuations is given as input to the stemming function.
- Loop through the words in the word list and apply PorterStemmer.stem() function
- The stemmed results are returned in the form of a list 

In [None]:
#Function for stemming
from nltk.stem import PorterStemmer
def stem_words(words):
    ps = PorterStemmer()

    stemmed_words=[]
    for w in words:
        stemmed_words.append(ps.stem(w))
    return stemmed_words

##  Lemmatization

- Lemmatization reduces words to their base word, which is linguistically correct lemmas. 
- It transforms root word with the use of vocabulary and morphological analysis. 
-  Lemmatization is usually more sophisticated than stemming. 
-  Stemmer works on an individual word without knowledge of the context. 
-  For example, The word "better" has "good" as its lemma
- The word list after removing stopwords and punctuations is given as input to the lemmatized  function.
- Loop through the words in the word list and apply WordNetLemmatizer.lemmatize() function
- The lemmatized results are returned in the form of a list 

In [None]:
#Function for Lemmatization
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
def lemmatize_words(words):
    
    lem = WordNetLemmatizer()
    lemmatizedwords=[]
    for w in words:
        lemmatizedwords.append(lem.lemmatize(w,"v"))
    return lemmatizedwords

##  Parts of speech Tagging

- The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word.
   whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. 
-  POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.
-  The lemmatized list of words are given as input to the POS tagging function
-  pos_tag function is used to assign tags such as N(Noun),VB(Verb), NNP(Noun Phrase ) and so on


In [None]:
#Function for POS tagging
def pos_tagging(words):
    postagginglist=[]
    
    postagginglist.append(nltk.pos_tag(words))
    return postagginglist

## Create a Temporary DataFrame to store the results after perform the Data cleaning operation for first 20 records of the dataframe
 - Columns in the new dataframe are
 
    - **reviewText**: Review Text Information  
    - **sent_tokenize**:Review Text tokenized into sentences.  
    - **word_tokenized**:Sentence tokenized into words  
    - **stop_word_removal**:word list after stop word removal  
    - **stemming**:word list after stemming  
    - **lemmatize**:word list after lemmatization  
    - **postag**:word list assigned with pos tags  
    
    

In [None]:

temp = pd.DataFrame()
temp['reviewText'] = df['reviewText'].head(20)
#Applying sentence tokenize function
temp['sent_tokenized'] = df['reviewText'].head(20).apply(sentence_tokenize)
#Applying word tokenize function
temp['word_tokenized'] = temp['sent_tokenized'].apply(myword_tokenize)
#Applying stopword removal function
temp['stop_word_removal'] = temp['word_tokenized'].apply(remove_stopwords)
#Applying stemming function
temp['stemming'] =temp['stop_word_removal'].apply(stem_words)
#Applying lemmatization function
temp['lemmatize'] =temp['stop_word_removal'].apply(lemmatize_words)
#Applying stop words removal function
temp['postag'] =temp['lemmatize'].apply(pos_tagging)

temp.head()


### Consider the Data Set cell phone accessories in the problem below and write the code for the following questions


In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Lakshmiholla-2808/first/master/cellphone.csv')

### Q1 Print the dimensions of the dataset

### Q2. Display the first 10 rows of the dataframe

### Q3 Get the count of the ratings of each rating category[5,4,3,2,1]  

### Q4 Check for missing values and drop rows containing  them

### Q5.Create and display a temporary dataframe containing the following columns.
  - reviewtext 
  - tokenized_sentences
  - tokenized_words
  - stopwords_removed
  - stemmed_values
  - lemmatized_values
  - POStags_assigned