## Problem Statement2
Come up with the frequency of each word used across all item descriptions (text analysis)


### Text Analysis 

In [1]:
#read the csv file
import pandas as pd
read_data=pd.read_csv("C:\\Users\\muska\\final_data3.csv")
read_data[:2]

Unnamed: 0,item_name,must_try_tag,price,item_description
0,"Voosh Paneer Premium Thali with Sweet, Butter ...",MUST TRY,₹209,Enjoy a wholesome thali meal with paneer masal...
1,"2 Gobhi Paratha, Curd & Pickle Meal",MUST TRY,₹134,"2 gobhi parathas, curd, sweet, salad and pickl..."


In [2]:
#convet item description data into lowercase
read_data['item_description'] = read_data['item_description'].str.lower()

### Tokenization
We will now apply the word_tokenize method from NLTK to split the item description into individual words , making a new column in our read_data DataFrame. Each entry will be a list of words. Here we will also strip out non alphanumeric words/characters (such as numbers and punctuation) using .isalpha. 

In [3]:
#tokenization
import nltk
def identify_tokens(row):
    description = row['item_description']
    tokens = nltk.word_tokenize(description)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

read_data['words'] = read_data.apply(identify_tokens, axis=1)

In [4]:
read_data[:2]

Unnamed: 0,item_name,must_try_tag,price,item_description,words
0,"Voosh Paneer Premium Thali with Sweet, Butter ...",MUST TRY,₹209,enjoy a wholesome thali meal with paneer masal...,"[enjoy, a, wholesome, thali, meal, with, panee..."
1,"2 Gobhi Paratha, Curd & Pickle Meal",MUST TRY,₹134,"2 gobhi parathas, curd, sweet, salad and pickl...","[gobhi, parathas, curd, sweet, salad, and, pic..."


### Removing stop words
‘Stop words’ are commonly used words that are unlikely to have any benefit in natural language processing. These includes words such as ‘a’, ‘the’, ‘is’.

In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\muska\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.append('ml')
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

As before we will define a function and apply it to our DataFrame.

In [7]:
def description_without_sw(words):
    txt = [word for word in words if not word in stop_words]
    return txt

In [8]:
read_data['words']= read_data['words'].apply(lambda x: description_without_sw(x))

In [9]:
read_data['words'][:2]

0    [enjoy, wholesome, thali, meal, paneer, masala...
1    [gobhi, parathas, curd, sweet, salad, pickle, ...
Name: words, dtype: object

In [10]:
#make a list containing each item description as a list
listoflist=[]
for l in read_data['words']:
    listoflist.append(l)
print(listoflist[:2])

[['enjoy', 'wholesome', 'thali', 'meal', 'paneer', 'masala', 'dry', 'veggie', 'day', 'dal', 'tadka', 'phulkas', 'rice', 'sweet', 'butter', 'milk', 'amazing', 'one'], ['gobhi', 'parathas', 'curd', 'sweet', 'salad', 'pickle', 'amazing', 'one']]


In [11]:
#convert a list of lists to a flat list
flatlist = []
for elem in listoflist:
    flatlist.extend(elem)

To get the count of how many times each word appears in the sample, use the built-in Python library collections.

In [12]:
# Get Most Commonest Keywords
import collections
counts = collections.Counter(flatlist)

In [13]:
counts.most_common()[:5]

[('amazing', 23), ('one', 22), ('sweet', 21), ('rice', 17), ('dal', 16)]

### Save word frequency as a csv file
Based on the counter, create a Pandas Dataframe and convert Dataframe into csv file.

In [14]:
from pandas import DataFrame
word_frequency= pd.DataFrame(counts.most_common(),
                             columns=['words', 'count'])

In [15]:
word_frequency.to_csv('word_frequency.csv', index=False)