# Exploratory Data Analysis

## Introduction

After the data cleaning step where we put our data into a few standard formats, the next step is to take a look at the data and see if what we're looking at makes sense. Before applying any fancy algorithms, it's always important to explore the data first.

When working with numerical data, some of the exploratory data analysis (EDA) techniques we can use include finding the average of the data set, the distribution of the data, the most common values, etc. The idea is the same when working with text data. We are going to find some more obvious patterns with EDA before identifying the hidden patterns with machines learning (ML) techniques. We are going to look at the following for each comedian:

1. **Most common words** - find these and create word clouds


### Most Common Words

**Analysis**

In [1]:
##Read Document Term Matrix
import pandas as pd
import numpy as np

books_tdm = pd.read_pickle('books_tdm.pkl')
books_tdm = books_tdm.transpose()

In [2]:
books_tdm.head()

books,Bahai_Aqdas,Budhist_Tipitaka,Gita,Rigveda,Jewish_bible,Quran,Gurugranthsahib,Bible
aa,0,0,3,3,0,0,325,0
aaa,0,0,1,0,0,0,0,0
aaay,0,0,0,0,0,0,509,0
aab,0,0,0,0,0,0,4,0
aabaadaan,0,0,0,0,0,0,1,0


In [3]:
# Find the top 30 words from the books
top_dict = {}
for c in books_tdm.columns:
    top = books_tdm[c].sort_values(ascending=False).head(30)
    top_dict[c]= list(zip(top.index,top.values))

top_30_df = pd.DataFrame(top_dict)
top_30_df

Unnamed: 0,Bahai_Aqdas,Budhist_Tipitaka,Gita,Rigveda,Jewish_bible,Quran,Gurugranthsahib,Bible
0,"(god, 613)","(buddha, 709)","(lord, 1928)","(veda, 164)","(shall, 7461)","(god, 2891)","(lord, 21718)","(shall, 11258)"
1,"(hath, 421)","(sutta, 485)","(supreme, 1702)","(time, 116)","(lord, 7227)","(lord, 971)","(har, 11183)","(lord, 8272)"
2,"(ye, 296)","(discourse, 202)","(material, 1437)","(vedic, 97)","(unto, 6846)","(say, 788)","(guru, 9364)","(thou, 6100)"
3,"(prayer, 198)","(dhamma, 167)","(book, 1066)","(word, 90)","(thou, 4572)","(said, 774)","(hir, 9308)","(thy, 6038)"
4,"(house, 189)","(life, 139)","(bhaktivedanta, 1063)","(rig, 90)","(thy, 4505)","(people, 684)","(gauvi, 9167)","(god, 5226)"
5,"(law, 189)","(bhikkhu, 136)","(right, 1052)","(india, 89)","(thee, 3353)","(day, 494)","(gur, 7629)","(said, 3950)"
6,"(lord, 181)","(venerable, 132)","(copyright, 1031)","(ancient, 81)","(god, 3253)","(believe, 411)","(naam, 7600)","(thee, 3948)"
7,"(justice, 171)","(bhikkhus, 129)","(trust, 1030)","(text, 79)","(son, 2996)","(know, 390)","(na, 6879)","(son, 3634)"
8,"(day, 167)","(given, 124)","(reserved, 1028)","(knowledge, 60)","(said, 2901)","(earth, 361)","(true, 6327)","(king, 3200)"
9,"(unto, 166)","(mind, 105)","(intl, 1026)","(life, 57)","(king, 2889)","(messenger, 351)","(awsw, 6167)","(hath, 3186)"


In [4]:
#Top 30 words from including all the books
books_tdm.sum(axis=1).sort_values(ascending=False).head(30)

lord      40302
shall     21689
god       17472
har       11183
thou      10760
thy       10682
man       10041
guru       9368
hir        9311
gauvi      9167
unto       8691
said       8123
gur        7630
naam       7601
day        7460
thee       7409
na         7316
son        7194
king       6801
true       6601
awsw       6167
ji         6107
come       6011
mind       5827
people     5796
hath       5485
word       5382
israel     5133
nanak      5112
naanak     5000
dtype: int64

In [11]:
#Find Common words in all the 
ind = []
record = []
for index, row in books_tdm.iterrows(): 
    if row.all() > 0:
        ind.append(index)
        record.append(row)
        
books_common_data = pd.DataFrame(record)
books_common_data.head()

books,Bahai_Aqdas,Budhist_Tipitaka,Gita,Rigveda,Jewish_bible,Quran,Gurugranthsahib,Bible
abide,3,1,4,1,46,20,185,96
able,12,3,83,8,95,36,26,227
accept,5,5,100,3,19,22,78,21
accepted,6,4,74,1,20,16,91,11
account,3,35,17,9,14,35,189,43


In [12]:
# Find the top 30 common words from the books
top_dict = {}
for c in books_common_data.columns:
    top = books_common_data[c].sort_values(ascending=False).head(10)
    top_dict[c]= list(zip(top.index,top.values))

top_30_df = pd.DataFrame(top_dict)
top_30_df

Unnamed: 0,Bahai_Aqdas,Budhist_Tipitaka,Gita,Rigveda,Jewish_bible,Quran,Gurugranthsahib,Bible
0,"(house, 189)","(life, 139)","(book, 1066)","(time, 116)","(son, 2996)","(say, 788)","(true, 6327)","(said, 3950)"
1,"(day, 167)","(given, 124)","(right, 1052)","(word, 90)","(said, 2901)","(said, 774)","(mind, 4947)","(son, 3634)"
2,"(question, 152)","(mind, 105)","(living, 796)","(ancient, 81)","(king, 2889)","(people, 684)","(man, 4759)","(king, 3200)"
3,"(book, 139)","(practice, 100)","(body, 771)","(knowledge, 60)","(day, 2122)","(day, 494)","(love, 2499)","(day, 2963)"
4,"(people, 130)","(teaching, 96)","(knowledge, 733)","(life, 57)","(house, 1965)","(believe, 411)","(world, 2240)","(man, 2948)"
5,"(answer, 129)","(rule, 88)","(life, 721)","(world, 55)","(people, 1904)","(know, 390)","(word, 2142)","(thing, 2563)"
6,"(world, 128)","(order, 86)","(service, 691)","(great, 46)","(land, 1879)","(earth, 361)","(come, 2076)","(people, 2252)"
7,"(time, 117)","(knowledge, 84)","(translation, 659)","(year, 45)","(man, 1843)","(come, 276)","(praise, 2013)","(hand, 2127)"
8,"(verse, 117)","(view, 81)","(person, 638)","(like, 44)","(book, 1781)","(good, 254)","(body, 1731)","(come, 2112)"
9,"(revealed, 101)","(world, 79)","(soul, 597)","(river, 43)","(child, 1753)","(punishment, 204)","(heart, 1686)","(house, 2062)"


In [45]:
# Removing some words which repeats most in all books.
# Filtering most repeated words
# custom_stopwords = ['god','lord','shall']

# books_tdm.drop(index=custom_stopwords, inplace=True)

In [62]:
#Find unique words in all the books

books_unique_content = {}
for book in books_tdm.columns:
    words = {}
    for index, row in books_tdm.iterrows(): 
        if row[book] > 0 and row[book] == row.sum():
            words[index] = row[book]
    books_unique_content[book] = words

In [63]:
 books_unique_content.keys()

dict_keys(['Bahai_Aqdas', 'Budhist_Tipitaka', 'Gita', 'Rigveda', 'Jewish_bible', 'Quran', 'Gurugranthsahib', 'Bible'])

In [69]:
for book in books_unique_content.keys():
    print(book,"==",books_unique_content[book][:50])
    print("####################################################################################")
#     books_unique_content[book] = sorted(books_unique_content[book].items(), key = lambda kv:kv[1], reverse=True)

Bahai_Aqdas == [('effendi', 99), ('shoghi', 99), ('qa', 88), ('ablution', 39), ('aqdas', 31), ('qiblih', 23), ('allglorious', 21), ('ordainer', 16), ('requirement', 16), ('allusion', 15), ('dawningplace', 14), ('gwb', 13), ('behooveth', 12), ('pdc', 12), ('selfsubsisting', 11), ('intercalary', 10), ('remarry', 10), ('reverts', 9), ('antipathy', 8), ('constantinople', 8), ('implementation', 8), ('payable', 8), ('superseded', 8), ('adult', 7), ('allbountiful', 7), ('designate', 7), ('infallibility', 7), ('napoleon', 7), ('wilmette', 7), ('austria', 6), ('branched', 6), ('debar', 6), ('endowment', 6), ('erelong', 6), ('formulated', 6), ('guardianship', 6), ('invalidate', 6), ('legislate', 6), ('monogamy', 6), ('overshadoweth', 6), ('rhine', 6), ('siyyid', 6), ('abjad', 5), ('absolved', 5), ('allpossessing', 5), ('anniversary', 5), ('arson', 5), ('bestbeloved', 5), ('indemnity', 5), ('lotetree', 5)]
####################################################################################
Budhis

In [65]:
books_unique_content

{'Bahai_Aqdas': [('effendi', 99),
  ('shoghi', 99),
  ('qa', 88),
  ('ablution', 39),
  ('aqdas', 31),
  ('qiblih', 23),
  ('allglorious', 21),
  ('ordainer', 16),
  ('requirement', 16),
  ('allusion', 15),
  ('dawningplace', 14),
  ('gwb', 13),
  ('behooveth', 12),
  ('pdc', 12),
  ('selfsubsisting', 11),
  ('intercalary', 10),
  ('remarry', 10),
  ('reverts', 9),
  ('antipathy', 8),
  ('constantinople', 8),
  ('implementation', 8),
  ('payable', 8),
  ('superseded', 8),
  ('adult', 7),
  ('allbountiful', 7),
  ('designate', 7),
  ('infallibility', 7),
  ('napoleon', 7),
  ('wilmette', 7),
  ('austria', 6),
  ('branched', 6),
  ('debar', 6),
  ('endowment', 6),
  ('erelong', 6),
  ('formulated', 6),
  ('guardianship', 6),
  ('invalidate', 6),
  ('legislate', 6),
  ('monogamy', 6),
  ('overshadoweth', 6),
  ('rhine', 6),
  ('siyyid', 6),
  ('abjad', 5),
  ('absolved', 5),
  ('allpossessing', 5),
  ('anniversary', 5),
  ('arson', 5),
  ('bestbeloved', 5),
  ('indemnity', 5),
  ('lotetre