In [1]:
import pandas as pd 
import numpy as np 
import os
import matplotlib.pyplot as plt
import sweetviz as sv
import seaborn as sns
from textblob import TextBlob
import math
from sklearn.feature_extraction.text import TfidfVectorizer
import csv

# The All Nighter #
## Beginner Track: Religious Text Analysis ##

### Introduction ###
We are here to investigate 8 fundemental texts of the major religions around the world: Buddhism, Tao Te Ching, Yoga Sutra, Book Of Proverb, Book Of Ecclesiastes, Book Of Eccleasiasticus, Book Of Wisdom. 
#### Background Research on the Texts ####
- **Yoga Sutra** <br>
Country of Origin: India <br> 
Time Period Written: 500BC - 400AD <br>
Attributed Writter: Patanjali <br>
Associated Religion: Yoga <br>
Key Themes: <br> 
1) Accepted as the most authoritative source on Yoga teaching. (Pramana) <br> 
2) Use of words or expressions that don't correspond to any actual physical reality but that are understood by most. (Metaphor) <br> 

- **Tao Te Ching** <br>
Country of Origin: China <br>
Time Period Written: ~400BC <br>
Attributed Writter: Laozi <br> 
Associated Religion: Tao <br>
Key Themes: <br>
1) How to live in the world with good ness and integrity <br> 
2) Teaches self-reflection and self-awareness <br>

- **Book of Eccleasiasticus (Sirach)** <br> 
Country of Origin: Israel (Jerusalem) <br>
Time Period Written: ~200BC - 175BC <br>
Attributed Writter: Ben Sira (Ecclesiasticus) <br>
Associated Religion: Christianity (Catholicism) <br>
Key Themes: <br>
1) Praise of wisdom <br> 
2) Duties to God/friends/parents/others
3) ruler of correct self conduct

- **Upanishads** <br>
Country of Origin: India <br>
Time Period Written: ~800BC or later <br>
Attributed Writter: Vyasa <br>
Associated Religion: Hinduism <br>
Key Themes: <br> 
1) Liberating the soul and returning to the world of Brahman <br> 
2) The fundemental concepts of Karma (Individual intents will endure condequences), Samsara (Reincarnation), Dharma (Duty), Moksha (Cycle of rebirth)

- **'Buddhist' Text** <br> 
Country of Origin: India <br> 
Time Period Written: ~29 BC <br> 
Attributed Writter: Gautama Buddha (transcriber) <br>
Associated Religion: Buddhism <br> 
Key Themes: <br> 
1) layout rules for nuns and monks <br> 
2) summary of teachings of the Buddha <br> 
3) collection of texts that give explanation to Buddhist doctrines about the mind <br> 

- **Proverbs** <br> 
Country of Origin: Israel <br> 
Time Period Written: 10-6BCE <br> 
Attributed Writter: King Solomon <br>
Associated Religion: Judaism <br> 
Key Themes: <br>
1) Part of the Christian Old Testament <br> 
2) Third section of the Hebrew Bible <br> 

- **Ecclesiastes** <br> 
Country of Origin: Israel <br> 
Time Period Written: 450-200CE <br> 
Attributed Writter: Qoheleth <br>
Associated Religion: Judaism, Christianity <br> 
Key Themes: <br>
1) Part of the Christian Old Testament <br> 
2) Refered by Catholic Church leader <br> 

- **Wisdom** <br> 
Country of Origin: Israel <br> 
Time Period Written: 1-2CE <br> 
Attributed Writter: King Solomon (?) <br>
Associated Religion: Judaism, Christianity <br> 
Key Themes: <br>
1) Part of the Christian Old Testament <br> 


** Purpose of the Investigation **
- One of our goals is to extrapolate useful patterns or trends in the word composition of each text, and see if those patterns or trends can be linked to similarity in related cultural groups in the nearby region. 
- Another purpose is to extrapolate information about the nature of each relationship (either by itself or as a group) by comparing the composition of their vocabulory and made inferences. 

### Basic Data Exploration / Pre-Processing ###
1. Get the data 

In [3]:
train = pd.read_csv("Datasets\Religious_text_train.csv")
chapters = np.array(train.iloc[:,0])
words = np.array(train.columns)

In [7]:
# there should be 590 chapters, and 8264 words + 'book' and 'columns'
train.shape

(590, 8266)

There are 8265 features ('words'), 590 chapters in total across 8 books. <br>

2. The book names and chapters are boundled in one column, it is better to isolate them for easier operation later. 

In [4]:
# book name changer 
# ex: Buddhism_Ch1
train[['book', 'chapter']] = train.iloc[:,0].str.split('_',expand=True)
train = train.drop(columns='Unnamed: 0')

3. Get the book names from the dataframe and store word table by books for later convinence

In [5]:
book_names = list(train.book.unique())
book_names

['Buddhism',
 'TaoTeChing',
 'Upanishad',
 'YogaSutra',
 'BookOfProverb',
 'BookOfEcclesiastes',
 'BookOfEccleasiasticus',
 'BookOfWisdom']

In [6]:
books = dict()
chapter_counts = dict()
for book in book_names:
    books[book] = train[train.book == book].drop(columns=['book'])
    chapter_counts[book] = books[book].shape[0]

### Visualizations / Analysis ###

### Proposal ###

Hypothesis: Similarity/Difference of texts from different religion is related to the geo-location.  

### Conclucion ###