In [66]:
#!/usr/bin/python3
# -*- coding: UTF8 -*-
# Author: Nicolas Flandrois
# Date: Monday, May 20th, 2020

In [67]:
import os
import pandas as pd

pd.set_option("max_colwidth", None)

In [68]:
def fortune2dataframe(source_file):
    """Given the Source file path/name, this function will transform the raw
    data into a Pandas' dataframe. This function is specifically made to suite
    a fortune-mod standardized format, as an input."""
    with open(source_file, 'r') as f:
        data = f.read()
        quote_list = data.replace('\n', '').split('%')
        clean_list = [x.strip() for x in quote_list if x]
        splited_quotes = [n.split('     — ') for n in clean_list]

        df = pd.DataFrame(splited_quotes, columns=['Quotes', 'Authors'])

        return df

**Define the Dataframe** & **Sampling**

In [69]:
df = fortune2dataframe(os.path.abspath("zen"))
df

Unnamed: 0,Quotes,Authors
0,"Zen is not some kind of excitement, but concentration on our usual everyday routine.",Shunryu Suzuki
1,"Since the time we were born from our mother's womb, the only thing we have seen is the present. We have never seen the past and we have never seen the future. Wherever we are, whatever time it is, it is only the present.",Khenpo Tsultrim Rinpoche
2,"Rather than being your thoughts and emotions, be the awareness behind them.",Eckhart Tolle
3,Your suffering is never caused by the person you are blaming.,Byron Katie
4,Death is not an error. It is not a failure. It is taking off a tight shoe.,Ram Dass
...,...,...
471,The world is ruled by letting things take their course.,Lao Tzu
472,"Nurturing your beloved, you become impartial. Opening your heart, you become accepted. Accepting the World, you embrace Tao.",Lao Tzu
473,Man suffers only because he takes seriously what the gods made for fun.,Alan Wilson Watts
474,Life is a journey. Time is a river. The door is ajar.,Jim Butcher


# A bit of Checkup for cleanning
- Checking for **Duplicate rows** (same Quote & same Author)

In [70]:
df[df.duplicated()]

Unnamed: 0,Quotes,Authors


- Checking for **Duplicate Quotes** (Same Quote, Any Author)

In [71]:
df[df.duplicated(['Quotes'])]

Unnamed: 0,Quotes,Authors


- Checking for **unwanted data** (typical issues from initial parsing Twitter2Fortune Bot)

In [72]:
df[df['Quotes'].str.contains('http')]

df[df['Authors'].str.contains('http')]

df[df['Authors'].str.match('None')]

Unnamed: 0,Quotes,Authors


# Statistics
- **How Many Quotes do we have?**

In [73]:
total_quotes = df['Quotes'].count()
total_quotes

476

- **How many Authors do we have?**

In [74]:
auth_grp = df.groupby('Authors').size()
auth_grp.count()

152

- **Which Author is quoted the most frequently?**

In [75]:
with pd.option_context('display.max_rows', None):
    print(auth_grp.sort_values(ascending=False))

Authors
Eckhart Tolle                                                        55
Alan Watts                                                           31
Thich Nhat Hanh                                                      30
Lao Tzu                                                              23
Dōgen Zenji                                                          23
Byron Katie                                                          18
Shunryu Suzuki                                                       17
Dalai Lama XIV                                                       14
Bruce Lee                                                            12
Bodhidharma                                                          11
B.D. Schiers                                                         11
Ajahn Chah                                                            8
Zen Proverb                                                           8
Haemin Sunim                                            

- List of Authors in *alphabetical order*, with quotation frequency

In [76]:
with pd.option_context('display.max_rows', None):
    print(auth_grp)

Authors
17th Karmapa                                                          1
Adyashanti                                                            7
Aesop                                                                 1
African Proverb, Swahili                                              1
Ajahn Brahm                                                           3
Ajahn Chah                                                            8
Alan Watts                                                           31
Alan Wilson Watts                                                     1
Alfred Korzybski                                                      1
Alfred North Whitehead                                                1
Allen Ginsberg                                                        1
Anne Scottlin                                                         1
Anonymous                                                             1
B.D. Schiers                                            

- If 1 quote was displayed Once a day, everyday, all year long... How long would it take to circle round?

In [79]:
yr = int(total_quotes // 365.25)
wks = int(total_quotes % 365.25 // 7)
mth = f'{wks // 4} month(s), {wks % 4} week(s)'  
days = int(round(total_quotes % 365.25 % 7))
f'{yr} year(s), {mth}, and {days} day(s).'

'1 year(s), 3 month(s), 3 week(s), and 6 day(s).'