## Exploring Philosophical Texts Across Gender, Region, and Time with Text Analytics
### Haley Egan

#### Project Overview




The goal of this project is to explore the cultural patterns in philosophical texts. Main topics and cultural themes are analyzed across the texts as a whole. Topics and themes are also examined between groups, such as between male and female philosochical writers, between regions/nations, and across time. Text Analytics is a powerful tool to help us understand and extract cultural patterns from large quantities of texts.  

Key Questions: 
- Do male and female philosophers explore different topics, or share similarites? 
- Are different topics discussed in different regions of the world? 
- Are different topics discussed in different time periods? 
- Can the key concepts of philosophical texts be extracted from each text, in an interpretable way? 
- What are the most important issues that philosophers are concerned with? Do these change over time, and do they differ based on gender?

#### Project Data 

For this project, 21 philosophical texts were collected. Ten texts are by male Western philosophers, and span from 4th century BC to the 19th century AD. Four philosophical texts are by men from other parts of the world, including Asia and South America. Seven texts were collected by female Western philosophers, most of which are from the 20th century. These texts were collected from Project Gutenberg, and other online document archives. The texts are all in Plain-Text format, for easier use in data cleaning, parsing, and manipulation. Due to the necessity of the plain-text format, the available texts were limited, especially for the female philosophers. Most (known) published female philosophers are from the modern era, with limited free access to their works in plain-text form. This is the same for modern male philosophers. The distribution of texts is not even across time/gender, which must be taken into account when analysing the texts. There are many philosophical texts in the world, and this sample barely scratches the surface. Further exploration with more texts would provide greater understandings, and should be considered for future studies. However, there should still be enough textual data in order to address many cultural questions when looking at philosophical texts.

- 10 Western Philosophical Texts by Male Authors: 
    - Aristotle: Nicomachean Ethics (4th century BC)
    - Plato: The Republic (4th century BC)
    - Cicero: On Moral Duties / De Officiis (44 BC)
    - David Hume: An Enquiry Concerning Human Understanding (1748)
    - Immanuel Kant: Fundamental Principles of the Metaphysic of Morals (1785)
    - Karl Marx: The Communist Manifesto (1848)
    - John Stuart Mill: Utilitarianism (1861) 
    - Friedrich Nietzsche: Beyond Good and Evil (1886)
    - Søren Kierkegaard: Selections from the Writings of Kierkegaard (1923)
    - Michel Foucault: The Order of Things (1966)
- 4 Non-Western Philosophical Texts by Male Authors: 
    - Laozi (Lao Tzu): Tao Te Ching (400 BC) 
    - Paulo Freire: Pedagogy of the Oppressed (1968) 
    - **Sun Tzu: The Art of War (5th century BC)**
    - Herman Hesse: Siddartha (1922) 
- 7 Western Phiosophical Texts by Female Authors:
    - Mary Wollstonecraft: A Vindication of the Rights of Men (1790)
    - Mary Wollstonecraft: A Vindication of the Rights of Woman (1792)
    - Harriet Taylor Mill: The Enfranchisement of Women (1852)
    - Simone de Beauvoir: The Second Sex (1952)
    - Hannah Arendt: The Origins of Totalitarianism (1951)
    - bell hooks: Ain’t I a Woman: Black Women and Feminism (1981)
    - bell hooks: Feminist Class Struggle (2002) 

#### Methodology

Several Text Analytics tools are used for this project.
- Bag-of-words:
- TF-IDF:
- Principle Component Analysis (PCA): 
- Latent Diriclet Analysis (LDA):
- Topic Modeling:
- Word Embeddings: word2vec
- Visualizations:
    - Cluster Diagrams:
    - t-SNE:
    - Dispersion Plots:
    - Correlation heatmaps:
    
**Step 1**: Clean texts - import texts, remove beginning and end of texts that are not part of the main corpus. 

**Step 2**: Combine texts into one dataframe. From combined dataframe, create library table (LIB), document table (DOC), token table (TOKEN), and vocabulary table (VOCAB), all exported as csv files. 
- LIB: basic metadata about each book
- DOC: preserved paragraphs of each book and appropriate OHCO index
- TOKEN: OHCO index and parts-of-speech tags derived from NLTK
- VOCAB: NLTK to extract stopwords, porter stems, 'pos_max' that contains most frequent parts-of-speech tags from TOKEN table
** These documents do not follow a traditional chapter structure

**Step 3**: 
- Build TF-IDF Matrix. 
- Get TF-IDF for texts. 
- Create Bag-of-Words.

**Step 4**: PCA
- Reduce number of features by removing proper nouns and insignificant words
- Vectorize TF-IDF and extract term covariance matrix
- Apply eigendecomposition to COV table
- Look at top components from each text
- Look at top componenets by author gender
- Plots: covariance matrix, eigen tables, Document Component Matrix (DCM), scatter plots

**Step 5**: Distance Metrics
- Plots: cluster diagrams for texts by year

**Step 6**: Topic Modeling
- Top overall topics & topics by weight
- Topics by gender
- Topics by era
- Topics by region
- Plots: cluster diagrams, gradient graphs, horizontal bar charts, topic tables

**Step 7**: Word Embeddings
- Specific authors
- word2vec
- t-SNE Plots by author
- Semantic Algebra

**Step 8**:Sentiment Analysis
- NCR lexicons (multiple languages?)
- 8 emotions
- top emotions per text
- top emotions per group 

In [1]:
#import libraries
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk
from pathlib import Path
%matplotlib inline

In [2]:
#hide warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:
#NLTK downloads
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\haley\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\haley\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haley\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\haley\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

#### OHCO Model & Structure Decisions

An OHCO Model stands for Ordered Hierarchy of Content Objects. Breaking text elements down to the OHCO hierarchical levels allows us to create a database of text elements for data exploration. 

For this project, the OHCO is set to text_id, which is the unique ID given to each text to help distinguish it quickly from other texts, para_num, which is the unique count index given to each paragraph per text, sent_num, the unique count index given to each sentence per text, and token_num, the unique count index given for each individual token, or word. Many OHCO models include chapter numbers, so that the texts can be examined on a chapter level. This is often less computationally costly than examining a text at the paragraph or sentence level, especially for very long texts. However, for this project chapters are not included, because most of the philosophical texts do not follow a chapter structure. For example, the two texts by Wollstonecraft follow a letter format, and the texts by hooks are in essay format. Due to the inconsistencies in document structure, the documents are broken down to the paragraph, sentence, and token level.

In [4]:
#set OHCO
OHCO = ['text_id', 'para_num', 'sent_num', 'token_num']

In [5]:
#set directory for philosophy texts
philostext_dir = 'philostexts'

In [6]:
#identify start and end of texts, make uniform chapters
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
chap_pats = {
    1: {
        'start_line': 17,
        'end_line': 7723},
    2: {
        'start_line': 14,
        'end_line': 9370},
    3: {
        'start_line': 7,
        'end_line': 180},
     4: {
        'start_line': 90,
        'end_line': 12102}, 
      5: {
        'start_line': 1081,
        'end_line': 19707},
      6: {
        'start_line': 396,
        'end_line': 4716},
      7: {
        'start_line': 502,
        'end_line': 30458},
      8: {
        'start_line': 155,
        'end_line': 1371},
      9: {
        'start_line': 72,
        'end_line': 3992},
      10: {
        'start_line': 89,
        'end_line': 5281},
    11: {
        'start_line': 62,
        'end_line': 3113},
    12: {
        'start_line': 107,
        'end_line': 8653},
    13: {
        'start_line': 80,
        'end_line': 1433},
    14: {
        'start_line': 75,
        'end_line': 2319},
    15: {
        'start_line': 618,
        'end_line': 9483},
    16: {
        'start_line': 151,
        'end_line': 6007},
    17: {
        'start_line': 61,
        'end_line': 24563},
    18: {
        'start_line': 902,
        'end_line': 39254},
    19: {
        'start_line': 12,
        'end_line': 2437},
    #20: {
     #   'start_line': 268,
     #   'end_line': 6783,
    # },
    21: {
        'start_line': 135,
        'end_line': 2739}}

In [7]:
#make a list of all text files
text_list = [text for text in sorted(glob(philostext_dir+'/*.txt'))]
text_list

['philostexts\\Aristotle_NicomachaenEthics-1.txt',
 'philostexts\\Cicero_OnDuties-4.txt',
 'philostexts\\Foucault_TheOrderofThings-5.txt',
 'philostexts\\Freire_PedagogyOfTheOppressed-6.txt',
 'philostexts\\HannahArendt_TheOriginsofTotalitarianism-7.txt',
 'philostexts\\HarrietTaylorMill_EnfranchisementofWomen-8.txt',
 'philostexts\\Hesse_Siddhartha-9.txt',
 'philostexts\\Hume_AnEnquiryConcerningHumanUnderstanding-10.txt',
 'philostexts\\Kant_MetaphysicsOfMorals-11.txt',
 'philostexts\\Kierkegaard_CollectionOfWritings-12.txt',
 'philostexts\\Laozi_TaoTeChing-21.txt',
 'philostexts\\Marx_CommunistManifesto-13.txt',
 'philostexts\\MaryWollstonecraft_AVindicationOfTheRightsofMen-14.txt',
 'philostexts\\MaryWollstonecraft_AVindicationOfTheRightsofWoman-15.txt',
 'philostexts\\Nietzsche_BeyondGoodandEvil-16.txt',
 'philostexts\\Plato_TheRepublic-17.txt',
 'philostexts\\Simonedebeauvoir_TheSecondSex-18.txt',
 'philostexts\\StuartMill_Utilitarianism-19.txt',
 'philostexts\\bellhooks_AintIAWom

### Clean Texts and Build Library and Document Tables

In [8]:
#function to clean texts, split into paragraphs, and build dataframes
def clean_texts(text_list, chap_pats, OHCO=OHCO):
    lib = []
    doc = []
    for text in text_list:
        # Get ID from filename
        text_id = int(text.split('-')[-1].split('.')[0])
        #print(text_id)

        #get text title 
        text_title = text.split('_')[-1].split('-')[0]
        #print(text_title)

        #get text author
        text_author = text.split('\\')[-1].split('_')[0]
        #print(text_author)    

        #read files as lines
        lines = open(text, 'r', encoding='utf-8-sig').readlines()
        
        #create dataframe to store text details
        df = pd.DataFrame(lines, columns=['line_str']) #add lines to df
        df.index.name = 'line_num' #set line_num as index for each line_str
        df.line_str = df.line_str.str.strip() #strip white space from lines
        #df['text_id'] = text_id #add text_id column to df
        
        #remove inconsistent chapter headings
        df.line_str = df.line_str.replace(r'CHAPTER', '').str.strip()
        df.line_str = df.line_str.replace(r'Chapter', '').str.strip()
        
        #remove page numbers
        df.line_str = df.line_str.str.replace(r'[0-9]+', '')
        
        # fix characters to improve tokenization
        df.line_str = df.line_str.str.replace('—', ' — ')
        df.line_str = df.line_str.str.replace('-', ' - ')
        
        #remove unimportant stuff at begininng and end of texts
        a = chap_pats[text_id]['start_line'] - 1
        b = chap_pats[text_id]['end_line'] + 1
        df = df.iloc[a:b]    

        #split into paragraphs
        df = df['line_str'].str.split(r'\n\n+', expand=True).stack().to_frame().rename(columns={0:'para_str'})
        df.index.names = OHCO[1:3]
        df['para_str'] = df['para_str'].str.replace(r'\n', ' ').str.strip()
        df = df[~df['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs    

        # Set index
        df['text_id'] = text_id
        df = df.reset_index().set_index(OHCO[:2])  
        df = df.drop(['sent_num'], axis=1) #drop sent_num - not needed at this point  

        #append extracted into to lists
        lib.append((text_id, text_title, text_author, text))
        doc.append(df)
    
    docs = pd.concat(doc) #put all doc into into dataframe format 
    #create new df with title, author, file, and text id index
    library = pd.DataFrame(lib, columns=['text_id', 'title', 'author', 'file']).set_index('text_id')
    return docs, library

In [9]:
#call function to make LIB and DOC tables
DOC, LIB = clean_texts(text_list, chap_pats)

In [10]:
LIB.sample(10)

Unnamed: 0_level_0,title,author,file
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17,TheRepublic,Plato,philostexts\Plato_TheRepublic-17.txt
1,NicomachaenEthics,Aristotle,philostexts\Aristotle_NicomachaenEthics-1.txt
19,Utilitarianism,StuartMill,philostexts\StuartMill_Utilitarianism-19.txt
16,BeyondGoodandEvil,Nietzsche,philostexts\Nietzsche_BeyondGoodandEvil-16.txt
6,PedagogyOfTheOppressed,Freire,philostexts\Freire_PedagogyOfTheOppressed-6.txt
4,OnDuties,Cicero,philostexts\Cicero_OnDuties-4.txt
12,CollectionOfWritings,Kierkegaard,philostexts\Kierkegaard_CollectionOfWritings-1...
3,FeministClassStruggle,bellhooks,philostexts\bellhooks_FeministClassStruggle-3.txt
8,EnfranchisementofWomen,HarrietTaylorMill,philostexts\HarrietTaylorMill_Enfranchisemento...
5,TheOrderofThings,Foucault,philostexts\Foucault_TheOrderofThings-5.txt


In [11]:
DOC.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
text_id,para_num,Unnamed: 2_level_1
15,8130,senses will ever be at work to harden their he...
5,3789,"Bopp, Ray and Cuvier, Petty and Ricardo, the f..."
17,11762,slain my son.’
15,623,"understand me, which I do not suppose many per..."
7,13967,"poet Tyutchev asserted at the same time that ""..."
4,6782,"ipsi est, tum etiam sordidum ad famam, committ..."
18,34531,Bashkirtseff was so intoxicated by her beauty ...
9,1866,saw them complaining about pain at which a Sam...
16,2990,heaven of OUR life. There are few pains so gri...
12,3827,"She, however, seemed only to be glad that it t..."


### Create TOKEN table to extract NLTK Part-of-Speech tags

In [12]:
def tokenize(doc_df, OHCO=OHCO, remove_pos_tuple=False, ws=False):
    
    # Paragraphs to Sentences
    df = doc_df.para_str\
        .apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame()\
        .rename(columns={0:'sent_str'})
    
    # Sentences to Tokens
    # Local function to pick tokenizer
    def word_tokenize(x):
        if ws:
            s = pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))
        else:
            s = pd.Series(nltk.pos_tag(nltk.word_tokenize(x)))
        return s
            
    df = df.sent_str\
        .apply(word_tokenize)\
        .stack()\
        .to_frame()\
        .rename(columns={0:'pos_tuple'})
    
    # Grab info from tuple
    df['pos'] = df.pos_tuple.apply(lambda x: x[1])
    df['token_str'] = df.pos_tuple.apply(lambda x: x[0])
    if remove_pos_tuple:
        df = df.drop('pos_tuple', 1)
    
    # Add index
    df.index.names = OHCO
    
    return df

In [13]:
TOKEN = tokenize(DOC, ws=True)

In [14]:
TOKEN.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str
text_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,16,0,0,"(Every, DT)",DT,Every
1,16,0,1,"(art, NN)",NN,art
1,16,0,2,"(and, CC)",CC,and
1,16,0,3,"(every, DT)",DT,every
1,16,0,4,"(inquiry,, NN)",NN,"inquiry,"


### Create VOCAB Table with NLTK Stopwords, Porter Stems, Most Frequent Parts-of-Speech Tags

In [15]:
#Extract a vocabulary from the TOKEN table
TOKEN['term_str'] = TOKEN['token_str'].str.lower().str.replace('[\W_]', '') #lowercase all, remove anything that's not a letter

VOCAB = TOKEN.term_str.value_counts().to_frame().rename(columns={'index':'term_str', 'term_str':'n'})\
    .sort_index().reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

VOCAB['num'] = VOCAB.term_str.str.match("\d+").astype('int')

VOCAB

Unnamed: 0_level_0,term_str,n,num
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,32337,0
1,a,34591,0
2,aa,4,0
3,aarhus,1,0
4,ab,105,0
...,...,...,...
50195,ἅπαν,1,0
50196,ἐξοίχεται,1,0
50197,ὁρμαί,1,0
50198,ὁρμάς,1,0


In [16]:
#add stopwords
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1
sw.sample(10)

Unnamed: 0_level_0,dummy
term_str,Unnamed: 1_level_1
such,1
should've,1
she's,1
wouldn't,1
who,1
be,1
him,1
aren,1
been,1
haven't,1


In [17]:
VOCAB['stop'] = VOCAB.term_str.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')

VOCAB[VOCAB.stop == 1].sample(10)

Unnamed: 0_level_0,term_str,n,num,stop
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
29147,more,5192,0,1
49697,won,82,0,1
20555,he,12294,0,1
24409,itself,1750,0,1
20522,haven,8,0,1
127,about,1825,0,1
29566,my,1951,0,1
50025,your,917,0,1
1226,against,1352,0,1
49378,what,4727,0,1


In [18]:
#add stems
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
VOCAB['p_stem'] = VOCAB.term_str.apply(stemmer.stem)

VOCAB.sample(10)

Unnamed: 0_level_0,term_str,n,num,stop,p_stem
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6286,castigatione,1,0,0,castigation
47196,unleashed,3,0,0,unleash
28368,mice,1,0,0,mice
41738,snivel,1,0,0,snivel
49439,whispering,8,0,0,whisper
31129,offhand,1,0,0,offhand
45024,theres,7,0,0,there
2550,applications,13,0,0,applic
26118,lhistoire,1,0,0,lhistoir
43842,suppliant,1,0,0,suppliant


In [19]:
TOKEN.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
text_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9,1864,0,6,"(for, IN)",IN,for,for
17,2857,0,6,"(from, IN)",IN,from,from
17,20171,0,1,"(be, VB)",VB,be,be
19,113,0,7,"(the, DT)",DT,the,the
5,14242,1,4,"(therefore, VB)",VB,therefore,therefore
9,2142,0,6,"(of, IN)",IN,of,of
18,10048,0,3,"(was, VBD)",VBD,was,was
15,6441,0,10,"(its, PRP$)",PRP$,its,its
8,1036,1,0,"(When, WRB)",WRB,When,when
18,16191,0,4,"(experience, NN)",NN,experience,experience


In [20]:
#add 'term_id' column to table to make it easier of merging tables 
TOKEN['term_id'] = TOKEN.term_str.map(VOCAB.reset_index().set_index('term_str').term_id)

In [21]:
TOKEN.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str,term_id
text_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,16,0,0,"(Every, DT)",DT,Every,every,15890
1,16,0,1,"(art, NN)",NN,art,art,2942
1,16,0,2,"(and, CC)",CC,and,and,1976
1,16,0,3,"(every, DT)",DT,every,every,15890
1,16,0,4,"(inquiry,, NN)",NN,"inquiry,",inquiry,23364


In [22]:
#Finally, add a feature named "pos_max" to the VOCAB table that contains the most frequently 
#associated part-of-speech tag, as found in the TOKEN table, with each term.
VOCAB['pos_max'] = TOKEN.groupby(['term_id', 'pos']).count().iloc[:,0].unstack().idxmax(1)

In [23]:
VOCAB.sample(10)

Unnamed: 0_level_0,term_str,n,num,stop,p_stem,pos_max
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
14176,effectuate,3,0,0,effectu,VB
24174,iracundiam,2,0,0,iracundiam,NN
1798,amicably,2,0,0,amic,NN
350,accomplice,6,0,0,accomplic,NN
32976,peperit,1,0,0,peperit,NN
37322,reconquest,1,0,0,reconquest,NN
8590,coneemed,1,0,0,coneem,VBN
42793,stautes,1,0,0,staut,NNP
36799,rain,26,0,0,rain,NN
26828,lubricus,1,0,0,lubricu,NN


In [24]:
#save to csv
DOC.to_csv('DOC.csv')
LIB.to_csv('LIB.csv')
VOCAB.to_csv('VOCAB.csv')
TOKEN.to_csv('TOKEN.csv')