<a href="https://colab.research.google.com/github/Mandi-Li/ChatQuine-TriGram-Model/blob/main/ChatQuine1_NGram_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**ChatQuine: A Tri-Gram Language Model Simulating Philosopher Willard Van Orman Quine**

Programmer and Project Creator: **Mandi Li**


University of Amsterdam


Note: while this is ChatQuine1.0 trained on a tri-gram model, I also led by team to create **ChatQuine2.0 trained on the GPT3-turbo model**： https://github.com/Cameron-Meinsma/ChatQuine






*Simulating the brain of philosopher Willard Van Orman Quine: What would Quine's perspective be on the contemporary philosophical challenges if he were alive today?

By emulating Quine's linguistic and thinking patterns in an AI language model, our research represents an initial stride towards 'mind uploading,' a transhumanist proposition aimed at overcoming humanity's inherent flaw of mortality.*

**Introduction**

Created by Mandi Li, ChatQuine is a vertical philosophy chatbot that is capable of engaging in extensive philosophical discussions, explaining complicated philosophical concepts, and completing philosophical texts from the perspective of philosopher Quine. This is ChatQuine1.0 trained on a tri-gram language model.

In many sci-fi movies, posthumans upload their consciousness into computers to achieve immortality. While today’s Artificial Intelligence (AI) cannot simulate human consciousness, AI language models are already capable of replicating human language patterns. As historian Yuval Noah Harari suggests, language is the foundation of human civilisation (2023).  Thus, AI’s mastery of language can provide us with a simulated copy of human culture.

This research aims to delve into the mind of the philosopher Willard Van Orman Quine and explore how he would approach contemporary philosophical questions if he were alive today. Quine’s scientific approach to logic is still relevant in contemporary philosophical discourse. By emulating Quine's thinking patterns in a generative language model, I seek to simulate his cognitive processes and gain unique insights into his philosophical logic.
Additionally, this research goes beyond philosophical exploration. It represents an initial stride towards 'mind uploading,' a transhumanist proposition aimed at overcoming humanity's inherent flaw of mortality. By emulating Quine's linguistic patterns in an AI language model, I aim to bridge the gap between human consciousness and machine intelligence, paving the way for preserving and continuing the intellectual legacy of great thinkers like Quine.




**Dataset**

Based on the fact that I want to train a generative language model in such a way that it is capable of emulating the thinking patterns of the philosopher Quine, the largest source of data is the entire corpus of Quine. To elaborate, the data includes two files, one of which is a lemmatized text file of the Quine corpus while the other text file is an un-lemmatized
version of the same Quine corpus. This model was trained on the lemmatized one. The Quine corpus consists of all of the 228 articles and books that Quine had written in his lifetime (Betti et al. 2020). Furthermore, the corpus consists of a total of 819 document (Betti et al. 2020). So how many words are there in total? When I constructed the ChatQuine tri-gram model, I used Python to test this. The corpus contains 2,150,356 word tokens. To elaborate, the Quine corpus covers a wide range of topics, from different phases of Quine’s life to different genres. Furthermore, according to Betti et al. (2020), Quine’s corpus consists of a “lexical variation” that is
rather “high”.

Notably, I acquired this datset from Betti et al's research (2020). Due to privacy reasons, I have agreed not to publish this dataset. All I am publishing here is the original code I wrote by myself. Thus, I am not violating the data use agreement I signed with Betti et al's research team. So if you want to make a similar project like ChatQuine, you can use my code but you have to collect the dataset by yourself. Good luck!

**Result Synopsis**

The two randomly generated sentences by ChatQuine are “hypotenuse be equal and opposite line in that book later pron be just these basic disagreement in the order of” and “trapping sentence by deep analysis of the function satisfy by x the hierarchy to which neg and x x y etc. stand in the main business of semantic.” Assessing these outputs by intrinsic evaluation, the terminology and syntax are in agreement with Quine’s philosophy of language, epistemology, and metaphysics.

However, these nonhuman outputs lack grammatical accuracy and semantic meaning. This 'stupidity' might result from the small training datasets and unadvanced language models. Thus, in my next project ChatQuine 2.0 (https://github.com/Cameron-Meinsma/ChatQuine), my team trained this chatbot on virtually the entire internet content regarding Quine and a GPT-3.5 model.




***Citations***

Programming: https://www.kaggle.com/code/alvations/n-gram-language-model-with-nltk/notebook

Dataset: Betti, Arianna. Martin Reynaert, Thijs Ossenkoppele, Yvette Oortwijn, Andrew Salway, and Jelke Bloem. 2020. “Expert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains.” In Proceedings of the 28th International Conference on Computational Linguistics, pages 6690–6702, Barcelona, Spain (Online). International Committee on Computational Linguistics. https://aclanthology.org/2020.coling-main.586/  

  (Note: this dataset is not published here due to privacy reasons!)

In [None]:
from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten

Open our tokenized text file of Quine's work:

In [None]:
import os
import requests
import io #codecs

#As our file is already tokenized, we only need to open the text file
if os.path.isfile('/content/drive/MyDrive/Quine_data/QUINEV05FOLIASPACY.ColTextLem.20200516.W2Vnorm.PerlLimit.txt'):
    with io.open('/content/drive/MyDrive/Quine_data/QUINEV05FOLIASPACY.ColTextLem.20200516.W2Vnorm.PerlLimit.txt', encoding='utf8') as fin:
        text = fin.read()
print(text[:100])
print(text[1:3])

{{N}} logical formules
the connective e of membership be adopt as part of pron primitive logical not
{N


Tokenize our text:

In [None]:
try: # Use the default NLTK tokenizer.
    from nltk import word_tokenize, sent_tokenize
    # Testing whether it works.
    # Sometimes it doesn't work on some machines because of setup issues.
    word_tokenize(sent_tokenize("This is a foobar sentence. Yes it is.")[0])
except: # Use a naive sentence tokenizer and toktok.
    import re
    from nltk.tokenize import ToktokTokenizer
    # See https://stackoverflow.com/a/25736515/610569
    sent_tokenize = lambda x: re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', x)
    # Use the toktok tokenizer that requires no dependencies.
    toktok = ToktokTokenizer()
    word_tokenize = word_tokenize = toktok.tokenize

In [None]:
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
                  for sent in sent_tokenize(text)]

In [None]:
tokenized_text[0]

['{',
 '{',
 'n',
 '}',
 '}',
 'logical',
 'formules',
 'the',
 'connective',
 'e',
 'of',
 'membership',
 'be',
 'adopt',
 'as',
 'part',
 'of',
 'pron',
 'primitive',
 'logical',
 'notation',
 'along',
 'with',
 'the',
 'connective',
 'j',
 'of',
 'joint',
 'denial',
 'and',
 'the',
 'quantifier',
 'and',
 'variable',
 'which',
 'constitute',
 'the',
 'notation',
 'of',
 'quantification',
 'by',
 'put',
 'v',
 'between',
 'variable',
 'pron',
 'obtain',
 'formulae',
 'xfz',
 'which',
 'be',
 'atomic',
 'in',
 'the',
 'sense',
 'of',
 'have',
 'no',
 'other',
 'formula',
 'as',
 'part',
 'this',
 'be',
 'the',
 'first',
 'time',
 'atomic',
 'formula',
 'have',
 'come',
 'to',
 'hand',
 'hitherto',
 'though',
 'formula',
 'in',
 'general',
 'be',
 'describe',
 'as',
 'build',
 'up',
 'of',
 'atomic',
 'formula',
 'by',
 'joint',
 'denial',
 'and',
 'quantifier',
 'the',
 'atomic',
 'formula',
 'pron',
 'be',
 'leave',
 'unspecified',
 'cf',
 '{',
 '{',
 'n',
 '}',
 '}',
 'strictly',
 '

In [None]:
print(tokenized_text[0][:50])

['{', '{', 'n', '}', '}', 'logical', 'formules', 'the', 'connective', 'e', 'of', 'membership', 'be', 'adopt', 'as', 'part', 'of', 'pron', 'primitive', 'logical', 'notation', 'along', 'with', 'the', 'connective', 'j', 'of', 'joint', 'denial', 'and', 'the', 'quantifier', 'and', 'variable', 'which', 'constitute', 'the', 'notation', 'of', 'quantification', 'by', 'put', 'v', 'between', 'variable', 'pron', 'obtain', 'formulae', 'xfz', 'which']


Padding sentences before splitting them into ngrams.

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

Training an N-gram Model:

In [None]:
from nltk.lm import MLE
model = MLE(n) # Train a 3-grams model, previously we set n=3

Initializing the MLE model, creates an empty vocabulary

In [None]:
len(model.vocab)

0

In [None]:
model.fit(train_data, padded_sents)
print(model.vocab)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 25519 items>


In [None]:
len(model.vocab)

25519

The vocabulary helps us handle words that have not occurred during training.

In [None]:
print(model.vocab.lookup(tokenized_text[0][:50]))

('{', '{', 'n', '}', '}', 'logical', 'formules', 'the', 'connective', 'e', 'of', 'membership', 'be', 'adopt', 'as', 'part', 'of', 'pron', 'primitive', 'logical', 'notation', 'along', 'with', 'the', 'connective', 'j', 'of', 'joint', 'denial', 'and', 'the', 'quantifier', 'and', 'variable', 'which', 'constitute', 'the', 'notation', 'of', 'quantification', 'by', 'put', 'v', 'between', 'variable', 'pron', 'obtain', 'formulae', 'xfz', 'which')


In [None]:
# If we lookup the vocab on unseen sentences not from the training data,
# it automatically replace words not in the vocabulary with `<UNK>`.
print(model.vocab.lookup('language is never random lah .'.split()))


('language', 'is', 'never', 'random', '<UNK>', '<UNK>')


Using the N-gram Language Model:

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus

In [None]:
print(model.counts)

<NgramCounter with 3 ngram orders and 5909169 ngrams>


Calculate score how probable words are in certain contexts.

This being MLE, the model returns the item's relative frequency as its score.

In [None]:
model.score('logical') # i.e.('language')

0.0010656315301026946

In [None]:
model.score('is', 'logical'.split())  # P('is'|'language')

0.0

Items that are not seen during training are mapped to the vocabulary's "unknown label" token. This is "" by default.

In [None]:
model.score("<UNK>") == model.score("hahaha")

True

Generation using N-gram Language Model:

Use our model to generate a random text:

In [None]:
print(model.generate(20, random_seed=7))

['hypotenuse', 'be', 'equal', 'and', 'opposite', 'line', 'in', 'that', 'book', 'later', 'pron', 'be', 'just', 'these', 'basic', 'disagreement', 'in', 'the', 'order', 'of']


Make the generated text human-like by cleaning:

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [None]:
generate_sent(model, 20, random_seed=7)

'hypotenuse be equal and opposite line in that book later pron be just these basic disagreement in the order of'

In [None]:
generate_sent(model, 28, random_seed=0)

'trapping sentence by deep analysis of the function satisfy by x the hierarchy to which neg and x x y etc. stand in the main business of semantic'