<a href="https://colab.research.google.com/github/IuliaElisa/Data-Mining-project/blob/main/DM_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

                                                  
**metapoint**
*A philosophical system generator*

---

Table of Content
1. Introduction 0 pts (Optional)
2. Problem statement 0 pts (Mandatory)
3. Dataset description 0 pts (Mandatory)
4. Data cleaning 2 pts
5. Analyzing word and document frequency 2 pts 
6. Relationship between words (N-grams) 2 pts 
7. Topic analysis 2 pts
8. Sentiment analysis 2 pts
9. Entity Recognition / PoS 2 pts
10. Conclusions (Mandatory)





##**1.Introduction**

This project represents the aggregation of various laboratory sub-problems within the Data Mining course as well as code snippets from online public sources. The idea around the project is to generate a simple/naive chat assistant for answering philosophical questions based on training data which is chosen at start and theoretically represents the book/text which is best suitable with one's philosophical beliefs. 

##**2.Problem statement**


Sometimes, people may have inherent complex perspectives about life, social constructs and even objective reality. Or, rather, their own thought process aligns with certain philosophy schools, but they are not entirely relying on fixed ideas. 

Also, some people don't have the opportunity to freely express their own beliefs due to various reasons: opression, inhibition, shyness or simply not thinking about practicing philosophical expression.

Next, we may assume that everyone's beliefs move around the following 10 major philosophy schools or religions for which we give an **utterly** brief description:

1. Nihilism: *No moral values, principles, truths.* 
2. Existentialism, Absurdism: *We killed God. Then how to reconstruct the meaning of life?*
3. Stoicism: *Focus only on what you can control.*
4. Hedonism, Utilitarianism - *Pleasure is the key to a good life. Moralty is based on outcomes, rather than the nature of the actions.*
5. Marxism: *Religion is the opium of masses* (materialism, dictatorship, communism).
6. Rationalism: *Knowledge comes from reason and thought, rather than empirical evidence.*
7. Relativism: *Don’t kill unless you would save lives by doing so.*
8. Budhism and Taoism: meditation, karma, individual, simplicity, naturalness.
9. Christianity: The 10 Moses' rules. Original sin. Salvation by grace.
10. Deonotology: *morality of an action depends on the nature of the action.*

###**Imports**

In [None]:
import requests
import string
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import spacy
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from nltk import FreqDist 
from nltk.stem import WordNetLemmatizer


  and should_run_async(code)


In [None]:
!python -m spacy download en_core_web_sm
!pip install --upgrade openai
nltk.download('wordnet')


  and should_run_async(code)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

##**2.Dataset desciption**

The dataset is composed by 10 books which are representative for the 10 philosophies described above.



In [None]:
allowed_chars = string.ascii_letters + string.digits + string.whitespace

# Nihilism: Emil Cioran
book1_url = 'https://www.gutenberg.org/cache/epub/4363/pg4363.txt'
book2_url = 'https://www.gutenberg.org/cache/epub/3296/pg3296.txt'
response = requests.get(book1_url)
text1 = response.text
text1 = ''.join(c for c in text1 if c in allowed_chars)

response = requests.get(book2_url)
text2 = response.text

text2 = ''.join(c for c in text2 if c in allowed_chars)


  and should_run_async(code)


In [None]:
raw_text1 = response.text
raw_text2 = response.text


  and should_run_async(code)


##**4. Data cleaning**

We performs the following data cleaning steps:

For the dataset:
*   Tokenization: text is splitted into sentences and the sentences into words. Words are lowercased and punctuation is removed.
*   English stopwords are removed.
*   Words are lemmatized
*   Words are **NOT** stemmed

**Specification**: we keep a copy for the books text as processed plain text instead of dataframe for using it further.


In [None]:
nlp = spacy.load("en_core_web_sm")

lemmatizer = WordNetLemmatizer()

def clean_chars(text):
  
  text = text.replace('\r', ' ')
  text = text.replace('\n', ' ')
  text = ''.join(c for c in text if c in allowed_chars)

  doc = nlp(text)
  empty_list = []
  for word in doc:
    empty_list.append(lemmatizer.lemmatize(str(word)))

  text = ' '.join(map(str,empty_list))

  return text

text1 = clean_chars(text1)
text2 = clean_chars(text2)

  and should_run_async(code)


In [None]:
text1

  and should_run_async(code)




In [None]:

def to_df(text):
  text_lines = text.splitlines()
  text_df = pd.DataFrame({
    "line": text_lines,
    "line_number": list(range(len(text_lines)))
  })
  return text_df

text1_df = to_df(text1)
text2_df = to_df(text2)

# We’ll want to know which content comes from which book
text1_df = text1_df.assign(book = 'Nietzche')
text2_df = text2_df.assign(book = 'St_aug')

# Finally, we concatenate the books into one dataframe
books = [text1_df, text2_df]
text_books_df = pd.concat(books)
text_books_df.head()

  and should_run_async(code)


Unnamed: 0,line,line_number,book
0,The Project Gutenberg EBook of Beyond Good and...,0,Nietzche
0,Project Gutenbergs The Confessions of Saint Au...,0,St_aug


In [None]:
# We split the data into words. We first split the text column into a list of words
text_books_df['word'] = text_books_df['line'].str.split()

# Explode the words column to create a new row for each word (this creates a separate row for each word from the newly created words list)
text_books_df = text_books_df.explode('word')

# Reset the index of the dataframe (we want to index each word now)
text_books_df = text_books_df.reset_index(drop=True)
text_books_df.drop(columns = 'line_number', inplace=True)
text_books_df.head()

  and should_run_async(code)


Unnamed: 0,line,book,word
0,The Project Gutenberg EBook of Beyond Good and...,Nietzche,The
1,The Project Gutenberg EBook of Beyond Good and...,Nietzche,Project
2,The Project Gutenberg EBook of Beyond Good and...,Nietzche,Gutenberg
3,The Project Gutenberg EBook of Beyond Good and...,Nietzche,EBook
4,The Project Gutenberg EBook of Beyond Good and...,Nietzche,of


In [None]:
text_books_df

  and should_run_async(code)


Unnamed: 0,line,book,word
0,The Project Gutenberg EBook of Beyond Good and...,Nietzche,The
1,The Project Gutenberg EBook of Beyond Good and...,Nietzche,Project
2,The Project Gutenberg EBook of Beyond Good and...,Nietzche,Gutenberg
3,The Project Gutenberg EBook of Beyond Good and...,Nietzche,EBook
4,The Project Gutenberg EBook of Beyond Good and...,Nietzche,of
...,...,...,...
180638,Project Gutenbergs The Confessions of Saint Au...,St_aug,to
180639,Project Gutenbergs The Confessions of Saint Au...,St_aug,hear
180640,Project Gutenbergs The Confessions of Saint Au...,St_aug,about
180641,Project Gutenbergs The Confessions of Saint Au...,St_aug,new


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

text_books_df['word'] = text_books_df['word'].apply(lambda x: x.lower())
text_books_df = text_books_df[~text_books_df['word'].isin(stopwords.words('english'))] 

  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##**5. Analyzing word and document frequency**


In [None]:
count_df_1 = text_books_df.groupby(['word', 'book']).size().sort_values(ascending=False).reset_index(name='count') 
count_df_2 = text_books_df.groupby(['book']).size().sort_values(ascending=False).reset_index(name='count') 
book_words = count_df_1.merge(count_df_2, on='book')
book_words = book_words.rename(columns={'count_x': 'word_appearances_in_book', 'count_y': 'book_total_word_count'}) 
book_words['tf'] = book_words['word_appearances_in_book']/book_words['book_total_word_count']
book_words['rank'] = book_words.groupby('book')['word_appearances_in_book'].rank(method='dense', ascending=False)

  and should_run_async(code)


In [None]:
N = n = []

N = book_words['book'].nunique()
n = book_words.groupby('word')['book'].transform(lambda x: len(x.unique()))

book_words['idf'] = np.log(N/n)
book_words['tf-idf'] = book_words['tf'] * book_words['idf'] 

book_words

  and should_run_async(code)


Unnamed: 0,word,book,word_appearances_in_book,book_total_word_count,tf,rank,idf,tf-idf
0,thou,St_aug,1087,53153,0.020450,1.0,0.000000,0.000000
1,wa,St_aug,1020,53153,0.019190,2.0,0.000000,0.000000
2,thee,St_aug,921,53153,0.017327,3.0,0.000000,0.000000
3,thy,St_aug,898,53153,0.016895,4.0,0.000000,0.000000
4,thing,St_aug,714,53153,0.013433,5.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...
15871,himselfpresque,Nietzche,1,33526,0.000030,99.0,0.693147,0.000021
15872,himselfshould,Nietzche,1,33526,0.000030,99.0,0.693147,0.000021
15873,himselfthis,Nietzche,1,33526,0.000030,99.0,0.693147,0.000021
15874,himselfwhither,Nietzche,1,33526,0.000030,99.0,0.693147,0.000021


In [None]:
book_words.sort_values('tf-idf', ascending=False).head(20)

  and should_run_async(code)


Unnamed: 0,word,book,word_appearances_in_book,book_total_word_count,tf,rank,idf,tf-idf
7078,german,Nietzche,90,33526,0.002684,27.0,0.693147,0.001861
7085,morality,Nietzche,80,33526,0.002386,32.0,0.693147,0.001654
7100,europe,Nietzche,65,33526,0.001939,40.0,0.693147,0.001344
73,thine,St_aug,95,53153,0.001787,62.0,0.693147,0.001239
81,mercy,St_aug,90,53153,0.001693,66.0,0.693147,0.001174
7162,modern,Nietzche,40,33526,0.001193,60.0,0.693147,0.000827
137,darkness,St_aug,62,53153,0.001166,90.0,0.693147,0.000809
7167,nowadays,Nietzche,38,33526,0.001133,62.0,0.693147,0.000786
148,hadst,St_aug,60,53153,0.001129,92.0,0.693147,0.000782
158,whereof,St_aug,56,53153,0.001054,96.0,0.693147,0.00073


##**6. Relationship between words (N-grams)**


In [None]:
text1_words_list  = text1.split(' ')

  and should_run_async(code)


In [None]:
text2

  and should_run_async(code)




In [None]:
type(ngrams)

  and should_run_async(code)


function

In [None]:
#create NGRAMS from the corpus
from nltk.util import ngrams
 
NGRAMS = ngrams(text1.split(), 3)
ngrams = ()
for grams in NGRAMS:
    print(grams)
    ngrams = (*ngrams, grams)
    
# calculate the frequency distribution of the bigrams
ngrams_freqdist = nltk.FreqDist(ngrams)
ngrams_freqdist

  and should_run_async(code)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
('experience', 'see', 'hears')
('see', 'hears', 'suspect')
('hears', 'suspect', 'hope')
('suspect', 'hope', 'and')
('hope', 'and', 'dream')
('and', 'dream', 'extraordinary')
('dream', 'extraordinary', 'thing')
('extraordinary', 'thing', 'who')
('thing', 'who', 'is')
('who', 'is', 'struck')
('is', 'struck', 'by')
('struck', 'by', 'his')
('by', 'his', 'own')
('his', 'own', 'thought')
('own', 'thought', 'a')
('thought', 'a', 'if')
('a', 'if', 'they')
('if', 'they', 'came')
('they', 'came', 'from')
('came', 'from', 'the')
('from', 'the', 'outside')
('the', 'outside', 'from')
('outside', 'from', 'above')
('from', 'above', 'and')
('above', 'and', 'below')
('and', 'below', 'a')
('below', 'a', 'a')
('a', 'a', 'specie')
('a', 'specie', 'of')
('specie', 'of', 'event')
('of', 'event', 'and')
('event', 'and', 'lightningflashes')
('and', 'lightningflashes', 'PECULIAR')
('lightningflashes', 'PECULIAR', 'TO')
('PECULIAR', 'TO', 'HIM')
(

FreqDist({('that', 'it', 'is'): 32, ('is', 'to', 'say'): 24, ('with', 'regard', 'to'): 24, ('in', 'order', 'to'): 19, ('by', 'mean', 'of'): 19, ('Project', 'Gutenbergtm', 'electronic'): 18, ('Gutenbergtm', 'electronic', 'work'): 18, ('in', 'the', 'end'): 17, ('that', 'is', 'to'): 17, ('the', 'fact', 'that'): 17, ...})

In [None]:
ngrams

  and should_run_async(code)


(('The', 'Project', 'Gutenberg'),
 ('Project', 'Gutenberg', 'EBook'),
 ('Gutenberg', 'EBook', 'of'),
 ('EBook', 'of', 'Beyond'),
 ('of', 'Beyond', 'Good'),
 ('Beyond', 'Good', 'and'),
 ('Good', 'and', 'Evil'),
 ('and', 'Evil', 'by'),
 ('Evil', 'by', 'Friedrich'),
 ('by', 'Friedrich', 'Nietzsche'),
 ('Friedrich', 'Nietzsche', 'This'),
 ('Nietzsche', 'This', 'eBook'),
 ('This', 'eBook', 'is'),
 ('eBook', 'is', 'for'),
 ('is', 'for', 'the'),
 ('for', 'the', 'use'),
 ('the', 'use', 'of'),
 ('use', 'of', 'anyone'),
 ('of', 'anyone', 'anywhere'),
 ('anyone', 'anywhere', 'at'),
 ('anywhere', 'at', 'no'),
 ('at', 'no', 'cost'),
 ('no', 'cost', 'and'),
 ('cost', 'and', 'with'),
 ('and', 'with', 'almost'),
 ('with', 'almost', 'no'),
 ('almost', 'no', 'restriction'),
 ('no', 'restriction', 'whatsoever'),
 ('restriction', 'whatsoever', 'You'),
 ('whatsoever', 'You', 'may'),
 ('You', 'may', 'copy'),
 ('may', 'copy', 'it'),
 ('copy', 'it', 'give'),
 ('it', 'give', 'it'),
 ('give', 'it', 'away'),
 ('

In [None]:
# calculate the total number of n-grams in the corpus
total_ngrams = len(ngrams)
total_ngrams

def generate_ngrams(sentence, n):
    
    if n>=len(ngrams):
        return None

    possible_words = {}
    
    for ngram in ngrams_freqdist:
        for i in range(0, len(ngram)-1):
            if ngram[len(ngram)-i-2] == sentence.split()[-i-1]:
                possible_words[ngram[n-1]] = ngrams_freqdist[ngram] / total_ngrams
    if possible_words:
        return max(possible_words, key=possible_words.get)
    else:
        return None
    
    
    
# predict the next word for a given context
context = "The meaning"
next_word = generate_ngrams(context, 3)
print(f"The predicted next word for '{context}' is '{next_word}'")

The predicted next word for 'The meaning' is 'Gutenberg'


  and should_run_async(code)


In [None]:
text1

  and should_run_async(code)




##**7.Topic analysis**


Here we are interested in finding the topic for each book.
*   *LDA*
*   List item



In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.tokenize import word_tokenize, sent_tokenize


  and should_run_async(code)


In [None]:
text1

  and should_run_async(code)




In [None]:
# Add all our documents to a list
documents = [text1, text2]
stop_words = set(stopwords.words('english'))
nltk.download('punkt')

# Preprocessing function
def preprocess(text):
    # Tokenize words
    words = word_tokenize(text)

    # Remove stopwords and lemmatize words
    words = [lemmatizer.lemmatize(word.lower()) for word in words if word.isalnum() ]

    return ' '.join(words)

# Preprocess the documents
preprocessed_documents = [preprocess(doc) for doc in documents]
preprocessed_documents[0]

  and should_run_async(code)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!




In [None]:
vectorizer = CountVectorizer(max_df=0.99, min_df=0, stop_words='english')
term_document_matrix = vectorizer.fit_transform(preprocessed_documents)
print(term_document_matrix[0]) # Print term-document column for the first wiki page

  (0, 3536)	5
  (0, 5589)	6
  (0, 3981)	3
  (0, 9278)	3
  (0, 880)	1
  (0, 115)	1
  (0, 2956)	4
  (0, 241)	1
  (0, 6285)	1
  (0, 2104)	1
  (0, 116)	1
  (0, 4731)	3
  (0, 5125)	2
  (0, 1431)	2
  (0, 3511)	3
  (0, 6489)	2
  (0, 8175)	2
  (0, 8420)	1
  (0, 6903)	1
  (0, 8431)	3
  (0, 3644)	114
  (0, 6561)	1
  (0, 103)	1
  (0, 396)	3
  (0, 4678)	1
  :	:
  (0, 960)	1
  (0, 4794)	1
  (0, 8733)	1
  (0, 3538)	1
  (0, 2572)	1
  (0, 7110)	1
  (0, 8346)	1
  (0, 5303)	1
  (0, 8513)	1
  (0, 1196)	1
  (0, 9273)	1
  (0, 9053)	1
  (0, 242)	1
  (0, 243)	1
  (0, 4152)	1
  (0, 4146)	1
  (0, 3145)	1
  (0, 5276)	1
  (0, 4153)	1
  (0, 4149)	1
  (0, 1285)	1
  (0, 4147)	2
  (0, 4148)	1
  (0, 8296)	1
  (0, 4150)	1


  and should_run_async(code)


In [None]:
!pip install numpy==1.23.4
!pip install pyldavis==3.4.1
!pip install pandas==1.5.3


  and should_run_async(code)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.23.4
  Using cached numpy-1.23.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyldavis 3.4.1 requires numpy>=1.24.2, but you have numpy 1.23.4 which is incompatible.
pyldavis 3.4.1 requires pandas>=2.0.0, but you have pandas 1.5.3 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.23.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas>=2.0.0
  Using cached pandas-2.0.1-cp39-cp39-manylinux_2_17_x86_64.many

In [None]:
# Apply LDA
n_topics = 2
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(term_document_matrix)

# Print top words for each topic
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = f"Topic #{topic_idx + 1}: "
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

n_top_words = 10
feature_names = vectorizer.get_feature_names_out()
print_top_words(lda, feature_names, n_top_words)

Topic #1: thine mercy darkness hadst whereof madest saith wert formed dost
Topic #2: german morality europe european modern nowadays century refined consequently fundamental


  and should_run_async(code)


In [None]:
import pyLDAvis
import pyLDAvis.lda_model

# Prepare the LDA visualization data
visualization_data = pyLDAvis.lda_model.prepare(lda, term_document_matrix, vectorizer)

# Display the LDA visualization
pyLDAvis.display(visualization_data)

  and should_run_async(code)


In [None]:
# We can check what's the probability of our documents belonging to each of the generated topics.
# Get the topic distribution for documents
document_topics = lda.transform(term_document_matrix)

# Display the topic distribution for the first document
print(document_topics[0])

[5.29308649e-05 9.99947069e-01]


  and should_run_async(code)


In [None]:
# Find the most dominant topic for each document
dominant_topics = np.argmax(document_topics, axis=1)

# Display the dominant topics for all documents
print(dominant_topics[:])

[1 0]


  and should_run_async(code)


##**8.Sentiment analysis**

For this step, we will try to analysise the overall sentiment on each book. We will use the Hugging Face's Transformers pip page for Deep sentiment analysis.

In [None]:
!pip install transformers

In [None]:
import nltk

sent_text = nltk.sent_tokenize(raw_text1) # this gives us a list of sentences
sent_text


In [None]:
from transformers import pipeline
i = 0
classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", return_all_scores=True) # https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
for i in range(0,10):
  sentence_sentiment = classifier(sent_text[i]) 

  and should_run_async(code)


In [None]:
sentence_sentiment

  and should_run_async(code)


[[{'label': 'anger', 'score': 0.12516140937805176},
  {'label': 'disgust', 'score': 0.05693498253822327},
  {'label': 'fear', 'score': 0.5552675127983093},
  {'label': 'joy', 'score': 0.002138598123565316},
  {'label': 'neutral', 'score': 0.1375475525856018},
  {'label': 'sadness', 'score': 0.11441147327423096},
  {'label': 'surprise', 'score': 0.008538453839719296}]]

  and should_run_async(code)


##**Generator**

In [None]:
# import openai library
'''
# Set up the OpenAI API client
openai.api_key = "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"

# this loop will let us ask questions continuously and behave like ChatGPT
while True:
    # Set up the model and prompt
    model_engine = "text-davinci-003"
    
    prompt = input('Enter new prompt: ')

    if 'exit' in prompt or 'quit' in prompt:
        break

    # Generate a response
    completion = openai.Completion.create(
        engine=model_engine,
        prompt=prompt,
        max_tokens=1024,
        n=1,
        stop=None,
        temperature=0.5,
    )

    # extracting useful part of response
    response = completion.choices[0].text
    
    # printing response
    print(response)
'''

  and should_run_async(code)


'\n# Set up the OpenAI API client\nopenai.api_key = "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"\n\n# this loop will let us ask questions continuously and behave like ChatGPT\nwhile True:\n    # Set up the model and prompt\n    model_engine = "text-davinci-003"\n    \n    prompt = input(\'Enter new prompt: \')\n\n    if \'exit\' in prompt or \'quit\' in prompt:\n        break\n\n    # Generate a response\n    completion = openai.Completion.create(\n        engine=model_engine,\n        prompt=prompt,\n        max_tokens=1024,\n        n=1,\n        stop=None,\n        temperature=0.5,\n    )\n\n    # extracting useful part of response\n    response = completion.choices[0].text\n    \n    # printing response\n    print(response)\n'