### Personality and It's Transformations ###

An analysis of prof. Jordan Peterson's collection of lectures from University of Toronto personality course.

|    More about prof. Peterson at https://www.jordanbpeterson.com/

---

In [10]:
# main imports
import os
import json

import pandas as pd
from tqdm import tqdm

from langchain import OpenAI, PromptTemplate, LLMChain
from langchain.chains.mapreduce import MapReduceChain
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import CharacterTextSplitter
import textwrap

from constants import *

<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Credentials set-up</h4> 
 </div>

In [3]:
def load_api_keys(credentials_file_name: str = 'credentials.json') -> tuple:
    '''Load API keys from file

    Arguments:
        credentials_file_name: name of file containing credentials

    Returns:
        A tuple containing OpenAI API Key, Pinecone API key and Pinecone API
        environment name

    '''
    
    if os.path.exists(credentials_file_name):

      # open credentials file 
        with open(credentials_file_name) as f:
            content = json.load(f)

            # load api keys
            OPENAI_API_KEY = content['OPENAI_API_KEY']
    else:
        return f'No file {credentials_file_name} or file corrupted'

    return OPENAI_API_KEY


In [4]:
# load the API keys from credential file and setup OPENAI API KEY as an environmental
# variable
os.environ['OPENAI_API_KEY'] = load_api_keys('credentials.json')

***LangChain*** is a framework that can develop certain tasks (as sequences called chains) for a language models and utilize their text generation ability to interact with external tools.

Documentation for LangChain can be found here: https://python.langchain.com/en/latest/index.html

<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Load the data</h4> 
</div>

The dataset consists of sentences combined into clusters with HDBSCAN in the previous notebook.

In [6]:
df =  pd.read_csv(f'{OUTPUT_FOLDER}\{CLUSTERED_TOPICS_DATAFRAME_NAME}.csv')

In [7]:
df

Unnamed: 0.1,Unnamed: 0,cluster_ids,text
0,0,-1,"So, there’s a website—I don’t really like Blac..."
1,1,0,"So, what I’m going to do today—how I’m going t..."
2,2,1,So the first issue is that there’s a lot of re...
3,3,2,But I think we might as well jump right into t...
4,4,3,So when we’re first born—we’re very primitive ...
5,5,4,And so in the mirror test what you essentially...
6,6,5,"Now, dolphins seem to be able to manage that, ..."
7,7,6,"Now, part of the reason for that is that Nietz..."
8,8,7,"It brings in elements of cultural history, ele..."
9,9,8,Most of the brain is structured with the olfac...


<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Define text splitter</h4> 

In [69]:
llm = OpenAI(temperature = 0)
text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 0, separator=' ')

The GPT3.5 model has a limit of 4096 tokens. When defining the model summary plan we must decrese then number of tokens that can be used inside a single text chunk (1000). The remanining space of 3096 tokens will be occutied for the prompts (main prompt and refine prompt) as well as the summary of the previus step (when using the refine summary method).

<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Define the prompts</h4> 

In [74]:
# main prompt for summarisation
prompt_templete = """
Write a concise summary of the following extracting only key information:

{text}


SUMMARY:
"""
PROMPT = PromptTemplate(template = prompt_templete, input_variables = ['text'])


# refinement prompt - because we will be chooseing chain_type = refine we must construct a prompt
# for combining our work that was produced so far
refine_template = (
    "Your job is to produce the final summary\n"
    "There is an current summary to a certain point: {existing_answer}\n"
    "But there is an opportunity, if possible, to refine the existing summary"
    "with some of the context information below.\n"
    "---\n"
    "{text}\n"
    "---\n"
    "Given the new information refine the current summary"
    "If the context information isn't useful, return the original summary"
) 
REFINE_PROMPT = PromptTemplate(
    input_variables = ['existing_answer', 'text'],
    template = refine_template
)

Variables `text` and `existing_answer` are set variables specific for LangChain architecture. Do not change their names when using the summary model like the one in this notebook.

<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Use defined prompts to run summarization chain</h4> 

### Explanation

One can summarize either with a:

a) stuff chain

b) map-reduce chain

c) refine chain

`Stuffing` is the simplest method. All the text is stuffed to the prompt to provide context.

[+] single call to the LLM, because all text is provided at once

[-] LLMs generally have a context length limit and in case of big documents this will produce an error 
because the provided context might be bigger then the models limit

[-] works only with small texts


`Map-reduce` in this method we have two prompts: one for each data chunk and the other for combining the data chunks together.

[+] can be parallelized because the calls on each independent data chunks can be made simultaneously

[+] works well with big texts

[-] can lose some information during the combining step

`Refine` in this method the main prompt does the work on the initial chunk, then the output along with the second chunk is treated as input for the text step and so on.

[+] can produce more relevant content

[-] requires many calls with bigger token lengths to the LLM which can be costly

[-] cannot be parallelized because the calls are not independent of one another


In this example the most suitable method is `refine`. Stuffing won't work due to the texts length. The refine method can be bit costly (comparred to map-reduce) but it provides more insight then map-reduce.

### Main summarization step

In [75]:
# the first entry the nosise cluster so we won't be doing any summary of that one for now.
summarized_clusters = ['[NOISE_CLUSTER]']

# number of all clusters when we exclude the noise cluster (22)
N = (df.shape[0] - 1)

# summarization chain with type 'refine'
chain = load_summarize_chain(llm,
                             chain_type = 'refine',
                             return_intermediate_steps = True,
                             question_prompt = PROMPT,
                             refine_prompt = REFINE_PROMPT)

# iterate over every text cluster
for n, doc in enumerate(df['text'][1:],1):
  
  # split the text using text_splitter
  texts = text_splitter.split_text(doc)
  docs = [Document(page_content = t) for t in texts]

  # perform a summarization chain
  summary = chain({'input_documents': docs}, return_only_outputs=True)
  
  # wrap it with a text wrapper
  summary_wrapper = textwrap.fill(summary['output_text'],
                                  width = 100, 
                                  break_long_words = False,
                                  replace_whitespace=False)
  summarized_clusters.append(summary_wrapper)
  print(f'Done {n} out of {N}')


Done 1 out of 22
Done 2 out of 22
Done 3 out of 22
Done 4 out of 22
Done 5 out of 22
Done 6 out of 22
Done 7 out of 22
Done 8 out of 22
Done 9 out of 22
Done 10 out of 22
Done 11 out of 22
Done 12 out of 22
Done 13 out of 22
Done 14 out of 22
Done 15 out of 22
Done 16 out of 22
Done 17 out of 22
Done 18 out of 22
Done 19 out of 22
Done 20 out of 22
Done 21 out of 22
Done 22 out of 22


<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Check summarizations</h4> 

In [87]:
summarized_clusters[:3]

['[NOISE_CLUSTER]',
 '\n\nJordan Peterson\'s course is wide-ranging and follows certain principles, such as not teaching\nanything that isn\'t relevant. Before taking the course, students should consider if the course is\nsuitable for them, as there is another personality course being offered at the same time. There is\nstill a lot of reading involved, but it is mostly original source material from the personality\ntheorists themselves and empirical papers. This reading is useful and should be taken into account\nwhen considering taking the course. Additionally, students should be aware that the lectures may not\nalways reflect the reading material, and that the course may be more suitable for those who enjoy\nreading literature. The course also goes back to the beginning of human history to understand how\npeople have understood and represented themselves. It is designed to give students tools to\nunderstand themselves and others, and to help them transition to university life, includ

In [77]:
df['summary'] = summarized_clusters

In [89]:
df['summary'] = df['summary'].map(lambda x: x.replace('\n', ' '))

<div class="tomcolor8">  
<h4 style="background:#135e96; color:white ;font-size:15px;line-height:1em; text-align:left; padding: 20px">
      Export</h4> 

In [None]:
df.to_csv(f'{OUTPUT_FOLDER}\{CLUSTERED_TOPICS_DATAFRAME_NAME}_with_summaries.csv')

In [91]:
df['summary'][21]

'. We can use our body to represent ourselves, such as when telling a joke. However, when we do something unexpected, such as having an affair in a relationship, it can cause us to question our own model of ourselves and the world, leading to feelings of depression and lack of motivation. In such situations, we may feel fairly anxious and fall apart, not knowing which way is up. But if we can come out the other side, we can be a little smarter and more together than before. Even if someone is annoying us, we should approach them with respect, as they may have something to tell us. In such difficult times, something terrible may have happened to us, but we can still strive to come out the other side a little smarter and more together. We need to come to terms with the terrible thing that has happened, however we can, and put ourselves back together. We should also be mindful of our behavior when faced with difficult situations, such as when our partner is being annoying. We can take a s

In [92]:
df

Unnamed: 0.1,Unnamed: 0,cluster_ids,text,summary
0,0,-1,"So, there’s a website—I don’t really like Blac...",[NOISE_CLUSTER]
1,1,0,"So, what I’m going to do today—how I’m going t...",Jordan Peterson's course is wide-ranging and...
2,2,1,So the first issue is that there’s a lot of re...,"The speaker suggests reading novels, literat..."
3,3,2,But I think we might as well jump right into t...,"This article discusses rituals, stories, and..."
4,4,3,So when we’re first born—we’re very primitive ...,. Babies are born with limited control of the...
5,5,4,And so in the mirror test what you essentially...,. The Mirror Test is used to assess the socia...
6,6,5,"Now, dolphins seem to be able to manage that, ...","Dolphins, crows, ravens, whales, and humans ..."
7,7,6,"Now, part of the reason for that is that Nietz...",Nietzsche argued that philosophers are often...
8,8,7,"It brings in elements of cultural history, ele...",. This discussion focuses on the elements of ...
9,9,8,Most of the brain is structured with the olfac...,. Humans have a unique brain structure with th...
