In [1]:
import os
import chromadb
import openai
import tiktoken
from chromadb.utils import embedding_functions

from chunking_evaluation.utils import openai_token_count
from chunking_evaluation.chunking import ClusterSemanticChunker, LLMSemanticChunker, FixedTokenChunker
from chunking_evaluation.chunking import RecursiveTokenChunker, KamradtModifiedChunker

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_text_splitters import TokenTextSplitter

In [2]:
main_path = "/home/aswath/Projects/deep_learning/backup_brain/test_2/"
input_path = main_path + "input/outsiders.txt"

with open(input_path, 'r') as file:
    document = file.read()

In [3]:
print(document[:1000])

The OUTSIDERS



Introduction

An Intelligent Iconoclasm

It is impossible to produce superior performance unless you do something different.

—John Templeton


The New Yorker’s Atul Gawande uses the term positive deviant to describe unusually effective performers in the field of medicine. To Gawande, it is natural that we should study these outliers in order to learn from them and improve performance.1

Surprisingly, in business the best are not studied as closely as in other fields like medicine, the law, politics, or sports. After studying Henry Singleton, I began, with the help of a talented group of Harvard MBA students, to look for other cases where one company handily beat both its peers and Jack Welch (in terms of relative market performance). It turned out, as Warren Buffett’s quote in the preface suggests, that these companies (and CEOs) were rare as hen’s teeth. After extensive searching in databases at Harvard Business School’s Baker Library, we came across only seven other

### General Helper Functions

In [4]:
def analyze_chunks(chunks, use_tokens=False):
    # Print the chunks of interest
    print("\nNumber of Chunks:", len(chunks))
    print("\n", "="*50, "10th Chunk", "="*50,"\n", chunks[9])
    print("\n", "="*50, "11st Chunk", "="*50,"\n", chunks[10])
    
    chunk1, chunk2 = chunks[9], chunks[10]
    
    if use_tokens:
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens1 = encoding.encode(chunk1)
        tokens2 = encoding.encode(chunk2)
        
        # Find overlapping tokens
        for i in range(len(tokens1), 0, -1):
            if tokens1[-i:] == tokens2[:i]:
                overlap = encoding.decode(tokens1[-i:])
                print("\n", "="*50, f"\nOverlapping text ({i} tokens):", overlap)
                return
        print("\nNo token overlap found")
    else:
        # Find overlapping characters
        for i in range(min(len(chunk1), len(chunk2)), 0, -1):
            if chunk1[-i:] == chunk2[:i]:
                print("\n", "="*50, f"\nOverlapping text ({i} chars):", chunk1[-i:])
                return
        print("\nNo character overlap found")

### Simple Split

In [5]:
def chunk_text(document, chunk_size, overlap):
    chunks = []
    stride = chunk_size - overlap
    current_idx = 0
    
    while current_idx < len(document):
        # Take chunk_size characters starting from current_idx
        chunk = document[current_idx:current_idx + chunk_size]
        if not chunk:  # Break if we're out of text
            break
        chunks.append(chunk)
        current_idx += stride  # Move forward by stride
    
    return chunks

In [6]:
simp_chunks = chunk_text(document, chunk_size=1100, overlap=100)
analyze_chunks(simp_chunks)


Number of Chunks: 318

 ismatic leadership and more on careful deployment of firm resources.

At bottom, these CEOs thought more like investors than managers. Fundamentally, they had confidence in their own analytical skills, and on the rare occasions when they saw compelling discrepancies between value and price, they were prepared to act boldly. When their stock was cheap, they bought it (often in large quantities), and when it was expensive, they used it to buy other companies or to raise inexpensive capital to fund future growth. If they couldn’t identify compelling projects, they were comfortable waiting, sometimes for very long periods of time (an entire decade in the case of General Cinema’s Dick Smith). Over the long term, this systematic, methodical blend of low buying and high selling produced exceptional returns for shareholders.

A Distant Mirror: 1974–1982

In assessing the current relevance of these outsider CEOs, it’s worth looking at how each navigated the post–World W

In [7]:
length = 0
for i in simp_chunks:
    length += len(i)
print(length/len(simp_chunks))

1098.254716981132


### Token Split

In [10]:
def count_tokens(text, model="cl100k_base"):
    encoder = tiktoken.get_encoding(model)
    return print(f"Number of tokens: {len(encoder.encode(text))}")

In [11]:
encoder = tiktoken.get_encoding("cl100k_base")

text = "humpty dumpty sat on the floor"
tokens = encoder.encode(text)

print("Tokens:", tokens)

for i in range(len(tokens)):
    print(f"Token {i+1}:", encoder.decode([tokens[i]]))

print("Full Decoding: ", encoder.decode(tokens))

Tokens: [28400, 1625, 63811, 1625, 7731, 389, 279, 6558]
Token 1: hum
Token 2: pty
Token 3:  dum
Token 4: pty
Token 5:  sat
Token 6:  on
Token 7:  the
Token 8:  floor
Full Decoding:  humpty dumpty sat on the floor


In [22]:
fixed_token_chunker = FixedTokenChunker(chunk_size=218, chunk_overlap=22, encoding_name="cl100k_base")
token_chunks = fixed_token_chunker.split_text(document)
analyze_chunks(token_chunks, use_tokens=True)


Number of Chunks: 321

  their own analytical skills, and on the rare occasions when they saw compelling discrepancies between value and price, they were prepared to act boldly. When their stock was cheap, they bought it (often in large quantities), and when it was expensive, they used it to buy other companies or to raise inexpensive capital to fund future growth. If they couldn’t identify compelling projects, they were comfortable waiting, sometimes for very long periods of time (an entire decade in the case of General Cinema’s Dick Smith). Over the long term, this systematic, methodical blend of low buying and high selling produced exceptional returns for shareholders.

A Distant Mirror: 1974–1982

In assessing the current relevance of these outsider CEOs, it’s worth looking at how each navigated the post–World War II period that looks most like today’s extended economic malaise: the brutal 1974–1982 period.

That period featured a toxic combination of an external oil shock, disast

In [23]:
length = 0
for i in token_chunks:
    length += len(i)
print(length/len(token_chunks))

1100.9127725856697


### Recursive Character Split

In [34]:
recursive_character_chunker = RecursiveTokenChunker(chunk_size=1375, chunk_overlap=137, length_function=len, separators=["\n\n", "\n", ".", "?", "!", " ", ""])
rec_ch_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(rec_ch_chunks, use_tokens=False)


Number of Chunks: 293

 That period featured a toxic combination of an external oil shock, disastrous fiscal and monetary policy, and the worst domestic political scandal in the nation’s history. This cocktail of negative news produced an eight-year period that saw crippling inflation, two deep recessions (and bear markets), 18 percent interest rates, a threefold increase in oil prices, and the first resignation of a sitting US president in over one hundred years. In the middle of this dark period, in August 1979, BusinessWeek famously ran a cover story titled “Are Equities Dead?”

The times, like now, were so uncertain and scary that most managers sat on their hands, but for all the outsider CEOs it was among the most active periods of their careers—every single one was engaged in either a significant share repurchase program or a series of large acquisitions (or in the case of Tom Murphy, both). As a group, they were, in the words of Warren Buffett, very “greedy” while their peers w

In [35]:
length = 0
for i in rec_ch_chunks:
    length += len(i)
print(length/len(rec_ch_chunks))

1091.5767918088736


### Recursive Token Split

In [49]:
recursive_token_chunker = RecursiveTokenChunker(chunk_size=275, chunk_overlap=27, length_function=openai_token_count, separators=["\n\n", "\n", ".", "?", "!", " ", ""])
rec_tk_chunks = recursive_token_chunker.split_text(document)
analyze_chunks(rec_tk_chunks, use_tokens=True)


Number of Chunks: 288

 a. Author interview with Warren Buffett, July 24, 2006.

This reformulation of the CEO’s job stemmed from shared (and unusual) backgrounds. All of these CEOs were outsiders. All were first-time chief executives (half not yet forty when they took the job), and all but one were new to their industries. They were not bound by prior experience or industry convention, and their collective records show the enormous power of fresh eyes. This freshness of perspective is an age-old catalyst for innovation across many fields. In science, Thomas Kuhn, inventor of the concept of the paradigm shift, found that the greatest discoveries were almost invariably made by newcomers and the very young (think of the middle-aged former printer, Ben Franklin, taming lightning; or Einstein, the twenty-seven-year-old patent clerk, deriving E = mc2).

This fox-like outsider’s perspective helped these executives develop differentiated approaches, and it informed their entire management ph

In [50]:
length = 0
for i in rec_tk_chunks:
    length += len(i)
print(length/len(rec_tk_chunks))

1107.173611111111


### Semantic Chunker

In [53]:
openai.api_key = 'dummy_val'
embedding_function = embedding_functions.OpenAIEmbeddingFunction(api_key=openai.api_key, model_name="text-embedding-3-large")

#### Greg Kamradt Semantic Chunker

In [80]:
kamradt_chunker = KamradtModifiedChunker(avg_chunk_size=220, min_chunk_size=20, embedding_function=embedding_function)
modified_kamradt_chunks = kamradt_chunker.split_text(document)

In [81]:
analyze_chunks(modified_kamradt_chunks, use_tokens=True)
print("\n\n", "="*50, "\n\n")
count_tokens(modified_kamradt_chunks[9])
count_tokens(modified_kamradt_chunks[10])


Number of Chunks: 289

 . Only two had MBAs. As a group, they did not attract or seek the spotlight . Rather, they labored in relative obscurity and were generally appreciated by only a handful of sophisticated investors and aficionados . As a group, they shared old-fashioned, premodern values including frugality, humility, independence, and an unusual combination of conservatism and boldness . They typically worked out of bare-bones offices (of which they were inordinately proud), generally eschewed perks such as corporate planes, avoided the spotlight wherever possible, and rarely communicated with Wall Street or the business press . They also actively avoided bankers and other advisers, preferring their own counsel and that of a select group around them . Ben Franklin would have liked these guys. This group of happily married, middle-aged men (and one woman) led seemingly unexciting, balanced, quietly philanthropic lives, yet in their business lives they were neither conventional n

In [82]:
length = 0
for i in modified_kamradt_chunks:
    length += len(i)
print(length/len(modified_kamradt_chunks))

1101.3356401384083


#### Cluster Semantic Chunker

In [89]:
cluster_chunker = ClusterSemanticChunker(embedding_function=embedding_function, max_chunk_size=790, length_function=openai_token_count)
cluster_chunker_chunks = cluster_chunker.split_text(document)

In [90]:
analyze_chunks(cluster_chunker_chunks, use_tokens=True)


Number of Chunks: 306

 one hundred years

 . In the middle of this dark period, in August 1979, BusinessWeek famously ran a cover story titled “Are Equities Dead?” The times, like now, were so uncertain and scary that most managers sat on their hands, but for all the outsider CEOs it was among the most active periods of their careers—every single one was engaged in either a significant share repurchase program or a series of large acquisitions (or in the case of Tom Murphy, both) . As a group, they were, in the words of Warren Buffett, very “greedy” while their peers were deeply “fearful.”a a. Author interview with Warren Buffett, July 24, 2006. This reformulation of the CEO’s job stemmed from shared (and unusual) backgrounds. All of these CEOs were outsiders . All were first-time chief executives (half not yet forty when they took the job), and all but one were new to their industries. They were not bound by prior experience or industry convention, and their collective records show 

In [91]:
length = 0
for i in cluster_chunker_chunks:
    length += len(i)
print(length/len(cluster_chunker_chunks))

1037.6666666666667


### Testing Chunks

In [94]:
llm = ChatOpenAI(temperature=0.0, model="gpt-4o", api_key= openai.api_key)

simp_chunks, token_chunks, rec_ch_chunks, rec_tk_chunks, lc_semantic_chunks, modified_kamradt_chunks, cluster_chunker_chunks, llm_chunker_chunks

In [95]:
def add_chunks(texts, collection):
    add_count = 0
    for text in texts:
        collection.add(documents=[text], ids=f"chunk_{add_count}")
        add_count += 1

def chroma_retrieval(query, collection, num_results=15):
    results = collection.query(query_texts=[query], n_results=num_results)
    return results

def chroma_rag(query, collection):
    retrieved_docs = chroma_retrieval(query, collection)["documents"][0]
    response = rag_chain.invoke({"retrieved_docs": retrieved_docs, "query": query})
    return retrieved_docs, response

In [96]:
client_path = main_path + "/notebook/chromadb"
chroma_client = chromadb.PersistentClient(path=client_path)

In [123]:
rag_prompt_template = """
Generate a response that responds to the user's question, summarizing all information in the input data tables, and incorporating any relevant general knowledge.

Do not include information where the supporting evidence for it is not provided.

Context: {retrieved_docs}

User Question: {query}

"""

#### Simple Chunking

In [134]:
simple_collection = chroma_client.get_or_create_collection(name="simple_collection_1100")
# add_chunks(simp_chunks, simple_collection)

In [135]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [136]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", simple_collection)

In [137]:
print(response)

The input data highlights a group of CEOs often referred to as "outsider CEOs," who are considered highly effective due to their unconventional management styles and impressive long-term performance. These CEOs include figures like Warren Buffett, John Malone, and Bill Anders, among others. Here are the key reasons behind their success:

1. **Long-term Value Focus**: These CEOs prioritized optimizing long-term shareholder value over short-term organizational growth. They often shrank their companies through share repurchases and asset sales, focusing on maximizing value rather than size.

2. **Capital Allocation**: They were adept at capital allocation, making disciplined acquisitions, and using leverage selectively. They focused on cash flow rather than reported net income, which is a common metric for most public company CEOs.

3. **Decentralization**: They ran highly decentralized organizations, allowing local managers to make operating decisions while centralizing capital allocatio

#### Token Chunking

In [138]:
token_collection = chroma_client.get_or_create_collection(name="token_collection_1100")
# add_chunks(token_chunks, token_collection)

In [139]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [140]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", token_collection)

In [141]:
print(response)

Determining the "best" CEOs can be subjective and depends on various criteria such as financial performance, innovation, leadership style, and impact on society. However, some CEOs are frequently recognized for their exceptional leadership and contributions to their companies and industries. Here are a few notable examples:

1. **Tim Cook (Apple Inc.)**: Tim Cook is often praised for his operational expertise and ability to maintain Apple's profitability and innovation after Steve Jobs. Under his leadership, Apple has expanded its product line and services, including the Apple Watch and Apple Music, and has become the first U.S. company to reach a $2 trillion market capitalization.

2. **Satya Nadella (Microsoft)**: Nadella is credited with transforming Microsoft by focusing on cloud computing and artificial intelligence. His leadership has revitalized the company, leading to significant growth in its Azure cloud services and a substantial increase in Microsoft's stock value.

3. **Elo

#### Recursive Character Chunking

In [142]:
rec_ch_collection = chroma_client.get_or_create_collection(name="rec_ch_collection_1100")
# add_chunks(rec_ch_chunks, rec_ch_collection)

In [143]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [144]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", rec_ch_collection)

In [158]:
print(response)

The best CEOs, as highlighted in the context, are those who adopt an unconventional approach to management and resource allocation. These CEOs, often referred to as "outsider CEOs," include individuals like Stonecipher, Tillerson, and others who have achieved extraordinary results by consistently making decisions that differ from their peers. Here are the key reasons behind their success:

1. **Long-term Investor Mindset**: These CEOs think more like investors than traditional managers. They focus on creating value for shareholders by making disciplined capital allocation decisions and optimizing resources.

2. **Decentralized Organizations**: They run highly decentralized organizations, allowing for efficient decision-making at local levels while maintaining centralized control over capital allocation.

3. **Contrarian Approach**: They are intelligent contrarians, willing to act differently from the crowd. This includes making large acquisitions, buying back stock, and minimizing taxe

#### Recursive Token Chunking

In [146]:
rec_token_collection = chroma_client.get_or_create_collection(name="rec_token_collection_1100")
# add_chunks(rec_tk_chunks, rec_token_collection)

In [147]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [148]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", rec_token_collection)

In [149]:
print(response)

The best CEOs, as highlighted in the context, are those who adopt an unorthodox approach to management, focusing on long-term value creation rather than short-term growth. These CEOs, often referred to as "outsider CEOs," include figures like Warren Buffett, Henry Singleton, and Rex Tillerson. Their success is attributed to several key principles:

1. **Long-term Value Focus**: These CEOs prioritize optimizing long-term value per share over mere organizational growth. They understand that bigger isn't always better and often shrink their operations through asset sales or spin-offs to maximize shareholder value.

2. **Capital Allocation**: They are adept at capital allocation, directing resources toward high-return projects and avoiding value-destroying investments. This involves a disciplined approach to financial management, often requiring a minimum return on capital projects.

3. **Decentralization**: They run highly decentralized organizations, empowering local managers and reducin

#### Modified Kamradt Chunks

In [150]:
mod_kamradt_collections = chroma_client.get_or_create_collection(name="mod_kamradt_collections_1100")
# add_chunks(modified_kamradt_chunks, mod_kamradt_collections)

In [151]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [152]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", mod_kamradt_collections)

In [153]:
print(response)

The best CEOs, as highlighted in the context, are those who adopt an unconventional approach to management and resource allocation. These CEOs, often referred to as "outsider CEOs," include individuals like Stonecipher, Tillerson, and others who have achieved extraordinary results by thinking and acting differently from their peers. Here are the key reasons behind their success:

1. **Long-term Investor Mindset**: These CEOs think more like investors than traditional managers. They focus on creating long-term value for shareholders rather than short-term gains.

2. **Capital Allocation**: They excel in capital allocation, making disciplined acquisitions, buying back stock, and minimizing taxes. They prioritize projects with attractive returns and are comfortable waiting for the right opportunities.

3. **Decentralized Organizations**: They run highly decentralized organizations, pushing operating decisions to the lowest levels while maintaining centralized control over capital allocati

#### Cluster Chunker

In [154]:
cluster_chunks_collection = chroma_client.get_or_create_collection(name="cluster_chunks_collections_1100")
# add_chunks(cluster_chunker_chunks, cluster_chunks_collection)

In [155]:
rag_prompt = ChatPromptTemplate.from_template(rag_prompt_template)
rag_chain = rag_prompt | llm | StrOutputParser()

In [156]:
docs, response = chroma_rag("Give me who are the best CEO's and elaborate reasons behind it ?", mod_kamradt_collections)

In [157]:
print(response)

The best CEOs, as highlighted in the context, are those who adopt an unconventional approach to management and resource allocation. These CEOs, often referred to as "outsider CEOs," include individuals like Stonecipher, Tillerson, and others who have achieved extraordinary results by consistently making decisions that differ from their peers. Here are the key reasons behind their success:

1. **Long-term Investor Mindset**: These CEOs think more like investors than traditional managers. They focus on creating value for shareholders by making disciplined capital allocation decisions and optimizing resources.

2. **Decentralized Organizations**: They run highly decentralized organizations, allowing for efficient decision-making at local levels while maintaining centralized control over capital allocation.

3. **Contrarian Approach**: They are intelligent contrarians, willing to act differently from the crowd. This includes making large acquisitions, buying back stock, and minimizing taxe