# Textual Analysis
You should create an seperate `yaml` for each of the project you are working on, this is a good practice in general. Thesre migt be weird issues that causes a dependency problem. For the same reason a new `yaml` file would record all the dependencies that would ensure this pipeline works for future references.

In [12]:
%load_ext autoreload
%autoreload 2

import plotly.io as pio
pio.renderers.keys()
pio.renderers.default = 'notebook' 

import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')

from utils.eda import *
from utils.text_model import transform_text, tokenize

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to /Users/kevinb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
# text data
fall_2022_text = pd.read_csv('data/2022_fall_text.csv')
winter_2023_text = pd.read_csv('data/2023_winter_text.csv')
spring_2023_text = pd.read_csv('data/2023_spring_text.csv')
summer_2023_text = pd.read_csv('data/2023_summer_text.csv')
fall_2023_text = pd.read_csv('data/2023_fall_text.csv')
winter_2024_text = pd.read_csv('data/2024_winter_text.csv')
spring_2024_text = pd.read_csv('data/2024_spring_text.csv')
summer_2024_text = pd.read_csv('data/2024_summer_text.csv')
fall_2024_text = pd.read_csv('data/2024_fall_text.csv')

In [14]:
# Initial Concat
text = pd.concat([fall_2022_text, winter_2023_text, spring_2023_text, summer_2023_text, fall_2023_text, winter_2024_text, spring_2024_text, summer_2024_text, fall_2024_text], axis=0)
clean_text = text.pipe(transform_text)
clean_text

Unnamed: 0,Study Materials
0,"2022-02-14 | Book read (30m), DOC 2 P3 part2 (..."
1,"2022-02-15 | Nutrition read (15m), Physics Lig..."
2,2022-10-21 |
3,2022-10-22 | After consecutively doing DSC 10 ...
4,2022-10-23 | Wolfram good to check solution | ...
...,...
628,2024-12-11 | RPLH eval + paper algorithm + RPL...
629,"2024-12-12 | track-mjx meeting, RPLH paper wri..."
630,"2024-12-13 | constraint article, EM article, m..."
631,2024-12-14 | chill


In [15]:
pre_process = (clean_text['Study Materials']
              .str.lower()
              .str.replace(r'\([\d]*m\)','',regex=True)
              .str.replace(',','')
              .str.strip())

corpus = ' '.join(pre_process.astype(str).to_list())
tokens = nltk.tokenize.word_tokenize(corpus, language='english')
len(tokens)

15955

# Making Chat Familier With My Data

Using either of `sentence-transformer`, `nltk`, `openai`, `langchain`, or related stuff has many dependency issue if just pyt in a big environment, need to have a seperated contained environment.

In [16]:
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
import faiss
import ast
import openai
import os

nltk.download('punkt')
model = SentenceTransformer('all-MiniLM-L6-v2')

[nltk_data] Downloading package punkt to /Users/kevinb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [17]:
from utils.text_model import split_text_nltk, get_similar_chunks, generate_response

documents = clean_text['Study Materials'].tolist()
all_chunks = []
for doc in documents:
    chunks = split_text_nltk(doc)
    all_chunks.extend(chunks)

print(f"Total chunks created: {len(all_chunks)}")
embeddings = model.encode(all_chunks, show_progress_bar=True, convert_to_tensor=False)

embedding_df = pd.DataFrame({
    'chunk': all_chunks,
    'embedding': embeddings.tolist()
})

embedding_df.to_csv('embeddings.csv', index=False)
print("Embeddings saved to embeddings.csv")

Total chunks created: 633


Batches: 100%|██████████| 20/20 [00:00<00:00, 35.29it/s]


Embeddings saved to embeddings.csv


In [18]:
embedding_df = pd.read_csv('embeddings.csv')
embedding_df['embedding'] = embedding_df['embedding'].apply(ast.literal_eval)

# Convert embeddings to a NumPy array of type float32
embeddings = np.array(embedding_df['embedding'].tolist()).astype('float32')

# Initialize FAISS index and using L2 distance, can also use cosine similarity
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings to the index
index.add(embeddings)
print(f"FAISS index has {index.ntotal} vectors.")

FAISS index has 633 vectors.


In [19]:
embedding_df

Unnamed: 0,chunk,embedding
0,"2022-02-14 | Book read (30m), DOC 2 P3 part2 (...","[-0.02826680615544319, -0.004485192708671093, ..."
1,"2022-02-15 | Nutrition read (15m), Physics Lig...","[-0.06095949560403824, -0.054127417504787445, ..."
2,2022-10-21 |,"[-0.09296953678131104, 0.028568459674715996, 0..."
3,2022-10-22 | After consecutively doing DSC 10 ...,"[-0.08834125101566315, -0.009416475892066956, ..."
4,2022-10-23 | Wolfram good to check solution | ...,"[-0.05123080313205719, 0.03428500518202782, -0..."
...,...,...
628,2024-12-11 | RPLH eval + paper algorithm + RPL...,"[-0.09284821152687073, 0.07178058475255966, 0...."
629,"2024-12-12 | track-mjx meeting, RPLH paper wri...","[-0.04259365424513817, 0.030164869502186775, 0..."
630,"2024-12-13 | constraint article, EM article, m...","[-0.10550352931022644, 0.030236540362238884, 0..."
631,2024-12-14 | chill,"[-0.0889950692653656, 0.03961106017231941, 0.0..."


In [20]:
api_key = "..." #os.getenv('OPENAI_API_KEY')

In [21]:
user_prompt = "What did I mainly do in 2022 fall quarter?"
similar_chunks = get_similar_chunks(user_prompt, index, embedding_df, top_k=5)
generate_response(user_prompt, similar_chunks, api_key=api_key)

'In the 2022 fall quarter, you mainly spent time on various tasks such as checking and documenting your spending for Christmas break and the BOA 2022 fall quarter, working on projects such as the Personal Study Time Analysis Project, Cogs 9 Final Project, DOC 1 Portfolio 4, and Math 18 大局观, and meeting new people.'

In [22]:
user_prompt = "What did I mainly do in 2023 fall quarter?"
similar_chunks = get_similar_chunks(user_prompt, index, embedding_df, top_k=5)
generate_response(user_prompt, similar_chunks, api_key=api_key)

"In 2023 fall quarter, I mainly did housework, planning, and training during the holiday break. I also spent time analyzing my spending and documenting it for Bank of America's 2022 fall quarter. I had a break on Christmas and New Year's Day. I also worked on various projects and submitted them, including a personal study time analysis project, a Cogs 9 final project, a DOC 1 portfolio, and a 大局观 for Math 18. I also had the opportunity to meet fun people during this time."

In [23]:
user_prompt = "What did I mainly do in 2024 fall quarter?"
similar_chunks = get_similar_chunks(user_prompt, index, embedding_df, top_k=5)
generate_response(user_prompt, similar_chunks, api_key=api_key)

'In 2024 fall quarter, you spent time cleaning your house, updating your resume and website, and analyzing study time data. You also spent time running and training your legs, and enjoyed some leisure activities. Additionally, you spent time working on a project for your Github repository.'