<a href="https://colab.research.google.com/github/Ejeat12/Portfolio/blob/main/An_rudimentary_example_of_CCA_for_Question_Anwering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this project I will demonstrate the effectiveness of the Canonical Correlation Analysis(CCA) algorithm for a question-answering task where a text corpus of arbirtary responses are collected and then sampled by the model in order to answer the desired prompt. For my details of this proposed algorithm, feel free to check out the corresponding blog post here: https://intro-to-deep-learning.blogspot.com/2023/05/youre-all-that-i-been-looking-for.html 

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.cross_decomposition import CCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Step 1: Preprocess the text data
corpus = [
    "I like pie",
    "I go by james",
    "Eric said hi to marcus.",
    "Is this the first document?",
]

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

preprocessed_corpus = []
for doc in corpus:
    # Tokenize
    tokens = word_tokenize(doc.lower())
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Append preprocessed document to the corpus
    preprocessed_corpus.append(" ".join(filtered_tokens))

# Step 2: Transform preprocessed text data into vector representations
vectorizer = TfidfVectorizer()
text_vectors = vectorizer.fit_transform(preprocessed_corpus).toarray()

# Step 3: Apply CCA and analyze correlations between text vectors
dummy_view = np.random.randn(len(corpus), 10)  # Create a dummy view for CCA

cca = CCA(n_components=1)
cca.fit(text_vectors, dummy_view)
text_vectors_cca = cca.transform(text_vectors)

# Step 4: Perform similarity search using cosine similarity
query = "What is your name?"
preprocessed_query = " ".join([token for token in word_tokenize(query.lower()) if token not in stop_words])
query_vector = vectorizer.transform([preprocessed_query]).toarray()
query_vector_cca = cca.transform(query_vector)

similarities = cosine_similarity(text_vectors_cca, query_vector_cca.reshape(1, -1))
most_similar_index = similarities.argmax()

print("Relevant answer:", corpus[most_similar_index])
print("Cosine similarity:", similarities[most_similar_index])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Relevant answer: I go by james
Cosine similarity: [1.]


In [2]:
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.15.1-py3-

In [5]:
# Import bert model that was trained to encode sentence embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-uncased')

Downloading (…)eaa63/.gitattributes:   0%|          | 0.00/491 [00:00<?, ?B/s]

Downloading (…)f0bb6d3eaa63/LICENSE:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

Downloading (…)bb6d3eaa63/README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Downloading (…)6d3eaa63/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)CoreML/model.mlmodel:   0%|          | 0.00/165k [00:00<?, ?B/s]

Downloading weight.bin:   0%|          | 0.00/532M [00:00<?, ?B/s]

Downloading (…)ackage/Manifest.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)eaa63/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)bb6d3eaa63/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# Compute cosine similarity on text vectors without using CCA
sentence1 = "What is your name?"
sentence2 = "I go by james"

encoded_sentences = model.encode([sentence1, sentence2])



# Compute cosine similarity
cosine_sim = cosine_similarity(encoded_sentences)

similarity_score = cosine_sim[0, 1]

print("Cosine similarity:", similarity_score)


Cosine similarity: 0.6230755


# As we can see the model correctly assigns the relevant answer to the prompt of "What is your name?" with "I go by James", when paired with a linear transformation and dimensionality reduction, such as CCA. It should be worth noting that computational cost for this procedure likely will grow exponetially as more responses are collected, within the corpus. Therefore in a production level environment, I would propose the approximate-nearest-neighbor algorithm for efficiently searching through responses in a batch like computation, opposed to all at once. For more details click here:https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6 