<a href="https://colab.research.google.com/github/Ejeat12/Portfolio/blob/main/An_rudimentary_example_of_CCA_for_Question_Anwering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this project I will demonstrate the effectiveness of the Canonical Correlation Analysis(CCA) algorithm for a question-answering task where a text corpus of arbirtary responses are collected and then sampled by the model in order to answer the desired prompt. For my details of this proposed algorithm, feel free to check out the corresponding blog post here: https://intro-to-deep-learning.blogspot.com/2023/05/youre-all-that-i-been-looking-for.html 

In [4]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.cross_decomposition import CCA
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Step 1: Preprocess the text data
corpus = [
    "I like pie",
    "I go by james",
    "Eric said hi to marcus.",
    "Is this the first document?",
]

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

preprocessed_corpus = []
for doc in corpus:
    # Tokenize
    tokens = word_tokenize(doc.lower())
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Append preprocessed document to the corpus
    preprocessed_corpus.append(" ".join(filtered_tokens))

# Step 2: Transform preprocessed text data into vector representations
vectorizer = TfidfVectorizer()
text_vectors = vectorizer.fit_transform(preprocessed_corpus).toarray()

# Step 3: Apply CCA and analyze correlations between text vectors
dummy_view = np.random.randn(len(corpus), 10)  # Create a dummy view for CCA

cca = CCA(n_components=1)
cca.fit(text_vectors, dummy_view)
text_vectors_cca = cca.transform(text_vectors)

# Step 4: Perform similarity search using cosine similarity
query = "What is your name?"
preprocessed_query = " ".join([token for token in word_tokenize(query.lower()) if token not in stop_words])
query_vector = vectorizer.transform([preprocessed_query]).toarray()
query_vector_cca = cca.transform(query_vector)

similarities = cosine_similarity(text_vectors_cca, query_vector_cca.reshape(1, -1))
most_similar_index = similarities.argmax()

print("Relevant answer:", corpus[most_similar_index])

Relevant answer: I go by james


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# As we can see the model correctly assigns the relevant answer to the prompt of "What is your name?" with "I go by James". It should be worth noting that computational cost for this procedure likely will grow exponetially as more responses are collected, within the corpus. Therefore in a production level environment, I would propose the approximate-nearest-neighbor algorithm for efficiently searching through responses in a batch like computation, opposed to all at once. For more details click here:https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6 