# README: expectations for scenarios 1 and 2

Use LLMs to explain code for a single Jupyter Notebook **cell** and, if possible, suggest what can be improved.

### Requirements: 
- The code should be in Python.
- The explaination shoud take into account the code's context.

In [None]:
!pip install requests
!pip install transformers

## Get the sample notebook's content

We will use one of the demo notebook of google colab as an example. We have a copy in ththis repo `examples` folder.

In [2]:
import requests 
raw_ipynb_url = 'https://raw.githubusercontent.com/C2DH/ai-notebooks-summer-workshop/master/examples/AI_Workshop_Semantic_Search.ipynb'
ipynb_data = requests.get(raw_ipynb_url).json()
print(ipynb_data['cells'][0]['source'])

['# Digital history of representations: analysing identity and nationalism in the *Capuchin Annual* periodical (1930-1977)']


## Scenario 1: Explain, do your magic!

Given the notebook as context, let your LLM explain the following cell's code and suggest improvements.

In [None]:
cell_code = """
tfidf_vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = tfidf_vectorizer.fit_transform([' '.join(tokenize(d)) for d in df['description']])
svd = SVD(10)
svd.fit(tfidf_matrix)

def cosine(a, b):
    eps = 1e-8
    if type(a) is np.ndarray:
        return a.dot(b) / ((np.linalg.norm(a) * np.linalg.norm(b)) + eps)
    else:
        return a.dot(b) / ((a.norm() * b.norm()) + eps)

def tfidf_model(query, document):
    query_vector = tfidf_vectorizer.transform([' '.join(tokenize(query, tfidf_vectorizer.get_feature_names()))]).todense()
    doc_vector = tfidf_vectorizer.transform([' '.join(tokenize(document, tfidf_vectorizer.get_feature_names()))]).todense()
    doc_vector = np.squeeze(np.asarray(doc_vector))
    query_vector = np.squeeze(np.asarray(query_vector))
    return cosine(query_vector, doc_vector)

def lsa_model(query, document):
    query = ' '.join(tokenize(query, tfidf_vectorizer.get_feature_names()))
    document = ' '.join(tokenize(document, tfidf_vectorizer.get_feature_names()))
    query_vector = tfidf_vectorizer.transform([query]).todense()
    doc_vector = tfidf_vectorizer.transform([document]).todense()
    query_vector = svd.transform(query_vector)
    doc_vector = svd.transform(doc_vector)
    return cosine(np.squeeze(np.asarray(doc_vector)), np.squeeze(np.asarray(query_vector)))

def search(model, query, documents, names = None):
    scores = [model(query, document) for document in documents]
    ixs = list(reversed(np.argsort(scores)[-10:]))

    data = [[scores[i] for i in ixs], [documents[i][:150] for i in ixs]]
    if names is not None:
        data.append([names[i] for i in ixs])
    tabulate(data, header=False)
"""

Expected output for scenario 1, using Gemini https://gemini.google.com:

### Explanation for advanced users, one paragraph:

This code builds a search engine that can utilize either TF-IDF or LSA for document retrieval. It creates TF-IDF vectors for all document descriptions, then allows you to choose between a TF-IDF or LSA search model. Both models preprocess the query, convert it and documents to vectors using the chosen method (TF-IDF or LSA with dimensionality reduction), and calculate cosine similarity to rank documents based on their relevance to the query.


### Improvements for advanced users, one paragraph:

This code can be improved by enhancing text preprocessing (stemming/lemmatization), tuning model parameters (stop words, n-grams, LSA components), using more advanced ranking algorithms (BM25), and improving result presentation (titles, summaries, keyword highlighting). For scalability with large datasets, consider sparse matrices or optimized indexing. Finally, building a user interface would provide a more interactive search experience.


### for a newby user in python

This code creates a search engine behind the scenes!

It allows you to search through documents using two methods:

- Regular Search (TF-IDF): This focuses on keywords in the documents. Imagine searching for "apple" and it finds documents with "apple" mentioned a lot.

- Smarter Search (LSA): This tries to understand the meaning behind words. For beginners, think of synonyms. If you search for "apple" it might also find documents with "fruit" or "orchard" because they are related.

The code can be improved by:

- Being pickier with words: It can currently remove common words like "the" but you might want it to also consider removing words like "very" or "just".
- Understanding words better: Right now, it treats "apple" and "apples" as different, but ideally it would recognize they mean the same thing.
- Ranking results better: Currently, it just picks documents that have similar words. In the future, it could consider other factors like document length or how many times a word appears.
- Showing results in a nicer way: Right now, it just shows a score and a bit of the document. In the future, it could show the title, a summary, and highlight the words that matched your search.