# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [31]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm

In [32]:
#API_KEY = os.environ.get("GEMINI_API_KEY")
API_KEY = ""
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [33]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    title = a.text
    link = a["href"].replace('/papers', '')

    papers.append({"title": title, "url": f"https://arxiv.org/pdf{link}"})

Code to extract text from PDFs.

In [34]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [35]:
LLM = "gemini-1.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [36]:
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content("Summarize this research article into one paragraph without formatting highlighting its strengths and weaknesses. " + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

100%|██████████| 4/4 [00:19<00:00,  4.84s/it]


We print the results to a html file.

In [37]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{paper["summary"]}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [38]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[YuLan-Mini: An Open Data-efficient Language Model](https://arxiv.org/pdf/2412.17743)**<br>This research paper details the development of YuLan-Mini, a 2.42B parameter language model achieving top-tier performance among similarly sized models.  Its strengths lie in a meticulously designed data pipeline combining data cleaning, scheduling strategies, and synthetic data generation, particularly for reasoning tasks.  A robust optimization method, incorporating techniques like µP initialization and WeSaR re-parameterization, effectively mitigates training instability.  Furthermore, an annealing approach with targeted data selection and long context training further boosts performance.  However, a weakness is the reliance on a relatively small training dataset (1.08T tokens) compared to industry-leading models, limiting its long-context capabilities.  While the open-sourcing of the model and detailed training procedures is a significant contribution, the limited computational resources available may hinder complete reproducibility for other researchers.
<br><br>

**[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https://arxiv.org/pdf/2412.17483)**<br>This research paper comprehensively investigates gist token-based context compression methods for improving long-context processing in large language models (LLMs).  The study finds that while these methods achieve near-lossless performance on tasks like retrieval-augmented generation and long-document QA, they struggle with tasks requiring precise recall, exhibiting three key failure patterns: information loss at segment boundaries, loss of surprising details, and loss of information mid-sequence.  Strengths include a unified framework for categorizing existing methods, extensive experimentation across diverse tasks, and the proposal of two effective mitigation strategies: fine-grained autoencoding and segment-wise token importance estimation.  However, a weakness is the limitation to relatively small LLMs (7-8B parameters) and a focus solely on gist token-based compression, potentially overlooking other effective context compression techniques.  Further research on larger models and a broader comparison of methods is needed.
<br><br>

**[Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation](https://arxiv.org/pdf/2412.18176)**<br>Molar is a novel sequential recommendation framework that integrates multimodal large language models (MLLMs) with collaborative filtering.  It addresses the limitations of existing LLM-based methods which neglect collaborative filtering information by using an MLLM to generate item representations from textual and non-textual data, and then aligning user representations from content-based and ID-based models via a post-alignment mechanism.  Experiments demonstrate Molar's superior performance compared to traditional and LLM-based baselines across multiple datasets.  However, a key weakness is the computationally intensive multi-task fine-tuning process, hindering real-time applications.  Furthermore, performance is dependent on the underlying capabilities of the MLLM used, limiting scalability to larger models.
<br><br>

**[MMFactory: A Universal Solution Search Engine for Vision-Language Tasks](https://arxiv.org/pdf/2412.18072)**<br>This research introduces MMFactory, a novel framework for solving vision-language tasks that acts as a solution search engine.  Unlike previous approaches that rely on single models or sample-specific solutions, MMFactory proposes a diverse pool of programmatic solutions by combining various vision, language, and vision-language models based on a user-provided task description, sample input-output pairs, and optional constraints (e.g., computational resources).  A committee-based multi-agent LLM system generates these executable solutions, ensuring robustness and diversity.  The framework also includes a metric router to evaluate and benchmark the performance and resource usage of each solution, enabling users to select the optimal solution for their needs.  While experimental results demonstrate state-of-the-art performance on benchmark datasets, a potential weakness is the computational cost of the multi-agent system, although this is mitigated by the reusability of generated solutions across all task instances.  Furthermore, the reliance on LLMs for both solution and metric routing might introduce biases and limitations inherent to these models.
<br><br>

In [39]:
# Modified prompt for tabulated analysis
for paper in tqdm(papers):
    try:
        prompt = """Analyze this research article and provide:
1. A brief one-sentence summary
2. Key strengths (list 3 points)
3. Key weaknesses (list 3 points)

Format the response as follows:
Summary: [one sentence]
| Strengths | Weaknesses |
| --- | --- |
| [strength 1] | [weakness 1] |
| [strength 2] | [weakness 2] |
| [strength 3] | [weakness 3] |

Article text: """ + extract_pdf(paper["url"])

        paper["analysis"] = model.generate_content(prompt).text
    except:
        print("Generation failed")
        paper["analysis"] = "Paper not available"

# Modified markdown printing
for paper in papers:
    printmd(f"""**[{paper['title']}]({paper['url']})**\n
{paper['analysis']}\n\n---\n""")

100%|██████████| 4/4 [00:21<00:00,  5.42s/it]


**[YuLan-Mini: An Open Data-efficient Language Model](https://arxiv.org/pdf/2412.17743)**

Summary: YuLan-Mini is a data-efficient 2.42B parameter language model achieving top-tier performance among similarly sized models by employing an elaborate data pipeline, a robust optimization method, and an effective annealing approach.

| Strengths | Weaknesses |
|---|---|
| Achieves top-tier performance comparable to much larger models, demonstrating high data efficiency. | Limited long-context capability due to resource constraints; only achieved 28K context window. |
| Open-source and reproducible; full training details and data composition are released. |  The study primarily focuses on a specific model and its training methods; generalization to other models is not fully explored. |
|  Employs a comprehensive approach to training stability, combining several methods to mitigate instability issues. | The reproducibility of the baseline models' results is challenged due to incomplete reporting of evaluation setup. |



---


**[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https://arxiv.org/pdf/2412.17483)**

Summary: This research comprehensively investigates gist token-based context compression for large language models, revealing its effectiveness in many tasks but identifying key failure patterns and proposing strategies to mitigate them.

| Strengths | Weaknesses |
|---|---|
| Comprehensive evaluation across diverse language tasks, including language modeling, weak context-dependent tasks, and long context tasks.  | Limited model size and context length explored due to computational resource constraints;  larger models might show different results. |
| Identification of three critical failure patterns (lost by the boundary, lost if surprise, lost along the way) arising from compression bottlenecks, providing valuable insights into the limitations of the method. | Focus solely on gist token-based compression;  other context compression methods are not included in the comparative analysis.  |
| Proposal of two effective strategies (fine-grained autoencoding and segment-wise token importance estimation) to mitigate identified weaknesses and improve model performance. |  While the proposed mitigation strategies show improvement, they don't entirely eliminate the identified failure patterns. |



---


**[Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation](https://arxiv.org/pdf/2412.18176)**

Summary: Molar, a novel multimodal large language model framework, enhances sequential recommendation by integrating multiple content modalities with collaborative filtering signals through a post-alignment mechanism, achieving superior performance compared to existing methods.

| Strengths | Weaknesses |
|---|---|
|  Combines multimodal data (text, images) with collaborative filtering, leveraging the strengths of both approaches. | Requires multi-task fine-tuning, which can be computationally expensive and time-consuming. |
|  Utilizes a post-alignment mechanism to effectively integrate collaborative filtering signals without hindering the LLM's semantic understanding. | Performance heavily relies on the underlying capabilities of the MLLM; suboptimal base models can degrade overall performance. |
|  Consistently outperforms traditional and LLM-based baselines across multiple datasets, demonstrating its effectiveness and robustness. |  Limited by computational constraints; inability to train larger LLMs may hinder further performance improvements. |



---


**[MMFactory: A Universal Solution Search Engine for Vision-Language Tasks](https://arxiv.org/pdf/2412.18072)**

Summary: MMFactory is a universal framework that acts as a solution search engine for vision-language tasks, suggesting diverse programmatic solutions tailored to user specifications and constraints by combining various models.

| Strengths | Weaknesses |
|---|---|
| Proposes multiple programmatic solutions for a given task, allowing users to choose based on performance and resource constraints. |  The reliance on a large language model (LLM) like GPT-4 as the core component raises concerns about cost and accessibility for users. |
|  Addresses limitations of existing methods by considering user constraints (computation, performance) and generating generalized solutions applicable to all task instances, not just individual examples. | The paper lacks detailed information on the size and complexity of the MMFactory framework itself, making it difficult to assess its scalability and practical deployment. |
| Outperforms state-of-the-art methods on two benchmarks (BLINK and Seedbench) by delivering tailored solutions.  | The ablation study, while informative, is limited in scope, focusing primarily on the individual components of the multi-agent system rather than a broader evaluation of the framework's overall performance. |



---


In [40]:
# Modified HTML printing
page = f"""<html>
<head>
    <style>
        table {{
            border-collapse: collapse;
            width: 100%;
            margin: 20px 0;
        }}
        th, td {{
            border: 1px solid #ddd;
            padding: 8px;
            text-align: left;
        }}
        th {{
            background-color: #f2f2f2;
        }}
    </style>
    <h1>Daily Dose of AI Research</h1>
    <h4>{date.today()}</h4>
    <p><i>Analysis generated with: {LLM}</i></p>
</head>
<body>"""

with open("papers_table.html", "w") as f:
    f.write(page)


for paper in papers:
    analysis_html = paper['analysis'].replace('|', '</td><td>').replace('\n', '</td></tr><tr><td>')
    page = f"""
    <h2><a href="{paper['url']}">{paper['title']}</a></h2>
    <table>
        <tr><td>{analysis_html}</td></tr>
    </table>
    <hr>"""
    with open("papers_table.html", "a") as f:
        f.write(page)


end = "</body></html>"
with open("papers_table.html", "a") as f:
    f.write(end)

In [41]:
# Open source model setup
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct",
    device_map="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

generation_args = {
    "max_new_tokens": 500, # Reduced max tokens
    "return_full_text": False,
    "temperature": 0.1, # Increased temperature slightly
    "do_sample": True,
    "top_p": 0.9, # Added top_p sampling
}

# Modified paper analysis loop for the open source model
for paper in tqdm(papers):
    try:
        # Limit the input text length
        pdf_text = extract_pdf(paper["url"])
        max_input_length = 2000
        truncated_text = pdf_text[:max_input_length]

        messages = [{
            "role": "system",
            "content": "You are a research paper analyzer. Provide analysis in a table format with strengths and weaknesses."
        }, {
            "role": "user",
            "content": f"""Analyze this research article and provide:
1. A brief one-sentence summary
2. Key strengths (list 3 points)
3. Key weaknesses (list 3 points)

Format the response as follows:
Summary: [one sentence]
| Strengths | Weaknesses |
| --- | --- |
| [strength 1] | [weakness 1] |
| [strength 2] | [weakness 2] |
| [strength 3] | [weakness 3] |

Article text: {truncated_text}"""
        }]

        paper["analysis"] = pipe(messages, **generation_args)[0]['generated_text']
    except Exception as e:
        print(f"Generation failed for {paper['title']}: {e}")
        paper["analysis"] = "Paper not available"

# Modified markdown printing
for paper in papers:
    printmd(f"""**[{paper['title']}]({paper['url']})**\n
{paper['analysis']}\n\n---\n""")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda
100%|██████████| 4/4 [02:12<00:00, 33.17s/it]


**[YuLan-Mini: An Open Data-efficient Language Model](https://arxiv.org/pdf/2412.17743)**

 Summary: YuLan-Mini is a highly efficient language model with 2.42B parameters that achieves top performance with significantly less data than industry-leading models, thanks to an innovative pre-training approach.

| Strengths | Weaknesses |
| --- | --- |
| 1. Achieves top-tier performance with a significantly smaller dataset (1.08T tokens) compared to industry standards, demonstrating data efficiency. | 1. The paper may not fully address the potential limitations or challenges in scaling the model beyond the current parameter size. |
| 2. Introduces a novel pre-training approach with three key technical contributions (data pipeline, robust optimization, and effective annealing), which could be beneficial for future research and development in the field. | 2. The effectiveness of the proposed techniques may be context-dependent, and further studies are needed to evaluate their generalizability across different tasks and domains. |
| 3. Facilitates reproducibility and further research by releasing full details of the data composition for each training phase and providing access to project details on GitHub. | 3. The paper does not discuss the computational resources required for training YuLan-Mini, which could be a barrier for researchers with limited access to high-performance computing infrastructure. |

Note: The weaknesses listed are hypothetical and based on common challenges in research papers. The actual weaknesses would depend on a thorough analysis of the paper's content, methodology, and results.

---


**[A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression](https://arxiv.org/pdf/2412.17483)**

 Summary: The study investigates gist-based context compression methods to enhance long-context processing in large language models, revealing near-lossless performance in certain tasks but challenges in others, and proposing strategies to mitigate identified failure patterns.

| Strengths | Weaknesses |
| --- | --- |
| [Strength 1] Comprehensive investigation of gist-based context compression methods, providing valuable insights into their application and limitations. | [Weakness 1] The study may not cover all possible failure patterns or scenarios, leaving room for further exploration. |
| [Strength 2] Identification of three key failure patterns (lost by the boundary, lost if surprise, and lost along the way), which helps in understanding the limitations of gist-based compression. | [Weakness 2] The proposed strategies (fine-grained autoencoding and segment-wise token importance estimation) may require significant computational resources, limiting their practicality in resource-constrained environments. |
| [Strength 3] Practical strategies for improving compression capabilities, such as fine-grained autoencoding and segment-wise token importance estimation, offer actionable solutions for enhancing the performance of gist-based context compression. | [Weakness 3] The study's experimental results are focused on specific tasks (retrieval-augmented generation and long-document QA), which may not generalize to all types of long-context processing applications, potentially limiting the broader applicability of the findings. |

---


**[Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation](https://arxiv.org/pdf/2412.18176)**

 Summary: Molar is a novel framework for sequential recommendation that integrates multimodal content with collaborative filtering signals, significantly improving recommendation accuracy over traditional and LLM-based methods.

| Strengths | Weaknesses |
| --- | --- |
| 1. **Integration of Multimodal Data**: Molar effectively combines textual and non-textual data, enriching item representations and capturing a more comprehensive user interest profile. | 1. **Complexity and Resource Intensity**: The framework may require significant computational resources and expertise to implement, potentially limiting its accessibility and scalability. |
| 2. **Enhanced Personalization**: By aligning user representations from content-based and ID-based models, Molar ensures precise personalization, leading to more relevant recommendations. | 2. **Dependency on Quality of Data**: The performance of Molar heavily relies on the quality and diversity of the multimodal data and collaborative filtering signals, which may not always be available or accurately maintained. |
| 3. **Superior Performance in Experiments**: Extensive experimental validation demonstrates Molar's significant outperformance over traditional and LLM-based baselines, showcasing its effectiveness in capturing both user interests and contextual semantics. | 3. **Potential Overfitting**: The sophisticated modeling approach, while beneficial for performance, may lead to overfitting, especially in scenarios with limited or noisy data, affecting the model's generalizability. |

These strengths and weaknesses highlight Molar's innovative approach to sequential recommendation, balancing the benefits of multimodal data integration and collaborative filtering with considerations regarding implementation complexity, data dependency, and model robustness.

---


**[MMFactory: A Universal Solution Search Engine for Vision-Language Tasks](https://arxiv.org/pdf/2412.18072)**

 Summary: MMFactory is a universal framework that acts as a solution search engine for vision-language tasks, offering a diverse pool of programmatic solutions based on task descriptions, sample inputs/outputs, and user-defined constraints.

| Strengths | Weaknesses |
| --- | --- |
| 1. Provides a universal framework that can handle a wide range of vision-language tasks, making it versatile and adaptable to various applications. | 1. The effectiveness of the suggested solutions may depend heavily on the quality and relevance of the input-output pairs provided, which could limit its applicability in scenarios with limited or ambiguous data. |
| 2. Incorporates user-defined constraints such as resource and performance limitations, allowing for more tailored and efficient solutions. | 2. The framework's reliance on a model repository may restrict its ability to leverage the latest or most specialized models not included in the repository, potentially limiting its performance on cutting-edge tasks. |
| 3. Utilizes a committee-based solution proposer and leverages multi-agent LLM conversation, enhancing the generation of diverse, universal, and robust solutions. | 3. The complexity of the framework and its reliance on advanced AI components may pose challenges in terms of usability for non-expert users, requiring a steeper learning curve or technical support. |
| 4. Offers a systematic approach to synthesizing solutions by instantiating and combining various visio-lingual tools, potentially improving the efficiency of finding suitable solutions. | 4. The performance and resource characteristics proposed by the framework may not always accurately reflect real-world scenarios, leading to potential mismatches between expected and actual performance. |
| 5. By proposing metrics and benchmarks, MMFactory empowers users to make informed decisions based on their unique design constraints, enhancing the overall usability and effectiveness of the framework. | 5. The framework's ability to generate executable solutions may be limited by the quality and compatibility of the underlying models and tools, potentially requiring additional integration efforts for optimal results. |

Note: The strengths and weaknesses listed above are inferred from the provided article text and general knowledge of similar frameworks. The actual strengths and weaknesses may vary based on

---
