# Text Summarization

In this notebook we use a model that can summarize texts. We will summarize the first two sections of the following paper from OpenAI about a model named clip:

CLIP Paper: https://arxiv.org/abs/2103.00020

Model (1.22 GB)

In [30]:
from transformers import pipeline
import pandas as pd

def summarize_text(text):
    summarizer = pipeline("summarization")
    summary = summarizer(text, max_length=150, min_length=30, do_sample=False)[0]['summary_text']
    return summary

def read_text_file(file_path):
    with open(file_path, 'r') as file:
        text = file.read()
    return text

def summarize_text_file(file_path):
    # Read the content of the text file
    text_content = read_text_file(file_path)

    # Generate a summary of the text content
    return summarize_text(text_content)

In [None]:
summary_abstract = summarize_text_file("./texts/abstract.txt")
summary_intro1 = summarize_text_file("./texts/introduction.txt")
summary_intro2 = summarize_text_file("./texts/introduction2.txt")
summary_approach21 = summarize_text_file("./texts/approach21.txt")
summary_approach22 = summarize_text_file("./texts/approach22.txt")
summary_approach23 = summarize_text_file("./texts/approach23.txt")
summary_approach24 = summarize_text_file("./texts/approach24.txt")
summary_approach25 = summarize_text_file("./texts/approach25.txt")

In [38]:
df = pd.DataFrame({
    "section": [
        "Abstract",
        "1 Introduction",
        "2.1 Natural Language Supervision",
        "2.2 Creating a Sufficiently Large Dataset",
        "2.3 Selecting an Efficient Pre-Training Method",
        "2.4 Choosing and Scaling a Model",
        "2.5 Training",
    ],
    "summary": [
        summary_abstract,
        summary_intro1 + summary_intro2,
        summary_approach21,
        summary_approach22,
        summary_approach23,
        summary_approach24,
        summary_approach25,
    ]
})

In [39]:
for i, row in df.iterrows():
    print(row["section"] + ":\n\n")
    print(row["summary"] + "\n\n")

Abstract:


 A simple pre-training task is an efficient and scalable way to learn SOTA-like representations from scratch on a dataset (image, text) pairs collected from the internet . Natural language is used to reference learned visual concepts (or new ones) enabling zero-shot transfer of the model to downstream tasks .


1 Introduction:


 Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years . The development of "text-to-text" as a standardized input-output interface has enabled taskagnostic-based architectures to zero-shot transfer to downstream data . Using natural language supervision for image representation learning is still rare . This is because demonstrated performance on common benchmarks is much lower than alternative approaches . Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance .


2.1 Natural Language Supervision:


 At the core of our approach is the idea of learning perc