# Movie Plot Compressor

Goal: Compress a long Wikipedia movie plot (~10 pages) into one concise paragraph

Model: facebook/bart-large-cnn

In [1]:
!pip install transformers torch nltk wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=44e39ae74c0408931deb1db4ea6c3d05b70d107d55bb91dbb8324c30e848e49d
  Stored in directory: /root/.cache/pip/wheels/63/47/7c/a9688349aa74d228ce0a9023229c6c0ac52ca2a40fe87679b8
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [16]:
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
import nltk
import wikipedia

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
model_name = "facebook/bart-large-cnn"

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

BartForConditionalGeneration(
  (model): BartModel(
    (shared): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): BartScaledWordEmbedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
      (layers): ModuleList(
        (0-11): 12 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
        

This cell loads the pretrained BART summarization model.
The model is already trained on large text data
Tokenizer converts text into model-readable format
GPU is used if available to make processing faster
This model performs the **actual summarization**.

In [9]:
def get_wikipedia_plot(movie_title):
    wikipedia.set_lang("en")
    page = wikipedia.page(movie_title, auto_suggest=False)
    return page.content

This function downloads the movie plot from Wikipedia.
It avoids confusion when movie names have many meanings.
It selects the movie page automatically.
This ensures clean and relevant input text.

In [11]:
movie_title = "Interstellar (film)"
plot_text = get_wikipedia_plot(movie_title)

print(plot_text[:500])


Interstellar is a 2014 epic science fiction film directed by Christopher Nolan, who co-wrote the screenplay with his brother Jonathan Nolan. It features an ensemble cast led by Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, and Michael Caine. Set in a dystopian future where Earth is suffering from catastrophic blight and famine, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.
The screenplay ha


This cell calls the Wikipedia function.
Movie name is given as input
Plot text is stored in plot_text
First 500 characters are printed to verify data and confirm that the plot was fetched correctly.

In [12]:
def chunk_text(text, max_tokens=900):
    sentences = nltk.sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(tokenizer.encode(current_chunk + sentence)) < max_tokens:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

This function breaks long text into smaller parts.
Text is split sentence by sentence.
Each chunk stays within token limit.
This avoids model errors and improves performance.

In [13]:
def summarize_chunk(text, max_length=150, min_length=60):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)

    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

This function summarizes one text chunk.
Text is converted into tokens.
BART generates a short summary.
This is the core summarization logic.

In [17]:
chunks = chunk_text(plot_text)

chunk_summaries = []
for i, chunk in enumerate(chunks):
    print(f"Summarizing chunk {i+1}/{len(chunks)}")
    chunk_summaries.append(summarize_chunk(chunk))

Summarizing chunk 1/11
Summarizing chunk 2/11
Summarizing chunk 3/11
Summarizing chunk 4/11
Summarizing chunk 5/11
Summarizing chunk 6/11
Summarizing chunk 7/11
Summarizing chunk 8/11
Summarizing chunk 9/11
Summarizing chunk 10/11
Summarizing chunk 11/11


This cell summarizes the entire plot step by step.
Page is divided into chunks.
Each chunk is summarized separately.
Progress is printed on screen.
This prevents memory overload.

In [18]:
combined_summary = " ".join(chunk_summaries)

final_summary = summarize_chunk(
    combined_summary,
    max_length=180,
    min_length=90
)

print("\n FINAL MOVIE BLURB:\n")
print(final_summary)


 FINAL MOVIE BLURB:

Interstellar is a 2014 epic science fiction film directed by Christopher Nolan. It features an ensemble cast led by Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, and Michael Caine. The film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind. More IMAX cameras were used for Interstellar than for any of Nolan's previous films. Hans Zimmer, who scored Nolan's The Dark Knight Trilogy and Inception, returned to score Interstellar.


All chunk summaries are combined.
Combined text is summarized again.
Final result is a single paragraph.
This gives a clean movie blurb.

Final Workflow

Wikipedia Plot → Chunking → Chunk Summaries → Final Summary