

# **Paragraph AI**


**Setup and Install Required Libraries**

Install them if necessary. Or just upgrade.

Install gradio if you wish to use the GUI version.


```
!pip install gradio
```



In [None]:
!pip install pandas transformers datasets #gradio

Collecting gradio
  Downloading gradio-5.30.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.1 (from gradio)
  Downloading gradio_client-1.10.1-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.10-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.

In [None]:
!pip install --upgrade transformers datasets

Collecting transformers
  Downloading transformers-4.52.1-py3-none-any.whl.metadata (38 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading transformers-4.52.1-py3-none-any.whl (10.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, transformers, datasets
  Attempting uninstall: fsspec
    Found existing installation: 



---



**Upload and Load the Dataset**

> Upload the Paragraphs file. It will be moved to the /content/ParagraphAI/ directory.



In [12]:
import os

output_dir = "/content/ParagraphAI/"
os.makedirs(output_dir, exist_ok=True)

In [13]:
from google.colab import files
import shutil

uploaded = files.upload()

# Move each uploaded file to /content/ParagraphAI/
for filename in uploaded.keys():
    shutil.move(filename, os.path.join(output_dir, filename))

Saving Paragraphs.txt to Paragraphs.txt


In [2]:
# Read the uploaded .txt file
with open("/content/ParagraphAI/Paragraphs.txt", "r", encoding="utf-8") as f:
    data = f.read()

# Split paragraphs by "---"
samples = [s.strip() for s in data.split('---') if s.strip()]



---


**Convert to Prompt/Completion Format**

In [3]:
dataset = []

for sample in samples:
    lines = sample.split("\n")
    if not lines or len(lines) < 2:
        continue
    prompt = lines[0].strip()
    completion = " ".join(line.strip() for line in lines[1:] if line.strip())
    dataset.append({"prompt": prompt, "completion": completion})

**Create Dataset for Training**

In [4]:
import pandas as pd
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,prompt,completion
0,**Opinion — *Why reading is better than watchi...,Why reading is better than watching movies is ...
1,**Expository/Informative — *What makes a good ...,What makes a good leader? is a concept that pl...
2,**Literary — *Themes of identity in 'The Catch...,Themes of identity in 'The Catcher in the Rye'...
3,**Descriptive — *A place that feels like home***,A place that feels like home is something that...
4,**Argumentative — *Is technology making us mor...,Is technology making us more alone? is a topic...


In [None]:
df.to_csv("./ParagraphAI/prompt_completion_dataset.csv", index=False)



---



**Tokenization and Dataset Preparation**

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Modified tokenize function to handle batched input
def tokenize(examples):
    # examples is a dictionary where values are lists when batched=True
    prompts = examples['prompt']
    completions = examples['completion']

    # Create a list of formatted input texts for the batch
    input_texts = [f"### PROMPT:\n{p}\n\n### COMPLETION:\n{c}{tokenizer.eos_token}" for p, c in zip(prompts, completions)]

    # Tokenize the list of input texts
    return tokenizer(input_texts, truncation=True, padding='max_length', max_length=512)

hf_dataset = Dataset.from_pandas(df)
# Use batched=True as intended, with the updated tokenize function
tokenized_dataset = hf_dataset.map(tokenize, batched=True)

# Add the 'labels' column which is a copy of 'input_ids'
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/81 [00:00<?, ? examples/s]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)




---



**Model Preparation and Fine-tuning**

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import os
os.environ["WANDB_DISABLED"] = "true"

model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

training_args = TrainingArguments(
    output_dir="./ParagraphAI/results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=500,
    save_total_limit=1,
    fp16=True,
    warmup_steps=10,
    # Add these lines for better logging of the loss
    report_to=["none"], # Disable reporting to external services if not needed
    logging_first_step=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# trainer.train() should work as the dataset has the 'labels' column
trainer.train()

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1,10.5248
10,6.1621
20,0.8704
30,0.5747
40,0.5716
50,0.3652
60,0.3459
70,0.4055
80,0.3536
90,0.3737


TrainOutput(global_step=123, training_loss=0.926874938292232, metrics={'train_runtime': 2294.8463, 'train_samples_per_second': 0.106, 'train_steps_per_second': 0.054, 'total_flos': 63493963776000.0, 'train_loss': 0.926874938292232, 'epoch': 3.0})



---



**Inference (Generating Text)**

For No GUI, Run this Code Cell. It will ask the user for Input.
Write your Prompt there.

> *Example : Why learning Math is important?*

In [None]:
user_prompt = input("Enter a prompt for paragraph generation: ")
output = generate_text(prompt=user_prompt)
print("\nGenerated Paragraph:\n", output)



---



**Another way to get the output.**

Putting the prompt on the Code Cell. In the Code, Replace the given prompt with your Prompt.

>*Example:*
>
> From
>```
># print(generate_text(prompt="How the traffic jam is Harming us?"))
>```
>
>To
>```
># print(generate_text(prompt="Why learning Math is important?"))
>```

In [None]:
print(generate_text(prompt="How the traffic jam is Harming us?"))



---



**Alternate Inference (Generating Text) GUI**

With GUI for more user friendly.
>It requires Installing gradio if you wish to use the GUI version.
>
>Run:
>
>```
!pip install gradio`
```

In [9]:
import gradio as gr
def generate_text(prompt, max_length=100):
    input_text = f"### PROMPT:\n{prompt}\n\n### COMPLETION:\n"
    inputs = tokenizer.encode(input_text, return_tensors="pt").cuda()
    outputs = model.generate(inputs, max_length=max_length, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def generate_response(prompt):
    return generate_text(prompt)

interface = gr.Interface(
    fn=generate_response,
    inputs=gr.Textbox(lines=3, placeholder="Enter your prompt here..."),
    outputs="text",
    title="Paragraph Generator",
    description="Enter a prompt and generate a well-structured paragraph using a fine-tuned GPT-2 model."
)

interface.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://12085ab5daa05515c0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)






---


**EXPORT**


---






**Exporting the Model**

In [None]:
model.save_pretrained("./ParagraphAI/my_finetuned_gpt2")
tokenizer.save_pretrained("./ParagraphAI/my_finetuned_gpt2")

('./ParagraphAI/my_finetuned_gpt2/tokenizer_config.json',
 './ParagraphAI/my_finetuned_gpt2/special_tokens_map.json',
 './ParagraphAI/my_finetuned_gpt2/vocab.json',
 './ParagraphAI/my_finetuned_gpt2/merges.txt',
 './ParagraphAI/my_finetuned_gpt2/added_tokens.json',
 './ParagraphAI/my_finetuned_gpt2/tokenizer.json')

**Exporting The Logs**

In [None]:
import pandas as pd

log_history = trainer.state.log_history
df = pd.DataFrame(log_history)
df.to_csv("./ParagraphAI/training_log.csv", index=False)


**Zipping the whole File and exporting it**

Then you should download it by clicking Download from the Directory

In [None]:
import shutil

shutil.make_archive("/content/ParagraphAI", 'zip', ".")

'/content/ParagraphAI.zip'



---



---


**Done By WAHIB UL MALIK**

---



---

