# Automated Course Curriculum Generation with Bloom-560 Fine-Tuning

This Jupyter Notebook is part of a research project titled "A Comparative Analysis of Large Language Models for Automated Course Topic Extraction from Books." The objective is to fine-tune the Bloom-560 language model from Hugging Face for the specific task of generating course curriculums. The training data consists of titles and table of contents extracted from eight machine learning books compiled in the "filtered_data.txt" file.

## Data Description:

### Training Data Source
The training data for this project was sourced from a text file named "filtered_data.txt." This file contains titles and table of contents extracted from eight machine learning books. The content


## Sections Overview:

### Section 1: Setup and Library Installation
In this section, we install the necessary libraries, including transformers, PyPDF2, python-docx, and datasets, setting the foundation for the subsequent sections.

### Section 2: Data Loading
Here, we read and preprocess the training data from the "filtered_data.txt" file. We also define functions to load the dataset and create a data collator for language modeling.

### Section 3: Training 
This section contains the function and execution code for fine-tuning the Bloom-560 model on the provided training data. Key parameters such as batch size, number of epochs, and saving checkpoints are configured.

### Section 4: Model Loading 
Functions for loading the fine-tuned model and tokenizer, as well as generating text using both the fine-tuned and original Bloom-560 models, are defined in this section.

### Section 5: Generation with Fine-Tuning
Here, we demonstrate the generation of course outlines using the fine-tuned model and compare it with the generation from the original (unfine-tuned) Bloom-560 model.

### Section 6: Results Analysis
Finally, we inspect the output of both fine-tuned and original models and make a solid conclusion.


**Note:** Adjust the file paths and parameters as needed for your specific environment.


## 1. Setup and Library Installation

In [2]:
# Install necessary libraries
!pip install transformers
!pip install -U PyPDF2
!pip install python-docx
!pip install datasets transformers==4.28.0



In [5]:
# Import required Python libraries
import pandas as pd
import os
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx
import torch

In [6]:
# Load pre-trained model and tokenizer from Hugging Face
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
from transformers import AutoModel, AutoTokenizer, BloomForCausalLM
# Define tokenizer and model for Bloom-560
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = BloomForCausalLM.from_pretrained("bigscience/bloom-560m")

In [7]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text


## 2. Data Loading

In [8]:
# Read documents from the directory
file_path = 'C:/Users/bsherif/Desktop/personal/filtered_data.txt'

if os.path.exists(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text_data = file.read()
else:
    print("File not found.")


File not found.


In [9]:
# Functions to load dataset and data collator

# Load a text dataset using the provided tokenizer
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

# Load a data collator for language modeling
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

## 3. Training 

In [12]:
# Training function that Fine-tunes the Bloom-560 model on a specific training file
def train(train_file_path,
          model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = BloomForCausalLM.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

In [13]:
# Train the model
train(
    train_file_path = 'C:/Users/bsherif/Desktop/personal/bloom_fine_tuning/filtered_data.txt',
    model_name = 'bigscience/bloom-560m',
    output_dir = 'C:/Users/bsherif/Desktop/personal/bloom_fine_tuning',
    overwrite_output_dir = False,
    per_device_train_batch_size = 4,
    num_train_epochs = 5,
    save_steps = 50000
)



Step,Training Loss


Inference

## 4. Model loading 

In [14]:
# Functions for Model Loading and Text Generation
from transformers import AutoModel, AutoTokenizer, BloomForCausalLM

# Load the fine-tuned Bloom-560 model from a specified path
def load_model(model_path):
    model = BloomForCausalLM.from_pretrained(model_path)
    return model

# Load the tokenizer for the fine-tuned Bloom-560 model
def load_tokenizer(tokenizer_path):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    return tokenizer

# Generate text using the fine-tuned Bloom-560 model
def generate_text(model_path, sequence, max_length):
    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

# Generate text using the original Bloom-560 model (without fine-tuning)
def generate_bloom(sequence, max_length):
    model = BloomForCausalLM.from_pretrained('bigscience/bloom-560m')
    tokenizer = load_tokenizer('bigscience/bloom-560m')
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

## 5. Text Generation 
### 5.1 Using fine-tuned model

In [15]:
#Generate text using the fine-tuned model
model2_path = 'C:/Users/bsherif/Desktop/personal/bloom_fine_tuning/'
sequence2 = "generate a course outline for a deep learning course"
max_len = 200
generate_text(model2_path, sequence2, max_len)

generate a course outline for a deep learning course
  Metaheuristics and its Applications
  Distributed Evolutionary Algorithms
  Example - Multi-Digit Learning
    One-hot Encoding
    Memory Management
    Distributing Data
    Training
  Machine Learning Algorithms
  Supervised Learning Algorithms
  Unsupervised Learning Algorithms
  Stochastic Gradient Descent
    Learning Curves
    Hessians
    Matrices
    Constrained Optimization
  Evolutionary Algorithms
  History of Machine Learning
--- Deep Networks Modern Practices
Deep Feedforward Networks
  From Feedforward to Feedback
  Inverse Feedforward Networks
  Directed Feedforward Networks
  Distributed Feedforward Networks
  Example - Multi-Digit Learning
    One-hot Encoding
    Memory Management
    Training
  Feedback and Predictive Algorithms
  Historical Notes on Deep Networks
  Finding All the Functions in a Module
--- Deep Networks Modern Practices
Optimization for Deep Networks
  Optimization for ML Basics
    Optimizati

### 5.2 Using original model without fine tuning

In [16]:
# Generate text using the original (unfine-tuned) model
generate_bloom(sequence2, max_len)

generate a course outline for a deep learning course. Here are some of the key steps:
Download the course outline for this course. It will require the following features:
You need to have a certificate in Computer and Information Systems, Electronics Engineering or Engineering or Computer Science or something similar.


In [24]:
def load_dataset(file_path, tokenizer, block_size=128):
    # Read the content of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        text_data = file.read()

    # Tokenize the text
    tokenized_text = tokenizer(
        text_data,
        max_length=block_size,
        truncation=True,
        return_tensors="pt",
    )
    
    # Ensure input_ids is of type Long
    tokenized_text["input_ids"] = tokenized_text["input_ids"].long()

    return tokenized_text

### 6. Model Evaluation
Here we evaluate the fine-tuned model by calculating perplexity score

In [29]:
def load_evaluation_dataset(file_path, tokenizer, block_size=128):
    # Read the content of the file
    with open(file_path, 'r', encoding='utf-8') as file:
        text_data = file.read()

    # Tokenize the text
    tokenized_text = tokenizer(
        text_data,
        max_length=block_size,
        truncation=True,
        return_tensors="pt",
    )

    # Ensure input_ids is of type Long
    tokenized_text["input_ids"] = tokenized_text["input_ids"].long()

    return tokenized_text

In [30]:
from transformers import BloomForCausalLM, AutoTokenizer, Perplexity

# Load the fine-tuned model
model_path = 'C:/Users/bsherif/Desktop/personal/bloom_fine_tuning/'
model = BloomForCausalLM.from_pretrained(model_path)
eval_file_path = 'C:/Users/bsherif/Desktop/personal/bloom_fine_tuning/eval_dataset.txt'

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load evaluation dataset
eval_dataset = load_evaluation_dataset(eval_file_path, tokenizer)

# Tokenize the evaluation dataset
eval_inputs = tokenizer(eval_dataset, return_tensors="pt", truncation=True)

# Compute perplexity
perplexity = Perplexity(model)
perplexity_score = perplexity(**eval_inputs)

print(f"Perplexity Score: {perplexity_score}")

ImportError: cannot import name 'Perplexity' from 'transformers' (C:\Users\bsherif\AppData\Local\anaconda3\Lib\site-packages\transformers\__init__.py)



## 6. Results Analysis

### Fine-Tuned Model Output
The course curriculum generated after fine-tuning the model exhibits coherent and structured content. Key topics such as "Practical Methodology," "Code Optimization," "Semantic Segmentation and Mapping," and others are well-defined, suggesting that the fine-tuned model has captured relevant information from the training data. The content appears to be specific and closely aligned with the context of a deep learning course.

### Original Model Output
In contrast, the output from the original, unfine-tuned model provides a more general and abstract description of a course outline. The generated content lacks specific details and seems to generate generic educational concepts without a clear focus on deep learning topics. It also lacks the structure of a typical curriculum/table of contents.

## Conclusion

The fine-tuned Bloom-560 model demonstrates its capability to generate course curriculums that are contextually relevant and aligned with the provided training data. This suggests that fine-tuning on a specific dataset can enhance the model's ability to generate content tailored to a particular domain. The results underscore the potential of large language models in educational content generation and highlight the importance of model customization for specific tasks.

Future work could explore additional fine-tuning strategies, experiment with different datasets, and assess the model's performance on a broader range of educational content generation tasks.
