<a href="https://colab.research.google.com/github/Sajita01/BasicInventoryManagementSystem_AI/blob/master/Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install the required packages

In [None]:
!pip install transformers



In [None]:
!pip install -U PyPDF2
!pip install python-docx

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m225.3/232.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.2.0-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.0/253.0 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.2.0


# Import Library

In [None]:
import pandas as pd
import numpy as np
import re
from PyPDF2 import PdfReader
import os
import docx

# Define function to read different format files.

In [None]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text


# Data Preprocessing

In [None]:
import PyPDF2

pdf_path = '/content/Tuningg.pdf'

def read_pdf_file(pdf_path):
    text = ""
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            text += page.extract_text()
    return text

# Now read the single PDF file
text_data = read_pdf_file(pdf_path)
print(text_data[:500])


The project “Nepali News Summarization” focuses on developing a Nepali news summarization tool 
using advanced transformer-based models like MT5. Text summarization is a critical aspect of natural 
language processing, aimed at condensing lengthy text into concise, meaningful summaries. It is 
particularly significant in today's information-driven era, enabling quick decision-making in domains 
such as journalism, research, and formal documentation. While high resource languages like English 
ha


In [None]:
text_data = re.sub(r'\n+', '\n', text_data).strip()  # Remove excess newline characters
print(text_data[:500])

The project “Nepali News Summarization” focuses on developing a Nepali news summarization tool 
using advanced transformer-based models like MT5. Text summarization is a critical aspect of natural 
language processing, aimed at condensing lengthy text into concise, meaningful summaries. It is 
particularly significant in today's information-driven era, enabling quick decision-making in domains 
such as journalism, research, and formal documentation. While high resource languages like English 
ha


In [None]:
with open("/content/Tuningg.pdf", "w") as f:
    f.write(text_data)

# Library for the GPT2 finetune

In [None]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

In [None]:
def load_dataset(file_path, tokenizer, block_size = 64):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset

In [None]:
def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

In [None]:
def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

In [None]:

train_file_path = "/content/Tuningg.pdf"
model_name = 'gpt2'
output_dir = '/content/'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 300
save_steps = 50000

In [None]:
# Train
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msajitakhadka48[0m ([33msajitakhadka48-hcoe[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,1.4945
1000,0.1824
1500,0.0858
2000,0.0618
2500,0.0485
3000,0.0446
3500,0.0382
4000,0.0346
4500,0.0351
5000,0.0304


# Inference

In [None]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

In [None]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer

def generate_text(model_path, sequence, max_length):

    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

This model got trained on the entire text and took much longer to train, and yet it fails to give meaningful results.

In [None]:
model1_path = "/content/checkpoint-1500"
sequence1 = "[Question] What is Transformer?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


[Question] What is Transformer? [4. 6] Transformer is a deep learning model designed for 
sequence-to-sequence tasks, particularly well-suited for natural language processing. Transformer does not inherently use 
param


The following model was trained on 100 questions and answers based on the original text and it trained in a few seconds (50 epochs). It gives very meaningful results.

In [None]:
model1_path = "/content/checkpoint-6500"
sequence1 = "[Question] Explain Softmax?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

[Question] Explain Softmax? [4. 19] We 
already faced the challenge of generating a concise and 
accurate summary for our query, which is a big challenge for many Nepali institutions. The introduction of


In [None]:
model1_path = "/content/checkpoint-5000"
sequence1 = "[Question] What are the purpose of nepali news summarization?"
max_len = 50
generate_text(model1_path, sequence1, max_len)

[Question] What are the purpose of nepali news summarization? 
As Nepali language is low resource language, the project focuses on developing a Nepali news summarization tool 
using advanced transformer-based models like MT5. Text


In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [None]:
import torch


In [None]:
torch.save(model.state_dict(), "model_finetuned.pth")


In [None]:
state_dict = torch.load("/content/model_finetuned.pth", map_location=torch.device('cpu'))
model.load_state_dict(state_dict, strict=False)



<All keys matched successfully>

In [None]:
from google.colab import files
files.download('model_finetuned.pth')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!git clone https://github.com/Sajita01/Fine_Tuning


Cloning into 'Fine_Tuning'...


In [None]:
!cp /content/Fine_Tuning.ipynb /content/https://github.com/Sajita01/Fine_Tuning/


cp: cannot stat '/content/Fine_Tuning.ipynb': No such file or directory
