<a href="https://colab.research.google.com/github/Farhan-ANWAR0611/CHAT-BOT-NLP/blob/main/CAPSTONE_PROJECT_Deep_Learning_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Capstone Project: Development of an Industry-Specific Large Language Model (LLM) Bot for the Education and Training Sector**

**cohort moscow**

**contribution : individual**

In today’s rapidly evolving educational landscape, the integration of Artificial Intelligence (AI) is transforming how knowledge is delivered, accessed, and personalized. At the forefront of this transformation are Large Language Models (LLMs)—advanced AI systems capable of understanding and generating human-like text with remarkable fluency. This capstone project leverages the power of pre-trained LLMs from platforms such as Hugging Face to design a domain-specific chatbot tailored for the Education and Training industry.

The core objective is to fine-tune a state-of-the-art LLM using carefully curated educational datasets to create an intelligent, conversational bot capable of answering questions, providing learning support, and offering administrative assistance in educational contexts. Whether it's helping students clarify doubts, supporting teachers with content delivery, or guiding parents through school systems, the bot is trained to deliver accurate, contextually relevant, and meaningful interactions.

By bridging cutting-edge NLP techniques with the real-world needs of the education sector, this project not only demonstrates technical proficiency but also showcases how AI can empower learning experiences, promote accessibility, and reshape the future of education.



Step 1: Install and Import Required Libraries

In [20]:
!pip install -q transformers datasets gradio scikit-learn

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
from datasets import Dataset


Step 2: Load and Prepare the Dataset

In [21]:
# Load CSV (Upload this in Colab first using the sidebar or upload widget)
df = pd.read_csv("/StudentsPerformance01.csv")

# Create performance labels based on average score
df['average_score'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

def performance_label(score):
    if score < 60:
        return "Low"
    elif score < 80:
        return "Medium"
    else:
        return "High"

df['label'] = df['average_score'].apply(performance_label)
df['text'] = df.apply(lambda row: f"Gender: {row['gender']}, Race: {row['race/ethnicity']}, Parent Education: {row['parental level of education']}, Lunch: {row['lunch']}, Test Prep: {row['test preparation course']}, Scores: Math {row['math score']}, Reading {row['reading score']}, Writing {row['writing score']}", axis=1)
df = df[['text', 'label']]
df.head()


Unnamed: 0,text,label
0,"Gender: female, Race: group B, Parent Educatio...",Medium
1,"Gender: female, Race: group C, Parent Educatio...",High
2,"Gender: female, Race: group B, Parent Educatio...",High
3,"Gender: male, Race: group A, Parent Education:...",Low
4,"Gender: male, Race: group C, Parent Education:...",Medium


Step 3: Encode Labels and Tokenize Text

In [22]:
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])  # 0 = High, 1 = Low, 2 = Medium

# Convert to Hugging Face dataset
dataset = Dataset.from_pandas(df)

# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

dataset = dataset.map(tokenize, batched=True)

# Train/test split
dataset = dataset.train_test_split(test_size=0.2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = label_encoder.fit_transform(df['label'])  # 0 = High, 1 = Low, 2 = Medium


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Step 4: Load Model and Define Training Arguments

In [23]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    # Changed 'evaluation_strategy' to 'eval_strategy'
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=10,
    report_to='none'
)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step 5: Train the Model

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer,
)

trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.1225,0.138921
2,0.1121,0.054043
3,0.0821,0.049055
4,0.0193,0.027151
5,0.0076,0.028304


TrainOutput(global_step=500, training_loss=0.15881728142499923, metrics={'train_runtime': 1283.3832, 'train_samples_per_second': 3.117, 'train_steps_per_second': 0.39, 'total_flos': 41396800320000.0, 'train_loss': 0.15881728142499923, 'epoch': 5.0})

Step 6: Create and Launch a Gradio Bot

In [25]:
import gradio as gr

def classify_student(input_text):
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    return label_encoder.inverse_transform([prediction])[0]

gr.Interface(
    fn=classify_student,
    inputs=gr.Textbox(lines=4, label="Enter student information"),
    outputs=gr.Label(label="Predicted Performance Level"),
    title="Student Performance Predictor"
).launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://84c70a28c2160e7b59.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


