<a href="https://colab.research.google.com/github/JyotiJha01/Text_Classification/blob/main/Text_Classification_using_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment: Text Classification using Hugging Face

**Objective**: The goal of this assignment is to build a text classification model using the Hugging Face library to classify a dataset of text into one of multiple categories. The candidate will use a pre-trained model such as BERT or GPT-2 as a starting point and fine-tune it on the classification task.


**Instructions:**

1. Choose a dataset of text that has multiple categories (e.g. news articles labeled as sports, politics, entertainment, etc.). The dataset should have at least 1000 samples for each category.

2. Preprocess the text data by cleaning it, removing stopwords, punctuations and other irrelevant characters.

3. Use the Hugging Face library to fine-tune a pre-trained model such as BERT or GPT-2 on the classification task. The candidate should use the transformers library in python.

4. Train the model on the dataset and evaluate the performance using metrics such as accuracy, precision, recall and F1-score.

5. Use the trained model to predict the categories of a few samples from the test set.



## Import the Required Libraries

In [None]:
#import libraries
import re
import string

import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold, train_test_split

## Read the data

In [None]:
# Read the data
categories = [
    "alt.atheism",
    "misc.forsale",
    "sci.space",
    "soc.religion.christian",
    "talk.politics.guns",
]

news_group_data = fetch_20newsgroups(
    subset="all", remove=("headers", "footers", "quotes"), categories=categories
)

df = pd.DataFrame(
    dict(
        text=news_group_data["data"],
        target=news_group_data["target"]
    )
)
df["target"] = df.target.map(lambda x: categories[x])

## Prepare the Data

1. Preprocess the text data by cleaning it, removing stopwords, punctuations and other irrelevant characters.

In [None]:
# clean the Text column & remove the punctuation 
def process_text(text):
    text = str(text).lower()
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", " ", text
    )
    text = " ".join(text.split())
    return text

df["clean_text"] = df.text.map(process_text)

In [None]:
# Split the Data into Train and Test sets
df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)

In [None]:
# bag of words
vec = CountVectorizer(
    ngram_range=(1, 3), 
    stop_words="english",
)

X_train = vec.fit_transform(df_train.clean_text)
X_test = vec.transform(df_test.clean_text)

y_train = df_train.target
y_test = df_test.target

## Traning model using Hugging face
1. Use the Hugging Face library to fine-tune a pre-trained model such as BERT or GPT-2 on the classification task. The candidate should use the transformers library in python.

In [None]:
# fetch the dataset using scikit-learn
category = ['alt.atheism', 'soc.religion.christian',
             'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train, test = fetch_20newsgroups(subset='train',
   categories=category, shuffle=True, random_state=42),fetch_20newsgroups(subset='test',
   categories=category, shuffle=True, random_state=42)

print('size of training set: %s' % (len(train['data'])))
print('size of validation set: %s' % (len(test['data'])))

x_train = train.data
y_train = train.target
x_test = test.data
y_test = test.target

size of training set: 2257
size of validation set: 1502


In [None]:
#!pip install ktrain

In [None]:
# import ktrain and the ktrain.text modules
import ktrain
from ktrain import text

In [None]:
# Step 1: Create a Transformer instance
model_name ='distilbert-base-uncased'
t =text.Transformer(model_name, maxlen =500, classes =categories)



Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

In [None]:
#  Step 2: Preprocess the Datasets
trn =t.preprocess_train(x_train, y_train)
val =t.preprocess_test(x_test, y_test)

preprocessing train...
language: en
train sequence lengths:
	mean : 308
	95percentile : 837
	99percentile : 1938


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 343
	95percentile : 979
	99percentile : 2562


In [None]:
# Step 3: Create a Model and wrap in learner
model =t.get_classifier()
learner =ktrain.get_learner(model, train_data =trn, val_data =val, batch_size=6 )


In [None]:
# Step 4(optional): Estimate the learning rate
#learner.lr_find(show_plot=True, max_epochs=2)

In [None]:
# Step 5: Train the Model
learner.fit_onecycle(5e-5,1)



begin training using onecycle policy with max lr of 5e-05...

In [None]:
# Step 6: Inspect the Model
learner.view_top_losses(n=1, preproc =t)

In [None]:
# STEP 7: Make Predictions on New Data
predictor =ktrain.get_predictor(learner.model, preproc =t)
predictor.predict('jesus christ is the central figure of christinity')