<a href="https://colab.research.google.com/github/MeetVaghasia2274/NLP_PropertyTypeClassification/blob/main/NLP_task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Classification by Fine-tuning Language Model**

1. **Data Loading**

In [None]:
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.43.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets-

In [None]:
from google.colab import drive
drive.mount('/content/drive')
data= pd.read_csv('/content/drive/My Drive/data.csv')
# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['Property Type'].value_counts())  # Class distribution

Mounted at /content/drive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Description    1000 non-null   object
 1   Property Type  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB
None
Property Type
Industrial     334
Residential    333
Commercial     333
Name: count, dtype: int64


In [None]:
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'Description': train_data['Description'],
    'Property Type': train_data['Property Type']
})

val_df = pd.DataFrame({
    'Description': val_data['Description'],
    'Property Type': val_data['Property Type']
})

**2. Text Processing**

In [None]:
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    text = text.strip()

    return text

# Apply the cleaning function to the dataset
train_df['Description'] = train_df['Description'].apply(clean_text)
val_df['Description'] = val_df['Description'].apply(clean_text)

print(train_df.head())

                                           Description Property Type
29   a large warehouse equipped with advanced stora...    Industrial
535  a large warehouse equipped with advanced stora...    Industrial
695  a spacious bedroom apartment with a modern ope...   Residential
557  a mixeduse development featuring luxury apartm...   Residential
836  a luxurious villa with an elegant private pool...    Industrial


**3. Text Embedding using BERT and RoBERTa**

In [None]:
from simpletransformers.classification import ClassificationModel
# Get the number of unique labels (intents) in the dataset
num_labels = len(data['Property Type'].unique())

# Create a BERT model for text classification
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

print("BERT and RoBERTa models initialized successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BERT and RoBERTa models initialized successfully!


**4. Model Training with BERT and RoBERTa**

In [None]:
from simpletransformers.classification import ClassificationArgs
from sklearn.preprocessing import LabelEncoder
# Convert string labels to integer labels using LabelEncoder
label_encoder = LabelEncoder()
train_df['Property Type'] = label_encoder.fit_transform(train_df['Property Type'])
val_df['Property Type'] = label_encoder.transform(val_df['Property Type'])
# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=1,       # Reduce to 1 epoch for quicker training
    train_batch_size=16,      # Increase batch size for efficiency
    eval_batch_size=16,       # Increase evaluation batch size
    learning_rate=3e-5,       # Keep the learning rate the same
    max_seq_length=64,        # Reduce sequence length for faster processing
    weight_decay=0.01,        # Keep weight decay the same
    warmup_steps=0,           # No warmup steps for simplicity
    logging_steps=100,        # Log training progress less frequently
    save_steps=500,           # Save model less frequently
    overwrite_output_dir=True,  # Overwrite the output directory
    output_dir='outputs',     # Directory to save model outputs
    fp16=True,                # Enable mixed precision training (if supported)
    dataloader_num_workers=2  # Use multiple workers for data loading
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=len(label_encoder.classes_), args=model_args, use_cuda=False)
bert_model.train_model(train_df)




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

(50, 1.1125436234474182)

In [None]:
# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=len(label_encoder.classes_), args=model_args, use_cuda=False)
roberta_model.train_model(train_df)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/50 [00:00<?, ?it/s]

(50, 1.1108564567565917)

**5. Evaluation on Validation Set**

In [None]:
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/13 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.01793185673374517), 'eval_loss': 1.104737611917349}


In [None]:
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/2 [00:00<?, ?it/s]

RoBERTa Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.0986013412475586}


**6. Saving the Best Model**

In [None]:
bert_model.save_model('bert_best_model')

In [None]:
roberta_model.save_model('roberta_best_model')

**7. Prediction on Real-World Input**

In [None]:
# Load the saved BERT model
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

# Real-world input text
real_world_text = ["Built in a quiet residential neighborhood, this property has spacious green areas and modern interiors. Ideal for those seeking a peaceful lifestyle."]
# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)# 0: Commercial, 1: Residential, 2: Industrial

print(f"BERT Predictions: {predictions_bert}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: [1]


In [None]:
# Load the saved RoBERTa model
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)
# Real-world input text
real_world_text = ["Built in a quiet residential neighborhood, this property has spacious green areas and modern interiors. Ideal for those seeking a peaceful lifestyle."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Predictions: [1]
