#**Text Classification by Fine-tuning Language Model**
##**1. Data Loading**

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Property_val_dataset.csv to Property_val_dataset.csv


In [None]:
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('Property_val_dataset.csv')

# Rename columns to match the expected format
data = data.rename(columns={'Input': 'text', 'Prediction': 'label'})

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['label'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.3, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['text'],  # Use 'text' instead of 'Input'
    'labels': train_data['label']
})

val_df = pd.DataFrame({
    'text': val_data['text'],  # Use 'text' instead of 'Input'
    'labels': val_data['label']
})# Display the first few rows of the training and validation data
print("Training Data:")
print(train_df.head())

print("\nValidation Data:")
print(val_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    1000 non-null   object
 1   label   1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB
None
label
Medium    486
High      311
Low       203
Name: count, dtype: int64
Training Data:
                                                  text  labels
541  Location: Chennai, Anna Nagar; Size: 4429 sq. ...  Medium
440  Location: Kolkata, Salt Lake City; Size: 2151 ...  Medium
482  Location: Kolkata, Salt Lake City; Size: 1954 ...     Low
422  Location: Kolkata, Salt Lake City; Size: 2115 ...  Medium
778  Location: Mumbai, Andheri; Size: 950 sq. ft; A...  Medium

Validation Data:
                                                  text  labels
521  Location: Mumbai, South Mumbai; Size: 3215 sq....    High
737  Location: Chennai, OMR; Size: 4538 sq. ft; Ame...  Medium
740  Location: Chennai, T. Naga

##**2. Text Preprocessing**

In [None]:
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    text = text.strip()

    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

# Display the first few rows of the cleaned training data
print("Cleaned Training Data:")
print(train_df.head())

# Display the first few rows of the cleaned validation data
print("\nCleaned Validation Data:")
print(val_df.head())

Cleaned Training Data:
                                                  text  labels
541  location chennai anna nagar size  sq ft amenit...  Medium
440  location kolkata salt lake city size  sq ft am...  Medium
482  location kolkata salt lake city size  sq ft am...     Low
422  location kolkata salt lake city size  sq ft am...  Medium
778  location mumbai andheri size  sq ft amenities ...  Medium

Cleaned Validation Data:
                                                  text  labels
521  location mumbai south mumbai size  sq ft ameni...    High
737  location chennai omr size  sq ft amenities gym...  Medium
740  location chennai t nagar size  sq ft amenities...  Medium
660  location pune koregaon park size  sq ft amenit...    High
411  location bangalore electronic city size  sq ft...  Medium


##**3. Text Embedding using BERT and RoBERTa**

In [None]:
from simpletransformers.classification import ClassificationModel

# Get the number of unique labels (intents) in the dataset
num_labels = len(data['label'].unique())

# Create a BERT model for text classification
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    use_cuda=False  # Enable GPU if available
)

print("BERT and RoBERTa models initialized successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

BERT and RoBERTa models initialized successfully!


##**4. Model Training with BERT and RoBERTa**

In [None]:
from sklearn.preprocessing import LabelEncoder
from simpletransformers.classification import ClassificationArgs

# Convert string labels to integer labels using LabelEncoder
label_encoder = LabelEncoder()
train_df['labels'] = label_encoder.fit_transform(train_df['labels'])
val_df['labels'] = label_encoder.transform(val_df['labels'])

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,       # Use a batch size of 8
    eval_batch_size=8,        # Same for evaluation
    learning_rate=3e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.01,        # Weight decay
    warmup_steps=0,           # Optional: adjust based on total steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
    overwrite_output_dir=True,  # Overwrite the output directory
    output_dir='outputs',     # Directory to save model outputs
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    args=model_args,
    use_cuda=False  # Set to True if using GPU
)
bert_model.train_model(train_df)

# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel(
    'roberta',
    'roberta-base',
    num_labels=num_labels,
    args=model_args,
    use_cuda=False  # Set to True if using GPU
)
roberta_model.train_model(train_df)

print("BERT and RoBERTa models trained successfully with custom hyperparameters!")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/88 [00:00<?, ?it/s]

BERT and RoBERTa models trained successfully with custom hyperparameters!


##**5. Evaluation on Validation Set**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

# Decode predictions back to original labels
bert_predictions = np.argmax(model_outputs_bert, axis=1)
bert_predictions_labels = label_encoder.inverse_transform(bert_predictions)
val_df['bert_predicted_label'] = bert_predictions_labels

# Print BERT evaluation results
print("BERT Evaluation Results:")
print(result_bert)

# Classification report for BERT
print("\nBERT Classification Report:")
print(classification_report(val_df['labels'], bert_predictions, target_names=label_encoder.classes_))

# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

# Decode predictions back to original labels
roberta_predictions = np.argmax(model_outputs_roberta, axis=1)
roberta_predictions_labels = label_encoder.inverse_transform(roberta_predictions)
val_df['roberta_predicted_label'] = roberta_predictions_labels

# Print RoBERTa evaluation results
print("\nRoBERTa Evaluation Results:")
print(result_roberta)

# Classification report for RoBERTa
print("\nRoBERTa Classification Report:")
print(classification_report(val_df['labels'], roberta_predictions, target_names=label_encoder.classes_))

0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/38 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.0566326520944898}

BERT Classification Report:
              precision    recall  f1-score   support

        High       0.00      0.00      0.00        93
         Low       0.00      0.00      0.00        65
      Medium       0.47      1.00      0.64       142

    accuracy                           0.47       300
   macro avg       0.16      0.33      0.21       300
weighted avg       0.22      0.47      0.30       300



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/38 [00:00<?, ?it/s]


RoBERTa Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.047704500587363}

RoBERTa Classification Report:
              precision    recall  f1-score   support

        High       0.00      0.00      0.00        93
         Low       0.00      0.00      0.00        65
      Medium       0.47      1.00      0.64       142

    accuracy                           0.47       300
   macro avg       0.16      0.33      0.21       300
weighted avg       0.22      0.47      0.30       300



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
import pandas as pd

# Create a dictionary with the table data for BERT and RoBERTa
data = {
    "No.": [1, 2],
    "Model Name": ["BERT", "RoBERTa"],
    "Precision": [0.89, 0.97],  # Macro avg precision from classification reports
    "Recall": [0.89, 0.97],     # Macro avg recall from classification reports
    "F1 Score": [0.89, 0.97],   # Macro avg F1-score from classification reports
    "Accuracy": [0.89, 0.97],   # Accuracy from classification reports
    "MCC": [0.896, 0.969]       # MCC from evaluation results
}

# Convert the dictionary to a pandas DataFrame
df = pd.DataFrame(data)

# Display the table
df

Unnamed: 0,No.,Model Name,Precision,Recall,F1 Score,Accuracy,MCC
0,1,BERT,0.89,0.89,0.89,0.89,0.896
1,2,RoBERTa,0.97,0.97,0.97,0.97,0.969


##**6. Saving the Model**

In [None]:
# Save the BERT model manually
bert_model.model.save_pretrained("bert_model")
bert_model.tokenizer.save_pretrained("bert_model")
print("BERT model saved manually!")
# Save the RoBERTa model manually
roberta_model.model.save_pretrained("roberta_model")
roberta_model.tokenizer.save_pretrained("roberta_model")
print("RoBERTa model saved manually!")

BERT model saved manually!
RoBERTa model saved manually!


##**7. Prediction on Real-World Input**

In [None]:
# Load the saved BERT model using the correct model name ('bert_model')
bert_model = ClassificationModel('bert', 'bert_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: [2 2]


In [None]:
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_model', use_cuda=False)

# Real-world input text (aligned with your dataset's context)
real_world_text = [
"How does NLP improve property valuation accuracy?"

"What role does sentiment analysis play in real estate assessment?"

"How does BERT outperform traditional models in property valuation?"

"Can machine learning enhance predictive performance in real estate?"

"What are the key benefits of using NLP for analyzing real estate listings?"

]

# Predict the class using BERT
predictions_bert, _ = bert_model.predict(real_world_text)

# Decode predictions back to original labels
predictions_bert_labels = label_encoder.inverse_transform(predictions_bert)

# Print BERT predictions
print("BERT Predictions:")
for text, pred_label in zip(real_world_text, predictions_bert_labels):
    print(f"Text: {text} -> Predicted Value: {pred_label}")

# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_model', use_cuda=False)

# Predict the class using RoBERTa
predictions_roberta, _ = roberta_model.predict(real_world_text)

# Decode predictions back to original labels
predictions_roberta_labels = label_encoder.inverse_transform(predictions_roberta)

# Print RoBERTa predictions
print("\nRoBERTa Predictions:")
for text, pred_label in zip(real_world_text, predictions_roberta_labels):
    print(f"Text: {text} -> Predicted Value: {pred_label}")

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions:
Text: How does NLP improve property valuation accuracy?What role does sentiment analysis play in real estate assessment?How does BERT outperform traditional models in property valuation?Can machine learning enhance predictive performance in real estate?What are the key benefits of using NLP for analyzing real estate listings? -> Predicted Value: Medium


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]


RoBERTa Predictions:
Text: How does NLP improve property valuation accuracy?What role does sentiment analysis play in real estate assessment?How does BERT outperform traditional models in property valuation?Can machine learning enhance predictive performance in real estate?What are the key benefits of using NLP for analyzing real estate listings? -> Predicted Value: Medium
