# Enhancing User Experience: Text Classification for Online App using DistilBertTokenizerFast

## Abstract
In the rapidly evolving digital landscape, online apps play a pivotal role in connecting users with information, services, and communities. This project aims to enhance user experience within an online app by implementing text classification techniques powered by DistilBertTokenizerFast. Text classification serves as a fundamental aspect of user interaction, aiding in content organization, sentiment analysis, and personalized recommendations. Leveraging the advanced natural language processing capabilities of DistilBertTokenizerFast, the project seeks to develop a robust text classification model capable of accurately categorizing user-generated content in real-time. By effectively categorizing text, the app's functionality can be optimized, leading to improved content discoverability, tailored recommendations, and a more engaging user journey. The outcomes of this project hold the potential to significantly elevate user satisfaction and interaction within the online app, thereby contributing to its growth and success in an increasingly competitive digital landscape

## Objective:
The objective of this project is to perform text classification on an app review dataset using the DistilBERT-base-uncased transformer model. By employing this state-of-the-art model, the project aims to accurately categorize user reviews and feedback, providing insights into user sentiments and preferences. The ultimate goal is to enhance app development and user experience by automating the analysis of user-generated content. Through this project, we aim to showcase the effectiveness of transformer models in handling real-world text data and their potential to improve decision-making and app optimization strategies.

## Table of Contents

* [Loading Local Dataset](#section-1.1)
* [Train Test Split](#section-1.2)
* [Importing Transformer Model](#section-1.3)
* [Tokenization ](#section-1.4)
* [Converting Raw Dataset to TensorFlow Tensors](#section-1.5)
* [Fine Tuning](#section-1.6)
* [Model Training](#section-1.7)
* [Model Evaluation](#section-1.8)
* [Conclusion](#section-1.9)

   

### Loading local dataset
<a id='section-1.1'></a>

In [None]:
import pandas as pd


In [None]:
df = pd.read_excel('/content/evaluation.xlsx')
print(df.shape)
df.head()

(9000, 3)


Unnamed: 0,text,reason,label
0,the app is crashing when i play a vedio,app crashes during playback,1
1,but i want to connect it to the tv from one de...,want compatibility with more smart televisions,0
2,very helpful when and home working remotley,good app for work,0
3,this zoom so called and missed call and mobile...,receiving incorrect phone number message,0
4,one of my favorite apps,good for spending time,0


In [None]:
X= list(df['text'])


In [None]:
y = list(df['label'])

In [None]:
len(y)

9000

### Train test split
<a id='section-1.2'></a>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X ,y , test_size = 0.2, random_state = 42)


In [None]:
len(X_train)

7200

In [None]:
len(y_test)

1800

### Installing transformers

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m121.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.6 MB/s[0m eta [36m0:00:

### Importing Transformer model
<a id='section-1.3'></a>

In [None]:
from transformers import DistilBertTokenizerFast

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

## Tokenization & Padding
<a id='section-1.4'></a>

This code snippet utilizes the tokenizer function to convert raw text data into tokenized encodings suitable for training and testing. It applies tokenization to both the training and testing datasets, ensuring that the sequences are appropriately truncated and padded for consistency in model input.

This preprocessing step is essential to prepare the textual data for input into machine learning models, enhancing their ability to learn patterns and make accurate predictions.



In [None]:
train_encodings = tokenizer(X_train, truncation=True, padding=True)
test_encodings = tokenizer(X_test , truncation=True, padding= True)

## Converting Raw Dataset to TensorFlow Tensors
<a id='section-1.5'></a>

This code snippet demonstrates the creation of a TensorFlow dataset from the tokenized encodings and corresponding labels. The tf.data.Dataset.from_tensor_slices function allows seamless integration of the encoded inputs and their associated labels, enabling efficient batch processing during training.

By utilizing TensorFlow datasets, the data is effectively organized and ready to be used for model training, enhancing the efficiency and effectiveness of the training process.

In [None]:
import tensorflow as tf

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train))


In [None]:
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), y_test))


### Fine Tuning
<a id='section-1.6'></a>

Training Configuration and Parameters

This code segment configures and specifies the training parameters for a TensorFlow-based DistilBERT model using the TFTrainingArguments class from the Hugging Face Transformers library. These settings determine how the model will be trained and how the training process will be managed.

* TFTrainingArguments: A class for configuring training arguments.
* output_dir: Directory where model checkpoints and results will be saved.
* num_train_epochs: Total number of training epochs.
* per_device_train_batch_size: Batch size per device during training.
* per_device_eval_batch_size: Batch size for evaluation.
* warmup_steps: Number of warm-up steps for the learning rate scheduler.
* weight_decay: Strength of weight decay regularization.
* logging_dir: Directory for storing training logs.
* logging_steps: Interval between logging training progress.

In [None]:

from transformers import TFDistilBertForSequenceClassification, TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

## Model Training
<a id='section-1.7'></a>

In [None]:
with training_args.strategy.scope():
    model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset             # evaluation dataset
)

trainer.train()


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

## Model Evaluation
<a id='section-1.8'></a>

In [None]:
trainer.evaluate(test_dataset)

{'eval_loss': 0.523518486360533}

In [None]:
y_pred= trainer.predict(test_dataset)

PredictionOutput(predictions=array([[ 1.8882581 , -1.983661  ],
       [ 0.3870951 , -0.5638229 ],
       [ 0.9676417 , -0.9744448 ],
       ...,
       [-0.630165  ,  0.55363226],
       [ 1.3176115 , -1.4144349 ],
       [ 1.0437535 , -1.0954528 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 0, 1], dtype=int32), metrics={'eval_loss': 0.5240015013028035})

In [None]:

trainer.predict(test_dataset)[1].shape

(1800,)

In [None]:

y_pred =trainer.predict(test_dataset)[1]

In [None]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test,output)
cm


array([[1188,    0],
       [   0,  612]])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, log_loss

# Assuming you have true labels (y_true) and predicted labels (y_pred) from your classification model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
accuracy

1.0

In [None]:
precision

1.0

In [None]:
recall

1.0

In [None]:
f1

1.0

## Conclusion:
<a id='section-1.9'></a>

In conclusion, the project successfully demonstrated the effectiveness of utilizing the DistilBERT-base-uncased transformer model for text classification in the context of app reviews. The model's exceptional performance, with perfect precision, recall, F1-score, and accuracy scores of 1, highlights its robust ability to accurately categorize user-generated content. This achievement underscores the power of advanced natural language processing techniques in understanding and classifying textual data.

The project's outcomes emphasize the potential impact of such models in real-world applications, where precise sentiment analysis and content categorization are crucial. While the achieved results are remarkable, it's important to continue exploring different scenarios and datasets to validate the model's generalizability. Furthermore, this project opens avenues for deploying the model within app ecosystems to automate and enhance user feedback analysis, ultimately leading to improved app development and user satisfaction.