# Title:Utilizing and tweaking SqueezeBERT

#### Group Member Names : Josiah Grimm - 200531178



### INTRODUCTION:
My project implements and compares three transformer-based language models: SqueezeBERT Base, my tuned SqueezeBERT, and BERT for masked language modeling (MLM). The primary goal is to explore how SqueezeBERT works and use it's current speed in running tasks such as GLUE MRPC and my sample text from Wikipedia.
#### AIM :

The aim of this project is to implement, fine-tune, and evaluate transformer-based language models, specifically SqueezeBERT and BERT, on a sample of Wikipedia text. The project seeks to reproduce the methodology of SqueezeBERT for masked language modeling, while experimenting with modifications such as hyperparameter tuning and domain adaptation to explore their effects on model performance. By comparing SqueezeBERT base, tuned SqueezeBERT, and BERT base, my project evaluates not only predictive accuracy but also inference speed and resource efficiency. The models are deployed on unseen Wikipedia text to assess their applicability and to provide insights into the trade-offs between model size, speed, and effectiveness as well as testing whether the paper holds up to the 4+ years since it was published by testing it against the advancements in BERT as well.
#### Github Repo:

- My Github Project: https://github.com/GizMoGRZ/Final-Project-ML-Programming
- Hugging Face for SqueezeBERT-uncased: https://huggingface.co/squeezebert/squeezebert-uncased
#### DESCRIPTION OF PAPER:

The research paper explains how SqueezeBERT adapts techniques from efficient computer vision models into natural language processing (NLP). The authors observed that popular NLP models were heavy loads on small devices. To address this, they propose SqueezeBERT, a new architecture which uses grouped convolutions to improve the the computing efficiency and latency. By doing this, SqueezeBERT achieved significantly lower latency which was approximately 4.3 times faster than BERT at the time of publication while still being accpetable in comparing accuracy on standard NLP benchmarks such as the GLUE test suite.
#### PROBLEM STATEMENT :

While SqueezeBERT reduces computational costs compared to BERT, it remains unclear how well it performs on custom domain-specific text or small real-world datasets. The problem is to evaluate and compare the effectiveness, efficiency, and real applicability of SqueezeBERT (base and fine-tuned) versus the current state BERT on a sample of Wikipedia articles, and to explore how fine-tuning and hyperparameter modifications impact model performance, inference speed, and resource usage. My project will highlight trade-offs between model size, accuracy, and efficiency in practical NLP applications.
#### CONTEXT OF THE PROBLEM:

Modern NLP relies heavily on large transformer-based models like BERT, which achieve incredible results across a wide range of tasks, including sentiment analysis, question answering, and natural language understanding benchmarks such as GLUE. Unfortunately these models are computationally intensive, memory hungry, and difficult to deploy on devices with limited resources or in real time. SqueezeBERT was proposed as an efficient alternative, leveraging techniques from computer vision, such as grouped convolutions, to reduce latency and memory usage while maintaining competitive accuracy. Understanding how these models perform on unseen datasets, such as Wikipedia text samples, is important for assessing their practical applicability.

#### SOLUTION:
The solution involves implementing, fine-tuning, and evaluating three each of the two mentioned models SqueezeBERT, and BERT Base as well as my own tweaked hyperparameter version of SqueezeBERT on a sample of Wikipedia text to analyze the speed and efficiency in NLP. The workflow includes preprocessing the text and tokenizing sequences, fine tuning SqueezeBERT on the sample dataset with optimized hyperparameters, and evaluating all models using masked language modeling and GLUE metrics. Performance is compared in terms of inference speed, and predictive accuracy. Finally, the models are deployed on unseen Wikipedia text to assess real-world applicability, highlighting the difference between models in practical NLP tasks.


# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|
|BERT (Devlin et al., 2019)|Popular transformer-based language model tested on GLUE|GLUE, Wikipedia, BookCorpus|Large, slow inference, high memory usage
|SqueezeBERT (Iandola et al., 2020)|Efficient variant of BERT using grouped convolutions|GLUE, BookCorpus, Wikipedia, Mobile devices|May have slightly lower accuracy, less tested on domain-specific text|








# Implement paper code :
To implement the concepts presented in the SqueezeBERT paper, I leveraged the pre-trained SqueezeBERT model from Hugging Face Transformers, which incorporates the paper’s architecture. The model was fine-tuned on a sample of Wikipedia text to evaluate its domain adaptation capabilities. The workflow included preprocessing and tokenizing the text, setting appropriate batch sizes and sequence lengths to manage resources, and configuring training arguments such as learning rate, number of epochs, and evaluation steps. Both masked language modeling and GLUE-style metrics were used to measure performance, and inference times were recorded to assess computational efficiency.

*



*********************************************************************************************************************
### Contribution  Code :
In this project, I made two main contributions to extend and evaluate the SqueezeBERT methodology. First, I created SqueezeBERT v2, a fine-tuned version of the pre-trained SqueezeBERT model, by adjusting hyperparameters such as batch size, learning rate, and evaluation steps, and training it on a sample of Wikipedia text. This allowed me to explore how targeted fine-tuning affects performance, inference speed, and memory usage on domain-specific data. Second, I implemented BERT Base as a baseline model to provide a direct comparison with both the original and tuned SqueezeBERT models. By evaluating all three models on the same dataset and measuring accuracy, GLUE-style metrics, and inference time, I was able to quantify the trade-offs between model efficiency, computational cost, and predictive performance, highlighting the benefits and limitations of efficiency-driven transformer architectures in practical NLP applications.

### Results :
During training, SqueezeBERT Base, Tuned SqueezeBERT, and BERT Base all demonstrated decreasing training losses. SqueezeBERT Base’s training loss started at 0.606 and decreased to 0.246 by the final step of epoch 3. Tuned SqueezeBERT began at 0.623 and finished around 0.293 by epoch 4, while BERT Base started at 0.624 and ended at 0.224. GLUE-style evaluation metrics showed that SqueezeBERT Base achieved 86.5% accuracy and an F1 score of 0.905, Tuned SqueezeBERT reached 83.3% accuracy and 0.881 F1, and BERT Base achieved 84.3% accuracy with a 0.888 F1 score. Inference timing on 100 Wikipedia sentences revealed that Tuned SqueezeBERT processed the batch in 0.021 seconds, SqueezeBERT Base in 0.036 seconds, and BERT Base in 0.014 seconds.

#### Observations :
From this experiment, I observed several key insights about SqueezeBERT, Tuned SqueezeBERT, and BERT Base. First, training loss decreased steadily for all models, showing that fine tuning was effective and stable across epochs. Second, the GLUE evaluation metrics indicated that SqueezeBERT Base slightly outperformed Tuned SqueezeBERT in accuracy and F1 score, suggesting that aggressive tuning may slightly reduce generalization even as it improves efficiency. Third, inference speed measurements showed that Tuned SqueezeBERT was faster than the base version, confirming the effectiveness of hyperparameter adjustments, while BERT Base, despite being larger, achieved the fastest processing on this small batch due to efficient batching and GPU optimization. Overall, the experiment demonstrates the tradeoffs between model accuracy, efficiency, and computational cost, highlighting that SqueezeBERT can provide a balance between speed and performance, and fine-tuning allows tailoring the model for domain-specific tasks like processing Wikipedia text, but it should be noted that based on the results of BERT, it is arguably better than SqueezeBERT based on it's speed, and the accuracy can be improved with further tuning to make SqueezeBERT irrelevant.
*


### Conclusion and Future Direction :
In conclusion, I successfully implemented and evaluated SqueezeBERT, Tuned SqueezeBERT, and BERT Base on a sample of Wikipedia text. The experiment showed that SqueezeBERT provides a good balance between efficiency and performance, with tuning allowing further improvements in inference speed at a minor cost to accuracy. BERT Base remains a strong baseline in accuracy but comes with higher computational cost. For future work, I could explore larger or more diverse datasets, further optimize hyperparameters, and experiment with other efficiency focused transformer variants. Additionally, implementing mixed precision training, gradient checkpointing, or model quantization could further reduce memory usage and speed up inference, making these models more practical for large scale or real-time NLP applications.
#### Learnings :

From this project, I learned several important lessons about implementing and evaluating transformer based NLP models. I gained hands on experience with preprocessing large scale text data, including parsing XML Wikipedia dumps and tokenizing sequences for masked language modeling. I also learned how model architecture and hyperparameter choices impact both performance and efficiency, observing that fine tuning SqueezeBERT improved inference speed while slightly affecting accuracy. Comparing SqueezeBERT with BERT highlighted the trade offs between computational cost, latency, and predictive performance, reinforcing the importance of selecting models that match the constraints of the task and hardware. Finally, I developed a better understanding of evaluation metrics such as GLUE accuracy and F1 scores, and how to measure real world applicability through inference timing, which will inform future work in optimizing NLP models for practical deployment.
#### Results Discussion :

These results indicate several important insights. Although SqueezeBERT Base had the highest accuracy and F1 score, fine-tuning it to create Tuned SqueezeBERT improved inference speed while only slightly reducing predictive performance, highlighting the tradeoff between efficiency and accuracy. BERT Base maintained competitive performance but required higher computational resources, demonstrating that larger models are not always the fastest in practice. The differences in inference times also reflect the impact of model architecture and hyperparameter tuning on real-world deployment efficiency. Overall, my observations confirm that efficiency focused models like SqueezeBERT can be tuned to balance speed and accuracy, and careful evaluation is necessary to match model choice to an individual task.

#### Limitations :

While my program successfully implemented and compared SqueezeBERT, Tuned SqueezeBERT, and BERT Base on a sample of Wikipedia text, there are several limitations to note. First, I worked with a small subset of the Wikipedia dataset (limited to 100 samples for testing), which may not fully represent the variability and complexity of natural language in larger corpora. Second, GPU memory constraints in Google Colab restricted the maximum sequence lengths and batch sizes I could use, preventing full utilization of the models’ capabilities. Third, the program currently evaluates only inference speed and standard GLUE style metrics, without deeper analysis of downstream task performance like sentiment analysis or generalization to other domains. Finally, tokenizing and preprocessing the XML data introduced additional headaches with parsing errors, which may have impacted the quality of training data. These limitations suggest that results should be interpreted as a sample of both models, and scaling up to larger datasets or more diverse inputs could yield different results and highlight true limitations of the models better.


#### Future Extension :
Future development of this project could focus on expanding both the experimental depth and the practical utility of the models. One extension of the project would be to evaluate the trained models on downstream tasks, such as sentiment analysis or natural language inference, to better understand how architecture differences translate to true real world results beyond a single binary classification task. Another direction is to incorporate larger and more diverse datasets, enabling stronger generalization and reducing the limitations imposed by the small, curated XML subset used for training and testing.

# References:

[1]:  SqueezeBERT Paper: https://arxiv.org/pdf/2006.11316  
[2]:  BERT Paper: https://arxiv.org/pdf/1810.04805  
[3]:  SqueezeBERT HuggingFace: https://huggingface.co/squeezebert/squeezebert-uncased  
[4]:  BERT HuggingFace https://huggingface.co/papers/1810.04805