## Applied Exploration - Fortnight 1

The `roberta-base-go_emotions` model is documented here: https://huggingface.co/SamLowe/roberta-base-go_emotions

Answer some questions about this:
* What is `roberta-base`? Write down some things you can learn about it from the documentation.
* What is `go_emotions`? Write down some things you can learn about it from the documentation.

Go to the Hugging Face models page: https://huggingface.co/models
* click `Text Classification`
* Try some additional models
    - test out at least one more sentiment/emotions model
    - test out at least two other kinds of models - like news topic classification or spam detection
    - write down some info about the models you found
        - what is it for?
        - who made it?
        - what kind of data was it trained on?
        - are they based on some other model and trained on new data (*fine-tuned*) for a specific task?


### roberta-base
* pretrained masked language transformer model
* from my brief research, masked language models are kind of like a spoon blank in woodcarving; a rough start to whatever kind of NLP task you want to do, but later something that needs to be refined
* self-supervised pre-training
* I was surprised to see how strong the bias was towards gender stereotypes (the example they gave); I suppose it doesn't surprise me, but that's something to keep in mind

### go_emotions
* https://huggingface.co/datasets/google-research-datasets/go_emotions
* 58,000 Reddit comments labeled with (potentially multiple) emotion categories
* data was curated for use in, for example, emotionally-aware chatbots (although the huggingface page doesn't say how they curated it)
* I did find it interesting how huggingface has a section on social impact & bias consideration

In [3]:
import sys
!{sys.executable} -m pip install transformers accelerate



### tabularisai/multilingual-sentiment-analysis
* like roberta, this is a sentiment analysis model; but it's not limited to just English
* made by tabularis.ai, a company focused on developing local AI models
* trained on only LLM-generated multilingual data
* fine-tuned version of `distilbert-base-multilingual-cased`

In [4]:
from transformers import pipeline

classifier = pipeline("text-classification", model="tabularisai/multilingual-sentiment-analysis")

# ; my dog is very pretty; the professor of CS 195 is very intelligent; I don't speak french (google translate)
classifier(["I like Drake University", "mi perra es muy muy bonita", "el profe de CS 195 es muy inteligente", "Je ne parle pas français"])

Loading weights: 100%|██████████| 104/104 [00:00<00:00, 415.72it/s, Materializing param=pre_classifier.weight]                                  


[{'label': 'Positive', 'score': 0.3969952166080475},
 {'label': 'Positive', 'score': 0.7760933041572571},
 {'label': 'Positive', 'score': 0.8941887021064758},
 {'label': 'Negative', 'score': 0.6792149543762207}]

### papluca/xlm-roberta-base-language-detection
* also based on roberta; it's interesting to see how the same model can be used for a variety of different kinds of outputs
* it looks like it's just made by some guy interested in language detection
* trained on 70,000 samples from the [language identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) dataset, which itself looks just like a few different datasets aggregated together
* a fine-tuned version of roberta

In [5]:
from transformers import pipeline

classifier = pipeline("text-classification", model="papluca/xlm-roberta-base-language-detection")

# ; my dog is very pretty; the professor of CS 195 is very intelligent; I don't speak french (google translate)
classifier(["I like Drake University", "mi perra es muy muy bonita", "el profe de CS 195 es muy inteligente", "Je ne parle pas français"])

Loading weights: 100%|██████████| 201/201 [00:00<00:00, 658.70it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
XLMRobertaForSequenceClassification LOAD REPORT from: papluca/xlm-roberta-base-language-detection
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[{'label': 'en', 'score': 0.9530580043792725},
 {'label': 'es', 'score': 0.9931657910346985},
 {'label': 'es', 'score': 0.9923359751701355},
 {'label': 'fr', 'score': 0.9915567636489868}]

### mshenoda/roberta-spam
* *also* based on roberta! on second thought, maybe I should have gone through to find a different base model
* it looks like this guy is a researcher at the University of Drexel with a few AI papers listed on his huggingface progfile
* trained on 47,000 labeled ham/spam messages from SMS, Telegram, and Email
* fine-tuned from the roberta base again

In [6]:
from transformers import pipeline

classifier = pipeline("text-classification", model="mshenoda/roberta-spam")

# the hard part about this is that spam filters are good enough now that I hardly get any real spam.
# both of these are marked as clean; the first, I pulled from a spammy email (although it might just be annoying marketing)
classifier(["I an reaching out to wish you a very Happy New Year. As we go into this year, are there any upcoming projects that I could assist with in terms of labrotory essential equipment? I would be happy to hop on a call to discuss more. I look forward to hearing from you!", 
           "Hi John, hope you're doing well. Just wanted to wish you a merry christmas, and I hope the kids are doing well!"])

Loading weights: 100%|██████████| 201/201 [00:00<00:00, 462.29it/s, Materializing param=roberta.encoder.layer.11.output.dense.weight]              
RobertaForSequenceClassification LOAD REPORT from: mshenoda/roberta-spam
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[{'label': 'LABEL_0', 'score': 0.9999982118606567},
 {'label': 'LABEL_0', 'score': 0.9999983310699463}]