**What is a pre-trained NLP model? And what do I have to keep in mind when performing Tokenization**
Pre-trained NLP model are already trained on a specific dataset and able to solve specific tasks. However, there are 1000s of pre trained model, some of them are more specifically trained on a given dataset some are more general. For example the `microsoft/deberta-v3-small` is a very general well trained model and applicable for a lot of cases. In this case "small" states that it is a smaller model, meaning for us its faster to train, perform more iterations etc.

For using pre-trained NLP model, we need to make sure that we are using the same dictonary as the pretrained model, otherwise the ML Model will map the word "of" to two different numerical values (since we would then use 2 diff. dictonarys) and this would mess up everything. 

Meaning, before even starting to tokenize, we need to make sure what kind of pre-trained model we want to use. To know what kind of tokenization the pretrained model used we can use `AutoTokenizer` (Dictionary that knows which model uses which tokenizer). 



**How do I load them?**
For example go to https://huggingface.co/models and use one of the pretrained models, with the following code 
```
  from transformers import AutoTokenizer, AutoModelForMaskedLM

  tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

  model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")
```

**What is tokenization?**
Tokenization seperates a given String into seperate Words! It is called Token because in some language, the category of words do not exists (e.g. chinese). Tokenization therefore splits strings such as sentences into smaller units (mostly words). Tokenization has some issues with uncommon words or for examle "I'm" will be tokenized into 3 tokens.

**What is numericalization and why is it needed to work with Tokens?**
In order for Machine Learning Models to work and learn with the given dataset, the data must be numerical. Therefore, NLPs use numericalization to get numerical values for each given Token. It uses dictonarys where it looks up each different token and maps a numerical value. So for example the word "of" in a given dictonary is equal to the numerical value of 255. 

**What does fine-tuning mean?**
Pretrained Models are usually trained on different dataset than we use. Therefore we need to fine tune the model in order to make it work on our specific task. Which means we re-train the model our dataset and tweak the weights and biases so it gets more accurate on our specific task. Other than pretraining on our own, where random weights and biases are applied and adjusted with each epoch, we get better results way faster using pretrained models. 

However, we need to worry about overfitting, therefore we need to split our dataset into training and validation data, in order to remove some of the real data and prevent the model to be overfitting, while already knowing every single datapoint. 
If our model is underfitting the training dataset, we need to fine tune the model

**What types of NLP Models are there?**

**What are Transformers**
Transformers in NLP context are trained as language model. They have been trained on raw data, in a self-supervised fashion, without needing humans to label the data beforehand.

**What possibilities do I have with the Transformers package?** 
Example of Transformerts are GPT2, GPT3, BERT, BART, DeBERTa: They can be grouped into three categories:
- GPT-like (also called auto-regressive Transformer models)
- BERT-like (also called auto-encoding Transformer models)
- BART/T5-like (also called sequence-to-sequence Transformer models)

1. Text Generation
2. Text classification
  - e.g. sentiment analysis
  - topic classification
3. Question answering
4. Translation
5. Summarization
*OR*
1. Fine Tune pretrained model
2. Create custom pipelines (Preprocessing, MOdel, Postprocessing)

#### Import Data

#### Coding

In [18]:
import pandas as pd
from google.colab import drive
from google.colab import data_table
import matplotlib.pyplot as plt
data_table.enable_dataframe_formatter()

drive.mount('/content/drive')
data = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/Signal_AppStore_lastYear_English.xlsx")

data.head(10)

Mounted at /content/drive


Unnamed: 0,Submission date,Publication date,AppID,AppName,Country,Review Language,Version,Author,Rating,Title,...,Updated,Semantic Tags,Semantic Categories,Semantic Sentiment,Notes,Likes,Dislikes,Link,Permalink,AF Link
0,2023-05-03T22:44:54,2023-05-05T18:31:05,874139669,Signal - निजी मैसेंजर,us,en,6.22,GhostBoy849,4,💯💯💯,...,2023-05-05T18:31:05,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180347190?...,https://watch.appfollow.io/apps/my-first-works...
1,2023-05-03T17:55:55,2023-05-05T07:58:55,874139669,Signal - निजी मैसेंजर,us,en,6.22,banky tee jp,5,Signal is secure and good alot I love it is go...,...,2023-05-05T07:58:55,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180260427?...,https://watch.appfollow.io/apps/my-first-works...
2,2023-05-03T17:49:19,2023-05-05T07:58:55,874139669,Signal - निजी मैसेंजर,us,en,6.22,karlaveggies,5,iMessage for any phone,...,2023-05-05T07:58:55,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180260426?...,https://watch.appfollow.io/apps/my-first-works...
3,2023-05-03T17:45:07,2023-05-05T07:58:55,874139669,Signal - निजी मैसेंजर,us,en,6.22,gukitoguka,4,Why?,...,2023-05-05T07:58:55,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180260425?...,https://watch.appfollow.io/apps/my-first-works...
4,2023-05-03T08:55:08,2023-04-22T08:02:06,874139669,Signal - निजी मैसेंजर,au,en,6.22,nayfanuel,1,Signal now sends unsolicited d picks,...,2023-05-05T07:58:54,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/176996669?...,https://watch.appfollow.io/apps/my-first-works...
5,2023-05-03T02:06:57,2023-04-26T07:57:53,874139669,Signal - निजी मैसेंजर,us,en,6.22,Jayvaun,1,Terrible,...,2023-05-05T07:58:55,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/177623644?...,https://watch.appfollow.io/apps/my-first-works...
6,2023-05-02T22:17:37,2023-05-05T07:58:55,874139669,Signal - निजी मैसेंजर,us,en,6.22,WT impossible,2,Constant pin reminders,...,2023-05-05T07:58:55,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180260424?...,https://watch.appfollow.io/apps/my-first-works...
7,2023-05-02T19:07:27,2023-05-04T07:57:24,874139669,Signal - निजी मैसेंजर,us,en,6.22,honestlynk,3,Used to be better when it supported SMS,...,2023-05-04T07:57:24,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180049434?...,https://watch.appfollow.io/apps/my-first-works...
8,2023-05-02T12:45:31,2023-05-04T11:13:54,874139669,Signal - निजी मैसेंजर,hk,en,6.22,andriod to ios,1,andriod to ios,...,2023-05-04T11:13:54,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/180088389?...,https://watch.appfollow.io/apps/my-first-works...
9,2023-05-02T12:44:04,2023-02-24T06:21:42,874139669,Signal - निजी मैसेंजर,gb,en,6.22,JulTimes2011,1,Loved Signal - till data loss messages IGNORED,...,2023-05-04T07:57:25,,,,,0,0,https://appstoreconnect.apple.com/WebObjects/i...,https://appfollow.io/app/194/review/162119712?...,https://watch.appfollow.io/apps/my-first-works...


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Submission date      686 non-null    object 
 1   Publication date     686 non-null    object 
 2   AppID                686 non-null    int64  
 3   AppName              686 non-null    object 
 4   Country              686 non-null    object 
 5   Review Language      686 non-null    object 
 6   Version              686 non-null    object 
 7   Author               686 non-null    object 
 8   Rating               686 non-null    int64  
 9   Title                686 non-null    object 
 10  Review               686 non-null    object 
 11  Translated title     0 non-null      float64
 12  Translated review    0 non-null      float64
 13  Reply Date           2 non-null      object 
 14  Developer Reply      2 non-null      object 
 15  User                 0 non-null      flo

In [20]:
data = data[['Review']]
data.head(10)

Unnamed: 0,Review
0,Nice App very appreciated
1,Good
2,This app is like iMessages but a little better...
3,I buy new phone and signal not working when I ...
4,Unverified people can’t been verified till I m...
5,There’s a lot of scam activity going on with t...
6,STOPIT! Please STOP IT!
7,I wish I didn’t have to have two texting apps ...
8,it been long time ，it still cannot tranfer fro...
9,I’ve been happy to make £$ contributions to Si...


#### Import Huggingface transformers

In [21]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m56.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-

In [22]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "Nice App very appreciated",
    candidate_labels=["abusive", "Bug", "Design"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'Nice App very appreciated',
 'labels': ['Design', 'Bug', 'abusive'],
 'scores': [0.8378982543945312, 0.11538271605968475, 0.046719010919332504]}

In [23]:
results = []

for string in data["Review"].head(10):
  classification = classifier(
        string,
        candidate_labels=["positive", "negative"]
    )
  results.append(classification)
print(results)



[{'sequence': 'Nice App very appreciated', 'labels': ['positive', 'negative'], 'scores': [0.9982302784919739, 0.001769729657098651]}, {'sequence': 'Good', 'labels': ['positive', 'negative'], 'scores': [0.9979331493377686, 0.0020668618381023407]}, {'sequence': 'This app is like iMessages but a little better because you can “react” with literally any emoji! I love it!', 'labels': ['positive', 'negative'], 'scores': [0.9321578741073608, 0.06784212589263916]}, {'sequence': 'I buy new phone and signal not working when I write my number it shows me that it is wrong and can you correct it?!', 'labels': ['negative', 'positive'], 'scores': [0.9505767226219177, 0.0494232252240181]}, {'sequence': 'Unverified people can’t been verified till I message them!? Yet they can send d picks. Might be time to switch platforms.\n\nThe fact that they’ve given us the option to “sync” but in reality it’s implemented in a way that is absolutely and totally useless is frustrating. With the security we go through

In [24]:
import pandas as pd


df = pd.DataFrame(results, columns=['sequence', 'labels', 'scores'])
print(df)

                                            sequence                labels  \
0                          Nice App very appreciated  [positive, negative]   
1                                               Good  [positive, negative]   
2  This app is like iMessages but a little better...  [positive, negative]   
3  I buy new phone and signal not working when I ...  [negative, positive]   
4  Unverified people can’t been verified till I m...  [negative, positive]   
5  There’s a lot of scam activity going on with t...  [negative, positive]   
6                            STOPIT! Please STOP IT!  [negative, positive]   
7  I wish I didn’t have to have two texting apps ...  [negative, positive]   
8  it been long time ，it still cannot tranfer fro...  [negative, positive]   
9  I’ve been happy to make £$ contributions to Si...  [negative, positive]   

                                        scores  
0   [0.9982302784919739, 0.001769729657098651]  
1  [0.9979331493377686, 0.00206686183810234

####testing how and if typical app Review topics can be applied

In [25]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "There’s a lot of scam activity going on with this app, be careful!",
    candidate_labels=["Error", "Design", "Security", "Privacy","pricing"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'There’s a lot of scam activity going on with this app, be careful!',
 'labels': ['Security', 'Error', 'Privacy', 'Design', 'pricing'],
 'scores': [0.3259131610393524,
  0.2853868007659912,
  0.1647179126739502,
  0.14147698879241943,
  0.08250526338815689]}

In [26]:
review_string = df.sequence.tolist()
print(review_string)

['Nice App very appreciated', 'Good', 'This app is like iMessages but a little better because you can “react” with literally any emoji! I love it!', 'I buy new phone and signal not working when I write my number it shows me that it is wrong and can you correct it?!', 'Unverified people can’t been verified till I message them!? Yet they can send d picks. Might be time to switch platforms.\n\nThe fact that they’ve given us the option to “sync” but in reality it’s implemented in a way that is absolutely and totally useless is frustrating. With the security we go through why do you not allow conversations to sync across devices?!? \nThe result is that every few weeks when the app fails I have to start conversation all over again. USELESS', 'There’s a lot of scam activity going on with this app, be careful!', 'STOPIT! Please STOP IT!', 'I wish I didn’t have to have two texting apps now. I may uninstall soon.', 'it been long time ，it still cannot tranfer from andriod to ios', 'I’ve been happ

In [27]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")


def string_classifier(string):
  result = classifier(
      review,
      candidate_labels=["Error", "Design", "Security", "Privacy","pricing"],            
  )
  return result

df = pd.DataFrame(columns=['sequence', 'labels', 'scores'])


for review in review_string:
  output = string_classifier(review)
  temporary_df = pd.DataFrame(output, columns=['sequence', 'labels', 'scores'])
  df = df.append(temporary_df, ignore_index=True)

print(df) 
  

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)
  df = df.append(temporary_df, ignore_index=True)


                                             sequence    labels    scores
0                           Nice App very appreciated    Design  0.449475
1                           Nice App very appreciated   Privacy  0.218620
2                           Nice App very appreciated  Security  0.191128
3                           Nice App very appreciated   pricing  0.088114
4                           Nice App very appreciated     Error  0.052663
5                                                Good    Design  0.316005
6                                                Good  Security  0.276757
7                                                Good   Privacy  0.231264
8                                                Good   pricing  0.129260
9                                                Good     Error  0.046714
10  This app is like iMessages but a little better...    Design  0.395863
11  This app is like iMessages but a little better...   Privacy  0.194850
12  This app is like iMessages but a l

  df = df.append(temporary_df, ignore_index=True)


#### Create Pivot Table and export it as csv

In [28]:
df_pivot = df.pivot(index='sequence', columns='labels', values='scores').reset_index()
df_pivot=df_pivot.rename(columns={'sequence': 'Review'})
print(df_pivot)
df_pivot.to_csv("/content/drive/MyDrive/Colab Notebooks/ReviewScores.csv", mode='w+' ,index=False)

labels                                             Review    Design     Error  \
0                                                    Good  0.316005  0.046714   
1       I buy new phone and signal not working when I ...  0.095240  0.596546   
2       I wish I didn’t have to have two texting apps ...  0.208514  0.384962   
3       I’ve been happy to make £$ contributions to Si...  0.056806  0.225613   
4                               Nice App very appreciated  0.449475  0.052663   
5                                 STOPIT! Please STOP IT!  0.100703  0.450328   
6       There’s a lot of scam activity going on with t...  0.141477  0.285387   
7       This app is like iMessages but a little better...  0.395863  0.192010   
8       Unverified people can’t been verified till I m...  0.081149  0.398054   
9       it been long time ，it still cannot tranfer fro...  0.134810  0.550411   

labels   Privacy  Security   pricing  
0       0.231264  0.276757  0.129260  
1       0.108057  0.171191  0.