<h3 style=color:#362419> Large Language Models (LLM) - Zero/Few Shot Classifier </h3>

<p style=color:#38629C>This is third notebook in the Final Project focused LLM. The performance of LSTM <i>(So far best Performing model among traditional and deep learning models)</i> will be compared.</p>


<h3 style=color:#362419> Approach </h3>

<p style=color:#38629C>Instead of changing the weight (Fine Tunning) of the pre-trained model. <u>ZeroShotClassifier</u> and <u>FewShotClassifier</u> techniques are used to check that how efficiently and effectively these LLM (Pre-Trained Model) will be capable of classifying the `real` or `fake` news.</p>


<p style=color:#38629C><b style=color:#467200 >Skorch library </b> <u>ZeroShotClassifier</u> and <u>FewShotClassifier</u> is used for the classification without having any significant training. Thanks to the power of the LLM as most of these models are trained over extensive data. We have relied on the open source libraries available from Hugging Face.</p>



<h3 style=color:#362419> System Resources</h3>

<p style=color:#38629C>The most challenging part was to ensure having a significant amount of resource even for using ZeroShotClassifier and FewShotClassifier with a model having less parameters (1 Billion). Therefore, i have selected the Google Colab environment to Train the classifier using GPU <u>A100</u> and <u>V100</u> and original data is trimmed to lowers numbers to get the results quickly and to reduce the model cost. This will be further discuss in the sections below.</p>

<h3 style=color:#362419> Model Modification and Limited Training</h3>

<p style=color:#38629C><b style=color:#467200 >Step 1 - Importing Libraries :</b> Downloading the required libraries for the LLM classifier to work using skorch <u>ZeroShotClassifier</u> and <u>FewShotClassifier</u> .</p> 

In [None]:
import subprocess

# Installation on Google Colab
try:
    import google.colab
    subprocess.run(['python', '-m', 'pip', 'install', 'skorch', 'transformers', 'datasets'])
except ImportError:
    pass

In [None]:
import numpy as np
import pandas as pd
import transformers
import torch
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import GridSearchCV, train_test_split

<p style=color:#38629C><b style=color:#467200 >Step 2 - Loading Dataset :</b> Dataset cleaned in NoteBook-1 and export is loaded into Colab NoteBook-3. Colab is used for a reason to have ample online resources (GPU/TPU) and Ram for loading the model, model modification and limited training.</p> 

In [None]:
cleaned_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cleaned_data.csv')

In [None]:
cleaned_df

Unnamed: 0,label,title_text
0,true,national federation independent business
1,true,comment fayetteville nc
2,true,romney make pitch hoping close deal election r...
3,true,democratic leader say house democrat united go...
4,true,budget united state government fy
...,...,...
60732,fake,white house theatrics gun control 21st century...
60733,fake,activist terrorist medium control dictate narr...
60734,fake,boiler room surrender retreat head roll ep tun...
60735,fake,federal showdown loom oregon blm abuse local r...


<p style=color:#38629C><b style=color:#467200 >Step 3 - Splitting Dataset :</b>  Dataset is splitted into X_train, X_test, y_train, y_test using Sk-Learn library with the ratio 80% - Training and 20% Testing.</p>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cleaned_df['title_text'], cleaned_df['label'], test_size=0.2, shuffle=True, random_state=42)


<p style=color:#38629C>Training datasets is reduced to smaller size because of the following two problems encountered:</p>
<ul style=color:#38629C>
<li>It takes significant amount of resources to even run few records (such as 100) out of complete datasets. Though, There is no fine tunning of pre-trained model is involved, however, it still requires significant reduction in the training data to have limited in order to check the LLM.</li>
<li>Secondly, we are using techniques such as `ZeroShotClassifier` and `FewShotClassifier`. Which are in fact designed to use Nil or Few data records instead of having a comprehensive dataset. This is due to the fact that pre-trained model has already trained over billion parameters and should be capable of handling any NLP related task including classification.</li>
</ul>



In [None]:
X_train

6915                khloé kardashian told tristan pregnant
4547                   u special air ahead season premiere
57151    racist want closed border protect citizen terr...
28800    factbox trump meet huckabee romney others week...
16093           golden globe look made statement — cost le
                               ...                        
54343    president trump travel orlando private school ...
38158    abadi defends role iranian-backed paramiltarie...
860                         way get boho wave without heat
15795    prince william kate middleton arrive poland ro...
56422    cafe owner reacts awesome way town told remove...
Name: title_text, Length: 48589, dtype: object

In [None]:
#Xtrain and Ytrain are reduced to 8000 records.
X_train = X_train[:8000]
y_train = y_train[:8000]

In [None]:
X_train

6915                khloé kardashian told tristan pregnant
4547                   u special air ahead season premiere
57151    racist want closed border protect citizen terr...
28800    factbox trump meet huckabee romney others week...
16093           golden globe look made statement — cost le
                               ...                        
33017    uae information tunisian woman may commit 'ter...
58805    loony california secession group proclaims ope...
55706    truth alicia machado blow up…backfires big-tim...
57610    obama putting terrorist boot ground every amer...
41385    puerto rico open arm refugee irma caribbean ch...
Name: title_text, Length: 8000, dtype: object

In [None]:
y_train

6915     true
4547     true
57151    fake
28800    true
16093    true
         ... 
33017    true
58805    fake
55706    fake
57610    fake
41385    true
Name: label, Length: 8000, dtype: object

In [None]:
# To release the memory allocated to GPU before running the deep learning
from numba import cuda

device = cuda.get_current_device();
device.reset()

In [None]:
# To release the memory allocated to GPU before runing the deep learning
torch.cuda.empty_cache()

In [None]:
#Defining the device to be used on the based on available resources
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

As per the documents, FewShot Classifier accepts` array-like of shape (n_samples,)`. Therefore, X_train data is shifted inside the array using Numpy Library.

In [None]:
X_train = np.array(X_train)

In [None]:
# Changing the datatype of y_train in order for this to make it acceptable by the model
y_train = np.array(y_train)
y_train = y_train.astype('<U8')

<p style=color:#38629C><b style=color:#467200 >Step 3(a) - Zero Shot Classifier :</b>The first step will be using the zero shot classifier in which nothing will be provided for the context and only y parameters will be provided for classification label as "real and fake" news.</p>

<p style=color:#38629C>There are multiple LLMs tested available from hugging fave  such as open_llama_3b, Flan-T5 small, bloomz-1b1, bloomz-1b7, bloomz-7b1 but finally ended up using bloomz-1b1 because of resource limitation, which is trained using 1 billion parameters. Though this model does not meet prevailing standards, However, this was the only model capable of running in the limited computation resources available either in my personal system or collab environment.</p>


In [None]:
from skorch.llm import ZeroShotClassifier

#If use_cashing is True, the predictions for each sample will be cached, as well as the
#intermediate result for each generated token.
clf = ZeroShotClassifier('bigscience/bloomz-1b1', device=device, use_caching=False)


In [None]:
%time clf.fit(X=None, y=['true', 'fake'])

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

CPU times: user 17.1 s, sys: 5.71 s, total: 22.8 s
Wall time: 27.6 s


In [None]:
%time y_proba = clf.predict_proba(X_train)

CPU times: user 23min 46s, sys: 1.95 s, total: 23min 48s
Wall time: 14min 7s


In [None]:
log_loss(y_train, y_proba)

0.6488780449814768

In [None]:
y_pred = y_proba.argmax(1)
y_pred = np.array(['fake', 'true'])[y_pred]

In [None]:
accuracy_score(y_train, y_pred)

0.63225

<p style=color:#38629C><b style=color:#467200 >Step 3(b) - Fewshot Classifier :</b>The accuracy of zero shot classifier is not as impressive the possible reason could not training the model and it was not trained for news classification task by default. In contrast to Zero Shot, Few shot is more kind adaptive and requires a relatively small labelled dataset to learn a new task. This approach is commonly used by LLM such as GPT 3 etc. Research has also shown that the few-shot classifier has better performance in comparison to the zero classifier.</p>

In [None]:
from skorch.llm import FewShotClassifier

# Number of samples selected are 5. This number should be large enough for the LLM to generalize,
# but not too large so as to exceed the context window size.
clf = FewShotClassifier('bigscience/bloomz-1b1', max_samples=5, device=device, use_caching=False)



In [None]:
%time clf.fit(X_train,y_train)

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

CPU times: user 19.8 s, sys: 6.93 s, total: 26.8 s
Wall time: 45.2 s


<p style=color:#38629C>Ensuring everything works as expected by inspecting the prompt. This is possible using the get_prompt method from Skorch library</p>

In [None]:
print(clf.get_prompt("Your task is to identify the fake news"))

You are a text classification assistant.

Choose the label among the following possibilities with the highest probability.
Only return the label, nothing more:

['fake', 'true']

Here are a few examples:

```
pancreatic cancer killed aretha franklin score celebs
```

Your response:
true

```
irs give “ school satan club ” tax-exempt status days…tea party group still waiting school satan club allowed school district district next classification offered charitable religious educational organization operate nonprofit obama administration irs political appointee illegally targeted conservative group either making wait seven year tax-exempt status denying application altogetherjudicial watch uncovered scandal obtained pile government record showing irs illegally colluded another federal agency single group conservative-sounding term patriot tea party title applying tax-exempt statusin meantime leftist group like satan club got fast tracked principle goal establishing satan club public schoo

In [None]:
%time y_proba = clf.predict_proba(X_train)

CPU times: user 1h 10min 51s, sys: 3.31 s, total: 1h 10min 54s
Wall time: 1h 18s



<p style=color:#38629C>Evaluating FewShotClassifier</p>

In [None]:
log_loss(y_train, y_proba)

0.783626876266087

In [None]:
y_pred = y_proba.argmax(1)
y_pred = np.array(['fake', 'true'])[y_pred]

In [None]:
accuracy_score(y_train, y_pred)


0.42375

<p style=color:#38629C><b style=color:#467200 >GridSearchCV :</b>GridSearch CV enable to select the best possible haperparameter (Currently the parameter selected in Fewshot classifier is 5) in order to get the best result from the model. There are three difference sample parameters tested over the dataset to select best parameter for the target dataset in order to get the better results (Contrary to just 42.37% above)</p>

In [None]:
#Giving the number of parameter to the model for testing
params = {'max_samples': [3, 5, 7]}


In [None]:
# Create the GridSearchCV object
search = GridSearchCV(clf, param_grid=params, cv=2, scoring=['accuracy', 'neg_log_loss'], refit=False)

In [None]:
# Fit the GridSearchCV object to training data
%time search.fit(X_train,y_train)

CPU times: user 7min 10s, sys: 21.4 s, total: 7min 32s
Wall time: 8min 28s


In [None]:
# Plotting the performance of each sample size by measuring the accuracy, log loss and mean score.
pd.DataFrame(search.cv_results_)[['mean_test_accuracy', 'mean_test_neg_log_loss', 'param_max_samples', 'mean_score_time']]

Unnamed: 0,mean_test_accuracy,mean_test_neg_log_loss,param_max_samples,mean_score_time
0,0.39,-0.918018,3,44.047527
1,0.4,-1.040957,5,56.316571
2,0.38,-1.215785,7,74.462161


<p style=color:#38629C><u style=color:#467200 >Conclusion:</u> It seems sample size of 5 was best in among all three choices in fact other parameter are further degrading the results instead of improving performance of the model.</p>


<p style=color:#38629C><b style=color:#467200 >Diagnosis :</b> It seems there is an issue with the classifier as it is not getting the fake and real news classification accurately. It is quite difficult to detect the exact issue going underneath with LLM because limited information available on the training data or training parameters used originally. However, Skorch library provided certain tools to identify the root cause.</p>

<p style=color:#38629C><b style=color:#467200 >Step 4 - Running unnormalized probabilities :</b> Skorch library is by default nomalizing the outcome to 1. This could hide the underlying working happening with the model. Therefore, we will make the Normalization to `False` in order to check the exact weight provided by classifiers to the classes. This will let us identify the exact probabilities provided by the model.</p>

In [None]:
clf = ZeroShotClassifier('bigscience/bloomz-1b1', device = device , use_caching=False, probas_sum_to_1=False)

In [None]:
%time clf.fit(X=None, y=['true', 'fake'])

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

CPU times: user 23.9 s, sys: 7.71 s, total: 31.7 s
Wall time: 53.4 s


In [None]:
y_proba = clf.predict_proba(X_train[:10])

In [None]:
y_proba

array([[0.19912136, 0.25912085],
       [0.23412521, 0.21689624],
       [0.1152491 , 0.09902817],
       [0.2456733 , 0.13392307],
       [0.20926329, 0.27272159],
       [0.01619505, 0.05243009],
       [0.39914721, 0.06411993],
       [0.23241256, 0.34167874],
       [0.10923862, 0.23131309],
       [0.27720082, 0.15755253]])

In [None]:
y_proba.sum(1)

array([0.45824221, 0.45102145, 0.21427727, 0.37959637, 0.48198488,
       0.06862513, 0.46326714, 0.5740913 , 0.34055172, 0.43475334])

<p style=color:#38629C><b style=color:#467200 >Conclusion :</b> We can clearly notice that probabilities are quite low by our selected model. This might be because `bloomz-1b1` might not have been trained on classification task especially fake news. `ZeroShotClassifier` and `FewShotClassifer` heavily relies on the pre trained data quality, therefore, there is possibility that it might not run on low level distinction between different categories.</p>


<p style=color:#38629C>We can cross validation by the above outcome by using Skorch builtin method `error_low_prob`.</p>

In [None]:
clf = ZeroShotClassifier('bigscience/bloomz-1b1', device = device, use_caching=False, error_low_prob='raise', threshold_low_prob=0.5)

In [None]:
%time clf.fit(X=None, y=['true', 'fake'])

Downloading (…)lve/main/config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/2.13G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

CPU times: user 20.7 s, sys: 7.61 s, total: 28.3 s
Wall time: 3min 21s


In [None]:
try:
    clf.predict_proba(X_train[:10])
except Exception as exc:
    print("There was an error:", exc)

There was an error: The sum of all probabilities is 0.458, which is below the minimum threshold of 0.500


<b style=color:#467200 >Next Step :</b> It is now confirmed that the `bloomz-1b1` is not the right model for classification task especially for `real` and `fake`. Though, `ZeroShotClassifier` and `FewShotClassifer` have its own benefits such as:

*   It can work on the unseen data based using pre trained models.
*   Reduced the need for re training or fine tunning with respect to relationship between different classes or categories
*   Easy to scale as it can handle wide range of inputs.

However, It only depends on the model you are using. We have clearly confirmed from our analysis that `bloomz-1b1` is not the right model even though it is trained over 1 billion parameters. One possible solution is to continue try other model having better parameters models using `ZeroShotClassifier` and `FewShotClassifer` with the help of `GridCVSearch`. However, this will be resource hungry task and we might not get the desired results. Therefore, Next step in my journey of identifying best working model is follows:

1.   Select the pre-trained model from hugging face library such as `bert-base-uncased` which is good for text classification tasks. 
2.   FineTune selected LLM for fake news classification task. We can notice from lecturer review that Bert could best choice for the fake news classification.
3.   Test the fine-tunned model against the LSTM, which seems to be best option so far based on our analysis. This will give us clear understanding which approach works better for our problem statement i.e. fake news classification.

<i style=color:#38629C>Final-Project/Notebooks/Final Project - Part 4 (LLM Fine-Tuning).ipynb</i>
