<font color="grey">
    
## Zero-Shot Classification

- Zero Shot Classification is the task of predicting a class that wasn't seen by the model during training. This method, which leverages a pre-trained language model, can be thought of as an instance of transfer learning which generally refers to using a model trained for one task in a different application than what it was originally trained for. This is particularly useful for situations where the amount of labeled data is small.
- In zero shot classification, we provide the model with a prompt and a sequence of text that describes what we want our model to do, in natural language. Zero-shot classification excludes any examples of the desired task being completed. This differs from single or few-shot classification, as these tasks include a single or a few examples of the selected task.

Learning links:
- https://huggingface.co/tasks/zero-shot-classification
- https://huggingface.co/models
- https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=trending

</font>

In [3]:
import pandas as pd
books=pd.read_csv(r'F:\learning\LLM\data\books_cleaned.csv')

In [4]:
books.columns

Index(['isbn13', 'isbn10', 'title', 'authors', 'categories', 'thumbnail',
       'description', 'published_year', 'average_rating', 'num_pages',
       'ratings_count', 'title_and_subtitle', 'tagged_description'],
      dtype='object')

In [5]:
books['categories'].value_counts().reset_index()

Unnamed: 0,categories,count
0,Fiction,2111
1,Juvenile Fiction,390
2,Biography & Autobiography,311
3,History,207
4,Literary Criticism,124
...,...,...
474,Aged women,1
475,Imperialism,1
476,Human-animal relationships,1
477,Amish,1


In [6]:
books['categories'].value_counts().reset_index().query('count>50')

Unnamed: 0,categories,count
0,Fiction,2111
1,Juvenile Fiction,390
2,Biography & Autobiography,311
3,History,207
4,Literary Criticism,124
5,Religion,117
6,Philosophy,117
7,Comics & Graphic Novels,116
8,Drama,86
9,Juvenile Nonfiction,57


In [10]:
category_mapping = {'Fiction': "Fiction",
                    'Juvenile Fiction': "Fiction",
                    'Biography & Autobiography': "Nonfiction",
                    'History': "Nonfiction",
                    'Literary Criticism': "Nonfiction",
                    'Philosophy': "Nonfiction",
                    'Religion': "Nonfiction",
                    'Comics & Graphic Novels': "Fiction",
                    'Drama': "Fiction",
                    'Juvenile Nonfiction': "Nonfiction",
                    'Science': "Nonfiction",
                    'Poetry': "Fiction"}
books['simple_categories']=books["categories"].map(category_mapping)

print("The proportion of books with missing categories:")
round((books['simple_categories'].isna().sum())/len(books),2).item()

The proportion of books with missing categories:


0.28

<font color="grey">
    
More models about zero-shot-classification:
- https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=trending

</font>

In [11]:
from transformers import pipeline
fiction_categories = ["Fiction", "Nonfiction"]
pipe = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
#prediction=pipe(text_description, category space)

  if not hasattr(np, "object"):





Device set to use cuda:0


In [12]:
book1=books.loc[books['simple_categories']=='Fiction','description'][0]
book1

'A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world ha

In [19]:
import numpy as np
prediction=pipe(book1, ["Fiction", "Nonfiction"])
print(f"Prediction keys: {prediction.keys()}")
print(f"Prediction labels: {prediction['labels']}")
print(f"Prediction scores: {prediction['scores']}")
print(f"The predicted category: {prediction['labels'][np.argmax(prediction['scores'])]}")

Prediction keys: dict_keys(['sequence', 'labels', 'scores'])
Prediction labels: ['Fiction', 'Nonfiction']
Prediction scores: [0.8438265323638916, 0.1561734527349472]
The predicted category: Fiction


In [20]:
def generate_predictions(sequence, categories):
    prediction=pipe(sequence, categories)
    max_index=np.argmax(prediction['scores'])
    max_label=prediction['labels'][max_index]
    return max_label # Predicted category

<font color="grey">
    
In our database, there are 2 types of books: those with existing category information and those without it. Approximately 30% of the books lack category information.

We first evaluate the model using books that already have category labels to assess its accuracy. If the performance is satisfactory, we can then use the model to infer category information for books that do have have it.
</font>

In [21]:
from tqdm import tqdm
book_fiction=books.loc[books['simple_categories']=='Fiction','description'].reset_index(drop=True)
book_nonfiction=books.loc[books['simple_categories']=='Nonfiction','description'].reset_index(drop=True)
k=0
predicred_cats=[]
for i in tqdm(range(300)):
    sequence=book_fiction[i]
    pre=generate_predictions(sequence, ["Fiction", "Nonfiction"])
    predicred_cats += [pre]
    if pre=="Fiction":k=k+1     
for i in tqdm(range(300)):
    sequence=book_nonfiction[i]
    pre=generate_predictions(sequence, ["Fiction", "Nonfiction"])
    predicred_cats += [pre]
    if pre=="Nonfiction":k=k+1
actual_cats=["Fiction"]*300+["Nonfiction"]*300
print(f'Text classification accuracy: {k/len(predicred_cats)}')

  1%|▏         | 4/300 [00:00<00:59,  4.97it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 300/300 [00:31<00:00,  9.42it/s]
100%|██████████| 300/300 [00:31<00:00,  9.60it/s]

Text classification accuracy: 0.75





- 75% is quite good accuracy.

In [22]:
predicted_cats=[]
books_missing_cats=books.loc[books['simple_categories'].isna(),['isbn13','description']].reset_index(drop=True)
for i in tqdm(range(len(books_missing_cats))):
    sequence=books_missing_cats['description'][i]
    predicted_cats += [generate_predictions(sequence, ["Fiction", "Nonfiction"])]

missing_predictions_df=pd.DataFrame({"isbn13":books_missing_cats['isbn13'].values,"predicred_cats":predicted_cats})
books=pd.merge(books,missing_predictions_df,on='isbn13',how='left')
books['simple_categories']=np.where(books['simple_categories'].isna(),books['predicred_cats'],books['simple_categories'])
books=books.drop(columns=['predicred_cats'])
books.to_csv(r"F:\learning\LLM\step3_books_with_categories.csv", index=False)

100%|██████████| 1454/1454 [02:26<00:00,  9.92it/s]


<font color="grey">
    
### The whole code
</font>

In [None]:
import pandas as pd
import numpy as np
from transformers import pipeline
books=pd.read_csv(r'F:\learning\LLM\data\books_cleaned.csv')
category_mapping = {'Fiction': "Fiction",
                    'Juvenile Fiction': "Fiction",
                    'Biography & Autobiography': "Nonfiction",
                    'History': "Nonfiction",
                    'Literary Criticism': "Nonfiction",
                    'Philosophy': "Nonfiction",
                    'Religion': "Nonfiction",
                    'Comics & Graphic Novels': "Fiction",
                    'Drama': "Fiction",
                    'Juvenile Nonfiction': "Nonfiction",
                    'Science': "Nonfiction",
                    'Poetry': "Fiction"}
books['simple_categories']=books["categories"].map(category_mapping)

pipe = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
#prediction=pipe(text_description, category space)

def generate_predictions(sequence, categories):
    prediction=pipe(sequence, categories)
    max_index=np.argmax(prediction['scores'])
    max_label=prediction['labels'][max_index]
    return max_label # Predicted category

# We first evaluate the model using books that already have category labels to assess its accuracy
from tqdm import tqdm
book_fiction=books.loc[books['simple_categories']=='Fiction','description'].reset_index(drop=True)
book_nonfiction=books.loc[books['simple_categories']=='Nonfiction','description'].reset_index(drop=True)
k=0
predicred_cats=[]
for i in tqdm(range(300)):
    sequence=book_fiction[i]
    pre=generate_predictions(sequence, ["Fiction", "Nonfiction"])
    predicred_cats += [pre]
    if pre=="Fiction":k=k+1     
for i in tqdm(range(300)):
    sequence=book_nonfiction[i]
    pre=generate_predictions(sequence, ["Fiction", "Nonfiction"])
    predicred_cats += [pre]
    if pre=="Nonfiction":k=k+1
actual_cats=["Fiction"]*300+["Nonfiction"]*300
print(f'Text classification accuracy: {k/len(predicred_cats)}')
   
# Use the model to infer category information for books that do have have it
predicted_cats=[]
books_missing_cats=books.loc[books['simple_categories'].isna(),['isbn13','description']].reset_index(drop=True)
for i in tqdm(range(len(books_missing_cats))):
    sequence=books_missing_cats['description'][i]
    predicted_cats += [generate_predictions(sequence, ["Fiction", "Nonfiction"])]

missing_predictions_df=pd.DataFrame({"isbn13":books_missing_cats['isbn13'].values,"predicred_cats":predicted_cats})
books_o1=pd.merge(books,missing_predictions_df,on='isbn13',how='left')
books_o1['simple_categories']=np.where(books_o1['simple_categories'].isna(),books_o1['predicred_cats'],books_o1['simple_categories'])
books_o1=books_o1.drop(columns=['predicred_cats'])
books_o1.to_csv(r"F:\learning\LLM\step3_books_with_categories.csv", index=False)