# Handling Missing Values

In this Jupyter notebook, I attempted to solve the issue of missing values. Several ideas came to mind for solving this problem. 

1. **Regex Approach:** 
   - Firstly, I re-extracted the metadata of each book and used regex to keep only the book's name, author's name, and publication date to address the issues with these three columns. 
   - I stored the resulting data in `refined_metadata.txt`. I wanted to continue extracting the information in this way but due to many corner cases, this approach was not implemented. I just used the `refined_metadata.txt` in the following approaches.

2. **Named Entity Recognition (NER):**
   - The second idea involved a type of Language Models (LMs) called Named Entity Recognition (NER), which seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, etc. 
   - I found multiple NERs like [UniversalNER](https://universal-ner.github.io/), but due to limited GPU memory, I couldn't implement this model. 
   - I also explored NER models on Hugging Face 🤗 and found `bert-base-NER`, a popular model. However, it lacked the necessary accuracy, so I abandoned this model.
   - For those who are reading this repository, I found a good article on the Medium website about NER. I'm leaving it here if you want to deepen your knowledge. [Named Entity Recognition with LLMs — Extract Conversation Metadata](https://medium.com/@grisanti.isidoro/named-entity-recognition-with-llms-extract-conversation-metadata-94d5536178f2)

3. **together.ai API:**
   - My third idea was to use the together.ai API, which utilizes the Llama language model. 
   - I used it in the Jupyter notebook EDA to convert the metadata of books into JSON objects. 
   - I wrote a prompt to find the author's name in each text and return it as a string, but this approach also had low accuracy.


In [7]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from together import Together
from dotenv import load_dotenv
import os

2024-04-27 01:38:07.841133: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-27 01:38:10.439456: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
# Open a the booksummaries.txt file for reading
with open('booksummaries.txt', 'r') as file:
    # Read the entire contents of the file
    file_contents = file.read()

In [2]:
df = pd.read_excel('bookInfo.xlsx')

In [None]:
# Extracting the metadata with regex
metadata_patern = r"\d+\s*\/m\/\w+.*}?\t+"
metadata_match = re.finditer(metadata_patern, file_contents)
metadata = []
# re.finditer(pattern3, file_contents)
for match in metadata_match:
    metadata.append(match.group())

In [None]:
braces_pattern = r'{(.*?)}'
# Function to remove text between '{' and '}'
def remove_between_braces(text):
    return re.sub(braces_pattern, '', text)

In [None]:
# Remove text between '{' and '}' for each string in the list
metadata_without_genres = [remove_between_braces(string) for string in metadata]

In [None]:
first_redundant_chars = r'\d+\s+\/m\/[\w\d_]+'
# Function to remove the first redundatn characters

def redundant_chars(text):
    return re.sub(first_redundant_chars, '', text)

In [None]:
# Remove redundant chars for each string in the list
refined_metadata = [redundant_chars(string) for string in metadata_without_genres]

In [None]:
refined_metadata

['\tAnimal Farm\tGeorge Orwell\t1945-08-17\t\t',
 '\tA Clockwork Orange\tAnthony Burgess\t1962\t\t',
 '\tThe Plague\tAlbert Camus\t1947\t\t',
 '\tAn Enquiry Concerning Human Understanding\tDavid Hume\t\t\t',
 '\tA Fire Upon the Deep\tVernor Vinge\t\t\t',
 '\tAll Quiet on the Western Front\tErich Maria Remarque\t1929-01-29\t\t',
 '\tA Wizard of Earthsea\tUrsula K. Le Guin\t1968\t\t',
 '\tAnyone Can Whistle\tArthur Laurents\t\t\t',
 '\tBlade Runner 3: Replicant Night\tK. W. Jeter\t1996-10-01\t\t',
 '\tBlade Runner 2: The Edge of Human\tK. W. Jeter\t1995-10-01\t\t',
 '\tBook of Joshua\t\t\t\t',
 '\tBook of Ezra\t\t\t\t',
 '\tBook of Numbers\t\t\t\t',
 '\tBook of Ruth\t\t\t\t',
 '\tBook of Esther\t\t\t\t',
 '\tBook of Job\t\t\t\t',
 '\tBook of Hosea\t\t\t\t',
 '\tBook of Jonah\t\t\t\t',
 '\tBook of Micah\t\t\t\t',
 '\tBook of Haggai\t\t\t\t',
 '\tCrash\tJ. G. Ballard\t1973\t\t',
 '\tChildren of Dune\tFrank Herbert\t1976\t\t',
 "\tCandide, ou l'Optimisme\tVoltaire\t1759-01\t\t",
 '\tChapter

In [None]:
file_path = 'refined_metadata.txt'

with open(file_path, 'w') as file:
    for entity in refined_metadata:
        file.write(entity + '\n')

### bert-base-NER

In [3]:
# Filter rows with NaN values in column 'author'
df_authorna = df[df['author'].isna()]

In [4]:
df_authorna

Unnamed: 0,bookName,author,publishDate,genres,summary
10,Book of Joshua,,,,(Chapter 1 is the first of three important mo...
11,Book of Ezra,,,,"For the Bible text, see Bible Gateway (opens ..."
12,Book of Numbers,,,,"God orders Moses, in the wilderness of Sinai,..."
13,Book of Ruth,,,,During the time of the Judges when there was ...
14,Book of Esther,,,,"Ahasuerus, ruler of a massive Persian empire,..."
...,...,...,...,...,...
16498,The Millionaire's Wife,,2012-03-27,"['Biography', 'True crime']","Twenty years after George Kogan's murder, in ..."
16500,Arrhythmia,,,['Novel'],The novel is set in a busy Montreal hospital ...
16504,The Life,,,['Novel'],The Life traces the life story of Dennis Keit...
16540,De vierde man,,,,The novel is a frame narrative: a writer name...


In [5]:
authorna_indices = df_authorna.index.tolist()

In [12]:
refined_metadata = []
# Open the refined_metadata.txt file for reading
with open('refined_metadata.txt', 'r') as file:
    # Read the entire contents of the file
     for line in file:
        # Remove leading/trailing whitespaces and newline characters
        cleaned_line = line.strip()
        refined_metadata.append(cleaned_line)

In [23]:
# Replacing the \t with a white space
refined_metadata = [string.replace("\t", " ") for string in refined_metadata]

In [67]:
authorna_list = []

In [8]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[]


In [27]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = f"{refined_metadata[2]}"

ner_results = nlp(example)
print(ner_results)


[{'entity': 'I-MISC', 'score': 0.69562596, 'index': 2, 'word': 'P', 'start': 4, 'end': 5}, {'entity': 'I-MISC', 'score': 0.9242417, 'index': 4, 'word': '##ue', 'start': 8, 'end': 10}, {'entity': 'B-PER', 'score': 0.38932484, 'index': 5, 'word': 'Albert', 'start': 11, 'end': 17}, {'entity': 'I-PER', 'score': 0.9844342, 'index': 6, 'word': 'Cam', 'start': 18, 'end': 21}]


In [28]:
refined_metadata[2]

'The Plague Albert Camus 1947'

### together.ai

In [70]:
# Create client
client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))

for index in authorna_indices[20:30]:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3-8b-chat-hf",
        messages=[{"role": "user", "content": f"Given a text containing information about a book, extract the author's name from the text. I don't want any explanation from you just give me the name or leave it empty and don't generate anything from yourserlf. {refined_metadata[index]}"}]
    )
    authorna_list.append(response.choices[0].message.content)