### LMs for keyphrase extraction (using the `KeyBERT` library)
* ‚ùå - Bad performance | ‚úÖ - Good performance (* - best performance) | üö´ - Not usable
* `vasugoel/K-12BERT` - [Indian corpus](https://medium.com/@vasu18322/k-12bert-bert-for-k-12-education-96a8a6ee9265)
* For `openaccess-ai-collective/jackalope-7b`:
    `kw_model.model.embedding_model.tokenizer.pad_token = kw_model.model.embedding_model.tokenizer.eos_token`

| Model                                                | Status |
|--------------------------------------------------    |:---:|
| `google/flan-t5-large`                               | ‚ùå |
| `dbmdz/bert-large-cased-finetuned-conll03-english`   | ‚ùå |
| `yanekyuk/bert-uncased-keyword-extractor`            | ‚ùå |
| `allenai/scibert_scivocab_uncased`                   | ‚úÖ |
| `vasugoel/K-12BERT`                                  | ‚úÖ* | 
| `ogimgio/K-12BERT-reward-neurallinguisticpioneers-3` | ‚úÖ* |
| `bbunzeck/gpt-wee-curriculum`                        | ‚úÖ |
| `openaccess-ai-collective/jackalope-7b`              | ‚úÖ |
| `Nonegom/roberta_curriculum_learn`                   | ‚úÖ |
| `egumasa/roberta-base-academic`                      | ‚úÖ |
| `spacy/en_core_web_lg`                               | üö´ |
| `51la5/roberta-large-NER`                            | üö´ |
| `EhimeNLP/AcademicBART`                              | üö´ |
| `brennan-richards/gpt2-finetuned-academic-topics`    | üö´ |
| `crumb/44m-textbook`                                 | üö´ |
| `openaccess-ai-collective/mistral-100m-textbooks`    | üö´ |
| `Taekyoon/textbook_scramble`                         | üö´ |
| `jupiterben/gpt-academic`                            | üö´ |
| `Dongchao/AcademiCodec`                              | üö´ |
| `ricardo-filho/BERT-pt-institutional-corpus-v.1`     | ‚ùå | 

### LMs for grammar checking & prepping for keyphrase extraction
| Model                                                | Status |
|--------------------------------------------------    |:---:|
| `grammarly/coedit-large`                             | ‚ùå |
| `grammarly/coedit-xl-composite`                      | ‚úÖ |
| `vennify/t5-base-grammar-correction`                 | ‚ùå |
| `pszemraj/flan-t5-large-grammar-synthesis`           | ‚ùå |


### Prep text
- [x] Grammar correction
    - [x] Remove numbers
    - [x] Remove / augment non-words
- [x] Other NLP prep:
    - [x] Remove stopwords
    - [x] Remove punctuation
    - [x] Stemming - ‚ùå
    - [x] Lemmatization - ‚úÖ

#### Checking transcripts
NOTE:
* Transcripts don't have numbers in them.
* There are contractions in the transcripts, like "don't".
* There are many grammatical, spelling & syntactical errors.

In [1]:
from pprint import pprint
from tqdm import tqdm
import numpy as np, pandas as pd

ROOT = "../data/kagdata/"
TO = ROOT + "cleaned/"

meta = pd.read_csv(ROOT + "metadata.csv")

In [2]:
import re

RE_D = re.compile('\d')
titles = []
for title, trans in meta[['video name', 'transcript']].values:
    # If there is a number in the transcript, print title
    res = RE_D.search(trans)
    if res:
        print(title)
    # trans = '' + trans
    if not trans.strip().replace(' ', '').isalpha():
        titles.append(title)
print(titles)

['Collection and Presentation of Data - I', 'Dalton√É¬¢√¢‚Äö¬¨√¢‚Äû¬¢s Atomic Theory', "kepler's first law", 'Average Speed', 'Linear Graph', 'Angles', 'Congruent Figures', 'Climatic Adaptations in Animals of Tropical Rainforests', 'Climatic Adaptations in Animals of Polar Regions', 'Plane Mirror and Image Formation', 'Types of Angles', 'Conservation of Water', 'Importance of Water', 'Potential Energy', 'Demagnetizing a Magnet', 'Regeneration', 'Sexual Reproduction in Plants', 'Converse of BPT', 'Similar Polygons', 'Discovery of Subatomic Particles', 'Thomson Atomic Model', 'Thomson√É¬¢√¢‚Äö¬¨√¢‚Äû¬¢s Plum Pudding Model', 'Thrust and Pressure', 'Care of Eyes', 'Image Formation by a Plane Mirror', 'Compound Interest', 'Convex and Concave Polygons', 'Profit and Loss', 'Fibre to Wool', 'Formation of water table', 'Irrigation', 'Minerals', 'Magnetic Field and Terrestrial Magnetism', 'Verification of Pythagoras Theorem', "Boyle's Law", "Henry's Law", 'Phagocytosis in Amoeba', '5R s of Manag

#### Correct grammar + Preprocess + Generate keyphrases / embeddings

In [4]:
# For grammatical corrections
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-xl-composite")
grammar_model = T5ForConditionalGeneration.from_pretrained("grammarly/coedit-xl-composite")
grammar_model.to('cuda');

# For text preprocessing (before embedding generation)
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# For embedding generation
from keybert import KeyBERT
from flair.embeddings import TransformerDocumentEmbeddings

kw_model = KeyBERT(model=TransformerDocumentEmbeddings('ogimgio/K-12BERT-reward-neurallinguisticpioneers-3'))

[nltk_data] Downloading package punkt to /home/js/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/js/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/js/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
def preprocess_text(text):
    """
    Preprocesses text by converting to lowercase, removing punctuation & 
    special characters, stop words, and lemmatizing.

    """
    # Lowercase
    text = text.lower()

    # Removing punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    filtered_text = ' '.join(filtered_words)

    # Lemmatizate
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    lemmatized_text = ' '.join(lemmatized_words)

    return lemmatized_text

In [6]:
from ast import literal_eval

pe = pd.read_csv("../data/kagdata/pre_embeddings_std_combined.csv", converters={'pre_embed_vec': literal_eval})
# Add columns in pe for doc_emb_unproc & doc_emb
pe['doc_emb_unproc'] = np.nan
pe['doc_emb'] = np.nan

In [7]:
itr = tqdm(pe['title'].values)
for ttl in itr:
    # CONCERN: Set to 0s ?
    if ttl in ["3D circulatory system", "Introduction to Human Musculo - Skeletal System"]:
        continue

    text = meta[meta['video name'] == ttl]['transcript'].values[0]

    itr.set_description(desc=f"Grammar correction for {ttl}: ")

    # Generate embeddings with UNprocessed transcript
    doc_emb_unproc, word_emb_unproc = kw_model.extract_embeddings(text, keyphrase_ngram_range=(1, 6), stop_words=None)
    pe.loc[pe['title'] == ttl, 'doc_emb_unproc'] = str(list(doc_emb_unproc.squeeze()))

    # Grammar correction
    input_ids = tokenizer("Please correct the grammar & syntax of the following text" + str(text), max_length=len(text), truncation=True, return_tensors="pt").input_ids.cuda()
    outputs = grammar_model.generate(input_ids, max_length=len(text))
    edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Save transcript as well as original text in 2 separate files
    with open(TO + ttl + "_orig.txt", "w") as f:
        f.write(text)

    with open(TO + ttl + "_cleaned.txt", "w") as f:
        f.write(edited_text)

    itr.set_description(desc=f"Preprocessing {ttl}: ")
    # Usual NLP preprocessing
    text_nlp = preprocess_text(edited_text)

    # Save preprocessed text
    with open(TO + ttl + "_cleaned_nlp.txt", "w") as f:
        f.write(text_nlp)

    itr.set_description(desc=f"Generating embeddings for {ttl}: ")
    # Generate embeddings with processed transcript
    doc_emb, word_emb = kw_model.extract_embeddings(text_nlp, keyphrase_ngram_range=(1, 6), stop_words=None)
    pe.loc[pe['title'] == ttl, 'doc_emb'] = str(list(doc_emb.squeeze()))


Generating embeddings for work: : 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1092/1092 [9:09:37<00:00, 30.20s/it]                                                                      


In [8]:
pe

Unnamed: 0,title,pre_embed_vec,doc_emb_unproc,doc_emb
0,2-D Gel Electrophoresis,"[-0.19126136814513778, 0.13733942668046725, -0...","[0.24289848, -0.45071986, 1.1598305, 0.4752130...","[0.27433813, -0.51973844, 1.067887, 0.40132207..."
1,3 R of Management,"[-0.16999237728515493, 0.06898725536530119, 0....","[0.40318987, -0.47663635, 1.0908582, 0.3861287...","[0.45690998, -0.59301126, 1.0201745, 0.2898596..."
2,5R s of Management,"[-0.1642900425680676, 0.05772429490165987, 0.0...","[0.43877366, -0.4371265, 1.09398, 0.40499687, ...","[0.4427606, -0.5024001, 1.0034235, 0.2949309, ..."
3,Absorption of Water by the Soil,"[-0.16203413847803913, -0.04227356757548518, -...","[0.32079756, -0.48280507, 1.16238, 0.39818862,...","[0.70309246, -0.694826, 1.1850696, 0.39960852,..."
4,Acceleration,"[-0.1223012690576073, 0.03342131371255797, -0....","[0.30854106, -0.694411, 0.86721295, 0.4675479,...","[0.17490087, -0.32100463, 0.8312924, 0.3257601..."
...,...,...,...,...
1087,movement by cilia and flagella,"[-0.137220452074245, 0.07921627140121146, -0.0...","[0.26817966, -0.43313804, 1.0973607, 0.3645128...","[0.32050088, -0.5102894, 1.0391995, 0.3232281,..."
1088,polymerisation,"[-0.16961626438499486, -0.021925985188381163, ...","[0.34239438, -0.50310206, 1.0814409, 0.2538067...","[0.2860881, -0.53453565, 0.9084051, 0.25165945..."
1089,protein structure and folding,"[-0.16562660838023435, -0.1728222006648213, -0...","[0.32868332, -0.37179884, 1.1365671, 0.4342724...","[0.38396522, -0.46210217, 1.0276499, 0.3523318..."
1090,sieving,"[-0.22382595983772374, 0.11274248544353116, 0....","[0.2377191, -0.6006751, 0.91552424, 0.33194014...","[0.07291744, -0.53669393, 0.9200764, 0.271548,..."


In [22]:
# Defaults for the 2 no-transcript videos
pe.loc[pe['title'] == "3D circulatory system", 'doc_emb'] = str([0]*768)
pe.loc[pe['title'] == "3D circulatory system", 'doc_emb_unproc'] = str([0]*768)
pe.loc[pe['title'] == "Introduction to Human Musculo - Skeletal System", 'doc_emb'] = str([0]*768)
pe.loc[pe['title'] == "Introduction to Human Musculo - Skeletal System", 'doc_emb_unproc'] = str([0]*768)

In [23]:
# Saving to disk
# pe.to_csv("../data/kagdata/emb_std_combined.csv", index=False) # NOTE: Careful about overwriting