## **Gen AI & LLMs**

### **Intro**

**Q :** "Find out whether the given email is legitimate or spam". Give a High-level overview of this problem statement.

* It is a binary classification problem
* Input is text, should be subjected to numerical transformation
* supervised learning problem since email label is present in training data

**Features**

1. Timestamp
2. From
3. Subject (text field)
4. Mail body (text field)
5. cc list
6. Category (personal, domain based, social, promotional etc.)

**Label**

1. Spam/ Not Spam

**Task (Step by Step)** 

1. Identify the input and output --> Use 3rd and 4th feature combined(feature selection) to determine label
2. EDA --> Count plot, word cloud
3. Train-test split
4. NLP of training data (**fit_transform**)
   * Clean the text data 
   * Transform the cleaned input part of training dataset( use any of the text to num algo )
5. Repeat step 4 on text data also (**transform**)
6. Model building step using the transformed train set
7. Train score calculation (Mock test)
8. Test score calculation (Evaluation step)
9. Inference - putting model in real time
10. deployment of the model

**NOTE** : The model = (pre-processing component + classification component)

**Q :** After train score calculation, what to do if you findout that the model has not learned anything?

* change the algorithm
* hyperparameter tuning(validation score vs test score)
* alter pre-processing steps

**Q :** After evaluation & before deployment of a ML model, there is an inference step. Justify

The inference step after model evaluation and before deployment is essential for the following reasons:

1. **Model Performance Validation**: 
* ensures that the trained model generalizes well to unseen data (test data)
* you verify that the model predictions meet performance expectations (accuracy, precision, recall, etc.), confirming the model is ready for real-world data.

2. **Operational Readiness**: 
* Inference tests the model’s ability to function in a live environment
* helps simulate the actual deployment context, including handling data inputs and system integration, ensuring the model will work smoothly after deployment.

3. **Debugging and Optimization**: 
* allows developers to catch potential issues such as latency, memory usage, or incorrect predictions in real-world conditions
* provides an opportunity to fine-tune and optimize the model before deployment

4. **Business Value Validation**: 
* performing inference on a separate validation or test set, stakeholders can evaluate if the model provides the desired business outcomes, ensuring alignment with business goals before full-scale deployment.

Thus, inference acts as a **bridge between evaluating model performance and the practical, efficient use of the model in production**.

### **Text Preprocessing**

**Q :** List out the major steps in text preprocessing.

1. Lower casing
2. Tokenization
3. Removing Punctuation and special characters
4. Removal of stop words
5. Stemming/ Lemmatization

###### **TOKENIZATION**

**Q :** What is tokenization?

* process of splitting text into smaller units called tokens 
* These tokens can be words, phrases, or even characters, depending on the level of tokenization
* crucial step in text preprocessing for NLP tasks
* By breaking down text into tokens, ML models can work with manageable and meaningful parts of the text

**Q :** What are the different types of tokenization?

* **Word Tokenization** : 
    * Breaks text into individual words 
    * "I love NLP" becomes ["I", "love", "NLP"].
* **Sentence Tokenization** : 
    * Splits text into sentences 
    * "I love NLP. It's fun!" becomes ["I love NLP.", "It's fun!"].
* **Character Tokenization** : 
    * Splits text into individual characters
    * "love" becomes ["l", "o", "v", "e"].

**Q :** Which module in **nltk** library contains functions for tokenization?

* tokenize module of nltk library

**Q :** How is word tokenization enabled in Python using nltk? Give example.

```python
import nltk
from nltk.tokenize import word_tokenize

# Download NLTK data files (only needed for the first time)
nltk.download('punkt')

# Sample text
text = "I love NLP! It's fascinating."

# Perform word tokenization
tokens = word_tokenize(text)
print(tokens)

# output
['I', 'love', 'NLP', '!', 'It', "'s", 'fascinating', '.']

```

**Q :** How is sentence tokenization implemented in Python using NLTK? Give example.

```python
from nltk.tokenize import sent_tokenize

# Sample text
text = "I love NLP. It's a fascinating field of study."

# Perform sentence tokenization
sentences = sent_tokenize(text)
print(sentences)

# output
['I love NLP.', "It's a fascinating field of study."]

```

###### **PUNCTUATIONS/SPECIAL CHARACTERS REMOVAL**

**Q :** Why Remove Special Characters and Punctuation?

* **Noise Reduction** : Special characters and punctuation may not contribute meaningfully to the task, and removing them reduces noise in the data.
* **Simplifies Text** : Focusing on words without clutter simplifies the input for machine learning models.
* **Uniformity** : It brings uniformity to the dataset, especially when handling large corpora of text.

**Q :** Why Remove Special Characters After Tokenization?

* **Flexibility** : By tokenizing first, you have more granular control. For example, some punctuation (e.g., emoticons, hashtags) might carry meaning, and you can choose which tokens to keep or remove.
* **Handling Contractions** : In English, contractions like "don't" and "it's" include punctuation. Removing punctuation before tokenization can complicate things (e.g., turning "don't" into "don" and "t").
* **Avoid Over-removal** : If you remove special characters before tokenization, you risk stripping away meaningful punctuation marks or symbols that should be preserved.

**Q :** List out the methods that can be used for removal of punctuations and special characters from tokenized data.

1. Regex method
2. Python string library method

**Q :** How to use Regex for special character removal?

```python
import re

tokens = ['Hello', '!', '!', '!', 'How', "'s", 'everything', 'going', '?', '#', 'Exciting', ':', ')']

# Remove punctuation and special characters
cleaned_tokens = [token for token in tokens if re.match(r'^\w+$', token)]

print(cleaned_tokens)

# output
['Hello', 'How', 'everything', 'going', 'Exciting']

```

**Q :** How to use Python string library for special character removal?

```python
import string

# Tokenize the text
tokens = word_tokenize(text)

# Remove punctuation using string.punctuation
cleaned_tokens = [token for token in tokens if token not in string.punctuation]

print(cleaned_tokens)

# output
['Hello', 'How', "'s", 'everything', 'going', '#', 'Exciting']

```

###### **DEALING WITH STOP WORDS**

**Q :** What are stop words?

* common words in a language that are often removed from text data during NLP tasks
* they usually don't carry significant meaning or contribute to the analysis
* These words are frequently used in the text 
* they don’t provide much context or information in tasks like 
    * text classification
    * information retrieval
    * sentiment analysis

**Q :** Give examples of stop words.

In English, stop words typically include:

1. **Articles** : the, a, an
2. **Pronouns** : he, she, it, they
3. **Conjunctions** : and, or, but
4. **Prepositions** : in, on, at, over, under, after, before
5. **Auxiliary verbs** : is, am, are, was, were, be, has, have, had

**Q :** Why Remove Stop Words?

* **Reduce Noise** : Stop words don’t add significant meaning, so removing them helps reduce noise in the text
* **Improve Efficiency** : reduces the number of tokens (words) that a machine learning model needs to process
* **Focus on Important Words** : for sentiment analysis, keywords like "great," "bad," etc., are more useful than "is" or "the").

**Q :** Which module in NLTK contains stop words data?

* the **corpus** module

**Q :** Give an example on how to implement stop words removal in Python using NLTK.

```python
import nltk
from nltk.corpus import stopwords

# Download stopwords dataset (if not already downloaded)
nltk.download('stopwords')

# List of English stopwords
stop_words = set(stopwords.words('english'))

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

# output
Original Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
Filtered Tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.']

```

###### **STEMMING**

**Q :** Define stemming.

* process of reducing words to their base or root form by removing suffixes or prefixes, 
* it enables the grouping of related words for analysis
* eg. the words "running," "runner," and "ran" can all be reduced to the root word "run"

**Q :** List out the major algorithms to implement stemming.

1. Porter stemmer
2. Lovins stemmer

**Q :** Idea behind Porter stemmer algo.

The algorithm consists of multiple steps, each containing a set of rules to handle specific suffixes. These steps are applied in a sequential manner, where the output of one step may serve as the input for the next.


**Step 1: Removing Plurals and Past Tenses**

Rules are applied to remove common suffixes like -s, -es, -ed, and -ing

**Step 2: Handling Verb Suffixes**

Additional suffixes related to verb forms are handled, like -ly, -ness, and -ment

**Step 3: Removing Derivational Suffixes**

Further rules target suffixes such as -er, -est, and -ful to reduce derived forms to their base form

**Step 4: Handling Special Cases**

The algorithm includes specific rules to handle exceptions or irregular forms that don't follow standard suffix patterns

**Q :** Which NLTK module has function for implementing Porter stemmer algo?

* the stem module in NLTK

**Q :** How to implement Porter stemmer algo in Python using NLTK?

```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
```


###### **LEMMETIZATION**

**Q :** Define lemmatization.

* process of reducing words to their base or dictionary form (lemma) by considering the context and meaning of the word
* Unlike stemming, which often removes suffixes arbitrarily, lemmatization uses linguistic knowledge to ensure that the root form is a valid word, such as converting "better" to "good" or "running" to "run."

**Q :** How to implement lemmatization in python using NLTK?

```python
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]

```

"**List Comprehension is widely used in Text Preprocessing Tasks!!!**"

### **NLP Overview**

**Q :** What is NLP?

* Natural Language Processing
* the input text data in the above problem is "Natural Language" used by humans
* The conversion of N.L into model-friendly format(pandas friendly tabular format) is NLP

It is a study of :
* processing
* analyzing
* understanding(by means of a built model) &
* generating(by means of a built model)

text data.

Processing
___
1. Cleaning
  * special char removal
  * lower
  * stop word removal
  * lemmatization

2. Vectorization
  * text to num conversion (BOW)

**Q :** List out steps involved in Text data cleaning 

* convert into lowercase
* remove stop words
* stemming, lemmetization
* tokenization 

**Q :** Tabulate the algorithms and their corresponding Libraries.

|Algo|Library|
|-----|----|
|BOW, TFIDF|sklearn|
|Word2Vec, GLoVe, FastText|gensim|
|RNN, CSTM, GRU|Tensorflow, Keras, Pytorch|
|BERT, GPT, LLMs|Hugging Face, Openai, Google genai, anthropic|

**Q :** List out the AI tasks that are dealt with by NLP domain.

1. **Text Classification** : 
   * Spam detection
   *  Sentiment analysis
   * Adult content filtering etc. 
   * Goal is to classify the entire piece of text into categories

2. **Information extraction** :
   * Goal is to ectract structured info from unstructured text data 
   * Named entity recognition
     - It is a sequence classification task
     - goal is to classify each element in a sequence into categories
   * Part of speech tagging

3. **Information Retrieval**
   * given a user query, the model should be able to retrieve a relvant piece of info from a huge repo like documents / websites
   * retrieve dics/ info that best matches a user's query

4. **Text summarization**
   * News feed creation
   * case report summarization in a big law firm
   * such models are Gen AI model
   * text doc --> model --> summary of the doc

5. **Machine Translation**

6. **Q and A Systems**
   * automated systems to answer
     * customer query
     * weather updates
     * healthcare chatbots 
   * question reaches model --> model does information **retrieval** --> model performs info **extraction** --> model **generates**(Gen AI) the quality answer based on the question

**Q :** Show an example that illustrates the purpose of NER Information extraction.

* sentence : "Apple is lookimg to launch M4 Macbooks for 1500$"

Entities :

1. Apple = Oraganization
2. M4 Macbooks = Product
3. 1500$ = Price

**Q :** Show an example that illustrates the speech tagging information extraction.

* senetence = "She runs fast."

Parts of speech :

1. She = pronoun
2. runs = verb
3. fast = adverb


**Q :** Why is NLP hard?

1. **Ambiguity** : refers to the uncertainity in meaning
   * "The chicken is ready to eat"
   * "The car hit the pole while it was moving"
2. **Complexity of Representation** : refers to poem, sarcasm, phrases etc.
   * "You have a football game tomorrow. Break a leg!"
   * "Yeah, right, because that worked so well last time"( sarcasm )

### **Text to Num Conversion**

Vectoization techniques / Language Representation techniques / text representation techniques

**Q :** List out the possible challenges or drawbacks while vectorizing text.

1. Higher dimensionality of the vector
2. Sparsity of the vector
3. Vectors not able to capture the semantic relationships between the words
4. Vectors that do not account for sequence of the words (the order in which they appear) in a sentence
5. Vectors not being able to capture the context in which a word is used

**Q :** What is the problem with sparse vectors or matrix ?

* visualization
* memory consumption
* computational complexity
* curse of dimensionality
* lack of interpretability

**Q :** Suggest a solution for this problem.

* remove stop words
* convert words to their root form
* remove punctuations

**Q :** BOW, TF-IDF vectorization techniques results in loss of sequence information in a text. What is the solution?

* N-grams approach

**Q :** Assuming that all the vocabulary words can be represented using 2 dimensional vector representation, Which plot do you think make more sense?

* scatter plot

**Q :** Give example where the vectorization technique's inability to capture sequence information in a sentence can be a problem. What is the problem?

* Sentence 1 : This is my notebook.
* Sentence 2 : Is this my notebook?

Problem :

1. loss of sequence information would mean that any model using such vector representations could 
   * misinterpret these two sentences as having the same meaning
   * leading to incorrect predictions or responses in applications like 
       * sentiment analysis
       * question answering
       * chatbots
2. Although these sentences contain exactly the same words, the sequence of the words completely changes the meaning.

**Q :** Why do you think that BOW and TF-IDF methods do not capture the sequence info in sentences?

* they treat sentences as collections of words, disregarding their order
* As a result, both sentences would have identical vector representations since these techniques only count occurrences of each word, not the position

**Q :** Why this happens with word 2 vec also?

* represents individual words 
* it does not consider the sentence structure
* it captures word relationships but ignores how words are sequenced to convey meaning

#### **BOW**

**Q :** What is Bag of Words?

* it is an algorithm
* it is used for making numerical representation of a text data
* it transforms textual information into numerical data that ML algos can use

**Q :** What is the basic idea used BoW algo?

* text (such as a sentence or document) = a collection of words 
* grammar, order, or context of the words are ignored

**Q :** What are the major steps involved in BoW algo?

**Step 1** - Text preprocessing

**Step 2** - Create a vocabulary viz. a dictionary of words(tokens) in the corpus

**Step 3** - For each document, count the occurrence of each word in the vocabulary. The result is a vector where each dimension represents the frequency of a word from the vocabulary in the document

**Step 4** - Create a matrix(**Document Term matrix**) where each row represents a document, and each column represents a word from the vocabulary. The matrix contains word counts for each document.

**Q :** Which function in scikit-learn is used for implementing BoW?

* CountVectorizer function

**Q :** Which module in scikit learn has CountVectorizer function?

* feature_extraction.text module of sklearn

**Q :** Major tasks done by CountVectorizer function?

* tokenizes the input text 
* It creates the document-term matrix

**Q :** List out the input parameters of CountVectorizer function.

1. stop_words: Optionally remove common stop words.
2. ngram_range: Specify the range of n-grams to consider (e.g., single words, pairs of words).
3. max_features: Limit the number of features (words) to include based on frequency.
4. lowercase: Convert all characters to lowercase to ensure uniformity.

**Q :** How to implement BOW in Python using Scikit learn?

```python
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "I love programming in Python.",
    "Python is great for data science.",
    "I enjoy learning new programming languages."
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to an array
count_array = X.toarray()

# Get the feature names (unique words)
feature_names = vectorizer.get_feature_names_out()

# Display results
print("Feature Names:", feature_names)
print("Count Matrix:\n", count_array)

# output
Feature Names: ['data' 'enjoy' 'for' 'great' 'I' 'in' 'languages' 'love' 'new'
                'programming' 'python' 'science']
Count Matrix:
 [[0 0 0 0 1 1 0 1 0 1 1 0]
 [1 0 1 1 0 0 0 0 0 0 1 1]
 [0 1 0 0 1 0 1 0 1 1 0 0]]

```

**Q :** List out the features of BOW algo.

* Simplicity: The model is straightforward and easy to interpret.
* Easy to Implement: Libraries like sklearn provide direct methods to use BoW.
* Works with Many ML Algorithms: BoW outputs vectors, which work well as inputs to most machine learning models.
* Scalable: Suitable for large text datasets, especially with efficient sparse matrix representations.

**Q :** Disadvantages of BOW algo?

* ignores meaning. numerical representations are meaningless
* semantic meaning not visible in the output
* high dimensionality

**Q :** What is n-grams approach?

* dimensionality of output vectors explode
* "I love programming in Python"--> ("I love"), ("love programming"), ("programming in"), ("in Python") [bi-grams]
* stop words removal, spcl char removal etc. are solutions to solve the sparsity, dimensionality problems. ie. text preprocessing is crucial

#### **TF IDF**

**Q :** Define term frequency (TF)

* measures how frequently a term occurs in a document

* It's typically calculated as:

$$TF(t,d) = \frac{\# \ of \ term \  t \  in \  d}{\# \ terms \ in \ doc \ d}$$

* it represents the presence or importance of a term in a document
* it's value ranges between (0,1]
* for any term t and document d, higher TF means that the term t is very significant in the document d

​
 


**Q :** What is inverse document frequency (IDF)?

* This measures how important a term is across the entire corpus
* If a term appears in many documents, it is less important, meaning it's less specific or, it carries lesser information to differentiate between the documents in the corpus

$$IDF(t, D) = \log\left(\frac{\# \ of \ docs \ in \ the \ corpus \ D}{\# \ of \ docs \ that \ has \ t}\right)$$

* higher IDF score indicates that the term is rare across the entire corpus ie., the term is specific to certain documents in the corpus
* Conversely, common terms (like "the," "is," etc.) will have low IDF scores because they appear in many documents

<img src="Images/log.png" alt="image description" width=120 height=100>

**Q :** How to compute TF-IDF score?

$$TF-IDF \ score \ = \ TF(t,d)\times IDF(t,D)$$

**Q :** TF-IDF score depends on (or is a function of) ___ , ___ and ___ .

* the term t, document d, corpus D

**Q :** In a specific corpus, TF-IDF score is a function of ___.

* the term t and document d

**Q :** What does a higher TF-IDF score imply?

TF-IDF is high 

$\Rightarrow$ TF is high & IDF is high 

$\Rightarrow$ the term t appears more frequently in the document d, but the term t is rare in the whole corpus D

$\Rightarrow$ the term t is specific to the document d and also significant in the document 

$\Rightarrow$ term t is likely to be an important keyword or topic in the document d

**Q :** What does a lower TF-IDF score imply?

TF-IDF is low 

$\Rightarrow$ TF is low or IDF is low or both are low

$\Rightarrow$ the term t appears less frequently in document d or the term t is common in the whole corpus D or both

$\Rightarrow$ the term t is not specific to the document d or not significant in the document d or both

$\Rightarrow$ This indicates that the term is likely not a key topic or is generic in nature

**Q :** The terms $t_1$ and $t_2$ has same frequency in a document d. However IDF of $t_1$ is higher than $t_2$. What does that imply?

1. **$t_1$ is rarer across the entire corpus**: A higher IDF value for $t_1$ suggests it appears in fewer documents in the corpus compared to $t_2$. Since IDF measures how unique a term is within the corpus, a higher IDF for $t_1$ implies it is more specific or unique across documents, while $t_2$ is more common.

2. **$t_1$ is likely more important in document d**: Since both terms have the same frequency in d, the TF values for $t_1$ and $t_2$ are identical. However, the higher IDF for $t_1$ increases its overall **TF-IDF score** in d, making $t_1$ more significant than $t_2$ in identifying the unique content or focus of document d.

3. **Interpretation in context**:
   - $t_1$ likely represents a **more specific or specialized topic** in d since it is relatively rare across the corpus.
   - $t_2$, being more common across documents, might be a broader or more general term that does not add as much unique value to d in comparison.



**Q :** Why is Inverse of DF considered? Why not use DF?

* to penalize common terms across the corpus as they do not give much information for distinguishing between documents
* to highlight uncommon terms in the corpus allowing unique or specific terms in a document to stand out, making them more significant
* Balancing local and global relevance of a term

**Q :** Why Not Just Use TF? What is the need of IDF factor?

* Without IDF, TF alone could make common words seem equally relevant across documents
* TF-IDF approach refines the relevance by adjusting for words that are commonly used across the corpus, focusing on those that provide distinguishing information within each document
* Using the inverse of document frequency allows us to emphasize terms that are not just frequently occurring but also unique or relevant to specific documents
* makes TF-IDF a powerful technique for identifying key features that characterize documents in a meaningful way

**Q :** Give an example to demonstate the use of TF-IDF score.

| Word              | Document 1 | Document 2 | Document 3 |
|-------------------|------------|------------|------------|
| data              | 0.1        | 0.2        | 0.15       |
| machine learning  | 0.9        | 0.05       | 0.0        |
| statistics        | 0.0        | 0.7        | 0.3        |

* In Document 1, "machine learning" has a very high TF-IDF score, indicating that this term is central to its content.
* In Document 2, "statistics" is crucial, while "machine learning" is not.
* In Document 3, "data" is present but not significant enough to indicate a specific topic

**Q :** Which module in sklearn has the functionality to execute TF-IDF approach?

* feature_extraction.text module of sklearn

**Q :** Give a brief overview about TfidfVectorizer.


* It is a Scikit-Learn tool that transforms a collection of raw text documents into a matrix of TF-IDF features 
* it represents each document by the importance of each term
* Purpose: Convert a text corpus into a numerical representation, capturing the relative importance of words across documents
* Output: A sparse matrix where each row represents a document, and each column represents a term. Each element in the matrix is the TF-IDF score for that term in that document

**Q :** List out the attrbutes of a TfidfVectorizer object.

1. **documents** : A list of documents (as strings) to be transformed into TF-IDF scores

2. **max_df** : Words that appear in more than this proportion of documents are ignored (helps to remove very common words)

3. **min_df** : Words that appear in fewer than this proportion of documents are ignored (helps to remove rare words)

4. **stop_words** : A list or language name (e.g., 'english') to filter out common stop words

5. **ngram_range** : Specifies the range of n-grams (e.g., (1, 2) for unigrams and bigrams)

6. **max_features** : Limits the number of features based on frequency; only the top N features are retained

7. **vocabulary** : Manually provide a mapping of terms to indices (optional)

**Q :** List out the methods of TfidfVectorizer object.

* **.fit(documents)** : Learns the vocabulary and IDF values from the documents

* **.transform(documents)** : Transforms documents into the TF-IDF matrix using the learned vocabulary and IDF values

* **.fit_transform(documents)** : Combines the fit and transform steps in a single call (most commonly used)

* **.get_feature_names_out()** : Returns the list of terms in the vocabulary

* **.idf_**: Contains the learned IDF vector with inverse document frequencies for each term in the vocabulary

* **.vocabulary_** : A dictionary that maps terms to feature indices

**Q :** What happens when a list of strings is fed into the TfidfVectorizer.

1. **Tokenization** - The text is split into individual tokens (words or phrases, based on ngram_range)

2. **TF Calculation** - Term frequency for each term in each document is calculated

3. **IDF Calculation** - The inverse document frequency of each term is computed across the corpus

4. **TF-IDF Transformation** - Each term's TF is multiplied by its IDF to produce the TF-IDF score for that term in each document

5. **Normalization** - Optionally, each row (document) vector is normalized to have unit Euclidean norm, which helps in scaling the vector

**Q :** How is TF-IDF text to num conversion done in Python using scikit-learn?

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7, min_df=0.1)

# Fit and transform the documents
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Access the feature names and idf values
feature_names = tfidf_vectorizer.get_feature_names_out()
idf_values = tfidf_vectorizer.idf_

# Convert to dense format for visualization (optional)
dense_matrix = tfidf_matrix.todense()

# Display the output
import pandas as pd
df = pd.DataFrame(dense_matrix, columns=feature_names)
print(df)

```

**Q :** List out the unique features of the TF-IDF approach.

1. Dimensionality Reduction: Reduces the impact of common terms (stop words) by assigning them lower weights.
2. Normalization: TF-IDF helps in normalizing the text data, making it easier to compare different documents.
3. Useful for Information Retrieval: Effective for tasks like document classification, clustering, and search engines.
4. Handling Sparse Data: It is effective in managing high-dimensional sparse data.

**Q :** List out the demerits associated with the TF-IDF approach.

* **Sparsity** : TF-IDF can result in sparse matrices, which can be computationally expensive to process.
* **Ignores Word Order** : It treats words independently, losing context or meaning derived from word sequences.
* **Not Semantic** : Does not account for synonyms or related words, which can lead to the loss of meaning.
* **Fixed Representation** : The model is static; it does not adapt to new contexts or meanings over time.

#### **Word2Vec**

* captures semantic relations
* dense vectors
* low dimensional vectors
* its called an embedding

**Q :** Basic ideas that powers the word2vec model?

* words appearing in similar contexts have similar meaning

<img src="Images/w2v001.png" alt="image description" width=700 height=400>

<img src="Images/w2v002.png" alt="image description" width=700 height=400>

**Q :** What is the benefit of a numerical text representation algo that captures semantic meaning?

* words with similar meaning will be closer together while plotting their num representation

**Q :** What do you mean by semantic/ meaning?

* it is about understanding a deeper meaning and relationships between words
* BOW, TFIDF treat words as seperate entities
* w2v creates a num rep based on the context in which the word occured

**Q :** How can word2vec capture or learn the similarities in words based on their meaning?

* Using ANNs that takes the text data and extracts meaning and relationship between words
$$text \to shallow \ ANN \ (algo) \to Word2Vec \ (model) \to num \ representation$$

**Q :** Name the techniques that help to boil down the problem statement above viz. seemingly an unsupervised learning problem to a familiar classification(SL) problem?

* CBOW - continuous bag of words
* skip gram
* skip gram with negative sampling

|Deep ANN|Shallow ANN|
|---|--|
| Many hidden layers| lesser number hidden of layers|

##### **CBOW**

**Q :** What is the goal of CBOW technique?

The CBOW model aims to **predict a target word based on its context words**.

**Q :** Differentiate between the target word and its context words.

* Context words are simply the words surrounding a target word in a sentence, within a specified window size.
* Window Size defines how many words around the target word will be considered as context
* For instance, if we have a window size of 2, then we look at the 2 words before and the 2 words after the target word

**Q :** Illustrate target & context with example.

Sentence - ""The quick brown fox jumps over the lazy dog."

target word - say target word is "fox"

window size - say, 2

context words - "quick", "brown", "jumps", "over"

**Q :** Is CBOW a supervised learning method?

Yes, 

* The training data consists of context-target pairs: given a set of context words, the model needs to learn to predict the target word
*  CBOW model uses error (loss) between the predicted and actual target words to update its weights
* learns based on predefined relationship between context words (input) and target words (output), optimizing its weights by minimizing the difference between predicted and actual outputs, just like in typical supervised learning

**Q :** What is fed as input into the CBOW neural network ?

* the context words
* based on window size --> if window size = k, then k words occurring before the target word and k words occurring after the target word are fed as inputs
* each of the context words will be fed in the form of binary(0,1 are the only component values) fixed-length vectors
* the fixed length of such vectors = the number of unique words in the corpus 

**Q :** Explain how the words are represented as one-hot vectors.

1. determine the vocabulary from the corpus ie. the collection of unique words
2. for each word in the vocabulary, assign a unique index; an integer
3. find out the size of the vocabulary, say N
4. then any word in that vocabulary with index i can be represented using 
   * a N-component vector
   * with i-th component equal to 1
   * & rest of the (N-1) components equal to 0

   **NOTE** : $1 \leq i \leq N$

**Q :** List out the layers in a CBOW's neural network.

1. Input layer
2. Embedding layer(hidden)
3. Output layer

**Q :** What is the purpose of the embedding layer in CBOW?

* to convert sparse, high-dimensional one-hot encoded vectors (which are typically very large) into lower-dimensional, dense vectors that can capture semantic relationships between words
* For example, words that have similar meanings should have similar vector representations

**Q :** Explain the dimensionality of the embedding space in the context of CBOW.

* it is the number of components in vectors that are generated by the embedding layer
* how many features each word vector will have
* this will be much less than the size of the corpus vocabulary ie. the number of components in the input one-hot vectors
* typically 50, 100, 200 etc.

**Q :** In the due course of accomplishing its task of predicting target words using context words, what is the shallow neural network of CBOW trying to learn?

* it is trying to learn the vector representations of words in the vocabulary
* unlike, trivial high-dimensional, sparse one-hot vector representations, the learned representations will be dense and low dimensional and will capture the semantic relations between the words in the vocabulary
* these are also called word embeddings

**Q :** What is meant by embedding matrix in the context of CBOW?

* Embedding matrix is a 2D matrix with dimension, V×d, where:
    * V is the vocabulary size 
    * d is the embedding dimension 
* Each row represents the word embedding (vector) for a specific word in the vocabulary
* each column represents a dimension in that embedding space
* iow. during training, the CBOW tries to create a perfect embedding matrix that captures actual semantic relation between words in the vocabulary

**Q :** Name the connection bw CBOW's input layer & embedding layer?

* lookup connection

**Q :** How are inputs fed into the CBOW NN?

1. Each context word is represented as a one-hot encoded vector of length 100. If the context has 4 words, there will be 4 separate one-hot vectors for those words.
2. Instead of feeding the one-hot vectors directly into the neurons (as in a fully connected layer), each one-hot vector is used to perform an embedding lookup. 
3. lookup retrieves the corresponding row in the embedding matrix
4. The one-hot vector effectively acts as an "index" in the vocabulary, pulling out the embedding for each word from the embedding matrix.
5. Once the embeddings for each of the 4 context words are retrieved, they are averaged or summed to produce a single vector. This combined vector is then passed to the next layer to predict the target word.

**Q :** CBOW's neural network is a **single layer network**. Justify.

* In the basic CBOW architecture, there’s often **no hidden layer between the averaged embeddings and the output layer**, making it a single-layer network.
* However, some implementations may include a hidden layer after the embedding layer to add non-linearity and improve the model's ability to capture complex relationships in the data. 
* If a hidden layer is added, the averaged embedding vector would be transformed through this layer using a weight matrix and an activation function (e.g., ReLU).

**Q :** What is context vector?

* the average of the context word embeddings for any target word is called context vector

**Q :** List out the features of output layer of CBOW NN.

* activation function is softmax
* number of neurons in the output layer is equal to the vocabulary size
* each neuron represents the likelihood of each word being the target word given the context
* each neuron in the output layer corresponds to one word in the vocabulary

**Q :** How does the output layer use the context vector for predicting the target word?

1. The averaged embedding (or output from the hidden layer if there is one) is multiplied by a weight matrix  and a bias term is added. 
2. The result is a vector with a score for each word in the vocabulary.
3. Applying the softmax function to this vector converts the scores into probabilities, with each probability representing the likelihood of a specific word being the target word



**Q :** How is embedding matrix being created during training?

* In some models, like word2vec (CBOW and Skip-gram), the embedding matrix is initialized with random values.
* During training, the model learns embeddings by adjusting these values based on the context words surrounding each target word.
* embeddings are adjusted to minimize the error in predicting a target word given its context, while Skip-gram does the opposite (predicting context given a target word).
* Through multiple training epochs, these embeddings are updated using backpropagation until they capture the relationships between words based on co-occurrence patterns.

**Q :** Word2Vec cannot capture the context in which a word appears. Give example.

* Sentence 1 : "I went to the bank to take a bath"
* Sentence 2 : "I went to the bank to apply for a loan"

In both sentences, the word bank has different meaning depending on the context. However word2vec representations of these both banks will be similar.

**Q :** What features of the word2vec model prevents it from differentiating similar words with context-dependent meanings?

* **Single Vector Per Word**:
    * generates a single embedding (vector) for each unique word in the vocabulary
    * This single embedding captures the overall meaning of the word based on all contexts in which it appears during training.
* **Context-Independent Representation**:
    * The embedding is averaged over various contexts
    * so words with multiple meanings get a "generalized" representation that blends different contexts
    * In the case of "bank," the model learns a representation that might mix meanings like "riverbank" and "financial institution" into a single vector
    * This depends on 
        - how often the word appears in the text
        - and in what contexts it appears
* **Limited Sense Disambiguation**:
    * does not handle polysemy (multiple meanings) 
    * it doesn’t consider the specific sentence-level context each time it encounters the word
    * So, "bank" in a sentence about a river and "bank" in a sentence about finance both get mapped to the same embedding

**Q :** Drawbacks of word2vec technique.

* doesn't capture the sequence of words in a sentence(thereby the intended expression of the sentence) 
* This can be solved using recurrent neural networks
* this is due to the incapability of shallow neural networks

**Q :** List out the neural networks capable of capturing the sequential info in the data.

* LSTM
* GRU
* RNN


#### **ELMO**

Embedding from Language MOdelling

**Q :** ELMO **model** is created using _____ **algorithm**.

* Bidirectional LSTM

**Q :** What is the **technique** used by Bi-LSTM for building ELMo model?

* Language modelling

shallow NN ---> Bi-LSTM (algo)
 & CBOW ---> Language Modelling (technique) then,

word2vec ----> ELMo (Model)

**Q :** ELMo was not used heavily in the industry. Why?

* the use of bi-lstm, rnn or gru as they process the sequences taking one token at a time. The number of tokens can be exceptionally huge in real world data. Hence, the process is highly time consuming (an impactful bottleneck in NLP before 2017)
* speed of processing sequences is too low
* model development is a time consuming process
* these models where taken over by the **transformers** algo developed by Google. It revolutionized the whole field of NLP paving way for the brand new idea of Gen AI

### **TRANSFORMERS**

**Q :** What are transformers?

* it is a **deep learning model architecture**
* a model architecture defines the structural design for organizing and processing data within a neural network
* It’s neither a standalone model nor a single algorithm
* it serves as a blueprint for building models like GPT, BERT, and others

**Q :** If I can call a transformer as a modified version of something familiar, what is that familiar something?

* transformers are a modified version of RNNs & LSTM networks
* transformers keep the sequence-handling strengths of RNNs but replace the sequential processing with a more efficient, attention-based structure

**Modifications**

1. self-attention mechanism
2. Parellel processing

**Q :** Parts of a transformer algo.

1. Encoder only transformer
2. Decoder only transformer


**NOTE 01**

MODEL = BERT

ALGO = encoder only transformer

TECH = Language Modelling

**NOTE 02**

MODEL = GPT

ALGO = decoder only transformer

TECH = Language Modelling

**Q :** What is meant by long term dependencies?

* sometimes to underatand the context we must travel farther form the word

Eg. The bank was overcrowded, with people form all across the country comming to worship


**Q :** What is a possible solution? When was it discovered?

* Attention Mechanism, 2015
* in order to capture the meaning we dont need to look at all the surrounding words.
* just by focusing on some important words we can understand more about each word

* AM plays a vital role by learning context-based language rep
* it also helps with the long term dependency problem


#### **Language Modelling**

**Activity :** Predict the next word !!!

1. I ___

**A :** am / was / have / can / will

2. I am ___

**A :** Abhinaya / a 

3. I am learning ___

**A :** data science / maths



**Q :** The word can be anything. But still we predicted certain specific words only. Why ?

* I have good vocabulary
* I have good language skills
* I know grammar
* I am familiar with usage of words
* I am aware of what words to use based on context

**Q :** Imagine a machine having the above skills you mentioned. What possible tasks it can do then ?

* auto-complete
* text summarizations
* QnA
* Machine translations
* chatbots

**Q :** The above tasks are called ____ tasks.

* Generative

**Q :** A model built on text data is called ____.

* Language model

**Q :** Language models try to learn what ?

* **sequential** relationships between words and sentences of a language in which the text is written
* the language in which the text is written


**Q :** What are the basic types of language modelling techniques?

1. Auto regressive
2. Auto encoding

##### **Auto-regressive**

**Q :** What are auto-regressive lang models trying to do?

* They are trained to predict the next roken in a sequence, based on the previous tokens
* a mask is applied to full sentence
* unidirectional (forward / left to right through a sentence)
* also called "**next-word prediction**"

**Q :** What is masking?

* 

**Q :** How are sentences represented for this purpose ?

```html
<s>I am learning language modelling</s>
```

**Q :** ChatGPT uses what type of language modelling? Why?

Auto-regressive lang modelling. While it writes answers, it writes it word by word viz. nothing but next word prediction

**Q :** Auto-regressive model is a type of RL. Justify.

##### **Auto-Encoder**

**Q :** What is main idea behind auto encoding technique?

* they are trained to reconstruct the oringinal sentence from a corrupted version of the input
* Bidirectional
* certain words of a sentence are replaced with a special token, usually "[MASK]"


**Q :** This method is similar to the familiar fill in the blanks question. Justify

**Q :** What is reconstruction?

#### **Transformers**

**Q :** Large in the word LLM represents what?

1. Number of model parameters ie., size of the model
2. Amount of data used for training the model
3. Both 1 & 2

**A :** 1

**Q :** List out some companies and the names LLMs developed by them and applications that are built on those models.

* Open AI - GPT 1/2/3/4/4o - Chat GPT
* Meta - LLama
* Google - Gemini
* Microsoft - Phi
* Anthropic - Claude

**Q :** The idea of training transformer algos using lang modelling technique was introduced in 2017 paper. But it was only in 2020 that these LLMs were developed using this idea?

* Time was taken for development of transfer learning

**Q :** Give a basic definition of transfer learning.

* Using knowledge acquired previously by means of doing some task to solve another related task.

Eg. Task 1 - Riding a bicycle & Task 2 - Riding a motorcycle. 

Here Task 1 gives the ability of balancing 2 wheelers.
   

**Q :** Consider a model trained for language translation task, say English to Hindi translation. Can it do text summarization?

* No, they are entirely different tasks

**Q :** Downstream task vs Pre-training task. 


**Q :**

**Q :**

**Q :**

**Q :**

### **Summary**

||Language representation| Captures |
|----|----|----|
|1| BOW | frequency or word counts|
|2| TF-IDF | word importance|
|3| word2vec |semantic meaning by learning relationships b/w different words|
|4| BERT | word context & sequence|


|  |Model|Algorithm|Technique|Description|
|-----|-----|-----|-----|-----|
|1|Word2Vec|shallow ANN|CBOW/Skip gram||
|2|Embedding from Language Modelling|Bidirectional Long Short term Memory|Language model|time consuming|
|3|Bidirectional Encoder Representations from Transformers |Encoder Only Transformer|Auto-encoding Language modelling||
|4|Generative Pre-trained Transformers|Decoder Only Transformer|Auto-regressive Language modelling||


<img src="Images/vectortechs.jpeg" alt="image description" width=700 height=400>

* Gemini's AI is free for few initial requests
* set up & API key is a bit different than in Open AI

Q : Which website to visit inorder to play with Google AI?

Q : How to set up the API keys in Google AI?

Q : GitHub repo for Google AI basics?

A : Gen AI scratch 2 advance by that ai guy/Google ai walk through

AIzaSyD_z6S81cYBq-kJ-KbEYf7ogZeygORJWhY