---   
<img align="left" width="110"   src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg"> 
<h1 align="center">Tools and Techniques for Data Science</h1>
<h1 align="center">Course: Natural Language Processing</h1>

--- 
<h2><div align="right">Muhammad Sheraz (Data Scientist)</div></h2>
<h1 align="center">Lecture 4: (Text Representation)</h1>

<img align="center" width="1100"  src="../images/phase3.png"  > 

<div>

</div>

## Learning agenda of this notebook

1. **Feature Extraction**
    - Label Encoding
    - One Hot Encoding   
    - Bag of Words (BoW) Representation
    - Creating Bag of N-grams
    - Term Frequency-Inverse Document Frequency

## Feature Engineering

<img    src='images/Text Representation.png'>

## Label Encoding

- **Definition:**
  - Label encoding is a technique used in natural language processing (NLP) to assign numerical labels to categorical variables. It is commonly applied to represent text data, such as unique words in a vocabulary, as numerical values.

- **Process:**
  - Assign a unique numerical label to each category (word) in the dataset.
  - Labels are typically assigned in ascending order starting from 0.

- **Key Components:**
  - **Encoded Labels:** The set of unique words across all documents mapped to numerical labels.

- **Pros:**
  - Simple and straightforward method to represent categorical variables as numerical values.
  - Compatible with various machine learning algorithms.

- **Cons:**
  - Does not capture any inherent ordinal relationships between categories.
  - May introduce unintended ordinality in certain algorithms.

- **Use Cases:**
  - Commonly used in preprocessing text data for machine learning tasks such as classification and regression.
  - Suitable for scenarios where nominal categories need to be represented numerically.

- **Libraries:**
  - Python libraries like scikit-learn provide tools (e.g., `LabelEncoder`) for easy implementation of label encoding.

- **Considerations:**
  - Ensure that label encoding is appropriate for the specific task and dataset.
  - Handle unseen categories gracefully, either by ignoring them or assigning a special label.

### Encoded Labels:

- Vocabulary Label Encoding:
  - Muhammad: 0
  - Sheraz: 1
  - is: 2
  - a: 3
  - Student: 4
  - of: 5
  - Data: 6
  - Science: 7
  - He: 8
  - Learning: 9
  - Natural: 10
  - Language: 11
  - Processing: 12
  - very: 13
  - good: 14
  - in: 15
  - Machine: 16


In [33]:
import pandas as pd

# Define the vocabulary label encoding
label_encoding = {
    'Muhammad': 0,
    'Sheraz': 1,
    'is': 2,
    'a': 3,
    'Student': 4,
    'of': 5,
    'Data': 6,
    'Science': 7,
    'He': 8,
    'Learning': 9,
    'Natural': 10,
    'Language': 11,
    'Processing': 12,
    'very': 13,
    'good': 14,
    'in': 15,
    'Machine': 16
}

# Sample dataframe
data = {'Text': ['Muhammad Sheraz is a Student of Data Science',
                  'He is Learning Natural Language Processing',
                  'He is very good in Machine Learning'],
        'Output': [1, 0, 1]}  # Adding a binary output column

df = pd.DataFrame(data)

# Apply label encoding to the text column
encoded_text = df['Text'].apply(lambda x: ' '.join(str(label_encoding.get(word, word)) for word in x.split()))

# Combine the encoded text with the output column
encoded_df = pd.DataFrame({'Text': encoded_text, 'Output': df['Output']})

encoded_df


Unnamed: 0,Text,Output
0,0 1 2 3 4 5 6 7,1
1,8 2 9 10 11 12,0
2,8 2 13 14 15 16 9,1


In [34]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a sample dataframe
data = {'Text': ['"Muhammad Sheraz is a Student of Data Science."',
                 'He is Learning Natural Language Processing.',
                 'He is very good in Machine Learning.'],
        'Output': [1, 0, 1]}  # Adding a binary output column

df = pd.DataFrame(data)

df


Unnamed: 0,Text,Output
0,"""Muhammad Sheraz is a Student of Data Science.""",1
1,He is Learning Natural Language Processing.,0
2,He is very good in Machine Learning.,1


In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# Tokenization using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Text'])

# Convert sparse matrix to DataFrame
dtm_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Apply label encoding to each column
label_encoder = LabelEncoder()
encoded_df = dtm_df.apply(label_encoder.fit_transform)

# Combine the encoded DataFrame with the output column
#encoded_df['Output'] = df['Output']

encoded_df


Unnamed: 0,data,good,he,in,is,language,learning,machine,muhammad,natural,of,processing,science,sheraz,student,very
0,1,0,0,0,0,0,0,0,1,0,1,0,1,1,1,0
1,0,0,1,0,0,1,1,0,0,1,0,1,0,0,0,0
2,0,1,1,1,0,0,1,1,0,0,0,0,0,0,0,1


## One-Hot Encoding Representation

- **Definition:**
  - One-hot encoding is a technique used in natural language processing (NLP) to represent categorical variables as binary vectors.
  - It creates a binary vector where each element corresponds to a unique word in the vocabulary, indicating its presence or absence in a document.

- **Process:**
  - **Tokenization:** Breaks text into individual words or tokens.
  - **Lowercasing:** Converts all words to lowercase to ensure uniformity.
  - **Vocabulary Creation:** Build a vocabulary of unique words across all documents.
  - **One-Hot Encoding:** Represent each document as a binary vector, where each element corresponds to a word in the vocabulary. If a word is present in the document, its corresponding element is set to 1; otherwise, it's set to 0.

- **Key Components:**
  - **Binary Document-Term Matrix:** The resulting matrix where each row represents a document, and each column represents a unique word, with binary values indicating word presence or absence.
  - **Vocabulary:** The set of unique words across all documents.

- **Pros:**
  - Maintains the vocabulary size without increasing dimensionality further.
  - Preserves word order and semantics.

- **Cons:**
  - Results in high-dimensional sparse matrices, especially with large vocabularies.

- **Use Cases:**
  - Useful when word presence is more important than word frequency.
  - Commonly used in text classification and neural network-based models.

- **Libraries:**
  - Libraries like scikit-learn in Python provide tools for implementing one-hot encoding.

- **Considerations:**
  - Suitable for scenarios where each word is considered independent of others.
  - May require additional techniques like dimensionality reduction for handling large vocabularies.



<img src='images/ohe1.png'>

### One-Hot Encoding Example

Consider a corpus with three documents:

1. **Document 1**: "Muhammad Sheraz is a Student of Data Science."
2. **Document 2**: "He is Learning Natural Language Processing."
3. **Document 3**: "He is very good in Machine Learning."

#### Step 1: Tokenization

- Unique words across all documents:
  - {Muhammad, Sheraz, is, a, Student, of, Data, Science, He, Learning, Natural, Language, Processing, very, good, in, Machine}

#### Step 2: One-Hot Encoding

- Represent each word as a binary vector indicating its presence in each document:

| Word       | Document 1 | Document 2 | Document 3 |
|------------|------------|------------|------------|
| Muhammad   | 1          | 0          | 0          |
| Sheraz     | 1          | 0          | 0          |
| is         | 1          | 1          | 1          |
| a          | 1          | 0          | 0          |
| Student    | 1          | 0          | 0          |
| of         | 1          | 0          | 0          |
| Data       | 1          | 0          | 0          |
| Science    | 1          | 0          | 0          |
| He         | 0          | 1          | 1          |
| Learning   | 0          | 1          | 0          |
| Natural    | 0          | 1          | 0          |
| Language   | 0          | 1          | 0          |
| Processing | 0          | 1          | 0          |
| very       | 0          | 0          | 1          |
| good       | 0          | 0          | 1          |
| in         | 0          | 0          | 1          |
| Machine    | 0          | 0          | 1          |

The one-hot encoded representation captures the presence of each word in each document.


## Bag of Words (BoW) Representation

- **Definition:**
  - BoW is a common technique in natural language processing (NLP) to represent text data as a numerical matrix.
  - It focuses on the occurrence and frequency of words in a document, disregarding grammar and word order.

- **Process:**
  - **Tokenization:** Breaks text into individual words or tokens.
  - **Lowercasing:** Converts all words to lowercase to ensure uniformity.
  - **Stopword Removal:** Eliminates common words (e.g., 'the', 'is') that add little meaning.
  - **Counting:** Creates a matrix where each row represents a document, and each column represents a unique word, counting the occurrences.

- **Key Components:**
  - **Document-Term Matrix (DTM):** The resulting matrix showing the frequency of each word in each document.
  - **Vocabulary:** The set of unique words across all documents.

- **Pr,initutiveos:**
  - Simple and computationally efficient.
  - Captures important term frequencies for basic text anal
  - Sparcity ysis.

- **Cons:**
  - Ignores word order and semantics.
  - Doesn't consider relationships between words (e.g., 'good' and 'great' are treated as separate entities).

- **Use Cases:**
  - Commonly used in text classification, sentiment analysis, and information retrieval.
  - Foundation for more advanced NLP techniques like TF-IDF and word embeddings.

- **Libraries:**
  - Popular libraries like scikit-learn in Python provide tools (e.g., `CountVectorizer`) for easy implementation.

- **Considerations:**
  - Customize preprocessing steps and hyperparameters based on specific needs.
  - May require additional techniques for handling large vocabularies or improving semantic understanding.


In [37]:
from sklearn.feature_extraction.text import CountVectorizer

# Define the corpus
corpus = [
    "Muhammad Sheraz is a Student of Data Science.",
    "He is Learning Natural Language Processing.",
    "He is very good in Machine Learning."
]

# Initialize CountVectorizer for one-hot encoding
vectorizer = CountVectorizer(binary=False)

# Fit the vectorizer on the corpus
vectorizer.fit(corpus)

# Transform the corpus into a one-hot encoded matrix
one_hot_encoded_matrix = vectorizer.transform(corpus).toarray()

# Get the vocabulary (unique words) and their respective indices
vocabulary = vectorizer.get_feature_names_out()

# Display the one-hot encoded matrix and vocabulary
print("One-Hot Encoded Matrix:")
print(one_hot_encoded_matrix)
print("\nVocabulary:")
print(vocabulary)


One-Hot Encoded Matrix:
[[1 0 0 0 1 0 0 0 1 0 1 0 1 1 1 0]
 [0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0]
 [0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1]]

Vocabulary:
['data' 'good' 'he' 'in' 'is' 'language' 'learning' 'machine' 'muhammad'
 'natural' 'of' 'processing' 'science' 'sheraz' 'student' 'very']


### Bag of Words (BoW) Example

Consider a corpus with three documents:

1. Document 1: "Muhammad Sheraz is a Student of Data Science."
2. Document 2: "He is Learning Natural Language Processing."
3. Document 3: "He is very good in Machine Learning."

#### Step 1: Tokenization

- Unique words across all documents:
  - {Muhammad, Sheraz, is, a, Student, of, Data, Science, He, Learning, Natural, Language, Processing, very, good, in, Machine}

#### Step 2: Create Vocabulary

- Assign an index to each unique word:
  - Vocabulary: {Muhammad: 0, Sheraz: 1, is: 2, a: 3, Student: 4, of: 5, Data: 6, Science: 7, He: 8, Learning: 9, Natural: 10, Language: 11, Processing: 12, very: 13, good: 14, in: 15, Machine: 16}

#### Step 3: Document-Term Matrix (DTM)

- Represent each document as a vector of word frequencies based on the vocabulary:

| Document | Muhammad | Sheraz | is | a | Student | of | Data | Science | He | Learning | Natural | Language | Processing | very | good | in | Machine |
|----------|----------|--------|----|---|---------|----|------|---------|----|----------|---------|----------|------------|------|------|----|---------|
| 1        | 1        | 1      | 1  | 1 | 1       | 1  | 1    | 1       | 0  | 0        | 0       | 0        | 0          | 0    | 0    | 0  | 0       |
| 2        | 0        | 0      | 1  | 0 | 0       | 0  | 0    | 0       | 1  | 1        | 1       | 1        | 1          | 0    | 0    | 0  | 0       |
| 3        | 0        | 0      | 1  | 0 | 0       | 0  | 0    | 0       | 1  | 0        | 0       | 0        | 0          | 1    | 1    | 1  | 1       |

The Document-Term Matrix (DTM) captures the frequency of each word in each document, forming the basis of the Bag of Words (BoW) representation.


In [None]:
Document 1: 
2. Document 2: ""
3. Document 3: ""

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a sample dataframe
data = {'Text': ['"Muhammad Sheraz is a Student of Data Science."',
                 'He is Learning Natural Language Processing.',
                 'He is very good in Machine Learning.'],
        'Output': [1, 0, 1]}  # Adding a binary output column

df = pd.DataFrame(data)

df


Unnamed: 0,Text,Output
0,"""Muhammad Sheraz is a Student of Data Science.""",1
1,He is Learning Natural Language Processing.,0
2,He is very good in Machine Learning.,1


In [2]:
# Apply bag-of-words representation with specified hyperparameters
vectorizer = CountVectorizer(
    lowercase=True,      # Convert all characters to lowercase
    #stop_words='english', # Remove common English stop words
    max_features=None,    # Keep all unique words (no limit on features)
    binary=False,         # Count occurrences (binary=False) or presence (binary=True)
    ngram_range=(1, 1)    # Use unigrams (single words), can be adjusted for bigrams, trigrams, etc.
)

bow_matrix = vectorizer.fit_transform(df['Text'])

# Convert the bag-of-words matrix to a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
bow_df

Unnamed: 0,data,good,he,in,is,language,learning,machine,muhammad,natural,of,processing,science,sheraz,student,very
0,1,0,0,0,1,0,0,0,1,0,1,0,1,1,1,0
1,0,0,1,0,1,1,1,0,0,1,0,1,0,0,0,0
2,0,1,1,1,1,0,1,1,0,0,0,0,0,0,0,1


In [8]:
bow_matrix.toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1],
       [0, 2, 0, 0, 1, 0, 1, 1, 1],
       [1, 0, 0, 1, 1, 1, 0, 1, 1],
       [0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [9]:
bow_matrix[0].toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [10]:
bow_matrix[3].toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)

In [11]:
vectorizer.transform(['document is the final document']).toarray()

array([[0, 2, 0, 0, 1, 0, 0, 1, 0]], dtype=int64)

## Understanding `vectorizer.vocabulary_` in `CountVectorizer`

In the scikit-learn `CountVectorizer` class, `vectorizer.vocabulary_` is an attribute that provides a mapping between terms (words) and their indices in the bag-of-words matrix. Here's how it works:

### Vocabulary Building Process

1. **Tokenization:**
   - The text data is tokenized, breaking it into individual words or tokens.

2. **Lowercasing:**
   - All words are converted to lowercase to ensure consistency.

3. **Stopword Removal (if specified):**
   - Common English stop words (e.g., 'the', 'is', 'and') may be removed based on the `stop_words` parameter.

4. **Building the Vocabulary:**
   - For each unique word in the preprocessed text data, a unique index is assigned.
   - The resulting vocabulary is stored in a dictionary, where the keys are the words, and the values are their corresponding indices.



### Text data:

1. "Muhammad Sheraz is a Student of Data Science."
2. "He is Learning Natural Language Processing."
3. "He is very good in Machine Learning."

After tokenization, lowercasing, and stopword removal, we get a set of unique words:
{'muhammad', 'sheraz', 'student', 'data', 'science', 'learning', 'natural', 'language', 'processing', 'good', 'machine'}

The vocabulary dictionary:
{
 'muhammad': 0,
 'sheraz': 1,
 'student': 2,
 'data': 3,
 'science': 4,
 'learning': 5,
 'natural': 6,
 'language': 7,
 'processing': 8,
 'good': 9,
 'machine': 10
}


## Example of Vocabulary Building in Bag-of-Words (BoW)

Consider the following text data:

1. Document 1: "Muhammad Sheraz is a Student of Data Science."
2. Document 2: "He is Learning Natural Language Processing."
3. Document 3: "He is very good in Machine Learning."

### Step 1: Tokenization

Tokenize the text into individual words:

1. Document 1: ["Muhammad", "Sheraz", "is", "a", "Student", "of", "Data", "Science."]
2. Document 2: ["He", "is", "Learning", "Natural", "Language", "Processing."]
3. Document 3: ["He", "is", "very", "good", "in", "Machine", "Learning."]

### Step 2: Lowercasing

Convert all words to lowercase:

1. Document 1: ["muhammad", "sheraz", "is", "a", "student", "of", "data", "science."]
2. Document 2: ["he", "is", "learning", "natural", "language", "processing."]
3. Document 3: ["he", "is", "very", "good", "in", "machine", "learning."]

### Step 3: Vocabulary Building

Build the vocabulary by assigning unique indices to each unique word:

```markdown
{
 'muhammad': 0,
 'sheraz': 1,
 'is': 2,
 'a': 3,
 'student': 4,
 'of': 5,
 'data': 6,
 'science': 7,
 'he': 8,
 'learning': 9,
 'natural': 10,
 'language': 11,
 'processing': 12,
 'very': 13,
 'good': 14,
 'in': 15,
 'machine': 16
}


In [3]:
vectorizer.vocabulary_

{'muhammad': 8,
 'sheraz': 13,
 'is': 4,
 'student': 14,
 'of': 10,
 'data': 0,
 'science': 12,
 'he': 2,
 'learning': 6,
 'natural': 9,
 'language': 5,
 'processing': 11,
 'very': 15,
 'good': 1,
 'in': 3,
 'machine': 7}

In [4]:
result_df = pd.concat([df, bow_df], axis=1)
result_df

Unnamed: 0,Text,Output,data,good,he,in,is,language,learning,machine,muhammad,natural,of,processing,science,sheraz,student,very
0,"""Muhammad Sheraz is a Student of Data Science.""",1,1,0,0,0,1,0,0,0,1,0,1,0,1,1,1,0
1,He is Learning Natural Language Processing.,0,0,0,1,0,1,1,1,0,0,1,0,1,0,0,0,0
2,He is very good in Machine Learning.,1,0,1,1,1,1,0,1,1,0,0,0,0,0,0,0,1


## CountVectorizer Hyperparameters

- **`lowercase` (default=True):**
  - *Description:* Converts all text to lowercase. Helps in treating words with different cases as the same.
  - *Default Value:* True
  - *Use Case:* Set to False if you want to preserve the case sensitivity of words.

- **`stop_words` (default=None):**
  - *Description:* Removes common English stop words (e.g., 'the', 'is', 'and') to focus on more meaningful words.
  - *Default Value:* None
  - *Use Case:* Pass 'english' to remove common English stop words. Custom stop words can be provided as a list.

- **`max_features` (default=None):**
  - *Description:* Limits the number of unique words to consider. If specified, the most frequent words are selected.
  - *Default Value:* None
  - *Use Case:* Set a specific number to limit the vocabulary size, helpful when dealing with large datasets or when focusing on top words.

- **`binary` (default=False):**
  - *Description:* If True, the matrix representation is binary (1 if the word is present, 0 if not). If False, it counts the occurrences.
  - *Default Value:* False
  - *Use Case:* Set to True for binary representation when only presence/absence matters, not the frequency.

- **`ngram_range` (default=(1, 1)):**
  - *Description:* Specifies the range of n-grams to consider. For example, (1, 1) considers only unigrams, (1, 2) considers unigrams and bigrams, etc.
  - *Default Value:* (1, 1)
  - *Use Case:* Adjust to capture more context by including bigrams (or trigrams) in addition to unigrams.

- **`tokenizer` (default=None):**
  - *Description:* Custom function for tokenization. If None, it uses the default tokenizer.
  - *Default Value:* None
  - *Use Case:* Provide a custom tokenizer function if the default tokenization is not suitable for your data.

- **`preprocessor` (default=None):**
  - *Description:* Custom function applied to each document before tokenization and stop word removal.
  - *Default Value:* None
  - *Use Case:* Use when additional preprocessing is required before tokenization.

- **`max_df` (default=1.0):**
  - *Description:* Ignores terms that have a document frequency strictly higher than the specified threshold (float or integer).
  - *Default Value:* 1.0
  - *Use Case:* Exclude words that are too common and may not provide meaningful information.

- **`min_df` (default=1):**
  - *Description:* Ignores terms that have a document frequency strictly lower than the specified threshold (float or integer).
  - *Default Value:* 1
  - *Use Case:* Exclude words that are too rare and may not contribute much to the analysis.

- **`vocabulary` (default=None):**
  - *Description:* List of words to consider. If not None, it ignores all terms that are not in this list.
  - *Default Value:* None
  - *Use Case:* Provide a custom vocabulary list to restrict the features to a predefined set.

- **`strip_accents` (default=None):**
  - *Description:* Remove accents during the preprocessing step.
  - *Default Value:* None
  - *Use Case:* Set to 'ascii' or 'unicode' to remove accents from words.

- **`token_pattern` (default=r"(?u)\b\w\w+\b"):**
  - *Description:* Regular expression defining what constitutes a 'word' and how to split it.
  - *Default Value:* r"(?u)\b\w\w+\b"
  - *Use Case:* Customize the pattern to suit specific tokenization requirements.

- **`analyzer` (default='word'):**
  - *Description:* Determines whether the feature should be made of word n-gram or character n-grams.
  - *Default Value:* 'word'
  - *Use Case:* Set to 'char' or 'char_wb' for character n-grams instead of word n-grams.

- **`dtype` (default=np.int64):**
  - *Description:* Type of the matrix returned.
  - *Default Value:* np.int64
  - *Use Case:* Adjust if a different data type for the matrix is required.

- **`input` (default='content'):**
  - *Description:* 'content' interprets input as a collection of raw text documents.
  - *Default Value:* 'content'
  - *Use Case:* Typically, there's no need to change this unless the input format is different.

These hyperparameters offer flexibility in customizing the behavior of the `CountVectorizer` based on specific requirements and characteristics of the input text data. Adjusting these hyperparameters allows for fine-tuning the Bag of Words representation.


In [5]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create an advanced sample dataframe
data = {
        'Text': ['This is the first book about a thrilling adventure.',
                 'Book2 is a non-fiction work by AuthorB.',
                 'The third book is a fictional story by AuthorA.',
                 'AuthorC wrote a mystery novel in Book4.']}
df_advanced = pd.DataFrame(data)

# Display the advanced sample dataframe
df_advanced



Unnamed: 0,Text
0,This is the first book about a thrilling adven...
1,Book2 is a non-fiction work by AuthorB.
2,The third book is a fictional story by AuthorA.
3,AuthorC wrote a mystery novel in Book4.


In [6]:
# Apply bag-of-words representation
vectorizer = CountVectorizer(lowercase=True, stop_words='english')
bow_matrix = vectorizer.fit_transform(df_advanced['Text'])

# Convert the bag-of-words matrix to a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())

bow_df

Unnamed: 0,adventure,authora,authorb,authorc,book,book2,book4,fiction,fictional,mystery,non,novel,story,thrilling,work,wrote
0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0
2,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
3,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1


In [7]:
vectorizer.vocabulary_

{'book': 4,
 'thrilling': 13,
 'adventure': 0,
 'book2': 5,
 'non': 10,
 'fiction': 7,
 'work': 14,
 'authorb': 2,
 'fictional': 8,
 'story': 12,
 'authora': 1,
 'authorc': 3,
 'wrote': 15,
 'mystery': 9,
 'novel': 11,
 'book4': 6}

In [8]:
result_df_advanced = pd.concat([df_advanced, bow_df], axis=1)

result_df_advanced

Unnamed: 0,Text,adventure,authora,authorb,authorc,book,book2,book4,fiction,fictional,mystery,non,novel,story,thrilling,work,wrote
0,This is the first book about a thrilling adven...,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,Book2 is a non-fiction work by AuthorB.,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0
2,The third book is a fictional story by AuthorA.,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0
3,AuthorC wrote a mystery novel in Book4.,0,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1


## g. Creating N-grams
- **What are n-grams?** 
    - A sequence of n words, can be bigram, trigram,....
- **Why to use n-grams?** 
    - Capture contextual information (`good food` carries more meaning than just `good` and `food` when observed independently)
    - Applications of N-grams:
        - Sentence Completion
        - Auto Spell Check and correction
        - Auto Grammer Check and correction
    - Is there a perfect value of n?
        - Different types of n-grams are suitable for different types of applications. You should try different n-grams on your data in order to confidently conclude which one works the best among all for your text analysis. 

- **How to create n-grams?** 

In [9]:
import nltk
mystr = "Allama Iqbal was a visionary philosopher and politician. Thank you"
tokens = nltk.tokenize.word_tokenize(mystr)
bgs = nltk.bigrams(tokens)
print(bgs)
for grams in bgs:
    print(grams)

<generator object bigrams at 0x000002CF2D3778B0>
('Allama', 'Iqbal')
('Iqbal', 'was')
('was', 'a')
('a', 'visionary')
('visionary', 'philosopher')
('philosopher', 'and')
('and', 'politician')
('politician', '.')
('.', 'Thank')
('Thank', 'you')


>- The formula to calculate the count of n-grams in a document is: **`X - N + 1`**, where `X` is the number of words in a given document and `N` is the number of words in n-gram
\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 2 + 1 \hspace{0.5cm} = \hspace{0.5cm} 10
\end{equation}


In [65]:
tgs = nltk.trigrams(tokens)
for grams in tgs:
    print(grams)

('Allama', 'Iqbal', 'was')
('Iqbal', 'was', 'a')
('was', 'a', 'visionary')
('a', 'visionary', 'philosopher')
('visionary', 'philosopher', 'and')
('philosopher', 'and', 'politician')
('and', 'politician', '.')
('politician', '.', 'Thank')
('.', 'Thank', 'you')


\begin{equation}
    \text{Count of N-grams} \hspace{0.5cm} = \hspace{0.5cm} 11 - 3 + 1 \hspace{0.5cm} = \hspace{0.5cm} 9
\end{equation}


In [66]:
ngrams = nltk.ngrams(tokens, 4)
for grams in ngrams:
    print(grams)

('Allama', 'Iqbal', 'was', 'a')
('Iqbal', 'was', 'a', 'visionary')
('was', 'a', 'visionary', 'philosopher')
('a', 'visionary', 'philosopher', 'and')
('visionary', 'philosopher', 'and', 'politician')
('philosopher', 'and', 'politician', '.')
('and', 'politician', '.', 'Thank')
('politician', '.', 'Thank', 'you')


## N-Grams in Bag of Words (BoW)

- **Definition:**
  - N-Grams extend the BoW representation by considering sequences of 'n' consecutive words as a single feature.
  - Unigrams (1-grams) are single words, bigrams (2-grams) are pairs of consecutive words, trigrams (3-grams) are triplets, and so on.

- **Enhancements:**
  - **Contextual Information:** Captures the context and relationship between adjacent words.
  - **Increased Complexity:** Larger 'n' introduces more features, leading to a richer representation.

- **Use Cases:**
  - Useful in tasks where word order matters, such as language modeling and certain types of text analysis.
  - Provides more nuanced information for understanding the meaning of phrases.

- **Implementation:**
  - Supported in libraries like scikit-learn through the `ngram_range` parameter in `CountVectorizer`.

- **Considerations:**
  - Larger 'n' increases the dimensionality of the feature space, which may impact computational efficiency.
  - Finding the right balance between granularity and complexity is crucial.



### N-Grams in Bag of Words (BoW) Example

Consider a corpus with three documents:

1. Document 1: "Muhammad Sheraz is a Student of Data Science."
2. Document 2: "He is Learning Natural Language Processing."
3. Document 3: "He is very good in Machine Learning."

#### Step 1: Tokenization and N-Gram Formation

- Unique unigrams and bigrams across all documents:
  - {Muhammad, Sheraz, is, a, Student, of, Data, Science, He, Learning, Natural, Language, Processing, very, good, in, Machine, Muhammad Sheraz, Sheraz is, is a, a Student, Student of, of Data, Data Science, He is, is Learning, Learning Natural, Natural Language, Language Processing, He is, is very, very good, good in, in Machine, Machine Learning}

#### Step 2: Create Vocabulary

- Assign an index to each unique n-gram:
  - Vocabulary: {Muhammad: 0, Sheraz: 1, is: 2, a: 3, Student: 4, of: 5, Data: 6, Science: 7, He: 8, Learning: 9, Natural: 10, Language: 11, Processing: 12, very: 13, good: 14, in: 15, Machine: 16, Muhammad Sheraz: 17, Sheraz is: 18, is a: 19, a Student: 20, Student of: 21, of Data: 22, Data Science: 23, He is: 24, is Learning: 25, Learning Natural: 26, Natural Language: 27, Language Processing: 28, He is very: 29, is very good: 30, very good in: 31, good in Machine: 32, in Machine Learning: 33}

#### Step 3: Document-Term Matrix (DTM) for N-Grams

- Represent each document as a vector of n-gram frequencies based on the vocabulary:

| Document | Muhammad | Sheraz | is | a | Student | of | Data | Science | He | Learning | Natural | Language | Processing | very | good | in | Machine | Muhammad Sheraz | Sheraz is | is a | a Student | Student of | of Data | Data Science | He is | is Learning | Learning Natural | Natural Language | Language Processing | He is very | is very good | very good in | good in Machine | in Machine Learning |
|----------|----------|--------|----|---|---------|----|------|---------|----|----------|---------|----------|------------|------|------|----|---------|-----------------|-----------|-------|------------|-------------|----------|--------------|--------|--------------|-----------------|------------------|-------------------|-------------|---------------|--------------|-----------------|---------------------|
| 1        | 1        | 1      | 1  | 1 | 1       | 1  | 1    | 1       | 0  | 0        | 0       | 0        | 0          | 0    | 0    | 0  | 0       | 0               | 0         | 0     | 0          | 0           | 0        | 0            | 0      | 0            | 0               | 0                | 0                 | 0           | 0             | 0            | 0               | 0                   |
| 2        | 0        | 0      | 1  | 0 | 0       | 0  | 0    | 0       | 1  | 1        | 1       | 1        | 1          | 0    | 0    | 0  | 0       | 0               | 0         | 0     | 0          | 0           | 0        | 0            | 1      | 1            | 1               | 1                | 1                 | 0           | 0             | 0            | 0               | 0                   |
| 3        | 0        | 0      | 1  | 0 | 0       | 0  | 0    | 0       | 1  | 0        | 0       | 0        | 0          | 1    | 1    | 1  | 1       | 0               | 0         | 0     | 0          | 0           | 0        | 0            | 1      | 0            | 0               | 0                | 0                 | 1           | 1             | 1            | 1               | 1                   |

The Document-Term Matrix (DTM) for N-Grams captures the frequency of each n-gram in each document, forming the basis of the Bag of Words (BoW) representation with N-Grams.


In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create a sample dataframe
data = {'Text': ['"Muhammad Sheraz is a Student of Data Science."',
                 'He is Learning Natural Language Processing.',
                 'He is very good in Machine Learning.'],
        'Output': [1, 0, 1]}  # Adding a binary output column

df = pd.DataFrame(data)

df


Unnamed: 0,Text,Output
0,"""Muhammad Sheraz is a Student of Data Science.""",1
1,He is Learning Natural Language Processing.,0
2,He is very good in Machine Learning.,1


In [12]:
# Apply N-Grams using CountVectorizer
ngram_vectorizer = CountVectorizer(ngram_range=(1, 3))  # This specifies using unigrams and bigrams
ngram_matrix = ngram_vectorizer.fit_transform(df['Text'])

# Convert the N-Grams matrix to a DataFrame
ngram_df = pd.DataFrame(ngram_matrix.toarray(), columns=ngram_vectorizer.get_feature_names_out())

# Concatenate the N-Grams DataFrame with the original dataframe
result_df = pd.concat([df, ngram_df], axis=1)

result_df


Unnamed: 0,Text,Output,data,data science,good,good in,good in machine,he,he is,he is learning,...,science,sheraz,sheraz is,sheraz is student,student,student of,student of data,very,very good,very good in
0,"""Muhammad Sheraz is a Student of Data Science.""",1,1,1,0,0,0,0,0,0,...,1,1,1,1,1,1,1,0,0,0
1,He is Learning Natural Language Processing.,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,He is very good in Machine Learning.,1,0,0,1,1,1,1,1,0,...,0,0,0,0,0,0,0,1,1,1


In [13]:
ngram_vectorizer.vocabulary_

{'muhammad': 26,
 'sheraz': 37,
 'is': 12,
 'student': 40,
 'of': 32,
 'data': 0,
 'science': 36,
 'muhammad sheraz': 27,
 'sheraz is': 38,
 'is student': 15,
 'student of': 41,
 'of data': 33,
 'data science': 1,
 'muhammad sheraz is': 28,
 'sheraz is student': 39,
 'is student of': 16,
 'student of data': 42,
 'of data science': 34,
 'he': 5,
 'learning': 21,
 'natural': 29,
 'language': 19,
 'processing': 35,
 'he is': 6,
 'is learning': 13,
 'learning natural': 22,
 'natural language': 30,
 'language processing': 20,
 'he is learning': 7,
 'is learning natural': 14,
 'learning natural language': 23,
 'natural language processing': 31,
 'very': 43,
 'good': 2,
 'in': 9,
 'machine': 24,
 'is very': 17,
 'very good': 44,
 'good in': 3,
 'in machine': 10,
 'machine learning': 25,
 'he is very': 8,
 'is very good': 18,
 'very good in': 45,
 'good in machine': 4,
 'in machine learning': 11}

## TF-IDF (Term Frequency-Inverse Document Frequency)

- **Definition:**
  - TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents.
  - It balances the frequency of a word in a document (Term Frequency - TF) with its rarity across documents (Inverse Document Frequency - IDF).

- **Components:**
  - **Term Frequency (TF):**
    - Measures how often a term appears in a document.
    - Calculated as the number of occurrences of a term divided by the total number of terms in the document.

  - **Inverse Document Frequency (IDF):**
    - Measures how unique or rare a term is across all documents.
    - Calculated as the logarithm of the total number of documents divided by the number of documents containing the term.

- **Formula:**
  - TF-IDF = TF * IDF

- **Key Characteristics:**
  - Words with high TF-IDF scores are important in a specific document but not common across all documents.
  - Common words receive lower scores, emphasizing the uniqueness of terms.

- **Pros:**
  - Captures the importance of terms in the context of a document and a corpus.
  - Penalizes common words and highlights distinctive terms.

- **Cons:**
  - Complexity increases with a larger corpus.
  - Requires careful consideration of parameter tuning.

- **Use Cases:**
  - Document retrieval and ranking.
  - Keyword extraction.
  - Text mining and clustering.

- **Libraries:**
  - Implemented in scikit-leary(), columns=tfidf_vectorizer.get_feature_names_out())


- ### Steps to Calculate TF-IDF
    
    - **Step 1: Compute Term Frequency (TF)**
        - Term Frequency (TF) is calculated for each term in each document. It represents the frequency of a term in a document relative to the total number of terms in that document.
        
    - **Step 2: Compute Inverse Document Frequency (IDF)**
        - Inverse Document Frequency (IDF) measures the importance of a term in the entire corpus. It is computed by taking the logarithm of the ratio of the total number of documents to the number of documents containing the term.
    
    - **Step 3: Compute TF-IDF**
        - TF-IDF (Term Frequency-Inverse Document Frequency) is the product of TF and IDF. It represents the importance of a term in a document relative to its importance in the entire corpus.

The TF-IDF representation captures the importance of terms in each document, emphasizing the uniqueness of terms across the entire corpus.
s.


<img src='images/tdidf1.png'>

## Example

Consider a corpus with three documents:

1. Document 1: "Muhammad Sheraz is a Student of Data Science."
2. Document 2: "He is Learning Natural Language Processing."
3. Document 3: "He is very good in Machine Learning."


### Term Frequency (TF) Table

| Term        | Document 1 | Document 2 | Document 3 |
|-------------|------------|------------|------------|
| Muhammad    | 1/8        | 0          | 0          |
| Sheraz      | 1/8        | 0          | 0          |
| is          | 1/8        | 1/8        | 1/6        |
| a           | 1/8        | 0          | 0          |
| Student     | 1/8        | 0          | 0          |
| of          | 1/8        | 0          | 0          |
| Data        | 1/8        | 0          | 0          |
| Science     | 1/8        | 0          | 0          |
| He          | 0          | 1/8        | 1/6        |
| Learning    | 0          | 1/8        | 0          |
| Natural     | 0          | 1/8        | 0          |
| Language    | 0          | 1/8        | 0          |
| Processing  | 0          | 1/8        | 0          |
| very        | 0          | 0          | 1/6        |
| good        | 0          | 0          | 1/6        |
| in          | 0          | 0          | 1/6        |
| Machine     | 0          | 0          | 1/6        |
| Learning    | 0          | 0          | 0          |

### Inverse Document Frequency (IDF) Table

| Term        | IDF          |
|-------------|--------------|
| Muhammad    | 0.477        |
| Sheraz      | 0.477        |
| is          | 0            |
| a           | 0.477        |
| Student     | 0.477        |
| of          | 0.477        |
| Data        | 0.477        |
| Science     | 0.477        |
| He          | 0            |
| Learning    | 0            |
| Natural     | 0            |
| Language    | 0            |
| Processing  | 0            |
| very        | 0            |
| good        | 0            |
| in          | 0            |
| Machine     | 0            |

### TF-IDF Table

| Term        | Document 1 | Document 2 | Document 3 |
|-------------|------------|------------|------------|
| Muhammad    | 0.059      | 0          | 0          |
| Sheraz      | 0.059      | 0          | 0          |
| is          | 0          | 0          | 0          |
| a           | 0.059      | 0          | 0          |
| Student     | 0.059      | 0          | 0          |
| of          | 0.059      | 0          | 0          |
| Data        | 0.059      | 0          | 0          |
| Science     | 0.059      | 0          | 0          |
| He          | 0          | 0          | 0          |
| Learning    | 0          | 0          | 0          |
| Natural     | 0          | 0          | 0          |
| Language    | 0          | 0          | 0          |
| Processing  | 0          | 0          | 0          |
| very        | 0          | 0          | 0.080      |
| good        | 0          | 0          | 0.080      |
| in          | 0          | 0          | 0.080      |
| Machine     | 0          | 0          | 0.080      |
| Learning    | 0          | 0          | 0          |



In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# Sample documents
documents = ["Muhammad Sheraz is a Student of Data Science.",
                 'He is Learning Natural Language Processing.',
                'He is very good in Machine Learning.']

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (terms)
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to a dense array for easier manipulation
dense_matrix = tfidf_matrix.todense()

# Display the TF-IDF values in a readable format
print("TF-IDF Values:")
pd.DataFrame(dense_matrix,columns=feature_names)



TF-IDF Values:


Unnamed: 0,data,good,he,in,is,language,learning,machine,muhammad,natural,of,processing,science,sheraz,student,very
0,0.396875,0.0,0.0,0.0,0.2344,0.0,0.0,0.0,0.396875,0.0,0.396875,0.0,0.396875,0.396875,0.396875,0.0
1,0.0,0.0,0.358291,0.0,0.278245,0.47111,0.358291,0.0,0.0,0.47111,0.0,0.47111,0.0,0.0,0.0,0.0
2,0.0,0.426184,0.324124,0.426184,0.251711,0.0,0.324124,0.426184,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.426184


In [20]:
# Display feature names (terms)
print("\nFeature Names (Terms):")
feature_names



Feature Names (Terms):


array(['data', 'good', 'he', 'in', 'is', 'language', 'learning',
       'machine', 'muhammad', 'natural', 'of', 'processing', 'science',
       'sheraz', 'student', 'very'], dtype=object)

In [22]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
data = {
    'Document':  ["Muhammad Sheraz is a Student of Data Science.",
                 'He is Learning Natural Language Processing.',
                'He is very good in Machine Learning.']
}

# Create DataFrame
df = pd.DataFrame(data)

# Apply TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['Document'])

# Convert TF-IDF matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Display original DataFrame
print("Original DataFrame:")
print(df)

# Display TF-IDF DataFrame
print("\nTF-IDF DataFrame:")
print(tfidf_df)


Original DataFrame:
                                        Document
0  Muhammad Sheraz is a Student of Data Science.
1    He is Learning Natural Language Processing.
2           He is very good in Machine Learning.

TF-IDF DataFrame:
       data      good        he        in        is  language  learning  \
0  0.396875  0.000000  0.000000  0.000000  0.234400   0.00000  0.000000   
1  0.000000  0.000000  0.358291  0.000000  0.278245   0.47111  0.358291   
2  0.000000  0.426184  0.324124  0.426184  0.251711   0.00000  0.324124   

    machine  muhammad  natural        of  processing   science    sheraz  \
0  0.000000  0.396875  0.00000  0.396875     0.00000  0.396875  0.396875   
1  0.000000  0.000000  0.47111  0.000000     0.47111  0.000000  0.000000   
2  0.426184  0.000000  0.00000  0.000000     0.00000  0.000000  0.000000   

    student      very  
0  0.396875  0.000000  
1  0.000000  0.000000  
2  0.000000  0.426184  
