<a href="https://colab.research.google.com/github/Naren221/NLP/blob/main/NarendraMDS202336_Assignment02_Version_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      Assignment 02


## **Narendra. C** <br> **MDS202336**

## 1. Implementing Modified Verison of COALS

* The preprocessed corpus is stored in JSON files, where each file contains a collection of documents.
* Each document in the JSON file is represented as a key-value pair:

    * The key is the document name or ID, which serves as a unique identifier for the document.
    * The value is the preprocessed text of that document, which consists of cleaned and tokenized words from the original text (e.g., lowercase, punctuation removed, stopwords removed).

* This structure allows easy access to both the document identifiers and their
corresponding preprocessed content, facilitating efficient data manipulation and analysis

In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Steps to Create a Co-occurrence Matrix

1. **Identify Data Sources**
   - Locate and list all relevant JSON files containing text data.

2. **Extract Vocabulary**
   - **Read Files:**
     - Read and parse text data from each JSON file.
   - **Build Vocabulary:**
     - Extract unique tokens from each file and aggregate them into a global vocabulary.
   - **Save Vocabulary:**
     - Save the vocabulary to a file for future reference.

3. **Initialize Co-occurrence Matrix**
   - **Prepare Matrix:**
     - Initialize a sparse matrix for storing co-occurrence counts based on the global vocabulary.

4. **Process Each File**
   - **Load Data:**
     - Read and parse text data from each JSON file.
   - **Update Matrix:**
     - Convert tokens to matrix indices using the global vocabulary.
     - Update the co-occurrence matrix based on token co-occurrences within a defined window size.

5. **Finalize Matrix**
   - **Ensure Symmetry:**
     - Make the matrix symmetric by adding its transpose.
   - **Convert Format:**
     - Change the matrix to a more efficient storage format for further use (e.g., CSR format).

6. **Save Results**
   - **Store Matrix:**
     - Save the matrix and vocabulary in a storage format for future use (e.g., HDF5 format).

7. **Clean Up**
   - Free up memory by deleting temporary data structures.


In [None]:
import os
import json
from tqdm import tqdm

def extract_vocabulary(directory, output_path='/content/drive/MyDrive/vocabulary.txt'):
    # Get the list of JSON files in the directory
    json_files = [f for f in os.listdir(directory) if f.endswith('.json')]

    if not json_files:
        raise ValueError("No JSON files found in the directory.")

    # Initialize a set to store the vocabulary
    vocab = set()

    # Process each JSON file with a progress bar
    for json_file_name in tqdm(json_files, desc='Extracting vocabulary'):
        file_path = os.path.join(directory, json_file_name)

        # Read the JSON file
        with open(file_path, 'r') as file:
            data = json.load(file)

        # Collect unique tokens from the current file
        for document in data:
            vocab.update(document)

        # Free the data from memory
        del data

    # Save the vocabulary to a file
    with open(output_path, 'w') as file:
        for word in sorted(vocab):
            file.write(f"{word}\n")

    print(f"Vocabulary extracted and saved to: {output_path}")

# Example usage
directory = '/content/drive/MyDrive/NLP Assignments/preprocessed_files/'
output_path = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
extract_vocabulary(directory, output_path=output_path)


Extracting vocabulary: 100%|██████████| 12/12 [01:17<00:00,  6.46s/it]


Vocabulary extracted and saved to: /content/drive/MyDrive/vocabulary.txt


In [31]:
from google.colab import drive
drive.mount('/content/drive')

import os
import json
import numpy as np
import scipy.sparse as sp
import h5py
from tqdm import tqdm

# Load vocabulary
def load_vocabulary(vocab_path):
    with open(vocab_path, 'r') as file:
        vocab = [line.strip() for line in file]
    return vocab

# Function to process a single JSON file and build its co-occurrence matrix
def process_single_file(file_path, vocab_index, window_size, output_path):
    vocab_size = len(vocab_index)
    cooccurrence_matrix = sp.lil_matrix((vocab_size, vocab_size), dtype=np.float32)

    # Read the JSON file
    with open(file_path, 'r') as file:
        data = json.load(file)

    # Process each document in the current JSON file with a progress bar
    with tqdm(total=len(data), desc=f'Processing {os.path.basename(file_path)}') as pbar:
        for document in data:
            token_indices = [vocab_index[token] for token in document if token in vocab_index]

            for i, token_idx in enumerate(token_indices):
                start_index = max(0, i - window_size)
                end_index = min(len(token_indices), i + window_size + 1)

                # Update co-occurrence for previous and future tokens without weights
                for j in range(start_index, i):
                    cooccurrence_matrix[token_idx, token_indices[j]] += 1

                for j in range(i + 1, end_index):
                    cooccurrence_matrix[token_idx, token_indices[j]] += 1

            pbar.update(1)

    # Convert to CSR format for efficient storage
    cooccurrence_matrix_csr = cooccurrence_matrix.tocsr()

    # Store the matrix in a file
    output_file = os.path.join(output_path, f'cooccurrence_matrix_{os.path.basename(file_path).split(".")[0]}.h5')
    with h5py.File(output_file, 'w') as f:
        f.create_dataset('data', data=cooccurrence_matrix_csr.data, compression='gzip')
        f.create_dataset('indices', data=cooccurrence_matrix_csr.indices, compression='gzip')
        f.create_dataset('indptr', data=cooccurrence_matrix_csr.indptr, compression='gzip')
        f.attrs['shape'] = cooccurrence_matrix_csr.shape

    # Free memory
    del cooccurrence_matrix
    del cooccurrence_matrix_csr
    del data

    print(f"Co-occurrence matrix for {file_path} stored at: {output_file}")
    return output_file

def process_json_file(json_file_path, vocab_path, output_path, window_size=4):
    vocab = load_vocabulary(vocab_path)
    vocab_index = {word: i for i, word in enumerate(vocab)}

    # Process and store co-occurrence matrix for the specified JSON file
    process_single_file(json_file_path, vocab_index, window_size, output_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
# file_number =  3 # Replace with the desired file number

for file_number in [5, 6, 7]:
  json_file_path = f'/content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_{file_number}.json'  # Path to your specific JSON file
  directory = '/content/drive/MyDrive/NLP Assignments/preprocessed_files/'
  vocab_path = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
  output_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/'

  process_json_file(json_file_path, vocab_path, output_path)


Processing chunk_5.json: 100%|██████████| 4711/4711 [23:00<00:00,  3.41it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_5.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_5.h5


Processing chunk_6.json: 100%|██████████| 4711/4711 [23:07<00:00,  3.39it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_6.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_6.h5


Processing chunk_7.json: 100%|██████████| 4711/4711 [22:10<00:00,  3.54it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_7.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_7.h5


In [8]:
for file_number in [8, 9, 10]:
  json_file_path = f'/content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_{file_number}.json'  # Path to your specific JSON file
  directory = '/content/drive/MyDrive/NLP Assignments/preprocessed_files/'
  vocab_path = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
  output_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/'

  process_json_file(json_file_path, vocab_path, output_path)

Processing chunk_8.json: 100%|██████████| 4711/4711 [21:41<00:00,  3.62it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_8.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_8.h5


Processing chunk_9.json: 100%|██████████| 4711/4711 [23:04<00:00,  3.40it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_9.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_9.h5


Processing chunk_10.json: 100%|██████████| 4711/4711 [24:27<00:00,  3.21it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_10.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_10.h5


In [3]:
for file_number in [11,12]:
  json_file_path = f'/content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_{file_number}.json'  # Path to your specific JSON file
  directory = '/content/drive/MyDrive/NLP Assignments/preprocessed_files/'
  vocab_path = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
  output_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/'

  process_json_file(json_file_path, vocab_path, output_path)

Processing chunk_11.json: 100%|██████████| 4711/4711 [15:33<00:00,  5.05it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_11.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_11.h5


Processing chunk_12.json: 100%|██████████| 4707/4707 [14:34<00:00,  5.38it/s]


Co-occurrence matrix for /content/drive/MyDrive/NLP Assignments/preprocessed_files/chunk_12.json stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/cooccurrence_matrix_chunk_12.h5


* I built a cooccurrence matrix for each chunk of the corpus.
* This was done to manage tasks with limited memory.
* Now i will combine these matrices to form a single cooccurrence matrix for the entire corpus.

In [2]:
import os
import scipy.sparse as sp
import h5py
from tqdm import tqdm

# Function to incrementally combine matrices and write them to disk
def combine_cooccurrence_matrices_incrementally(folder_path, vocab_size, output_file):
    final_matrix = None

    # Get all .h5 files in the folder
    matrix_files = [f for f in os.listdir(folder_path) if f.startswith('cooccurrence_matrix') and f.endswith('.h5')]
    if not matrix_files:
        raise ValueError("No co-occurrence matrix files found in the folder.")

    # Combine each matrix into the final matrix incrementally
    for file in tqdm(matrix_files, desc="Combining matrices"):
        file_path = os.path.join(folder_path, file)

        with h5py.File(file_path, 'r') as f:
            data = f['data'][:]
            indices = f['indices'][:]
            indptr = f['indptr'][:]
            shape = f.attrs['shape']

        cooccurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

        # If final_matrix is None (i.e., first matrix), set it. Otherwise, add the new matrix
        if final_matrix is None:
            final_matrix = cooccurrence_matrix
        else:
            final_matrix = final_matrix + cooccurrence_matrix

        # After processing each matrix, save intermediate results to avoid memory overload
        _save_intermediate(final_matrix, output_file)
        print(f"\nProcessed {file} and saved intermediate result.")

    # Return the final matrix
    return final_matrix

# Helper function to save intermediate results
def _save_intermediate(matrix, output_file):
    with h5py.File(output_file, 'w') as f:
        f.create_dataset('data', data=matrix.data, compression='gzip')
        f.create_dataset('indices', data=matrix.indices, compression='gzip')
        f.create_dataset('indptr', data=matrix.indptr, compression='gzip')
        f.attrs['shape'] = matrix.shape

    print(f"Intermediate result saved to: {output_file}")

# Function to store the final combined matrix
def store_final_matrix(final_matrix, folder_path):
    final_output_path = os.path.join(folder_path, 'final_cooccurrence_matrix.h5')
    _save_intermediate(final_matrix, final_output_path)
    print(f"Final co-occurrence matrix stored at: {final_output_path} 🎉")


In [3]:
folder_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices'  # Folder where matrices are hanging out
vocab_size = len(load_vocabulary('/content/drive/MyDrive/NLP Assignments/vocabulary.txt'))  # Load your vocab size
output_file = '/content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5'  # Path to store the intermediate/final matrix

# Call the function to combine matrices incrementally
final_matrix = combine_cooccurrence_matrices_incrementally(folder_path, vocab_size, output_file)

# After processing, you can store the final result
store_final_matrix(final_matrix, '/content/drive/MyDrive/NLP Assignments/Assignment_02')


Combining matrices:   8%|▊         | 1/12 [00:12<02:12, 12.01s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_3.h5 and saved intermediate result.


Combining matrices:  17%|█▋        | 2/12 [00:31<02:41, 16.16s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_1.h5 and saved intermediate result.


Combining matrices:  25%|██▌       | 3/12 [00:56<03:03, 20.41s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_2.h5 and saved intermediate result.


Combining matrices:  33%|███▎      | 4/12 [01:26<03:12, 24.01s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_4.h5 and saved intermediate result.


Combining matrices:  42%|████▏     | 5/12 [02:02<03:20, 28.62s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_5.h5 and saved intermediate result.


Combining matrices:  50%|█████     | 6/12 [02:44<03:17, 32.91s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_6.h5 and saved intermediate result.


Combining matrices:  58%|█████▊    | 7/12 [03:28<03:03, 36.76s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_7.h5 and saved intermediate result.


Combining matrices:  67%|██████▋   | 8/12 [04:16<02:41, 40.40s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_8.h5 and saved intermediate result.


Combining matrices:  75%|███████▌  | 9/12 [05:10<02:13, 44.46s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_9.h5 and saved intermediate result.


Combining matrices:  83%|████████▎ | 10/12 [06:09<01:38, 49.01s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_10.h5 and saved intermediate result.


Combining matrices:  92%|█████████▏| 11/12 [07:11<00:52, 52.92s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_11.h5 and saved intermediate result.


Combining matrices: 100%|██████████| 12/12 [08:14<00:00, 41.17s/it]

Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/output_matrices/final_cooccurrence_matrix_intermediate.h5

Processed cooccurrence_matrix_chunk_12.h5 and saved intermediate result.





Intermediate result saved to: /content/drive/MyDrive/NLP Assignments/Assignment_02/final_cooccurrence_matrix.h5
Final co-occurrence matrix stored at: /content/drive/MyDrive/NLP Assignments/Assignment_02/final_cooccurrence_matrix.h5 🎉


In [2]:
import h5py
import numpy as np
import scipy.sparse as sp

def view_submatrix(file_path, row_start, col_start, size=5):
    with h5py.File(file_path, 'r') as f:
        data = f['data'][:]
        indices = f['indices'][:]
        indptr = f['indptr'][:]
        shape = f.attrs['shape']

        # Create the CSR matrix from the stored data
        cooccurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

        # Extract the submatrix
        submatrix = cooccurrence_matrix[row_start:row_start + size, col_start:col_start + size].toarray()

        print("10x10 Submatrix:")
        print(submatrix)

file_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/final_cooccurrence_matrix.h5'
view_submatrix(file_path, row_start=0, col_start=0)


10x10 Submatrix:
[[4.116e+03 4.900e+01 4.000e+00 1.000e+00 0.000e+00]
 [4.900e+01 1.600e+02 5.000e+00 2.000e+00 0.000e+00]
 [4.000e+00 5.000e+00 3.400e+01 1.000e+00 1.000e+00]
 [1.000e+00 2.000e+00 1.000e+00 0.000e+00 0.000e+00]
 [0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00]]


In [7]:
vocab_path = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
vocab = load_vocabulary(vocab_path)
print(len(vocab))

1470595


In [25]:
type(vocab)


list

## Now i will try to reduce the size of the vocabulary to 7000.

### I am reducing the vocabulary size by **filtering out infrequent words**, specifically those that have **low co-occurrence frequencies**. Additionally, stop words will be removed. This optimization aims to enhance the co-occurrence matrix and improve overall efficiency.


### Reason for Reducing Vocabulary Size This Way

* Reducing vocabulary size based on **co-occurrence frequency** helps prioritize words that frequently appear together, capturing more meaningful word relationships.
* Filtering out terms with low co-occurrence reduces noise and leads to a more efficient, less sparse co-occurrence matrix, improving computational performance.
* This method emphasizes relationships between words rather than individual token frequency, offering a more insightful analysis of the text.
* The approach focuses on selecting the **top 7000** words based on co-occurrence frequency, ensuring that significant terms are retained while less informative ones are removed.


In [34]:
import numpy as np
import scipy.sparse as sp
import h5py

# Step 1: Load the original co-occurrence matrix from the HDF5 file
def load_cooccurrence_matrix(hdf5_file):
    with h5py.File(hdf5_file, 'r') as f:
        data = f['data'][:]
        indices = f['indices'][:]
        indptr = f['indptr'][:]
        shape = f.attrs['shape']

        # Reconstruct sparse matrix
        cooccurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

    return cooccurrence_matrix

# Step 2: Load the original vocabulary (for 1.4 million words)
def load_vocabulary(vocab_file):
    with open(vocab_file, 'r') as file:
        vocab = [line.strip() for line in file]
    return vocab

# Step 3: Sum co-occurrence frequencies for each word
def get_cooccurrence_sums(cooccurrence_matrix):
    # Sum along rows to get the total co-occurrence frequency for each word
    word_cooccurrence_sums = np.array(cooccurrence_matrix.sum(axis=1)).flatten()
    return word_cooccurrence_sums

# Step 4: Rank words by total co-occurrence frequency and get top N words
def select_top_words(cooccurrence_sums, vocab_size):
    # Get indices of top N words based on co-occurrence sums
    top_word_indices = np.argsort(cooccurrence_sums)[-vocab_size:]
    return top_word_indices

# Step 5: Filter the co-occurrence matrix to keep only the top words
def filter_cooccurrence_matrix(cooccurrence_matrix, top_word_indices):
    # Extract rows and columns corresponding to top_word_indices
    reduced_matrix = cooccurrence_matrix[top_word_indices, :][:, top_word_indices]
    return reduced_matrix

# Step 6: Save the reduced co-occurrence matrix and reduced vocabulary
def save_reduced_matrix_and_vocab(reduced_matrix, reduced_vocab, output_matrix_file, output_vocab_file):
    # Save reduced co-occurrence matrix to HDF5
    with h5py.File(output_matrix_file, 'w') as f:
        f.create_dataset('data', data=reduced_matrix.data, compression='gzip')
        f.create_dataset('indices', data=reduced_matrix.indices, compression='gzip')
        f.create_dataset('indptr', data=reduced_matrix.indptr, compression='gzip')
        f.attrs['shape'] = reduced_matrix.shape

    # Save reduced vocabulary to a text file
    with open(output_vocab_file, 'w') as f:
        for word in reduced_vocab:
            f.write(f"{word}\n")

# Main function to reduce co-occurrence matrix and vocabulary by co-occurrence frequency
def reduce_matrix_and_vocab(hdf5_file, vocab_file, output_matrix_file, output_vocab_file, vocab_size):
    # Load the original co-occurrence matrix
    cooccurrence_matrix = load_cooccurrence_matrix(hdf5_file)

    # Load the original vocabulary
    vocab = load_vocabulary(vocab_file)

    # Get co-occurrence sums for each word
    cooccurrence_sums = get_cooccurrence_sums(cooccurrence_matrix)

    # Select top words based on co-occurrence frequency
    top_word_indices = select_top_words(cooccurrence_sums, vocab_size)

    # Filter the matrix to keep only the top words
    reduced_matrix = filter_cooccurrence_matrix(cooccurrence_matrix, top_word_indices)

    # Get the reduced vocabulary
    reduced_vocab = [vocab[i] for i in top_word_indices]

    # Save the reduced matrix and vocabulary
    save_reduced_matrix_and_vocab(reduced_matrix, reduced_vocab, output_matrix_file, output_vocab_file)

# Paths to original matrix, vocab, and reduced output files
original_matrix_file = '/content/drive/MyDrive/NLP Assignments/Assignment_02/final_cooccurrence_matrix.h5'
original_vocab_file = '/content/drive/MyDrive/NLP Assignments/vocabulary.txt'
reduced_matrix_file = '/content/drive/MyDrive/NLP Assignments/Assignment_02/reduced_cooccurrence_matrix.h5'
reduced_vocab_file = '/content/drive/MyDrive/NLP Assignments/Assignment_02/reduced_vocabulary_2.txt'
vocab_size = 7000  # Number of top words to keep

# Run the reduction process
reduce_matrix_and_vocab(original_matrix_file, original_vocab_file, reduced_matrix_file, reduced_vocab_file, vocab_size)


In [38]:
vocab_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/reduced_vocabulary_2.txt'
vocab = load_vocabulary(vocab_path)
print(len(vocab))

7000


### 2. I noticed some letters present in the vocab which can be considered as stop words ... for eg there are alphabets like `e` and all which are just letters. I will remove those from the vocabulary.

In [39]:
# List of stop words
stop_words = [
    'given', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
    'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
    'u', 'v', 'w', 'x', 'y', 'z'
]

# Removing stop words from the vocabulary
filtered_vocab = [word for word in vocab if word not in stop_words]

print(len(filtered_vocab))

6981


### 3. Since the corpus is made from research papers things like `et.al.` will be present. we can remove those as well.

In [40]:
# List of terms to remove
terms_to_remove = ['et', 'al', 'ie', 'eg', 'etc', 'aka', 'viz', 'cf', 'nb', 'pm', 'vs']

# Remove the terms from the vocabulary
filtered_vocab = [word for word in filtered_vocab if word not in terms_to_remove]

len(filtered_vocab)

6973

### Now i will store this filtered vocablary.

In [41]:
# Save the filtered vocabulary
filtered_vocab_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_vocab_2.txt'
with open(filtered_vocab_path, 'w') as f:
    f.write("\n".join(filtered_vocab))

### Now i will do corresponding changes to the cooccurrence matrix as well.

In [42]:
import h5py
import numpy as np
import scipy.sparse as sp
import pandas as pd
from tqdm import tqdm

# Load the original co-occurrence matrix (sparse format)
co_occurrence_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/reduced_cooccurrence_matrix.h5'
with h5py.File(co_occurrence_path, 'r') as f:
    data = f['data'][:]
    indices = f['indices'][:]
    indptr = f['indptr'][:]
    shape = f.attrs['shape']

# Rebuild the sparse co-occurrence matrix (CSR format)
co_occurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

# Load the filtered vocabulary
filtered_vocab_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_vocab_2.txt'
with open(filtered_vocab_path, 'r') as file:
    filtered_vocab = [line.strip() for line in file]

# Create a mapping from vocabulary words to their indices
vocab_index = {word: i for i, word in enumerate(filtered_vocab)}

# Create a new co-occurrence matrix based on the filtered vocabulary
filtered_matrix = sp.lil_matrix((len(filtered_vocab), len(filtered_vocab)), dtype=np.float32)

# Populate the filtered co-occurrence matrix based on the original matrix with progress bar
for word, new_index in tqdm(vocab_index.items(), desc="Building filtered co-occurrence matrix", total=len(vocab_index)):
    original_index = vocab_index.get(word)  # Get the index of the word in the original matrix

    if original_index is not None:
        # Copy relevant row and column from the original matrix
        filtered_matrix[new_index, :] = co_occurrence_matrix[original_index, :].toarray()[0, list(vocab_index.values())]
        filtered_matrix[:, new_index] = co_occurrence_matrix[:, original_index].toarray()[list(vocab_index.values()), 0]

# Convert the matrix to CSR for efficiency
filtered_matrix_csr = filtered_matrix.tocsr()

# Save the new filtered co-occurrence matrix
filtered_co_occurrence_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_cooccurrence_matrix_2.h5'
with h5py.File(filtered_co_occurrence_path, 'w') as f:
    f.create_dataset('data', data=filtered_matrix_csr.data, compression='gzip')
    f.create_dataset('indices', data=filtered_matrix_csr.indices, compression='gzip')
    f.create_dataset('indptr', data=filtered_matrix_csr.indptr, compression='gzip')
    f.attrs['shape'] = filtered_matrix_csr.shape

print(f"Filtered co-occurrence matrix saved at: {filtered_co_occurrence_path}")


Building filtered co-occurrence matrix: 100%|██████████| 6973/6973 [21:16<00:00,  5.46it/s]


Filtered co-occurrence matrix saved at: /content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_cooccurrence_matrix_2.h5


## Co-occurrence Matrix

The co-occurrence matrix used in this assignment is too large to be displayed here directly. You can download or view the matrix using the following link:

[Download Co-occurrence Matrix](https://drive.google.com/file/d/1-1Ga3x4Pl0GexVOtseEdYzXhSKccLWWb/view?usp=drive_link)

_Note: This matrix is stored in HDF5 format due to its large size._


In [5]:
import h5py
import numpy as np
import scipy.sparse as sp

def view_submatrix(file_path, row_start, col_start, size=5):
    with h5py.File(file_path, 'r') as f:
        data = f['data'][:]
        indices = f['indices'][:]
        indptr = f['indptr'][:]
        shape = f.attrs['shape']

        # Create the CSR matrix from the stored data
        cooccurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

        # Extract the submatrix
        submatrix = cooccurrence_matrix[row_start:row_start + size, col_start:col_start + size].toarray()

        print("10x10 Submatrix:")
        print(submatrix)

file_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_cooccurrence_matrix_2.h5'
view_submatrix(file_path, row_start=0, col_start=0, size = 10)

10x10 Submatrix:
[[  2.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0. 170.   0.   0.   0.   0.   0.   0.   0.   0.]
 [  0.   0.  84.   0.   0.   0.   1.   1.   0.   0.]
 [  0.   0.   0. 246.   0.   0.   2.   0.   3.  29.]
 [  0.   0.   0.   0.  74.   0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0. 220.   0.   0.   0.   0.]
 [  0.   0.   1.   2.   0.   0.  16.   0.   0.   0.]
 [  0.   0.   1.   0.   0.   0.   0.  38.   0.   0.]
 [  0.   0.   0.   3.   0.   0.   0.   0.  62.  15.]
 [  0.   0.   0.  29.   0.   0.   0.   0.  15.  54.]]


## Normalizing the entries using proabilities $\frac {P_{ik}} {P_{jk}}$

$X$      -> Co-occurrence Matrix      <br>
$X_{ij}$ -> Number of times word j occurs in the context of word i. <br>
$X_i = \sum_k X_{ik}$ -> Number of any word appears in the context of word i.<br>
$P_{ij} = P(j|i) = \frac {X_{ij} } {X_i}$ <br>
This gives the probability that the word j occurs in the contex of the word i.

<br>

$\frac {P_{ik}} {P_{jk}} = {P(k|i)} {P(k|j)} $ <br>

<br>
The ratio  $ \frac {P(k|i)} {P(k|j)}$
helps normalize co-occurrence counts by considering the probability of context words with respect to different target words. This normalization emphasizes the relative strength of association between target words, making it easier to distinguish meaningful patterns and relationships.

### Choosing the word for k

**Value of k:** High-Frequency Words (e.g., "the", "and", "of")

**Reason for Choosing This:**

In co-occurrence matrix normalization, selecting high-frequency words as the reference `k` helps address the issue of frequency bias. High-frequency words, being common across various contexts, serve as a stable baseline for comparison. By normalizing co-occurrence counts using these high-frequency words, we adjust for their overwhelming presence and mitigate their disproportionate influence. The ratio of co-occurrence probabilities $ \frac{P(k | i)}{P(k | j)} $ helps us understand the relative strength of the association between the target words `i` and `j` compared to their association with a high-frequency word `k`. This normalization process emphasizes the distinctive relationships between `i` and `j` by neutralizing the impact of common words that might otherwise dominate the matrix. Thus, using high-frequency words ensures that the co-occurrence matrix reflects more meaningful and specific patterns rather than being skewed by the frequent occurrence of generic words.



## Steps for Normalizing Co-occurrence Matrix

1. **Find the Most Frequent Word**:
   Start by summing each row (or column) of the co-occurrence matrix to get the total occurrences of each word across all contexts. The word with the highest total is identified as the most frequent word.

2. **Normalize the Co-occurrence Matrix**:
   Using the most frequent word $ k $, normalize the co-occurrence matrix. The normalization is done by calculating the ratio $ \frac{P(k | i)}{P(k | j)} $, where:
   - $ P(k | i) $ is the probability of word $ k $ occurring in the context of word $ i $.
   - $ P(k | j) $ is the probability of word $ k $ occurring in the context of word $ j $.
   - This step adjusts the co-occurrence counts by considering the probability of context words with respect to different target words.

3. **Remove the Most Frequent Word**:
   Once the normalization is done, remove the row and column corresponding to the most frequent word from the co-occurrence matrix. Also, remove the most frequent word from the vocabulary list, so it no longer influences further analysis.


### Step 1. Finding the most Frequent word.

In [2]:
import numpy as np
import h5py

def load_cooccurrence_matrix(hdf5_file):
    with h5py.File(hdf5_file, 'r') as f:
        data = f['data'][:]
        indices = f['indices'][:]
        indptr = f['indptr'][:]
        shape = f.attrs['shape']

        # Reconstruct sparse matrix
        cooccurrence_matrix = sp.csr_matrix((data, indices, indptr), shape=shape)

    return cooccurrence_matrix

def load_vocabulary(vocab_path):
    with open(vocab_path, 'r') as f:
        vocabulary = [line.strip() for line in f.readlines()]
    return vocabulary

def find_most_frequent_word(cooccurrence_matrix, vocabulary):
    word_frequencies = np.sum(cooccurrence_matrix, axis=1)
    most_frequent_word_index = np.argmax(word_frequencies)
    return most_frequent_word_index, vocabulary[most_frequent_word_index]


In [9]:
hdf5_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_cooccurrence_matrix_2.h5'
vocab_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_vocab_2.txt'

X = load_cooccurrence_matrix(hdf5_path)
vocabulary = load_vocabulary(vocab_path)

k, most_frequent_word = find_most_frequent_word(X, vocabulary)
print(f"Most Frequent Word (k): {most_frequent_word} (Index: {k})")

Most Frequent Word (k): patients (Index: 6972)


In [45]:
X.shape

(6973, 6973)

## Step 2. Normalize the co-occurrence matrix

In [29]:
import numpy as np
from tqdm import tqdm

def normalize_cooccurrence_matrix(X, k):
    m = X.shape[0]
    X_dense = X.toarray()  # Convert sparse matrix to dense for easy manipulation
    normalized_matrix = np.zeros_like(X_dense, dtype=np.float32)

    # Calculate the sum of each row only once
    row_sums = np.sum(X_dense, axis=1)

    # Extract probabilities for the k-th word for all rows
    with np.errstate(divide='ignore', invalid='ignore'):
        P_ik = np.where(row_sums > 0, X_dense[:, k] / row_sums, 0)  # Handle division by zero

    # Loop through each row to fill the normalized matrix
    for i in tqdm(range(m), desc="Processing X", total=m):
        # Extract probabilities for the current row
        X_jk = X_dense[:, k]  # This will be the same for each row i
        with np.errstate(divide='ignore', invalid='ignore'):
            P_jk = np.where(row_sums > 0, X_jk / row_sums, 0)  # Handle division by zero

        # Normalize for row i, but ensure that if X[i, j] is 0, the value stays 0
        non_zero_mask = (X_dense[i, :] != 0)  # Mask to check where X[i, :] is non-zero
        normalized_matrix[i, non_zero_mask] = P_ik[i] / P_jk[non_zero_mask]  # Normalize where non-zero
        normalized_matrix[i, ~non_zero_mask] = 0  # Set to 0 where X[i, :] is zero

    # Replace infinity values with zeros
    normalized_matrix[np.isinf(normalized_matrix)] = 0.0
    normalized_matrix[np.isnan(normalized_matrix)] = 0.0

    # Convert back to sparse matrix
    return normalized_matrix


In [37]:
import numpy as np
from tqdm import tqdm

def normalize_cooccurrence_matrix(X, k):
    m = X.shape[0]
    X_dense = X.toarray()  # Convert sparse matrix to dense for easy manipulation
    normalized_matrix = np.zeros_like(X_dense, dtype=np.float32)

    # Calculate the sum of each row only once
    row_sums = np.sum(X_dense, axis=1)

    # Extract probabilities for the k-th word for all rows
    # Avoid division by zero
    P_ik = np.where(row_sums > 0, X_dense[:, k] / row_sums, 0)

    # Loop through each row to fill the normalized matrix
    for i in tqdm(range(m), desc="Processing X", total=m):
        # Extract probabilities for the current row
        X_jk = X_dense[:, k]  # This will be the same for each row i

        # Avoid division by zero for P_jk
        P_jk = np.where(row_sums > 0, X_jk / row_sums, 0)

        # Normalize for row i, but ensure that if X[i, j] is 0, the value stays 0
        non_zero_mask = (X_dense[i, :] != 0)  # Mask to check where X[i, :] is non-zero

        # Ensure normalization does not result in inf or NaN
        normalized_values = np.zeros_like(normalized_matrix[i, :])
        valid_division = P_jk[non_zero_mask] > 0  # Check if P_jk is greater than 0
        normalized_values[non_zero_mask] = np.where(valid_division, P_ik[i] / P_jk[non_zero_mask], 0)

        normalized_matrix[i, :] = normalized_values

    return normalized_matrix


In [38]:
X_normalized = normalize_cooccurrence_matrix(X, k)

  normalized_values[non_zero_mask] = np.where(valid_division, P_ik[i] / P_jk[non_zero_mask], 0)
  normalized_values[non_zero_mask] = np.where(valid_division, P_ik[i] / P_jk[non_zero_mask], 0)
Processing X: 100%|██████████| 6973/6973 [00:01<00:00, 3982.81it/s]


In [28]:
import numpy as np
from tqdm import tqdm

def normalize_cooccurrence_matrix(X):
    X_dense = X.toarray()  # Convert sparse matrix to dense for easy manipulation
    m = X_dense.shape[0]
    normalized_matrix = np.zeros_like(X_dense, dtype=np.float32)

    # Calculate the sum of each row once
    row_sums = np.sum(X_dense, axis=1, keepdims=True)  # Shape (m, 1)

    # Compute P_ik for all rows (i) and for all columns (k)
    P_ik = np.where(row_sums > 0, X_dense / row_sums, 0)  # Shape (m, n)

    # Calculate the normalized matrix
    for i in tqdm(range(m), desc="Processing X", total=m):
        P_jk = P_ik[i]  # Get P_jk for the current row i
        normalized_matrix[i, :] = np.where(P_jk > 0, P_ik / P_jk, 0)  # Calculate P_ik/P_jk

    # Replace infinity and NaN values with zeros
    normalized_matrix[np.isinf(normalized_matrix)] = 0.0
    normalized_matrix[np.isnan(normalized_matrix)] = 0.0

    return normalized_matrix


In [None]:
X_normalized = normalize_cooccurrence_matrix(X, k)

## Step 3. Removing the most frequent word from the matrix and vocabulary


In [39]:
def remove_most_frequent_word(cooccurrence_matrix, most_frequent_word_index, vocabulary):
    new_matrix = np.delete(cooccurrence_matrix, most_frequent_word_index, axis=0)
    new_matrix = np.delete(new_matrix, most_frequent_word_index, axis=1)

    new_vocabulary = [word for idx, word in enumerate(vocabulary) if idx != most_frequent_word_index]

    return new_matrix, new_vocabulary

In [40]:
vocab_path = '/content/drive/MyDrive/NLP Assignments/Assignment_02/filtered_vocab_2.txt'

vocabulary = load_vocabulary(vocab_path)

X_final, final_vocab = remove_most_frequent_word(X_normalized, k, vocabulary)

### 10x10 Submatrix.

In [41]:
X_final[:10,:10]

array([[1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 1.1039914 , 1.3774192 , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.27166787, 0.        , 1.1946853 , 1.2950433 ],
       [0.        , 0.        , 0.        , 0.        , 1.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.90580416, 3.6809652 , 0.        ,
        0.        , 1.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.7259953

### I will save X_final and final_vocab.

In [19]:
output_vocab_file = '/content/drive/MyDrive/NLP Assignments/Assignment_02/Final_Vocab.txt'
with open(output_vocab_file, 'w') as f:
  for word in final_vocab:
    f.write(f"{word}\n")

In [21]:
np.save("/content/drive/MyDrive/NLP Assignments/Assignment_02/Final_Matrix.npy",X_final)

## Displaying the Vocabulary size and the size of the matrix.

In [16]:
print(f"Vocabulary Size: {len(final_vocab)}")
print(f"Matrix Shape: {X_final.shape}")

Vocabulary Size: 6972
Matrix Shape: (6972, 6972)


In [25]:
import numpy as np

def cosine_similarity(vec_a, vec_b):
    # print(vec_a, vec_b)
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    # print(dot_product)
    # print(dot_product / (norm_a * norm_b) if norm_a > 0 and norm_b > 0 else 0)
    return dot_product / (norm_a * norm_b) if norm_a > 0 and norm_b > 0 else 0

def get_top_similar_words(cooccurrence_matrix, word_index, vocabulary, top_n=10):
    # Get the vector for the specified word
    target_vector = cooccurrence_matrix[word_index].flatten()

    # Check if the target vector is non-zero
    if np.count_nonzero(target_vector) == 0:
        return []  # Return empty if the target vector is zero

    # Calculate cosine similarities
    similarities = []
    for i in range(cooccurrence_matrix.shape[0]):
        if i != word_index:  # Avoid comparing the word with itself
            other_vector = cooccurrence_matrix[i].flatten()

            # Check if the other vector is non-zero
            if np.count_nonzero(other_vector) > 0:
                similarity = cosine_similarity(target_vector, other_vector)
                similarities.append((vocabulary[i], similarity))

    # Sort by similarity score in descending order and get the top N
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

# Example usage
vocab_path = "/content/drive/MyDrive/NLP Assignments/Assignment_02/Final_Vocab.txt"
vocabulary = load_vocabulary(vocab_path)  # Your vocabulary list
covid_index = vocabulary.index("covid")  # Get the index of "covid"

X_final = np.load("/content/drive/MyDrive/NLP Assignments/Assignment_02/Final_Matrix.npy")
top_similar_words = get_top_similar_words(X_final, covid_index, vocabulary, top_n = 40)

print("Top 40 words similar to 'covid':")
for word, score in top_similar_words:
    print(f"{word}: {score:.4f}")


Top 40 words similar to 'covid':
treatment: 0.9732
use: 0.9690
studies: 0.9663
proteins: 0.9617
cases: 0.9590
specific: 0.9590
two: 0.9588
disease: 0.9575
one: 0.9575
severe: 0.9553
information: 0.9490
activities: 0.9474
immune: 0.9456
data: 0.9455
sample: 0.9429
virus: 0.9409
challenge: 0.9390
response: 0.9389
determined: 0.9385
clinical: 0.9381
increased: 0.9379
license: 0.9375
review: 0.9372
considered: 0.9355
cell: 0.9353
per: 0.9351
chest: 0.9344
incubated: 0.9336
de: 0.9334
surveillance: 0.9326
low: 0.9321
stable: 0.9320
role: 0.9319
characteristics: 0.9297
negative: 0.9292
approach: 0.9290
diagnostic: 0.9272
free: 0.9263
nature: 0.9252
results: 0.9218


## Five nouns and verbs relevant to COVID19 from the corpus.

**Nouns:**

- treatment
- disease
- virus
- response
- cases

**Verbs:**

- use
- increased
- reviewed
- considered
- incubated


##

In [27]:
# List of nouns and verbs
words_to_find = {
    "Nouns": ["treatment", "disease", "virus", "response", "cases"],
    "Verbs": ["use", "increased", "reviewed", "considered", "incubated"]
}

# Loop through each category and word
for category, words in words_to_find.items():
    print(f"\n{category}:")
    for word in words:
        # Assume you have a way to get the index of each word in the vocabulary
        word_index = vocabulary.index(word)  # Get the index of the word
        top_similar_words = get_top_similar_words(X_final, word_index, vocabulary, top_n=5)

        print(f"\nTop 5 words similar to '{word}':")
        for similar_word, score in top_similar_words:
            print(f"{similar_word}: {score:.4f}")


Nouns:

Top 5 words similar to 'treatment':
severe: 0.9762
covid: 0.9732
information: 0.9621
review: 0.9581
one: 0.9561

Top 5 words similar to 'disease':
specific: 0.9643
role: 0.9628
two: 0.9608
covid: 0.9575
diagnostic: 0.9500

Top 5 words similar to 'virus':
less: 0.9686
findings: 0.9669
cells: 0.9638
expression: 0.9636
protein: 0.9635

Top 5 words similar to 'response':
severe: 0.9451
covid: 0.9389
disease: 0.9377
treatment: 0.9375
nature: 0.9350

Top 5 words similar to 'cases':
role: 0.9740
two: 0.9711
covid: 0.9590
virus: 0.9583
pneumonia: 0.9545

Verbs:

Top 5 words similar to 'use':
covid: 0.9690
cell: 0.9607
studies: 0.9566
specific: 0.9515
disease: 0.9478

Top 5 words similar to 'increased':
two: 0.9406
cases: 0.9390
covid: 0.9379
activities: 0.9361
disease: 0.9348

Top 5 words similar to 'reviewed':
notably: 0.7997
lipid: 0.7916
toward: 0.7887
con: 0.7825
detect: 0.7823

Top 5 words similar to 'considered':
role: 0.9548
cases: 0.9540
associated: 0.9525
pneumonia: 0.9525
re

In [55]:
import numpy as np

# Replace NaN and infinity values with 0
X_final_cleaned = np.nan_to_num(X_normalized, nan=0.0, posinf=0.0, neginf=0.0)

# Get the rank of the cleaned matrix
rank = np.linalg.matrix_rank(X_final_cleaned)

print(f"Rank of the matrix: {rank}")


Rank of the matrix: 3636


In [34]:
import numpy as np

# Count the number of NaN values
num_nan = np.isnan(X_final).sum()

# Count the number of positive and negative infinity values
num_posinf = np.isposinf(X_final).sum()
num_neginf = np.isneginf(X_final).sum()

# Count the total number of infinity values (both positive and negative)
num_inf = np.isinf(X_final).sum()

# Output the counts
print(f"Number of NaN values: {num_nan}")
print(f"Number of positive infinity values: {num_posinf}")
print(f"Number of negative infinity values: {num_neginf}")
print(f"Total number of infinity values: {num_inf}")


Number of NaN values: 0
Number of positive infinity values: 0
Number of negative infinity values: 0
Total number of infinity values: 0
