## Chunking Large Dataset

In [None]:
import torch
import pandas as pd
import re  # Import the re module for regular expressions

### GPU Configuration in PyTorch

This section outlines how to check for GPU availability and configure PyTorch to use the GPU if available, otherwise fall back to using the CPU. This is crucial for optimizing computational efficiency, especially when working with large neural network models that benefit significantly from GPU acceleration.

#### Code Explanation:

1. **Check GPU Availability**:
   - The code begins by checking if a CUDA-compatible GPU is available using `torch.cuda.is_available()`. CUDA is NVIDIA's parallel computing architecture, which allows PyTorch to accelerate operations using the GPU.

2. **Configure Device**:
   - If a GPU is available:
     - The code sets the `device` to use CUDA by calling `torch.device("cuda")`.
     - It prints out the number of GPUs available using `torch.cuda.device_count()` and displays the name of the GPU that will be used, which is particularly useful for ensuring that the expected hardware is being utilized.
   - If no GPU is available:
     - The code outputs a message indicating that no GPU is available and sets the `device` to use the CPU instead. This ensures that the code remains portable and can run on systems without a GPU, albeit at potentially lower speeds.

#### Code Usage:
This setup is typically employed at the beginning of a script to configure the hardware settings appropriately before proceeding with data loading, model creation, and training processes. It helps in leveraging available hardware to its fullest potential, or provides a fallback to ensure compatibility.



In [2]:
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: NVIDIA A30


### Processing Large CSV Files in Chunks Using Pandas

This guide details how to efficiently process a large CSV file in manageable chunks using Python's pandas library. This approach is particularly useful when dealing with large datasets that do not fit into memory.

#### Code Overview:

1. **Parameters Setup**:
   - **input_file**: Specifies the path to the CSV file to be processed.
   - **chunk_size**: Determines the number of rows each chunk should contain. It is dynamically calculated as the total number of rows in the CSV file, allowing for evenly sized chunks except for the last one.
   - **number_of_chunks**: Sets the total number of chunks to be processed and saved.

2. **Chunk Processing**:
   - The code iterates through the CSV file in segments, each limited to `chunk_size` rows.
   - Each chunk is processed and saved as a separate CSV file. This is performed using a loop that enumerates over `pd.read_csv`, which is configured to read the file in segments specified by `chunksize`.

3. **Conditional Chunk Handling**:
   - If the loop reaches the last specified chunk (as determined by `number_of_chunks - 1`), it will save all remaining rows in that chunk and then break the loop to stop further processing. This ensures that the process does not exceed the desired number of chunks.



In [None]:

# Parameters
input_file = 'note_Jun2023.csv'
chunk_size = sum(1 for row in open(input_file, 'r')) 
number_of_chunks = 5

# Process and save chunks
for i, chunk in enumerate(pd.read_csv(input_file, chunksize=chunk_size)):
    if i == number_of_chunks - 1:  # If it's the last chunk, save all remaining rows
        chunk.to_csv(f'chunk_{i}.csv', index=False)
        break
    chunk.to_csv(f'chunk_{i}.csv', index=False)

## Cleaning EHR chunk and group sentences (to meet BERT's rule of 512 tokens)

### Random Sampling and Shuffling Data with Pandas

This guide outlines the process of reading data from a CSV file into a DataFrame, shuffling the data, and obtaining a random sample. This approach is commonly used in data science to ensure model robustness by randomizing the order of data and reducing dataset size for manageable processing or experimental reproducibility.

#### Libraries:
- **pandas**: Used for data manipulation and analysis.

#### Detailed Steps:

1. **Reading Data**:
   - The CSV file named `chunk_0.csv` is read into a pandas DataFrame named `df`. This initial step loads the data into memory, making it ready for further processing.


In [7]:
# Read the CSV file into a DataFrame
df = pd.read_csv("chunk_0.csv")
# Shuffle the DataFrame
df1_shuffled = df1.sample(frac=1, random_state=42)  # Using random_state for reproducibility
# Get a random sample of 1,000,000 rows
df = df1_shuffled.sample(n=1000000, random_state=42)  # Using random_state for reproducibility

### Text Processing Functions for Natural Language Processing

This guide provides an overview of three essential Python functions designed for preparing text data in natural language processing (NLP) applications. These functions are particularly useful when dealing with models like BERT that require specific text formatting.


#### Functions Overview:

1. **Preprocess and Clean Text**:
   - This function is designed to standardize text data, making it more uniform and easier for models to process. It involves converting all text to lowercase, removing non-ASCII characters, filtering out unwanted punctuations, and eliminating extra white spaces.

2. **Sentence Tokenization**:
   - Sentence tokenization involves splitting a block of text into its constituent sentences. This step is essential for understanding the structure of the text and for subsequent processing steps that may treat each sentence as a separate unit.

3. **Group Sentences**:
   - This process involves clustering sentences together based on certain criteria. It can be used to maintain context or improve the efficiency of NLP models by processing chunks of related sentences together.


In [None]:
def preprocess_and_clean_text(text):
    
    # Convert text to lowercase
    text = text.lower()
    
    # Remove non-ASCII characters
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # Replace 3 or more consecutive non-alphanumeric characters with 1 white space
    text = re.sub(r'[^a-zA-Z0-9\s]{3,}', ' ', text)
    
    # Replace 2 or more consecutive white spaces with 1 white space
    text = re.sub(r'\s{2,}', ' ', text)
    
    return text.strip()  # Trim whitespace from beginning and end
 
# Tokenize sentences using nltk's sent_tokenize
def tokenize_sentences(text):
    return sent_tokenize(text)

# Function to group sentences with a maximum length of 500 tokens (to meet BERT's rule of 512 tokens)
def group_sentences(sentences):
    grouped_sentences = []
    current_group = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())

        if current_length + sentence_length <= 500:
            current_group.append(sentence)
            current_length += sentence_length
        else:
            if current_group:
                grouped_sentences.append(current_group)
                current_group = [sentence]
                current_length = sentence_length

    if current_group:
        grouped_sentences.append(current_group)

    return grouped_sentences

In [None]:

# Initialize to maintain continuity, though in a single-file scenario, it might be less relevant
current_sent_id = 1
# Apply preprocessing function directly while filtering and cleaning text data
df['NOTE_TXT'] = df['NOTE_TXT'].apply(preprocess_and_clean_text)
# Filter out rows with non-string values or missing values and short texts
df = df[df['NOTE_TXT'].apply(lambda x: isinstance(x, str) and len(x.strip()) >= 10)]
df.dropna(subset=['NOTE_TXT'], inplace=True)
# Apply sentence tokenization and grouping to the 'NOTE_TXT' column
df['Sentences'] = df['NOTE_TXT'].apply(tokenize_sentences)
df['GroupedSentences'] = df['Sentences'].apply(group_sentences)

# Use a list comprehension for generating new rows
new_rows = [{'Note_ID': row['Note_ID'], 'SentID': current_sent_id + idx, 'SentText': ' '.join(group)}
            for _, row in df3.iterrows()
            for idx, group in enumerate(row['GroupedSentences'])]
current_sent_id += len(new_rows)

# Create a new DataFrame for the processed sentences
df_sent = pd.DataFrame(new_rows, columns=['Note_ID', 'SentID', 'SentText'])

# Save the processed DataFrame to a new CSV file
df_sent.to_csv(f"1_million_cleaned_data", index=False)