# Text Processing and Analysis in Jupyter

This notebook contains a script designed to take a body of text entered by the user and perform several natural language processing tasks. The interactive interface allows for easy input and clear display of processed data.

## Features:
- **Sentence Segmentation:** The input text is split into individual sentences.
- **Word Tokenization:** Each sentence is then tokenized into words, excluding punctuation.
- **Word Frequency Count:** A count of each word's occurrence within each sentence is computed.
- **Timestamping:** Each input is timestamped to record the exact time of processing.
- **Interactive UI:** A text area for input and a button to trigger processing are provided, with the output displayed below.

Simply enter your text into the textarea and click "Process Text" to see the script in action. The results will include the separated sentences, their word counts excluding punctuation, and the timestamps of processing.

# Imports
This notebook utilizes the following Python libraries:

- `nltk`: The Natural Language Toolkit is used for text processing tasks such as sentence tokenization and word tokenization.
- `datetime`: A standard library for handling date and time operations, providing timestamp functionality.
- `ipywidgets`: Provides interactive HTML widgets for Jupyter notebooks, enabling the creation of the user interface elements such as text boxes and buttons.
- `IPython.display`: Offers utilities for displaying rich media in Jupyter notebooks, used here to manage the output area.
- `logging`: Used for logging events and messages, particularly useful for suppressing informational messages from libraries.
- `os`: Provides a way of using operating system dependent functionality like creating folders.
- `json`: Used to encode and decode data in the popular JSON format for file storage.

These libraries work together to create an interactive text processing application within the Jupyter notebook environment.


In [5]:
import os
import json
from collections import Counter
from datetime import datetime
import nltk
from nltk.tokenize import word_tokenize
import ipywidgets as widgets
from IPython.display import display, clear_output
import logging

# Core Functionality

This script provides a user interface for entering and processing text to extract sentences and generate word frequency data.

Functionality includes:
- **Sentence Tokenization**: Splitting entered text into individual sentences.
- **Word Tokenization**: Breaking down sentences into words, excluding punctuation.
- **Word Count**: Generating a frequency count of words in each sentence.
- **Timestamping**: Marking each processed sentence with the current date and time.
- **Data Display**: Presenting the processed data neatly within the notebook.

To use the script, enter the text into the provided text area and click "Process Text".
The output will display below the button, showing the sentences, word counts, and timestamps.

In [6]:
# Set the level to ERROR to display only errors, not info messages
logging.getLogger('nltk').setLevel(logging.ERROR)

# Ensure the Punkt tokenizer models are available
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt', quiet=True)

def split_into_sentences(text):
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    return tokenizer.tokenize(text)

def process_sentence(sentence):
    words = word_tokenize(sentence.lower())
    words = [word for word in words if word.isalpha()]
    return Counter(words)

def add_to_list(sentences, sentence):
    if sentence and not any(s['sentence'] == sentence for s in sentences):
        word_dict = process_sentence(sentence)
        timestamp = datetime.now().isoformat()
        sentences.append({'sentence': sentence, 'word_dict': word_dict, 'timestamp': timestamp})
        return True
    else:
        return False

# Create a folder to save the data
folder_name = 'Processed_Data'
os.makedirs(folder_name, exist_ok=True)  # This will create the folder if it doesn't exist

# Function to save data to a JSON file
def save_data(data):
    file_name = os.path.join(folder_name, 'processed_data.json')
    with open(file_name, 'w') as f:
        json.dump(data, f, indent=4)
    print(f"Data saved to {file_name}")

# UI Components
text_area = widgets.Textarea(
    placeholder='Type or paste your transcribed text here',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='200px')
)

output_area = widgets.Output()
button = widgets.Button(description="Process Text")

# Event Handlers
def on_button_clicked(b):
    with output_area:
        clear_output()
        if text_area.value.strip():
            sentences = []
            data_updated = False
            for sentence in split_into_sentences(text_area.value):
                if add_to_list(sentences, sentence):
                    data_updated = True
            if data_updated:
                print("Data processed successfully.")
                # Displaying the sentences in a formatted way
                for item in sentences:
                    print(f"Sentence: {item['sentence']}")
                    print(f"Word Count: {dict(item['word_dict'])}")
                    print(f"Timestamp: {item['timestamp']}\n")
                # Save the data to a JSON file
                save_data(sentences)
        else:
            print("Please enter some text to process.")

button.on_click(on_button_clicked)

# Display UI
display(text_area, button, output_area)

Textarea(value='', description='Text:', layout=Layout(height='200px', width='100%'), placeholder='Type or past…

Button(description='Process Text', style=ButtonStyle())

Output()