## Project Requirements
Perform extractive summarization on a given text, involving steps like text preprocessing, sentence scoring, and extraction of summary sentences, then display the final summary.

## Load or Generate Text Data

Load an existing text document or generate a sample text for the summarization project. This will be the input document from which we will extract the summary.


In [1]:
text_data = """Artificial intelligence (AI) is rapidly transforming various aspects of our lives, from how we work to how we interact with technology. Machine learning, a subset of AI, enables systems to learn from data without explicit programming, leading to advancements in fields like natural language processing and computer vision. Deep learning, a further specialization within machine learning, utilizes neural networks with multiple layers to uncover intricate patterns in vast datasets. These technologies are at the forefront of innovation, powering applications such as self-driving cars, medical diagnostics, and personalized recommendation systems. Ethical considerations and responsible development are crucial as AI continues to evolve and become more integrated into society. The potential benefits are immense, but so are the challenges, requiring careful thought and collaboration from researchers, policymakers, and the public. Understanding the fundamentals of AI is becoming increasingly important for everyone in the modern world. This text provides a brief overview of the key concepts and impact of artificial intelligence."""

print("Sample text data loaded successfully.")
print(f"Length of text_data: {len(text_data)} characters")

Sample text data loaded successfully.
Length of text_data: 1133 characters


## Text Preprocessing

Clean and preprocess the text data. This typically involves sentence tokenization, word tokenization, removing stop words, and potentially stemming or lemmatization, to prepare it for analysis.


In [2]:
try:
    import nltk
except ImportError:
    print("NLTK not found. Installing NLTK...")
    !pip install nltk
    import nltk
    print("NLTK installed successfully.")
else:
    print("NLTK is already installed.")

NLTK is already installed.


To preprocess the text data as instructed, I will import necessary NLTK modules, download required NLTK data, tokenize the text into sentences, and then iterate through each sentence to perform word tokenization, lowercasing, punctuation removal, stop word filtering, and lemmatization, storing the results in `preprocessed_sentences`.



In [4]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
import string

# Download necessary NLTK data
print("Downloading NLTK data...")
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True) # Open Multilingual Wordnet for lemmatizer
nltk.download('punkt_tab', quiet=True) # Added to fix LookupError
print("NLTK data downloaded.")

# 3. Tokenize the text_data into individual sentences
sentences = sent_tokenize(text_data)

# 4. Initialize a WordNetLemmatizer object
lemmatizer = WordNetLemmatizer()

# 5. Create an empty list, preprocessed_sentences
preprocessed_sentences = []

# Get English stopwords
stop_words = set(stopwords.words('english'))

# 6. Iterate through each sentence for preprocessing
for sentence in sentences:
    # a. Tokenize the sentence into words
    words = word_tokenize(sentence)

    preprocessed_words = []
    for word in words:
        # b. Convert each word to lowercase
        word = word.lower()

        # c. Remove punctuation from the words
        # Check if the word is purely punctuation
        if word in string.punctuation:
            continue

        # Remove punctuation from within words (e.g., 'word.' -> 'word')
        word = word.strip(string.punctuation)

        # Skip if the word becomes empty after punctuation removal
        if not word:
            continue

        # d. Filter out stopwords
        if word not in stop_words:
            # e. Apply lemmatization to each remaining word
            # Lemmatize as a noun by default, can add pos tag if more precision needed
            lemmatized_word = lemmatizer.lemmatize(word)
            preprocessed_words.append(lemmatized_word)

    # f. Join the lemmatized words back into a string and append it to preprocessed_sentences
    preprocessed_sentences.append(' '.join(preprocessed_words))

# 7. Print the first few preprocessed sentences to verify
print("\nFirst 3 preprocessed sentences:")
for i, sent in enumerate(preprocessed_sentences[:3]):
    print(f"Sentence {i+1}: {sent}")

print(f"\nTotal preprocessed sentences: {len(preprocessed_sentences)}")

Downloading NLTK data...
NLTK data downloaded.

First 3 preprocessed sentences:
Sentence 1: artificial intelligence ai rapidly transforming various aspect life work interact technology
Sentence 2: machine learning subset ai enables system learn data without explicit programming leading advancement field like natural language processing computer vision
Sentence 3: deep learning specialization within machine learning utilizes neural network multiple layer uncover intricate pattern vast datasets

Total preprocessed sentences: 8


## Implement Sentence Scoring

Develop a method to score the importance of each sentence in the document.

To implement sentence scoring based on word frequency, I will first calculate the frequency of each word across all preprocessed sentences, then iterate through each sentence to sum the frequencies of its words, storing these sums as sentence scores, and finally print the scores.



In [5]:
from collections import defaultdict

# 1. Calculate the frequency of each word across all preprocessed_sentences.
word_frequencies = defaultdict(int)
for sentence in preprocessed_sentences:
    for word in sentence.split():
        word_frequencies[word] += 1

print("Top 10 word frequencies:")
# Sort word_frequencies by value in descending order and print the top 10
sorted_word_frequencies = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)
for word, freq in sorted_word_frequencies[:10]:
    print(f"  {word}: {freq}")

# 2. Create an empty list called sentence_scores
sentence_scores = []

# 3. Iterate through each preprocessed_sentence
for sentence in preprocessed_sentences:
    # a. Initialize a current_sentence_score to 0.
    current_sentence_score = 0
    # b. For each word in the current preprocessed_sentence (split by spaces):
    for word in sentence.split():
        # i. Add the word's frequency to current_sentence_score.
        current_sentence_score += word_frequencies[word]
    # c. Append current_sentence_score to the sentence_scores list.
    sentence_scores.append(current_sentence_score)

# 4. Print the sentence_scores list to inspect the calculated scores.
print("\nCalculated sentence scores:")
for i, score in enumerate(sentence_scores):
    print(f"Sentence {i+1} score: {score}")


Top 10 word frequencies:
  ai: 4
  learning: 3
  artificial: 2
  intelligence: 2
  technology: 2
  machine: 2
  system: 2
  rapidly: 1
  transforming: 1
  various: 1

Calculated sentence scores:
Sentence 1 score: 17
Sentence 2 score: 27
Sentence 3 score: 21
Sentence 4 score: 14
Sentence 5 score: 14
Sentence 6 score: 11
Sentence 7 score: 12
Sentence 8 score: 11


## Extract Summary Sentences

Select the top N highest-scoring sentences from the original document to form the extractive summary. The value of N can be adjusted based on the desired summary length.


In [6]:
n = 3 # Desired number of summary sentences

# Ensure n does not exceed the total number of sentences
if n > len(sentences):
    n = len(sentences)

# Create a list of (score, original_sentence) tuples
sentence_score_pairs = []
for i, score in enumerate(sentence_scores):
    sentence_score_pairs.append((score, sentences[i]))

# Sort the list of tuples by score in descending order
sentence_score_pairs.sort(key=lambda x: x[0], reverse=True)

# Select the top 'n' sentences to form the summary
summary_sentences = []
for i in range(n):
    summary_sentences.append(sentence_score_pairs[i][1]) # Append the original sentence

print(f"Extracted Summary (Top {n} sentences):")
for i, sent in enumerate(summary_sentences):
    print(f"Summary Sentence {i+1}: {sent}")

Extracted Summary (Top 3 sentences):
Summary Sentence 1: Machine learning, a subset of AI, enables systems to learn from data without explicit programming, leading to advancements in fields like natural language processing and computer vision.
Summary Sentence 2: Deep learning, a further specialization within machine learning, utilizes neural networks with multiple layers to uncover intricate patterns in vast datasets.
Summary Sentence 3: Artificial intelligence (AI) is rapidly transforming various aspects of our lives, from how we work to how we interact with technology.


## Display Extracted Summary

Present the generated extractive summary to clearly show the result of the summarization process.


In [7]:
print("\nFinal Extracted Summary:")
for i, sent in enumerate(summary_sentences):
    print(f"{i+1}. {sent}")


Final Extracted Summary:
1. Machine learning, a subset of AI, enables systems to learn from data without explicit programming, leading to advancements in fields like natural language processing and computer vision.
2. Deep learning, a further specialization within machine learning, utilizes neural networks with multiple layers to uncover intricate patterns in vast datasets.
3. Artificial intelligence (AI) is rapidly transforming various aspects of our lives, from how we work to how we interact with technology.


### Data Analysis Key Findings

*   **Text Data Initialization**: A sample text document of 1133 characters related to Artificial Intelligence was successfully loaded as the input for summarization.
*   **Comprehensive Text Preprocessing**: The text underwent several preprocessing steps using NLTK, including sentence tokenization (resulting in 8 sentences), word tokenization, conversion to lowercase, removal of punctuation, filtering of common English stopwords, and lemmatization. This process required downloading necessary NLTK data like 'punkt', 'stopwords', 'wordnet', 'omw-1.4', and 'punkt_tab'.
*   **Word Frequency Analysis**: Word frequencies were calculated across all preprocessed sentences. For example, 'ai' appeared 4 times, and 'learning' appeared 3 times, indicating their relative importance.
*   **Sentence Scoring Implementation**: Each preprocessed sentence was assigned a score by summing the frequencies of its constituent words. Scores ranged from 11 to 27 for the sentences, with "Machine learning, a subset of AI, enables systems to learn from data without explicit programming, leading to advancements in fields like natural language processing and computer vision" receiving the highest score of 27 (from its preprocessed form).
*   **Extractive Summary Generation**: The top 3 highest-scoring sentences from the original text were successfully extracted to form the summary. These included:
    1.  "Machine learning, a subset of AI, enables systems to learn from data without explicit programming, leading to advancements in fields like natural language processing and computer vision."
    2.  "Deep learning, a further specialization within machine learning, utilizes neural networks with multiple layers to uncover intricate patterns in vast datasets."
    3.  "Artificial intelligence (AI) is rapidly transforming various aspects of our lives, from how we work to how we interact with technology."
*   **Final Summary Presentation**: The generated extractive summary was clearly displayed, listing the three selected sentences in an ordered format.