# Data Cleaning Using Regex

In [6]:
import re

def clean_and_minimize(text):
    # Define a regex pattern to remove non-alphanumeric characters and extra spaces
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Remove extra spaces
    return cleaned_text


# Read the original text
with open('Ch1-LEC-1_Introduction.txt', "r") as file:
    original_text = file.read()

# Calculate and print the length of the original text

original_text_length = len(original_text)
print("Original Text Length:", original_text_length)


# Clean and minimize the text using regex
cleaned_text = clean_and_minimize(original_text)

# Calculate and print the length of the cleaned text
cleaned_text_length = len(cleaned_text)
print("Cleaned Text Length:", cleaned_text_length)

# Save cleaned text to a text file
text_filename = 'Ch1_cleaned.txt'
with open(text_filename, 'w') as text_file:
    text_file.write(cleaned_text)

print(f"Cleaned text saved to {text_filename}")

Original Text Length: 16015
Cleaned Text Length: 12022
Cleaned text saved to Ch1_cleaned.txt


# Summarization Model

In [8]:
from transformers import pipeline

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn", tokenizer="facebook/bart-large-cnn")

# Read the text from the file
file_path = 'Ch1_cleaned.txt'
with open(file_path, 'r') as file:
    input_text = file.read()

# Split the text into chunks (you can adjust the chunk size)
chunk_size = 1000
chunks = [input_text[i:i+chunk_size] for i in range(0, len(input_text), chunk_size)]

# Generate summaries for each chunk
summaries = []
for chunk in chunks:
    summary = summarizer(chunk, max_length=512, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
    summaries.append(summary[0]['summary_text'])

# Combine the summaries into a final summary
final_summary = " ".join(summaries)

# Print the original text and final summary
print("\nGenerated Summary:")
print(final_summary)

Your max_length is set to 512, but your input_length is only 153. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=76)
Your max_length is set to 512, but your input_length is only 147. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=73)
Your max_length is set to 512, but your input_length is only 146. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=73)
Your max_length is set to 512, but your input_length is only 159. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=79)



Generated Summary:
Software engineering is concerned with theories methods and tools for professional software development. Software engineering involves wider responsibilities than simply the application of technical skills. The economies of all developed nations are dependent on software. Software engineering is an engineering discipline that is concerned with all aspects of software production. Good software should deliver the required functionality and performance to the user and should be maintainable dependable and usable. Software engineering is concerned with the practicalities of developing and delivering useful software. Software engineering is part of the more general process of computerbased systems development. All software projects have to be professionally managed and developed different techniques are appropriate for different types of system. The web has led to the availability of software services and the possibility of developing highly distributed servicebased syst

In [9]:
print(len(final_summary))

3152
