# Extracting Keywords and Important Sentences from a Text

---
#### Course: Computational Data Mining
#### Professor: Dr. Fatemeh Shakeri
#### Student: Ilya Khalafi
#### Student ID: 9913039
#### January 2024

# Table Of Contents
- [Introduction](#intro)
- [Dependencies](#dependency)
- [Preparing the Data](#data)
  - [Importing the Text](#import)
  - [Text Preprocessing](#preprocessing)
  - [Extracting Sentences & Their Words](#extract)
- [Implementations](#implement)
  - [Method 1](#impmethod1)
  - [Method 2](#impmethod2)
- [Applying Implemented methods](#apply)
- [Final Conclusion](#conclusion)

<a name="intro"></a>

# Introduction 📚

---

In natural language processing, we aim to design mathematical methods to comprehend texts automatically and help us to accomplish several tasks such as classification, summarization, etc. One important goal is to automatically detect and extract keywords and significant sentences.

<img src="https://previews.123rf.com/images/lculig/lculig1409/lculig140900234/31533068-keyword-concept-word-cloud-background.jpg" width="400"/>

In this article, we implement and apply two computational methods to automatically extract important words and sentences.

 1. In the first method, we define a custom score function for each pair of words and sentences and then define a matrix containing these scores. finally, we apply SVD to this matrix to extract important features.
   
 2. Unfortunately, the first method does not cross out similar sentences when it reports a sentence as important, so it may report several similar sentences as important. In the second method, we use householder matrices to decrease the score of sentences that are similar to the reported sentences; therefore, our method only reports one of similar sentences as important and crosses out other sentences that are similar to the reported one.

<a name="dependency"></a>

# Dependencies 🧰

---

We need the following libraries during this article:

- **numpy** : <br />
    numpy is a commonly used library for doing scientific computation. Unlike Python's default pointer structure, numpy saves variables in place and continuously on RAM and also provides sophisticated methods that use parallelism to make our computations much faster.

- **Matplotlib**: <br />
    Matplotlib is a super useful Python library for drawing charts and data visualization. We use it to plot our facial images.

- **seaborn** : <br />
    seaborn is built on Matplotlib and provides many chart templates for us, so we don't need to draw and build every component of our charts with Matplotlib. We will use it to plot the final confusion matrix.

- **scikit-learn (sklearn)** : <br />
    Sklearn is a well-known library that has an implementation of machine learning models for several tasks. Here, we use the NMF class for the implementation of the second method.

- **nltk** :
    nltk stands for Natural Language Toolkit and contains several efficient implementations of methods for nlp. We will use it to remove stop words and replace words with their stem.


In [1]:
%%capture
# Python Standard Libraries
import re
# Fundamental Data Analysis & Visualization Tools
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.decomposition import NMF
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

<a name="data"></a>

# Preparing the Data 

---

In this section, we download our dataset from my Google Drive and then import its content. Then we remove its stop words and replace each word with its stem, and finally, we extract sentences from the text and convert our text into a list of lists where each list represents a sentence.

<a name="import"></a>

### Importing the Text 🔽

---

Let's download our data. It is simply a text file that contains text of 10 pages (258 ~ 268) of the book "The Design of Everyday Things". We use gdrive command to download this file from my Google Drive... 

In [2]:
%%capture
# Downloading the text file
!gdown '1y08C3tczKpmexAJZDgrCOOmXlPOBQ8tW'

Now, let's read it as a string and print some of its content...

In [3]:
# Reading the text from the txt file
data = open('book.txt', 'r').readlines()[0].strip()
# Printing first 80 characters of the text
print(data[:80])

The realities of the world impose severe constraints upon the design of products


Great! Let's proceed to the next section and clean this text.

<a name="preprocessing"></a>

### Text Preprocessing

---

Before processing the data and extracting keywords, we have to preprocess our text. In this section, we apply these three methods to our text:

1. While extracting keywords, we need to omit to stop words such as 'a', 'an', the,' etc. Because these words commonly occur in any text and may confuse our model to report them as significant words. 

2. Additionally, we need to remove unnecessary characters such as ',', ':', ';' and parenthesis, and replace '!' and '?' with a simple period.

3. A single word can have several forms, whether it is a noun, adjective, adverb, etc. This is an issue because it makes our feature matrix much larger and more sparse. Therefore, we replace each word with its stem to address this problem, but 

In [4]:
def remove_stopwords_and_signs(text):
    # Downloading the required NLTK resources
    nltk.download('punkt')
    nltk.download('stopwords')
    
    # Removing stop words and unnecessary characters
    stop_words = set(stopwords.words('english'))
    text = re.sub(r'[,:;()\“\”\"\']', '', text)
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

    # Replacing "!" and "?" with a single period
    return re.sub(r'[!?]+', '.', text)

def stem_words(text):
    # Replacing each word with its stem
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in text.split()])

Now, let's apply this method to apply all three mentioned changes.

In [5]:
# Removing stop words and unnecessary signs
cleaned_text = remove_stopwords_and_signs(data)

# We have to keep a list of original sentences to reconstruct sentences later
original_sentences = [sentence.strip() for sentence in cleaned_text.split('.') if len(sentence.strip()) != 0]

# Replacing words with their stems
text = stem_words(cleaned_text)

# Comparing first three sentences of the preprocessed string with the crude data
print('First 3 sentences of the crude data:\n', '\n'.join(data.split('.')[0:3]))
print('-' * 30)
print('First 3 sentences of the preprocessed data:\n', '\n'.join(text.split('.')[0:3]))

First 3 sentences of the crude data:
 The realities of the world impose severe constraints upon the design of products
 Up to now I have described the ideal case, assuming that human-centered design principles could be followed in a vacuum; that is, without attention to the real world of competition, costs, and schedules
 Conflicting requirements will come from different sources, all of which are legitimate, all of which need to be resolved
------------------------------
First 3 sentences of the preprocessed data:
 realiti world impos sever constraint upon design products
 describ ideal case assum human-cent design principl could follow vacuum without attent real world competit cost schedules
 conflict requir come differ sourc legitim need resolved


Awesome! Notice that all mentioned changes are applied to the text.

<a name="extract"></a>

### Extracting Sentences & Their Words

---

Before proceeding to the next section, we have to convert our text into a list of lists where each list contains a token of the words that are present in it. Therefore, we extract unique words from our text and assign a token to it (its index in the list of unique words). Then we convert our text into a list of lists of tokens.

In [6]:
# Defining a set that will contain all words present in the preprocessed text
words = set()
for sentence in text.split('.'):
    words = words.union(set(sentence.strip().split()))
words = list(words)
    
# Converting a set of words to a dictionary of words and their tokens
word_to_token = {word:token for token, word in enumerate(words)}
token_to_word = {token:word for word, token in word_to_token.items()}

# Making a list of lists where each list represents a sentence and contains a list of strings
tokenized_sentences = [[word_to_token[word.strip()] for word in sentence.split()] for sentence in text.split('.') if len(sentence.strip()) != 0]

# Printing statistics of words and sentences
print('Total number of unique words: ', len(words))
print('Total number of sentences: ', len(tokenized_sentences))

Total number of unique words:  803
Total number of sentences:  182


Perfect! Now let's proceed to the next section.

<a name="implement"></a>

# Implementations 🧨

---

Now let's implement the methods that we mentioned earlier!

<a name="impmethod1"></a>

### Method 1

---

The naive approach involves creating a unique scoring function for every combination of words and sentences. These scores are then assembled into a matrix, which is subsequently subjected to singular value decomposition (SVD) to extract significant characteristics.

We may consider the score of each word-sentence as the number of occurrences of the word in the sentence. Additionally, we can consider the following score to define our matrix of score:

> a[i][j] = f[i][j] * log(n / n_i)

where:
1. f[i][j] = Number of occurrences of the word i in sentence j
2. n = Total number of sentences
3. n_i = Total number of sentences containing word i

We implement both options for our word-sentence matrix...


In [7]:
def naive_analysis(sentences, words, matrix_mode='score'):
    '''
        args:
        1. sentences = a list of lists which each list contains the token of words
        2. words = list of unique words of the text
        3. matrix_mode = 'score', 'frequency'
        output:
        returns two lists, containing sorted rankings of words and sentences
    '''
    freq_mat = np.zeros((len(words), len(sentences)))
    for i, sentence in enumerate(sentences):
        for token in sentence:
            freq_mat[token, i] += 1
            
    if matrix_mode == 'score':
        n_i = (freq_mat > 0).astype(int).sum(axis=1)
        freq_mat = freq_mat * np.log(len(sentences) / n_i)[:, np.newaxis]
    
    U, _, V = np.linalg.svd(freq_mat)
    word_ranks = np.abs(U[:, 0]).argsort()[::-1]
    sentence_ranks = np.abs(V[0]).argsort()[::-1]
    return word_ranks, sentence_ranks

<a name="impmethod2"></a>

### Method 2

---

Regrettably, the first approach fails to eliminate comparable sentences when identifying an important sentence, potentially labeling multiple similar sentences as significant. In the second method, we employ Householder matrices to reduce the scores of sentences that resemble the identified sentences. Consequently, our method only designates one similar sentence as important while disregarding other sentences resembling the selected one.


In [8]:
def uniqueness_analysis(sentences, words, n_components=10):
    '''
        args:
        1. sentences = a list of lists which each list contains the token of words
        2. words = list of unique words of the text
        3. n_components = total number of top-ranked sentences to find
        output:
        returns one list, containing sorted rankings of sentences
    '''
    A = np.zeros((len(words), len(sentences)))
    for i, sentence in enumerate(sentences):
        for token in sentence:
            A[token, i] += 1
    
    nmf = NMF(n_components=n_components, init='random', random_state=41)
    W = nmf.fit_transform(A)
    H = nmf.components_
    
    swaps = []
    for col_idx in range(min(H.shape)):
        max_norm_idx = np.argmax(np.linalg.norm(H[col_idx:, col_idx:], axis=0))
        swaps.append((col_idx, max_norm_idx))
        A[:, [col_idx, max_norm_idx]] = A[:, [max_norm_idx, col_idx]]  # Swap columns for permutation
        
        x = H[col_idx:, col_idx]
        e = np.zeros_like(x)
        e[0] = np.linalg.norm(x)
        u = x - e
        v = u / (np.linalg.norm(u) + 1e-16)
        # Construct Householder matrix
        Q = np.eye(H.shape[0])
        Q[col_idx:, col_idx:] -= 2.0 * np.outer(v, v)
        
        W = W @ Q
        H = Q.T @ H
        k = H[col_idx, col_idx]
        H[col_idx] /= k
        W[:, col_idx] *= k
        
    indices = list(range(A.shape[1]))
    for swap in swaps:
        indices[swap[0]], indices[swap[1]] = indices[swap[1]], indices[swap[0]]
    return indices[:n_components]

<a name="apply"></a>

# Applying Implemented methods

---

Finally, we apply the implemented methods and print the results!

In [9]:
# Feel free to change k to print k-top sentences
k = 5

# Method 1 with Score Matrix - Results
word_ranks, sentence_ranks = naive_analysis(tokenized_sentences, words, matrix_mode='score')
print(f'Top {k} sentences from method-1 with score matrix: ')
for i, idx in enumerate(sentence_ranks[:k]):
    print(f'{i+1}. ', original_sentences[idx])

print('-' * 50)
print(f'Top {k} words from method-1 with score matrix: ')
for i, idx in enumerate(word_ranks[:k]):
    print(f'{i+1}. ', words[idx])

# Method 1 with Frequency Matrix - Results
print('-' * 50)
word_ranks, sentence_ranks = naive_analysis(tokenized_sentences, words, matrix_mode='frequency')
print(f'Top {k} sentences from method-1 with frequency matrix: ')
for i, idx in enumerate(sentence_ranks[:k]):
    print(f'{i+1}. ', original_sentences[idx])
    
print('-' * 50)
print(f'Top {k} words from method-1 with frequency matrix: ')
for i, idx in enumerate(word_ranks[:k]):
    print(f'{i+1}. ', words[idx])

# Method 2
print('-' * 50)
sentence_ranks = uniqueness_analysis(tokenized_sentences, words, n_components=k)
print(f'Top {k} sentences from method-2 with frequency matrix: ')
for i, idx in enumerate(sentence_ranks):
    print(f'{i+1}. ', original_sentences[idx])

Top 5 sentences from method-1 with score matrix: 
1.  pressures larger screens forced demise physical keyboards despite attempt make tiny keyboards operated single fingers thumbs keyboards displayed screen whenever needed letter tapped one time
2.  anyone type dictate take photographs videos draw animated scenes creatively produce experiences twentieth century required huge amounts technology large crews specialized workers types devices allow us tasks ways controlled proliferate
3.  Reading done quickly possible read around three hundred words per minute skim jumping ahead back effectively acquiring information rates thousands words per minute
4.  Today talking video conferences writing photography still video collaborative interaction sorts increasingly done one single device available large variety screen sizes computational power portability
5.  written using traditional keyboards even new technological devices keyboard still remains fastest way enter words system whether paper ele

<a name="conclusion"></a>

# Final Conclusion

---

Here is our observation for each method...

Method-1:

- Sentences: The sentences selected by this method focus on various topics such as the evolution of keyboards, technological advancements, reading speed, and the importance of keyboards in entering words into a system. The selection seems to capture different aspects related to the design and usage of products.

- Words: The top 5 words identified by this method are "keyboard," "word," "product," "even," and "new." These words reflect the importance of keyboards, the general concept of words, and the significance of products in the context of the extracted sentences.

There is no surprise that extracted keywords are emerging in extracted sentences as well, because in our method, we assumed that important sentences cotain important words and vice versa.

Method-2:

- Sentences: The selected sentences from this method cover topics such as the impact of new technologies, the shift from traditional writing to new media, listening speed, and the need for a new term to describe devices that combine multiple functions. These sentences provide insights into the changing landscape of technology and communication.

- Words: The top 5 words identified by this method are "product," "new," "feature," "company," and "competition." These words emphasize the importance of products, the introduction of new features, the role of companies, and the competitive environment.

Extracted sentences from this method look much less similar to each other, and it was predictable because we were trying to decrease the similarity of top-rank sentences reported by this method.

In general, important words & sentences reported from the 1st method are strongly connected, and important sentences of the 2nd method look more unique and less similar to each other.


Thanks for your valuable time and attention! This notebook is available in the link below😀

https://drive.google.com/file/d/12eBTqHXboWXehfCPu4nmoDLTxnpC5UZl/view?usp=sharing