<a href="https://colab.research.google.com/github/Deepak345/OCR-summariser/blob/master/lsa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [6]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize 

In [2]:
data = "Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Paragraphs are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs."
data

'Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Paragraphs are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns. Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs.

In [8]:
docs = sent_tokenize(data)
text = docs
text

['Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.',
 'LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis).',
 'A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.',
 'Paragraphs are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two columns.',
 'Values close to 1 represent very similar paragraphs while values close to 0 represent very dissi

In [9]:
df = pd.DataFrame(docs, columns=['sent'])
df

Unnamed: 0,sent
0,Latent semantic analysis (LSA) is a technique ...
1,LSA assumes that words that are close in meani...
2,A matrix containing word counts per paragraph ...
3,Paragraphs are then compared by taking the cos...
4,Values close to 1 represent very similar parag...


In [11]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [13]:
df['sent'] = df['sent'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df['sent'][0]

'Latent semantic analysis (LSA) technique natural language processing, particular distributional semantics, analyzing relationships set documents terms contain producing set concepts related documents terms.'

In [38]:
vectorizer = TfidfVectorizer(max_df = 0.5, min_df = 2, smooth_idf = True)
vectorizer.fit(df['sent'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.5, max_features=None,
                min_df=2, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [39]:
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

{'lsa': 3, 'technique': 7, 'distributional': 2, 'words': 9, 'close': 0, 'similar': 6, 'text': 8, 'represent': 5, 'columns': 1, 'paragraphs': 4}
[1.69314718 1.69314718 1.69314718 1.69314718 1.69314718 1.69314718
 1.69314718 1.69314718 1.69314718 1.69314718]


In [40]:
vector = vectorizer.transform(df['sent'])
# print(vector)
print(vector.shape)
tfidf = vector.toarray()
# tfidf

(5, 10)


In [41]:
tfidf = tfidf.T
print(tfidf.shape)
print(tfidf)

(10, 5)
[[0.         0.40824829 0.         0.         0.5547002 ]
 [0.         0.         0.60302269 0.70710678 0.        ]
 [0.57735027 0.40824829 0.         0.         0.        ]
 [0.57735027 0.40824829 0.         0.         0.        ]
 [0.         0.         0.         0.70710678 0.5547002 ]
 [0.         0.         0.60302269 0.         0.5547002 ]
 [0.         0.40824829 0.         0.         0.2773501 ]
 [0.57735027 0.         0.30151134 0.         0.        ]
 [0.         0.40824829 0.30151134 0.         0.        ]
 [0.         0.40824829 0.30151134 0.         0.        ]]


In [0]:
from sklearn.decomposition import TruncatedSVD

svd_model = TruncatedSVD(n_components = -(-len(text)//2),algorithm='randomized', n_iter=10, random_state=22)

In [50]:
svd_model.fit(tfidf)

svd_model.n_components

3

In [51]:
svd_model.components_

array([[ 0.30735003,  0.45000746,  0.52083595,  0.42495099,  0.50117444],
       [ 0.62984917,  0.51310073, -0.17410579, -0.50185479, -0.24051322],
       [-0.41432952,  0.356272  , -0.42650899, -0.31780859,  0.64690733]])

In [0]:
svc = svd_model.components_
svc = svc.tolist()

In [53]:
svc

[[0.3073500337440256,
  0.4500074569621447,
  0.520835947161369,
  0.42495098535202247,
  0.5011744423141258],
 [0.6298491686324654,
  0.5131007328141923,
  -0.17410578876022723,
  -0.5018547874025349,
  -0.24051322092530797],
 [-0.4143295175773899,
  0.35627199632993245,
  -0.426508987538849,
  -0.31780859055121163,
  0.6469073340252146]]

In [54]:
summary = ''
p_pos = 0
for _ in range(len(svc)):
  m = max(svc[_])
  print(m)
  pos = svc[_].index(m)
  print(pos)
  print(text[pos])
  if pos > p_pos:
    summary = summary + '\n' + text[pos]
  else:
    summary = text[pos] + '\n' + summary  
  p_pos = pos
print(summary)

0.520835947161369
2
A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.
0.6298491686324654
0
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
0.6469073340252146
4
Values close to 1 represent very similar paragraphs while values close to 0 represent very dissimilar paragraphs.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related 