### Import Package

In [107]:
import numpy as np
import re
import pandas as pd
from collections import Counter
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

### Import Data

In [118]:
file_path = "pandp.txt"
with open(file_path, "r", encoding= "utf-8") as file:
    pp = file.read()
pp[:100]

'\ufeffThe Project Gutenberg eBook of Pride and Prejudice\n    \nThis ebook is for the use of anyone anywher'

In [119]:
file_path = "sands.txt"
with open(file_path, "r", encoding= "utf-8") as file:
    ss = file.read()
ss[:100]

'\ufeffThe Project Gutenberg eBook of Sense and Sensibility\n    \nThis ebook is for the use of anyone anywh'

### Text Cleaning 

Step 1: extract the text from the huge amount of text in here
Step 2: lower the text and remove some punctuation

Use re to find specific text

In [120]:
pp_start = re.search("It is a truth universally acknowledged",pp)
pp_start

<re.Match object; span=(35187, 35225), match='It is a truth universally acknowledged'>

In [121]:
pp[35187:]

'It is a truth universally acknowledged, that a single man in possession\nof a good fortune must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds\nof the surrounding families, that he is considered as the rightful\nproperty of some one or other of their daughters.\n\n“My dear Mr. Bennet,” said his lady to him one day, “have you heard that\nNetherfield Park is let at last?”\n\nMr. Bennet replied that he had not.\n\n“But it is,” returned she; “for Mrs. Long has just been here, and she\ntold me all about it.”\n\nMr. Bennet made no answer.\n\n“Do not you want to know who has taken it?” cried his wife, impatiently.\n\n“_You_ want to tell me, and I have no objection to hearing it.”\n\n[Illustration:\n\n“He came down to see the place”\n\n[_Copyright 1894 by George Allen._]]\n\nThis was invitation enough.\n\n“Why, my dear, you must know, Mrs. Long says that Netherfield is ta

In [122]:
pp_end = re.search("END",pp)
pp_end

<re.Match object; span=(729478, 729481), match='END'>

In [123]:
pp[729360:]

'\n\n                            [Illustration:\n\n                                  THE\n                                  END\n                                   ]\n\n\n\n\n             CHISWICK PRESS:--CHARLES WHITTINGHAM AND CO.\n                  TOOKS COURT, CHANCERY LANE, LONDON.\n\n\n        \n            *** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\n        \n\n    \n\nUpdated editions will replace the previous one—the old editions will\nbe renamed.\n\nCreating the works from print editions not protected by U.S. copyright\nlaw means that no one owns a United States copyright in these works,\nso the Foundation (and you!) can copy and distribute it in the United\nStates without permission and without paying copyright\nroyalties. Special rules, set forth in the General Terms of Use part\nof this license, apply to copying and distributing Project\nGutenberg™ electronic works to protect the PROJECT GUTENBERG™\nconcept and trademark. Project Gutenberg is a register

In [124]:
pp_short = pp[35187:729360]
pp_short[:100]

'It is a truth universally acknowledged, that a single man in possession\nof a good fortune must be in'

In [125]:
ss_start = re.search("The family of Dashwood had long been settled in Sussex.",ss)
ss_start

<re.Match object; span=(17796, 17851), match='The family of Dashwood had long been settled in S>

In [126]:
ss[17796:]



In [127]:
ss_end = re.search("between themselves, or producing coolness between their husbands.",ss)
ss_end

<re.Match object; span=(690803, 690868), match='between themselves, or producing coolness between>

In [128]:
ss[690868:]

'\n\nTHE END\n\n\n\n\n\n        \n            *** END OF THE PROJECT GUTENBERG EBOOK SENSE AND SENSIBILITY ***\n        \n\n    \n\nUpdated editions will replace the previous one—the old editions will\nbe renamed.\n\nCreating the works from print editions not protected by U.S. copyright\nlaw means that no one owns a United States copyright in these works,\nso the Foundation (and you!) can copy and distribute it in the United\nStates without permission and without paying copyright\nroyalties. Special rules, set forth in the General Terms of Use part\nof this license, apply to copying and distributing Project\nGutenberg™ electronic works to protect the PROJECT GUTENBERG™\nconcept and trademark. Project Gutenberg is a registered trademark,\nand may not be used if you charge for an eBook, except by following\nthe terms of the trademark license, including paying royalties for use\nof the Project Gutenberg trademark. If you do not charge anything for\ncopies of this eBook, complying with the

In [129]:
ss_short = ss[17796:690868]

In [130]:
pp_short = pp_short.replace('\n',' ')
ss_short = ss_short.replace('\n',' ')

print(ss_short[:100])
print(pp_short[:100])

The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence 
It is a truth universally acknowledged, that a single man in possession of a good fortune must be in


In [132]:
pp_sent = sent_tokenize(pp_short)
ss_sent = sent_tokenize(ss_short)


In [15]:
# First Pattern: Remove all the punctuation
p_to_remove = r'[.!"",“”?:;_()-]'
pp_short = re.sub(p_to_remove, '', pp_short)
ss_short = re.sub(p_to_remove, '', ss_short)

print(ss_short[:100])
print(pp_short[:2000])

The family of Dashwood had long been settled in Sussex Their estate was large and their residence wa
It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife  However little known the feelings or views of such a man may be on his first entering a neighbourhood this truth is so well fixed in the minds of the surrounding families that he is considered as the rightful property of some one or other of their daughters  My dear Mr Bennet said his lady to him one day have you heard that Netherfield Park is let at last  Mr Bennet replied that he had not  But it is returned she for Mrs Long has just been here and she told me all about it  Mr Bennet made no answer  Do not you want to know who has taken it cried his wife impatiently  You want to tell me and I have no objection to hearing it  [Illustration  He came down to see the place  [Copyright 1894 by George Allen]]  This was invitation enough  Why my dear you must know Mrs Long says t

In [16]:
# Second Pattern: Remove some illustartion in the book
pattern = r'\[.*?\]'
pp_short = re.sub(pattern, '', pp_short)
ss_short = re.sub(pattern, '', ss_short)

print(ss_short[:100])
print(pp_short[:2000])

The family of Dashwood had long been settled in Sussex Their estate was large and their residence wa
It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife  However little known the feelings or views of such a man may be on his first entering a neighbourhood this truth is so well fixed in the minds of the surrounding families that he is considered as the rightful property of some one or other of their daughters  My dear Mr Bennet said his lady to him one day have you heard that Netherfield Park is let at last  Mr Bennet replied that he had not  But it is returned she for Mrs Long has just been here and she told me all about it  Mr Bennet made no answer  Do not you want to know who has taken it cried his wife impatiently  You want to tell me and I have no objection to hearing it  ]  This was invitation enough  Why my dear you must know Mrs Long says that Netherfield is taken by a young man of large fortune from the north of Eng

In [17]:
# Second Pattern: Remove again a ] 
pattern = r']'
pp_short = re.sub(pattern, '', pp_short)
ss_short = re.sub(pattern, '', ss_short)

print(ss_short[:100])
print(pp_short[:2000])

The family of Dashwood had long been settled in Sussex Their estate was large and their residence wa
It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife  However little known the feelings or views of such a man may be on his first entering a neighbourhood this truth is so well fixed in the minds of the surrounding families that he is considered as the rightful property of some one or other of their daughters  My dear Mr Bennet said his lady to him one day have you heard that Netherfield Park is let at last  Mr Bennet replied that he had not  But it is returned she for Mrs Long has just been here and she told me all about it  Mr Bennet made no answer  Do not you want to know who has taken it cried his wife impatiently  You want to tell me and I have no objection to hearing it    This was invitation enough  Why my dear you must know Mrs Long says that Netherfield is taken by a young man of large fortune from the north of Engl

### Tokenization

In [18]:
pp_split = pp_short.split()
ss_split = ss_short.split()

pp_low = [t.lower() for t in pp_split if t.isalpha()]
ss_low = [t.lower() for t in ss_split if t.isalpha()]


print(pp_low[:30])
print(ss_low[:30])

['it', 'is', 'a', 'truth', 'universally', 'acknowledged', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'however', 'little', 'known', 'the', 'feelings', 'or', 'views']
['the', 'family', 'of', 'dashwood', 'had', 'long', 'been', 'settled', 'in', 'sussex', 'their', 'estate', 'was', 'large', 'and', 'their', 'residence', 'was', 'at', 'norland', 'park', 'in', 'the', 'centre', 'of', 'their', 'property', 'where', 'for', 'many']


In [19]:
# Lemmatization
nltk_lem = WordNetLemmatizer()

pp_token = [nltk_lem.lemmatize(word, pos = 'v') for word in pp_low]
ss_token = [nltk_lem.lemmatize(word, pos = 'v') for word in ss_low]


print(pp_token[:30])
print(ss_token[:30])

['it', 'be', 'a', 'truth', 'universally', 'acknowledge', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', 'however', 'little', 'know', 'the', 'feel', 'or', 'view']
['the', 'family', 'of', 'dashwood', 'have', 'long', 'be', 'settle', 'in', 'sussex', 'their', 'estate', 'be', 'large', 'and', 'their', 'residence', 'be', 'at', 'norland', 'park', 'in', 'the', 'centre', 'of', 'their', 'property', 'where', 'for', 'many']


### Explore Text And Feature Engineering

In [62]:
# Without removing the stop words, the text look like this
pride_and_prejudice_count = Counter(pp_token)
sense_and_sensitivity_count = Counter(ss_token)

top_10_most_frequent_in_pp = pride_and_prejudice_count.most_common(10)
top_10_most_frequent_in_ss = sense_and_sensitivity_count.most_common(10)

top_10_most_frequent_in_pp, top_10_most_frequent_in_ss

([('be', 5850),
  ('the', 4326),
  ('to', 4133),
  ('of', 3614),
  ('and', 3537),
  ('have', 2335),
  ('her', 2219),
  ('i', 2052),
  ('a', 1935),
  ('in', 1860)],
 [('be', 5398),
  ('to', 4086),
  ('the', 4085),
  ('of', 3565),
  ('and', 3420),
  ('her', 2521),
  ('have', 2082),
  ('a', 2040),
  ('i', 1937),
  ('in', 1932)])

In [133]:
# Short summary for pride and prejudice 
parser = PlaintextParser.from_string(ss_sent, Tokenizer("english"))

# Initialize the summarizer (LSA algorithm)
summarizer = LsaSummarizer()

# Summarize the text (you can specify the number of words you want in the summary)
summary = summarizer(parser.document, sentences_count=3)

print(summary)

(<Sentence: ', 'He looked more than usually grave, and though expressing satisfaction at finding Miss Dashwood alone, as if he had somewhat in particular to tell her, sat for some time without saying a word.>, <Sentence: ', 'My sister will be equally sorry to miss the pleasure of seeing you; but she has been very much plagued lately with nervous head-aches, which make her unfit for company or conversation.>, <Sentence: ', "But at last she found herself with some surprise, accosted by Miss Steele, who, though looking rather shy, expressed great satisfaction in meeting them, and on receiving encouragement from the particular kindness of Mrs. Jennings, left her own party for a short time, to join their's.>)


In [None]:
# Short summary for sense and sensibility  
parser = PlaintextParser.from_string(pp_sent, Tokenizer("english"))

# Initialize the summarizer (LSA algorithm)
summarizer = LsaSummarizer()

# Summarize the text (you can specify the number of words you want in the summary)
summary = summarizer(parser.document, sentences_count=3)

print(summary)

In [63]:
# After the text cleaning, the result looks like this
stop_wd = stopwords.words('english')

cv_1 = CountVectorizer(stop_words= stop_wd)
cv_2 = CountVectorizer(stop_words= stop_wd)


bow_pp = cv_1.fit_transform(pp_token)
bow_ss = cv_2.fit_transform(ss_token)

vocab_pp = cv_1.vocabulary_
vocab_ss = cv_2.vocabulary_

In [21]:

tfidf = TfidfVectorizer(stop_words= stop_wd, norm='l2')
tfidf2 = TfidfVectorizer(stop_words= stop_wd, norm='l2') 


tfidf_pp = tfidf.fit_transform(pp_token)
tfidf.get_feature_names_out()

tfidf_ss= tfidf2.fit_transform(ss_token)


In [89]:
clean_pp_freq = (pd.DataFrame({'tfidf': tfidf_pp.sum(axis=0).A1,
                            'token': tfidf.get_feature_names_out()})
              .sort_values('tfidf', ascending=False))
clean_pp_freq.head(10)

Unnamed: 0,tfidf,token
2898,776.0,mr
3848,608.0,say
1412,595.0,elizabeth
964,524.0,could
4925,468.0,would
2539,387.0,know
1028,369.0,darcy
4399,348.0,think
2899,343.0,mrs
2717,334.0,make


In [90]:
clean_ss = (pd.DataFrame({'tfidf': tfidf_ss.sum(axis=0).A1,
                            'token': tfidf2.get_feature_names_out()})
              .sort_values('tfidf', ascending=False))
clean_ss.head(10)

Unnamed: 0,tfidf,token
1492,618.0,elinor
4147,600.0,say
1014,575.0,could
3128,527.0,mrs
5344,514.0,would
2938,488.0,marianne
2732,387.0,know
1627,373.0,every
4755,349.0,think
3322,317.0,one
