**SOFT DEADLINE:** `20.03.2022 23:59 msk` 

# [5 points] Part 1. Data cleaning

The task is to clear the text data of the crawled web-pages from different sites. 

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

1. Remove non-english words
1. Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
1. Apply lemmatization / stemming
1. Remove stop-words
1. Additional processing - At your own initiative, if this helps to obtain a better distribution

#### Hints

1. To do text processing you may use nltk and re libraries
1. and / or any other libraries on your choise

#### Data reading

The dataset for this part can be downloaded here: `https://drive.google.com/file/d/1wLwo83J-ikCCZY2RAoYx8NghaSaQ-lBA/view?usp=sharing`

In [1]:
import pandas as pd

df = pd.read_csv('./web_sites_data.csv')

#### Data processing

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
# nltk.download('words')
words = set(words.words())
stop_words = set(stopwords.words('english'))


In [3]:
# val = df.iloc[3015].values[0]
for index, row in df.iterrows():
    val = row.values[0]
    lemmatizer = WordNetLemmatizer()
    cleantext = BeautifulSoup(val, 'html.parser')
    n = " ".join(cleantext.stripped_strings)
    n = n.lower()
    n = " ".join(w for w in nltk.wordpunct_tokenize(n) \
            if w in words and w not in stop_words)
    n = lemmatizer.lemmatize(n)
    row.values[0] = n

In [4]:
freq = nltk.FreqDist(nltk.word_tokenize(" ".join(row.values[0] for _, row in df.iterrows())))


In [5]:
freq_df = pd.DataFrame.from_records(
    freq.most_common(100), columns=['Word', 'Count'])

#### Vizualization

As a visualisation, it is necessary to construct a frequency distribution of words (the 100 most common words), sorted by frequency. 

For visualization purposes we advice you to use plotly, but you are free to choose other libraries

In [6]:
import plotly.express as px
fig = px.histogram(freq_df, x="Word", y="Count")
fig.update_xaxes(tickmode="linear")
fig.show()

#### Provide examples of processed text (some parts)

Is everything all right with the result of cleaning these examples? What kind of information was lost?

### Was
```<html> <head profile=""http://www.w3.org/2005/10/profile""> <LINK REL=""SHORTCUT ICON"" href=""http://i.bookmooch.com/favicon.ico""> <link rel=""icon"" type=""image/png"" href=""http://i.bookmooch.com/favicon.png""> <title>Eric Newby : Love and War in the Apennines</title> <meta http-equiv=""Content-Type"" content=""text/html""> </head> <body bgcolor=""#FFFFFF"" leftmargin=""0"" topmargin=""0"" marginwidth=""0"" marginheight=""0"" text=""#000000"" link=""#0000FF"" vlink=""#0000FF"" alink=""#FF0000"" > <basefont face=""arial, sans-serif""><font face=""arial, sans-serif""> <table width=""100%"" height=""70"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr><form action=""/search"" method=""get""> <td width=""283"" colspan=""2"" rowspan=""2"" bgcolor=""#689A9B""> <a href=""/""> <img src=""http://i.bookmooch.com/images/bookmooch_logo.gif"" width=""283"" height=""66"" border=""0"" alt=""BookMooch logo""></a></td> <td width=""675"" height=""38"" colspan=""9"" align=""right"" bgcolor=""#689A9B"" xcolor=""#689A9B""> <table border=0 cellpadding=""0"" cellspacing=""0""><tr> <td width=270 height=18 valign=""middle"" align=""right""> <INPUT TYPE=""text"" NAME=""w"" VALUE="""" SIZE=""20"" MAXLENGTH=""100"">&nbsp;</td> <td width=67 height=18 valign=""middle"" align=""right""><input type=""image"" BORDER=""0"" title=""search"" alt=""search"" src=""http://i.bookmooch.com/images/search_button.gif"" width=""67"" height=""18"" name=""search""></td> <td height=38 width=37><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""10"" height=""1"" alt=""""></td></tr> </table> </td> <td bgcolor=""#689A9B"" width=940><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""1"" alt=""""></td> </tr></form> <tr> <td width=""193"" height=""28"" bgcolor=""#689A9B""><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""193"" height=""28"" alt=""""></td> <td bgcolor=""#FFFF99""><a href=""/""><img src=""http://i.bookmooch.com/images/home.gif"" width=""85"" height=""28"" border=""0"" alt=""home""></a></td> <td bgcolor=""#98D5DF""><a href=""/browse""><img src=""http://i.bookmooch.com/images/browse_selected.gif"" width=""86"" height=""28"" border=""0"" alt=""browse""></a></td> <td colspan=""2"" bgcolor=""#97D5DF""><a href=""/about/""><img src=""http://i.bookmooch.com/images/about.gif"" width=""85"" height=""28"" border=""0"" alt=""about""></a></td> <td bgcolor=""#8DD1D8""><a href=""/join""><img src=""http://i.bookmooch.com/images/join.gif"" width=""86"" height=""28"" border=""0"" alt=""join""></a></td> <td bgcolor=""#92D3DD""><a href=""/login""><img src=""http://i.bookmooch.com/images/login.gif"" width=""84"" height=""28"" border=""0"" alt=""login""></a></td> <td width=""38"" height=""28"" colspan=""2"" bgcolor=""#689A9B""><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""38"" height=""28"" alt=""""></td><td bgcolor=""#689A9B"" width=""100%""><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""1"" alt=""""></td> </tr> <tr> <td width=""940"" height=""4"" colspan=""11"" bgcolor=""#FFFF99""> <img src=""http://i.bookmooch.com/images/spacer.gif"" width=""940"" height=""4"" alt=""""></td><td bgcolor=""#FFFF99"" width=100%><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""1"" alt=""""></td> </tr> </table> <table width=""100%"" border=""0"" cellpadding=""0"" cellspacing=""0""><tr> <td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""12"" alt=""""></td><td></td><td></td></tr><td width=12><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""12"" height=""1"" alt=""""></td><td>

<Table width=891 cellspacing=0 cellpadding=0 border=0><tr><Td align=""left""><font face=""Verdana, Arial, utopia, sans-serif"" size=4 color=""#1F4A58"">Eric Newby : Love and War in the Apennines</font></td><td align=""right""><table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""http://wiki.bookmooch.com/index.php?title=Book+detail"" target=""help"" title=""help""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a target=""help"" STYLE=""text-decoration:none"" href=""http://wiki.bookmooch.com/index.php?title=Book+detail"" title=""help""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>?</nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a target=""help"" href=""http://wiki.bookmooch.com/index.php?title=Book+detail"" title=""help""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table></td></tr></table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""5"" height=""5"" alt=""""><br><img src=""http://i.bookmooch.com/images/greydot.gif"" width=""891"" height=""1"" alt=""""><br><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""12"" alt=""""><br>

 <table border=0 cellpadding=0 width=1221 cellspacing=0><tr><td valign=""top""> <table border=0 cellpadding=1 width=891 cellspacing=0><tr> <td valign=""top"" align=""left""> <table border=0 width=100% cellpadding=1 cellspacing=0> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Author:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><a title=""Search for this author"" for href=""/s/eric+newby"">Eric Newby</a> </td> </tr> <tr> <td valign=""top"" width=10% align=""right"" bgcolor=""FFFFFF"">Title:</td> <td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""12"" height=""1"" alt=""""></td> <td valign=""top"" width=90% bgcolor=""FFFFFF""><a title=""Search for this title"" href=""/s/love+and+war+in+the+apennines"">Love and War in the Apennines</a></td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Moochable&nbsp;copies:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">No copies available</td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Amazon&nbsp;suggests:</td> <td></td><form action=""/recommended_get"" method=""get"" name=""form31""><input type=""hidden"" name=""go"" value=""0001049305""> <td valign=""top"" bgcolor=""FFFFFF""><table cellspacing=0 cellpadding=0 border=0><tr><td align=""left"" valign=""top""><select size=1 name=""asin""><option value=""0864426046"">A Short Walk in the Hindu Kush</option><option value=""0864426313"">Slowly Down the Ganges</option><option value=""0864426216"">On the Shores of the Mediterranean</option><option value=""0864427689"">The Last Grain Race</option><option value=""0864426275"">Round Ireland in Low Gear</option></select></td><td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""5"" height=""1"" alt=""""></td><td align=""left"" valign=""top""><table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form31.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""javascript:document.form31.submit();"" title=""""><font face=""Verdana, Arial, utopia, sans-serif"" size=1 color=""#FFFFFF""><nobr>></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form31.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table></td></tr></table></td></form> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Recommended:</td> <td></td> <form action=""/recommended_get"" method=""get"" name=""form3""><input type=""hidden"" name=""go"" value=""0001049305""><td valign=""top"" bgcolor=""FFFFFF""> <table cellspacing=0 cellpadding=0 border=0><tr><td align=""left"" valign=""top""><select size=1 name=""asin""><option value=""BM1229593901148194047"">BookMooch: Box of 25 BookMooch Bookmarks</option><option value=""more"">---</option><option value=""more"">Show more recommendations...</option></select></td><td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""5"" height=""1"" alt=""""></td><td align=""left"" valign=""top""><table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form3.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""javascript:document.form3.submit();"" title=""""><font face=""Verdana, Arial, utopia, sans-serif"" size=1 color=""#FFFFFF""><nobr>></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form3.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table></td></tr></table> </td></form> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Topics:</td> <td></td><form action=""/topic_go"" method=""get"" name=""form""> <td valign=""top"" bgcolor=""FFFFFF""> <table cellspacing=0 cellpadding=0 border=0><tr><td align=""left"" valign=""top""><select size=1 name=""name""><option value=""Abridged"">Abridged</option><option value=""Ancient"">Ancient</option><option value=""Asia"">Asia</option><option value=""Audiobooks"">Audiobooks</option><option value=""Biographies & Memoirs"">Biographies & Memoirs</option><option value=""Biographies & Memoirs: General"">Biographies & Memoirs: General</option><option value=""Books on Cassette"">Books on Cassette</option><option value=""Eastern Front"">Eastern Front</option><option value=""Edition (format)"">Edition (format)</option><option value=""Europe"">Europe</option><option value=""Florence"">Florence</option><option value=""Hiroshima & Nagasaki"">Hiroshima & Nagasaki</option><option value=""History"">History</option><option value=""History: Europe: General"">History: Europe: General</option><option value=""History: Europe: Italy: General"">History: Europe: Italy: General</option><option value=""Home Front"">Home Front</option><option value=""Intelligence Operations"">Intelligence Operations</option><option value=""Italy"">Italy</option><option value=""Iwo Jima"">Iwo Jima</option><option value=""Medieval"">Medieval</option><option value=""Milan"">Milan</option><option value=""Military"">Military</option><option value=""Naples"">Naples</option><option value=""Naval"">Naval</option><option value=""Normandy"">Normandy</option><option value=""Pearl Harbor"">Pearl Harbor</option><option value=""Personal Narratives"">Personal Narratives</option><option value=""Reference"">Reference</option><option value=""Refinements"">Refinements</option><option value=""Renaissance"">Renaissance</option><option value=""Rome"">Rome</option><option value=""Sardinia"">Sardinia</option><option value=""Sicily"">Sicily</option><option value=""Stalingrad"">Stalingrad</option><option value=""Travel"">Travel</option><option value=""Travel: Europe: Italy: General"">Travel: Europe: Italy: General</option><option value=""Tuscany"">Tuscany</option><option value=""Umbria"">Umbria</option><option value=""Venice"">Venice</option><option value=""Western Front"">Western Front</option><option value=""Women"">Women</option><option value=""World War II"">World War II</option><option value=""Writing"">Writing</option></select></td><td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""5"" height=""1"" alt=""""></td><td align=""left"" valign=""top""><table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""javascript:document.form.submit();"" title=""""><font face=""Verdana, Arial, utopia, sans-serif"" size=1 color=""#FFFFFF""><nobr>></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""javascript:document.form.submit();"" title=""""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table></td></tr></table> </td></form> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Published&nbsp;in:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">English</td> </tr> 

 <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Binding:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">Audio Cassette</td> </tr> 

 <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Pages:</td> <td></td> <td bgcolor=""FFFFFF""></td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Date:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">1995-10-23</td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">ISBN:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">0001049305</td> </tr>

 <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Publisher:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><a title=""Search for this publisher"" href=""/s/HarperCollins+Audio"">HarperCollins Audio</a></td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Weight:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">0.49 pounds</td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Size:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">4.17 x 5.28 x 0.71 inches</td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Edition:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">Abridged Ed</td> </tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Amazon&nbsp;prices:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><table bgcolor=DDDDDD cellspacing=1 cellpadding=0><tr><td><table bgcolor=FFFFFF cellspacing=3 cellpadding=0><tr><td> <table bgcolor=FFFFFF cellspacing=0 cellpadding=0> </table> </td></tr></table></td></tr></table> <font size=1></font></td> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Previous&nbsp;givers:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">1 <a href=""/bio/jessierey"">jessierey (USA: OH)</a></td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Previous&nbsp;moochers:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF"">1 <a href=""/bio/markwp27"">will2-for (USA: CA)</a></td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Wishlists:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><table cellspacing=0 cellpadding=0 border=0><tr><td align=""left"" valign=""top"">1</td><td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""5"" height=""1"" alt=""""></td><td align=""left"" valign=""top""><a title=""view this member's wishlist"" href=""/wishlist/doobs54"">Deb (USA)</a>.</td></tr></table> </td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">Description:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><i>Book Description</I><br> When Italy made peace in the summer of '43, 50,000 Allied POWs, Eric Newby among them, walked away from their prison camps. But Italy was occupied by the Germans, and the camps were behind those lines. Newby went to the mountains where, with the help of locals, he evaded the retreating enemy. <P>Italian peasants sheltered him for more than three months. In this classic memoir of WW II, Newby recalls these selfless people. . .their unchanging lifestyle, the funny, bizarre and dangerous incidents, his hopes of the local girl who later became his wife. <P>""An exciting story, superbly told."" (Punch) <P>Of related interest: Carlino by Stuart Hood and Passages to Freedom by Joseph S. Frelinghuysen, both available from B-O-T.

</td> </tr> <tr> <td valign=""top"" align=""right"" bgcolor=""FFFFFF"">URL:</td> <td></td> <td valign=""top"" bgcolor=""FFFFFF""><a title=""Link to this book"" href=""http://bookmooch.com/0001049305"">http://bookmooch.com/0001049305</a></td> </tr> </table></td><td valign=""top"" align=""right""> <a target=""lgamazon"" onmouseover=""this.T_WIDTH='';return escape('<img src=\'http://images.amazon.com/images/P/0001049305.01._BO1,130,130,130_PC_SCLZZZZZZZ_.jpg\' height=\'\' width=\'\'>')"" href=""http://ecx.images-amazon.com/images/I/51CXMAZHTZL._SL500_.jpg"" title=""large book cover""><img alt=""large book cover"" border=0 height=87 width=70 src=""http://images.amazon.com/images/P/0001049305.01._BO1,130,130,130_PC_SCTZZZZZZZ_.jpg"" align=""none""></a><p> <p> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""/m/wishlist_add?asin=0001049305&store=amazon.com"" title=""add this book to the list of books you want right away""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""/m/wishlist_add?asin=0001049305&store=amazon.com"" title=""add this book to the list of books you want right away""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>WISHLIST&nbsp;ADD&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""/m/wishlist_add?asin=0001049305&store=amazon.com"" title=""add this book to the list of books you want right away""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""/m/savelater_add?asin=0001049305&store=amazon.com"" title=""add this book to those you may someday want""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""/m/savelater_add?asin=0001049305&store=amazon.com"" title=""add this book to those you may someday want""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>SAVE&nbsp;FOR&nbsp;LATER&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""/m/savelater_add?asin=0001049305&store=amazon.com"" title=""add this book to those you may someday want""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""http://www.amazon.it/gp/product/0001049305?ie=UTF8&tag=book05e-21&linkCode=bn1"" target=""amazon-0001049305"" title=""more info about this book at Amazon (a small commission goes to BookMooch if you buy books with this link)""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a target=""amazon-0001049305"" STYLE=""text-decoration:none"" href=""http://www.amazon.it/gp/product/0001049305?ie=UTF8&tag=book05e-21&linkCode=bn1"" title=""more info about this book at Amazon (a small commission goes to BookMooch if you buy books with this link)""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>AMAZON&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a target=""amazon-0001049305"" href=""http://www.amazon.it/gp/product/0001049305?ie=UTF8&tag=book05e-21&linkCode=bn1"" title=""more info about this book at Amazon (a small commission goes to BookMooch if you buy books with this link)""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""/book_sites?asin=0001049305"" title=""other web sites with information about this book""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""/book_sites?asin=0001049305"" title=""other web sites with information about this book""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>OTHER&nbsp;WEB&nbsp;SITES&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""/book_sites?asin=0001049305"" title=""other web sites with information about this book""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""/related_show?asin=0001049305"" title=""show books that are related editions to this one""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""/related_show?asin=0001049305"" title=""show books that are related editions to this one""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>RELATED&nbsp;EDITIONS&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""/related_show?asin=0001049305"" title=""show books that are related editions to this one""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> <table id=""button"" height=""18"" border=""0"" cellpadding=""0"" cellspacing=""0""> <tr> <td bgcolor=""#6EB0B1""> <a href=""/recommend?asin=0001049305"" title=""recommend this book to someone else""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_left.gif"" width=""5"" height=""18"" alt=""""></a></td> <td valign=""middle"" height=""18"" bgcolor=""#6EB0B1""><a STYLE=""text-decoration:none"" href=""/recommend?asin=0001049305"" title=""recommend this book to someone else""><font face=""Verdana, Arial, utopia, sans-serif"" size=2 color=""#FFFFFF""><nobr>RECOMMEND&nbsp;></nobr></font></a></td> <td bgcolor=""#6EB0B1""> <a href=""/recommend?asin=0001049305"" title=""recommend this book to someone else""><img border=0 bgcolor=""#6EB0B1"" src=""http://i.bookmooch.com/images/button_template_right.gif"" width=""6"" height=""18"" alt=""""></a></td> </tr> </table><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""1"" height=""2"" alt=""""><br> </td></tr></table> </td><td width=10><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""10"" height=""1"" alt=""""></td><td xbgcolor=AAAAAA valign=""top"" width=320><link rel=""stylesheet"" type=""text/css"" href=""http://cache.blogads.com/582081819/feed.css"" /><script language=""javascript"" src=""http://cache.blogads.com/582081819/feed.js""></script><link rel=""stylesheet"" type=""text/css"" href=""http://cache.blogads.com/990819385/feed.css"" /><script language=""javascript"" src=""http://cache.blogads.com/990819385/feed.js""></script><br/></td></tr></table> </td><td><img src=""http://i.bookmooch.com/images/spacer.gif"" width=""12"" height=""12"" alt=""""></td></tr><tr><td></td><td></td></tr></table> <script language=""JavaScript"" type=""text/javascript"" src=""http://i.bookmooch.com/js/wz_tooltip.js""></script> </body> </html>```

### Became

In [36]:
print(df.iloc[0].values[0])

eric love war eric love war author eric title love war available short walk slowly mediterranean last grain race round low gear box show abridged ancient general eastern front edition format florence history history general history general home front intelligence medieval military naval pearl harbor personal reference renaissance travel travel general western front world war writing binding audio date publisher audio weight size x x edition abridged previous oh previous ca deb description book description made peace summer allied eric among away prison behind went help retreating enemy sheltered three classic memoir selfless people unchanging funny bizarre dangerous local girl later wife exciting story superbly told punch related interest hood freedom available b add save later web related recommend


# [10 points] Part 2. Duplicates detection. LSH

#### Libraries you can use

1. LSH - https://github.com/ekzhu/datasketch
1. LSH - https://github.com/mattilyra/LSH
1. Any other library on your choise

1. Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
1. Make a plot dependency of duplicates on shingle size (with fixed minhash length) 
1. Make a plot dependency of duplicates on minhash length (with fixed shingle size)

In [7]:
from datasketch import MinHash, MinHashLSH
import math

In [8]:
def shingle(tokens: str, k: int):
    shingles = []
    for i in range(len(tokens) - k):
        shingles.append(val[i:i+k])
    return set(shingles)

permutations_list = dict([(32, 0), (64, 0), (128, 0), (256, 0), (512, 0)])
shinlge_list =  dict([(3, 0), (4, 0), (5, 0), (6, 0), (7, 0)])

for perm in permutations_list:
    print("permutation: ", perm)
    lsh = MinHashLSH(threshold=0.7, num_perm=perm)
    n = 0
    for index, row in df.iterrows():
        # if n == 100: break
        val = row.values[0]
        val = shingle(val, 5)
        # print(val)
        m = MinHash(num_perm=perm)
        for d in val:
            m.update(d.encode('utf8'))
        if len(lsh.query(m)) > 0:
            permutations_list.update({perm : permutations_list.get(perm) + 1})
        lsh.insert("m%s" % n, m)
        n += 1

for sh in shinlge_list:
    print("shinlge: ", sh)
    lsh = MinHashLSH(threshold=0.7)
    n = 0
    for index, row in df.iterrows():
        # if n == 100: break
        val = row.values[0]
        val = shingle(val, sh)
        m = MinHash()
        for d in val:
            m.update(d.encode('utf8'))
        if len(lsh.query(m)) > 0:
            shinlge_list.update({sh : shinlge_list.get(sh) + 1})
        lsh.insert("m%s" % n, m)
        n += 1


permutation:  32
permutation:  64
permutation:  128
permutation:  256
permutation:  512
shinlge:  3
shinlge:  4
shinlge:  5
shinlge:  6
shinlge:  7


In [33]:
print(permutations_list)
print(shinlge_list)

perm_df = pd.DataFrame.from_dict(permutations_list.items())
fig = px.histogram(perm_df, x=0, y=1)
fig.update_xaxes(type='category', title='Minhash length')
fig.update_yaxes(title='Duplicates')
fig.show()

sh_df = pd.DataFrame.from_dict(shinlge_list.items())
fig = px.histogram(sh_df, x=0, y=1)
fig.update_xaxes(type='category', title='Shingle size')
fig.update_yaxes(title='Duplicates')
fig.show()

{32: 66007, 64: 64746, 128: 65166, 256: 65160, 512: 64270}
{3: 70556, 4: 66557, 5: 65166, 6: 64048, 7: 63591}


# [Optional 10 points] Part 3. Topic model

In this part you will learn how to do topic modeling with common tools and assess the resulting quality of the models. 

The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: `https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing`

#### Preprocess dataset with the functions from the Part 1

#### Quality estimation

Implement the following three quality fuctions: `coherence` (or `tf-idf coherence`), `normalized PMI`, `based on the distributed word representation`(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.

### Topic modeling

Read and preprocess the dataset, divide it into train and test parts `sklearn.model_selection.train_test_split`. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.

Plot the histogram of resulting tokens counts in the processed datasets.

Plot the histogram of resulting tokens counts in the processed datasets.

#### NMF

Implement topic modeling with NMF (you can use `sklearn.decomposition.NMF`) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

#### LDA

Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

### Additive regularization of topic models 

Implement topic modeling with ARTM. You may use bigartm library (simple installation for linux: pip install bigartm) or TopicNet framework (`https://github.com/machine-intelligence-laboratory/TopicNet`)

Create artm topic model fit it to the data. Try to change hyperparameters (number of specific and background topics) to better fit the dataset. Play with smoothing and sparsing coefficients (use grid), try to add decorrelator. Print out resulting topics.

Write a function to convert new documents to topics probabilities vectors.

Calculate the quality scores for each model. Make a barplot to compare the quality.