* Oskar Szudzik 148245
* Krystian Moras 148243

Our first step was to download articles from Wikipedia and save them in .csv file. You can find more code of the preprocessing in pipeline.py

Now, we need to load this data

In [494]:
import pandas as pd

In [495]:
data = pd.read_csv('wikipedia.csv', header=0)
data.head()

Unnamed: 0,title,summary,links,categories,references
0,Psilocybe yungensis,psilocybe yungensis is a species of psychedeli...,Agaricales|Agaricomycetes|Alexander H. Smith|A...,Articles with 'species' microformats|Articles ...,http://www.fungimag.com/summer-2011-articles/F...
1,"Mian Rud, South Khorasan","mian rud (persian: ميان رود, also romanized as...","Abbasabad, Doreh|Abbasabad, Momenabad|Administ...",All stub articles|Articles containing Persian-...,http://geonames.nga.mil/namesgaz/|http://geoha...
2,Coat of arms of Hesse,the coat of arms of the german state of hesse...,Armiger|Blazon|Coat of arms|Coat of arms of An...,All stub articles|Articles with short descript...,https://www.hessen.de/fuer-besucher/70-jahre-h...
3,Helsinki Police Department,the helsinki police department (hpd) (finnish:...,2018 Russia–United States summit|8th World Fes...,Government agencies established in 1826|Law en...,http://tass.com/politics/1013215/amp|https://b...
4,Broadway Stages,"broadway stages, ltd. is one of new york’s ful...","Annadale, Staten Island|Arden Heights, Staten ...",1983 establishments in New York City|American ...,http://Broadway-Stages.com/|http://www.broadwa...


We will perform TF-IDF for summary of articles. To later propose similar articles to ones liked by a client.

In [496]:
article = data.iloc[0]
article.summary

'psilocybe yungensis is a species of psychedelic mushroom in the family hymenogastraceae. in north america, it is found in northeast, central and southeastern mexico. in south america, it has been recorded from bolivia, colombia, and ecuador. it is also known from the caribbean island martinique, and china. the mushroom grows in clusters or groups on rotting wood. the fruit bodies have conical to bell-shaped reddish- to orangish-brown caps that are up to 2.5 cm (1.0 in) in diameter, set atop slender stems 3 to 5 cm (1.2 to 2.0 in) long. the mushrooms stain blue when bruised, indicative of the presence of the compound psilocybin. psilocybe yungensis is used by mazatec indians in the mexican state of oaxaca for entheogenic purposes.'

First, we need to tokenize the summary.

In [497]:
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
from math import log 
import numpy as np

In [498]:
dict_of_words = {}

for i in range(50):
    article = data.iloc[i]
    words = []

    for sentence in sent_tokenize(article.summary):
        words += word_tokenize(sentence)
    dict_of_words[article.title] = words


Now we compose normalized TF dict

In [499]:
all_words = sorted({x for v in dict_of_words.values() for x in v})

# all_words

In [500]:
tf_dict = {}
for document in dict_of_words.keys():
    counted = Counter(dict_of_words[document])
    word_freq = {}
    max_occurances = max(counted.values()) 

    for key, value in zip(counted.keys(), counted.values()):
        word_freq[key] = value / max_occurances

    for word in all_words:
        if word in word_freq.keys():
            continue
        else:
            word_freq[word] = 0

    tf_dict[document] = word_freq

# tf_dict

And IDF as well

In [501]:
idf_dict = {}
num_of_docs = len(dict_of_words.keys())
for word in all_words:
    word_occurances = 0
    for article in dict_of_words.keys():
        if word in dict_of_words[article]:
            word_occurances += 1        
    try:
        idf_dict[word] = log(num_of_docs / word_occurances)
    except:
        # quotes, dates, some of the thrash
        continue

# idf_dict

Now we need to perform weighting for TF-IDF

In [502]:
from copy import deepcopy

In [503]:
weight_dict = deepcopy(tf_dict)

for document in tf_dict.keys():
    for word in tf_dict[document].keys():
        weight_dict[document][word] = tf_dict[document][word] * idf_dict[word]

# weight_dict

After that, we check cosine similarity between articles we have, and the one we liked.

In [504]:
liked_articles = ['Helsinki Police Department', 'William Stephen Devery']
liked_weights = {}
liked_w_abs = {}
for l_article in liked_articles:
    liked_weights[l_article] = weight_dict.pop(l_article)
    liked_w_abs[l_article] = abs(sum(liked_weights[l_article].values()))

cos_sim = {}
for l_article in liked_articles:
    for document in weight_dict.keys():
        if document not in cos_sim.keys():
            cos_sim[document] = 0 
        for word in weight_dict[document].keys():
            cos_sim[document] += liked_weights[l_article][word] * weight_dict[document][word]            
        cos_sim[document] /= (liked_w_abs[l_article] * abs(sum(weight_dict[document].values())))
        
suggestions = dict(sorted(cos_sim.items(), key=lambda item: -item[1]))
list(suggestions.keys())[:5]

['Lois Delander',
 'Broadway Stages',
 'Premier League Riders Championship',
 '2011–12 TFF Second League',
 'Soul Food (soundtrack)']