### Objective

    1 - Implement TFIDF on text from wikipedia articles
    2 - Sort articles for closest match

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pymongo 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
%matplotlib inline

#### Load df for ML and BusSof wiki from MongoDB

In [2]:
def get_list_pages(col_name, ip='34.209.242.27'):
    cli = pymongo.MongoClient(ip, 27016)
    wikidb = cli.wikipedia
    col_pages = wikidb.get_collection(col_name)
    cursor = col_pages.find()
    text_list = []
    for entry in list(cursor):
        text_list.append(entry)
        
    return text_list

In [7]:
cli = pymongo.MongoClient('34.209.242.27', 27016)
wikidb = cli.wikipedia
wikidb.collection_names()

['bussof', 'Machine Learning 2', 'ml_col', 'Business Software 2']

In [8]:
ml_df = pd.DataFrame(get_list_pages('Machine Learning 2'))
bizsoft_df = pd.DataFrame(get_list_pages('Business Software 2'))

In [12]:
bizsoft_df['text'].head(1)[0]

'{{multiple issues|\n{{COI|date=September 2011}}\n{{Orphan|date=September 2011}}\n}}\n\n{{Infobox software\n| name = SAP Business Rule Framework plus (BRFplus)\n| logo = \n| screenshot = BRFplus.PNG\n| caption = Pricing rule with BRFplus\n| developer = [[SAP AG|SAP]]\n| frequently updated = yes <!-- Release version update? Don\'t edit this page, just click on the version number!-->\n| operating_system = [[Microsoft Windows]], [[Linux]]\n| genre = [[BRMS]]\n| license = SAP NetWeaver 7.0 Enhancement Package 2\n| website = [http://www.sdn.sap.com/irj/sdn/nw-rules-management?rid=/webcontent/uuid/d00df7db-c783-2b10-aa97-ccfeacc19fcb BRFplus on SAP Developers Network (SDN)]\n}}\n\n\'\'\'BRFplus\'\'\' (Business Rule Framework plus) is a [[BRMS|business rules management system (BRMS)]] offered by [[SAP AG]]. BRFplus is part of the [[SAP NetWeaver]] [[ABAP|ABAP stack]]. Therefore, all SAP applications that are based on SAP NetWeaver can access BRFplus within the boundaries of an SAP system. How

In [4]:
# bizy = lambda x: "Business_Software"
# bizsoft_df['Category'] = bizsoft_df['Category'].map(bizy)

In [5]:
wiki_df = pd.concat([ml_df, bizsoft_df], axis=0)

In [6]:
import re

In [7]:
def cleaner(text):
    text = re.sub('&#39;','',text).lower()
    text = re.sub('<br />','',text)
    text = re.sub('<.*>.*</.*>','', text)
    text = re.sub('\\ufeff', '', text)
    text = re.sub('[\d]','',text)
    text = re.sub('[^a-z ]','',text)
    #text = ' '.join(text.split())
    return text

In [8]:
#clean = lambda x: re.sub('/[^a-z0-9-]/g', "", x)

wiki_df['text'] = wiki_df['text'].map(str)
wiki_df['text'] = wiki_df['text'].apply(cleaner)

In [9]:
wiki_df.set_index('page_id', inplace=True)

In [10]:
wiki_df.sample(5)

Unnamed: 0_level_0,Category,_id,categoies,text,title
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18025626,Business_Software,5a0e80f38423e101482c8e5b,[Category:All articles with dead external link...,document automation also known as document ass...,Document automation
42363103,Business_Software,5a0e80ee8423e101482c8e3c,"[Category:Business software, Category:Companie...",infobox company name gooddata logo ...,GoodData
41699654,Business_Software,5a0e81068423e101482c8edc,"[Category:1993 video games, Category:All artic...",multiple issuesnotabilityproductsdateoctober o...,On the Ball (video game)
15689191,Machine Learning,5a0d0b308423e1001feda4ac,"[Category:All articles lacking sources, Catego...",unreferenceddatejuly a loglinear model is a ma...,Log-linear model
12286078,Business_Software,5a0e80f68423e101482c8e81,"[Category:All articles lacking sources, Catego...",unreferenceddateseptember a workflow applicati...,Workflow application


#### Use TFIDF to vectorize page text

In [11]:
tfidf_vector = TfidfVectorizer(min_df=5, stop_words="english")

In [12]:
wiki_pages_matrix_spare = tfidf_vector.fit_transform(wiki_df['text'])
wiki_pages_df_tfd = pd.DataFrame(wiki_pages_matrix_spare.toarray(),
                                index=wiki_df.index,
                                columns=tfidf_vector.get_feature_names())
full_wiki_text_tfd_df = pd.concat([wiki_df['text'], wiki_pages_df_tfd], axis=1)

In [13]:
wiki_pages_df_tfd.shape

(2501, 12849)

### Compute SVD on document matrix to sort

In [34]:
wiki_pages_df_tfd.head()

Unnamed: 0_level_0,aa,aaai,aalst,aaron,ab,abacus,abandoned,abbreviated,abbreviation,abbreviations,...,zoho,zone,zones,zoo,zoom,zos,zoubin,zur,zurich,zx
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30632997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26499237,0.0,0.0,0.0,0.0,0.0,0.0,0.057049,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16369738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7309022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55330205,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
n_components = 100
SVD = TruncatedSVD(n_components)
component_names = ["component_"+str(i+1) for i in range(n_components)]

In [15]:
wiki_svd_matrix = SVD.fit_transform(wiki_pages_df_tfd)

In [36]:
wiki_svd_matrix[0:5]

array([[  1.63423144e-01,   1.41495792e-01,  -1.79123033e-03,
         -1.72436698e-02,  -1.10398598e-01,  -1.72652986e-02,
         -2.01521882e-02,   1.68462682e-02,   6.51002445e-02,
          1.12084405e-01,   7.90359704e-02,  -5.26982479e-02,
         -1.07386747e-02,   1.33838669e-01,  -6.38847829e-02,
         -2.98023646e-02,   1.56784312e-02,   1.06877483e-02,
          4.17276194e-02,  -4.91245161e-02,  -2.52677210e-02,
         -2.76148211e-02,  -1.88843707e-02,   1.53222410e-02,
         -1.60019745e-02,   1.66578937e-02,   9.86093860e-05,
         -3.34842872e-03,   1.91287742e-03,   7.66833177e-03,
         -7.67517390e-02,   1.27216103e-02,  -2.81010949e-03,
         -3.70287487e-02,  -2.23941838e-02,   1.36198597e-02,
          3.81595846e-02,  -2.99406275e-03,   5.85891349e-03,
         -1.84224317e-03,  -2.60305035e-02,   4.54508536e-02,
          1.84134244e-02,  -2.41000094e-02,  -7.27532944e-02,
         -1.26204758e-02,   4.65502841e-03,   2.13822271e-02,
        

In [16]:
SVD.explained_variance_ratio_

array([ 0.00483526,  0.01776534,  0.01302833,  0.01045203,  0.00770768,
        0.00721908,  0.00669387,  0.00552595,  0.00529684,  0.00499334,
        0.00482626,  0.00451449,  0.00432126,  0.00398246,  0.00372253,
        0.00361133,  0.00342344,  0.00339462,  0.00327732,  0.00316749,
        0.00309248,  0.00301424,  0.00293291,  0.00281471,  0.00272836,
        0.00266205,  0.00264757,  0.00252462,  0.0024713 ,  0.00241751,
        0.00236871,  0.00233813,  0.0022923 ,  0.00226654,  0.00225078,
        0.00220429,  0.00218351,  0.00212686,  0.00211591,  0.00210277,
        0.00208909,  0.00205042,  0.00203009,  0.00200261,  0.00196178,
        0.00193066,  0.00189356,  0.00186667,  0.00184384,  0.00182704,
        0.00182005,  0.00176988,  0.00175747,  0.00174845,  0.00173248,
        0.00172398,  0.00168664,  0.00167141,  0.00164616,  0.0016312 ,
        0.0016234 ,  0.00161893,  0.00159204,  0.00158884,  0.00156506,
        0.00155294,  0.00153279,  0.0015196 ,  0.00150473,  0.00

### Use Cosine Similiarity to produce top matching articles on a given search term

In [17]:
search_term = "microsoft word"

In [18]:
search_term_vec = tfidf_vector.transform([search_term])
search_term_lsa = SVD.transform(search_term_vec)
cosine_similarities = wiki_svd_matrix.dot(search_term_lsa.T).ravel()

In [19]:
cosine_similarities.argsort()[:-6:-1]

array([1113, 1442, 1428, 1179, 1858])

In [20]:
cos_sims = cosine_similarities.argsort()[-5:-6:-1][0]

In [21]:
wiki_df.iloc[cos_sims]['title']

'Office Genuine Advantage'

In [22]:
def get_articles(search_term, svd_matrix, orig_df):
    search_term_vec = tfidf_vector.transform([search_term])
    search_term_lsa = SVD.transform(search_term_vec)
    
    #using global variable here must fix 
    cosine_similarities = svd_matrix.dot(search_term_lsa.T).ravel()
    cos_sim_sorted = cosine_similarities.argsort()
    cos_sims_5 = [cos_sim_sorted[:-2:-1][0],
                  cos_sim_sorted[-2:-3:-1][0],
                  cos_sim_sorted[-3:-4:-1][0],
                  cos_sim_sorted[-4:-5:-1][0],
                  cos_sim_sorted[-5:-6:-1][0]]

    print("1: {}".format(orig_df.iloc[cos_sims_5[0]]['title']))
    print("2: {}".format(orig_df.iloc[cos_sims_5[1]]['title']))
    print("3: {}".format(orig_df.iloc[cos_sims_5[2]]['title']))
    print("4: {}".format(orig_df.iloc[cos_sims_5[3]]['title']))
    print("5: {}".format(orig_df.iloc[cos_sims_5[4]]['title']))
    return cos_sims_5

In [24]:
search_term = 'Artificial Intelligence'

In [25]:
get_articles(search_term, wiki_svd_matrix, wiki_df)

1: AAAI Conference on Artificial Intelligence
2: Glossary of artificial intelligence
3: International Joint Conference on Artificial Intelligence
4: European Conference on Artificial Intelligence
5: Jürgen Schmidhuber


[709, 84, 613, 112, 1037]

This search method appears to work relatively well. Four of the top five hits have the search term in the title, but note that I did not train the model on the article titles. On the other hand, however, due to the way wikipedia provides its page text via its API, the article title may be included in the text itself that I am training on. Despite this, the fact that it is producing articles with matching titles is indictative of a relatively good degree of success.

Below are further examples of good functionality of the search engine.

In [32]:
get_articles("Microsoft Word", wiki_svd_matrix, wiki_df)

1: Microsoft Office 3.0
2: Microsoft
3: Microsoft Dynamics
4: Microsoft Office 98 Macintosh Edition
5: Office Genuine Advantage


[1113, 1442, 1428, 1179, 1858]

Note: An interesting note on the "Microsoft Word" search results, "Microsoft Word" itself does not appear in the search results. This could be due to the use of min_df=5 when I initialized the TFDFI Vectorizor. 'word' could be ignored since it is a more common word (for example, I just used the word word 4 times in this sentence).

In [27]:
get_articles("Machine Learning", wiki_svd_matrix, wiki_df)

1: Machine learning
2: Outline of machine learning
3: Meta learning (computer science)
4: BigDL
5: Portal:Machine learning/Related portals


[924, 1031, 916, 990, 745]

In [29]:
get_articles("Kernel Approximation", wiki_svd_matrix, wiki_df)

1: Kernel method
2: Radial basis function kernel
3: Multiple kernel learning
4: Low-rank matrix approximations
5: Kernel density estimation


[917, 946, 1051, 111, 1002]

Comparing the search results to google for, example, "Kernel Approximation", four of my top five results appear on the first page of the google search engine for the same search term. 

"Kernel method" appears first on google, followed by "low-rank matrix approximation" (rank 4 using get_articles()) and "kernel density estimation" (rank 5 using get_articles()). 

Lower in the search but still on the first page is "Radial basis function kernel", rank 7 on google.com versus rank 2 using get_articles().

Clearly, there is some divergence in search results, with Google's no doubt being preferable, but for such as simple implementation this function works surprisingly well.

"Multiple kernel learning", which appeared at rank 3 using get_articles(), did not appear on the first page of the google search. 