In [1]:
%load_ext autoreload
%autoreload 1

import sys
sys.path.append("../utils/")

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler, MaxAbsScaler, RobustScaler

from GraphAPI import GraphCreator
from graph_helpers import *
from evaluations import *
from visualizers import *

%aimport GraphAPI
%aimport graph_helpers
%aimport evaluations
%aimport visualizers

## Generating Graph from Entry Point

1. We initialize our GraphCreator class and check how many new nodes we will need to query. 

In [14]:
# include_see_also=False used for validation below.
# in deployment, include_see_also should be set to True
gc = GraphCreator("https://en.wikipedia.org/wiki/Data_science",
                  include_see_also=False, max_recursive_requests=50)
print("Number of Links to Search:", len(gc.next_links), "\n\n")
print(list(gc.primary_nodes.keys()), "\n\n")
print(gc.see_also_articles)

Number of Links to Search: 436 


['Business intelligence', 'Data', 'Multi-disciplinary', 'Business analytics', 'Data analysis', 'Data mining', 'Computational science', 'Harvard Business Review', 'Buzzword', 'Distributed computing', 'Information explosion', 'Empirical research', 'Knowledge', 'Turing award', 'Computer science', 'Predictive modelling', 'Basic research', 'Statistics', 'Machine learning', 'Hans Rosling', 'American Statistical Association', 'Information science', 'Database', 'Nate Silver', 'Big data', 'Mathematics', 'Jim Gray (computer scientist)'] 


[]


2. We query all the nodes linked to/from the entry point (expand our network one level for each node).

In [15]:
gc.expand_network_threaded(threads=20, chunk_size=1)
print("Number of Links After Expansion: ", len(gc.graph.nodes))

Number of Links After Expansion:  226732


3. Since some nodes will likely have linked to articles through a redirect link, we need to traverse our graph and ensure that all redirects are assigned to the correct nodes. Once all redirects have been dealt with, we remove any old redirect nodes. 

In [16]:
gc.redraw_redirects()

4. Edges are weighted by how many categories two connected nodes have in common. Once we have all our nodes, and we have dealt with redirects, we can add edge weights for our entire graph. 

In [17]:
gc.update_edge_weights()
gc.get_edge_weights().head()

Unnamed: 0,source_node,target_node,edge_weight
0,Claudio Silva (computer scientist),Juliana Freire,9
1,Juliana Freire,Claudio Silva (computer scientist),9
2,International Standard Book Number,International Standard Serial Number,7
3,International Standard Serial Number,International Standard Book Number,7
4,Big memory,Big data,6


# Getting Our Feature Set

There are two options when generating the feature set:

1. we can generate a standard feature set with only the features themselves. To do this, have the `rank` parameter set to `False`.
2. We can generate a ranked feature set (set `rank` equal to `True`). For each parameter, this will rank them in order of _best_ to _worst_ (this could be ascending or descending, depending on the context of the feature).

After running `get_features_df`, the feature set will be saved in the GraphCreator instance as `feature_df`

In [18]:
features_df = gc.get_features_df(rank=False)

In [19]:
features_df

Unnamed: 0,node,degree,category_matches_with_source,in_edges,out_edges,shared_neighbors_with_entry_score,centrality,page_rank,adjusted_reciprocity,shortest_path_length_from_entry,shortest_path_length_to_entry,jaccard_similarity,primary_link
0,Data science,434,1,278,156,1.000000,3.053527e-02,0.000599,0.006842,0.0,0.0,1.000000,0
1,Academic journal,6482,1,6294,188,0.002495,9.420964e-03,0.009561,0.029934,1.0,3.0,0.000457,0
2,Academic publishing,746,4,547,199,0.014480,4.331199e-03,0.001702,0.018388,1.0,2.0,0.001214,0
3,Academy,1337,3,763,574,0.006035,3.011145e-03,0.001108,0.037632,1.0,3.0,0.000962,0
4,American Statistical Association,633,1,606,27,0.011696,1.275282e-03,0.000562,0.005559,1.0,2.0,0.007982,1
5,Analytics,422,2,343,79,0.042660,9.244348e-03,0.000383,0.003849,1.0,2.0,0.036728,0
6,Anand Rajaraman,33,0,13,20,0.008969,1.193452e-03,0.000015,0.000428,1.0,2.0,0.003448,0
7,Anomaly detection,256,0,149,107,0.171260,9.334228e-02,0.000221,0.033783,1.0,2.0,0.011848,0
8,Applied science,391,2,311,80,0.014286,4.658904e-03,0.000607,0.011974,1.0,2.0,0.003407,0
9,Artificial neural network,1138,2,814,324,0.072344,1.404851e-01,0.000593,0.056875,1.0,2.0,0.008310,0


# Similarity Rank

Two articles are more similar the more categories they share and the closer they are to each other. 

In [20]:
gc.rank_similarity()
gc.features_df.sort_values("similarity_rank", ascending=False)

Unnamed: 0,node,degree,category_matches_with_source,in_edges,out_edges,shared_neighbors_with_entry_score,centrality,page_rank,adjusted_reciprocity,shortest_path_length_from_entry,shortest_path_length_to_entry,jaccard_similarity,primary_link,similarity_rank
0,Data science,434,1,278,156,1.000000,3.053527e-02,0.000599,0.006842,0.0,0.0,1.000000,0,5.645827e+00
33,Data analysis,688,3,459,229,0.064286,1.790331e-02,0.000499,0.026941,1.0,1.0,0.068116,1,2.613886e+00
68,Information explosion,64,3,27,37,0.014894,2.209606e-03,0.000065,0.002138,1.0,2.0,0.006601,1,2.587707e+00
69,Information science,1253,4,985,268,0.016297,8.016865e-03,0.001545,0.047895,1.0,1.0,0.011209,1,2.484508e+00
183,Data quality,221,4,119,102,0.071048,1.297741e-02,0.000392,0.015395,2.0,1.0,0.081744,0,2.434840e+00
24,Computational science,305,3,143,162,0.036928,1.177864e-02,0.000291,0.008980,1.0,2.0,0.036946,1,2.313982e+00
169,Geomatics,397,4,265,132,0.007833,7.580153e-04,0.000427,0.018388,2.0,1.0,0.001845,0,2.200532e+00
97,Nate Silver,686,3,317,369,0.007042,1.283311e-03,0.000483,0.044046,1.0,2.0,0.000000,1,1.957366e+00
34,Data mining,1542,3,1056,486,0.098965,1.492661e-01,0.001664,0.111612,1.0,1.0,0.043818,1,1.910824e+00
2,Academic publishing,746,4,547,199,0.014480,4.331199e-03,0.001702,0.018388,1.0,2.0,0.001214,0,1.904720e+00


# Scaling Features

We can easily scale our each of our features through the `scale_features_df` method. It will default to `Standard Scaler`, but we can specify alternate scalers in the `scaler` parameter.  

In [21]:
scaled_feature_df = gc.scale_features_df(scaler=Normalizer, copy=True) # Makes a copy of the df
scaled_feature_df.sort_values("similarity_rank", ascending=False).reset_index().drop("index", axis=1)

Unnamed: 0,node,similarity_rank,degree,category_matches_with_source,in_edges,out_edges,shared_neighbors_with_entry_score,centrality,page_rank,adjusted_reciprocity,shortest_path_length_from_entry,shortest_path_length_to_entry,jaccard_similarity,primary_link
0,Data science,5.645827e+00,0.805947,0.001857,0.516252,0.289695,0.001857,5.670463e-05,1.112161e-06,0.000013,0.000000,0.000000,0.001857,0.000000
1,Data analysis,2.613886e+00,0.801694,0.003496,0.534851,0.266843,0.000075,2.086187e-05,5.819979e-07,0.000031,0.001165,0.001165,0.000079,0.001165
2,Information explosion,2.587707e+00,0.812212,0.038072,0.342652,0.469560,0.000189,2.804168e-05,8.255044e-07,0.000027,0.012691,0.025382,0.000084,0.012691
3,Information science,2.484508e+00,0.775279,0.002475,0.609457,0.165822,0.000010,4.960342e-06,9.561722e-07,0.000030,0.000619,0.000619,0.000007,0.000619
4,Data quality,2.434840e+00,0.815576,0.014762,0.439156,0.376420,0.000262,4.789167e-05,1.447715e-06,0.000057,0.007381,0.003690,0.000302,0.000000
5,Computational science,2.313982e+00,0.815925,0.008025,0.382549,0.433377,0.000099,3.150981e-05,7.793568e-07,0.000024,0.002675,0.005350,0.000099,0.002675
6,Geomatics,2.200532e+00,0.801605,0.008077,0.535076,0.266529,0.000016,1.530551e-06,8.623845e-07,0.000037,0.004038,0.002019,0.000004,0.000000
7,Nate Silver,1.957366e+00,0.815707,0.003567,0.376938,0.438770,0.000008,1.525956e-06,5.745152e-07,0.000052,0.001189,0.002378,0.000000,0.001189
8,Data mining,1.910824e+00,0.798513,0.001554,0.546841,0.251671,0.000051,7.729629e-05,8.616215e-07,0.000058,0.000518,0.000518,0.000023,0.000518
9,Academic publishing,1.904720e+00,0.788394,0.004227,0.578085,0.210309,0.000015,4.577333e-06,1.798401e-06,0.000019,0.001057,0.002114,0.000001,0.000000


___
# Validation

Here, we _validate_ our results. For many articles, we already have some user defined links that are highly related to the present article. These are found in the **See Also** section of several Wikipedia articles (some pages do not have them). These are not ordered in importance in any meaningful way, and there are no rating scores.

The intuition in this validation is as follows: 

> _Given that we know some articles are highly related from user input, if the recommendations provided by this system are valid, we would expect to see those **See Also** links ranked relatively high on our list._ 

_Note: This validation is not meant as **confirmation** or **evaluation** of the results. It only provides us one way of telling if the results we are seeing are reasonably valid. It is important to note that we cannot compare these results across two different articles, as those would be two entirely different network structures, likely with different human labeled links._  


In [22]:
validation_df = evaluate_metrics(scaled_feature_df, 
                 on=[
                     "similarity_rank", 
                     "degree",
                     "category_matches_with_source",
                     "in_edges",
                     "out_edges",
                     "shared_neighbors_with_entry_score",
                     "centrality", 
                     "adjusted_reciprocity", 
                     "page_rank", 
                     "shortest_path_length_from_entry", 
                     "shortest_path_length_to_entry",
                     "jaccard_similarity"
                 ], 
                 targets=gc.see_also_articles).sort_values(["% targets in top 1%", 'score'], ascending=False)
validation_df

ZeroDivisionError: division by zero

The chart generated above compares different ranking metrics (left index) for a given article. The most important column, `score`, provides a fast way for us to compare these different metrics. 

For example, if we see a _score_ of 0.98 for a given ranking metric, The following statement would be true:

> All of the human labeled **See Also** links are present within the top 98% of our recommendations. 

Since the human labeled links comprise a range, it is not possible to get a score of 100%. The `max score possible` column indicates the score that would be achieved if all the human labeled _See Also_ links appeared at the top of our recommendations without any other links intervening.   

The `difference` column is an alternative way of looking at the score. If we had a 0.02 in this column, we could say:

> All the human labeled **See Also** links are contained within the top 2% of our recommendations. 

`Total targets` is the number of human labeled _See Also_ links. 

Because it is possible that different metrics could have similar scores, we want a way to break down the dispersion of the known related links to see if one metric does perform better than another. The trailing four columns provide us with a course way of measuring this dispersion. 

Each of these columns indicates the percentage of human labeled _See Also_ links captured within a given percentage of the top of our recommendations. For example, if we see a 0.92 in the `% targets in the top 1%` columns, we could say:

> 92% of the human labeled **See Also** links appear in the top 1% of our recommendations. 

The value of these columns is a follows - If two ranking metrics have similar scores, we _might_ consider the better performing one to be the one in which the majority of the human labeled links are higher in our recommendation list. 

___

In [None]:
plot_validation_scores(validation_df, gc.entry)