# Wikipedia Node Connection
## Marco Gancitano

The purpose of this notebook is to demonstrate the function built out for searching wikipedia's massive network of connections. This find_connection function utilizes web crawling and breadth first search to find the most efficient connection from one wikipedia page to another. 

Steps:
1. Load wikipedia url and scrap all internal wikipedia links.
2. Store links in a queue (FIFO) along with connection layer  
    - Initial connection = 1, connection of that = 2
3. Pop next url from queue
4. Repeat steps 1 - 3 until the end_url is found or reach max layers to search 

<br>
NOTE: Reference functions.py for actual code  

functions:  
is_absolute: Checks if link is absolute or relative to website  
find_connection: Master function which crawls wikipedia looking for the connections  

In [11]:
import functions
import operator

import networkx as nx

In [2]:
base_url = 'https://en.wikipedia.org/wiki/Barrack_Obama'
end_url = 'https://en.wikipedia.org/wiki/Glaucoma'

G = functions.find_connection(base_url,end_url)

Layer: 0
Layer: 1
Connection Found!
Degrees to end site: 1
0: https://en.wikipedia.org/wiki/Barrack_Obama
1: https://en.wikipedia.org/wiki/Cocaine
2: https://en.wikipedia.org/wiki/Glaucoma


In [3]:
G.number_of_edges()

131025

In [5]:
G.number_of_nodes()

57444

In [9]:
nx.average_clustering(G)

0.27617546653639796

In [16]:
page_rank_dict = nx.pagerank(G)
ordered_page_rank = sorted(page_rank_dict.items(),key = operator.itemgetter(1),reverse = True)
ordered_page_rank[1:10]

[('https://en.wikipedia.org/wiki/Chicago', 0.011773186044081427),
 ('https://en.wikipedia.org/wiki/Marijuana', 0.008594407678305127),
 ('https://en.wikipedia.org/wiki/ISIL', 0.007781139309373355),
 ('https://en.wikipedia.org/wiki/Illinois', 0.0075487352334875045),
 ('https://en.wikipedia.org/wiki/St._Francis_of_Assisi',
  0.0073408666676589174),
 ('https://en.wikipedia.org/wiki/Washington,_D.C.', 0.007275465059430593),
 ('https://en.wikipedia.org/wiki/Iraq_War', 0.007141811181429644),
 ('https://en.wikipedia.org/wiki/Al_Gore', 0.007033667412814957),
 ('https://en.wikipedia.org/wiki/Hillary_Clinton', 0.006805282188664555)]