# Widebot 
## Task 1 - Getting to Philosophy

Task 1 - Getting to Philosophy

Please write a Python script to check the "Getting to Philosophy" law.
https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy

Clicking on the first link in the main body of a Wikipedia article and repeating the process for subsequent articles would usually lead to the article Philosophy.

The program should receive a Wikipedia link as an input, go to another normal link and repeat this process until either Philosophy page is reached, or we are in an article without any outgoing Wikilinks, or stuck in a loop.

A "normal link" is a link from the main page article, not in a box, is blue (red is for non-existing articles), not in parentheses, not italic and not a footnote. You don't have to check style tables or other fancy things, it is enough that the script works with the current Wikipedia style (for example you can use 'class' attribute in Wikipedia tags). For easy validation, please print all visited links to the standard output.

Use a 0.5 second timeout between queries to avoid heavy load on Wikipedia (sleep function from time module).

You can use https://en.wikipedia.org/wiki/Special:Random to check this hypothesis at home.


In [1]:
from bs4 import BeautifulSoup
import urllib
import time
import sys
import requests

## Getting to Philosophy will stop when : 
1. reaches the Philosophy article
2. reaches 30 article without reaching the Philosophy article 
3. get into a article with no links 
4. stuck in a loop (like mathematics article)

In [2]:
# start
start_url = "https://en.wikipedia.org/wiki/Special:Random"
# target
target_url = "https://en.wikipedia.org/wiki/Philosophy"
# store the visited article 
visited_urls = [start_url]

In [3]:
def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")

    # This div stars with the body of the article
    content_div = soup.find(id="mw-content-text").find(class_="mw-parser-output")

    # if the link contains no links it remains None
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        #find only the direct children
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    # Build a full url 
    first_link = urllib.parse.urljoin(
        'https://en.wikipedia.org/', article_link)

    return first_link

In [4]:
def continue_scraping(scraping_history, target_url, max_steps=30):
    # When reaches to philosphy
    if scraping_history[-1] == target_url:
        print("Target ('Philosphy') article reached!")
        return False
    # max iterations 
    elif len(scraping_history) > max_steps:
        print("Maximum (30) searches reached, interrupted.")
        return False
    elif scraping_history[-1] in scraping_history[:-1]:
        print("We are in a Loop , interrupted.")
        return False
    else:
        return True

In [5]:
while continue_scraping(visited_urls, target_url):
    #print first link
    print(visited_urls[-1])

    first_link = find_first_link(visited_urls[-1])
    # when arrive at an article with no links
    if not first_link:
        print("Arrived at an article with no links, search aborted.")
        break

    visited_urls.append(first_link)

    time.sleep(0.5)  # Slow things down so as to not overload Wikipedia's servers
visited_urls=[start_url]

https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Finnish_language
https://en.wikipedia.org/wiki/Finnic_language
https://en.wikipedia.org/wiki/Uralic_language_family
https://en.wikipedia.org/wiki/Language_family
https://en.wikipedia.org/wiki/Language
https://en.wikipedia.org/wiki/Linguistic_system
https://en.wikipedia.org/wiki/Ferdinand_de_Saussure
https://en.wikipedia.org/wiki/Switzerland
https://en.wikipedia.org/wiki/Sovereign_state
https://en.wikipedia.org/wiki/International_law
https://en.wikipedia.org/wiki/Nation
https://en.wikipedia.org/wiki/Community
https://en.wikipedia.org/wiki/Norm_(social)
https://en.wikipedia.org/wiki/Sociology
https://en.wikipedia.org/wiki/Society
https://en.wikipedia.org/wiki/Social_group
https://en.wikipedia.org/wiki/Social_science
https://en.wikipedia.org/wiki/Discipline_(academia)
https://en.wikipedia.org/wiki/Knowledge
https://en.wikipedia.org/wiki/Fact
https://en.wikipedia.org/wiki/Reality
https://en.wikipedia.org/wiki/Object

### Mohamed Adel Mohamed 
### Mohamedadelmohamed@gmail.com