# Motivation

The scikit package for python will be one of the most important ones during the master program. Therefore I came up with the idea to get a overview what topics are coverd on the documentation webpage and especially which external resources are provided to learn more about the theoretical backgroud. My apporoch is to start at the highest level of the webpage and then go down the tree shaped by the website's directory. I limit the analysis to 3 levels below the start page, otherwise it would be computational too challenging. This approach should cover according to a quick investigation the whole website. In the end I will export two csv-files: 
1. scikit_FullLinks.csv:
    - covers all data retrieved in the analysis without any data cleansing
    
    
2. scikit_ExternalLinks.csv:
    - covers all external links,
    - the context in which the link was provided and 
    - the link to the scikit webpage from wehere the link was gathered

### Load required packages

In [1]:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import queue
from concurrent.futures import ThreadPoolExecutor
from urllib.parse import urlparse

### Define classes and functions needed

In [2]:
def get_class(DRs):
    try:
        if len(DRs.get("class")) > 1: 
            class_lap = str("")
            for i in range(len(DRs.get("class"))):
                class_lap = str(class_lap + DRs.get("class")[i])
            return class_lap
        else: 
            return DRs.get("class")
    except:
        "Not Labeled"

In [3]:
def get_title(soup):
    try:
        return (soup.select("div.body h1")[0].get_text())[:-1]
    except:
        "No Header"

In [4]:
def website_type(webdir):
    try: 
        if (webdir[0:4] == "http") or (webdir[-3:] == "pdf") or (webdir[0:2] == ".."):
            return "foreign"
        else: 
            return "scikit"
    except: 
        return "foreign"

In [5]:
def get_links_scikit(webdir):
    if website_type(webdir) == "scikit":
        URL = "http://scikit-learn.org/stable/"
        next_url = str(URL + webdir)
        soup = BeautifulSoup(requests.get(next_url).content, "lxml")
        link_data = pd.DataFrame([])
        if len(soup.select("div.body a")) > 0:
            for DRs in soup.select("div.body a"):
                link_data = link_data.append(pd.DataFrame({"Header": get_title(soup), "Location": webdir, 
                                                           "Links": DRs.get("href"), "Type": get_class(DRs)}, 
                                                          index=[0])
                                             , ignore_index=True)
        else: 
            link_data = pd.DataFrame({"Header": get_title(soup), "Location": webdir, 
                                      "Links": "No Link here", "Type": "Not Labeled"}, 
                                                      index=[0])
    else: 
        link_data = pd.DataFrame({"Header": "No Header", "Location": webdir, 
                                  "Links": "No Link here", "Type": "Not Labeled"}, 
                                                      index=[0])
    return link_data

In [6]:
def simple_worker(i):
    while True:
        item = q.get()
        if item == 'break':
            break
        results = get_links_scikit(item)
        q.task_done()
        if (q.unfinished_tasks in [2000,1500,1000,500,10]) or (q.unfinished_tasks < 10):
            print("Task_done & amount of unfinished sub tasks: " + str(q.unfinished_tasks))
        return results

In [7]:
def hard_worker(i):
    while True:
        item = m.get()
        if item == 'break':
            break
        results = get_links_scikit(item)
        if any(results):
            for x in results["Links"]:
                q.put(x)
        m.task_done()
        print("Task_done & amount of unfinished tasks: " + str(q.unfinished_tasks))
        return results

### Retrieving information from the website

In [8]:
m = queue.Queue()
q = queue.Queue()
initial_tasks = list(get_links_scikit("documentation.html")["Links"])
for y in initial_tasks:
    m.put(y)
with ThreadPoolExecutor(max_workers=4) as executor:
    m_results = executor.map(hard_worker,range(m.unfinished_tasks))
    m.join()
    for i in range(4):
        m.put('break')

Task_done & amount of unfinished tasks: 1
Task_done & amount of unfinished tasks: 2
Task_done & amount of unfinished tasks: 49
Task_done & amount of unfinished tasks: 90
Task_done & amount of unfinished tasks: 121
Task_done & amount of unfinished tasks: 209
Task_done & amount of unfinished tasks: 773
Task_done & amount of unfinished tasks: 823
Task_done & amount of unfinished tasks: 851
Task_done & amount of unfinished tasks: 852
Task_done & amount of unfinished tasks: 928
Task_done & amount of unfinished tasks: 1407
Task_done & amount of unfinished tasks: 2080


In [9]:
links_full = pd.DataFrame([])
for value in m_results:
    links_full = links_full.append(value,ignore_index=True)

In [10]:
with ThreadPoolExecutor(max_workers=25) as executor:
    q_results = executor.map(simple_worker,range(q.unfinished_tasks))
    q.join()
    for i in range(25):
        q.put('break')

Task_done & amount of unfinished sub tasks: 2000
Task_done & amount of unfinished sub tasks: 1500
Task_done & amount of unfinished sub tasks: 1000
Task_done & amount of unfinished sub tasks: 500
Task_done & amount of unfinished sub tasks: 10
Task_done & amount of unfinished sub tasks: 9
Task_done & amount of unfinished sub tasks: 8
Task_done & amount of unfinished sub tasks: 7
Task_done & amount of unfinished sub tasks: 6
Task_done & amount of unfinished sub tasks: 5
Task_done & amount of unfinished sub tasks: 4
Task_done & amount of unfinished sub tasks: 3
Task_done & amount of unfinished sub tasks: 2
Task_done & amount of unfinished sub tasks: 1
Task_done & amount of unfinished sub tasks: 0


In [11]:
for value in q_results:
    links_full = links_full.append(value,ignore_index=True)

In [12]:
len(links_full)

105466

### Data Manipulation

In [13]:
# Extract urls
links_full["url"] = links_full.Links.apply(lambda x: urlparse(x).netloc)
links_full["Scikit_Location"] = links_full.Location.apply(lambda x: str("http://scikit-learn.org/stable/"+str(x)))
# Get all external links
to_explore = pd.DataFrame(links_full[links_full.Type == "referenceexternal"])

In [14]:
# Example1: Find all youtube links from scikit
to_explore[to_explore['url'].str.contains('youtube', na = False)]

Unnamed: 0,Header,Location,Links,Type,url,Scikit_Location
1997,"External Resources, Videos and Talks",presentations.html,https://www.youtube.com/watch?v=Zd5dfooZWG4,referenceexternal,www.youtube.com,http://scikit-learn.org/stable/presentations.html
2001,"External Resources, Videos and Talks",presentations.html,https://www.youtube.com/watch?v=cHZONQ2-x7I,referenceexternal,www.youtube.com,http://scikit-learn.org/stable/presentations.html
11651,Who is using scikit-learn?,testimonials/testimonials.html,https://www.youtube.com/watch?v=Jm-eBD9xR3w,referenceexternal,www.youtube.com,http://scikit-learn.org/stable/testimonials/te...


In [15]:
# Example2: Find all wikipedia links from scikit
(to_explore[to_explore['url'].str.contains('wikipedia', na = False)])[0:5]

Unnamed: 0,Header,Location,Links,Type,url,Scikit_Location
48,An introduction to machine learning with sciki...,tutorial/basic/tutorial.html,https://en.wikipedia.org/wiki/Machine_learning,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/tutorial/basic/...
50,An introduction to machine learning with sciki...,tutorial/basic/tutorial.html,https://en.wikipedia.org/wiki/Sample_(statistics),referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/tutorial/basic/...
51,An introduction to machine learning with sciki...,tutorial/basic/tutorial.html,https://en.wikipedia.org/wiki/Multivariate_ran...,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/tutorial/basic/...
52,An introduction to machine learning with sciki...,tutorial/basic/tutorial.html,https://en.wikipedia.org/wiki/Supervised_learning,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/tutorial/basic/...
54,An introduction to machine learning with sciki...,tutorial/basic/tutorial.html,https://en.wikipedia.org/wiki/Classification_i...,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/tutorial/basic/...


In [16]:
# Example3: Find all external links for the topic Stochastic Gradient Descent
(to_explore[to_explore['Header'].str.contains('1.5. Stochastic Gradient Descent', na = False)])[0:5]

Unnamed: 0,Header,Location,Links,Type,url,Scikit_Location
22739,1.5. Stochastic Gradient Descent,modules/sgd.html,https://en.wikipedia.org/wiki/Support_vector_m...,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/modules/sgd.html
22740,1.5. Stochastic Gradient Descent,modules/sgd.html,https://en.wikipedia.org/wiki/Logistic_regression,referenceexternal,en.wikipedia.org,http://scikit-learn.org/stable/modules/sgd.html
22768,1.5. Stochastic Gradient Descent,modules/sgd.html,https://docs.scipy.org/doc/scipy/reference/spa...,referenceexternal,docs.scipy.org,http://scikit-learn.org/stable/modules/sgd.html
22769,1.5. Stochastic Gradient Descent,modules/sgd.html,http://docs.scipy.org/doc/scipy/reference/gene...,referenceexternal,docs.scipy.org,http://scikit-learn.org/stable/modules/sgd.html
22776,1.5. Stochastic Gradient Descent,modules/sgd.html,http://yann.lecun.com/exdb/publis/pdf/lecun-98...,referenceexternal,yann.lecun.com,http://scikit-learn.org/stable/modules/sgd.html


In [17]:
# Get all external links
export = pd.merge(to_explore[['Header','Links']].drop_duplicates(), to_explore[["Scikit_Location"]], 
                  left_index=True, right_index=True, sort=True)
export = pd.DataFrame(export[~export["Header"].isin(["Glossary of Common Terms and API Elements",
                                              "Release History","Version 0.19.2","Version 0.18.2",
                                              "About us","Contributing"])])

### Saving data

In [18]:
links_full.to_csv("scikit_FullLinks.csv")
export.to_csv("scikit_ExternalLinks.csv")