Author: Suvansh Vaid

Dated: 21/09/2020

# Project idea

We often find ourselves trying to learn a new thing and ending up on a wikipedia page. Although the wikipedia does a pretty good job of describing a topic in detial, sometimes we just want an overview of that topic and some important topics related to it. 

This project does the same thing for you using web scraping the wikipedia page of the topic you want to learn about and gives you a list of the top 20 topics related to that topic along with a small introduction paragraph (just for demonstration purpose). 

__*NOTE*__: This notebook is just a demonstration of how we can achieve the project goals using web scraping and not an actual implementation. 

<img src="wiki.png">

# What is Web Scraping?
Web scraping is an exciting practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data. 

There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API. Thus, we need to scrape the HTML website to fetch the information.

Here, we try to use the 3 non-standard python libraries to scrape the web data. 
* urllib
* beatifulsoup 
* requests

In [60]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pandas as pd
from IPython.display import HTML

# Getting started

## Extracting a list of links on the Wikipedia page.

Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If we look at the source code of the following page 
```
https://en.wikipedia.org/wiki/Data_science
```
in your browser,we will find that all these links have 3 things in common:
* They are in the *div* with id *set* to *bodyContent*
* The URLs do not contain semicolons
* The URLs begin with */wiki/*

We can use these rules to construct our search through the HTML page. 

Firstly, we use the urlopen() function to open the wikipedia page for "Data Science".

In [2]:
html = urlopen("https://en.wikipedia.org/wiki/Data_science")

Then, we find and print all the links in the page. In order to do this, we need to
* find the *div* whose *id = "bodyContent"*
* find all the link tags, whose href starts with "/wiki/" and does not ends with ":".

In [3]:
bsobj = BeautifulSoup(html, "lxml")


for link in bsobj.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Information_science
/wiki/Machine_learning
/wiki/Data_mining
/wiki/Statistical_classification
/wiki/Cluster_analysis
/wiki/Regression_analysis
/wiki/Anomaly_detection
/wiki/Automated_machine_learning
/wiki/Association_rule_learning
/wiki/Reinforcement_learning
/wiki/Structured_prediction
/wiki/Feature_engineering
/wiki/Feature_learning
/wiki/Online_machine_learning
/wiki/Semi-supervised_learning
/wiki/Unsupervised_learning
/wiki/Learning_to_rank
/wiki/Grammar_induction
/wiki/Supervised_learning
/wiki/Statistical_classification
/wiki/Regression_analysis
/wiki/Decision_tree_learning
/wiki/Ensemble_learning
/wiki/Bootstrap_aggregating
/wiki/Boosting_(machine_learning)
/wiki/Random_forest
/wiki/K-nearest_neighbors_algorithm
/wiki/Linear_regression
/wiki/Naive_Bayes_classifier
/wiki/Artificial_neural_network
/wiki/Logistic_regression
/wiki/Perceptron
/wiki/Relevance_vector_machine
/wiki/Support-vector_machine
/wiki/Cluster_analysis
/wiki/BIRCH
/wiki/CURE_data_clustering_algorithm
/wik

## Extracting information about each of the link topics

This is where the real fun begins!

Assume that we want to find all the keyword links in the Wikipedia page of "Data Science". We first store all the links from our target page into a list called `links`. 

In [4]:
def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Data_Science")

In [5]:
print(links[0])

<a href="/wiki/Information_science" title="Information science">information science</a>


In [6]:
len(links)

226

Now, since the list of links is too long for us, we try to extract the first paragraph of each of the wikipedia pages from the top 20 links. 

In [16]:
topic_info = {'topic' : [], 'para': []}

In [17]:
for index in range(20):
    
    newArticle = links[index].attrs['href']
    
    html = urlopen("http://en.wikipedia.org"+ newArticle)
    
    bsObj = BeautifulSoup(html, "lxml")
    
    topic_info['topic'].append(newArticle) 
    topic_info['para'].append(bsObj.find('p').text)

In [18]:
topic_info.keys()

dict_keys(['topic', 'para'])

In [42]:
topic_df = pd.DataFrame(topic_info)

In [43]:
topic_df

Unnamed: 0,topic,para
0,/wiki/Information_science,Information science (also known as information...
1,/wiki/Machine_learning,Machine learning (ML) is the study of computer...
2,/wiki/Data_mining,Data mining is a process of discovering patter...
3,/wiki/Statistical_classification,"In statistics, classification is the problem o..."
4,/wiki/Cluster_analysis,Cluster analysis or clustering is the task of ...
5,/wiki/Regression_analysis,"In statistical modeling, regression analysis i..."
6,/wiki/Anomaly_detection,"In data analysis, anomaly detection (also outl..."
7,/wiki/Automated_machine_learning,Automated machine learning (AutoML) is the pro...
8,/wiki/Association_rule_learning,Association rule learning is a rule-based mach...
9,/wiki/Reinforcement_learning,Reinforcement learning (RL) is an area of mach...


## Refining the topic and paragraph content

We finally remove the unwanted `/wiki/` from each topic using the regular expression

In [44]:
topic_df['topic'] = topic_df['topic'].apply(lambda x : re.findall(r'/wiki/(.*)', x)[0])

In [45]:
topic_df

Unnamed: 0,topic,para
0,Information_science,Information science (also known as information...
1,Machine_learning,Machine learning (ML) is the study of computer...
2,Data_mining,Data mining is a process of discovering patter...
3,Statistical_classification,"In statistics, classification is the problem o..."
4,Cluster_analysis,Cluster analysis or clustering is the task of ...
5,Regression_analysis,"In statistical modeling, regression analysis i..."
6,Anomaly_detection,"In data analysis, anomaly detection (also outl..."
7,Automated_machine_learning,Automated machine learning (AutoML) is the pro...
8,Association_rule_learning,Association rule learning is a rule-based mach...
9,Reinforcement_learning,Reinforcement learning (RL) is an area of mach...


Removing the `_` from the topic names

In [46]:
topic_df['topic'] = topic_df['topic'].apply(lambda x: x.replace('_', ' '))

In [47]:
topic_df

Unnamed: 0,topic,para
0,Information science,Information science (also known as information...
1,Machine learning,Machine learning (ML) is the study of computer...
2,Data mining,Data mining is a process of discovering patter...
3,Statistical classification,"In statistics, classification is the problem o..."
4,Cluster analysis,Cluster analysis or clustering is the task of ...
5,Regression analysis,"In statistical modeling, regression analysis i..."
6,Anomaly detection,"In data analysis, anomaly detection (also outl..."
7,Automated machine learning,Automated machine learning (AutoML) is the pro...
8,Association rule learning,Association rule learning is a rule-based mach...
9,Reinforcement learning,Reinforcement learning (RL) is an area of mach...


In [48]:
topic_df['para'][0]

'Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.[1] Practitioners within and outside the field study the  application and the usage of knowledge in organizations along with the interaction between people, organizations, and any existing information systems with the aim of creating, replacing, improving, or understanding information systems. Historically, information science is associated with computer science, psychology, technology and intelligence agencies.[2] However, information science also incorporates aspects of diverse fields such as archival science, cognitive science, commerce, law, linguistics, museology, management, mathematics, philosophy, public policy, and social sciences.\n'

Removing the hyperlink brackets `[]` from the paragraphs as well. 

In [51]:
topic_df['para'] = topic_df['para'].apply(lambda x : re.sub('\[.*\]', '', x))

In [52]:
topic_df['para'][0]

'Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. However, information science also incorporates aspects of diverse fields such as archival science, cognitive science, commerce, law, linguistics, museology, management, mathematics, philosophy, public policy, and social sciences.\n'

Finally, removing the `\n` from the paragraphs. 

In [54]:
topic_df['para'] = topic_df['para'].apply(lambda x : re.sub('\\n', '', x))

## Presenting the information to the user!

In [59]:
HTML(topic_df.to_html())

Unnamed: 0,topic,para
0,Information science,"Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. However, information science also incorporates aspects of diverse fields such as archival science, cognitive science, commerce, law, linguistics, museology, management, mathematics, philosophy, public policy, and social sciences."
1,Machine learning,"Machine learning (ML) is the study of computer algorithms that improve automatically through experience. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."
2,Data mining,"Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems."
3,Statistical classification,"In statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the ""spam"" or ""non-spam"" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition."
4,Cluster analysis,"Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning."
5,Regression analysis,"In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features'). The most common form of regression analysis is linear regression, in which a researcher finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane) that minimizes the sum of squared differences between the true data and that line (or hyperplane). For specific mathematical reasons (see linear regression), this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters (e.g., quantile regression or Necessary Condition Analysis) or estimate the conditional expectation across a broader collection of non-linear models (e.g., nonparametric regression)."
6,Anomaly detection,"In data analysis, anomaly detection (also outlier detection)"
7,Automated machine learning,Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model. AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring becoming an expert in the field first.
8,Association rule learning,Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.
9,Reinforcement learning,"Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning."
