## Web Scrapping Exercises

### Problem 1

From the following Wikipedia page extract the hyperlinked words and the hyperlinks and arrange them in a table.
https://en.wikipedia.org/wiki/Johann_Sebastian_Bach

### Libraries

In [1]:
#required libraries for web scrap
from urllib.request import urlopen  #to get access of webpage
from bs4 import BeautifulSoup       #to get html content

In [2]:
bach_html = urlopen('https://en.wikipedia.org/wiki/Johann_Sebastian_Bach')  #This helps us to get the complete HTML code for the webpage
bach_soup = BeautifulSoup(bach_html, 'html.parser')                         #Help to put the codes in proper XML structure (More in the notes)

In [4]:
print(bach_soup.prettify())   #To print the HTML codes for the page in a more organized way

### A Basic Attempt

In [25]:
#Note: The tag <a> in HTML defines hyperlinks
#From the BS object we will find all the lines with the tag 'a'

hlink = bach_soup.find_all('a')

#Using count to break the look in between 
#(Not required. Used for the purpose of demo)
count = 0
for obj in hlink:
    count += 1
    if count == 20:
        break                                                #1
    try:                                                     #2
        if obj.get('title') is not None:                     #3                      
            print(obj.get('title'), '-', obj.get('href'))    #4
    except:                                                  #5
        pass
    

#Comments:

#1. The break was used to test the code. We are breaking once 20 iterations are done. This will be removed in the final code. 
#2. Try to get the title. The title may not be found for some line and this code may throw Attribute error.
#3. If the title is found (titles are the hyperlinked words in here) and are not none
#4. Print the title and its hyperlinks separated by hyphen
#5. if title not found and error result just pass the iteration.


This article is semi-protected. - /wiki/Wikipedia:Protection_policy#semi
Johann Sebastian Bach (painter) - /wiki/Johann_Sebastian_Bach_(painter)
Bach (disambiguation) - /wiki/Bach_(disambiguation)
Elias Gottlob Haussmann - /wiki/Elias_Gottlob_Haussmann
Old Style and New Style dates - /wiki/Old_Style_and_New_Style_dates
Old Style and New Style dates - /wiki/Old_Style_and_New_Style_dates
Eisenach - /wiki/Eisenach
Leipzig - /wiki/Leipzig
List of compositions by Johann Sebastian Bach - /wiki/List_of_compositions_by_Johann_Sebastian_Bach
Old Style and New Style dates - /wiki/Old_Style_and_New_Style_dates
Baroque music - /wiki/Baroque_music
Brandenburg Concertos - /wiki/Brandenburg_Concertos
Goldberg Variations - /wiki/Goldberg_Variations


### An Improvement

In [15]:
#This is a repeat of the previous code with some improvements
#Note the links we have obtained are incomplete hyperlinks (this wont help us)
#All these links are wikipedia linka and starts with 'https://en.wikipedia.org'
#So for each links let's concatinate this string to the left of the links.

hlink = bach_soup.find_all('a')

count = 0
for obj in hlink:
    count += 1
    if count == 20:
        break
    try:
        if obj.get('title') is not None:
            print(obj.get('title'), '-', 'https://en.wikipedia.org'+obj.get('href'))
    except:
        pass
    
#Now we should get the complete hyperlinks...

This article is semi-protected. - https://en.wikipedia.org/wiki/Wikipedia:Protection_policy#semi
Johann Sebastian Bach (painter) - https://en.wikipedia.org/wiki/Johann_Sebastian_Bach_(painter)
Bach (disambiguation) - https://en.wikipedia.org/wiki/Bach_(disambiguation)
Elias Gottlob Haussmann - https://en.wikipedia.org/wiki/Elias_Gottlob_Haussmann
Old Style and New Style dates - https://en.wikipedia.org/wiki/Old_Style_and_New_Style_dates
Old Style and New Style dates - https://en.wikipedia.org/wiki/Old_Style_and_New_Style_dates
Eisenach - https://en.wikipedia.org/wiki/Eisenach
Leipzig - https://en.wikipedia.org/wiki/Leipzig
List of compositions by Johann Sebastian Bach - https://en.wikipedia.org/wiki/List_of_compositions_by_Johann_Sebastian_Bach
Old Style and New Style dates - https://en.wikipedia.org/wiki/Old_Style_and_New_Style_dates
Baroque music - https://en.wikipedia.org/wiki/Baroque_music
Brandenburg Concertos - https://en.wikipedia.org/wiki/Brandenburg_Concertos
Goldberg Variatio

### Improvement over the Basic Attempt (Getting a Data Frame Instead)

In [35]:
#ARRANGING IN DATA FRAME

hlink = bach_soup.find_all('a')

#Creating two empty lists to store the hyperlinked words (titles) and the hyperlinks (href)
title = []                      
links = []

count = 0
for obj in hlink:
    count += 1
    if count == 20:
        break
    try:
        if obj.get('title') is not None:
            title.append(obj.get('title'))                             #Appending the title in the title list
            links.append('https://en.wikipedia.org'+obj.get('href'))   #Appending the hyperlinks in the links list
    except:
        pass

In [36]:
#This step is for creating the data frame in the next step
#Create a dictionary with the keys as the variable names and the values as the created lists
dic = {'Title':title, 'Link':links}               


In [40]:
#Creating the Pandas data frame using the dictionary
import pandas as pd
bach_wiki_links = pd.DataFrame(dic)

#Re-arranging
bach_wiki_links = bach_wiki_links[['Title','Link']]
bach_wiki_links

Unnamed: 0,Title,Link
0,This article is semi-protected.,https://en.wikipedia.org/wiki/Wikipedia:Protec...
1,Johann Sebastian Bach (painter),https://en.wikipedia.org/wiki/Johann_Sebastian...
2,Bach (disambiguation),https://en.wikipedia.org/wiki/Bach_(disambigua...
3,Elias Gottlob Haussmann,https://en.wikipedia.org/wiki/Elias_Gottlob_Ha...
4,Old Style and New Style dates,https://en.wikipedia.org/wiki/Old_Style_and_Ne...
5,Old Style and New Style dates,https://en.wikipedia.org/wiki/Old_Style_and_Ne...
6,Eisenach,https://en.wikipedia.org/wiki/Eisenach
7,Leipzig,https://en.wikipedia.org/wiki/Leipzig
8,List of compositions by Johann Sebastian Bach,https://en.wikipedia.org/wiki/List_of_composit...
9,Old Style and New Style dates,https://en.wikipedia.org/wiki/Old_Style_and_Ne...


### The Final Code

Create a function that will take a wikipedia link and give you the hyper-linked words and the hyperlink in the form of a csv file in a specified directory (or in the working directory by default)

In [85]:
#Required Libraries

#required libraries for web scrap
from urllib.request import urlopen  #to get access of webpage
from bs4 import BeautifulSoup       #to get html content

#Pandas for data frame
import pandas as pd

#For directory link
import os

cwd = os.getcwd()

#Function
def get_wiki_hyperlinks(link, directory=cwd):
    
    if link.startswith('https://en.wikipedia.org'):
        html = urlopen(link)  #This helps us to get the complete HTML code for the webpage
        soup = BeautifulSoup(html, 'html.parser')                         #Help to put the codes in proper XML structure (More in the notes)

        #Find all lines tagged 'a'. <a> defines hyperlink in HTML
        hlink = soup.find_all('a')

        #Creating two empty lists to store the hyperlinked words (titles) and the hyperlinks (href)
        title = []                      
        links = []

        for obj in hlink:
            try:
                if obj.get('title') is not None:
                    title.append(obj.get('title'))                             #Appending the title in the title list
                    links.append('https://en.wikipedia.org'+obj.get('href'))   #Appending the hyperlinks in the links list
            except:
                pass

        dic = {'Title':title, 'Link':links}  
        wiki_links = pd.DataFrame(dic)

        #Re-arranging
        wiki_links = wiki_links[['Title','Link']]

        #Getting the title of the article. We will use it as the file name
        filename = soup.title.string
        print('The file name of the exported csv file is '+filename+'.csv')

        #Change the directory
        os.chdir(directory)
        
        #Exporting the data frame as csv file
        wiki_links.to_csv(filename+'.csv')
        
        #re-set directory
        os.chdir(cwd)
        
        return(wiki_links)
    
    else:
        print('This is not a Wikipedia link')

#### Testing

In [67]:
#TESTING 1
ML = get_wiki_hyperlinks('https://en.wikipedia.org/wiki/Machine_learning')

This is not a Wikipedia link


In [66]:
ML.iloc[1:5,]

Unnamed: 0,Title,Link
1,Statistical learning in language acquisition,https://en.wikipedia.org/wiki/Statistical_lear...
2,Data mining,https://en.wikipedia.org/wiki/Data_mining
3,Statistical classification,https://en.wikipedia.org/wiki/Statistical_clas...
4,Cluster analysis,https://en.wikipedia.org/wiki/Cluster_analysis


In [68]:
#TESTING 2
bla = get_wiki_hyperlinks('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')

This is not a Wikipedia link


In [87]:
#TESTING 3
ML = get_wiki_hyperlinks(link='https://en.wikipedia.org/wiki/Johann_Sebastian_Bach', directory='C:\\Users\\Gourab\\Desktop')

The file name of the exported csv file is Johann Sebastian Bach - Wikipedia.csv


**Notes**

**1. What BeautifulSoup does?**
Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects representing XML structures.

