# 3. Web Crawler - fetch  links from wiki page

Write a web crawler in a Python notebook.  The crawler should be given a single URL entry point to a webpage with links to a  number of static web pages in the same domain.  From that URL entry point, the crawler should gather additional “target” URLs to webpages from inside the same domain as the entry URL and visit those web pages to collect the page titles for each (“page_title_target”).  For each page aside from the entry URL entry, I want you to collect:
    The source URL (url_sorce) of an HTML webpage that you gathered the target URL from
    The target URL (url_target) to an HTML webpage you discovered through crawling
    The title  (page_title_target) of the web page rendered by the (url_target) URL

From the entry webpage you will gather an additional 99 unique target URLs to other webpages, using them as additional source_urls as needed to obtain the additional target_urls.  Do note that we are collecting html webpage URLs and not URLs other resources like images; and these target URLs should be unique (i.e. not pointing to a different section of the same page).

Write the collected url_sorce, url_target, and page_title_target data into a CSV file as described in the deliverables section.  The initial entry point into the crawl should be given a “None” value for its url_source.  Including the entry point, you should have 100 target_urls by the end of the crawl.

# Step 1 - Read the URL HTML Page 

In [5]:
import urllib #For downloading webpages
import bs4
from bs4 import BeautifulSoup # For Parsing HTML Documents

input_url = 'https://en.wikipedia.org/wiki/2019_in_film'
with urllib.request.urlopen(input_url) as response:
    Read1 = response.read() #Reads the whole HTML Document

Read1_BS = BeautifulSoup(Read1)
#print(Read1_BS)


In [6]:
# to identify childs in body
for child in Read1_BS.body:
    if type(child) == bs4.element.Tag:
        print(type(child), child.name)

<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> script
<class 'bs4.element.Tag'> script
<class 'bs4.element.Tag'> script


# Step 2 - Filter out the a-Tags

#### then geting unique href values in a set

In [7]:
a_tags = Read1_BS.body.find_all('a') # This will give us all a-tags (a tags contain href's)
#a_tags

In [8]:
href_found = set() #Using set to get unique values
for l in a_tags:
    if 'href' in l.attrs:
        href_found.add(l.attrs['href'])   # (.attrs) reads the attributes if any. href is an attribute of 'a' Tag

len(href_found)

1272

In [3]:
href_found # here we can observe that the links obtained are not in default URL format

# Step 3 - Cleaning the extracted URL's

### We will first analyse the source URL in which we will get the netloc and then convert the extracted links in the default url format

In [9]:
check_source_url = urllib.parse.urlparse(input_url)  # Analyse the input url
check_source_url  # check the value of netloc in output. we will be using it in next step

ParseResult(scheme='https', netloc='en.wikipedia.org', path='/wiki/2019_in_film', params='', query='', fragment='')

In [10]:
cleaned_discovered_links = set()
# Clean up discovered links to remove fragments and URLs
# from different "netloc" sources
for link in href_found:
    parsed_link = urllib.parse.urlparse(link) # This will give the path. note that here netloc will be blank
    #print(parsed_link)
    # Try to join relative links to base URL
    # Trying to normalize our links to be absolute
    joined_link = urllib.parse.urljoin(input_url, link)
    parsed_joined_link = urllib.parse.urlparse(joined_link)
    if check_source_url.netloc == parsed_joined_link.netloc:
        cleaned_discovered_links.add(joined_link)

#cleaned_discovered_links

In [11]:
len(cleaned_discovered_links)

1179

In [12]:
# removing the fragment part of the links if any to get unique links

unique_links  = set()
for links in cleaned_discovered_links:  # As the links are of datatype str I am using the split function to get unique links
    links = links.split('#')[0]
    links = links.split('?')[0]

    unique_links.add(links)

#unique_links

In [13]:
len(unique_links)

1078

# Step 4 - Checking if Links are authentic and fetching first 100 links

In [14]:
%%time

# This step takes a lot of time to execute. Ask Matt if any method to speed up this process

webpage_only = set()
for check in unique_links:
        #print(check)
        
        try:
            with urllib.request.urlopen\
            (check, data=None, timeout=5) as response:
                if 'text/html' in response.headers['Content-Type']: # Checks the contect type which is under network - headers

                    webpage_only.add(check)
        except:
            pass

len(webpage_only)

Wall time: 3min 26s


1078

## Here we will fetch 100 links as required 

In [15]:
unique_links_list = list(webpage_only)
url_target = unique_links_list[0:100] #Slicing
url_target[0] = 'https://en.wikipedia.org/wiki/2019_in_film'
#url_target  # List of 100 unique links

# Step 5 - Fectching titles of 100 URL Pages

In [16]:
%%time
page_title_target = list()
url_source = list()
import urllib.request

for t in url_target:
    soup = BeautifulSoup(urllib.request.urlopen(t))
    page_title_target.append(soup.title.string) # This will give Page titles
    url_source.append('https://en.wikipedia.org/wiki/2019_in_film') # To create source URL Data


#print(page_title_target)
#print(len(page_title_target))

Wall time: 43.3 s


In [17]:
url_source[0] = None # As required, First data in url_source must be None
#print(url_source)
#print(len(url_source))

# Final step - Export into csv in the required format

In [18]:

%%time
# Adding double quotes to all elements as required
A = list()
for i in url_source:
    j = '"' + str(i) + '"'
    A.append(j)
url_source = A
#print(url_source)


A = list()
for i in url_target:
    j = '"' + str(i) + '"'
    A.append(j)
url_target = A
#print(url_target)

A = list()
for i in page_title_target:
    j = '"' + str(i) + '"'
    A.append(j)
page_title_target = A
#print(page_title_target)


Wall time: 0 ns


In [19]:
# we can use pandas for this job.

import pandas as pd

#data = [url_source, url_target, page_title_target]
crawl = pd.DataFrame()

crawl['url_source'] = url_source
crawl['url_target'] = url_target
crawl['page_title_target'] = page_title_target
crawl

Unnamed: 0,url_source,url_target,page_title_target
0,"""None""","""https://en.wikipedia.org/wiki/2019_in_film""","""2019 in film - Wikipedia"""
1,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/List_of_Malayal...","""List of Malayalam films of 2019 - Wikipedia"""
2,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/Giulio_Brogi""","""Giulio Brogi - Wikipedia"""
3,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/(I%27m_Gonna)_L...","""(I'm Gonna) Love Me Again - Wikipedia"""
4,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/1930s_in_film""","""1930s in film - Wikipedia"""
...,...,...,...
95,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/The_Boat_That_R...","""The Boat That Rocked - Wikipedia"""
96,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/1998_in_film""","""1998 in film - Wikipedia"""
97,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/List_of_highest...","""List of highest-grossing films in Japan - Wik..."
98,"""https://en.wikipedia.org/wiki/2019_in_film""","""https://en.wikipedia.org/wiki/Quentin_Tarantino""","""Quentin Tarantino - Wikipedia"""


In [32]:
%%time
# Export dataframe to csv
crawl.to_csv (r'C:\Users\neilr\OneDrive\Desktop\Notes\Matt 891\crawl.csv', index = None, header=True) 

Wall time: 5.28 ms


# END