<a href="https://colab.research.google.com/github/MamadouBousso/IAProjects/blob/master/Webscrapingnotebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## requests library
This is one of the most important library to work with when you want to do web scraping. It helps downloading web pages we need to scrape

In [None]:
# Let's install the library
!pip install requests --upgrade --quiet

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Let's import the library
import requests

### Job 1
We will write a function that will help download a web page using request
The requests.get function returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.
If the request was successful, response.status_code is set to a value between 200 and 299.

In [None]:
def getWebPage(topic_url):
    '''
    Get the content of the page you want to scrape
    :param arg1: The URL of the page you want to scrape
    :type arg1: str
    :return : the status code of the response, content from this url or an assertion error
    
    '''
    
    response = requests.get(topic_url)
     
    try: 
      assert response.status_code  in [200,201,202,203,204,205,207,208,226]
      return response.status_code, response.text
    except AssertionError:
      print(f"Problem with your url or your request. Try to check it.status_code: {response.status_code}")
      return None,None

In [None]:
#Test
topic_url = "https://dagshub.com/Omdena/OmdenaLore"
code,pagecontent = getWebPage(topic_url)
print(code)



Problem with your url or your request. Try to check it.status_code: 404
None


In [None]:
topic_url1 = "https://developer.mozilla.org/en-US/docs/Web/HTTP/Status"
code2,pagecontent2 = getWebPage(topic_url1)




In [None]:
print(code2)
pagecontent2[:1000]

200


'<!DOCTYPE html><html lang="en-US" prefix="og: https://ogp.me/ns#"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="icon" href="/favicon-48x48.97046865.png"><link rel="apple-touch-icon" href="/apple-touch-icon.0ea0fa02.png"><meta name="theme-color" content="#ffffff"><link rel="manifest" href="/manifest.56b1cedc.json"><script>Array.prototype.flat&&Array.prototype.includes||document.write(\'<script src="https://polyfill.io/v3/polyfill.min.js?features=Array.prototype.flat%2Ces6"><\\/script>\')</script><title>HTTP response status codes - HTTP | MDN</title><link rel="preload" as="font" type="font/woff2" crossorigin="" href="/static/media/ZillaSlab-Bold.subset.0beac26b.woff2"><link rel="alternate" title="HTTP response status codes" href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status" hreflang="en"><link rel="alternate" title="HTTP 狀態碼" href="https://developer.mozilla.org/zh-TW/docs/Web/HTTP/Status" hreflang="zh-TW"><link rel

## Exercise

Try to get contents of many web pages and analyse the status code you get in case of an error

### Job 2
Write all the page contents in a file. We will use the function defined below

In [None]:
def putContentFile(filename,contents, encoding="utf-8"):
  '''
  Get a new html file where the content of a web page is written
  :param arg1: name of the html file in which you want to put the content
  :param arg2: the contents we will write in the file
  :param arg3: the encoding of the file
  :type arg1: str
  :type arg2: object get from response object
  :type arg3: str

  '''
  with open(filename,mode='w',encoding = encoding) as file:
    file.write(contents)
        

In [None]:
putContentFile('status.html',pagecontent2)

### You can verify that the file status.html is created

> **EXERCISE**: Download the web page for a different topic, e.g., https://github.com/topics/data-analysis using `requests` and save it to a file, e.g., `data-analysis.html`. 

## Beautiful Soup library

### JOB3:  Extracting information from HTML using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from the `bs4` module.

In [None]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▉                             | 10 kB 23.1 MB/s eta 0:00:01[K     |█████▋                          | 20 kB 9.8 MB/s eta 0:00:01[K     |████████▌                       | 30 kB 8.4 MB/s eta 0:00:01[K     |███████████▎                    | 40 kB 7.6 MB/s eta 0:00:01[K     |██████████████▏                 | 51 kB 4.1 MB/s eta 0:00:01[K     |█████████████████               | 61 kB 4.4 MB/s eta 0:00:01[K     |███████████████████▉            | 71 kB 4.5 MB/s eta 0:00:01[K     |██████████████████████▋         | 81 kB 5.0 MB/s eta 0:00:01[K     |█████████████████████████▌      | 92 kB 5.0 MB/s eta 0:00:01[K     |████████████████████████████▎   | 102 kB 4.0 MB/s eta 0:00:01[K     |███████████████████████████████▏| 112 kB 4.0 MB/s eta 0:00:01[K     |████████████████████████████████| 115 kB 4.0 MB/s 
[?25h

In [None]:
# Import the library
from bs4 import BeautifulSoup

In [None]:
# You can see all the documentation for beautiful soup
?BeautifulSoup

In [None]:
# Open a filename for lecteure
def openFile(filename):
    '''
    Open the content scraped from a website 
    :param arg1: the name of an existing file
    :return: the content of the file
    '''
    with open(filename, 'r') as f:
        html_source = f.read()
        return html_source

In [None]:
html_source = openFile('status.html')

In [None]:
html_source



Let's create a BeautifoulSoup object that contains several properties and methods for extracting information from the HTML document. You can look up the documentation of BeautifulSoup or search online to find what you need when you need it.

In [None]:
doc = BeautifulSoup(html_source, 'html.parser')

In [None]:
#Example: Find the title of the page
title_tag = doc.title

In [None]:
print(title_tag)

<title>HTTP response status codes - HTTP | MDN</title>


In [None]:
print(title_tag.text)

HTTP response status codes - HTTP | MDN


In [None]:
# This helps you identify properties and method of the doc object. We can work with all these methods starting with find
dir(doc)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'ROOT_TAG_NAME',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_check_markup_is_url',
 '_decode_markup',
 '_feed',
 '_find_all',
 '_find_one',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_linkage_fixer',
 '_most_recent_element',
 '_namespaces',
 '_popToTag',
 '_should_pretty_print',
 'append',
 'attrs',
 'builder',
 'can_be_empty_element',
 'cdata_list_attributes',
 'childGenerator',
 'children',
 'clear',
 'conta

In [None]:
# Let's define a function that will help us get the tag we want using find
def get_tag(doc, tag_name):
    '''
    Get a tag and it's the content. 
    :param arg1: the reference for the BeautifulSoup object created
    :param arg2: the name of the tag
    :type arg1: BeautifulSoup object
    :type arg2: str
    :return : the tag and it's content
    '''
    return doc.find(tag_name)
    
    
    

In [None]:
# Let's define a function that will help us get the tag we want using find_all
def get_groups_of_tag(doc, tag_name):
    '''
    Get groups of same tag in the page. 
    :param arg1: the reference for the BeautifulSoup object created
    :param arg2: the name of the tag
    :type arg1: BeautifulSoup object
    :type arg2: str
    :return : a list of same tag and their content
    '''
    return doc.findAll(tag_name)
    

In [None]:
#Example: get the title
title_tag = get_tag(doc,'title')

In [None]:
#Example: get all the links
a_tag = get_groups_of_tag(doc,'a')

In [None]:
str = 'title'
print(title_tag)

<title>HTTP response status codes - HTTP | MDN</title>


In [None]:
print(a_tag)



In [None]:
a_tag[2]

<a href="#select-language" id="skip-select-language">Skip to select language</a>

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

### Accessing attributes

The attributes of a tag can be accessed using the indexing notation, e.g., `first_link['href']`

In [None]:
def get_tag_and_attribute(doc,tag_name,attr):
  '''
  get the tab and a value of one attribute
  :param arg1: the reference for the BeautifulSoup object created
  :param arg2: the name of the tag
  :param arg3: the name of the attribute
  :type arg1: BeautifulSoup object
  :type arg2: str
  :type arg3: str
  :return : a str that represents the value of the attribute
  '''
  try: 
    assert attr  in doc.find(tag_name).attrs.keys()
    return doc.find(tag_name)[attr]
  except AssertionError:
    print(f"Attribute {attr} doesn't exist in tag {tag_name}")
    return ''
    
  
  


In [None]:
attr = get_tag_and_attribute(doc,'a','href')

In [None]:
attr

'#content'

> **EXERCISE**: Find the 1st IMAGE tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

### Searching by Attribute Value

> **PROBLEM**: Find the `a` tag(s) on the page with the `href` attribute set to `#select-language`.

We can provide a dictionary of attributes as the second argument to `find_all`

In [None]:
def find_tag_with_attributes(doc,tag_name,attr_dict):
  
    '''
    Get groups of same tag in the page. 
    :param arg1: the reference for the BeautifulSoup object created
    :param arg2: the name of the tag
    :param arg3: A dictionnary of attribute-values
    :type arg1: BeautifulSoup object
    :type arg2: str
    :type arg3: dictionnary
    :return : a list of same tag having the same attributes defined in the attr_dict
    '''
    return doc.findAll(tag_name,attr_dict)
    

In [None]:
find_tag_with_attributes(doc,'a',{'href':'#content'})

[<a href="#content" id="skip-main">Skip to main content</a>]

### Job 5: Parsing Information from Tags

Once we have a list of tags matching some criteria, it's easy to extract information and convert it to a more convenient format.

> **QUESTION**: Find the link text and URL of all the links related to social network on https://www.reddit.com/ .

We'll create a list of dictionaries containing the required information. We'll add the base URL https://www.reddit.com as a prefix because the `href` attribute only contains the relative path e.g. `/explore`.

In [None]:
# Let's first try to scrape www.reddit.com
topic_url = "https://www.reddit.com"
code,pagecontentred = getWebPage(topic_url)
print(code)

200


In [None]:
pagecontentred[:1000]

'\n    <!DOCTYPE html>\n    <html lang="en-US">\n      <head>\n        <script>\n    var __SUPPORTS_TIMING_API = typeof performance === \'object\' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;\n    function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };\n    var __firstPostLoaded = false;\n    function __markFirstPostVisible() {\n      if (__firstPostLoaded) { return; }\n      __firstPostLoaded = true;\n      __perfMark("first_post_title_image_loaded");\n    }\n    var __firstCommentLoaded = false;\n    function __markFirstCommentVisible() {\n      if (__firstCommentLoaded) { return; }\n      __firstCommentLoaded = true;\n      __perfMark("first_comment_loaded");\n    }\n  </script>\n        <script>__perfMark(\'head_tag_start\');</script>\n        <meta charSet="utf-8"/>\n        <meta name="viewport" content="width=device-width, initial-scale=1" />\n        <meta name="referrer" content="origin-when-cross-origin" />\n      

In [None]:
# Let's declare a BeatifoulSoup Object for reddit website
docreddit = BeautifulSoup(pagecontentred, 'html.parser')

In [None]:
# Let's explore all the links
a_tag = get_groups_of_tag(docreddit,'a')

In [None]:
a_tag

[<a aria-label="Home" class="_30BbATRhFv3V83DHNDjJAO" href="/"><svg class="_1O4jTk-dZ-VIxsCuYB6OR8 _32hLJ8_m9mplK6bwNXysk8" viewbox="0 0 20 20" xmlns="http://www.w3.org/2000/svg"><g><circle cx="10" cy="10" fill="#FF4500" r="10"></circle><path d="M16.67,10A1.46,1.46,0,0,0,14.2,9a7.12,7.12,0,0,0-3.85-1.23L11,4.65,13.14,5.1a1,1,0,1,0,.13-0.61L10.82,4a0.31,0.31,0,0,0-.37.24L9.71,7.71a7.14,7.14,0,0,0-3.9,1.23A1.46,1.46,0,1,0,4.2,11.33a2.87,2.87,0,0,0,0,.44c0,2.24,2.61,4.06,5.83,4.06s5.83-1.82,5.83-4.06a2.87,2.87,0,0,0,0-.44A1.46,1.46,0,0,0,16.67,10Zm-10,1a1,1,0,1,1,1,1A1,1,0,0,1,6.67,11Zm5.81,2.75a3.84,3.84,0,0,1-2.47.77,3.84,3.84,0,0,1-2.47-.77,0.27,0.27,0,0,1,.38-0.38A3.27,3.27,0,0,0,10,14a3.28,3.28,0,0,0,2.09-.61A0.27,0.27,0,1,1,12.48,13.79Zm-0.18-1.71a1,1,0,1,1,1-1A1,1,0,0,1,12.29,12.08Z" fill="#FFF"></path></g></svg><svg class="_1bWuGs_1sq4Pqy099x_yy-" viewbox="0 0 57 18" xmlns="http://www.w3.org/2000/svg"><g fill="#1c1c1c"><path d="M54.63,16.52V7.68h1a1,1,0,0,0,1.09-1V6.65a1,1,0,0,0-.

In [None]:
# We will explore all the links from reddit
reddit_link_tags = find_tag_with_attributes(docreddit,'a',{'class':'_3Eyh3vRo5o4IfzVZXhaWAG'})

In [None]:
reddit_link_tags

[<a class="_3Eyh3vRo5o4IfzVZXhaWAG" href="https://www.redditinc.com/policies/user-agreement">User Agreement</a>,
 <a class="_3Eyh3vRo5o4IfzVZXhaWAG" href="https://www.redditinc.com/policies/privacy-policy">Privacy policy</a>,
 <a class="_3Eyh3vRo5o4IfzVZXhaWAG" href="https://www.redditinc.com/policies/content-policy">Content policy</a>,
 <a class="_3Eyh3vRo5o4IfzVZXhaWAG" href="https://www.redditinc.com/policies/moderator-guidelines">Moderator Guidelines</a>]

In [None]:
reddit_links = []


for tag in reddit_link_tags:
    reddit_links.append({ 'title': tag.text.strip(), 'url': tag['href']})
    
reddit_links

[{'title': 'User Agreement',
  'url': 'https://www.redditinc.com/policies/user-agreement'},
 {'title': 'Privacy policy',
  'url': 'https://www.redditinc.com/policies/privacy-policy'},
 {'title': 'Content policy',
  'url': 'https://www.redditinc.com/policies/content-policy'},
 {'title': 'Moderator Guidelines',
  'url': 'https://www.redditinc.com/policies/moderator-guidelines'}]

In [None]:
def findChildren(beautyobject,tagvalue,childvalue,classname,deeper=False):
  '''
  Get for a tag and its child and little child.
  :param arg1: the beautifulsoup object linked to the data scraped
  :param arg2: the tag for which you search for its child
  :param arg3: the tag for child you are searching for
  :param arg4: the class for father tag
  :param arg5: the deeper you want to go in the tree
  :type arg1: BeautifulSoup object
  :type arg2: str
  :type arg3: str
  :type arg4: str
  :type arg5: boolean
  :return: a list of children of the tag
  '''
  return beautyobject.find(tagvalue,class_=classname).find_all(childvalue,recursive= deeper)
  



In [None]:
l = findChildren(docreddit,'div','div','STit0dLageRsa2yR4te_b')

In [None]:
print(l)

[]


In [None]:
def findTagAttr(doc,tagname,classname,*attributes):
  '''
  Get a list of dictionnaries representing for each tag of class classname theirs attributes and their values 
    :param arg1: the reference for the BeautifulSoup object created
    :param arg2: the name of the  tag
    :param arg3: the name of the class of the tag
    :param arg4: the list of attributes we need
    :type arg1: BeautifulSoup object
    :type arg2: str
    :type arg3: str
    :type arg4: list of variables arguments
    :return : a list of  dictionnaries where the attributes names are the keys and their values the corresponding values
  '''
  attrList = []
  tags = find_tag_with_attributes(docreddit,tagname,{'class':classname})
  for tag in tags:
    
    listval = []
    for attr in attributes:
      listval.append(tag[attr])
      # Add the text of the tag
      
    dictattr = dict(zip(list(attributes),listval))
    # Add the text of the tag if it exists
    if tag.text!= '':
      dictattr['text']=tag.text.strip()
    attrList.append(dictattr)
  return attrList
  


In [None]:
dictscrape = findTagAttr(doc,'a','_3Eyh3vRo5o4IfzVZXhaWAG','href')

In [None]:
dictscrape

[{'href': 'https://www.redditinc.com/policies/user-agreement',
  'text': 'User Agreement'},
 {'href': 'https://www.redditinc.com/policies/privacy-policy',
  'text': 'Privacy policy'},
 {'href': 'https://www.redditinc.com/policies/content-policy',
  'text': 'Content policy'},
 {'href': 'https://www.redditinc.com/policies/moderator-guidelines',
  'text': 'Moderator Guidelines'}]

## Write the dictionnary to a csv file

In [None]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(item.get(header, ""))
            f.write(','.join(values) + "\n")

In [None]:
write_csv(dictscrape, 'datascrape.csv')