## CHAPTER 3 PROCESSING, WRANGLING AND DATA VISUALIZATION.
### WEB SCRAPPING
#### The book encouraged us to create a web scrapper in order to know the basics of this kind of task essencials for every DS

##### *Jose Ruben Garcia Garcia*
##### *Frebuary 2024*
##### *Reference: Practical Machine learning python problems solver*

### Basic Crawler

In [20]:
#As a first task we will identify the URL that would be scrapped
#We selected a url from aprres that it's https://www.apress.com/in/blog/all-blog-posts/gradient-descent-optimization/15512052


In [15]:
import re
import requests

def extract_blog_content(content):
    """This function extracts blog post content using regex

    Args:
        content (request.content): String content returned from requests.get

    Returns:
        str: string content as per regex match

    """
    content_pattern = re.compile(r'<div class="cms-richtext">(.*?)</div>')
    result = re.findall(content_pattern, content)
    return result[0] if result else "None"

In [16]:
if __name__ =='__main__':
    
    base_url = "https://www.apress.com/in/blog/all-blog-posts"
    blog_suffix = "/gradient-descent-optimization/15512052"
    
    print("Crawling Apress.com for required blog post...\n\n")    
    
    response = requests.get(base_url+blog_suffix)
    
    if response.status_code == 200:
        content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')
        content = content.replace("\n", '')
        blog_post_content = extract_blog_content(content)

Crawling Apress.com for required blog post...




In [18]:
# After creating the proper function we have our result saved in the variable called content, this aproach
# gave us a complete html code what is not desired but it's just for this basic example.  
content

'<!DOCTYPE html><!--[if lt IE 7]> <html lang="en" class="no-js ie6 lt-ie9 lt-ie8"> <![endif]--><!--[if IE 7]> <html lang="en" class="no-js ie7 lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie9"> <![endif]--><!--[if IE 9]> <html lang="en" class="no-js ie9"> <![endif]--><!--[if gt IE 9]><!--> <html lang="en" class="no-js"> <!--<![endif]--><head><meta http-equiv="x-ua-compatible" content="IE=edge"><script type="text/javascript" src="/spcom/js/vendor/googleapis/ajax/libs/jquery/1.9.1/jquery.min.js"></script><script type="text/javascript" id="angular-script" src="/spcom/js/vendor/googleapis/ajax/libs/angularjs/1.2.17/angular.min.js"></script><script type="text/javascript" id="script--165730135" src="/spcom/min/prod.js?r=0.102.0"></script><link rel="stylesheet" type="text/css" href="/spcom/min/modern_sprcom-cms-frontend_apress.css?r=0.102.0" /><!--[if (lt IE 9) & (!IEMobile)]><link rel="stylesheet" type="text/css" href="/spcom/min/ielt9_sprcom-cms-frontend_ap

### Scrapping using beatiful soup module
##### To get a result cleaned up we can use this module that can help us in our task of scrapping

##### This script utilizes the requests and BeautifulSoup packages to crawl the blog page of apress.com in order to:
##### + Extract a list of recent blog post titles and their URLs.
##### + Extract the plain text content related to each blog post.

In [19]:
## importing modules

import requests
from time import sleep
from bs4 import BeautifulSoup

In [21]:
def get_post_mapping(content):
    """
    This function extracts blog post titles and URLs from a response object.

    Args:
        content (str): String content returned from requests.get.
    
    Returns:
        list: A list of dictionaries with keys 'title' and 'url'.
        
    We are going to use the h3 attributes so when we checked the web page the h3 are the attribute used for titles
    """
    post_detail_list = []
    post_soup = BeautifulSoup(content,"lxml")
    h3_content = post_soup.find_all("h3")
    
    for h3 in h3_content:
        post_detail_list.append(
            {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}
            )
    
    return post_detail_list

In [24]:
#The next step is to iterate through the list of URL and extract eacho blog post's text

def get_post_content(content):
    """
    This function extracts blog post content from a response object.
    
    Args:
        content (str): String content returned from requests.get.
    
    Returns:
        str: Blog content in plain text.
    """

    plain_text = ""
    text_soup = BeautifulSoup(content,"lxml")
    para_list = text_soup.find_all("div",
                                   {'class':'cms-richtext'})
    
    for p in para_list[0]:
        plain_text += p.getText()
    
    return plain_text

In [27]:
get_post_content(content)

"By Santanu PattanayakIt is important to understand and appreciate few key points regarding full batch Gradient Descent and Stochastic gradient descent methods along with their shortcomings so that one can appreciate the need of using different variants of gradient based optimizers in Deep Learning.Elliptical ContoursThe cost function for a linear neuron with least square error is quadratic. When the cost function is quadratic the direction of gradient in full batch gradient descent method gives the best direction for cost reduction in a linear sense but it doesn’t point to the minimum unless the different elliptical contours of the cost function are circles. Incase of long elliptical contours the gradient components might be large in directions where less change is required and less in directions where more change is required to move to the minimum point. As we can see in the Figure 1 below the gradient at \xa0doesn’t point to the direction of the minimum i.e. at point . The problem w

##### As we can see now the text of one of the URL's in the scrapping looks better than the first options using regular expressions.