# Web operations is easy

## Socket

Python has built-in support for TCP Sockets.
- HTTP (80)
- HTTPS (443)
- FTP (21) - File Transfer
- SMTP (25) - Mail
- IMAP (143/220/993) - Mail Retrieval

Operations as raw level

In [1]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode(), end='')
mysock.close()

HTTP/1.1 200 OK
Date: Wed, 26 Apr 2023 07:12:49 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


## Using urllib in Python

Since HTTP is so common, Python has a library that does all the socket work for us and makes web pages look a file.

In [7]:
import urllib.request, urllib.parse, urllib.error

response = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in response:
    print(line.decode().strip())

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


Like a file

In [6]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1

print(counts)

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


Reading Web Pages is easy

In [8]:
import urllib.request, urllib.parse, urllib.error
import re

with urllib.request.urlopen('http://python.org/') as response:
    charset = response.info().get_content_charset()
    html = response.read().decode(charset)

print(html)


<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <!-- Google tag (gtag.js) -->
    <script async src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH"></script>
    <script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());
      gtag('config', 'G-TF35YF9CVH');
    </script>

    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">

## The First Lines of Code at Google

Following links in a simple way. In this example only one web is obtained, and the links are searched. Iteratively, you can search each link found.  

In [11]:
import urllib.request, urllib.parse, urllib.error
import re

email_pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

with urllib.request.urlopen('http://python.org/') as response:
    charset = response.info().get_content_charset()
    html = response.read().decode(charset)

    emails = re.findall(email_pattern, html)
    print(emails)


['https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH', 'https://media.ethicalads.io/media/client/v1.4.0/ethicalads.min.js', 'https://www.python.org/static/opengraph-icon-200x200.png', 'https://www.python.org/static/opengraph-icon-200x200.png', 'https://www.python.org/', 'https://www.python.org/dev/peps/peps.rss/', 'https://www.python.org/jobs/feed/rss/', 'https://feeds.feedburner.com/PythonSoftwareFoundationNews', 'https://feeds.feedburner.com/PythonInsider', 'https://schema.org', 'https://www.python.org/', 'https://www.python.org/search/?q=', "https://ssl' : 'http://www') + '.google-analytics.com/ga.js';", 'http://browsehappy.com/', 'https://www.python.org/psf/', 'https://docs.python.org', 'https://pypi.org/', 'https://psfmember.org/civicrm/contribute/transact?reset=1&id=2', 'https://www.facebook.com/pythonlang?fref=ts', 'https://twitter.com/ThePSF', 'http://brochure.getpython.info/', 'https://docs.python.org/3/license.html', 'https://wiki.python.org/moin/BeginnersGuide', 'https

## Web Scraping - Crawling

When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information, and then looks at more web pages.

Search engines scrape web pages - we call this "spidering the web" or "web crawling".

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. [Reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

In [5]:
import sys
!{sys.executable} -m pip install BeautifulSoup4


"d:\Documentos" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.


In [4]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter url: ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))


ModuleNotFoundError: No module named 'bs4'

In [None]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

In [None]:
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id="link3"))