# Chapter 8: Extracting Data from the Internet
## 05: Web Scraping with BeautifulSoup and Requests
## 怎麼樣從 html 網頁上，刮下我們要的特定內容

2020-12-12, 2019-5-11, 2019-12-05

content:
### 先看 Corey 個人網頁上的例子：
1. 怎麼對應瀏覽器看到的圖表文等多媒體內容 inspect 到對應的 html code
2. 怎樣用 requests, beautifulsoup 承接整個 html 資料，然後如何 access 到我們想要的內容。不需要非常詳，但足夠明確即可。
3. 怎麼將同類型的資料，整頁都刮下來：find_all()
4. 在 YouTube 上的影音檔在 YouTube 上的編號，怎麼刮下來？然後可以被點選。
5. 怎麼將刮下來的資料，塞到一個 csv (excel) 檔中。

## 習題：各大電子報上的社論標題

## *Python Tutorial: Web Scraping with BeautifulSoup and Requests*
https://www.youtube.com/watch?v=ng2o98k983k

這個網頁有一個示例，是將 Corey Schafer 自己錄製的教學內容的：
1. article topic
2. summary
3. YouTube link

全部給 scrape 下來，並且存到一個 csv 的表格裏。

## 要先學會如何從 browser 的某筆資料由「檢查」或是 (inspect) 找出原 html 的 code

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/ng2o98k983k" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## 這是 Corey Schafer 的網頁：http://coreyms.com/

In [2]:
from bs4 import BeautifulSoup
import requests # 專門用來開 html 的，取代 urllib.request 的功能
import csv # 之後要建一個 csv 表格的檔案用

In [3]:
source = requests.get('http://coreyms.com')

In [4]:
source

<Response [200]>

In [5]:
whos

Variable        Type        Data/Info
-------------------------------------
BeautifulSoup   type        <class 'bs4.BeautifulSoup'>
csv             module      <module 'csv' from 'C:\\U<...>\Anaconda3\\lib\\csv.py'>
requests        module      <module 'requests' from '<...>\\requests\\__init__.py'>
source          Response    <Response [200]>


In [6]:
help(source)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, *args)
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if

In [7]:
# 如果要把取回來的內容翻出來，就用 text 
source = requests.get('http://coreyms.com').text

In [8]:
source

'<!DOCTYPE html>\n<html lang="en-US">\n<head >\n<meta charset="UTF-8" />\n<meta name="viewport" content="width=device-width, initial-scale=1" />\n\n\t<!-- This site is optimized with the Yoast SEO plugin v15.4 - https://yoast.com/wordpress/plugins/seo/ -->\n\t<title>CoreyMS - Development, Design, DIY, and more</title>\n\t<meta name="description" content="Development, Design, DIY, and more" />\n\t<meta name="robots" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />\n\t<link rel="canonical" href="https://coreyms.com/" />\n\t<link rel="next" href="https://coreyms.com/page/2" />\n\t<meta property="og:locale" content="en_US" />\n\t<meta property="og:type" content="website" />\n\t<meta property="og:title" content="CoreyMS - Development, Design, DIY, and more" />\n\t<meta property="og:description" content="Development, Design, DIY, and more" />\n\t<meta property="og:url" content="https://coreyms.com/" />\n\t<meta property="og:site_name" content="CoreyMS

In [9]:
type(source)

str

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [10]:
# 使用 lxml 的 parser 來解讀 source
soup = BeautifulSoup(source, 'lxml')

In [11]:
type(soup)

bs4.BeautifulSoup

In [12]:
print(soup)

<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- This site is optimized with the Yoast SEO plugin v15.4 - https://yoast.com/wordpress/plugins/seo/ -->
<title>CoreyMS - Development, Design, DIY, and more</title>
<meta content="Development, Design, DIY, and more" name="description"/>
<meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/>
<link href="https://coreyms.com/" rel="canonical"/>
<link href="https://coreyms.com/page/2" rel="next"/>
<meta content="en_US" property="og:locale"/>
<meta content="website" property="og:type"/>
<meta content="CoreyMS - Development, Design, DIY, and more" property="og:title"/>
<meta content="Development, Design, DIY, and more" property="og:description"/>
<meta content="https://coreyms.com/" property="og:url"/>
<meta content="CoreyMS" property="og:site_name"/>
<meta content="https://coreyms.com/wp-content/uploa

In [13]:
# 如果要看得到有內縮的效果，這是 beautifuloup 內建的功能
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- This site is optimized with the Yoast SEO plugin v15.4 - https://yoast.com/wordpress/plugins/seo/ -->
  <title>
   CoreyMS - Development, Design, DIY, and more
  </title>
  <meta content="Development, Design, DIY, and more" name="description"/>
  <meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/>
  <link href="https://coreyms.com/" rel="canonical"/>
  <link href="https://coreyms.com/page/2" rel="next"/>
  <meta content="en_US" property="og:locale"/>
  <meta content="website" property="og:type"/>
  <meta content="CoreyMS - Development, Design, DIY, and more" property="og:title"/>
  <meta content="Development, Design, DIY, and more" property="og:description"/>
  <meta content="https://coreyms.com/" property="og:url"/>
  <meta content="CoreyMS" property="og:site_name"/>
  <meta content

In [14]:
help(soup)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

## 先對照一個 Corey Schafer 的網頁用容：
http://coreyms.com

待會執行後，也去看一下生成的 cms_scrape.csv

In [15]:
# 先給大家看一下 Corey Schafer 的 video 裏要教大家寫的程式
# 怎樣無中生有建一個 csv 檔案
csv_file = open('cms_scrape.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline', 'summary', 'video_link'])

for article in soup.find_all('article'):
    headline = article.h2.a.text
    print(headline)

    summary = article.find('div', class_='entry-content').p.text
    print(summary)

    try:
        vid_src = article.find('iframe', class_='youtube-player')['src']

        vid_id = vid_src.split('/')[4]
        vid_id = vid_id.split('?')[0]

        yt_link = f'https://youtube.com/watch?v={vid_id}'
    except Exception as e:
        yt_link = None

    print(yt_link)

    print()

    csv_writer.writerow([headline, summary, yt_link])

csv_file.close()

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how t

## 我們一共要搜集三樣資訊：
1. outline
2. summary
3. YouTube link to the video

In [16]:
# 打開視窗時上面顯示的資訊
soup.title

<title>CoreyMS - Development, Design, DIY, and more</title>

In [20]:
print(soup.title)

<title>CoreyMS - Development, Design, DIY, and more</title>


In [21]:
soup.title.text

'CoreyMS - Development, Design, DIY, and more'

In [22]:
type(soup.title)

bs4.element.Tag

In [23]:
# find() 是要找到『第一個』 tag 有符合的分枝
article = soup.find('article')
type(article)

bs4.element.Tag

In [24]:
help(article)

Help on Tag in module bs4.element object:

class Tag(PageElement)
 |  Tag(parser=None, builder=None, name=None, namespace=None, prefix=None, attrs=None, parent=None, previous=None, is_xml=None, sourceline=None, sourcepos=None, can_be_empty_element=None, cdata_list_attributes=None, preserve_whitespace_tags=None)
 |  
 |  Represents an HTML or XML tag that is part of a parse tree, along
 |  with its attributes and contents.
 |  
 |  When Beautiful Soup parses the markup <b>penguin</b>, it will
 |  create a Tag object representing the <b> tag.
 |  
 |  Method resolution order:
 |      Tag
 |      PageElement
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      A tag is non-None even if it has no contents.
 |  
 |  __call__(self, *args, **kwargs)
 |      Calling a Tag like a function is the same as calling its
 |      find_all() method. Eg. tag('a') returns a list of all the A tags
 |      found within this tag.
 |  
 |  __contains__(self, x)
 |  
 |  __co

In [25]:
print(article.prettify())

<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </spa

### 這其實是 print(article.prettify()) 的內容
```HTML
<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </span>
    </a>
   </span>
   <span class="entry-comments-link">
    <a href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives#respond">
     <span class="dsq-postid" data-dsqidentifier="1670 http://coreyms.com/?p=1670">
      Leave a Comment
     </span>
    </a>
   </span>
  </p>
 </header>
 <div class="entry-content" itemprop="text">
  <p>
   In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
   <br/>
  </p>
  <span class="embed-youtube" style="text-align:center; display: block;">
   <iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640">
   </iframe>
  </span>
 </div>
 <footer class="entry-footer">
  <p class="entry-meta">
   <span class="entry-categories">
    Filed Under:
    <a href="https://coreyms.com/category/development" rel="category tag">
     Development
    </a>
    ,
    <a href="https://coreyms.com/category/development/python" rel="category tag">
     Python
    </a>
   </span>
   <span class="entry-tags">
    Tagged With:
    <a href="https://coreyms.com/tag/gzip" rel="tag">
     gzip
    </a>
    ,
    <a href="https://coreyms.com/tag/shutil" rel="tag">
     shutil
    </a>
    ,
    <a href="https://coreyms.com/tag/zip" rel="tag">
     zip
    </a>
    ,
    <a href="https://coreyms.com/tag/zipfile" rel="tag">
     zipfile
    </a>
   </span>
  </p>
 </footer>
</article>
```

In [26]:
# 注意 find() 與 find_all() 的差別，後者輸出為一個類及 list 的 ResultsSet
articles = soup.find_all('article')

In [27]:
# bs4.element.ResultSet 等效於 list
type(articles)

bs4.element.ResultSet

In [28]:
articles

[<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork"><header class="entry-header"><h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>
 <p class="entry-meta"><time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time> by <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author"><span class="entry-author-name" itemprop="name">Corey Schafer</span></a></span> <span class="entry-comments-link"><a href="https:/

In [29]:
len(articles)

10

In [30]:
help(articles)

Help on ResultSet in module bs4.element object:

class ResultSet(builtins.list)
 |  ResultSet(source, result=())
 |  
 |  A ResultSet is just a list that keeps track of the SoupStrainer
 |  that created it.
 |  
 |  Method resolution order:
 |      ResultSet
 |      builtins.list
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __getattr__(self, key)
 |      Raise a helpful exception to explain a common code fix.
 |  
 |  __init__(self, source, result=())
 |      Constructor.
 |      
 |      :param source: A SoupStrainer.
 |      :param result: A list of PageElements.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from builtins.list:
 |  
 |  __add__(se

## 要怎麼 find()--怎麼去找類似 XPath？
1. 先去看 browser 中想對應的標題，
2. 然後用 inspect，
3. 在跑出來對應的 html code之後，會發覺每次的標題的內容是放在：article/header/h2/a 中的 text
4. 其實在 html code 那邊用 copy/copy XPath 的選項得到類似：
/html/body/div/div/div/main/article[1]/header/h2/a \
(注意：第一個 child 的編號是 [1])

In [36]:
# 如何找得到 headline (article 的標題)
# /html/body/div/div/div/main/article[1]/header/h2/a
headline = soup.find("article").header.h2.a

In [37]:
print(headline)

<a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a>


In [38]:
headline.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

In [39]:
#/html/body/div/div/div/main/article[1]/header/h2/a
article.header

<header class="entry-header"><h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>
<p class="entry-meta"><time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time> by <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author"><span class="entry-author-name" itemprop="name">Corey Schafer</span></a></span> <span class="entry-comments-link"><a href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives#respond"><span class="dsq-postid" data-dsqidentifier="1670 http://coreyms.com/?p=1670">Leave a Comment</span></a></span> </p></header>

In [40]:
print(article.header.prettify())

<header class="entry-header">
 <h2 class="entry-title" itemprop="headline">
  <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
   Python Tutorial: Zip Files – Creating and Extracting Zip Archives
  </a>
 </h2>
 <p class="entry-meta">
  <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
   November 19, 2019
  </time>
  by
  <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
   <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
    <span class="entry-author-name" itemprop="name">
     Corey Schafer
    </span>
   </a>
  </span>
  <span class="entry-comments-link">
   <a href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives#respond">
    <span class="dsq-postid" data-dsqidentifier="1670 http://corey

In [41]:
#/html/body/div/div/div/main/article[1]/header/h2/a
article.header.h2

<h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>

In [42]:
print(article.header.h2.prettify())

<h2 class="entry-title" itemprop="headline">
 <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
  Python Tutorial: Zip Files – Creating and Extracting Zip Archives
 </a>
</h2>



In [43]:
print(article.header.h2.a.prettify())

<a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
 Python Tutorial: Zip Files – Creating and Extracting Zip Archives
</a>


## Xpath 的簡化：當不會搞混的前題下，可以跳過中間節點，而直接到那個「節點上」取

In [44]:
# /html/body/div/div/div/main/article[1]/header/h2/a
# 中間跳過 header`
# 也就是說，等效於：article.header.h2
article.h2

<h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>

In [45]:
print(article.h2.prettify())

<h2 class="entry-title" itemprop="headline">
 <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
  Python Tutorial: Zip Files – Creating and Extracting Zip Archives
 </a>
</h2>



In [46]:
article.h2.a.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

In [47]:
# 同樣道理
article.h2.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

In [48]:
# 同樣道理
article.a.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

In [49]:
# 但是… 
# 會把 article 底下所有的 text 全部印出來
# 請對照先前 article 的 html code
article.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives\nNovember 19, 2019 by Corey Schafer Leave a Comment \nIn this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…\n\nFiled Under: Development, Python Tagged With: gzip, shutil, zip, zipfile'

### 這是有點出乎我的意料之外：text 應該是屬於 h2 底下 a 的，但竟然問 h2.text 也找得到

In [50]:
article.h2.a.text

'Python Tutorial: Zip Files – Creating and Extracting Zip Archives'

## 我們接下來找 summary
/html/body/div/div/div/main/article[1]/div/p

In [52]:
# /html/body/div/div/div/main/article[1]/div/p
article.div

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [51]:
# /html/body/div/div/div/main/article[1]/div/p
article.div.text

'\nIn this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…\n\n'

In [53]:
# /html/body/div/div/div/main/article[1]/header/p/time
# /html/body/div/div/div/main/article[1]/header/p
article.time.text

'November 19, 2019'

## soup.*find*("article") 只是找到『*第一項*』符合 tag = "artcile" 的項目，但要把『*所有*』符合 tag = "artcile" 的項目，就要用 soup.*find_all*("article")，後者的輸出為類似 list 的 ReultSet

In [54]:
articles = soup.find_all('article')

In [55]:
articles

[<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork"><header class="entry-header"><h2 class="entry-title" itemprop="headline"><a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">Python Tutorial: Zip Files – Creating and Extracting Zip Archives</a></h2>
 <p class="entry-meta"><time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">November 19, 2019</time> by <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person"><a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author"><span class="entry-author-name" itemprop="name">Corey Schafer</span></a></span> <span class="entry-comments-link"><a href="https:/

In [56]:
type(articles)

bs4.element.ResultSet

In [57]:
len(articles)

10

## 接下來，我們找「內容摘要」(summary)

In [59]:
soup.find_all('div')

[<div class="site-container"><header class="site-header" itemscope="" itemtype="https://schema.org/WPHeader"><div class="wrap"><div class="title-area"> <div class="site-avatar">
 <a href="https://coreyms.com/"><svg class="site-avatar-svg" enable-background="new 0 0 441.5 441.5" height="150px" id="Layer_1" version="1.1" viewbox="0 0 441.5 441.5" width="150px" x="0px" xml:space="preserve" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" y="0px">
 <g class="site-avatar-background">
 <path d="M79.178 390.133C30.781 349.639 0 288.789 0 220.75C0 98.833 98.833 0 220.75 0S441.5 98.833 441.5 220.75 c0 63.558-26.86 120.842-69.848 161.12" fill="#56616B"></path>
 </g>
 <g class="site-avatar-foreground">
 <path d="M254.602 182.291c0 1.992-0.097 4 0 6c0.057 0.88-0.093 1.952 0.194 2.78c0.22 0.631 0.69 1.12 1.704 1.273 c2.009 0.3 3.436-1.062 4.384-2.719c0.712-1.244 0.863-3.376 1.843-3.807c1.612-0.712 2.646-1.537 3.276-2.44 c1.903-2.732 0.09-6.185-0.723-9.42c-0.29-1.157-1.9

In [60]:
# 對於 soup 要限定 find 的條件：tag，+ 限定 attribute
# /html/body/div/div/div/main/article[1]/div/p/text()
article.find('div', class_='entry-content')

<div class="entry-content" itemprop="text">
<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>
<span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640"></iframe></span>
</div>

In [61]:
print(article.find('div', class_='entry-content').prettify())

<div class="entry-content" itemprop="text">
 <p>
  In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
  <br/>
 </p>
 <span class="embed-youtube" style="text-align:center; display: block;">
  <iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640">
  </iframe>
 </span>
</div>



### 以下是 article.find('div', class_='entry-content') 的內容：
```HTML
<div class="entry-content" itemprop="text">
 <p>
  In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
  <br/>
 </p>
 <span class="embed-youtube" style="text-align:center; display: block;">
  <iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640">
  </iframe>
 </span>
</div>
```

In [62]:
article.find('div', class_='entry-content').p

<p>In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…<br/></p>

In [63]:
article.find('div', class_='entry-content').p.text

'In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…'

In [64]:
article.find('div', class_='entry-content').text

'\nIn this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…\n\n'

In [65]:
print(article.find('div', class_='entry-content').text)


In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…




## article.find('div', class_='entry-content') 底下的 p 剛好只有一個 p 上。

In [66]:
summary = article.find('div', class_='entry-content').p.text

In [67]:
print(summary)

In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…


## 我們現在要去取 youtube 的網址

## HTML <iframe\> 內嵌框架 (Inline Frame)

https://www.fooish.com/html/iframe-tag.html
    
<iframe\> 標籤 (inline frame) 是所謂的內嵌框架 (內聯框架)，用來在一個 HTML 網頁裡面嵌入另外一個 HTML 網頁，像是常見的在部落格裡面用 iframe 語法嵌入 Facebook 的粉絲專頁或按讚按鈕外掛。

舉個例子，我用 iframe 在下方嵌入本站的首頁：

```HTML
<iframe src="http://www.fooish.com/">
  你的瀏覽器不支援 iframe
</iframe>
```

In [68]:
# /html/body/div/div/div/main/article[1]/div/span/iframe
article.find('iframe')

<iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640"></iframe>

```HTML
<iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640"></iframe>
```

### 重點是要看 src 後面 youTube 的編號，如以下的「z0gguhEmWiY」：
src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp

In [69]:
print(article.find('iframe').prettify())

<iframe allowfullscreen="true" class="youtube-player" height="360" sandbox="allow-scripts allow-same-origin allow-popups allow-presentation" src="https://www.youtube.com/embed/z0gguhEmWiY?version=3&amp;rel=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;fs=1&amp;hl=en-US&amp;autohide=2&amp;wmode=transparent" style="border:0;" width="640">
</iframe>


## 我們要取的是 https://www.youtube.com/embed/-nh9rCzPJ20 也就是在 "?" 之前的所有內容

In [70]:
# 先取出 "src" 這組 attribute 的 value:
vid_src = article.find('iframe', class_='youtube-player')['src']
vid_src

'https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'

In [71]:
# 用 '/' 隔開
vid_id = vid_src.split('/')
vid_id

['https:',
 '',
 'www.youtube.com',
 'embed',
 'z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent']

In [72]:
# 取第四個
vid_id = vid_id[4]
vid_id

'z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent'

In [73]:
# 再用 "?" 隔開後，取前想的那個
vid_id = vid_id.split('?')[0]
vid_id

'z0gguhEmWiY'

In [74]:
# 然後造字串
yt_link = f'https://youtube.com/watch?v={vid_id}'
yt_link

'https://youtube.com/watch?v=z0gguhEmWiY'

In [75]:
# we now put everything together
vid_src = article.find('iframe', class_='youtube-player')['src']
vid_id = vid_src.split('/')[4]
vid_id = vid_id.split('?')[0]
yt_link = f'https://youtube.com/watch?v={vid_id}'
yt_link

'https://youtube.com/watch?v=z0gguhEmWiY'

In [76]:
# 現在全部放起！
for article in soup.find_all('article'):
    headline = article.h2.a.text
    print(headline)

    summary = article.find('div', class_='entry-content').p.text
    print(summary)

    try:
        vid_src = article.find('iframe', class_='youtube-player')['src']
        #yt_link = article.find('div', class_="youtube-player")['src']
        vid_id = vid_src.split('/')[4]
        vid_id = vid_id.split('?')[0]
        yt_link = f'https://youtube.com/watch?v={vid_id}'

    except Exception as e:
        yt_link = None

    print(yt_link)

    print()

#     csv_writer.writerow([headline, summary, yt_link])

# csv_file.close()

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how t

## 同時也讓 scrape 的結果放入一個 csv 檔

In [77]:
# 先開一個檔案，凖備寫入
csv_file = open('cms_scrape.csv', 'w')
# 要調用到 csv.write() 
csv_writer = csv.writer(csv_file)
# 凖備 csv 檔案的各欄標頭
csv_writer.writerow(['headline', 'summary', 'video_link'])

for article in soup.find_all('article'):
    headline = article.h2.a.text
    print(headline)

    summary = article.find('div', class_='entry-content').p.text
    print(summary)

    try:
        vid_src = article.find('iframe', class_='youtube-player')['src']
        #yt_link = article.find('div', class_="youtube-player")['src']
        vid_id = vid_src.split('/')[4]
        vid_id = vid_id.split('?')[0]
        yt_link = f'https://youtube.com/watch?v={vid_id}'

    except Exception as e:
        yt_link = None

    print(yt_link)

    print()
    
    csv_writer.writerow([headline, summary, yt_link])

csv_file.close()

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how t