# Data Extraction with Python
- categories: [Python, DataExtraction, BeautifulSoup]

To start a data project, it is quite often we need to extract data from different kinds of sources (webpage, word, excel, database elsewhere...). In this blog (series?), I will try to cover several library and tracks I used.

## Data in html format

The most common way to work with data in html format (i.e. webpage downloaded) should be <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/'>BeautifulSoup</a>. It is a powerful external library (i.e. it does not come with Python itself, we have to install it - usually by pip, or other method can be found in the documentation page).

In [1]:
#hide
# ! pip install beautifulsoup4

To begin with, surely we have to import it,

In [6]:
from bs4 import BeautifulSoup

# if you have a saved file, you can open the file by:
with open('path_to_html_file', 'w') as f:
    soup = BeautifulSoup(f, 'html.parser')
    print(soup.prettify())

or BeautifulSoup can also support text input.  
By using the prettify() function, BeautifulSoup will display the html in a easy for human read form (computer don't care at all!)


In [1]:
#collapse
# as an example I left only content in side body tag
html_sample = '''
<body class="review-template-default single single-review postid-7685 single-format-standard header-image content-sidebar genesis-breadcrumbs-hidden unknown-os ie11 feature-top-outside site-fluid override"><div class="site-container"><ul class="genesis-skip-link"><li><a href="#genesis-nav-primary" class="screen-reader-shortcut"> Skip to primary navigation</a></li><li><a href="#genesis-content" class="screen-reader-shortcut"> Skip to main content</a></li><li><a href="#genesis-sidebar-primary" class="screen-reader-shortcut"> Skip to primary sidebar</a></li></ul><header class="site-header"><div class="wrap"><div class="title-area"><p class="site-title"><a href="https://www.coffeereview.com/">Coffee Review</a></p><p class="site-description">The World&#039;s Leading Coffee Guide</p></div><div class="widget-area header-widget-area"><section id="text-11" class="widget widget_text"><div class="widget-wrap"><div class="textwidget"><form role="search" id="searchform" method="get" action="https://www.coffeereview.com/"><div class="header_search_line_1"> <input type="radio" name="post_type" id="cr_reviews" value="review" checked="checked"> <label for="cr_reviews">Reviews</label> <input type="radio" name="post_type" id="cr_tasting_reports" value="post"> <label for="cr_tasting_reports">Tasting Reports</label></div><div class="header_search_line_2"> <input type="search" value="" placeholder="Enter search terms" size="18" maxlength="50" name="s" id="searchfield"> <input type="submit" value="Search" class="header_search_button"></div><div class="header_search_line_3"> <a href="/advanced-search/">Advanced Search</a></div></form></div></div></section><section id="text-12" class="widget widget_text"><div class="widget-wrap"><div class="textwidget"><p><a href="http://bit.ly/2r1silm" target="_blank" rel="noopener noreferrer"><img class="aligncenter size-full wp-image-19359" src="https://dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com/wp-content/uploads/2019/12/Hula-Daddy-Button-1st-place.jpg" alt="Shop for Award-Winning 100% Kona Coffees at Hula Daddy" width="195" height="90" /></a></p></div></div></section></div></div></header><div class="responsive-primary-menu-container"><h3 class="mobile-primary-toggle"></h3><div class="responsive-menu-icon"> <span class="responsive-icon-bar"></span> <span class="responsive-icon-bar"></span> <span class="responsive-icon-bar"></span></div></div><nav class="nav-primary" aria-label="Main" id="genesis-nav-primary"><div class="wrap"><ul id="menu-main" class="menu genesis-nav-menu menu-primary js-superfish"><li id="menu-item-4998" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-4998"><a href="https://www.coffeereview.com/review/"><span >Reviews</span></a><ul class="sub-menu"><li id="menu-item-13553" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13553"><a href="https://www.coffeereview.com/review/"><span >Latest Reviews</span></a></li><li id="menu-item-13533" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-13533"><a href="https://www.coffeereview.com/highest-rated-coffees/"><span >Top-Rated (94+)</span></a></li><li id="menu-item-18958" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-18958"><a href="https://coffeereview.com/types/espresso/"><span >Espressos</span></a></li><li id="menu-item-13534" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13534"><a href="https://coffeereview.com/types/best-value-coffees/"><span >Best Values</span></a></li><li id="menu-item-19057" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-19057"><a href="https://www.coffeereview.com/top-30-coffees-2019/"><span >Top 30 Coffees of 2019</span></a></li><li id="menu-item-19632" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-19632"><a href="https://www.coffeereview.com/types/coffees-from-taiwan/"><span >Taiwan Coffees &#8211; 台灣送評的咖啡豆</span></a></li><li id="menu-item-19160" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-19160"><a href="https://www.coffeereview.com/top-30-coffees-past-rankings/"><span >Top 30 &#8211; Past Rankings</span></a></li><li id="menu-item-13536" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13536"><a href="https://coffeereview.com/types/single-serve-capsule/"><span >Pods and Capsules</span></a></li><li id="menu-item-18959" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18959"><a href="https://www.coffeereview.com/best-coffee-cities/"><span >Reviews by U.S. City</span></a></li><li id="menu-item-13857" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13857"><a href="https://www.coffeereview.com/advanced-search/"><span >Advanced Search</span></a></li></ul></li><li id="menu-item-18960" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-has-children menu-item-18960"><a href="https://www.coffeereview.com/category/articles/"><span >Reports</span></a><ul class="sub-menu"><li id="menu-item-15819" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-15819"><a href="https://www.coffeereview.com/category/articles/"><span >Latest Reports</span></a></li><li id="menu-item-18961" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18961"><a href="https://www.coffeereview.com/category/articles/africa/"><span >Africa</span></a></li><li id="menu-item-18962" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18962"><a href="https://www.coffeereview.com/category/articles/americas/"><span >Americas</span></a></li><li id="menu-item-18964" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18964"><a href="https://www.coffeereview.com/category/articles/asia-pacific-coffees/"><span >Asia-Pacific</span></a></li><li id="menu-item-18966" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18966"><a href="https://www.coffeereview.com/category/articles/espressos/"><span >Espressos</span></a></li><li id="menu-item-18963" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18963"><a href="https://www.coffeereview.com/category/articles/annual-top-30/"><span >Annual Top 30</span></a></li><li id="menu-item-18967" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18967"><a href="https://www.coffeereview.com/category/articles/tasting-report-processing-method/"><span >Processing Method</span></a></li><li id="menu-item-18968" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18968"><a href="https://www.coffeereview.com/category/articles/tasting-reports-social-environmental/"><span >Social/Environmental</span></a></li><li id="menu-item-18969" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18969"><a href="https://www.coffeereview.com/category/articles/tasting-reports-tree-variety/"><span >Tree Variety</span></a></li><li id="menu-item-18965" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-18965"><a href="https://www.coffeereview.com/category/articles/coffee-and-espresso-blends/"><span >Blends</span></a></li></ul></li><li id="menu-item-15781" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-has-children menu-item-15781"><a href="https://www.coffeereview.com/category/equipment-reports/"><span >Equipment</span></a><ul class="sub-menu"><li id="menu-item-19950" class="menu-item menu-item-type-post_type menu-item-object-post menu-item-19950"><a href="https://www.coffeereview.com/burr-coffee-grinder-reviews/"><span >Mid-Range Burr Coffee Grinders</span></a></li><li id="menu-item-19622" class="menu-item menu-item-type-post_type menu-item-object-post menu-item-19622"><a href="https://www.coffeereview.com/equipment-report-digital-electric-gooseneck-pourover-kettles/"><span >Electric Gooseneck Kettles</span></a></li><li id="menu-item-19586" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19586"><a href="https://www.coffeereview.com/interpreting-equipment-ratings/"><span >Interpreting Equipment Ratings</span></a></li></ul></li><li id="menu-item-12025" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-has-children menu-item-12025"><a href="https://www.coffeereview.com/category/blog/"><span >Journal</span></a><ul class="sub-menu"><li id="menu-item-19959" class="menu-item menu-item-type-post_type menu-item-object-post menu-item-19959"><a href="https://www.coffeereview.com/recognizing-the-coffees-communities-and-contributions-of-black-owned-coffee-companies/"><span >Recognizing Black-Owned Coffee Companies</span></a></li><li id="menu-item-19650" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-19650"><a href="https://www.coffeereview.com/category/covid-19-info/"><span >COVID-19 Information</span></a></li></ul></li><li id="menu-item-18977" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-has-children menu-item-18977"><a href="https://www.coffeereview.com/our-story/"><span >About</span></a><ul class="sub-menu"><li id="menu-item-18978" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18978"><a href="https://www.coffeereview.com/our-story/"><span >Our Story</span></a></li><li id="menu-item-18971" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18971"><a href="https://www.coffeereview.com/kennethdavids/"><span >Kenneth Davids</span></a></li><li id="menu-item-13913" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13913"><a href="https://www.coffeereview.com/our-team/"><span >Our Team</span></a></li><li id="menu-item-19050" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19050"><a href="https://www.coffeereview.com/advertisers/"><span >Our Advertisers</span></a></li><li id="menu-item-19677" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19677"><a href="https://www.coffeereview.com/our-sponsors/"><span >Our Sponsors</span></a></li><li id="menu-item-18975" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-has-children menu-item-18975"><a href="https://www.coffeereview.com/learn/"><span >Learn</span></a><ul class="sub-menu"><li id="menu-item-18973" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18973"><a href="https://www.coffeereview.com/interpret-coffee/"><span >Interpreting Coffee Reviews</span></a></li><li id="menu-item-15773" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-15773"><a href="https://www.coffeereview.com/coffee-reference/"><span >Reference</span></a></li><li id="menu-item-18974" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18974"><a href="https://www.coffeereview.com/coffee-glossary/"><span >Glossary</span></a></li></ul></li><li id="menu-item-8925" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-8925"><a href="https://www.coffeereview.com/contact/"><span >Contact Us</span></a></li></ul></li><li id="menu-item-18956" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-18956"><a href="#"><span >Trade</span></a><ul class="sub-menu"><li id="menu-item-13549" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-13549"><a href="https://www.coffeereview.com/calendar/"><span >Tasting Report Calendar</span></a></li><li id="menu-item-13543" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-13543"><a href="https://www.coffeereview.com/advertising/"><span >Becoming an Advertiser</span></a></li><li id="menu-item-19389" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19389"><a href="https://www.coffeereview.com/what-we-would-do-campaign-packages/"><span >Campaign Package Deals</span></a></li><li id="menu-item-13548" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-13548"><a href="https://www.coffeereview.com/review-services/"><span >Getting Coffees Reviewed</span></a></li><li id="menu-item-18970" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-18970"><a href="https://www.coffeereview.com/guidelines/"><span >Quoting Reviews</span></a></li><li id="menu-item-19643" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19643"><a href="https://www.coffeereview.com/award-certificates/"><span >Award Certificates</span></a></li><li id="menu-item-13544" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13544"><a target="_blank" rel="noopener noreferrer" href="https://dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com/wp-content/uploads/2020/01/CR_Media_Kit_Jan_2020_v5.pdf"><span >Media Kit</span></a></li></ul></li><li id="menu-item-19401" class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-has-children menu-item-19401"><a href="https://www.coffeereview.com/category/blog/green-coffee-origins-and-issues/"><span >中文 &#8211; Chinese</span></a><ul class="sub-menu"><li id="menu-item-13537" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-13537"><a href="/types/coffees-from-taiwan/"><span >台灣送評的咖啡豆</span></a></li><li id="menu-item-19392" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19392"><a href="https://www.coffeereview.com/%e5%a6%82%e4%bd%95%e5%b0%87%e6%82%a8%e7%9a%84%e5%92%96-%e5%95%a1%e9%80%81%e8%a9%95/"><span >如何將您的咖啡送評</span></a></li><li id="menu-item-19400" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19400"><a href="https://www.coffeereview.com/%e8%a1%8c%e9%8a%b7%e6%94%bb%e7%95%a5-%e4%bf%83%e9%8a%b7%e6%b4%bb%e5%8b%95/"><span >“行銷攻略” 促銷活動</span></a></li></ul></li><li id="menu-item-19674" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-has-children menu-item-19674"><a href="https://www.coffeereview.com/why-become-a-member/"><span >Members</span></a><ul class="sub-menu"><li id="menu-item-19676" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19676"><a href="https://www.coffeereview.com/why-become-a-member/"><span >WHY BECOME A MEMBER?</span></a></li><li id="menu-item-19673" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19673"><a href="https://www.coffeereview.com/member-benefits/"><span >Member Benefits</span></a></li><li id="menu-item-19675" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19675"><a href="https://www.coffeereview.com/our-sponsors/"><span >Our Sponsors</span></a></li><li id="menu-item-19671" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19671"><a href="https://www.coffeereview.com/programs-and-initiatives/"><span >Programs and Initiatives</span></a></li><li id="menu-item-19672" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-19672"><a href="https://www.coffeereview.com/member-support/"><span >Member Support</span></a></li></ul></li></ul></div></nav><div class="site-inner"><div class="content-sidebar-wrap"><main class="content" id="genesis-content"><article class="post-7685 review type-review status-publish format-standard types-supermarket entry override"><header class="entry-header"></header><div class="entry-content"><div class="review-template"><div class="row row-1"><div class="column col-1"> <span class="review-template-rating">84</span></div><div class="column col-2"><p class="review-roaster">Yuban</p><h1 class="review-title">100% Colombian</h1></div><div class="column col-3"></div></div><div class="row row-2"><div class="column col-1"><table class="review-template-table"><tr><td>Roaster Location:</td><td>Rye Brook, New York</td></tr><tr><td>Roast Level:</td><td>Very Dark</td></tr><tr><td>Agtron:</td><td>0/46</td></tr></table></div><div class="column col-2"><table class="review-template-table"><tr><td>Review Date:</td><td>March 2002</td></tr><tr><td>Aroma:</td><td>5</td></tr><tr><td>Acidity :</td><td>5</td></tr><tr><td>Body:</td><td>7</tr><tr><td>Flavor:</td><td>7</td></tr></table></div></div><p><strong>Blind Assessment: </strong>Sweet, balanced, restrained. The fruit and vegetal cocoa notes are faint but pleasant, the roasty tones understated.</p><p><strong>Notes: </strong>Sold roasted and ground in a 12-ounce can; 33 cents per ounce. Visit <website link="www.yuban.com">www.yuban.com</website> or call 1-800-982-2649 for more info.</p><p><strong> Who Should Drink It : </strong>The sort of self-effacing, versatile coffee that doesn't call attention to itself, yet three cups later you're still pouring it.</p><div class="row row-3"><div class="column col-1"> <strong><a href="/all-reviews/?roaster_name=Yuban" title="Show all reviews for this roaster">Show all reviews for this roaster</a></strong></div><div class="column col-2"></div></div></div><p style="text-align: center;"><br />This review originally appeared in the March, 2002 tasting report: <a href="/the-robusta-fuss" style="font-weight: bold;">The Robusta Fuss</a></p></div><footer class="entry-footer"></footer></article><img src="https://dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com/wp-content/themes/dynamik-gen/images/content-filler.png" class="dynamik-content-filler-img" alt=""></main><aside class="sidebar sidebar-primary widget-area" role="complementary" aria-label="Primary Sidebar" id="genesis-sidebar-primary"><h2 class="genesis-sidebar-title screen-reader-text">Primary Sidebar</h2><section id="cr_advertiser_widget-2" class="widget widget_cr_advertiser_widget"><div class="widget-wrap"><div id="cr-advertiser-widget-content"></div> <script>var passedArray = [{"url":"https:\/\/espressorepublic.com\/shop\/","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2017\/05\/Coffee-Review-Ad-2-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"\" \/>","excerpt":""},{"url":"https:\/\/paradiseroasters.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/paradise-cr-1-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at Paradise Roasters\" srcset=\"https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/paradise-cr-1-300x190.jpg 300w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/paradise-cr-1.jpg 600w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>","excerpt":"Elite single-origin coffees and top-rated espressos."},{"url":"https:\/\/www.jbccoffeeroasters.com\/product-category\/coffee\/","thumb":"<img width=\"300\" height=\"189\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2018\/11\/JBC-300x190-Dec-2018-300x189.jpeg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for Top-Rated Coffees at JBC Coffee Roasters\" srcset=\"https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2018\/11\/JBC-300x190-Dec-2018-300x189.jpeg 300w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2018\/11\/JBC-300x190-Dec-2018-768x485.jpeg 768w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2018\/11\/JBC-300x190-Dec-2018.jpeg 1020w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>","excerpt":"JBC Coffee Roasters of Madison, Wisconsin is a distinguished small-batch roaster that has produced numerous 90+ point coffees."},{"url":"https:\/\/www.willoughbyscoffee.com\/","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/CR_Willoughbys_300x190_vA-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Visit Willoughby&#039;s Coffee And Tea\" \/>","excerpt":""},{"url":"https:\/\/bit.ly\/2V86Iaw","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/06\/Thanksgiving_Coffee-CR_ad-2020_Mocha-Java.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for 96-point Mocha Java at Thanksgiving Coffee\" \/>","excerpt":"Thanksgiving Coffee was founded in 1972 by Paul and Joan Katzeff. From their roastery in the small town of Fort Bragg, California comes amazing single origins, complex blends and stunning espressos. "},{"url":"https:\/\/www.1stincoffee.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/CR_firstincoffee_300x190-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"1st in Coffee Logo\" \/>","excerpt":"Superior service and low prices on top-quality espresso machines, coffee equipment, and accessories.  Free shipping."},{"url":"https:\/\/jackrabbitjava.com\/","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2018\/11\/2x_BannerAd_JRJV_2020-300x190.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at great prices at Jackrabbit Java\" srcset=\"https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2018\/11\/2x_BannerAd_JRJV_2020-300x190.png 300w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2018\/11\/2x_BannerAd_JRJV_2020.png 600w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>","excerpt":""},{"url":"http:\/\/www.mysticmonkcoffee.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/CR_mysticmonk_300x190-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Mystic Monk Coffee Ad\" \/>","excerpt":"Gourmet coffees roasted by the Carmelite Monks at their monastery in the Rocky Mountains of northern Wyoming."},{"url":"https:\/\/greatergoodsroasting.com\/collections\/all-coffee","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2016\/03\/GG-CR-LTO-300x190@2X-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at Greater Goods\" srcset=\"https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2016\/03\/GG-CR-LTO-300x190@2X-300x190.jpg 300w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2016\/03\/GG-CR-LTO-300x190@2X.jpg 600w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>","excerpt":""},{"url":"http:\/\/www.lexingtoncoffee.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/Lexington-300x190-Ad-Dec-2018-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at Lexington Coffee Roasters\" srcset=\"https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/Lexington-300x190-Ad-Dec-2018-300x190.jpg 300w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/Lexington-300x190-Ad-Dec-2018-768x486.jpg 768w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/Lexington-300x190-Ad-Dec-2018-1024x649.jpg 1024w, https:\/\/dlo9n43mpvj20faa12i3i3lh-wpengine.netdna-ssl.com\/wp-content\/uploads\/2014\/04\/Lexington-300x190-Ad-Dec-2018.jpg 2048w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>","excerpt":"\"Uncommon & Uncompromised\" artisan coffees roasted to order and shipped the same day."},{"url":"https:\/\/www.ptscoffee.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/PTs-300x190-banner-300x190.png\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at PT&#039;s Coffee\" \/>","excerpt":"Award-winning single origin coffees and top-of-the-line equipment for homes and businesses."},{"url":"https:\/\/www.klatchroasting.com\/products\/out-of-africa-blend","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2018\/11\/CR_OUT_OF_AFR.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for African coffees at Klatch Coffee\" \/>","excerpt":""},{"url":"https:\/\/www.templecoffee.com","thumb":"<img width=\"300\" height=\"190\" src=\"https:\/\/www.coffeereview.com\/wp-content\/uploads\/2014\/04\/Coffee-Review-Ad-Decv2-300x190.jpg\" class=\"attachment-medium size-medium wp-post-image\" alt=\"Shop for top-rated coffees at Temple Coffee\" \/>","excerpt":"Temple Coffee specializing in artisan coffees from individual farms and cooperatives."}]; 
'''

In [2]:
#collapse-output
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_sample, 'html.parser')
print(soup.prettify())

<body class="review-template-default single single-review postid-7685 single-format-standard header-image content-sidebar genesis-breadcrumbs-hidden unknown-os ie11 feature-top-outside site-fluid override">
 <div class="site-container">
  <ul class="genesis-skip-link">
   <li>
    <a class="screen-reader-shortcut" href="#genesis-nav-primary">
     Skip to primary navigation
    </a>
   </li>
   <li>
    <a class="screen-reader-shortcut" href="#genesis-content">
     Skip to main content
    </a>
   </li>
   <li>
    <a class="screen-reader-shortcut" href="#genesis-sidebar-primary">
     Skip to primary sidebar
    </a>
   </li>
  </ul>
  <header class="site-header">
   <div class="wrap">
    <div class="title-area">
     <p class="site-title">
      <a href="https://www.coffeereview.com/">
       Coffee Review
      </a>
     </p>
     <p class="site-description">
      The World's Leading Coffee Guide
     </p>
    </div>
    <div class="widget-area header-widget-area">
     <section 

The sample we use here is one of the page in <a href='https://www.coffeereview.com/advanced-search/'>coffee review</a>, a simple table with some text.

*this page is not the same as the html sample

![](../images/coffeereview_sample_page.png)

One small tips to find the target data faster is to use the "inspect" function of you browser, just right click on the target data and select inspect, the browser will automatically locate the data position for us.

Most of the time we are playing with find and find_all functions in the BeautifulSoup class. (surely there are a lot more, e.g. we can find all a tag using soup.a, but most of the case there are too many other tags that we have to reduce our choices by some more specific conditions, like all a tags in a div tag named "target")

find() will give you the first matched result, while find_all() can return all matched result in a list. 

In [None]:
target_div = soup.find("div", name='target')

# we can also search for class name
target_div = soup.find("div", class_='target')

# or id
target_div = soup.find("div", id='target')

In the following example, we will extract the data in the table in dictionary form. As we can see the target data are locate inside the "div" tag with class "entry-content", reducing the search area will give a less noisy result (e.g. you can tell when you search for the "a" tags for webpage of coffee rosters).

In [3]:
content = soup.find('div', class_='entry-content')
content

<div class="entry-content"><div class="review-template"><div class="row row-1"><div class="column col-1"> <span class="review-template-rating">84</span></div><div class="column col-2"><p class="review-roaster">Yuban</p><h1 class="review-title">100% Colombian</h1></div><div class="column col-3"></div></div><div class="row row-2"><div class="column col-1"><table class="review-template-table"><tr><td>Roaster Location:</td><td>Rye Brook, New York</td></tr><tr><td>Roast Level:</td><td>Very Dark</td></tr><tr><td>Agtron:</td><td>0/46</td></tr></table></div><div class="column col-2"><table class="review-template-table"><tr><td>Review Date:</td><td>March 2002</td></tr><tr><td>Aroma:</td><td>5</td></tr><tr><td>Acidity :</td><td>5</td></tr><tr><td>Body:</td><td>7</td></tr><tr><td>Flavor:</td><td>7</td></tr></table></div></div><p><strong>Blind Assessment: </strong>Sweet, balanced, restrained. The fruit and vegetal cocoa notes are faint but pleasant, the roasty tones understated.</p><p><strong>Note

Compare to the original page, there is much less to handle, we can pick out data one by one. Let's start by getting the roaster, name of coffee and the overall score of the coffee (the title part of the page)

In [4]:
review_dict = {}
review_dict['OverallRating'] = content.find('span', class_='review-template-rating').text
review_dict['Roaster'] = content.find('p', class_='review-roaster').text
review_dict['Name'] = content.find('h1', class_='review-title').text

review_dict

{'OverallRating': '84', 'Roaster': 'Yuban', 'Name': '100% Colombian'}

Surely we can parse the table in the same way, but the process can be simplified if we have a some knowledge on html format. In html, each row of table is within a "tr" tag, while each cell is inside a "td" tag. So if we loop through each "tr" tag, and by knowing that the first cell is the item name (key) and the second is the data (value), we can put everything in a loop by asking BeautifulSoup to search for all "tr" than "td".

In [5]:
# all rows in the table
rows = content.find_all('tr')

#loop through each row
for row in rows:
    
    # convert all cells within a single row into a list
    cells = row.find_all('td')
    print(cells)
    
    # use .text get the text within the tag
    review_dict[cells[0].text] = cells[1].text
        
# let's see the result
review_dict

[<td>Roaster Location:</td>, <td>Rye Brook, New York</td>]
[<td>Roast Level:</td>, <td>Very Dark</td>]
[<td>Agtron:</td>, <td>0/46</td>]
[<td>Review Date:</td>, <td>March 2002</td>]
[<td>Aroma:</td>, <td>5</td>]
[<td>Acidity :</td>, <td>5</td>]
[<td>Body:</td>, <td>7</td>]
[<td>Flavor:</td>, <td>7</td>]


{'OverallRating': '84',
 'Roaster': 'Yuban',
 'Name': '100% Colombian',
 'Roaster Location:': 'Rye Brook, New York',
 'Roast Level:': 'Very Dark',
 'Agtron:': '0/46',
 'Review Date:': 'March 2002',
 'Aroma:': '5',
 'Acidity :': '5',
 'Body:': '7',
 'Flavor:': '7'}

Now basically we get most of the useful data in the page, and finally if we want to get all comments below the table, we can search for the "p" tag. But here we notice the text always start with bold format, so we can instead search for the "strong" format.

In [6]:
for item in content.find_all('strong'):
    # .replace(': ', '') is to clear the ':  ' after the key, 
    # and .strip() is to cut the space before and after the text.
    # here we use the .next_sibling method, which is taking the next item right after text
    review_dict[item.text.replace(': ', '').strip()] = item.next_sibling
    
review_dict

{'OverallRating': '84',
 'Roaster': 'Yuban',
 'Name': '100% Colombian',
 'Roaster Location:': 'Rye Brook, New York',
 'Roast Level:': 'Very Dark',
 'Agtron:': '0/46',
 'Review Date:': 'March 2002',
 'Aroma:': '5',
 'Acidity :': '5',
 'Body:': '7',
 'Flavor:': '7',
 'Blind Assessment': 'Sweet, balanced, restrained. The fruit and vegetal cocoa notes are faint but pleasant, the roasty tones understated.',
 'Notes': 'Sold roasted and ground in a 12-ounce can; 33 cents per ounce. Visit ',
 'Who Should Drink It': "The sort of self-effacing, versatile coffee that doesn't call attention to itself, yet three cups later you're still pouring it.",
 'Show all reviews for this roaster': None}

Actually you can see there is a missing link in the 'Notes', which is inside an 'website' tag, to extract that might need another .next_sibling and some if statments, or if we use the multiple from .next_sublings

In [55]:
for item in content.find_all('strong'):
    for text in item.next_siblings:
        if not review_dict.get(item.text.replace(': ', '').strip()):
            review_dict[item.text.replace(': ', '').strip()] = ''
        if isinstance(text, str):
            add_content = text
        else:
            add_content = text.text
            
        review_dict[item.text.replace(': ', '').strip()] += add_content
    
review_dict

{'OverallRating': '84',
 'Roaster': 'Yuban',
 'Name': '100% Colombian',
 'Roaster Location:': 'Rye Brook, New York',
 'Roast Level:': 'Very Dark',
 'Agtron:': '0/46',
 'Review Date:': 'March 2002',
 'Aroma:': '5',
 'Acidity :': '5',
 'Body:': '7',
 'Flavor:': '7',
 'Blind Assessment': 'Sweet, balanced, restrained. The fruit and vegetal cocoa notes are faint but pleasant, the roasty tones understated.',
 'Notes': 'Sold roasted and ground in a 12-ounce can; 33 cents per ounce. Visit www.yuban.com or call 1-800-982-2649 for more info.',
 'Who Should Drink It': "The sort of self-effacing, versatile coffee that doesn't call attention to itself, yet three cups later you're still pouring it."}

And that is it! With the script above (and add some exception case) we can basically extract data from the coffee review, the remaining work would be a) convert text to integer for statistic (can be done by int() function); b) review date should be in datetime format (can be done by datetime library);

If you prefer to learn the whole web scraping process, here is a youtube tutorial by  <a href='https://www.youtube.com/watch?v=XVv6mJpFOb0'>freeCodeCamp</a>. ;)