# Scraping Medical Research Topics on TED Talks

## About Web Scraping


Data analysts analyze and report insights gleaned from data. Ever wondered how these data is collected? That's where knowledge of web scraping comes in handy.

### What is web scraping?

Web scraping is a technique used to extract content and data from websites.The data extracted is stored in databases and retrieved later to perform analysis and communicate insights.

*“Data are just summaries of thousands of stories.”* **– By Chip & Dan Heath**

Given that large amounts of data is extracted, web scraping automates tasks that might otherwise take humans far longer, or even be impossible to complete on a timely basis.

How does web scraping work?

Hypertext Markup Language(HTML) is used to give structure to websites. A mark-up language is universal meaning scrapers can easily pinpoint specific elements within them and extract content.

A general process of web scraping follows these steps:

- Identifying a site to scrape.
- Use `Request` to fetch the HTML code.
- Locate HTML elements using Beautiful Soup.
- Use Pandas to create CSV  files.



## About TED Talks:

TED Conferences LLC(Technology, Entertainment, Design) is an American media organization that posts talks online for free distribution under the slogan "ideas worth spreading".TED's early emphasis was on technology and design, consistent with its Silicon Valley origins. It has since broadened its perspective to include talks on many scientific, cultural, political, humanitarian and academic topics.

In this project, we are going to scrape inspirational medical research topic presented on Ted Talks.


![Title](https://i.ibb.co/PDPrnVW/ashraful-islam-p-Rt3-JVYl-Jho-unsplash.jpg)


Photo by <a href="https://unsplash.com/@ashraful25?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Ashraful Islam</a> on <a href="https://unsplash.com/photos/a-double-strand-of-blue-and-white-spirals-pRt3JVYlJho?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash">Unsplash</a>
      

## Website of Interest and Objectives

Our website of interest is https://www.ted.com/talks?page=1&sort=popular&topics%5B%5D=medical+research

As earlier mentioned, this site contains talks. From the talks posted we want to extract the following information.
- Title of the talk
- The speaker
- Talk URL
- Duration of the talk
- Talk views



## Web Scraping Toolbox

Before we proceed to the next step, let's learn a little bit about the tools that we will use for web scraping. We will make use of three python libraries for this project.

`Requests`
- for making various types of HTTP requests.

`Beautiful Soup`
- for pulling data out of HTML and XML files.

`Pandas`
- for storing the data in the required format.

## Download Web Pages Using `Requests`

We will use the `requests` library to fetch HTML code from the website.

In [None]:
# Installing the library
!pip install requests --upgrade --quiet

In [None]:
# import the library
import requests

In [None]:
#variable for our website of interest
talks_url = 'https://www.ted.com/talks?page=1&sort=popular&topics%5B%5D=medical+research'

We will use `requests.get` function to make a request from the webpage. A `response` object is generated once Requests gets a response back from the server.

In [None]:
response = requests.get(talks_url)

We can check if our request is succesful using the `response.status_code`, the values of which is set between 200 and 299.

In [None]:
response.status_code

200

We can access the contents of the webpage using the `.text` property of the `response`.

In [None]:
page_content = response.text

Let's take a peak of the page_content.

In [None]:
len(page_content)

92230

The page has 92,230 characters.

In [None]:
page_content[:1000]

'<!DOCTYPE html>\n<!--[if lt IE 8]> <html class="no-js loggedout oldie ie7" lang="en"> <![endif]-->\n<!--[if IE 8]> <html class="no-js loggedout oldie ie8" lang="en"> <![endif]-->\n<!--[if gt IE 8]><!--><html class=\'no-js loggedout\' lang=\'en\'><!--<![endif]-->\n<head>\n<script>\n  (function (H){\n  H.className=H.className.replace(/\\bno-js\\b/,\'js\');\n  if ((\'; \'+document.cookie).match(/; _ted_user_id=/)) H.className=H.className.replace(/\\bloggedout\\b/,\'loggedin\');\n  })(document.documentElement)\n</script><meta charset=\'utf-8\'>\n<title>TED Talks</title>\n<meta name="description" content="TED Talks are influential videos from expert speakers on education, business, science, tech and creativity, with subtitles in 100+ languages. Ideas free to stream and download." />\n<meta name="rss-feed" content="https://www.ted.com/feeds/talks.rss" />\n<meta name="keywords" content="TED, Talks, Themes, Speakers, Technology, Entertainment, Design" />\n<link rel="mask-icon" href="https://p

Above is the `source code` of the webpage written in HTML.

Let's save the contents to a file with the `.html` extension.

In [None]:
with open('talks.html' , 'w') as file:
    file.write(page_content)

## Beautiful Soup : Parse and Extract Information

To use Beautiful Soup, we first need to install it

In [None]:
#Install library
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
#import library
from bs4 import BeautifulSoup

Then, we create the Beautiful Soup object with the `page_content` as the input.

In [None]:
talk_doc = BeautifulSoup(page_content, 'html.parser')

In [None]:
type(talk_doc)

bs4.BeautifulSoup

We have used a lot of code to prepare our web document for scraping. Let us now create a function to download the page.

In [None]:
import requests
from bs4 import BeautifulSoup
def get_page(talks_url):
    #URL to scrape
    talks_url = talks_url
    # get HTML page using requests
    response = requests.get(talks_url)
    #confirm request is a success
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(talks_url))
    #create beautiful soup object
    talk_doc = BeautifulSoup(response.text)
    return talk_doc

In [None]:
type(talk_doc)

bs4.BeautifulSoup

Data is stored in HTML tags. Beautiful soup has a lot of methods of searching through these tags, https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree. The two common ones being `find()` and `find_all()`. For our project we will use the `find_all()` method to extract information from our webpage and return lists.

Let's now create some functions using the `find_all()` method.

We will first need to inspect our page. Below we have an image of the webpage at the top. The HTML code which gives the page structure is at the bottom.

We can get the image by right clicking on our image of interest and selecting the option inspect element.

![Title](https://i.ibb.co/j8Jz9y9/Annotation-2021-05-10-140602.png)

The speakers names, talk urls, talk title, the number of views and the talks duration are among the information we need to extract from the webpage.
If you are familiar with HTML, you will notice that some of the information that we need to extract from the web page are located within a h4 tag which has different classes.

We will now create functions to extact these information.

In [None]:
#Talk titles
def get_title(talk_doc):
    # Initialize an empty list to store talk titles
    title = []

    # Define the class used for talk titles in <h4>
    selection_class = ('f-w:700 h9 m5')

    # Search for all <h4> tags with the specific class
    title_tags = talk_doc.find_all('h4', {'class': selection_class})

    # Loop through each <h4> tag that contain the talk title
    for tag in title_tags:
        # Append the stripped text (removing leading/trailing spaces) to the title list
        title.append(tag.text.strip())

    # Return the list of extracted titles
    return title



In [None]:
#Check if our function is working
get_title(talk_doc)

['Sleep is your superpower',
 "You can grow new brain cells. Here's how",
 'How does anesthesia work?',
 'Conception to birth — visualized',
 'Is marijuana bad for your brain?',
 "What you can do to prevent Alzheimer's",
 "One more reason to get a good night's sleep",
 'How reliable is your memory?',
 'A promising test for pancreatic cancer ... from a teenager',
 'What happens during a heart attack?',
 'The surprisingly charming science of your gut',
 'Is the obesity crisis hiding a bigger problem?',
 "Autism — what we know (and what we don't know yet)",
 'What is bipolar disorder?',
 'How your emotions change the shape of your heart',
 'The future of psychedelic-assisted psychotherapy',
 'What happens when you remove the hippocampus?',
 'How CRISPR lets us edit our DNA',
 'The brain may be able to repair itself — with help',
 'Printing a human kidney',
 'The power of the placebo effect',
 'How menopause affects the brain',
 "A doctor's case for medical marijuana",
 "The most groundbre

Our function is working properly. Let's move to the next.

In [None]:
#Speakers names
def get_speakers(talks_doc):
    speakers= []
    speaker_class = 'h12 talk-link__speaker'
    speaker_tag = talks_doc.find_all('h4',{'class' : speaker_class })
    for tag in speaker_tag:
        speakers.append(tag.text)

    return speakers

In [None]:
get_speakers(talk_doc)

['Matt Walker',
 'Sandrine Thuret',
 'Steven Zheng',
 'Alexander Tsiaras',
 'Anees Bahji',
 'Lisa Genova',
 'Jeff Iliff',
 'Elizabeth Loftus',
 'Jack Andraka',
 'Krishna Sudhir',
 'Giulia Enders',
 'Peter Attia',
 'Wendy Chung',
 'Helen M. Farrell',
 'Sandeep Jauhar',
 'Rick Doblin',
 'Sam Kean',
 'Jennifer Doudna',
 'Jocelyne Bloch',
 'Anthony Atala',
 'Emma Bryce',
 'Lisa Mosconi',
 'David Casarett',
 'Addison Anderson',
 'Shohini Ghose',
 'Suchitra Krishnan-Sarin',
 'Céline Valéry',
 'Ben Goldacre',
 'Bill Gates',
 'Sara-Jane Dunn',
 'TED-Ed',
 'Samuel Cohen',
 'Henna-Maria Uusitupa',
 'Matt Walker',
 'Kaitlyn Sadtler',
 'Rebecca Brachman']

In [None]:
#Talk duration
def get_time(talk_doc):
    talk_time = []
    duration_tag = 'thumb__duration'
    talk_duration = talk_doc.find_all('span', {'class' : duration_tag })
    for duration in talk_duration:
        talk_time.append(duration.text)

    return talk_time

In [None]:
get_time(talk_doc)

['19:18',
 '11:04',
 ' 4:41',
 ' 9:37',
 ' 6:21',
 '13:56',
 '11:41',
 '17:36',
 '10:49',
 ' 4:39',
 '14:03',
 '15:58',
 '15:35',
 ' 5:52',
 '16:02',
 '16:32',
 ' 5:10',
 '15:53',
 '11:34',
 '17:24',
 ' 4:22',
 '13:04',
 '15:07',
 ' 4:18',
 ' 4:43',
 '14:29',
 ' 3:58',
 '13:29',
 '43:07',
 '14:47',
 ' 5:25',
 ' 7:53',
 '10:40',
 '1h  0m',
 ' 4:57',
 ' 5:10']

In [None]:
#Talk view
def get_views(talk_doc):
    talk_views = []
    views_class = 'meta__val'
    views = talk_doc.find_all('span',{'class' : views_class})
    for view in views:
        talk_views.append(view.text.strip())

    return talk_views

In [None]:
get_views(talk_doc)

['13M',
 '9.6M',
 '9.1M',
 '8.1M',
 '5.8M',
 '5.3M',
 '5.3M',
 '5.3M',
 '4.9M',
 '4.8M',
 '4.7M',
 '4.4M',
 '4.1M',
 '4.1M',
 '4M',
 '3.6M',
 '3.5M',
 '3.4M',
 '3.3M',
 '3.1M',
 '3M',
 '2.9M',
 '2.9M',
 '2.7M',
 '2.7M',
 '2.6M',
 '2.6M',
 '2.6M',
 '2.5M',
 '2.5M',
 '2.5M',
 '2.4M',
 '2.4M',
 '2.4M',
 '2.4M',
 '2.3M']

In [None]:
#Talks urls
def get_urls(talk_doc):
    urls = []
    a_tags = talk_doc.find_all('a', {'class': 'ga-link', 'data-ga-context': 'talks'})
    for a_tag in a_tags:
        url = ('https://www.ted.com/' + a_tag['href'])
        if url not in urls:
            urls.append('https://www.ted.com/' + a_tag['href'])

    return urls

In [None]:
get_urls(talk_doc)

['https://www.ted.com//talks/matt_walker_sleep_is_your_superpower',
 'https://www.ted.com//talks/sandrine_thuret_you_can_grow_new_brain_cells_here_s_how',
 'https://www.ted.com//talks/steven_zheng_how_does_anesthesia_work',
 'https://www.ted.com//talks/alexander_tsiaras_conception_to_birth_visualized',
 'https://www.ted.com//talks/anees_bahji_is_marijuana_bad_for_your_brain',
 'https://www.ted.com//talks/lisa_genova_what_you_can_do_to_prevent_alzheimer_s',
 'https://www.ted.com//talks/jeff_iliff_one_more_reason_to_get_a_good_night_s_sleep',
 'https://www.ted.com//talks/elizabeth_loftus_how_reliable_is_your_memory',
 'https://www.ted.com//talks/jack_andraka_a_promising_test_for_pancreatic_cancer_from_a_teenager',
 'https://www.ted.com//talks/krishna_sudhir_what_happens_during_a_heart_attack',
 'https://www.ted.com//talks/giulia_enders_the_surprisingly_charming_science_of_your_gut',
 'https://www.ted.com//talks/peter_attia_is_the_obesity_crisis_hiding_a_bigger_problem',
 'https://www.ted

## Create CSV Files Using Pandas

In [None]:
#Installing Pandas library
!pip install pandas --quiet

In [None]:
#Import Pandas
import pandas as pd

In [None]:
#Create a dictionary
talks_dict = {
    'Title' : get_title(talk_doc),
    'Speaker' : get_speakers(talk_doc),
    'Duration' : get_time(talk_doc),
    'Views' : get_views(talk_doc),
    'URLs' : get_urls(talk_doc),}

In [None]:
talks_dict

{'Title': ['Sleep is your superpower',
  "You can grow new brain cells. Here's how",
  'How does anesthesia work?',
  'Conception to birth — visualized',
  'Is marijuana bad for your brain?',
  "What you can do to prevent Alzheimer's",
  "One more reason to get a good night's sleep",
  'How reliable is your memory?',
  'A promising test for pancreatic cancer ... from a teenager',
  'What happens during a heart attack?',
  'The surprisingly charming science of your gut',
  'Is the obesity crisis hiding a bigger problem?',
  "Autism — what we know (and what we don't know yet)",
  'What is bipolar disorder?',
  'How your emotions change the shape of your heart',
  'The future of psychedelic-assisted psychotherapy',
  'What happens when you remove the hippocampus?',
  'How CRISPR lets us edit our DNA',
  'The brain may be able to repair itself — with help',
  'Printing a human kidney',
  'The power of the placebo effect',
  'How menopause affects the brain',
  "A doctor's case for medical 

In [None]:
#Create dataframe
talks_pd = pd.DataFrame(talks_dict)

In [None]:
talks_pd

Unnamed: 0,Title,Speaker,Duration,Views,URLs
0,Sleep is your superpower,Matt Walker,19:18,13M,https://www.ted.com//talks/matt_walker_sleep_i...
1,You can grow new brain cells. Here's how,Sandrine Thuret,11:04,9.6M,https://www.ted.com//talks/sandrine_thuret_you...
2,How does anesthesia work?,Steven Zheng,4:41,9.1M,https://www.ted.com//talks/steven_zheng_how_do...
3,Conception to birth — visualized,Alexander Tsiaras,9:37,8.1M,https://www.ted.com//talks/alexander_tsiaras_c...
4,Is marijuana bad for your brain?,Anees Bahji,6:21,5.8M,https://www.ted.com//talks/anees_bahji_is_mari...
5,What you can do to prevent Alzheimer's,Lisa Genova,13:56,5.3M,https://www.ted.com//talks/lisa_genova_what_yo...
6,One more reason to get a good night's sleep,Jeff Iliff,11:41,5.3M,https://www.ted.com//talks/jeff_iliff_one_more...
7,How reliable is your memory?,Elizabeth Loftus,17:36,5.3M,https://www.ted.com//talks/elizabeth_loftus_ho...
8,A promising test for pancreatic cancer ... fro...,Jack Andraka,10:49,4.9M,https://www.ted.com//talks/jack_andraka_a_prom...
9,What happens during a heart attack?,Krishna Sudhir,4:39,4.8M,https://www.ted.com//talks/krishna_sudhir_what...


We have now extracted information from the first page of the webpage.

In [None]:
#Save in a csv file
talks_pd.to_csv('talks.csv', index = None)