## Scrape Talks Data from ted.com



#### As of Today (May,01 2022), There are 5033 talks on ted.com posted on total 154 pages. In order to scrape all talks of the site, I need to go to each page to do the scraping. 


***Data to be collected from each talk:***<br>
title, speaker, posted_on date, talk_link

In [6]:
# !pip install requests_html


In [2]:
from requests_html import AsyncHTMLSession
import json
import asyncio

#### Make a list to store all talks later

In [3]:
all_talks = []

#### Method for start a session

In [4]:
def get_all_talks_pages_urls():
    links=[]
    for i in range(1,155):
        link = f'https://www.ted.com/talks?page={i}'
        links.append(link)
    return links

#### Scrape one page is one task, for each page, collect all '.col' info about each talk and store them into a dictionary.
*In order to run all tasks in one event_loop, asession is used for passing around the AsyncHTMLSession *

In [5]:
async def get_all_talks_from_one_page(url, asession):    
    a_re = await asession.get(url)    
    row = a_re.html.find('div.row.row-sm-4up.row-lg-6up.row-skinny', first=True)
    all_cols = row.find('div.col')

    for col in all_cols:
        speaker = col.find('h4', first=True).text
        a_tag = col.find('div.media__message a.ga-link', first=True)
        title = a_tag.text
        talk_link = a_tag.attrs['href']
        post_date = col.find('span.meta__val', first=True).text
                 
        talk_dict = {
            'title':title,
            'speaker': speaker,
            'posted_on': post_date,
            'talk_link': talk_link
        }
        all_talks.append(talk_dict)
    return all_talks

#### The top-level function to call a new session and asyncio.gather() all tasks

In [8]:
async def scrape_all_talks():
    asession = AsyncHTMLSession()
    links = get_all_talks_pages_urls()
    tasks = (get_all_talks_from_one_page(link, asession) for link in links[:2]) # generator
    return await asyncio.gather(*tasks)

#### Run the top-level coroutine using asyncio.run() 
**Note:**
On Jupyter Notebook, since there is an eventloop running already, use the following will work. However, for python code, this part should go like this: asyncio.run(scrape_all_talks()) 

In [9]:
await scrape_all_talks()

[[{'title': '"I Hope" / "DAWN"',
   'speaker': 'Resistance Revival Chorus',
   'posted_on': 'Mar 2022',
   'talk_link': '/talks/resistance_revival_chorus_i_hope_dawn'},
  {'title': '"A seat at the table" isn\'t the solution for gender equity',
   'speaker': 'Lilly Singh',
   'posted_on': 'Mar 2022',
   'talk_link': '/talks/lilly_singh_a_seat_at_the_table_isn_t_the_solution_for_gender_equity'},
  {'title': 'Why US laws must expand beyond the nuclear family',
   'speaker': 'Diana Adams',
   'posted_on': 'Mar 2022',
   'talk_link': '/talks/diana_adams_why_us_laws_must_expand_beyond_the_nuclear_family'},
  {'title': 'The ingredient in almost everything you eat',
   'speaker': 'Francesca Bot',
   'posted_on': 'Mar 2022',
   'talk_link': '/talks/francesca_bot_the_ingredient_in_almost_everything_you_eat'},
  {'title': 'What seaweed and cow burps have to do with climate change',
   'speaker': 'Ermias Kebreab',
   'posted_on': 'Mar 2022',
   'talk_link': '/talks/ermias_kebreab_what_seaweed_and_

In [10]:
with open('ted_talks.json', 'w') as file:
    json.dump(all_talks, file, indent=4)

In [11]:
with open('ted_talks.json', 'r') as file:
    talks = json.load(file)
talks[:3]

[{'title': '"I Hope" / "DAWN"',
  'speaker': 'Resistance Revival Chorus',
  'posted_on': 'Mar 2022',
  'talk_link': '/talks/resistance_revival_chorus_i_hope_dawn'},
 {'title': '"A seat at the table" isn\'t the solution for gender equity',
  'speaker': 'Lilly Singh',
  'posted_on': 'Mar 2022',
  'talk_link': '/talks/lilly_singh_a_seat_at_the_table_isn_t_the_solution_for_gender_equity'},
 {'title': 'Why US laws must expand beyond the nuclear family',
  'speaker': 'Diana Adams',
  'posted_on': 'Mar 2022',
  'talk_link': '/talks/diana_adams_why_us_laws_must_expand_beyond_the_nuclear_family'}]

In [12]:
len(talks)

252