# Teacher List Scrape

This notebook scrapes a list of teachers from directory pages on Insight Timer's website.

The resulting dataframe contains three columns:
- alpha_index: m
- teacher_name: Malcolm Huxter
- teacher_href: malhuxter

The dataframe is saved to **teachers_list_df.csv**.

In [1]:
from selenium import webdriver
#import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import string

In [1]:
# Insight Timer webpages use javascript, so need selenium and chrome driver.
chrome_driver_path = '../../../../Tech/chrome_driver/chromedriver.exe'

### Teacher Lists
https://insighttimer.com/dir/meditation-teachers/  <br>
https://insighttimer.com/dir/meditation-teachers/a <br>
https://insighttimer.com/dir/meditation-teachers/b

In [3]:
directory_alpha_index = ['hash'] + list(string.ascii_lowercase) + ['more']
# Result is ['hash', 'a', 'b', 'c', ... 'x', 'y', 'z', 'more']
# 'hash' page includes teachers with names starting with a number.
# 'more' page includes teachers with names starting with non-standard characters 
#     such as punctuation marks or languages other than English.

In [4]:
teachers_dir_url = 'https://insighttimer.com/dir/meditation-teachers/'

In [5]:
#initialize lists which will become columns in a dataframe 
alpha_indices = []
teacher_hrefs = []
teacher_names = []

In [6]:
# Iterate through teacher pages for each letter
for alpha_index in directory_alpha_index:
    
    # Create a new Chrome session with a custom executable path
    url = teachers_dir_url + alpha_index

    #Create a session and load the page
    driver = webdriver.Chrome(executable_path=chrome_driver_path)
    driver.get(url)

    #Wait for page to fully load
    driver.implicitly_wait(2)

    #Make soup and close driver
    soup = BS(driver.page_source)
    driver.close()
    
    #Example of tag that includes teacher data:
    #<div class="css-1y0feak">
    #  <a href="/malhuxter">Malcolm Huxter</a>
    #</div
    
    div_teacher_tags = soup.findAll('div', attrs = {'class':'css-1y0feak'})

    for teacher_div_tag in div_teacher_tags:
        
        #Add current alpha index to list
        alpha_indices = alpha_indices + [alpha_index]
        
        #Get 'a' tag contained within 'div' tag
        teacher_a_tag = teacher_div_tag.find('a')
        
        #Get teacher name and add to list
        teacher_names = teacher_names + [teacher_a_tag.text]
        
        #Get href attribute and remove first "/" character to create teacher_href
        teacher_href = teacher_a_tag.get('href', default = '/no href')
        teacher_href = teacher_href[1:]

        #Add new teacher_id to list 
        teacher_hrefs = teacher_hrefs + [teacher_href]

In [7]:
# Create DataFrame with teacher IDs
teachers_list_dict = {'alpha_index':alpha_indices, 'teacher_name':teacher_names, 'teacher_href':teacher_hrefs}
teachers_list_df = pd.DataFrame(teachers_list_dict)

In [8]:
print(teachers_list_df.shape)
teachers_list_df.head()

(12492, 3)


Unnamed: 0,alpha_index,teacher_name,teacher_href
0,hash,33bowls,33bowls
1,hash,432hz Dúo,miguebrea_and_nicopucci
2,hash,8FINITY ANGEL - SLEEP SPELLS,8finityangel
3,a,A Course in Miracles Lessons,acimlessons
4,a,A Friday Studio,afridaystudio


In [9]:
# Save results to data file
teachers_list_df.to_csv('../data/teachers_list_df.csv')