# Teacher List Scrape

This notebook scrapes a list of teachers from directory pages on Insight Timer's website.

The resulting dataframe contains three columns:
- alpha_index: m
- teacher_name: Malcolm Huxter
- teacher_href: malhuxter

The dataframe is saved to **teachers_list_df.csv**.

In [None]:
from selenium import webdriver
#import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import string
from datetime import datetime

In [None]:
# Insight Timer webpages use javascript, so need selenium and chrome driver.
chrome_driver_path = '../../../../Tech/chrome_driver/chromedriver.exe'

### Teacher Lists
https://insighttimer.com/dir/meditation-teachers/  <br>
https://insighttimer.com/dir/meditation-teachers/a <br>
https://insighttimer.com/dir/meditation-teachers/b

In [None]:
directory_alpha_index = ['hash'] + list(string.ascii_lowercase) + ['more']
# Result is ['hash', 'a', 'b', 'c', ... 'x', 'y', 'z', 'more']
# 'hash' page includes teachers with names starting with a number.
# 'more' page includes teachers with names starting with non-standard characters 
#     such as punctuation marks or languages other than English.

In [None]:
teachers_dir_url = 'https://insighttimer.com/dir/meditation-teachers/'

In [None]:
#initialize lists which will become columns in a dataframe 
alpha_indices = []
teacher_ids = []
dir_teacher_names = []

In [None]:
start_time = datetime.now()

In [None]:
# Iterate through teacher pages for each letter
for alpha_index in directory_alpha_index:
    
    # Create a new Chrome session with a custom executable path
    url = teachers_dir_url + alpha_index

    #Create a session and load the page
    driver = webdriver.Chrome(executable_path=chrome_driver_path)
    driver.get(url)

    #Wait for page to fully load
    driver.implicitly_wait(2)

    #Make soup and close driver
    soup = BS(driver.page_source)
    driver.close()
    
    #Example of tag that includes teacher data:
    #<div class="css-1y0feak">
    #  <a href="/malhuxter">Malcolm Huxter</a>
    #</div
    
    div_teacher_tags = soup.findAll('div', attrs = {'class':'css-1y0feak'})

    for teacher_div_tag in div_teacher_tags:
        
        #Add current alpha index to list
        alpha_indices = alpha_indices + [alpha_index]
        
        #Get 'a' tag contained within 'div' tag
        teacher_a_tag = teacher_div_tag.find('a')
        
        #Get teacher name and add to list
        dir_teacher_names = dir_teacher_names + [teacher_a_tag.text]
        
        #Get href attribute and remove first "/" character to create teacher_href
        teacher_id = teacher_a_tag.get('href', default = '/no href')
        teacher_id = teacher_id[1:]

        #Add new teacher_id to list 
        teacher_ids = teacher_ids + [teacher_id]

In [None]:
# Create DataFrame with teacher IDs
teachers_list_dict = {'teacher_id':teacher_ids,
                      'dir_teacher_name':dir_teacher_names, 
                      'alpha_index':alpha_indices}

teachers_list_df = pd.DataFrame(teachers_list_dict)

In [None]:
# Save results to data file
teachers_list_df.to_csv('../data/teachers_list_df.csv')

In [None]:
end_time = datetime.now()

In [None]:
#Print Runtime 
runtime = end_time - start_time
hours, remainder = divmod(runtime.seconds, 3600)
minutes, seconds = divmod(remainder, 60)

print('Runtime:')

if hours > 0:
    print(hours,'hours')
if minutes > 0:
    print(minutes,'minutes')
print(seconds,'seconds')

Final Run was on December 8, 2022. Run time was 8 minutes 59 seconds.

In [None]:
print(teachers_list_df.shape)
teachers_list_df.head()

# Alternative Teacher Lists
https://insighttimer.com/meditation-teachers/ <br>
https://insighttimer.com/meditation-teachers/starts-with-k <br>
https://insighttimer.com/meditation-teachers/starts-with-k/1 <br>
https://insighttimer.com/meditation-teachers/starts-with-k/2 <br>
50 teachers per page <br>
Gives you the number of teachers for each letter. <br> 
Does not have "hash" and "other" teachers. <br>