# Faculty Directory Crawler Project
This Python project uses Beautiful Soup to crawl the university’s faculty directory and extract key information:
* Name
* Email address
* Position
* Bio content
   
The crawler will navigate through directory pages, parse HTML content, and store extracted data efficiently. Key considerations include ethical web scraping practices.
This tool will create a comprehensive dataset of faculty information for various academic and administrative purposes.

**Purpose of the notebook is to experiment with web scraping and text summarization.**

In [1]:
import requests
import urllib3
from bs4 import BeautifulSoup
import re
from tqdm import tqdm
import pandas as pd



## FetchURL Content
* Using URLLIB3 to fetch the URL's content
* Using BeautifulSoup to read/parse the content further

In [2]:
fetchURL = lambda url: urllib3.PoolManager().request('GET', url).data

In [3]:
%%time
sampleUrl = 'https://www.rhsmith.umd.edu/directory'

html_content = fetchURL(sampleUrl)

soup = BeautifulSoup(html_content, 'lxml')

CPU times: user 40.1 ms, sys: 5.3 ms, total: 45.4 ms
Wall time: 815 ms


## Finding Tags

### Name Tag

In [4]:
tagList = soup.find_all('div',attrs={'class':"col-12 mb-4"})
print(len(tagList))
tag = tagList[0]

10


In [5]:
tag

<div class="col-12 mb-4">
<div class="row">
<div class="col-12 col-800-3 mb-3 mb-800-0">
<img alt="Suresh Acharya" class="w-100" src="/sites/default/files/people/headshots/acharya-suresh.jpg"/>
</div>
<div class="col-12 col-800-6">
<p class="h3 mb-1 person-name">
<a class="fancy-link" href="/directory/suresh-acharya">Suresh Acharya</a>
</p>
<ul><li>Professor of Practice</li><li>Academic Director, MS in Business Analytics</li></ul>
<div>Decision, Operations and Information Technologies</div>
<div>AI Faculty, Social Impact</div>
</div>
<div class="col-12 col-800-3">
<p class="profile-subheading">Contact</p>
<div>
<a href="tel:301-405-9678">301-405-9678</a>
</div>
<div>
<a href="mailto:suresh12@umd.edu">suresh12@umd.edu</a>
</div>
<div>4342 Van Munching Hall</div>
</div>
</div>
</div>

In [6]:
tag.find('a',attrs={'class':'fancy-link'}).string

'Suresh Acharya'

### Mail Tag

In [7]:
tag.find('a',href=re.compile(r'^mailto')).string

'suresh12@umd.edu'

### Position Tag

In [8]:
tag.find('ul').find_all('li')

[<li>Professor of Practice</li>,
 <li>Academic Director, MS in Business Analytics</li>]

In [9]:
finPos = ''
for pos in tag.find('ul').find_all('li'):
    finPos += pos.string + '$'
finPos

'Professor of Practice$Academic Director, MS in Business Analytics$'

### HREF Tag

In [10]:
tag.find('a',attrs={'class':'fancy-link'})

<a class="fancy-link" href="/directory/suresh-acharya">Suresh Acharya</a>

In [11]:
link = tag.find('a',attrs={'class':'fancy-link'}).get('href')
link

'/directory/suresh-acharya'

In [12]:
f"{sampleUrl}/+{link.split('/')[-1]}"

'https://www.rhsmith.umd.edu/directory/+suresh-acharya'

### Content Section

In [159]:
%%time
sampleContentUrl = f"{sampleUrl}/{link.split('/')[-1]}"

html_contentV2 = fetchURL(sampleContentUrl)

soupV2 = BeautifulSoup(html_contentV2, 'lxml')

CPU times: user 46 ms, sys: 4.75 ms, total: 50.7 ms
Wall time: 247 ms


In [167]:
soupV2.find('div',attrs={'class':"profile-bio editor-content"}).p.string

'Suresh Acharya is Professor of Practice and the Academic Director for the MS in Business Analytics programs at the Robert H. Smith School of Business. Suresh has spent over 25 years designing and building statistical and optimization solutions in the areas of Supply Chain Management, Retail Planning, Airline Operations, Logistics, and Pricing and Revenue Management.\xa0As a practicing analytics professional, Suresh continues to work with Fortune 500 companies in delivering practical algorithmic solutions that demonstrate measurable customer value. His experience helps students gain real world insights as they prepare themselves for a fulfilling career in Analytics.\xa0Suresh has a Masters in Mathematical Sciences from Clemson University and a Masters in Operations Research from the University of North Carolina.'

## Merging them all

In [183]:
%%time
directory = {'name':[],'email':[],'position':[],'bio':[]}
try:
    for tag in tqdm(tagList):
        name = tag.find('a',attrs={'class':'fancy-link'}).string
        email = tag.find('a',href=re.compile(r'^mailto')).string
        finPos = ''
        for pos in tag.find('ul').find_all('li'):
            finPos += pos.string + '$'
            
        link = tag.find('a',attrs={'class':'fancy-link'}).get('href')
        contentUrl = f"{sampleUrl}/{link.split('/')[-1]}"
        html_contentV2 = fetchURL(contentUrl)
        soupV2 = BeautifulSoup(html_contentV2, 'lxml')
        bio = soupV2.find('div',attrs={'class':"profile-bio editor-content"}).p.string

        directory['name'].append(name)
        directory['email'].append(email)
        directory['position'].append(finPos)
        directory['bio'].append(bio)

except Exception as e:
    print(e)
        

100%|███████████████████████████████████████████| 10/10 [00:03<00:00,  3.08it/s]

CPU times: user 524 ms, sys: 26.6 ms, total: 551 ms
Wall time: 3.25 s





In [184]:
pd.DataFrame(directory)

Unnamed: 0,name,email,position,bio
0,Suresh Acharya,suresh12@umd.edu,"Professor of Practice$Academic Director, MS in...",Suresh Acharya is Professor of Practice and th...
1,Rajshree Agarwal,rajshree@umd.edu,Rudolph Lamone Chair of Strategy and Entrepren...,Rajshree Agarwal is the Rudolph Lamone Chair o...
2,Aysun Alp Paukowits,alpaysun@umd.edu,Visiting Assistant Professor of Finance$,Aysun Alp Paukowits joined the Robert H. Smith...
3,Tejwansh (Tej) Singh Anand,tejanand@umd.edu,"Clinical Professor$Academic Director, MS in In...",Tejwansh (Tej) Singh Anand is a Clinical Profe...
4,G. “Anand” Anandalingam,ganand@umd.edu,Ralph J. Tyser Professor of Management Science$,Professor G. 'Anand' Anandalingam is the Ralph...
5,T. Leigh Anenson,lanenson@umd.edu,"Professor$Associate Director, C-BERC$",TERP Teaching Essentials
6,Manmohan Aseri,maseri@umd.edu,Assistant Professor of Information Systems$,Manmohan Aseri is an Assistant Professor of De...
7,Joseph P. Bailey,jpbailey@umd.edu,Associate Dean of Undergraduate Programs$Assoc...,Joseph P. Bailey's research and teaching inter...
8,M. Gisela Bardossy,bardossy@umd.edu,"Faculty Director, QUEST$Associate Clinical Pro...",M. Gisela Bardossy is a Clinical Professor in ...
9,Progyan Basu,pbasu@umd.edu,Clinical Professor$,Dr. Basu has over 30 years of experience teach...
