In [1]:
# first, load up the necessary libraries

import requests
from bs4 import BeautifulSoup
import lxml

In [3]:
# second, load up the webpage and create a 'soup' object

webpage = requests.get("https://www.stern.nyu.edu/faculty/search_name_form")
source = webpage.content
soup = BeautifulSoup(source, 'lxml')

## identifying html tags

Before you can scrape a specific part of an webpage, you need to find out the html tag for the element you want to scrape. In this case, I want to get the names, affiliation, and email address for all of the Adjuncts listed on this [faculty page](https://www.stern.nyu.edu/faculty/search_name_form). 

In order to find out the element I need to scrape, I need to go into the page source. Here, I will use the "inspector" tool. On most browsers, you can fire up this tool by right-clicking on the element you want to scrape. 

Note: if you want to scrape an attribute, that's totally fine too. Check out the bs4 docs on [tags and attributes](
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag) to get started. 

After right-clicking on your desired element that you want to scrape, in the popup menu, select "inspect."

![inspector tool](img/inspector.png)

Then, a window should appear that covers about half of your webpage. Check in the largest panel in this window (the one displaying html code) and look for the *highlighted code* in this section. That code contains the name of the html tag that you want to scrape.

![selection pane](img/selection.png)

Remember, it's helpful to think of the HTML as a tree structure. You can use bs4 methods like `.parent` and `.children` to navigate this tree. Read more about the [navigating html tags](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names) on the bs4 documentation.

Sometimes, it helps to identify the html tag directly above the current one. This is known as the 'parent' tag. 

## scraping the right tag

Let's first print all of the text in items within the `<i>` tag. This is what we saw on the 'inspector' as the tag that wraps each person's affiliation. 

Here, we are going to use the `find_all()` function from bs4. 

In [4]:
# this is a for loop that goes through all of the <i> tags in our 'soup' object
# then uses the 'text' attribute to grab just the text (taking out the html tag)
# for that item

for item in soup.find_all('i'):
    print(item.text)

Adjunct Assistant Professor of Finance
C.V. Starr Professor of Economics
Adjunct Associate Professor
Adjunct Assistant Professor
Adjunct Professor of Finance
Assistant Professor of Accounting
Professor of Marketing
Professor Emeritus of Finance
Adjunct Assistant Professor
Ira Rennert Professor of Entrepreneurial Finance
Assistant Professor
Vice Dean for Faculty and Research
Clinical Assistant Professor
Emeritus Professor of Marketing
Adjunct Professor of Information, Operations and Management Sciences
Adjunct Instructor of Management Communication
Assistant Professor of Technology, Operations, and Statistics
Associate Professor of Technology, Operations, and Statistics
Professor Emeritus of Accounting
Professor of the Practice and Senior Fellow
Professor of Accounting
Executive in Residence
Adjunct Professor of Finance
Adjunct Professor
Clinical Professor of Finance and Professor of Management Practice
Adjunct Associate Professor
Visiting Associate Professor of Management
Adjunct Assis

The code definitely works here. We got a list of affiliations. But we want more than the affiliations, we also want to get the names and emails. So what do we do? 

We go to the 'parent' element, which is the `<tr>` tag.

In [6]:
# printing all of the items contained within the <tr> element

for item in soup.find_all('tr'):
    print(item.text)


Name
Title & Department
Contact


Abut, Daniel 

Adjunct Assistant Professor of Finance
                                          Finance Department


daa249@stern.nyu.edu



Acharya, Viral V.

C.V. Starr Professor of Economics
                                          Finance Department
                                          Berkley Center for Entrepreneurship
                                          Salomon Center for the Study of Financial Institutions
                                          Volatility and Risk Institute
                                          Chao-Hon Chen Institute for Global Real Estate Finance


vva1@stern.nyu.edu



Adamson, Allen 

Adjunct Associate Professor
                                          Marketing Department


apa1@stern.nyu.edu



Afsar Melemetci, Beril 

Adjunct Assistant Professor
                                          Marketing Department


ba2201@stern.nyu.edu



Albanese, Tommaso M.

Adjunct Professor of Finance
                 

This is good! But rememeber we are only looking for faculty who are adjuncts. How do we do this? 

The first thing is to realize the structure of our data. Each `tr` element is one item among a series of items. You can think of it like an item in a list, or a row in a dataset. Either way, the main idea is that we can loop through these items to pick out the desired objects. 

In [7]:
# saves all the `<tr>` tags to the `tr_elements` variable
tr_elements = soup.find_all('tr')

In [9]:
# accesses just the second item, which is the very first row of data on our faculty column
# FYI - the first item, at 0, is the column heading
tr_elements[1]

<tr>
<td><a href="https://www.stern.nyu.edu/faculty/bio/daniel-abut">Abut, Daniel </a></td>
<td>
<i>Adjunct Assistant Professor of Finance</i><br/>
                                          Finance Department<br/>
</td>
<td>
<a href="mailto:daa249@stern.nyu.edu">daa249@stern.nyu.edu</a>
</td>
</tr>

Our goal is to weed out the items that do *not* contain "Adjunct" in the title. Our final list, therefore, should only be items that contain "Adjunct" somewhere in the text. 

To accomplish that, we can write a `for loop` that contains a contditional statement, i.e. an `if-statement`. Then, we save the results to a list called `adjuncts`.

In [10]:
adjuncts = []

# for each "row" (or item) in all the TR elements
for item in soup.find_all('tr'):
    
    # check if it has Adjunct
    if 'Adjunct' in item.text:
        
        # if it does, then append it to our new list
        adjuncts.append(item.text)

In [11]:
# checking our output, just the first item (at position 0)

adjuncts[0]

'\nAbut, Daniel \n\nAdjunct Assistant Professor of Finance\n                                          Finance Department\n\n\ndaa249@stern.nyu.edu\n\n'

In [12]:
# now the first 10 items

adjuncts[:10]

['\nAbut, Daniel \n\nAdjunct Assistant Professor of Finance\n                                          Finance Department\n\n\ndaa249@stern.nyu.edu\n\n',
 '\nAdamson, Allen \n\nAdjunct Associate Professor\n                                          Marketing Department\n\n\napa1@stern.nyu.edu\n\n',
 '\nAfsar Melemetci, Beril \n\nAdjunct Assistant Professor\n                                          Marketing Department\n\n\nba2201@stern.nyu.edu\n\n',
 '\nAlbanese, Tommaso M.\n\nAdjunct Professor of Finance\n                                          Finance Department\n\n\nta42@stern.nyu.edu\n\n',
 '\nAltman, Steven A.\n\nAdjunct Assistant Professor\n                                          Management and Organizations Department\n                                          Center for the Future of Management\n\n\nsaa558@stern.nyu.edu\n\n',
 '\nAviv, Yossi \n\nAdjunct Professor of Information, Operations and Management Sciences\n                                          Technology, Operat

We got our list of adjunct names, affiliation, and emails! Next step is to clean up the formatting of this list and save the results to a handy CSV file. 

## cleaning the results

There are some things we want to take out of our data. These include so called "whitespace" characters, like:
- `\n` characters
- `\n\n` characters
- `\n\n\n` characters

We also want to separate out each entity (a person's name, affiliation, email) into separate items, so that they can populate separate cells on a spreadsheet. In other words we want the following string: 

`'Abut, Daniel Adjunct Assistant Professor of Finance Finance Department daa249@stern.nyu.edu'`

to become a part of a list of individual strings, like: 

`'Abut, Daniel ', 'Adjunct Assistant Professor of Finance, Finance Department', 'daa249@stern.nyu.edu'`

In [13]:
# first, remove the newline characters, i.e. the \n

stripped = []

for item in adjuncts:
    stripped.append(item.strip('\n'))
    
# check the results
stripped[0]

'Abut, Daniel \n\nAdjunct Assistant Professor of Finance\n                                          Finance Department\n\n\ndaa249@stern.nyu.edu'

In [14]:
# we can also do the same using 'list comprehension' - a way of shortening
# the syntax to compress the loop into one line of code.

stripped = [item.strip('\n') for item in adjuncts]

# check the first few lines
stripped[:3]

['Abut, Daniel \n\nAdjunct Assistant Professor of Finance\n                                          Finance Department\n\n\ndaa249@stern.nyu.edu',
 'Adamson, Allen \n\nAdjunct Associate Professor\n                                          Marketing Department\n\n\napa1@stern.nyu.edu',
 'Afsar Melemetci, Beril \n\nAdjunct Assistant Professor\n                                          Marketing Department\n\n\nba2201@stern.nyu.edu']

As you can see, we got some of the `\n` but not all. We didn't get the double or triple, like `\n\n` or `\n\n\n`

But this is actually good. We will actually use the remaining `\n` to split our rows into individual items. This will be necessary when we transfer it to a CSV. 

In [17]:
quote = 'Give me liberty or give me death!'

quote.split('e')

['Giv', ' m', ' lib', 'rty or giv', ' m', ' d', 'ath!']

In [27]:
# list comprehension version of loop

divided_comprehension = [item for row in stripped for item in row.split('\n\n')]

# extended (traditional) version of loop

for row in stripped:
    for item in row.split('\n\n'):
        divided.append(item)

In [28]:
divided_comprehension[:10]

['Abut, Daniel ',
 'Adjunct Assistant Professor of Finance\n                                          Finance Department',
 '\ndaa249@stern.nyu.edu',
 'Adamson, Allen ',
 'Adjunct Associate Professor\n                                          Marketing Department',
 '\napa1@stern.nyu.edu',
 'Afsar Melemetci, Beril ',
 'Adjunct Assistant Professor\n                                          Marketing Department',
 '\nba2201@stern.nyu.edu',
 'Albanese, Tommaso M.']

In [26]:
divided[:10]

[['Abut, Daniel ',
  'Adjunct Assistant Professor of Finance\n                                          Finance Department',
  '\ndaa249@stern.nyu.edu'],
 ['Adamson, Allen ',
  'Adjunct Associate Professor\n                                          Marketing Department',
  '\napa1@stern.nyu.edu'],
 ['Afsar Melemetci, Beril ',
  'Adjunct Assistant Professor\n                                          Marketing Department',
  '\nba2201@stern.nyu.edu'],
 ['Albanese, Tommaso M.',
  'Adjunct Professor of Finance\n                                          Finance Department',
  '\nta42@stern.nyu.edu'],
 ['Altman, Steven A.',
  'Adjunct Assistant Professor\n                                          Management and Organizations Department\n                                          Center for the Future of Management',
  '\nsaa558@stern.nyu.edu'],
 ['Aviv, Yossi ',
  'Adjunct Professor of Information, Operations and Management Sciences\n                                          Technology, Opera

In [29]:
divided_comprehension[0]

'Abut, Daniel '

In [30]:
# Now we can remove some of the remaining \n characters

stripped_twice = []

for item in divided_comprehension:
    stripped_twice.append(item.strip('\n'))

In [32]:
stripped_thrice = []

for item in divided_comprehension:
    stripped_thrice.append(item.strip('\n'))

In [33]:
stripped_thrice

['Abut, Daniel ',
 'Adjunct Assistant Professor of Finance\n                                          Finance Department',
 'daa249@stern.nyu.edu',
 'Adamson, Allen ',
 'Adjunct Associate Professor\n                                          Marketing Department',
 'apa1@stern.nyu.edu',
 'Afsar Melemetci, Beril ',
 'Adjunct Assistant Professor\n                                          Marketing Department',
 'ba2201@stern.nyu.edu',
 'Albanese, Tommaso M.',
 'Adjunct Professor of Finance\n                                          Finance Department',
 'ta42@stern.nyu.edu',
 'Altman, Steven A.',
 'Adjunct Assistant Professor\n                                          Management and Organizations Department\n                                          Center for the Future of Management',
 'saa558@stern.nyu.edu',
 'Aviv, Yossi ',
 'Adjunct Professor of Information, Operations and Management Sciences\n                                          Technology, Operations and Statistics Department'

Excellent work! Now we have a nice clean list of names, affiliations, and emails. The next step will be to break up this list into a tabular format, so we can have one person per row. 

## writing CSV to format and save our data

In [34]:
import csv

with open('data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(stripped_twice)
    

f.close()

In [35]:
# open input and output files
with open('data.csv', 'r') as input_file, open('adjuncts.csv', 'w', newline='') as output_file:
    # create csv reader and writer objects
    reader = csv.reader(input_file)
    writer = csv.writer(output_file)
    
    # iterate over rows in input file
    for row in reader:
        # initialize list for updated row
        new_row = []
        # iterate over cells in row
        for cell in row:
            # check if cell contains "edu"
            if "edu" in cell:
                # add cell to new row and write to output file
                new_row.append(cell)
                writer.writerow(new_row)
                # start new row
                new_row = []
            else:
                # add cell to current row
                new_row.append(cell)
        # write last row to output file
        writer.writerow(new_row)