# Collect Data From CSV Files

In this tutorial, we retrieve data from a .csv file.

### What is CSV?

* A CSV (comma-separated values) file is a text file in which information is separated by commas.
* CSV files are widely used as spreadsheets and databases.
* CSV files are convenient and can be read/output by many programs.



We use a Python package, pandas, to deal with CSV files (and many more other types of files). The pandas package has a data structure, called DataFrame, which reads in CSV files as tables with rows and columns. The first line in the CSV file is converted to names of columns, and other lines are converted to rows. In convention, we rename the pandas as pd to make it simple.

In [None]:
import pandas as pd

Once the pandas package is imported, we can start to use the functions included in this package. 

The function to read a CSV file, is .read_CSV(path_to_file). The return value of this function is a DataFrame, and in convention, we name it df. To make sure the data is read in successfully, we normally print the first 5 rows out, using .head().

We have two CSV files in the project, they are simpsons_paradox_covid.csv and HELPfull.csv. Let's play with simpsons_paradox_covid.csv at this time. 

In [None]:
df = pd.read_csv('data/simpsons_paradox_covid.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age_group,vaccine_status,outcome
0,1,under 50,vaccinated,death
1,2,under 50,vaccinated,death
2,3,under 50,vaccinated,death
3,4,under 50,vaccinated,death
4,5,under 50,vaccinated,death


We can find the database read in, with four columns ['Unnamed: 0', 'age_group', 'vaccine_status', 'outcome']. The .head() function by default will print the first 5 rows (including the column head). If you want to print more rows out, you can specify as .head(10) to print first 10 rows out.

In [None]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,age_group,vaccine_status,outcome
0,1,under 50,vaccinated,death
1,2,under 50,vaccinated,death
2,3,under 50,vaccinated,death
3,4,under 50,vaccinated,death
4,5,under 50,vaccinated,death
5,6,under 50,vaccinated,death
6,7,under 50,vaccinated,death
7,8,under 50,vaccinated,death
8,9,under 50,vaccinated,death
9,10,under 50,vaccinated,death


We can use .info() to get a big picture of the data we read in.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268166 entries, 0 to 268165
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      268166 non-null  int64 
 1   age_group       268166 non-null  object
 2   vaccine_status  268166 non-null  object
 3   outcome         268166 non-null  object
dtypes: int64(1), object(3)
memory usage: 8.2+ MB


Now we can find out this tabular data has 4 columns, and 268166 rows. 

To output a DataFrame to a CSV file, we use .to_CSV(path_to_file) accordingly. Caution: If there is no such file, a new file will be created; if there is a file already, it will be replaced!

In [None]:
df.to_csv('data/simpsons_paradox_covid_new.csv')

Now we finished the I/O of CSV files using pandas. We will cover more details in Data Understanding step.

### Exercise: Read the HELPfull.csv

Now we learned how to read from and write tp a CSV file. Make a duplicate of the project, and play with the **'data/HELPful.csv'** file in this project. 

Tasks you can try:
1.  Read the file in.
2.  Print the first 5 rows.
3.  Print the first 10 rows.
4.  Print the information of the dataset.
5.  Write the data to a file **'data/HELPful_new.csv'**

## Read Some Other Types of Files

The pandas package can deal more than just CSV files. It provides functions to read many other common data types, such as JSON, MS Excel, HTML, LaTex, FWF, etc. You can check out details via: https://pandas.pydata.org/docs/user_guide/io.html

# Collect Data From SQLite Databases

## What is SQLite

A file with the .sqlite extension is a lightweight SQL database file created with the SQLite software. It is a database in a file itself and implements a self-contained, full-featured, highly-reliable SQL database engine. 

## Read an SQLite Database in Python

We use a Python package, sqlite3, to deal with SQLite databases. Once the sqlite3 package is imported, the general steps are:
1.Create a connection object that connects the SQLite database.
2.Create a cursor object
3.Create a query statement
4.execute the query statement
5.fetch the query result to result
6.If all work is done, close the connection.

We use the built-in SQLite database Chinook as the example here. We connect with the database, and show all the tables it contains.

In [None]:
import sqlite3

connection = sqlite3.connect('data/Chinook.sqlite')
cursor = connection.cursor()

query = '''SELECT name FROM sqlite_master  
WHERE type='table';'''

cursor.execute(query)
results = cursor.fetchall()
results

[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

## Play with the SQLite Databases

Using SQL statements, you can play with the SQLite Databases and get the data you need.

In [None]:
query = '''SELECT * 
FROM Artist'''

cursor.execute(query)
results = cursor.fetchall()
results

[(1, 'AC/DC'),
 (2, 'Accept'),
 (3, 'Aerosmith'),
 (4, 'Alanis Morissette'),
 (5, 'Alice In Chains'),
 (6, 'Antônio Carlos Jobim'),
 (7, 'Apocalyptica'),
 (8, 'Audioslave'),
 (9, 'BackBeat'),
 (10, 'Billy Cobham'),
 (11, 'Black Label Society'),
 (12, 'Black Sabbath'),
 (13, 'Body Count'),
 (14, 'Bruce Dickinson'),
 (15, 'Buddy Guy'),
 (16, 'Caetano Veloso'),
 (17, 'Chico Buarque'),
 (18, 'Chico Science & Nação Zumbi'),
 (19, 'Cidade Negra'),
 (20, 'Cláudio Zoli'),
 (21, 'Various Artists'),
 (22, 'Led Zeppelin'),
 (23, 'Frank Zappa & Captain Beefheart'),
 (24, 'Marcos Valle'),
 (25, 'Milton Nascimento & Bebeto'),
 (26, 'Azymuth'),
 (27, 'Gilberto Gil'),
 (28, 'João Gilberto'),
 (29, 'Bebel Gilberto'),
 (30, 'Jorge Vercilo'),
 (31, 'Baby Consuelo'),
 (32, 'Ney Matogrosso'),
 (33, 'Luiz Melodia'),
 (34, 'Nando Reis'),
 (35, 'Pedro Luís & A Parede'),
 (36, 'O Rappa'),
 (37, 'Ed Motta'),
 (38, 'Banda Black Rio'),
 (39, 'Fernanda Porto'),
 (40, 'Os Cariocas'),
 (41, 'Elis Regina'),
 (42, 'Mi

## Save Data to CSV Files

Since CSV file is much more convenient to process, we still use pandas to convert and to write to CSV files.

In [None]:
import pandas as pd

df = pd.DataFrame(results)
df.info()
df.to_csv('data/Chinook.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 275 entries, 0 to 274
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       275 non-null    int64 
 1   1       275 non-null    object
dtypes: int64(1), object(1)
memory usage: 4.4+ KB


In [None]:
cursor.close()
connection.close()

## Exercise

Do a search, and learn how to use mysql.connector in Python to connect to a mysql server, and fetch data as needed.

# Collect Data From Web

## Introduction to Request and Beautiful Soup
Now that we understand the structure of a web page, it’s time to get into the fun part: scraping the content we want!
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library.

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one. If you want to learn more, check out our API tutorial.

Let’s try downloading a simple sample website, https://dataquestio.github.io/web-scraping-pages/simple.html.

### Download
We’ll need to first import the requests library, and then download the page using the requests.get method:

In [None]:
import requests

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully:

In [None]:
page.status_code

200

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

In [None]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

### Parsing a page
As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object.

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


This step isn’t strictly necessary, and we won’t always bother with it, but it can be helpful to look at prettified HTML to make the structure of the and where tags are nested easier to see.

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup.

Note that children returns a list generator, so we need to call the list function on it:

In [None]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well. Let’s see what the type of each element in the list is:

In [None]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

As we can see, all of the items are BeautifulSoup objects:

*  The first is a Doctype object, which contains information about 
the type of the document.
*  The second is a NavigableString, which represents text found in the HTML document.
*  The final item is a Tag object, which contains other nested tags.

The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

In [None]:
html = list(soup.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

In [None]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

As we can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

In [None]:
body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

In [None]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag:

In [None]:
p = list(body.children)[1]

In [None]:
list(p.children)

['Here is some simple content for this page.']

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [None]:
p.get_text()

'Here is some simple content for this page.'

### Finding Tags

Finding all instances of a tag at once
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple.

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [None]:
soup.find('p')

<p>Here is some simple content for this page.</p>

Searching for tags by class and id:

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we’re scraping, we can also use them to specify the elements we want to scrape.

Let's try another page.

In [None]:
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

In [None]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In the below example, we’ll look for any tag that has the class outer-text:

In [None]:
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

We can also search for elements by id:

In [None]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

Using CSS Selectors

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

*  p a — finds all a tags inside of a p tag.
*  body p a — finds all a tags inside of a p tag inside of a body tag.
*  html body — finds all body tags inside of an html tag.
*  p.outer-text — finds all p tags with a class of outer-text.
*  p#first — finds all p tags with an id of first.
*  body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

In [None]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

## Case study: Weather!

### Downloading weather data

We now know enough to proceed with extracting information about the local weather from the National Weather Service website!

The local weather of Boulder, CO is: https://forecast.weather.gov/MapClick.php?lat=40.0466&lon=-105.2523#.YwpRBy2B1f0

Time to Start Scraping!

We now know enough to download the page and start parsing it. In the below code, we will:

*  Download the web page containing the forecast.
*  Create a BeautifulSoup class to parse the page.
*  Find the div with id seven-day-forecast, and assign to seven_day
*  Inside seven_day, find each individual forecast item.
Extract and print the first forecast item.


In [None]:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=40.0466&lon=-105.2523#.YwpRBy2B1f0")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
print(forecast_items)

[<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. "/></p><p class="short-desc">Sunny</p><p class="temp temp-high">High: 88 °F</p></div>, <div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: Mostly clear, with a low around 59. Northeast wind 8 to 10 mph becoming southwest in the evening. Winds could gust as high as 16 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 59. Northeast wind 8 to 10 mph becoming southwest in the evening. Winds could gust as high as 16 mph. "/></p><p class="short-desc">Mostly Clear</p><p class="temp temp-low">Low: 59 °F</p></div>, <div class="tombstone-container">
<p clas

In [None]:
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 88 °F
 </p>
</div>


### Extracting information of tonight

As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:

*  The name of the forecast item — in this case, Tonight.
*  The description of the conditions — this is stored in the title property of img.
*  A short description of the conditions — in this case, Sunny and hot.
*  The temperature hight — in this case, 98 degrees.


We’ll extract the name of the forecast item, the short description, and the temperature first, since they’re all similar:

In [None]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

Today
Sunny
High: 88 °F


Now, we can extract the title attribute from the img tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [None]:
img = tonight.find("img")
desc = img['title']
print(desc)

Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. 


### Extract all nights!

Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code, we will:

Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
Use a list comprehension to call the get_text method on each BeautifulSoup object.

In [None]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday']

As we can see above, our technique gets us each of the period names, in order.

We can apply the same technique to get the other three fields:

In [None]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Sunny', 'Mostly Clear', 'Sunny thenSlight ChanceT-storms', 'Slight ChanceT-storms thenMostly Clear', 'Sunny', 'Mostly Clear', 'Sunny thenSlight ChanceT-storms', 'Slight ChanceT-storms', 'Sunny thenSlight ChanceT-storms']
['High: 88 °F', 'Low: 59 °F', 'High: 88 °F', 'Low: 57 °F', 'High: 87 °F', 'Low: 58 °F', 'High: 89 °F', 'Low: 60 °F', 'High: 86 °F']
['Today: Sunny, with a high near 88. Northwest wind 9 to 13 mph, with gusts as high as 21 mph. ', 'Tonight: Mostly clear, with a low around 59. Northeast wind 8 to 10 mph becoming southwest in the evening. Winds could gust as high as 16 mph. ', 'Sunday: A 20 percent chance of showers and thunderstorms after 1pm.  Mostly sunny, with a high near 88. West southwest wind 6 to 13 mph, with gusts as high as 20 mph. ', 'Sunday Night: A 20 percent chance of showers and thunderstorms before 7pm.  Partly cloudy, with a low around 57. West wind 5 to 11 mph, with gusts as high as 17 mph. ', 'Monday: Sunny, with a high near 87. Light and variable win

### Deal with data

We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. If you want to learn more about Pandas, check out our free to start course here.

In order to do this, we’ll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary.

Each dictionary key will become a column in the DataFrame, and each list will become the values in the column:

In [None]:
import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc":descs
})
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Sunny,High: 88 °F,"Today: Sunny, with a high near 88. Northwest w..."
1,Tonight,Mostly Clear,Low: 59 °F,"Tonight: Mostly clear, with a low around 59. N..."
2,Sunday,Sunny thenSlight ChanceT-storms,High: 88 °F,Sunday: A 20 percent chance of showers and thu...
3,SundayNight,Slight ChanceT-storms thenMostly Clear,Low: 57 °F,Sunday Night: A 20 percent chance of showers a...
4,Monday,Sunny,High: 87 °F,"Monday: Sunny, with a high near 87. Light and ..."
5,MondayNight,Mostly Clear,Low: 58 °F,"Monday Night: Mostly clear, with a low around 58."
6,Tuesday,Sunny thenSlight ChanceT-storms,High: 89 °F,Tuesday: A 10 percent chance of showers and th...
7,TuesdayNight,Slight ChanceT-storms,Low: 60 °F,Tuesday Night: A slight chance of showers and ...
8,Wednesday,Sunny thenSlight ChanceT-storms,High: 86 °F,Wednesday: A slight chance of showers and thun...


Now let's save it to CSV.

In [None]:
weather.to_csv('data/Boulder_Weather_7_Days.csv')

## Case study: MSDS Faculty

### Downloading the faculty data

In [None]:
import requests

url = 'https://www.colorado.edu/program/data-science/faculty'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)
content = r.text
print(content)

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html  lang="en" dir="ltr"
  xmlns:og="http://ogp.me/ns#"><!--<![endif]-->

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="https://www.colorado.edu/program/data-science/profiles/express/themes/cumodern/favicon.ico" type="image/vnd.microsoft.icon" />
<link href="https://www.colorado.edu/program/data-science/feed/rss.xml" rel="alternate" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<link rel="apple-touch-icon" sizes="57x57" href="https://www.colorado.edu/program/data-science/profiles/express/themes/ucb/apple-icon-57x57.png" />
<link rel="appl

### Extract Data from the Page

In [None]:
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(content, 'html.parser')
soup

<!DOCTYPE html>

<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!--><html dir="ltr" lang="en" xmlns:og="http://ogp.me/ns#"><!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://www.colorado.edu/program/data-science/profiles/express/themes/cumodern/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<link href="https://www.colorado.edu/program/data-science/feed/rss.xml" rel="alternate"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://www.colorado.edu/program/data-science/profiles/express/themes/ucb/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="https://www.c

### Find all \<h2\> tags

In [None]:
links = soup.find_all('h2')
links

[<h2>Search</h2>,
 <h2 class="element-invisible">Main menu</h2>,
 <h2 class="element-invisible">Mobile menu</h2>,
 <h2>The MS-DS is a truly interdisciplinary program, with faculty in Computer Science, Applied Mathematics, Information Science, and more.</h2>,
 <h2 class="node-title"><a href="/program/data-science/jane-wall">Jane Wall</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/bobby-schnabel">Bobby Schnabel</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/brian-zaharatos">Brian Zaharatos</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/jem-corcoran">Jem Corcoran</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/anne-dougherty">Anne Dougherty</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/ioana-fleming">Ioana Fleming</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/ami-gates">Ami Gates</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/abel-iyasele-0">Abel Iyasele</a></h2>,

### Find all \<h2\> tags and with class as 'node-title"

In [None]:
faculty = soup.find_all('h2', {'class':'node-title'})
faculty

[<h2 class="node-title"><a href="/program/data-science/jane-wall">Jane Wall</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/bobby-schnabel">Bobby Schnabel</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/brian-zaharatos">Brian Zaharatos</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/jem-corcoran">Jem Corcoran</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/anne-dougherty">Anne Dougherty</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/ioana-fleming">Ioana Fleming</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/ami-gates">Ami Gates</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/abel-iyasele-0">Abel Iyasele</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/geena-kim">Geena Kim</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/william-kuskin">William Kuskin</a></h2>,
 <h2 class="node-title"><a href="/program/data-science/qin-christine-lv">Qin

### Put together: Output List of MSDS faculty

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.colorado.edu/program/data-science/faculty'
r = requests.get(url)
content = r.text

soup = BeautifulSoup(content, 'html.parser')

faculty = soup.find_all('h2', {'class': 'node-title'})

names = [n.get_text() for n in faculty]

names

1 Jane Wall
2 Bobby Schnabel
3 Brian Zaharatos
4 Jem Corcoran
5 Anne Dougherty
6 Ioana Fleming
7 Ami Gates
8 Abel Iyasele
9 Geena Kim
10 William Kuskin
11 Qin (Christine) Lv
12 Quentin McAndrew
13 Osita Onyejekwe
14 Alan Paradise
15 Al Pisano
16 Sriram Sankaranarayanan
17 Chris Vargo
18 Di Wu
19 Dave Underwood


['Jane Wall',
 'Bobby Schnabel',
 'Brian Zaharatos',
 'Jem Corcoran',
 'Anne Dougherty',
 'Ioana Fleming',
 'Ami Gates',
 'Abel Iyasele',
 'Geena Kim',
 'William Kuskin',
 'Qin (Christine) Lv',
 'Quentin McAndrew',
 'Osita Onyejekwe',
 'Alan Paradise',
 'Al Pisano',
 'Sriram Sankaranarayanan',
 'Chris Vargo',
 'Di Wu',
 'Dave Underwood']

### Save the data to CSV

In [None]:
import pandas as pd
df = pd.DataFrame({'Names': names})
df

Unnamed: 0,Names
0,Jane Wall
1,Bobby Schnabel
2,Brian Zaharatos
3,Jem Corcoran
4,Anne Dougherty
5,Ioana Fleming
6,Ami Gates
7,Abel Iyasele
8,Geena Kim
9,William Kuskin


In [None]:
df.to_csv('data/MSDS_Faculty.csv')

# Congratulations!