# Web Scraping with BeautifulSoup
This tutorial covers some basics of web scraping with BeautifulSoup, Data Manipulation and cleaning with Pandas followed by Data Visualization with Matplotlib

This example will be analysing data from a 10K race that took place in Hillsboro, OR on June 2017.

## Prerequisites
The following packages need to be installed for this tutorial:

1. `numpy` for array processing
2. `pandas` for data manipulation and cleaning
3. `matplotlib` for data visualization
4. `seaborn` for further data visualization. It is built upon `matplotlib`
5. `beautifulsoup4` Beautiful Soup version 4
6. `lxml` a lxml parser for to be used with beautiful soup

which can all be installed with `conda install <package-name>`

## Imports
First off, we'll import some basic libraries those libraries

`%matplotlib inline` to display the plots in Jupyter Notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

As BeautifulSoup doesn't have its own built in URL access library, we shall use Python's urllib.
BeautifulSoup's library `bs4` stands for BeautifulSoup version 4

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs

## Loading and Accessing Elements of a HTML page

We shall load the html from a specified url with urllib and parse it with a BeautifulSoup object

In [3]:
url = "http://www.hubertiming.com/results/2017GPTR10K"
html = urlopen(url)
soup = bs(html, 'lxml')

The BeautifulSoup object allows you to extract various elements from the html like:

* title
* text
* and various tag elements ie: &lt;a&gt;: hyperlinks, &lt;table&gt;: tables, &lt;tr&gt;: table rows, &lt;th&gt;: table headers, &lt;td&gt;: table cells

as shown below:

In [4]:
# Title
print('The Title\n')
print(soup.title)

# Text
print('\nThe Text\n')
#print(soup.get_text())

# <a> tags
print('\nAll Links <a>\n')
all_links = soup.find_all('a')
print(all_links)

# href attribute from <a> tags
print('\nAll href attributes of <a>\n')
for link in all_links:
    print(link.get('href'))

# print first ten rows of a table via <tr>
print('\nTable rows <tr>\n')
print(soup.find_all('tr')[:10])

The Title

<title>2017 Intel Great Place to Run 10K \ Urban Clash Games Race Results</title>

The Text


All Links <a>

[<a class="btn btn-primary btn-lg" href="/results/2017GPTR" role="button" style="margin: 0px 0px 5px 5px">5K</a>, <a href="http://hubertiming.com/">Huber Timing Home</a>, <a href="#individual">Individual Results</a>, <a href="#team">Team Results</a>, <a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>, <a href="#tabs-1" style="font-size: 18px">Results</a>, <a name="individual"></a>, <a name="team"></a>, <a href="http://www.hubertiming.com/"><img height="65" src="/sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>, <a href="http://facebook.com/hubertiming/"><img src="/results/FB-f-Logo__blue_50.png"/></a>]

All href attributes of <a>

/results/2017GPTR
http://hubertiming.com/
#individual
#team
mailto:timing@hubertiming.com
#tabs-1
None
None
http://www.hubertiming.com/
http://facebook.com/hubertiming/

Table ro

## Extracting data into DataFrames
Now that we have an overview of how we can extract elements with BeautifulSoup, let's start extracting the table into a Pandas DataFrame

First, we shall get all table rows as list and then convert them into a DataFrame

In [5]:
rows = soup.find_all('tr')
for row in rows:
    row_td = row.find_all('td')
    print(row_td)
type(row_td)

[<td>Finishers:</td>, <td>577</td>]
[<td>Male:</td>, <td>414</td>]
[<td>Female:</td>, <td>163</td>]
[]
[<td>1</td>, <td>814</td>, <td>JARED WILSON</td>, <td>M</td>, <td>TIGARD</td>, <td>OR</td>, <td>00:36:21</td>, <td>05:51</td>, <td>1 of 414</td>, <td>M 36-45</td>, <td>1 of 152</td>, <td>00:00:03</td>, <td>00:36:24</td>, <td></td>]
[<td>2</td>, <td>573</td>, <td>NATHAN A SUSTERSIC</td>, <td>M</td>, <td>PORTLAND</td>, <td>OR</td>, <td>00:36:42</td>, <td>05:55</td>, <td>2 of 414</td>, <td>M 26-35</td>, <td>1 of 154</td>, <td>00:00:03</td>, <td>00:36:45</td>, <td>INTEL TEAM F</td>]
[<td>3</td>, <td>687</td>, <td>FRANCISCO MAYA</td>, <td>M</td>, <td>PORTLAND</td>, <td>OR</td>, <td>00:37:44</td>, <td>06:05</td>, <td>3 of 414</td>, <td>M 46-55</td>, <td>1 of 64</td>, <td>00:00:04</td>, <td>00:37:48</td>, <td></td>]
[<td>4</td>, <td>623</td>, <td>PAUL MORROW</td>, <td>M</td>, <td>BEAVERTON</td>, <td>OR</td>, <td>00:38:34</td>, <td>06:13</td>, <td>4 of 414</td>, <td>M 36-45</td>, <td>2 of 152

bs4.element.ResultSet

From the results above, we can see that each row still contains the html tags embedded in each row. We can remove them either with BeautifulSoup or regEx.

This can easily be done with `.get_text()'

In [6]:
for row in rows:
    row_td = row.find_all('td')
    clean_cells = bs(str(row_td), 'lxml').get_text()
    print(clean_cells)

type(clean_cells)

[Finishers:, 577]
[Male:, 414]
[Female:, 163]
[]
[1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21, 05:51, 1 of 414, M 36-45, 1 of 152, 00:00:03, 00:36:24, ]
[2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR, 00:36:42, 05:55, 2 of 414, M 26-35, 1 of 154, 00:00:03, 00:36:45, INTEL TEAM F]
[3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:37:44, 06:05, 3 of 414, M 46-55, 1 of 64, 00:00:04, 00:37:48, ]
[4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:34, 06:13, 4 of 414, M 36-45, 2 of 152, 00:00:03, 00:38:37, ]
[5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00:39:21, 06:20, 5 of 414, M 26-35, 2 of 154, 00:00:03, 00:39:24, INTEL TEAM F]
[6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39:49, 06:25, 6 of 414, M 18-25, 1 of 34, 00:00:06, 00:39:55, ]
[7, 144, GEORGE TOTONCHY, M, PORTLAND, OR, 00:40:04, 06:27, 7 of 414, M 36-45, 3 of 152, 00:00:13, 00:40:17, ]
[8, 395, BENJAMIN C CHAFFIN, M, PORTLAND, OR, 00:40:05, 06:27, 8 of 414, M 36-45, 4 of 152, 00:00:04, 00:40:09, ]
[9, 7, BRANDON THOMAS, M, , , 00:40:17, 06:29

[156, 176, ANDREW WENDLANDT, M, HILLSBORO, OR, 00:51:24, 08:17, 139 of 414, M 26-35, 42 of 154, 00:00:42, 00:52:06, ]
[157, 422, TRISTAN WOEHRLE, M, BEAVERTON, OR, 00:51:34, 08:19, 140 of 414, M 26-35, 43 of 154, 00:00:07, 00:51:41, ]
[158, 276, PAUL HACK, M, PORTLAND, OR, 00:51:35, 08:19, 141 of 414, M 46-55, 25 of 64, 00:00:39, 00:52:14, ]
[159, 595, JIANREN TAI, M, PORTLAND, OR, 00:51:38, 08:19, 142 of 414, M 36-45, 56 of 152, 00:00:22, 00:52:00, ]
[160, 159, EUGENE LUHAVY, M, PORTLAND, OR, 00:51:41, 08:20, 143 of 414, M 26-35, 44 of 154, 00:00:50, 00:52:31, ]
[161, 221, REET CHATTERJEE, M, BEAVERTON, OR, 00:51:43, 08:20, 144 of 414, M Under 18, 1 of 2, 00:00:27, 00:52:10, ]
[162, 4, BHASKAR MANDALA, M, , , 00:51:49, 08:21, 145 of 414, M 36-45, 57 of 152, 00:00:21, 00:52:10, COLUMBIA TEAM A]
[163, 757, JOSE MAURICIO MARULANDA, M, HILLSBORO, OR, 00:51:52, 08:21, 146 of 414, M 26-35, 45 of 154, 00:00:18, 00:52:10, ]
[164, 571, MARC STEVENSON SO, M, BEAVERTON, OR, 00:51:54, 08:22, 147 

[352, 661, CARINA E HAHN, F, HILLSBORO, OR, 01:02:29, 10:04, 66 of 163, F 18-25, 11 of 21, 00:00:14, 01:02:43, ]
[353, 244, SANDEEP CHIPPADA, M, PORTLAND, OR, 01:02:30, 10:04, 287 of 414, M 26-35, 101 of 154, 00:01:07, 01:03:37, ]
[354, 756, VEERESH A HONGAL, M, HILLSBORO, OR, 01:02:36, 10:05, 288 of 414, M 26-35, 102 of 154, 00:01:06, 01:03:42, ]
[355, 658, RAVINDRA HOSKOTE, M, PORTLAND, OR, 01:02:39, 10:06, 289 of 414, M 36-45, 106 of 152, 00:01:03, 01:03:42, ]
[356, 850, SHASHIKIRAN KONNUR SAMPATHKUMAR, M, HILLSBORO, OR, 01:02:41, 10:06, 290 of 414, M 26-35, 103 of 154, 00:00:50, 01:03:31, ]
[357, 608, MAITHREYI GOPALAKRISHNAN, F, HILLSBORO, OR, 01:02:43, 10:07, 67 of 163, F 18-25, 12 of 21, 00:01:11, 01:03:54, ]
[358, 376, SREENIVAS KASTURI, M, TIGARD, OR, 01:02:44, 10:07, 291 of 414, M 36-45, 107 of 152, 00:01:07, 01:03:51, ]
[359, 406, JEFFREY K SCIPIO, M, BEAVERTON, OR, 01:02:47, 10:07, 292 of 414, M 46-55, 52 of 64, 00:00:31, 01:03:18, ]
[360, 229, MEENAKSHI MAMUNURU, F, TIGARD

[566, 219, BHARAT ADDALA, M, PORTLAND, OR, 01:32:20, 14:53, 411 of 414, M 36-45, 150 of 152, 00:01:31, 01:33:51, ]
[567, 133, LOLA ABDUL-JABBAR, F, OUT OF STATE, OR, 01:32:28, 14:54, 156 of 163, F 26-35, 57 of 59, 00:00:37, 01:33:05, ]
[568, 51, AIMEE TONEY-LOVINGS, F, BEAVERTON, OR, 01:32:30, 14:55, 157 of 163, F 46-55, 21 of 22, 00:00:31, 01:33:01, ]
[569, 158, NARENDER MUDUGANTI, M, HILLSBORO, OR, 01:33:50, 15:08, 412 of 414, M 36-45, 151 of 152, 00:00:04, 01:33:54, ]
[570, 670, HEMA VIJWANI, F, HILLSBORO, OR, 01:34:45, 15:17, 158 of 163, F 26-35, 58 of 59, 00:00:08, 01:34:53, ]
[571, 357, USHA K KETINENI, F, PORTLAND, OR, 01:34:48, 15:17, 159 of 163, F 36-45, 55 of 56, 00:00:39, 01:35:27, ]
[572, 293, SPASS O STOIANTSCHEWSKY, M, PORTLAND, OR, 01:37:10, 15:40, 413 of 414, M 46-55, 64 of 64, 00:01:23, 01:38:33, ]
[573, 273, RACHEL L VANEY, F, OTHER, OR, 01:38:17, 15:51, 160 of 163, F 18-25, 21 of 21, 00:00:17, 01:38:34, ]
[574, 467, ROHIT B DSOUZA, M, PORTLAND, OR, 01:38:31, 15:53, 4

str

This creates a string of comma separated values for each row.

Notice each string is wrapped by '&#91;' and '&#93;'. Let's remove those

In [7]:
for row in rows:
    row_td = row.find_all('td')
    clean_cells = bs(str(row_td), 'lxml').get_text()
    clean_cells = clean_cells[1:-1]
    print(clean_cells)

Finishers:, 577
Male:, 414
Female:, 163

1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21, 05:51, 1 of 414, M 36-45, 1 of 152, 00:00:03, 00:36:24, 
2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR, 00:36:42, 05:55, 2 of 414, M 26-35, 1 of 154, 00:00:03, 00:36:45, INTEL TEAM F
3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:37:44, 06:05, 3 of 414, M 46-55, 1 of 64, 00:00:04, 00:37:48, 
4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:34, 06:13, 4 of 414, M 36-45, 2 of 152, 00:00:03, 00:38:37, 
5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00:39:21, 06:20, 5 of 414, M 26-35, 2 of 154, 00:00:03, 00:39:24, INTEL TEAM F
6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39:49, 06:25, 6 of 414, M 18-25, 1 of 34, 00:00:06, 00:39:55, 
7, 144, GEORGE TOTONCHY, M, PORTLAND, OR, 00:40:04, 06:27, 7 of 414, M 36-45, 3 of 152, 00:00:13, 00:40:17, 
8, 395, BENJAMIN C CHAFFIN, M, PORTLAND, OR, 00:40:05, 06:27, 8 of 414, M 36-45, 4 of 152, 00:00:04, 00:40:09, 
9, 7, BRANDON THOMAS, M, , , 00:40:17, 06:29, 9 of 414, M 26-35, 3 of

178, 584, CODY L COVEY, M, HILLSBORO, OR, 00:52:47, 08:30, 158 of 414, M 26-35, 48 of 154, 00:00:05, 00:52:52, INTEL TEAM F
179, 21, VINOTHKUMAR RADHAKRISHNAN, M, PORTLAND, OR, 00:52:48, 08:31, 159 of 414, M 26-35, 49 of 154, 00:00:14, 00:53:02, 
180, 337, DEEPAK RAJENDRAKUMARAN, M, HILLSBORO, OR, 00:52:50, 08:31, 160 of 414, M 26-35, 50 of 154, 00:00:32, 00:53:22, 
181, 459, JOHN HAMILTON, M, PORTLAND, OR, 00:52:55, 08:32, 161 of 414, M 46-55, 28 of 64, 00:00:29, 00:53:24, INTEL TEAM K
182, 444, VARUN K SETLUR, M, HILLSBORO, OR, 00:52:55, 08:32, 162 of 414, M 26-35, 51 of 154, 00:00:56, 00:53:51, 
183, 211, NAHOM YEMANE, M, BEAVERTON, OR, 00:52:58, 08:32, 163 of 414, M 18-25, 19 of 34, 00:00:32, 00:53:30, 
184, 55, ERNIE KHAW, M, HILLSBORO, OR, 00:53:02, 08:33, 164 of 414, M 46-55, 29 of 64, 00:00:16, 00:53:18, 
185, 613, ANTON RODRIGUEZ-DMITRIEV, M, PORTLAND, OR, 00:53:05, 08:33, 165 of 414, M 26-35, 52 of 154, 00:01:06, 00:54:11, 
186, 129, KARL H AMSPACHER, M, HILLSBORO, OR, 00:53:

369, 737, HEIDI N BARNABY, F, OTHER, OR, 01:03:34, 10:15, 73 of 163, F 36-45, 23 of 56, 00:00:49, 01:04:23, 
370, 615, SEAN P DELCAMBRE, M, PORTLAND, OR, 01:03:36, 10:15, 297 of 414, M 26-35, 104 of 154, 00:00:32, 01:04:08, 
371, 859, ERIN MCKALIP, F, PORTLAND, OR, 01:03:44, 10:16, 74 of 163, F 26-35, 26 of 59, 00:00:19, 01:04:03, 
372, 309, BHARADWAJ THANDRA, M, HILLSBORO, OR, 01:03:44, 10:16, 298 of 414, M 18-25, 28 of 34, 00:00:08, 01:03:52, 
373, 725, GENE H LANDREVILLE, M, HILLSBORO, OR, 01:03:49, 10:17, 299 of 414, M 46-55, 54 of 64, 00:01:07, 01:04:56, 
374, 678, RITU ARORA, F, HILLSBORO, OR, 01:03:51, 10:17, 75 of 163, F 26-35, 27 of 59, 00:00:29, 01:04:20, 
375, 343, JAYAKUMARAN RAVI, M, HILLSBORO, OR, 01:03:51, 10:18, 300 of 414, M 26-35, 105 of 154, 00:00:43, 01:04:34, 
376, 131, ASHLEY DURHAM, F, HILLSBORO, OR, 01:03:53, 10:18, 76 of 163, F 36-45, 24 of 56, 00:00:45, 01:04:38, 
377, 854, JESSE T HARRIS, M, LAKE OSWEGO, OR, 01:04:07, 10:20, 301 of 414, M 36-45, 111 of 152, 0

4TH, DTNA1, 03:15:33, 00:40:28 - WITALI SPULING, 00:46:45 - INGA ANDREYEVA, 00:54:09 - KEATON WEISENBORN, 00:54:10 - MAISIE WEISENBORN
5TH, FXG1, 03:21:16, 00:42:56 - DAVID HERRON, 00:46:35 - LEO SOTO, 00:53:59 - TONY GONZALEZ, 00:57:45 - ARINDA SCHRUM
6TH, INTEL TEAM B, 03:26:38, 00:46:26 - EDMONDO MAZZULLI, 00:48:44 - JUN TAKEI, 00:53:49 - NIKHIL TALPALLIKAR, 00:57:38 - PRANAV NAKATE
7TH, COLUMBIA TEAM B, 03:38:34, 00:40:17 - BRANDON THOMAS, 00:47:05 - RANIER EVANS, 01:04:19 - KENDRA DONLIN, 01:06:51 - MARIE-PIERRE VIGNE
8TH, INTEL TEAM D, 03:41:59, 00:47:56 - TEAL HAND, 00:49:03 - BENJAMIN R PORTER, 00:51:10 - JUSTON LI, 01:13:49 - DHAVAL V SHAH
9TH, COLUMBIA TEAM A, 03:46:19, 00:40:21 - ERIK BJORNSTAD, 00:41:59 - JIM DROZDOWSKI, 00:51:49 - BHASKAR MANDALA, 01:32:10 - SURESH GAUJULA
10TH, INTEL TEAM N, 04:02:13, 00:48:39 - ERIN F WIESENAUER, 01:00:24 - JULIE MAAS, 01:00:49 - MEETA A BHATE, 01:12:19 - JANE M JACKSON
11TH, DTNA3, 04:05:09, 00:50:29 - AUSTIN WIPF, 00:56:06 - YUNFENG PI

Alright, let's now convert this into a Pandas DataFrame

In [8]:
clean_rows = []
for row in rows:
    row_td = row.find_all('td')
    clean_cells = bs(str(row_td), 'lxml').get_text()[1:-1]
    clean_rows.append(clean_cells)
    
df = pd.DataFrame(clean_rows)

df.head(10)

Unnamed: 0,0
0,"Finishers:, 577"
1,"Male:, 414"
2,"Female:, 163"
3,
4,"1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21,..."
5,"2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR, 0..."
6,"3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:37..."
7,"4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:3..."
8,"5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00:..."
9,"6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39:..."


## Data Manipulation & Cleaning

The DataFrame is not in the format we want. We can see a few problems here:

1. There is only 1 column with everything in it
2. There is no header
3. The top few rows are summary information which we don't need in this DataFrame

Let's start some cleaning! First spliting the columns by commas

In [9]:
# Split column 1 by ','
df1 = df[0].str.split(',', expand=True)
df1.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,Finishers:,577.0,,,,,,,,,,,,
1,Male:,414.0,,,,,,,,,,,,
2,Female:,163.0,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,1,814.0,JARED WILSON,M,TIGARD,OR,00:36:21,05:51,1 of 414,M 36-45,1 of 152,00:00:03,00:36:24,
5,2,573.0,NATHAN A SUSTERSIC,M,PORTLAND,OR,00:36:42,05:55,2 of 414,M 26-35,1 of 154,00:00:03,00:36:45,INTEL TEAM F
6,3,687.0,FRANCISCO MAYA,M,PORTLAND,OR,00:37:44,06:05,3 of 414,M 46-55,1 of 64,00:00:04,00:37:48,
7,4,623.0,PAUL MORROW,M,BEAVERTON,OR,00:38:34,06:13,4 of 414,M 36-45,2 of 152,00:00:03,00:38:37,
8,5,569.0,DEREK G OSBORNE,M,HILLSBORO,OR,00:39:21,06:20,5 of 414,M 26-35,2 of 154,00:00:03,00:39:24,INTEL TEAM F
9,6,642.0,JONATHON TRAN,M,PORTLAND,OR,00:39:49,06:25,6 of 414,M 18-25,1 of 34,00:00:06,00:39:55,


Next up, lets add the Columns Headers based on the &lt;th&gt; elements.

First we will create a list with all the headers. Once again we will employ the same cleaning methods to remove the element tags.

Then, we will rename the column headers with this list

In [10]:
col_labels = soup.find_all('th')
label_list = []
for col_label in col_labels:
    clean_col_label = bs(str(col_label),'lxml').get_text()
    label_list.append(clean_col_label)
print(label_list)

df1.columns = label_list
df1.head(10)

['Place', 'Bib', 'Name', 'Gender', 'City', 'State', 'Chip Time', 'Chip Pace', 'Gender Place', 'Age Group', 'Age Group Place', 'Time to Start', 'Gun Time', 'Team']


Unnamed: 0,Place,Bib,Name,Gender,City,State,Chip Time,Chip Pace,Gender Place,Age Group,Age Group Place,Time to Start,Gun Time,Team
0,Finishers:,577.0,,,,,,,,,,,,
1,Male:,414.0,,,,,,,,,,,,
2,Female:,163.0,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,1,814.0,JARED WILSON,M,TIGARD,OR,00:36:21,05:51,1 of 414,M 36-45,1 of 152,00:00:03,00:36:24,
5,2,573.0,NATHAN A SUSTERSIC,M,PORTLAND,OR,00:36:42,05:55,2 of 414,M 26-35,1 of 154,00:00:03,00:36:45,INTEL TEAM F
6,3,687.0,FRANCISCO MAYA,M,PORTLAND,OR,00:37:44,06:05,3 of 414,M 46-55,1 of 64,00:00:04,00:37:48,
7,4,623.0,PAUL MORROW,M,BEAVERTON,OR,00:38:34,06:13,4 of 414,M 36-45,2 of 152,00:00:03,00:38:37,
8,5,569.0,DEREK G OSBORNE,M,HILLSBORO,OR,00:39:21,06:20,5 of 414,M 26-35,2 of 154,00:00:03,00:39:24,INTEL TEAM F
9,6,642.0,JONATHON TRAN,M,PORTLAND,OR,00:39:49,06:25,6 of 414,M 18-25,1 of 34,00:00:06,00:39:55,


Next, let's drop the first few rows of non-important data.

In [11]:
df2 = df1.drop(df1.index[0:4])
df2.head(10)

Unnamed: 0,Place,Bib,Name,Gender,City,State,Chip Time,Chip Pace,Gender Place,Age Group,Age Group Place,Time to Start,Gun Time,Team
4,1,814,JARED WILSON,M,TIGARD,OR,00:36:21,05:51,1 of 414,M 36-45,1 of 152,00:00:03,00:36:24,
5,2,573,NATHAN A SUSTERSIC,M,PORTLAND,OR,00:36:42,05:55,2 of 414,M 26-35,1 of 154,00:00:03,00:36:45,INTEL TEAM F
6,3,687,FRANCISCO MAYA,M,PORTLAND,OR,00:37:44,06:05,3 of 414,M 46-55,1 of 64,00:00:04,00:37:48,
7,4,623,PAUL MORROW,M,BEAVERTON,OR,00:38:34,06:13,4 of 414,M 36-45,2 of 152,00:00:03,00:38:37,
8,5,569,DEREK G OSBORNE,M,HILLSBORO,OR,00:39:21,06:20,5 of 414,M 26-35,2 of 154,00:00:03,00:39:24,INTEL TEAM F
9,6,642,JONATHON TRAN,M,PORTLAND,OR,00:39:49,06:25,6 of 414,M 18-25,1 of 34,00:00:06,00:39:55,
10,7,144,GEORGE TOTONCHY,M,PORTLAND,OR,00:40:04,06:27,7 of 414,M 36-45,3 of 152,00:00:13,00:40:17,
11,8,395,BENJAMIN C CHAFFIN,M,PORTLAND,OR,00:40:05,06:27,8 of 414,M 36-45,4 of 152,00:00:04,00:40:09,
12,9,7,BRANDON THOMAS,M,,,00:40:17,06:29,9 of 414,M 26-35,3 of 154,00:00:07,00:40:24,COLUMBIA TEAM B
13,10,3,ERIK BJORNSTAD,M,,,00:40:21,06:30,10 of 414,M 36-45,5 of 152,00:00:04,00:40:25,COLUMBIA TEAM A


We're almost there, notice the index no longer starts from 0 after we removed some of the rows. Let's reset it. `drop=True` prevents converting the old index into a new column 

In [12]:
df = df2.reset_index(drop=True)
df.head(10)

Unnamed: 0,Place,Bib,Name,Gender,City,State,Chip Time,Chip Pace,Gender Place,Age Group,Age Group Place,Time to Start,Gun Time,Team
0,1,814,JARED WILSON,M,TIGARD,OR,00:36:21,05:51,1 of 414,M 36-45,1 of 152,00:00:03,00:36:24,
1,2,573,NATHAN A SUSTERSIC,M,PORTLAND,OR,00:36:42,05:55,2 of 414,M 26-35,1 of 154,00:00:03,00:36:45,INTEL TEAM F
2,3,687,FRANCISCO MAYA,M,PORTLAND,OR,00:37:44,06:05,3 of 414,M 46-55,1 of 64,00:00:04,00:37:48,
3,4,623,PAUL MORROW,M,BEAVERTON,OR,00:38:34,06:13,4 of 414,M 36-45,2 of 152,00:00:03,00:38:37,
4,5,569,DEREK G OSBORNE,M,HILLSBORO,OR,00:39:21,06:20,5 of 414,M 26-35,2 of 154,00:00:03,00:39:24,INTEL TEAM F
5,6,642,JONATHON TRAN,M,PORTLAND,OR,00:39:49,06:25,6 of 414,M 18-25,1 of 34,00:00:06,00:39:55,
6,7,144,GEORGE TOTONCHY,M,PORTLAND,OR,00:40:04,06:27,7 of 414,M 36-45,3 of 152,00:00:13,00:40:17,
7,8,395,BENJAMIN C CHAFFIN,M,PORTLAND,OR,00:40:05,06:27,8 of 414,M 36-45,4 of 152,00:00:04,00:40:09,
8,9,7,BRANDON THOMAS,M,,,00:40:17,06:29,9 of 414,M 26-35,3 of 154,00:00:07,00:40:24,COLUMBIA TEAM B
9,10,3,ERIK BJORNSTAD,M,,,00:40:21,06:30,10 of 414,M 36-45,5 of 152,00:00:04,00:40:25,COLUMBIA TEAM A


## Data Analysis and Visualization

