# Web Scraping in Python (EoC I, F21)

Student Name: Joseph Neus
<br>
Net ID: 902039137
<br>

NOTE: Remember to submit two versions of your lab notebook on Canvas- a Jupyter Notebook (`.ipynb`) file and a PDF.

**Q1: Describe the general approach to loading a web page in Python using `requests` and isolating the section of HTML you need using `BeautifulSoup`. What are the basic steps involved in this workflow, thinking about what happens at the start of the program to isolate the section of HTML you would need to do further work with to extract the data you want to work with?**

**Answer**: In order to use beautiful soup we use requests library. To load a URL, we use requests.get which is a method within the requests library with the URL of the webpage that we assign to a variable. We then create a beautifulsoup object to interact with the webpage content. With the beautifulsoup object we have a variable which is assigned to beautiful soup library using the previous page and format it as text and parse it as html. To find the single instance of the html tag we can use .find() while .find_all() creates a list-like object with each instance of a specific tag. There is also index operators that can go through the items in tables to indicate which one has the HTML we would work with.

**Q2: Select another Wikipedia page that includes a table. From looking at the public web page, what data do you want to scrape from this web page (i.e. specific table, multiple tables, etc.)? What do you want the resulting data structure to look like (columns, rows, etc)?**

**Answer**: I chose to do the United Nations wiki page (https://en.wikipedia.org/wiki/United_Nations). For this Wikipedia page, I want to scrape a table called the Secretaries-General of the UN. For this table, I want the resulting data structure with columns that will include a number (order of job position), name, country of origin, when they took office, when they left office, as well as extra notes. There will also be nine rows as there are nine secretaries.

Number,
Name,
Country of Origin,
When They Took Office,
When They Left Office,
Extra Notes

In [1]:
# your codes here (if needed)

**Q3: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with <code>BeautifulSoup</code> to isolate a section of the web page.**

**Answer**: Looking at the section of the page I want to work with, I see tags such as <th> and <tr> for table heading and rows. More specifically, I see tags such as <table class="wikitable" style="font-size:90%; text-align:left;"> This marks the start of the table. There is also class="flagicon" which is attributed to table rows. These are some clues that may be used for the BeautifulSoup to isolate the section of the webpage.

In [1]:
# your codes here (if needed)

<strong>Q4: Develop an outline for a Python program that scrapes data from the web page you selected. 

A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your table (either directly or extract from list)
- Isolate table row elements (create list where each element is a table row)
- Extract contents from row (isolate the pieces of information from each row)
- Write extracted row contents to CSV file

NOTE: You do not need to have working code for all components of this program. That's where we're heading with the final project. At this point, we're focusing on the conceptual framework for the web scraping program. Start to build out code where you can, but think about the programming version of outlining a paper.</strong>

**Answer**: I first imported libraries. I then used get page to get the United Nations link. After that, I created beautifulsoup object and isolated all html using wikitable class. I indicated which table I wanted the program to focus on and then I got all table rows from the secretary_table.

In [10]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

In [11]:
page = requests.get('https://en.wikipedia.org/wiki/United_Nations')
soup = BeautifulSoup(page.text, 'html.parser')

In [12]:
tables = soup.find_all(class_ = 'wikitable')

In [18]:
tables[0]

<table class="wikitable" style="font-size:90%; text-align:left;">
<caption style="padding-top:1em;">Secretaries-General of the United Nations<sup class="reference" id="cite_ref-115"><a href="#cite_note-115">[114]</a></sup>
</caption>
<tbody><tr>
<th>No.</th>
<th>Name</th>
<th>Country of origin</th>
<th>Took office</th>
<th>Left office</th>
<th>Notes
</th></tr>
<tr>
<td>-
</td>
<td><b><a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a></b>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/> </span><a h

In [14]:
secretary_table = tables[0]

In [15]:
secretary_test = secretary_table.find_all("tr")
secretary_test

[<tr>
 <th>No.</th>
 <th>Name</th>
 <th>Country of origin</th>
 <th>Took office</th>
 <th>Left office</th>
 <th>Notes
 </th></tr>,
 <tr>
 <td>-
 </td>
 <td><b><a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a></b>
 </td>
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/> </span><a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a>
 </td>
 <td>24 October 1945
 </td>
 <td>2 February 1946
 </td>
 <td>Served as <a href="/wiki/Acting_(law)" title="Acting (law)">Acting</a> Secretary-General 

In [16]:
single_person = secretary_test[1]
single_person

<tr>
<td>-
</td>
<td><b><a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a></b>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/> </span><a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a>
</td>
<td>24 October 1945
</td>
<td>2 February 1946
</td>
<td>Served as <a href="/wiki/Acting_(law)" title="Acting (law)">Acting</a> Secretary-General until Lie's election
</td></tr>

In [17]:
single_position = single_person.find_all("td")
single_position

[<td>-
 </td>,
 <td><b><a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a></b>
 </td>,
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/> </span><a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a>
 </td>,
 <td>24 October 1945
 </td>,
 <td>2 February 1946
 </td>,
 <td>Served as <a href="/wiki/Acting_(law)" title="Acting (law)">Acting</a> Secretary-General until Lie's election
 </td>]

In [19]:
actual_position = single_position[0]
actual_position

<td>-
</td>

In [20]:
actual_position.contents[0]

'-\n'

In [21]:
sample_position = actual_position.contents[0]
sample_position = sample_position.strip()
sample_position

'-'

In [22]:
link_test = single_person.find("a")
link_test

<a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a>

In [23]:
sample_link = link_test.get('href')
sample_link

'/wiki/Gladwyn_Jebb'

In [24]:
full_link = "https://en.wikipedia.org/" + sample_link
full_link

'https://en.wikipedia.org//wiki/Gladwyn_Jebb'

In [25]:
name = single_position[1]
sample_name = name.find('a')
sample_name = sample_name.contents[0]
sample_name


'Gladwyn Jebb'

In [26]:
origin = single_position[2]
sample_origin = origin.find('a')
sample_origin = sample_origin.contents[0]
sample_origin

'United Kingdom'

In [27]:
otherlink = single_position[2]
sample_otherlink = otherlink.find('a')
sample_otherlink = sample_otherlink.get('href')
fullother_link = "https://en.wikipedia.org/" + sample_otherlink
fullother_link

'https://en.wikipedia.org//wiki/United_Kingdom'

In [91]:
took = single_position[3]
sample_took = took.contents[0]
sample_took = sample_took.strip()
sample_took

'24 October 1945'

In [28]:
left = single_position[4]
sample_left = left.contents[0]
sample_left = sample_left.strip()
sample_left

'2 February 1946'

In [29]:
notes = single_position[5]
sample_notes = notes.contents[0]
sample_notes


'Served as '

In [30]:
del secretary_test[0]
secretary_test

[<tr>
 <td>-
 </td>
 <td><b><a href="/wiki/Gladwyn_Jebb" title="Gladwyn Jebb">Gladwyn Jebb</a></b>
 </td>
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/23px-Flag_of_the_United_Kingdom.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/35px-Flag_of_the_United_Kingdom.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/46px-Flag_of_the_United_Kingdom.svg.png 2x" width="23"/> </span><a href="/wiki/United_Kingdom" title="United Kingdom">United Kingdom</a>
 </td>
 <td>24 October 1945
 </td>
 <td>2 February 1946
 </td>
 <td>Served as <a href="/wiki/Acting_(law)" title="Acting (law)">Acting</a> Secretary-General until Lie's election
 </td></tr>,
 <tr>
 <td>1</td>
 <td><b><a href="/wiki/Trygve_Lie" title="Trygve Lie">Trygve Lie</a></b></td>
 

In [34]:
test_list = []

for row in secretary_test:
    
    tags = []
    
    test = row.find_all("td")
    
    tags.append(test)
    
    for single_position in tags:
        
        try:
            actual_position = single_position[0]
            actual_position.contents[0]
            sample_position = actual_position.contents[0]
            sample_position = sample_position.strip()
             
            link_test = single_person.find("a")
            sample_link = link_test.get('href')
            full_link = "https://en.wikipedia.org/" + sample_link
            
            name = single_position[1]
            sample_name = name.find('a')
            sample_name = sample_name.contents[0]
             
            origin = single_position[2]
            sample_origin = origin.find('a')
            sample_origin = sample_origin.contents[0]
            
            otherlink = single_position[2]
            sample_otherlink = otherlink.find('a')
            sample_otherlink = sample_otherlink.get('href')
            fullother_link = "https://en.wikipedia.org/" + sample_otherlink
            
            took = single_position[3]
            sample_took = took.contents[0]
            sample_took = sample_took.strip() 
            
            left = single_position[4]
            sample_left = left.contents[0]
            sample_left = sample_left.strip()
            
            notes = single_position[5]
            sample_notes = notes.contents[0]
            sample_notes = sample_notes.strip()

            row_data = [sample_position, full_link, sample_name, sample_origin, fullother_link, sample_took, sample_left, sample_notes]

            test_list.append(row_data)
            
        except:
            continue
    
# show list
test_list


            

[['-',
  'https://en.wikipedia.org//wiki/Gladwyn_Jebb',
  'Gladwyn Jebb',
  'United Kingdom',
  'https://en.wikipedia.org//wiki/United_Kingdom',
  '24 October 1945',
  '2 February 1946',
  'Served as'],
 ['1',
  'https://en.wikipedia.org//wiki/Gladwyn_Jebb',
  'Trygve Lie',
  'Norway',
  'https://en.wikipedia.org//wiki/Norway',
  '2 February 1946',
  '10 November 1952',
  'Resigned'],
 ['2',
  'https://en.wikipedia.org//wiki/Gladwyn_Jebb',
  'Dag Hammarskjöld',
  'Sweden',
  'https://en.wikipedia.org//wiki/Sweden',
  '10 April 1953',
  '18 September 1961',
  'Died in office'],
 ['3',
  'https://en.wikipedia.org//wiki/Gladwyn_Jebb',
  'U Thant',
  'Burma',
  'https://en.wikipedia.org//wiki/Myanmar',
  '30 November 1961',
  '31 December 1971',
  'First non-European to hold office'],
 ['4',
  'https://en.wikipedia.org//wiki/Gladwyn_Jebb',
  'Kurt Waldheim',
  'Austria',
  'https://en.wikipedia.org//wiki/Austria',
  '1 January 1972',
  '31 December 1981',
  ''],
 ['5',
  'https://en.wikipe

In [35]:
f = csv.writer(open('secretary.csv', 'w'))

headers = ["position", "link", "name", "country_origin", "country_link", "took_office", "left_office", "notes"]

f.writerow(headers)

for row in test_list:
    f.writerow(row)

In [32]:
import pandas as pd

# create data frame 
df = pd.DataFrame(test_list, columns=headers)

# show df
df

Unnamed: 0,position,link,name,country_origin,country_link,took_office,left_office,notes
0,-,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Gladwyn Jebb,United Kingdom,https://en.wikipedia.org//wiki/United_Kingdom,24 October 1945,2 February 1946,Served as
1,1,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Trygve Lie,Norway,https://en.wikipedia.org//wiki/Norway,2 February 1946,10 November 1952,Resigned
2,2,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Dag Hammarskjöld,Sweden,https://en.wikipedia.org//wiki/Sweden,10 April 1953,18 September 1961,Died in office
3,3,https://en.wikipedia.org//wiki/Gladwyn_Jebb,U Thant,Burma,https://en.wikipedia.org//wiki/Myanmar,30 November 1961,31 December 1971,First non-European to hold office
4,4,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Kurt Waldheim,Austria,https://en.wikipedia.org//wiki/Austria,1 January 1972,31 December 1981,
5,5,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Javier Pérez de Cuéllar,Peru,https://en.wikipedia.org//wiki/Peru,1 January 1982,31 December 1991,
6,6,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Boutros Boutros-Ghali,Egypt,https://en.wikipedia.org//wiki/Egypt,1 January 1992,31 December 1996,Served for the shortest time
7,7,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Kofi Annan,Ghana,https://en.wikipedia.org//wiki/Ghana,1 January 1997,31 December 2006,
8,8,https://en.wikipedia.org//wiki/Gladwyn_Jebb,Ban Ki-moon,South Korea,https://en.wikipedia.org//wiki/South_Korea,1 January 2007,31 December 2016,


**Q5: What challenges or roadblocks did you face working on Q4? What parts of the program do you understand/feel ready to develop at this point? What parts of the program are less clear?**

**Answer**: Some challenges I faced was to account for the first row having a value of - as it's number. I had to narrow down the position and then use sample_position = sample_position.strip() to strip the /n. One other roadblock was trying to find certain links such as with Gladwyn Jebb. I had to narrow down the link by using .get and finding the link using .find(). I then had to do the same thing with the link for the location but use different position. I feel ready to be able to use .contents and .find() but I was succesfully able to write extracted row contents to csv file.

In [1]:
# your codes here (if needed)

**Q6: Describe in your own words how the url generation program covered in the previous section of the lab works. The full program is also included below. What is happening in the different program components?**

In [None]:
root = "https://www.sports-reference.com/cfb/schools/notre-dame/"

years = range(1899, 2021, 1)

tag = "-schedule.html"

urls = []

for year in years:
    urls.append(root + str(year) + tag)
    
urls

**Answer**: If we have a list of urls and we have a program working on one page, you can iterate over the list of urls to scrape data over a series of wewb pages. Here we have two variables with root and tag that are assigned to string variables. the root has the beginning of the url while the tag has the end of the url. Range takes a list of years for the daes we want to cover. Later in the program is an empty list where we can use a forloop that concatenates the full url. we append to our empty list the root, the year (which is converted to a string) and then the tag. We then can on urls to show list.

In [2]:
# your codes here (if needed)

<strong>Q7: Select another Sports Reference web page that follows this pattern and write a program that generates a list of full URLs for that team/organization.

A few places to start:
- Baseball Reference season web pages have the following URL pattern:
  * `https://www.baseball-reference.com/teams/`, `TEAM ABBREVIATION`, `SEASON`, `.shtml`
- Basketball Reference season web pages have a similar pattern for NBA teams:
  * `https://www.basketball-reference.com/teams/`, `TEAM ABBREVIATION`, `SEASON`, `.html`
- Basketball Reference uses a slightly different pattern for its WNBA pages:
  * `https://www.basketball-reference.com/wnba/teams`, `TEAM ABBREVIATION`, `SEASON`, `.html`
- College Basketball Reference pages also follow a pattern: 
  * `https://www.sports-reference.com/cbb/schools`, `SCHOOL ABBREVIATION`, `SEASON`, `.html`
- For Hockey Reference pages: 
  * `https://www.hockey-reference.com/teams/`, `TEAM ABBREVIATION`, `SEASON`, `.html`
- Football Reference pages follow the same pattern for men's and women's teams:
  * `https://fbref.com/en/squads/`, `SQUAD ID`, `SEASON`, `TEAM NAME`
- Pro Football Reference pages also have a pattern: 
  * `https://www.pro-football-reference.com/teams/`, `TEAM ABBREVIATION`, `SEASON`, `.htm`

NOTE: You DO NOT need to write a program that scrapes data from these pages for this question. The purpose of this question is to be able to programmatically generate a list of URLs that cover a date range.</strong>

**Answer**: Your answer here (double click to edit)

In [137]:
root = "https://www.baseball-reference.com/teams/DET/"

years = range(1899, 2021, 1)

tag = ".shtml"

urls = []

for year in years:
    urls.append(root + str(year) + tag)
    
urls

['https://www.baseball-reference.com/teams/DET/1899.shtml',
 'https://www.baseball-reference.com/teams/DET/1900.shtml',
 'https://www.baseball-reference.com/teams/DET/1901.shtml',
 'https://www.baseball-reference.com/teams/DET/1902.shtml',
 'https://www.baseball-reference.com/teams/DET/1903.shtml',
 'https://www.baseball-reference.com/teams/DET/1904.shtml',
 'https://www.baseball-reference.com/teams/DET/1905.shtml',
 'https://www.baseball-reference.com/teams/DET/1906.shtml',
 'https://www.baseball-reference.com/teams/DET/1907.shtml',
 'https://www.baseball-reference.com/teams/DET/1908.shtml',
 'https://www.baseball-reference.com/teams/DET/1909.shtml',
 'https://www.baseball-reference.com/teams/DET/1910.shtml',
 'https://www.baseball-reference.com/teams/DET/1911.shtml',
 'https://www.baseball-reference.com/teams/DET/1912.shtml',
 'https://www.baseball-reference.com/teams/DET/1913.shtml',
 'https://www.baseball-reference.com/teams/DET/1914.shtml',
 'https://www.baseball-reference.com/tea

**Q8: Describe the general approach to loading a web page in Python using `pd.read_html()`. What are the basic steps involved in this workflow, thinking about what happens to identify/isolate the specific table you want to work with?**

**Answer**: pd.read_html() can look for any <table> tag on a web page and reads each <table> tag's HTML into Python as a list of DataFrame objects. So in this example we are importing pandas, then storing the url as a variable, and passing the variable to pd.read_html which can read the table tags into Python as a list of DataFrames. You can then use index values to pull out each dataframe/

In [2]:
import pandas as pd

url = "https://www.baseball-reference.com/teams/DET/2021.shtml"

dfs = pd.read_html(url)

dfs

len(dfs)

<strong>Q9: For Q7, you generated a list of Sports Reference URLs covering a time span for a specific team/organization. Select three years and web pages from that list- something early in the time period covered, something in the middle of the time period covered, and something toward the end of the time period covered.
    
Do these pages have the same pattern in terms of number and order of tables?

For one of these pages, what table or tables on these pages would you want to be able to extract and work with?</strong>

**Answer**: So I chose three years from the list of detroit lions. These pages have the same pattern in terms of number and order of tables. They all have two data tables and are in the same order as illustrated by the headings on the tables when I isolate dfs[0] and dfs[1]. I want to extract work from the 2019 table

In [19]:
import pandas as pd

url = "https://www.baseball-reference.com/teams/DET/1903.shtml"

dfs = pd.read_html(url)

len(dfs)

2

In [18]:
import pandas as pd

url = "https://www.baseball-reference.com/teams/DET/1952.shtml"

dfs = pd.read_html(url)

dfs

len(dfs)

2

In [7]:
import pandas as pd

url = "https://www.baseball-reference.com/teams/DET/2019.shtml"

dfs = pd.read_html(url)

dfs

len(dfs)

2

<strong>Q10: Develop an outline for a Python program that uses `pd.read_html()` to scrape data from one of the web pages you select in Q9.

A preliminary workflow:
- Use `pd.read_html()` to create a list of DataFrame objects
- Identify which DataFrame object in the list is the table you want to work with
- Isolate the list element to create a new DataFrame
- Write the new DataFrame to a CSV file

NOTE: For Q4, you did not need to have working code for all components of this program. Since `pd.read_html()` has an easier learning curve, let's see if we can flesh out more of this program. But if you run into problems, it's okay to focus on the conceptual framework for the web scraping program. Start to build out code where you can, but think about the programming version of outlining a paper.

ANOTHER NOTE: For many Sports Reference pages, tables further down the page are buried in HTML comments. These tables will not show up when you use `pd.read_html()`. We can come back to these "hidden tables" in the final project, but for now, focus on the tables that do show up when you use `pd.read_html()`.</strong>

**Answer**: I first imported pandas and then assigned url variable to my 2019 statistics. I then used pd.read_html() to create a list of Data Frame objects and identified which object I wanted to use. After that I isolated it and wrote the new DataFrame to a CSV file.

In [20]:
import pandas as pd

url = "https://www.baseball-reference.com/teams/DET/2019.shtml"

dfs = pd.read_html(url)

len(dfs)


2

In [23]:
players_df = dfs[1]

players_df

Unnamed: 0,Rk,Pos,Name,Age,W,L,W-L%,ERA,G,GS,...,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W
0,1,SP,Matthew Boyd*,28,9,12,.429,4.56,32,32,...,6,788,104,4.32,1.230,8.6,1.9,2.4,11.6,4.76
1,2,SP,Spencer Turnbull,26,3,17,.150,4.61,30,30,...,9,656,103,3.99,1.436,9.3,0.8,3.6,8.9,2.47
2,3,SP,Daniel Norris*,26,3,13,.188,4.49,32,29,...,5,610,106,4.61,1.330,9.6,1.6,2.4,7.8,3.29
3,4,SP,Jordan Zimmermann,33,1,13,.071,6.91,23,23,...,3,504,69,4.79,1.518,11.7,1.5,2.0,6.6,3.28
4,Rk,Pos,Name,Age,W,L,W-L%,ERA,G,GS,...,WP,BF,ERA+,FIP,WHIP,H9,HR9,BB9,SO9,SO/W
5,5,CL,Shane Greene,30,0,2,.000,1.18,38,0,...,0,151,405,3.69,0.868,5.0,1.2,2.8,10.2,3.58
6,6,RP,Nick Ramirez*,29,5,4,.556,4.07,46,0,...,6,348,117,4.51,1.393,8.6,1.2,4.0,8.4,2.11
7,7,RP,Buck Farmer,28,6,6,.500,3.72,73,1,...,4,288,128,3.88,1.271,8.2,1.1,3.2,9.7,3.04
8,8,RP,Joe Jimenez,24,4,7,.364,4.37,66,0,...,1,257,109,4.66,1.324,8.4,2.0,3.5,12.4,3.57
9,9,RP,Daniel Stumpf*,28,1,1,.500,4.34,48,0,...,0,135,111,5.08,1.724,10.9,1.6,4.7,8.7,1.87


In [25]:
players_df.to_csv("sample_DET_players.csv", index=False)

**Q11: What challenges or roadblocks did you face working on Q10? What parts of the program do you understand and/or were able to develop? What parts of the program are less clear?**

**Answer**: I was succesfully able to develop the program and understood how to isolate my DataFrame. I was able to choose the index of my Data frame and assign it to a variable. I think the index part for the CSV was a little less clear but the program was still able to function and write to a CSV file.

In [2]:
# your codes here (if needed)

**Q12: Describe in your own words how program for scraping unstructured text covered in the previous section of the lab works. The full program is also included below. What is happening in the different program components?**

In [28]:
page = requests.get("https://ndsmcobserver.com/2021/10/south-bend-community-leaders-discuss-role-of-notre-dame-in-fight-for-black-civil-rights/")

soup = BeautifulSoup(page.text, 'html.parser')

article = soup.find_all(class_ = "content-body")

article_text = article[0]

article_paragraphs = article_text.find_all("p")

f = open("observer_text.txt", "a")

for paragraph in article_paragraphs:
    text = str(paragraph.contents[0])
    f.write(text)

NameError: name 'requests' is not defined

**Answer**: First we are using requests.get to load the url. Then we still parse it as html to interact with it and make it into a beautifulsoup object. We need to find what is around paragraph tags to isolate parto of the html. So we look for class "content-body". We then refine it further by getting the content of the text with article [0]. Then we create a list like object by isolating paragraph tags. Later we create new txt file and open it in append mode. Finally we have a forloop that iterates and removes HTML markup for each paragraph.

In [2]:
# your codes here (if needed)

<strong>Q13: Select another web page that includes unstructured text. From looking at the public web page, what text do you want to scrape from this web page (i.e. specific sections, multiple paragraphs, etc.)?

A few places to start for unstructured text:
- [The Observer!](https://ndsmcobserver.com) (or another news publication of your choosing)
- [WikiSource](https://en.wikisource.org/wiki/Main_Page)(a library of texts that are not covered by copyright)
  * [U.S. Presidential State of the Union Addresses](https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents)
  * [U.S. Presidential Inaugural Speeches](https://en.wikisource.org/wiki/Portal:Inaugural_Speeches_by_United_States_Presidents)
- [Project Gutenberg](https://www.gutenberg.org) (a library of literary works or texts that are not covered by copyright)</strong>

**Answer**: The text I chose is titled "Building our sanctuary, one student at a time" from The Observer. I want to scrape multiple paragraphs from the article.

In [2]:
# your codes here (if needed)

**Q14: Take a look at the HTML for this page. What tags or other HTML components do you see around the section of the page you want to work with? For this question, we're thinking about how we will end up writing a program with `BeautifulSoup` to isolate a section of the web page.**

**Answer**: Looking at the HTML, i see <p> tags. Around that is <div class="content-body">. This has content body class that is shown once so I will utilize that to scrape.

In [2]:
# your codes here (if needed)

<strong>Q15: Develop an outline for a Python program that scrapes unstructured text from the web page you selected. 

A preliminary workflow:
- Load URL and create BeautifulSoup object
- Isolate section of HTML with your text (either directly or extract from list)
- IF NEEDED: Isolate text elements (create list where each element is a section of text)
- IF NEEDED: Extract text contents (isolate text from each section/paragraph)
- Write text to TXT file

NOTE: You do not need to have working code for all components of this program. That's where we're heading with the final project. At this point, we're focusing on the conceptual framework for the web scraping program. Start to build out code where you can, but think about the programming version of outlining a paper.</strong>

**Answer**: Your answer here (double click to edit)

In [51]:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://ndsmcobserver.com/2021/11/building-our-sanctuary-one-student-at-a-time/")

soup = BeautifulSoup(page.text, 'html.parser')

observer = soup.find_all(class_ = "content-body")

observer_text = observer[0]

text_list = []

for paragraph in observer_paragraphs:
    paragraph = str(paragraph.contents[0])
    text_list.append(paragraph)
    
f = open("observer_text.txt", "a")

for paragraph in observer_paragraphs:
    text = str(paragraph.contents[0])
    f.write(text)
    

**Q16: What challenges or roadblocks did you face working on Q15? What parts of the program do you understand/feel ready to develop at this point? What parts of the program are less clear?**

**Answer**: I forgot to import requests and BeautifulSoup at the beginning as well as specify the content-body class. However, I was able narrow down my paragraphs and create a forloop that writes text to a txt file.

In [2]:
# your codes here (if needed)

# Final Project


For this project, I decided to examine a Smithsonian data file (https://photocontest.smithsonianmag.com/photocontest/archive/2020/where). In this case, I would be working with a type of unstructured data. It consists of a grid like structure of images with brief information underneath that is not structured in a chart format like Wikipedia. I wanted to partiuclarly scrape the first three categories from the finalists off of the site. These categories would be Natural World, People and Travel. My plan was to write the data into seperate CSV files for each seperate category. The challenges that I thought I would encounter would be organizing images or how to structure all the information in a chart. The initial plan would be using a for loop that would take each section using the same class name and iterating through my data. One challenge I encountered was with scraping the hashtags for an individual grid. I knew this would be a problem since each image has a different amount of hashtags. Therefore, after many trial and error I consulted w3schools and figured I could use a for loop that could append each tag together and use the join() method that takes all the hashtags and joins them into one string. This successfully worked and could account for all the different amount of hashtags. Furthermore, another challenge I encountered was with the for loop that would create a list of all my seperate grids and compile it. For this, I had to be careful with my variable names and make sure I was calling out grids which accounts for my entire grid structure for Natural World. After that I worked on how I can incorporate my previous code to web scrape the other categories. This was challenging as well as I had to be very specific with variable names and locate the right variables that were specific for the categories of "people" and "travel". I think the project helped my understanding of the importance of variables as well as how to expand on previous code. Once I had the code figured out for Natural World, I could apply it to different scenarios such as web scraping other categories such as People and Travel. Every line of code you create requires maintenance and small changes can require a lot of work. However, the final project was very useful in understanding the process of writing extensive code.


In [6]:
#First I imported libraries
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

In [7]:
#Here I got the smithsonian magazine page with the url
page = requests.get('https://photocontest.smithsonianmag.com/photocontest/archive/2020/')
#Created a BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

In [8]:
#Isolating the HTML with winners-category class
group = soup.find_all(class_ = 'winners-category')

In [9]:
#showing group of Finalists for Natural World
group[0]

<section class="winners-category">
<h2>Finalists</h2>
<h3>NATURAL WORLD</h3>
<div class="photo-grid-photos" id="photoGridPhotos">
<div class="photo-container" id="photoContainer1">
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/crabeater-seals/">
<img alt="Crabeater Seals thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/el_0tAXud0t3EpTjl2j_c5FYHFo=/fit-in/600x0/filters:focal(2732x1817:2733x1818)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/c130aa79-3045-4d87-9287-ebb3c8c322c3.jpg"/>
<div class="photographer">Florian Ledoux</div>
<div class="photo-title">Crabeater Seals</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/antarctica/">#antarctica</a>
<a href="/photocontest/tags/ice/">#ice</a>
<a href="/photocontest/tags/nature/">#nature</a>
<a href="/photocontest/tags/seal/">#seal</a>
<a href="/photocontest/tags/wildlife/">#wildlife</a>
</div>
</div>
</div>
<div class="grid-photo">

In [10]:
#Isolating group of Finalists for Natural World
natural_world = group[0]

In [69]:
#finding the individual grid (specifically Florian Ledoux)
grids = natural_world.find_all(class_ ="grid-photo")

first_grid = grids[0]

<section class="winners-category">
<h3>PEOPLE</h3>
<div class="photo-grid-photos" id="photoGridPhotos">
<div class="photo-container" id="photoContainer1">
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/proud-7/">
<img alt="Nilima's Challenge thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/HA-WGl4U_s6qrJwGEpgHQ3jSjs4=/fit-in/600x0/filters:focal(750x1001:751x1002)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/9f4a977a-cc98-455c-90bb-79e69d85640c.jpg"/>
<div class="photographer">Mauro De Bettio</div>
<div class="photo-title">Nilima's Challenge</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/bangladesh/">#bangladesh</a>
</div>
</div>
</div>
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/once-with-the-earth/">
<img alt="Once With the Earth thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/ETSl48rPSIqy8AG9-CyJhy-1RJk=/fit-in/600x0/filters:foca

In [12]:
#finding the individual photo from the grid
photo = first_grid.find("img")
sample_image = photo.get('src')
sample_image


'https://th-thumbnailer.cdn-si-edu.com/el_0tAXud0t3EpTjl2j_c5FYHFo=/fit-in/600x0/filters:focal(2732x1817:2733x1818)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/c130aa79-3045-4d87-9287-ebb3c8c322c3.jpg'

In [13]:
#finding the page link of the grid as well as making sure that I grab the entire link
photo_page = first_grid.find("a")
page_link = photo_page.get('href')
entire_link = "https://photocontest.smithsonianmag.com" + page_link
entire_link

'https://photocontest.smithsonianmag.com/photocontest/detail/crabeater-seals/'

In [14]:
#Locating the photographer name
finding_stuff = first_grid.find_all("div")
finding_stuff[0]


<div class="photographer">Florian Ledoux</div>

In [15]:
#Narrowing the photographer name using .contents
photographer = finding_stuff[0]
sample_photographer = photographer.contents[0]
sample_photographer

'Florian Ledoux'

In [16]:
#similar process like the name but with the title of the photo using .contents
photo_title = finding_stuff[1]
sample_title = photo_title.contents[0]
sample_title

'Crabeater Seals'

In [38]:
#Narrowing the phototags by finding all with the attribute 'a'
photo_tags = finding_stuff[2]
finding_tag = photo_tags.find_all("a")
all_tags = []
#forloop created a list of all the tags and the ', '.join went through each element of list and seperated it with commas
for tag in finding_tag:
    all_tags.append(tag.contents[0])
', '.join(all_tags)


'#antarctica, #ice, #nature, #seal, #wildlife'

In [60]:
#Creating my first example list of Natural world using a forloop that contains my previous instructions for a single image. This will iterate over every image in the first section, Natural World.

example_list = []

for image in grids:
    
    try:
        photo = image.find("img")
        sample_image = photo.get('src')

        photo_page = image.find("a")
        page_link = photo_page.get('href')
        entire_link = "https://photocontest.smithsonianmag.com" + page_link

        finding_stuff = image.find_all("div")
        photographer = finding_stuff[0]
        sample_photographer = photographer.contents[0]

        photo_title = finding_stuff[1]
        sample_title = photo_title.contents[0]

        photo_tags = finding_stuff[2]
        finding_tag = photo_tags.find_all("a")
        all_tags = []
        for tag in finding_tag:
            all_tags.append(tag.contents[0])
        ', '.join(all_tags)

        row_data = [sample_image, entire_link, sample_photographer, sample_title, all_tags]
        example_list.append(row_data)
    except:
        continue
        
example_list



[['https://th-thumbnailer.cdn-si-edu.com/el_0tAXud0t3EpTjl2j_c5FYHFo=/fit-in/600x0/filters:focal(2732x1817:2733x1818)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/c130aa79-3045-4d87-9287-ebb3c8c322c3.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/crabeater-seals/',
  'Florian Ledoux',
  'Crabeater Seals',
  ['#antarctica', '#ice', '#nature', '#seal', '#wildlife']],
 ['https://th-thumbnailer.cdn-si-edu.com/pFvOhv9r6ieQjCrixRLQUQ5ctmo=/fit-in/600x0/filters:focal(1000x750:1001x751)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/ded5bc33-a7ab-4558-b33a-a978b2eb4273.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/turtle-19/',
  'Galice Hoarau',
  'Turtle',
  ['#indonesia', '#reef', '#turtle']],
 ['https://th-thumbnailer.cdn-si-edu.com/BFarV8TaHBkTDTQAvPJCwZVw1EI=/fit-in/600x0/filters:focal(1700x1054:1701x1055)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/0443

In [64]:
#Here I create a CSV file with corresponding headers and a for loop that assigns each element in test_list to a new row
f = csv.writer(open('Natural_World.csv', 'w'))

headers = ["Image Link", "Page Link", "Photographer", "Photo Title", "Hashtags"]

f.writerow(headers)

for row in example_list:
    f.writerow(row)

In [63]:
#Here I create an example data frame of my first grid structure (Natural World) to see what it would look like
import pandas as pd

exdf = pd.DataFrame(example_list, columns=headers)

exdf

Unnamed: 0,Image Link,Page Link,Photographer,Photo Title,Hashtags
0,https://th-thumbnailer.cdn-si-edu.com/el_0tAXu...,https://photocontest.smithsonianmag.com/photoc...,Florian Ledoux,Crabeater Seals,"[#antarctica, #ice, #nature, #seal, #wildlife]"
1,https://th-thumbnailer.cdn-si-edu.com/pFvOhv9r...,https://photocontest.smithsonianmag.com/photoc...,Galice Hoarau,Turtle,"[#indonesia, #reef, #turtle]"
2,https://th-thumbnailer.cdn-si-edu.com/BFarV8Ta...,https://photocontest.smithsonianmag.com/photoc...,Gueorgui Petkov,Walrus With Baby,"[#arctic, #wildlife]"
3,https://th-thumbnailer.cdn-si-edu.com/yk4spV5F...,https://photocontest.smithsonianmag.com/photoc...,Vladimir Karamazov,Goodbye Family,"[#bulgaria, #snail]"
4,https://th-thumbnailer.cdn-si-edu.com/FD8Dhg1H...,https://photocontest.smithsonianmag.com/photoc...,John Comisky,Whale Tail,[#antarctica]
5,https://th-thumbnailer.cdn-si-edu.com/yQmiLoUE...,https://photocontest.smithsonianmag.com/photoc...,Chandrashekhar Shirur,Unsung Warrior,"[#india, #reptile, #wildlife]"
6,https://th-thumbnailer.cdn-si-edu.com/GXYzus6I...,https://photocontest.smithsonianmag.com/photoc...,Frank Klein,Robber Fly With Dew,[#texas]
7,https://th-thumbnailer.cdn-si-edu.com/x4iUB1YE...,https://photocontest.smithsonianmag.com/photoc...,Andro Loria,Light and Rain,"[#highlands, #iceland, #landscape, #weather]"
8,https://th-thumbnailer.cdn-si-edu.com/qGEPde6v...,https://photocontest.smithsonianmag.com/photoc...,Ly Dang,At the Helm,"[#animal, #baboon, #ethiopia, #monkey, #wildlife]"
9,https://th-thumbnailer.cdn-si-edu.com/G65wHhF6...,https://photocontest.smithsonianmag.com/photoc...,Yaron Schmid,The Giraffe and the Roller,"[#bird, #giraffe, #kenya, #nature, #safari, #w..."


In [71]:
#Here I start on my next grid structure (people) by identifying the group of tags that I want isolated. In this case, I am taking the second group by using index 1.
people = group[1]
people

<section class="winners-category">
<h3>PEOPLE</h3>
<div class="photo-grid-photos" id="photoGridPhotos">
<div class="photo-container" id="photoContainer1">
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/proud-7/">
<img alt="Nilima's Challenge thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/HA-WGl4U_s6qrJwGEpgHQ3jSjs4=/fit-in/600x0/filters:focal(750x1001:751x1002)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/9f4a977a-cc98-455c-90bb-79e69d85640c.jpg"/>
<div class="photographer">Mauro De Bettio</div>
<div class="photo-title">Nilima's Challenge</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/bangladesh/">#bangladesh</a>
</div>
</div>
</div>
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/once-with-the-earth/">
<img alt="Once With the Earth thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/ETSl48rPSIqy8AG9-CyJhy-1RJk=/fit-in/600x0/filters:foca

In [72]:
#I then continue narrowing down until I reach the first image of the second grid structure that I want.
second_grids = people.find_all(class_ ="grid-photo")

one_grid = second_grids[0]
one_grid

<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/proud-7/">
<img alt="Nilima's Challenge thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/HA-WGl4U_s6qrJwGEpgHQ3jSjs4=/fit-in/600x0/filters:focal(750x1001:751x1002)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/9f4a977a-cc98-455c-90bb-79e69d85640c.jpg"/>
<div class="photographer">Mauro De Bettio</div>
<div class="photo-title">Nilima's Challenge</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/bangladesh/">#bangladesh</a>
</div>
</div>
</div>

In [73]:
#Here I create another forloop using different values that will isolate my second grid structure of people.

secondexample_list = []

for image in second_grids:
    
    try:
        photo = image.find("img")
        sample_image = photo.get('src')

        photo_page = image.find("a")
        page_link = photo_page.get('href')
        entire_link = "https://photocontest.smithsonianmag.com" + page_link

        finding_stuff = image.find_all("div")
        photographer = finding_stuff[0]
        sample_photographer = photographer.contents[0]

        photo_title = finding_stuff[1]
        sample_title = photo_title.contents[0]

        photo_tags = finding_stuff[2]
        finding_tag = photo_tags.find_all("a")
        all_tags = []
        for tag in finding_tag:
            all_tags.append(tag.contents[0])
        ', '.join(all_tags)

        row_data = [sample_image, entire_link, sample_photographer, sample_title, all_tags]
        secondexample_list.append(row_data)
    except:
        continue
        
secondexample_list

[['https://th-thumbnailer.cdn-si-edu.com/HA-WGl4U_s6qrJwGEpgHQ3jSjs4=/fit-in/600x0/filters:focal(750x1001:751x1002)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/9f4a977a-cc98-455c-90bb-79e69d85640c.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/proud-7/',
  'Mauro De Bettio',
  "Nilima's Challenge",
  ['#bangladesh']],
 ['https://th-thumbnailer.cdn-si-edu.com/ETSl48rPSIqy8AG9-CyJhy-1RJk=/fit-in/600x0/filters:focal(750x1001:751x1002)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/b96e86ac-5110-4b11-abc5-db6e7cdc4d4d.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/once-with-the-earth/',
  'Mauro De Bettio',
  'Once With the Earth',
  ['#ethiopia']],
 ['https://th-thumbnailer.cdn-si-edu.com/RdPs5ezbJjjmbZHYd1lRcjconLQ=/fit-in/600x0/filters:focal(1500x998:1501x999)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/22e2dc83-1c20-40c6-8c57-5693514f1d21.jpg',
  'htt

In [74]:
#I then create another CSV file with corresponding headers and a for loop that assigns each element in test_list to a new row

f = csv.writer(open('People.csv', 'w'))

headers = ["Image Link", "Page Link", "Photographer", "Photo Title", "Hashtags"]

f.writerow(headers)

for row in secondexample_list:
    f.writerow(row)

In [75]:
#Here is a data frame that illustrates the category of people in a chart.
import pandas as pd

ex2df = pd.DataFrame(secondexample_list, columns=headers)

ex2df

Unnamed: 0,Image Link,Page Link,Photographer,Photo Title,Hashtags
0,https://th-thumbnailer.cdn-si-edu.com/HA-WGl4U...,https://photocontest.smithsonianmag.com/photoc...,Mauro De Bettio,Nilima's Challenge,[#bangladesh]
1,https://th-thumbnailer.cdn-si-edu.com/ETSl48rP...,https://photocontest.smithsonianmag.com/photoc...,Mauro De Bettio,Once With the Earth,[#ethiopia]
2,https://th-thumbnailer.cdn-si-edu.com/RdPs5ezb...,https://photocontest.smithsonianmag.com/photoc...,Mauro Scattolini,At the Brink of Desperation,"[#italy, #people]"
3,https://th-thumbnailer.cdn-si-edu.com/aN1NXXxK...,https://photocontest.smithsonianmag.com/photoc...,Sarah Wouters,Alms,"[#culture, #thailand]"
4,https://th-thumbnailer.cdn-si-edu.com/I5PwvaXB...,https://photocontest.smithsonianmag.com/photoc...,Alain Schroeder,Grandma Diver,[#south korea]
5,https://th-thumbnailer.cdn-si-edu.com/nN75w8o5...,https://photocontest.smithsonianmag.com/photoc...,Vladimir Karamazov,The Boy From the Mountain,"[#boy, #bulgaria, #house, #village]"
6,https://th-thumbnailer.cdn-si-edu.com/b1uGchxL...,https://photocontest.smithsonianmag.com/photoc...,Ye Win Nyunt,Attachment and Hatred,"[#monks, #myanmar, #people]"
7,https://th-thumbnailer.cdn-si-edu.com/A3YIPQlQ...,https://photocontest.smithsonianmag.com/photoc...,Balu Gudimetla,A Photo of a Kid Enjoying the Summer Evening P...,"[#adventure, #family, #kids, #massachusetts, #..."
8,https://th-thumbnailer.cdn-si-edu.com/VfFwN2HO...,https://photocontest.smithsonianmag.com/photoc...,Marlon Porter,The Pride Of A Nation,"[#canada, #people, #ritual]"
9,https://th-thumbnailer.cdn-si-edu.com/SEqJGPFW...,https://photocontest.smithsonianmag.com/photoc...,Matt Stasi,Paying Respect at a George Floyd Sit-In,[#california]


In [76]:
#I then repeat the process for the subsequent cells to isolate my third category of travel. 
travel = group[2]
travel

<section class="winners-category">
<h3>TRAVEL</h3>
<div class="photo-grid-photos" id="photoGridPhotos">
<div class="photo-container" id="photoContainer1">
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/serenity-137/">
<img alt="Serenity thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/utJov_MoR8N6__ytjRfMHBOZpvQ=/fit-in/600x0/filters:focal(2016x1512:2017x1513)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/611b3d47-0a64-4814-8f7a-be79351a1389.jpg"/>
<div class="photographer">Ankit Sharma</div>
<div class="photo-title">Serenity</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/india/">#india</a>
<a href="/photocontest/tags/morning/">#morning</a>
<a href="/photocontest/tags/people/">#people</a>
<a href="/photocontest/tags/reflection/">#reflection</a>
<a href="/photocontest/tags/travel/">#travel</a>
</div>
</div>
</div>
<div class="grid-photo">
<a class="lightbox-thumbnail" href="/p

In [77]:
third_grids = travel.find_all(class_ ="grid-photo")

thefirst_grid = third_grids[0]
thefirst_grid

<div class="grid-photo">
<a class="lightbox-thumbnail" href="/photocontest/detail/serenity-137/">
<img alt="Serenity thumbnail" src="https://th-thumbnailer.cdn-si-edu.com/utJov_MoR8N6__ytjRfMHBOZpvQ=/fit-in/600x0/filters:focal(2016x1512:2017x1513)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/611b3d47-0a64-4814-8f7a-be79351a1389.jpg"/>
<div class="photographer">Ankit Sharma</div>
<div class="photo-title">Serenity</div>
</a>
<div class="photo-tags">
<div class="photo-tags">
<a href="/photocontest/tags/india/">#india</a>
<a href="/photocontest/tags/morning/">#morning</a>
<a href="/photocontest/tags/people/">#people</a>
<a href="/photocontest/tags/reflection/">#reflection</a>
<a href="/photocontest/tags/travel/">#travel</a>
</div>
</div>
</div>

In [78]:
thirdexample_list = []

for image in third_grids:
    
    try:
        photo = image.find("img")
        sample_image = photo.get('src')

        photo_page = image.find("a")
        page_link = photo_page.get('href')
        entire_link = "https://photocontest.smithsonianmag.com" + page_link

        finding_stuff = image.find_all("div")
        photographer = finding_stuff[0]
        sample_photographer = photographer.contents[0]

        photo_title = finding_stuff[1]
        sample_title = photo_title.contents[0]

        photo_tags = finding_stuff[2]
        finding_tag = photo_tags.find_all("a")
        all_tags = []
        for tag in finding_tag:
            all_tags.append(tag.contents[0])
        ', '.join(all_tags)

        row_data = [sample_image, entire_link, sample_photographer, sample_title, all_tags]
        thirdexample_list.append(row_data)
    except:
        continue
        
thirdexample_list

[['https://th-thumbnailer.cdn-si-edu.com/utJov_MoR8N6__ytjRfMHBOZpvQ=/fit-in/600x0/filters:focal(2016x1512:2017x1513)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/611b3d47-0a64-4814-8f7a-be79351a1389.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/serenity-137/',
  'Ankit Sharma',
  'Serenity',
  ['#india', '#morning', '#people', '#reflection', '#travel']],
 ['https://th-thumbnailer.cdn-si-edu.com/-D-PfFeZFiBCaA_wt9YAw2Eot94=/fit-in/600x0/filters:focal(1024x681:1025x682)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/e4faeeb8-1a6e-4008-8782-754181c79cee.jpg',
  'https://photocontest.smithsonianmag.com/photocontest/detail/inside-a-hot-balloon/',
  'Tran Tuan Viet',
  'Colorful Work',
  ['#travel', '#vietnam']],
 ['https://th-thumbnailer.cdn-si-edu.com/oHep_crZs8VYZmZUJr3_TAdVRvA=/fit-in/600x0/filters:focal(1125x748:1126x749)/https://tf-cmsv2-photocontest-smithsonianmag-prod-approved.s3.amazonaws.com/b39056d

In [81]:
f = csv.writer(open('Travel.csv', 'w'))

headers = ["Image Link", "Page Link", "Photographer", "Photo Title", "Hashtags"]

f.writerow(headers)

for row in thirdexample_list:
    f.writerow(row)

In [80]:
import pandas as pd

ex3df = pd.DataFrame(thirdexample_list, columns=headers)

ex3df

Unnamed: 0,Image Link,Page Link,Photographer,Photo Title,Hashtags
0,https://th-thumbnailer.cdn-si-edu.com/utJov_Mo...,https://photocontest.smithsonianmag.com/photoc...,Ankit Sharma,Serenity,"[#india, #morning, #people, #reflection, #travel]"
1,https://th-thumbnailer.cdn-si-edu.com/-D-PfFeZ...,https://photocontest.smithsonianmag.com/photoc...,Tran Tuan Viet,Colorful Work,"[#travel, #vietnam]"
2,https://th-thumbnailer.cdn-si-edu.com/oHep_crZ...,https://photocontest.smithsonianmag.com/photoc...,Prithwiraj Dhang,Dance on the Auspicious Day of the Turmeric Fe...,"[#festival, #india]"
3,https://th-thumbnailer.cdn-si-edu.com/CABmhyYs...,https://photocontest.smithsonianmag.com/photoc...,Alain Schroeder,Taekwondo,"[#korea, #north]"
4,https://th-thumbnailer.cdn-si-edu.com/z7TERTzh...,https://photocontest.smithsonianmag.com/photoc...,Phuoc Hoai Nguyen,Fishing Boat,"[#blue, #boat, #fishing, #net, #people, #sea, ..."
5,https://th-thumbnailer.cdn-si-edu.com/puOuQJrw...,https://photocontest.smithsonianmag.com/photoc...,Viktor Lyagushkin,Green Clouds,"[#ice, #underwater]"
6,https://th-thumbnailer.cdn-si-edu.com/4dIQbcqt...,https://photocontest.smithsonianmag.com/photoc...,Callie Chee,Autumn Morning at the American Swamps,"[#kayak, #sunrise]"
7,https://th-thumbnailer.cdn-si-edu.com/rX6lSleT...,https://photocontest.smithsonianmag.com/photoc...,Olesia Kim,Generous Russia,"[#food, #hood, #meal]"
8,https://th-thumbnailer.cdn-si-edu.com/b7OuuIZA...,https://photocontest.smithsonianmag.com/photoc...,Marco Campi,In Spite of Everything,"[#flood, #italy, #square, #tide, #venice]"
9,https://th-thumbnailer.cdn-si-edu.com/2jxx2Bpm...,https://photocontest.smithsonianmag.com/photoc...,Christopher Michel,The Far Away,[#antarctica]


# Final Project Next Steps

Our work in this lab is designed to lay the foundation and serve as a springboard for final project work. Specifically, Q4, Q10, and Q13 ask you to develop an outline for web scraping programs using `BeautifulSoup` and `pd.read_html()`.

Those questions (and other work for this lab) are the starting place for the final project.

The final project for this course involves a web-scraping project written in Python. Specifically, the final project allows you to select a web page (or web pages) and write a Python program (or programs) that downloads select content from that web page as a plain-text file (`CSV`, `TXT`, etc). That content could be paragraphs of text, tables of data, etc. 

Successful final projects will include two main components:
- a well-documented, working Python program written in Jupyter Notebooks
- a written reflection (minimum 300 words) that documents how you approached the final project/what you wanted to accomplish via the final project, resources consulted, how you handled challenges you encountered, key takeaways, etc.

That reflection can come at the end of the Jupyter Notebook or be embedded throughout the Jupyter Notebook, if you want to approach authoring the notebook as a type of tutorial or "report".

Expect to spend at least 10 hours working on the final project. That includes brainstorming, meeting with instructor/TAs, in-class work time, etc. If you’re working on a project that is not going to take that much time, think about how to add complexity or take on another smaller scale project.
- Contact the instructor with questions.

**So where to start?**

The instructor and TAs are going to move quickly on getting you feedback on this lab. 

But your work in Q4, Q10, and Q13 should be the starting place for how you think about and approach the final project.

Specifically, think about the kinds of web content you were able to scrape for these questions and how you might further develop, refine, or expand your work.

For example, these questions asked you to develop an outline for specific types of webscraping programs. 
- One next step for the final project could be selecting 1-2 of these programs and further developing or refining the code.
- Another next step (especially if you were able to develop working programs for these questions) is to think about how you could expand or extend these workflows to multiple web pages or other web data sources.
- A third option for next steps would involve thinking expansively about how you could apply the concepts and approaches covered in this lab to a different type of data source/structure. 

So in the interim, as you're waiting for feedback on this lab, think about where you could go next with expanding and extending your work in this lab, and start to flesh out or develop some of your own ideas about where you put your time and effort as you work on the final project.

**Final project timeline**
- Week #14 (11/23): In-class work time
- Week #15 (11/30, 12/2):
  * Initial version of lab due end of day Tuesday 11/30
  * In-class work time both days
- By end of Week #15: Final project update/shareout due on Canvas
- By end of day Friday 12/10 (Week #16, last week of classes/start of exam week): Final project due on Canvas by end of day

Again, contact the instructor with questions.