<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Highest Mountains in the World</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/highest-mountain/">https://discovery.cs.illinois.edu/microproject/highest-mountain/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Wikipedia's "List of mountains by elevation"

Wikipedia is an absolutely amazing source of information about almost every topic you can imagine!  In this microproject, you will explore how to easily use data in Wikipedia tables as datasets.

The Wikipedia article "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) contains information on hundreds of mountains -- including Mount Everest (tallest in the world), Denali (tallest in the United States), and many more!
- Click the link above [(or right here)]((https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)" (https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)) to view how the Wikipedia page looks in your web browser!

### Using pandas `read_html` function

The `pd.read_html(...)` function in the pandas library is designed to read data from tables found in webpages.
- `read_html` is very similar to the more commonly used `read_csv`
- Instead of returning a DataFrame like `read_csv`, the `read_html` returns a **list of DataFrames** -- one DataFrame for each table!
- Just like `read_csv`, you only need to provide the URL of the data!

Import `pandas` and create a new variable called `pages` the reads in all of tables on the Wikipedia page  "[List of mountains by elevation](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation)":

In [6]:
import pandas as pd
pages = pd.read_html('https://en.wikipedia.org/wiki/List_of_mountains_by_elevation', header=0)
pages

[                          Mountain  Metres   Feet      Range  \
 0                    Mount Everest    8848  29029  Himalayas   
 1                               K2    8612  28255  Karakoram   
 2                    Kangchenjunga    8586  28169  Himalayas   
 3                           Lhotse    8516  27940  Himalayas   
 4                           Makalu    8485  27838  Himalayas   
 5                          Cho Oyu    8188  26864  Himalayas   
 6                       Dhaulagiri    8167  26795  Himalayas   
 7                          Manaslu    8163  26781  Himalayas   
 8                     Nanga Parbat    8126  26660  Himalayas   
 9                        Annapurna    8091  26545  Himalayas   
 10  Gasherbrum I (Hidden peak; K5)    8080  26509  Karakoram   
 11                      Broad Peak    8051  26414  Karakoram   
 12              Gasherbrum II (K4)    8035  26362  Karakoram   
 13                    Shishapangma    8027  26335  Himalayas   
 
                       

### 🔬 Checkpoint Tests 🔬

In [8]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])
print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Joining the individual DataFrames into one large DataFrame

Now that you have **ALL** of the tables in the `pages` variable, we want to convert this into one large DataFrame.  However, instead of having just one DataFrame, the webpage has different tables.

Let's explore the individual tables.  Using `pages[0]`, you can view the first table of data found on the Wikipedia page:

In [9]:
pages[0]

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Mount Everest,8848,29029,Himalayas,Nepal/China
1,K2,8612,28255,Karakoram,Pakistan/China
2,Kangchenjunga,8586,28169,Himalayas,Nepal/India
3,Lhotse,8516,27940,Himalayas,Nepal – Climbers ascend Lhotse Face in climbin...
4,Makalu,8485,27838,Himalayas,Nepal
5,Cho Oyu,8188,26864,Himalayas,"Nepal – Considered ""easiest"" eight-thousander"
6,Dhaulagiri,8167,26795,Himalayas,Nepal – Presumed world's highest from 1808-1838
7,Manaslu,8163,26781,Himalayas,Nepal
8,Nanga Parbat,8126,26660,Himalayas,Pakistan
9,Annapurna,8091,26545,Himalayas,Nepal – First eight-thousander to be climbed (...


Using `pages[1]`, you view the second table that was found:

In [10]:
pages[1]

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Gasherbrum III,7952,26089,Karakoram,Pakistan
1,Gyachung Kang,7952,26089,Himalayas,Nepal (Khumbu)/China
2,Annapurna II,7937,26040,Himalayas,Nepal
3,Gasherbrum IV (K3),7932,26024,Karakoram,Pakistan
4,Himalchuli,7893,25896,Himalayas,"Manaslu, Nepal"
5,Distaghil Sar,7885,25869,Karakoram,Pakistan
6,Ngadi Chuli,7871,25823,Himalayas,"Manaslu, Nepal"
7,Nuptse,7861,25791,Himalayas,"Everest Massif, Nepal"
8,Khunyang Chhish,7852,25761,Karakoram,Pakistan
9,Masherbrum (K1),7821,25659,Karakoram,Pakistan – Originally named K1


### Finding the Last DataFrame

Continue to look at the tables the Wikipedia page contains.  Find out the **last index** of `pages` that contains data amount the mountains:

In [15]:
pages[len(pages)-1]

Unnamed: 0,Mountain,Metres,Feet,Range,Location and Notes
0,Sgurr Dearg,986.00,3235,Cuillin,Scotland
1,Mount Sizer,980.00,3215,Diablo Range,US (California)
2,Mount Valin,980.00,3215,Saguenay Lac St-Jean,Canada (Québec)
3,Hyangnosan,979.00,3212,,"Gyeongnam Province, South Korea"
4,Scafell Pike,978.00,3209,Southern Fells,England (Cumbria) – Highest in England
5,Mount Edgecumbe,976.00,3202,,US (Alaska)
6,Grand Bonhomme,973.00,3192,,Saint Vincent and the Grenadines
7,North Mountain (Catskills),969.00,3179,Catskill Escarpment,US (New York)
8,Doli Gutta,965.00,3166,Deccan Plateau,India – Highest in Telangana State
9,Mount Monadnock,965.00,3166,,US (New Hampshire) – One of the most frequentl...


### Combining the DataFrames Together

Before we can do analysis on the whole dataset, we need to join the individual tables together.  When we join DataFrames end-to-end, where the last row of the previous DataFrame is followed by the first row of the next DataFrame, the operation is called concatenation.

Read the DISCOVERY guide to learn the syntax on "Combining DataFrames by Concatenation"
- [Guide: "Combining DataFrames by Concatenation"](https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/) (https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)

Use concatenation to create a single DataFrame `df` that contains data amount every mountain found on the Wikipedia page:

In [17]:
df = pd.concat([pages[0], pages[1], pages[2], pages[3], pages[4], pages[5], pages[6], pages[7], pages[8]]).reset_index()
df

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,index,Feet,Location and Notes,Metres,Mountain,Range
0,0,29029,Nepal/China,8848.00,Mount Everest,Himalayas
1,1,28255,Pakistan/China,8612.00,K2,Karakoram
2,2,28169,Nepal/India,8586.00,Kangchenjunga,Himalayas
3,3,27940,Nepal – Climbers ascend Lhotse Face in climbin...,8516.00,Lhotse,Himalayas
4,4,27838,Nepal,8485.00,Makalu,Himalayas
5,5,26864,"Nepal – Considered ""easiest"" eight-thousander",8188.00,Cho Oyu,Himalayas
6,6,26795,Nepal – Presumed world's highest from 1808-1838,8167.00,Dhaulagiri,Himalayas
7,7,26781,Nepal,8163.00,Manaslu,Himalayas
8,8,26660,Pakistan,8126.00,Nanga Parbat,Himalayas
9,9,26545,Nepal – First eight-thousander to be climbed (...,8091.00,Annapurna,Himalayas


### 🔬 Checkpoint Tests 🔬

In [18]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert(len(df[df.Feet > 26000]) > 0)
assert(len(df[df.Feet < 2000]) > 0)
assert(len(df[ (df.Feet < 26000) & (df.Feet > 22000) ]) > 0)
assert(len(df[ (df.Feet < 22000) & (df.Feet > 18000) ]) > 0)
assert(len(df[ (df.Feet < 18000) & (df.Feet > 14000) ]) > 0)
assert(len(df[ (df.Feet < 14000) & (df.Feet > 10000) ]) > 0)
assert(len(df[ (df.Feet < 10000) & (df.Feet > 6000) ]) > 0)
assert(len(df[ (df.Feet < 6000) & (df.Feet > 2000) ]) > 0)
print(f"{tada} All Tests Passed! {tada}")


🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Mountains in the United States

Now that we have every mountain in a single DataFrame, we can do some analysis!  In the dataset, the `Location and Notes` column contains a human-written description of the location and other notes.

Create a DataFrame called `df_us` that contains all of the mountains in the United States.

- You will need to look back at the [Wikipedia page]((https://discovery.cs.illinois.edu/guides/Combining-DataFrames/Combining-DataFrames-by-Concatenation/)), or explore `df` here in Python, to find out all the different ways mountains in the United States might be labeled.  *(Hint: There's two different ways!)*

In [27]:
df_us1 = df[df['Location and Notes'].str.contains('US')]
df_us2 = df[df['Location and Notes'].str.contains('United States')]
df_us = pd.concat([df_us1, df_us2])
df_us

Unnamed: 0,index,Feet,Location and Notes,Metres,Mountain,Range
293,30,18009,"Yukon, Canada/Alaska, US – Second highest in b...",5489.0,Mount Saint Elias,Saint Elias Mountains
310,47,17402,"Alaska, US",5304.0,Mount Foraker,Alaska Range
338,75,16421,"Alaska, US – Also given as 5,030 m or 5,045m",5005.0,Mount Bona,Saint Elias Mountains
339,0,16391,"Wrangell Mtns., Alaska, US (also given 5036 m)",4996.0,Mount Blackburn,
343,4,16237,"Wrangell Mtns., Alaska, US",4949.0,Mount Sanford,
362,23,15636,"Saint Elias Mountains, Alaska, US",4766.0,Mount Churchill,
378,39,15299,"Fairweather Range, Alaska, US",4663.0,Mount Fairweather,
393,54,14833,"Saint Elias Mountains, Alaska, US",4521.0,Mount Bear,
409,70,14573,"Alaska Range, Alaska, US",4442.0,Mount Hunter,
412,73,14505,"Sierra Nevada, California, US",4421.0,Mount Whitney,


### Analysis: Percentage of Mountains in the Dataset in the United States?

What percentage of mountains in the entire dataset is found in the United States?

In [28]:
pct_us = len(df_us)/len(df)
pct_us

0.21397379912663755

### 🔬 Checkpoint Tests 🔬

In [29]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("df_us" in vars())
assert(len(df_us) > 300)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)

assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada} DataFrame Analysis: All Tests Passed! {tada}")

🎉 DataFrame Analysis: All Tests Passed! 🎉


<hr style="color: #DD3403;">

## 🔬 Microproject - All Checkpoint 🔬

The final check is that you pass all the tests, all at once!

In [30]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.

tada = "\N{PARTY POPPER}"

assert("pages" in vars())
assert(type(pages[0]) == type(pd.DataFrame()))
assert("Feet" in pages[0])
assert("Range" in pages[1])
assert("Mountain" in pages[2])
assert("Location and Notes" in pages[3])
assert("Metres" in pages[4])

assert("df" in vars())
assert(len(df) > len(pages[0]))
assert("Feet" in df)
assert("Mountain" in df)
assert(len(df[df.Feet > 26000]) > 0)
assert(len(df[df.Feet < 2000]) > 0)
assert(len(df[ (df.Feet < 26000) & (df.Feet > 22000) ]) > 0)
assert(len(df[ (df.Feet < 22000) & (df.Feet > 18000) ]) > 0)
assert(len(df[ (df.Feet < 18000) & (df.Feet > 14000) ]) > 0)
assert(len(df[ (df.Feet < 14000) & (df.Feet > 10000) ]) > 0)
assert(len(df[ (df.Feet < 10000) & (df.Feet > 6000) ]) > 0)
assert(len(df[ (df.Feet < 6000) & (df.Feet > 2000) ]) > 0)

assert("df_us" in vars())
assert(len(df_us) > 300)
assert(len(df_us[ df_us.Mountain.str.contains("Mount Saint Elias")]) == 1)
assert(len(df_us[ df_us.Mountain.str.contains("Denali")]) == 1)

assert("pct_us" in vars())
assert(pct_us == len(df_us) / len(df))

print(f"{tada}{tada} All Tests Passed! {tada}{tada}")


🎉🎉 All Tests Passed! 🎉🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject on GitHub!
