# Data for Mountains Taller than 8,000 meters

Will Imoehl

October 2016

# Abstract
This project looks at data for all the mountains that are taller than 8,000 meters. The first data set comes from kaggle and contains information about the names, heights, and number of ascents of each mountain. Analysis is done to determine which mountains have the highest and lowest rates for successful climbing attempts. The second data set contains information gathered on the number of deaths on Mount Everest. The third and final data set contains data on the deaths on the remaining mountains over 8,000 meters tall. These last two data sets are combined with one another in order to ask questions about the most common cause of deaths on all the mountains, the most common cause of death on the mountain with the most deaths and the most common cause of death on the mountain with the least number of deaths. The initial plan for the three data sets was to combine all three in order to ask questions, but the data set from kaggle disagrees with those from wikipedia in several important ways, including number of failed summits (for some mountains the number of deaths is way larger than what kaggle says the number of failed attempts are), so I felt it was best to analyze the kaggle data set separately from the two from wikipedia. 

In [1]:
import pandas as pd

# The First Dataset
This data comes from kaggle and contains information on
* Mountain Name
* Height in meters
* Height in feet
* Prominence in meters
* Mountain Range
* Coordinates
* Parent Mountain
* Year of first ascent
* Number of ascents before 2004
* Number of failed ascents before 2004

In [2]:
start = pd.read_excel("Mountains.xls")

In [3]:
start = start[(start.Heightm>=8000)]

# What was the last mountain to be climbed? The first?
We start with the last mountain to be climbed by sorting by first ascent in descending order.

In [4]:
start.sort_values('First ascent',ascending=False).head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
13,14,Shishapangma,8027,26335,2897,Jugal Himalaya,28Â°21â€²12â€³N 85Â°46â€²43â€³Eï»¿,Cho Oyu,1964,43.0,19.0


Now we simply reverse the order to obtain the first mountain in the list climbed.

In [5]:
start.sort_values('First ascent').head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
9,10,Annapurna I,8091,26545,2984,Annapurna Himalaya,28Â°35â€²44â€³N 83Â°49â€²13â€³Eï»¿,Cho Oyu,1950,36.0,47.0


# Which moutain has been climbed the most?

In [6]:
start.sort_values('Ascents bef. 2004',ascending=False).head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
0,1,Mount Everest,8848,29029,8848,Mahalangur Himalaya,27Â°59â€²17â€³N 86Â°55â€²31â€³Eï»¿,,1953,145.0,121.0


# Which mountain had the most failed attempts?

In [7]:
start.sort_values('Failed attempts bef. 2004',ascending=False).head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
0,1,Mount Everest,8848,29029,8848,Mahalangur Himalaya,27Â°59â€²17â€³N 86Â°55â€²31â€³Eï»¿,,1953,145.0,121.0


# Which mountain has the highest success rate?
The success rate is defined to be the number of successful attempts divided by the total number of attempts (failed ascents plus successful ascents). 

In [8]:
start['success'] = start['Ascents bef. 2004']/(start['Ascents bef. 2004']+start['Failed attempts bef. 2004'])

In [9]:
start.sort_values('success',ascending=False).head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004,success
12,13,Gasherbrum II,8035,26362,1524,Baltoro Karakoram,35Â°45â€²28â€³N 76Â°39â€²12â€³Eï»¿,Gasherbrum I,1956,54.0,12.0,0.818182


# Which mountain has the highest rate of failure?
Similarly the failure rate is defined as the number of failed attempts divided by the number of total attempts.

In [10]:
start['failure'] = start['Failed attempts bef. 2004']/(start['Ascents bef. 2004']+start['Failed attempts bef. 2004'])

In [11]:
start.sort_values('failure', ascending=False).head(1)

Unnamed: 0,Rank,Mountain,Heightm,Heightft,Prominencem,Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004,success,failure
9,10,Annapurna I,8091,26545,2984,Annapurna Himalaya,28Â°35â€²44â€³N 83Â°49â€²13â€³Eï»¿,Cho Oyu,1950,36.0,47.0,0.433735,0.566265


# Now getting data on deaths on Mount Everest.
This data was scraped from wikipedia. While wikipedia may not be the most trustworthy source of data, it has the largest data set I could find for deaths on Mount Everest and the rest of the mountains over 8,000 meters tall. The scraped data was then exported to Excel to make it easier to handle since the scraping resulted in strange formatting issues. This data contains information on the name, age, nationality, cause of death, and date of death for each of the deceased.

In [12]:
everest = pd.read_excel("EverestDeaths.xlsx")
everest = everest[['Name','Age','Nationality','Cause of death','Date']]

In [13]:
everest.fillna(0).head(0)

Unnamed: 0,Name,Age,Nationality,Cause of death,Date


When the data was compiled for the other mountains I manually added a column that named the mountain the deaths occurred on. In order for this data set to be combined with the data set I import below, we first need to create a column that says which mountain these deaths belong to.

In [14]:
everest['Mountain'] = 'Mount Everest'

# Now death data on the rest of the mountains over 8,000 meters.
This data also comes from wikipedia. Unfortunately it does not have the age of the climbers when they died, so any questions about the ages of climbers must be limited just to the data on Mount Everest. It does contain information on the date of death, name, nationality, cause of death, and the mountain for each of the deceased. 

In [15]:
mount = pd.read_excel("8000metermountaindeaths.xlsx")

In [16]:
mount.head()

Unnamed: 0,Date,Name,Nationality,Cause of death,Mountain,Age
0,1905-08-28,unnamed porter,India,Fall,Kangchenjunga,
1,1905-09-01,three unnamed others,India,Fall,Kangchenjunga,
2,1905-09-01,Alexis Pache,Switzerland,Fall,Kangchenjunga,
3,1929-05-27,E. Farmer,United States,,Kangchenjunga,
4,1930-05-09,Chettan,India,Falling ice,Kangchenjunga,


This line combines the two data frames.

In [17]:
all_mount = mount.append(everest)

In [18]:
all_mount.fillna(0).head(0)

Unnamed: 0,Age,Cause of death,Date,Mountain,Name,Nationality


# Who is the oldest person to die on Mount Everest? The youngest?
This can be done by sorting the data frame that has the data for deaths on all the mountains by age. 

In [19]:
all_mount.sort_values('Age',ascending=False).head(1)[['Age','Date','Cause of death', 'Name', 'Nationality','Mountain']]

Unnamed: 0,Age,Date,Cause of death,Name,Nationality,Mountain
59,82.0,2011-05-9,Altitude,Shailendra Kumar Upadhyaya,Nepal,Mount Everest


Here we do the exact same thing as the cell above, just in reversed order.

In [20]:
all_mount.sort_values('Age').head(1)[['Age','Date','Cause of death', 'Name', 'Nationality','Mountain']]

Unnamed: 0,Age,Date,Cause of death,Name,Nationality,Mountain
3,19.0,2015-04-25,,Pemba Sherpa,Nepal,Mount Everest


# What is the most common cause of death on all the mountains over 8,000 meters?
To solve this we use groupby to get the number of people that died each way. Then we can just sort by the number of people that died to find the most common cause of death.

In [21]:
cod = all_mount.groupby('Cause of death').size().reset_index()
cod.columns = ['Cause of death', 'num']
cod.sort_values('num',ascending=False).head(1)

Unnamed: 0,Cause of death,num
3,Avalanche,243


# What are the top 10 causes of death?
Here we just use the data frame from the previous question and show the top 10. This was done more for my own curiousty than anything else.

In [22]:
cod.sort_values('num',ascending=False).head(10)

Unnamed: 0,Cause of death,num
3,Avalanche,243
15,Fall,214
34,Unknown cause,81
2,Altitude,49
14,Exposure,44
13,Exhaustion,32
30,Sickness,23
19,Heart attack,13
1,Accident,12
32,Storm,12


# List all the mountains in order of most deaths to least
Here we use groupby to count the number of deaths on each of the mountains. Then we can simply sort by the number of deaths to get the ordered list of mountains.

In [23]:
dpm = all_mount.groupby('Mountain').size().reset_index()
dpm.columns = ['Mountain', 'number of deaths']
dpm.sort_values('number of deaths',ascending=False)

Unnamed: 0,Mountain,number of deaths
11,Mount Everest,287
6,K2,84
12,Nanga Parbat,80
10,Manaslu,79
0,Annapurna I,72
3,Dhaulagiri I,72
7,Kangchenjunga,53
2,Cho Oyu,50
9,Makalu,35
4,Gasherbrum I,34


# What were the most common causes of death on the mountain with the most deaths and on the mountain with the least deaths?
First we look at the mountain with the most deaths, which was Mount Everest from the previous step. We create a new data from called most by filtering out all the deaths that were not on Mount Everest. Then we use groupby to count the number of people that died each way and look at the top cause of death.

In [24]:
most = all_mount[all_mount['Mountain'] == 'Mount Everest']
most = most.groupby('Cause of death').size().reset_index()
most.columns = ['Cause of death','number of deaths']
most.sort_values('number of deaths', ascending=False).head(1)

Unnamed: 0,Cause of death,number of deaths
2,Avalanche,70


Next we look at the mountain with the fewest amount of deaths, which was Gasherbrum II from the previous question. We use the same methodology as for Mount Everest to find the most common cause of death.

In [25]:
most = all_mount[all_mount['Mountain'] == 'Gasherbrum II']
most = most.groupby('Cause of death').size().reset_index()
most.columns = ['Cause of death','number of deaths']
most.sort_values('number of deaths', ascending=False).head(1)

Unnamed: 0,Cause of death,number of deaths
3,Fall,9
