### Maven Everest Challenge

Use data storytelling to visualize the evolution of mankind's pursuit of the world's highest peak.

#### Challenge Objective:

For the Maven Everest Challenge, you’ll play the role of a data journalist tasked with telling the story of mankind’s quest to conquer Mount Everest. Using real expedition data, your goal is to craft a compelling visual narrative that highlights things like key milestones, shifting strategies, and the climbers who dared to reach the top of the world.

#### About The Data Set:

This dataset, based on the archives of Elizabeth Hawley, provides a comprehensive record of mountaineering expeditions in the Nepalese Himalaya, spanning from 1905 to 2024. It includes detailed information on 89,000+ members across 11,000+ expeditions and 480 mountain peaks, including dates, successes, and significant events.

In [228]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# show full output
pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 1000)

In [229]:
# read all the datasets
expedition = pd.read_csv('E:\Maven_Analytics_Challenges\Everest_Challenge\HimalayanExpeditions\exped.csv')
members = pd.read_csv('E:\Maven_Analytics_Challenges\Everest_Challenge\HimalayanExpeditions\members.csv')
peaks = pd.read_csv('E:\Maven_Analytics_Challenges\Everest_Challenge\HimalayanExpeditions\peaks.csv')
# reference = pd.read_csv('E:\\Maven_Analytics_Challenges\\Everest_Challenge\\HimalayanExpeditions\\refer.csv', encoding="ISO-8859-1")

In [230]:
print('Expedition shape:', expedition.shape)
print('Members shape:', members.shape)
print('Peaks shape:', peaks.shape)
# print('Reference shape:', reference.shape) 
# we will not use this dataset since I don't want to cite media trends and do not want to enrich a specific expedition story)

Expedition shape: (11425, 65)
Members shape: (89000, 61)
Peaks shape: (480, 23)


In [231]:
# Expedition columns
exped_cols = [
    'expid', 'peakid', 'year', 'season', 'route1', 'nation', 'leaders',
    'sponsor', 'success1', 'claimed', 'disputed', 'smtdate', 'smtdays',
    'totdays', 'termreason', 'accidents', 'achievment', 'o2used',
    'ski', 'parapente', 'traverse', 'tothired', 'smthired',
    'totmembers', 'smtmembers', 'mdeaths', 'hdeaths',
    'comrte', 'stdrte']

# Member columns
members_cols = [
    'expid', 'membid', 'peakid', 'myear', 'mseason', 'fname', 'lname',
    'sex', 'yob', 'citizen', 'leader', 'hired', 'sherpa',
    'msuccess', 'mclaimed', 'mdisputed', 'msolo', 'mtraverse',
    'mski', 'mparapente', 'mspeed', 'mo2used', 'mo2climb',
    'mo2sleep', 'death', 'deathdate', 'deathtype']

# Peak columns
peaks_cols = [
    'peakid', 'pkname', 'pkname2', 'heightm', 'location',
    'himal', 'region', 'open', 'restrict', 'pyear',
    'pcountry', 'psummiters']

# Filter columns
expedition = expedition[exped_cols]
members = members[members_cols]
peaks = peaks[peaks_cols]

In [232]:
# let's join the members and expedition dataframes on expid
members_expedition = pd.merge(members, expedition, on='expid', how='left')

In [233]:
members_expedition.shape

(89089, 55)

In [234]:
# lets check if values in peakid_x and peakid_y are same
for i in range(len(members_expedition)):
    if members_expedition['peakid_x'][i] != members_expedition['peakid_y'][i]:
        print('Not same:', members_expedition['peakid_x'][i], members_expedition['peakid_y'][i])

In [235]:
# drop the column peakid_y
members_expedition.drop(columns=['peakid_y'], inplace=True)

In [236]:
# now let's join the members_expedition dataframe with the peaks dataframe on peakid
everest_data = pd.merge(members_expedition, peaks, left_on='peakid_x', right_on='peakid', how='left')

In [237]:
everest_data.shape

(89089, 66)

In [238]:
everest_data.isnull().sum()

expid             0
membid            0
peakid_x          0
myear             0
mseason           0
fname           104
lname          1125
sex               0
yob            5424
citizen           7
leader            0
hired             0
sherpa            0
msuccess          0
mclaimed          0
mdisputed         0
msolo             0
mtraverse         0
mski              0
mparapente        0
mspeed            0
mo2used           0
mo2climb          0
mo2sleep          0
death             0
deathdate     87944
deathtype     87918
year              0
season            0
route1          648
nation            0
leaders          19
sponsor        3960
success1          0
claimed           0
disputed          0
smtdate        3291
smtdays       14147
totdays       26166
termreason        0
accidents     59138
achievment    80786
o2used            0
ski               0
parapente         0
traverse          0
tothired          0
smthired          0
totmembers        0
smtmembers        0


In [239]:
# let's remove the columns that has more than 90% null values
everest_data = everest_data.dropna(thresh=len(everest_data) * 0.9, axis=1)

In [240]:
len(everest_data) * 0.9

80180.1

In [241]:
everest_data.shape

(89089, 58)

In [242]:
# duplicate records
everest_data.duplicated().sum()

0

In [243]:
# colums with null values
everest_data.columns[everest_data.isnull().any()]

Index(['fname', 'lname', 'yob', 'citizen', 'route1', 'leaders', 'sponsor',
       'smtdate', 'location', 'pyear', 'pcountry', 'psummiters'],
      dtype='object')

In [244]:
# let's fill the null values in the columns with "not available"
everest_data = everest_data.fillna('not available')

In [245]:
everest_data.duplicated().sum()

0

In [246]:
# colums with null values
everest_data.columns[everest_data.isnull().any()]

Index([], dtype='object')

In [247]:
# lets check if values in peakid_x and peakid are same
for i in range(len(everest_data)):
    if everest_data['peakid_x'][i] != everest_data['peakid'][i]:
        print('Not same:', everest_data['peakid_x'][i], everest_data['peakid'][i])

In [248]:
# lets check if values in mseason and season are same
for i in range(len(everest_data)):
    if everest_data['mseason'][i] != everest_data['season'][i]:
        print('Not same:', everest_data['mseason'][i], everest_data['season'][i])

In [249]:
# let's drop the peakid_x column
everest_data.drop(columns=['peakid_x', 'mseason'], inplace=True)

In [250]:
everest_data.dtypes

expid         object
membid         int64
myear          int64
fname         object
lname         object
sex           object
yob           object
citizen       object
leader          bool
hired           bool
sherpa          bool
msuccess        bool
mclaimed        bool
mdisputed       bool
msolo           bool
mtraverse       bool
mski            bool
mparapente      bool
mspeed          bool
mo2used         bool
mo2climb        bool
mo2sleep        bool
death           bool
year           int64
season        object
route1        object
nation        object
leaders       object
sponsor       object
success1        bool
claimed         bool
disputed        bool
smtdate       object
termreason    object
o2used          bool
ski             bool
parapente       bool
traverse        bool
tothired       int64
smthired       int64
totmembers     int64
smtmembers     int64
mdeaths        int64
hdeaths        int64
comrte          bool
stdrte          bool
peakid        object
pkname       

In [251]:
import re

# Standardize 'citizen' values by sorting country names and removing special characters
def standardize_citizen(val):
    # Remove special characters except for alphabets, numbers, spaces, and '/'
    val = re.sub(r'[^A-Za-z0-9/ ]+', '', val)
    # Split by '/', sort, and join back if multiple countries
    if '/' in val:
        countries = [c.strip() for c in val.split('/')]
        countries = sorted(set(countries))
        return '/'.join(countries)
    return val.strip()

everest_data['citizen'] = everest_data['citizen'].apply(standardize_citizen)

In [252]:
# Still there are some inconsistencies in the 'citizen' column which need to be fixed like below
everest_data['citizen'] = everest_data['citizen'].replace({'Nepal/India': 'India/Nepal', 'Nepal/India?': 'India/Nepal',
                                                                       'India?': 'India', 'W Germany': 'Germany',
                                                                       'Malaysi': 'Malaysia',
                                                                       'Iran/W Germany': 'Germany/Iran',
                                                                       'Lativa/USA': 'Latvia/USA'})

In [253]:
drop_columns = ['phost', 'unlisted', 'trekking', 'trekyear', 'pexpid', 'msmtdate2', 'msmtdate3', 'msmttime2', 'mo2note', 'occupation', 
                'residence', 'hcn', 'mchksum', 'route2', 'route4']

for i in drop_columns:
    if i in everest_data.columns:
        everest_data.drop(columns=[i], inplace=True)

In [254]:
everest_data.shape

(89089, 56)

In [255]:
# now that we have cleaned the data, let's save the cleaned data to a csv file
everest_data.to_csv('E:\Maven_Analytics_Challenges\Everest_Challenge\HimalayanExpeditions\everest_data.csv', index=False)