## Data

The Survivor data is a R package from Daniel Oehm.  Daniel has made the data for this package available as an Excel file as explained in his article on [gradientdescending.com](http://gradientdescending.com/survivor-data-from-the-tv-series-in-r/).  Please make sure that you use the file from our Brightspace page though to make sure that your data will match what CodeGrade is expecting.  We have also updated some errors in the file, which is another reason that you must use the data given to you.

You need to first read the article on the website linked above.  This will give you additional details about the data that will be important as you answer the questions below.

Please note that there is a data dictionary in the file that explains the columns in the data.  You will also want to become familiar with the various spreadsheets and column names.  This will help you out tremendously in this assignment.

Finally, here are a couple of things to know for those of you that have not seen the show:
- Survivor is a reality TV show that first aired May 31, 2000 and is currently still on TV.
- Contestants are broken up into two teams (usually) where they live in separate camps. 
- The teams compete in various challenges for rewards (food, supplies, brief experience trips, etc) and tribal immunity.  
- The team that loses a challenge, and therefore doesn't get the tribal immunity, goes to tribal council where they have to vote one of their members out (this data is represented in the "Vote History" spreadsheet).
- After there are a small number of contestants left, the tribes are merged into one tribe where each contestant competes for individual immunity.  The winner of the individual immunity cannot get voted out and is safe at the next tribal council.
- The are also hidden immunity idols that are hidden around the campground.  If a contestant finds and plays their hidden immunity at the tribal council, then all votes against them do not count, and the player with the next highest number of votes goes home.
- When the contestants get down to 2 or 3 people, a number of the last contestants, known as the jury, come back to vote for the person who they think should win the game.  The winner is the one who gets the most jury votes (this data is represented in the "Jury Votes" spreadsheet).  This person is known as the Sole Survivor.
- Voting recap: 
    - Tribal Council votes (Vote History spreadsheet) are bad; contestants with the most votes get sent home
    - Jury Votes (Jury Votes spreadsheet) are good; contestants with the most votes win the game and is the Sole Survivor

In [1]:

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)


In [3]:

fileName = "survivor.xlsx"
xls = pd.ExcelFile(fileName)

castaway_details = pd.read_excel(xls, 'Castaway Details')
castaways = pd.read_excel(xls, 'Castaways')
challenge_description = pd.read_excel(xls, 'Challenge Description')
challenge_results = pd.read_excel(xls, 'Challenge Results')
confessionals = pd.read_excel(xls, 'Confessionals')
hidden_idols = pd.read_excel(xls, 'Hidden Idols')
jury_votes = pd.read_excel(xls, 'Jury Votes')
tribe_mapping = pd.read_excel(xls, 'Tribe Mapping')
viewers = pd.read_excel(xls, 'Viewers')
vote_history = pd.read_excel(xls, 'Vote History')
season_summary = pd.read_excel(xls, 'Season Summary')
season_palettes = pd.read_excel(xls, 'Season Palettes')
tribe_colours = pd.read_excel(xls, 'Tribe Colours')

In [4]:
dataframes = [castaway_details,castaways,challenge_description,challenge_results,confessionals,hidden_idols,jury_votes,tribe_mapping,viewers,vote_history,season_summary,season_palettes,tribe_colours]

for df in dataframes:
    df.columns = [col.lower().replace(' ', '_') for col in df.columns]
    


In [5]:
copy_copy_cast = castaway_details.copy()
copy_copy_cast['date_of_birth'] = pd.to_datetime(copy_copy_cast['date_of_birth'], errors='coerce')
reference_date = pd.to_datetime("May 31, 2000")
copy_copy_cast['age_at_time_of_season'] = ((reference_date - copy_copy_cast['date_of_birth']).dt.days / 365).astype(int)
oldest_contestant_index = copy_copy_cast['age_at_time_of_season'].idxmax()
Q2 = copy_copy_cast.loc[oldest_contestant_index:oldest_contestant_index, :]
Q2 = Q2.drop('age_at_time_of_season', axis='columns')
Q2

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
13,14,Rudy Boesch,Rudy,1928-01-20,2019-11-01,Male,,,Retired Navy SEAL,ISTJ


In [6]:
seasons_played = castaways.groupby('full_name')['season_name'].nunique()

max_seasons_contestant = seasons_played.idxmax()

Q3 = castaway_details[castaway_details['full_name'] == max_seasons_contestant]

Q3

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
54,55,Rob Mariano,Boston Rob,1975-12-25,NaT,Male,,,Construction Worker,ESTJ


In [7]:
sole_survivor = castaways[castaways['result'] == 'Sole Survivor']

sole_survivor = sole_survivor.sort_values(by='season').reset_index(drop=True)

sole_survivor.index = range(len(sole_survivor))
sole_survivor
Q4 = sole_survivor
Q4

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
0,Survivor: Borneo,1,Richard Hatch,16,Richard,39,Newport,Rhode Island,ENTP,14,39,16,Sole Survivor,,Tagi,,,Rattana,6,4
1,Survivor: The Australian Outback,2,Tina Wesson,32,Tina,40,Knoxville,Tennessee,ESFJ,16,42,16,Sole Survivor,,Ogakor,,,Barramundi,0,2
2,Survivor: Africa,3,Ethan Zohn,48,Ethan,27,Lexington,Massachusetts,ISFP,15,39,16,Sole Survivor,,Boran,Boran,,Moto Maji,0,4
3,Survivor: Marquesas,4,Vecepia Towery,64,Vecepia,36,Hayward,California,ISTJ,15,39,16,Sole Survivor,,Maraamu,Rotu,,Soliantu,2,4
4,Survivor: Thailand,5,Brian Heidik,80,Brian,34,Quartz Hill,California,ISTP,15,39,16,Sole Survivor,,Chuay Gahn,,,Chuay Jai,0,8
5,Survivor: The Amazon,6,Jenna Morasca,96,Jenna,21,Bridgeville,Pennsylvania,ISTP,15,39,16,Sole Survivor,,Jaburu,Jaburu,,Jacaré,3,7
6,Survivor: Pearl Islands,7,Sandra Diaz-Twine,112,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
7,Survivor: All-Stars,8,Amber Brkich,27,Amber,25,Beaver,Pennsylvania,ISFP,17,39,18,Sole Survivor,,Chapera,Chapera,Chapera,Chaboga Mogo,6,6
8,Survivor: Vanuatu,9,Chris Daugherty,130,Chris,33,South Vienna,Ohio,ENTP,15,39,18,Sole Survivor,,Lopevi,Lopevi,,Alinta,3,6
9,Survivor: Palau,10,Tom Westman,150,Tom,40,Sayville,New York,ESTJ,15,39,20,Sole Survivor,,Koror,,,Koror,0,12


In [8]:
win_counts = sole_survivor['castaway_id'].value_counts()

winners_with_multiple_wins = win_counts[win_counts > 1].index

if len(winners_with_multiple_wins) > 0:
    Q5 = sole_survivor[sole_survivor['castaway_id'].isin(winners_with_multiple_wins)].sort_values(by='season')
else:
    Q5 = None
    
Q5

Unnamed: 0,season_name,season,full_name,castaway_id,castaway,age,city,state,personality_type,episode,day,order,result,jury_status,original_tribe,swapped_tribe,swapped_tribe_2,merged_tribe,total_votes_received,immunity_idols_won
6,Survivor: Pearl Islands,7,Sandra Diaz-Twine,112,Sandra,29,Fort Lewis,Washington,ESTP,15,39,18,Sole Survivor,,Drake,Drake,,Balboa,0,3
19,Survivor: Heroes vs. Villains,20,Sandra Diaz-Twine,112,Sandra,35,Fayetteville,North Carolina,ESTP,15,39,20,Sole Survivor,,Villains,,,Yin Yang,3,4
27,Survivor: Cagayan,28,Tony Vlachos,424,Tony,39,Jersey City,New Jersey,ESTP,14,39,18,Sole Survivor,,Aparri,Solana,,Solarrion,5,6
39,Survivor: Winners at War,40,Tony Vlachos,424,Tony,45,Allendale,New Jersey,ESTP,15,39,22,Sole Survivor,,Dakal,Dakal,,Koru,0,9


In [628]:
Q6 = int(castaways['age'].mean().round())

In [9]:
castaways.sort_values(by=['season', 'castaway_id', 'day'], inplace=True)

last_results = castaways.groupby(['season', 'castaway_id', 'full_name']).last().reset_index()

total_days_played = last_results.groupby(['castaway_id', 'full_name'])['day'].sum().reset_index()

Q7 = total_days_played.sort_values(by='day', ascending=False).head(5).reset_index(drop=True)

Q7.rename(columns={'day': 'total_days_played'}, inplace=True)
Q7

Unnamed: 0,castaway_id,full_name,total_days_played
0,55,Rob Mariano,131
1,197,Parvati Shallow,130
2,201,Oscar Lusth,128
3,179,Cirie Fields,121
4,112,Sandra Diaz-Twine,110


In [10]:
castaways_with_personality = castaway_details.dropna(subset=['personality_type'])

total_with_personality = len(castaways_with_personality)

extrovert_percentage = (castaways_with_personality['personality_type'].str[0] == 'E').mean() * 100

introvert_percentage = (castaways_with_personality['personality_type'].str[0] == 'I').mean() * 100

Q8A = round(extrovert_percentage, 2)
Q8B = round(introvert_percentage, 2)

In [11]:
Q8A

53.63

In [12]:
Q8B

46.37

In [13]:
winning_castaways = castaways.loc[
    (castaways['personality_type'].notna()) & (castaways['result'] == 'Sole Survivor')
]

winners = winning_castaways['full_name'].nunique()

unique_winners = winning_castaways.drop_duplicates('full_name')

extroverted_winners = winning_castaways[winning_castaways['personality_type'].str[0] == 'E']['full_name'].nunique()

Q9A = (extroverted_winners / winners) * 100 

Q9B = 100 - Q9A  

Q9A = round(Q9A, 2)
Q9B = round(Q9B, 2)

Q9A

61.54

In [14]:
Q9B

38.46

In [15]:
contestants_with_votes = vote_history['castaway_id'].unique()
Q10 = castaway_details[~castaway_details['castaway_id'].isin(contestants_with_votes)]
Q10 = Q10.sort_values(by='castaway_id', ascending=True)

Q10

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
130,131,Jonathan Libby,Jonathan,1981-09-05,NaT,Male,,,Sales & Marketing Associate,ISTP
131,132,Wanda Shirk,Wanda,1949-08-24,NaT,Female,,,English Teacher,ENFP
205,206,Gary Stritesky,Gary,1951-09-16,NaT,Male,,,School Bus Driver,ISFJ
353,354,Kourtney Moon,Kourtney,1982-02-27,NaT,Female,,,Motorcycle Repair,ISFP
374,375,Dana Lambert,Dana,1979-12-13,NaT,Female,,,Cosmetologist,ISTP
536,537,Pat Cusack,Pat,1977-02-25,NaT,Male,,,Maintenance Manager,ESTP


In [16]:
most_challanages = challenge_results.groupby('winner_id')['winner'].nunique()

max_challanges_contestant = most_challanages.idxmax()
Q11 = castaway_details[castaway_details['full_name'] == max_challanges_contestant]
Q11 = castaway_details[castaway_details['full_name'] == 'Oscar Lusth']
Q11

Unnamed: 0,castaway_id,full_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
200,201,Oscar Lusth,Ozzy,1981-08-23,NaT,Male,Mexican American,Hispanic or Latino,Waiter;Photographer,ISFP


In [637]:
merged_df = pd.merge(castaway_details, jury_votes, on='castaway_id')
result = merged_df.groupby(['season', 'finalist_id', 'season_name']).agg({'vote': 'sum', 'castaway_id': 'count'})
unanimous_votes = result[result['vote'] == result['castaway_id']].copy()
unanimous_votes.reset_index(inplace=True)

finalist_ids = unanimous_votes['finalist_id']
filtered_castaway_details = castaway_details[castaway_details['castaway_id'].isin(finalist_ids)][['castaway_id', 'full_name']]
final_output = unanimous_votes.merge(filtered_castaway_details, left_on='finalist_id', right_on='castaway_id', how='left')

final_output = final_output[['season', 'season_name', 'finalist_id', 'full_name']].sort_values('season').reset_index(drop=True)

final_output.rename(columns={'finalist_id': 'winner_id', 'full_name': 'full_name'}, inplace=True)

Q12 = final_output
Q12

Unnamed: 0,season,season_name,winner_id,full_name
0,14,Survivor: Fiji,221,Earl Cole
1,18,Survivor: Tocantins,281,James Thomas Jr.
2,26,Survivor: Caramoan,348,John Cochran
3,31,Survivor: Cambodia,433,Jeremy Collins
4,33,Survivor: Millennials vs. Gen X,498,Adam Klein


In [17]:
Q13 = castaway_details.copy()

Q13[['first_name', 'last_name']] = Q13['full_name'].str.split(n=1, expand=True)


Q13 = Q13[['castaway_id','full_name', 'first_name', 'last_name', 'short_name','date_of_birth', 'date_of_death', 'gender', 'race', 'ethnicity', 'occupation', 'personality_type']] 


Q13 = Q13.reset_index(drop=True)

Q13

Unnamed: 0,castaway_id,full_name,first_name,last_name,short_name,date_of_birth,date_of_death,gender,race,ethnicity,occupation,personality_type
0,1,Sonja Christopher,Sonja,Christopher,Sonja,1937-01-28,NaT,Female,,,Musician,ENFP
1,2,B.B. Andersen,B.B.,Andersen,B.B.,1936-01-18,2013-10-29,Male,,,Real Estate Developer,ESTJ
2,3,Stacey Stillman,Stacey,Stillman,Stacey,1972-08-11,NaT,Female,,,Attorney,ENTJ
3,4,Ramona Gray,Ramona,Gray,Ramona,1971-01-20,NaT,Female,Black,,Biochemist/Chemist,ISTJ
4,5,Dirk Been,Dirk,Been,Dirk,1976-06-15,NaT,Male,,,Dairy Farmer,ISFP
...,...,...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,Tiffany,Seely,Tiffany,1973-12-08,NaT,Female,White,Jewish,Teacher,ENTP
604,605,Sydney Segal,Sydney,Segal,Sydney,1995-07-19,NaT,Female,White,Jewish,Law Student,ESTP
605,606,Shantel Smith,Shantel,Smith,Shan,1987-03-11,NaT,Female,Black,,Pastor,ENFJ
606,607,David Voce,David,Voce,Voce,1986-05-01,NaT,Male,,,Neurosurgeon,ENTJ


In [18]:
personality_mapping = {
    'ISTJ': 1, 'ISTP': 2, 'ISFJ': 3, 'ISFP': 4,
    'INFJ': 5, 'INFP': 6, 'INTJ': 7, 'INTP': 8,
    'ESTP': 9, 'ESTJ': 10, 'ESFP': 11, 'ESFJ': 12,
    'ENFP': 13, 'ENFJ': 14, 'ENTP': 15, 'ENTJ': 16,
    None: 17
}

Q14 = castaway_details[['castaway_id', 'full_name', 'personality_type']].copy()
Q14['personality_type'] = Q14['personality_type'].fillna(value=np.nan)
Q14['personality_type'] = Q14['personality_type'].map(personality_mapping)
Q14['personality_type'] = Q14['personality_type'].astype('Int64')

Q14.sort_values('castaway_id', inplace=True)

Q14.reset_index(drop=True, inplace=True)
Q14

Unnamed: 0,castaway_id,full_name,personality_type
0,1,Sonja Christopher,13
1,2,B.B. Andersen,10
2,3,Stacey Stillman,16
3,4,Ramona Gray,1
4,5,Dirk Been,4
...,...,...,...
603,604,Tiffany Seely,15
604,605,Sydney Segal,9
605,606,Shantel Smith,14
606,607,David Voce,16


In [19]:
Q15 = castaway_details[['castaway_id', 'full_name', 'personality_type']].copy()

dummy_columns = pd.get_dummies(Q15['personality_type'], prefix='type', drop_first=True)

Q15 = pd.concat([Q15, dummy_columns], axis=1)

Q15.drop(columns=['personality_type'], inplace=True)

Q15


Unnamed: 0,castaway_id,full_name,type_ENFP,type_ENTJ,type_ENTP,type_ESFJ,type_ESFP,type_ESTJ,type_ESTP,type_INFJ,type_INFP,type_INTJ,type_INTP,type_ISFJ,type_ISFP,type_ISTJ,type_ISTP
0,1,Sonja Christopher,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,2,B.B. Andersen,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
2,3,Stacey Stillman,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
3,4,Ramona Gray,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4,5,Dirk Been,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603,604,Tiffany Seely,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
604,605,Sydney Segal,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
605,606,Shantel Smith,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
606,607,David Voce,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False


In [641]:
castaway_details_copy = castaway_details.copy()
Q16 = castaway_details_copy[['castaway_id', 'full_name', 'personality_type']].copy()
Q16['interaction'] = Q16['personality_type'].str[0].map({'I': 0, 'E': 1})
Q16['information'] = Q16['personality_type'].str[1].map({'S': 0, 'N': 1})
Q16['decision'] = Q16['personality_type'].str[2].map({'T': 0, 'F': 1})
Q16['organization'] = Q16['personality_type'].str[3].map({'J': 0, 'P': 1})
Q16.fillna(2, inplace=True)

Q16 = Q16.astype({'interaction': int, 'information': int, 'decision': int, 'organization': int})
Q16.sort_values(by='castaway_id', inplace=True)
Q16.reset_index(drop=True, inplace=True)
Q16

Unnamed: 0,castaway_id,full_name,personality_type,interaction,information,decision,organization
0,1,Sonja Christopher,ENFP,1,1,1,1
1,2,B.B. Andersen,ESTJ,1,0,0,0
2,3,Stacey Stillman,ENTJ,1,1,0,0
3,4,Ramona Gray,ISTJ,0,0,0,0
4,5,Dirk Been,ISFP,0,0,1,1
...,...,...,...,...,...,...,...
603,604,Tiffany Seely,ENTP,1,1,0,1
604,605,Sydney Segal,ESTP,1,0,0,1
605,606,Shantel Smith,ENFJ,1,1,1,0
606,607,David Voce,ENTJ,1,1,0,0
