# Asking some questions

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly as plt 
import plotly.graph_objects as go
pd.options.plotting.backend = "plotly"

In [2]:
train_csv = "../data/train.csv"
test_csv = "../data/test.csv"

In [3]:
train_df = pd.read_csv(train_csv)

## Conversation with Matt

* If someone is non-cryo or cryo is important -- as cryo aren't moving around, if they are transported it gives some indication to where the anomaly intersected the ship. 

* Determine if FoodCourt, ShoppingMall, Spa, VRDeck intersect the anomaly independently (might be in different areas of the ship)

* If a non-cyro was transported, you can determine their probabilistic location at the time of the accident according to their spend -- if they spend the majority of the money on RoomService they are probably in their cabin. 

* Could determine if the passengers who were likely to be at each activity were transported, and then these activities/areas also likely intersect the anomaly. 

* For NaN cabin number values, you are really assigning a risk score for each passenger, e.g. if you have a row with only a 10% intersection with the anomaly, and the cabin number of the passenger is not present, then they are likely a survivor. 

* For NaN cabin number/deck/side you can assign a set of values that would be valid by determing which cabins are unoccupied.


# 1. For cryo passengers who were transported, where were they situated on the ship?

The Side (`x`), Deck (`y`) and Number (`z`) columns give us `x, y, z` coordinates for passengers. 

In [4]:
extra_cabin_cols = train_df.Cabin.str.split("/", expand=True).rename(columns={0: "Deck", 1:"Number", 2:"Side"})

train_df = train_df.assign(
    Deck=extra_cabin_cols["Deck"],
    Number=extra_cabin_cols["Number"],
    Side=extra_cabin_cols["Side"],
)

desired_column_order = [
    "Name", "PassengerId", 
    "HomePlanet", "CryoSleep", "Cabin", 
    "Deck", "Number", "Side", "Destination", 
    "Age", "VIP", "RoomService", "FoodCourt", 
    "ShoppingMall", "Spa", "VRDeck", "Transported"
    ]

train_df = train_df[desired_column_order]
print(len(train_df))
train_df.head(1)

8693


Unnamed: 0,Name,PassengerId,HomePlanet,CryoSleep,Cabin,Deck,Number,Side,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Maham Ofracculy,0001_01,Europa,False,B/0/P,B,0,P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False


In [5]:
cryo_passengers = train_df[train_df["CryoSleep"] == True]
transported_cryo_passengers = cryo_passengers[cryo_passengers["Transported"] == True]

In [6]:
transported_cryo_passengers_deck_counts = transported_cryo_passengers["Deck"].value_counts().reset_index().reset_index(names="rank")

In [7]:
transported_cryo_passengers_side_counts = transported_cryo_passengers["Side"].value_counts().reset_index().reset_index(names="rank")

In [8]:
transported_cryo_passengers_number_counts = transported_cryo_passengers["Number"].value_counts().reset_index().reset_index(names="rank")

### What location makes someone high risk of being transported?

Highest risk is Deck G, Number 176, Side S. Assign a rank to each of the value_count() dfs I made before, and add these risk numbers to passengers in the main df. One risk level for each location, score 0-5? And then as we go along, the lower the sum of their risk factors, the more likely they were to be transported. 

In [9]:
deck_ranking_dict = transported_cryo_passengers_deck_counts["Deck"].to_dict()
train_df["deck_risk_rating"] = train_df["Deck"].map({v:k for k,v in deck_ranking_dict.items()})

side_ranking_dict = transported_cryo_passengers_side_counts["Side"].to_dict()
train_df["side_risk_rating"] = train_df["Side"].map({v:k for k,v in side_ranking_dict.items()})

number_ranking_dict = transported_cryo_passengers_number_counts["Number"].to_dict()
train_df["number_risk_rating"] = train_df["Number"].map({v:k for k,v in number_ranking_dict.items()})

location_risk_columns = ["deck_risk_rating", "side_risk_rating", "number_risk_rating"]
train_df["sum_of_location_risk"] = train_df[location_risk_columns].sum(axis=1)

In [10]:
train_df.sort_values(by="sum_of_location_risk")

Unnamed: 0,Name,PassengerId,HomePlanet,CryoSleep,Cabin,Deck,Number,Side,Destination,Age,...,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,deck_risk_rating,side_risk_rating,number_risk_rating,sum_of_location_risk
5967,Murie Hinetthews,6324_02,Earth,,G/1025/S,G,1025,S,55 Cancri e,44.0,...,0.0,0.0,0.0,0.0,0.0,True,0.0,0.0,,0.0
6847,Adammy Whitakers,7236_01,Earth,True,,,,,TRAPPIST-1e,16.0,...,0.0,0.0,0.0,0.0,0.0,True,,,,0.0
3671,Floyde Grahangory,3942_01,Earth,True,G/645/S,G,645,S,PSO J318.5-22,20.0,...,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,,0.0
1362,Hughan Cartez,1434_02,Earth,False,,,,,TRAPPIST-1e,0.0,...,0.0,0.0,0.0,0.0,0.0,True,,,,0.0
7708,Heersh Nutty,8225_01,Mars,False,,,,,TRAPPIST-1e,15.0,...,85.0,0.0,1130.0,0.0,248.0,False,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3798,Milda Mirandry,4055_01,Earth,False,F/836/P,F,836,P,TRAPPIST-1e,21.0,...,3.0,34.0,600.0,0.0,8.0,False,1.0,1.0,1134.0,1136.0
8684,Chelsa Bullisey,9274_01,,True,G/1508/P,G,1508,P,TRAPPIST-1e,23.0,...,0.0,0.0,0.0,0.0,0.0,True,0.0,1.0,1136.0,1137.0
4604,Arla Moodman,4903_01,Earth,False,F/995/P,F,995,P,55 Cancri e,23.0,...,62.0,0.0,0.0,1.0,2203.0,False,1.0,1.0,1135.0,1137.0
7378,Chrisa Reenez,7890_01,Earth,False,F/1508/S,F,1508,S,TRAPPIST-1e,58.0,...,0.0,268.0,0.0,109.0,594.0,False,1.0,0.0,1136.0,1137.0


## 1.b. How is best to handle NaNs for cabin information?

* Depending on what information is available e.g. if there is partial information for Deck/Number/Side, could assign a risk score based off of what remains. 

* Could work out which Cabins are unoccupied, and fill in passengers depending on what cabin information they have available. 


In [11]:
cabin_coordinate_columns = ["Cabin", "Deck",  "Number", "Side"]
nan_check = train_df[cabin_coordinate_columns].isna().any(axis=1)
print(nan_check.sum())

199


In [28]:
train_df["cabin_coordinate_nan_check"] = nan_check

In [31]:
train_df[(train_df["cabin_coordinate_nan_check"]==True) & (train_df["Transported"] == True)]

Unnamed: 0,Name,PassengerId,HomePlanet,CryoSleep,Cabin,Deck,Number,Side,Destination,Age,...,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,deck_risk_rating,side_risk_rating,number_risk_rating,sum_of_location_risk,cabin_coordinate_nan_check
93,Book Trad,0101_01,Mars,True,,,,,TRAPPIST-1e,31.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True
227,Froos Sad,0244_01,Mars,True,,,,,TRAPPIST-1e,43.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True
260,Tetra Bootty,0287_01,Europa,True,,,,,55 Cancri e,39.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True
295,Jasony Ocherman,0327_01,Earth,False,,,,,TRAPPIST-1e,19.0,...,0.0,784.0,0.0,2.0,True,,,,0.0,True
314,Weet Mane,0348_02,Mars,,,,,,TRAPPIST-1e,36.0,...,0.0,1865.0,0.0,0.0,True,,,,0.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7847,Carlen Valezaley,8375_01,Earth,False,,,,,PSO J318.5-22,15.0,...,0.0,0.0,0.0,640.0,True,,,,0.0,True
8043,Mera Netshaless,8605_02,Europa,True,,,,,TRAPPIST-1e,28.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True
8110,Donnie Hurchrisong,8663_02,Earth,True,,,,,55 Cancri e,40.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True
8485,Bath Brakeng,9069_03,Europa,True,,,,,55 Cancri e,25.0,...,0.0,0.0,0.0,0.0,True,,,,0.0,True


In [20]:
train_df[nan_check]["sum_of_location_risk"].value_counts()

sum_of_location_risk
0.0    199
Name: count, dtype: int64

In [21]:
train_df[~nan_check]["sum_of_location_risk"].value_counts()

sum_of_location_risk
1.0      716
2.0      496
0.0      263
4.0       89
5.0       88
        ... 
986.0      1
426.0      1
590.0      1
379.0      1
641.0      1
Name: count, Length: 1127, dtype: int64

In [26]:
train_df[~nan_check].sort_values(by="sum_of_location_risk")

Unnamed: 0,Name,PassengerId,HomePlanet,CryoSleep,Cabin,Deck,Number,Side,Destination,Age,...,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,deck_risk_rating,side_risk_rating,number_risk_rating,sum_of_location_risk
6872,Jaimer Pacerty,7273_01,Earth,False,G/1183/S,G,1183,S,55 Cancri e,36.0,...,207.0,0.0,450.0,94.0,0.0,True,0.0,0.0,,0.0
6194,Stanya Kellyons,6548_01,Earth,False,G/1065/S,G,1065,S,TRAPPIST-1e,38.0,...,395.0,1035.0,0.0,0.0,4.0,True,0.0,0.0,,0.0
3178,Garry Oconley,3426_01,Earth,False,G/549/S,G,549,S,TRAPPIST-1e,33.0,...,22.0,37.0,415.0,0.0,333.0,False,0.0,0.0,,0.0
7175,Moniey Belley,7656_01,Earth,True,G/1244/S,G,1244,S,TRAPPIST-1e,16.0,...,0.0,0.0,0.0,0.0,0.0,False,0.0,0.0,,0.0
2319,Karlie Beckerson,2502_01,Earth,False,G/404/S,G,404,S,TRAPPIST-1e,24.0,...,873.0,12.0,1.0,0.0,3.0,False,0.0,0.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4866,Tostex Khaf,5192_01,Mars,True,F/995/S,F,995,S,TRAPPIST-1e,27.0,...,0.0,0.0,0.0,0.0,0.0,True,1.0,0.0,1135.0,1136.0
8684,Chelsa Bullisey,9274_01,,True,G/1508/P,G,1508,P,TRAPPIST-1e,23.0,...,0.0,0.0,0.0,0.0,0.0,True,0.0,1.0,1136.0,1137.0
4604,Arla Moodman,4903_01,Earth,False,F/995/P,F,995,P,55 Cancri e,23.0,...,62.0,0.0,0.0,1.0,2203.0,False,1.0,1.0,1135.0,1137.0
7378,Chrisa Reenez,7890_01,Earth,False,F/1508/S,F,1508,S,TRAPPIST-1e,58.0,...,0.0,268.0,0.0,109.0,594.0,False,1.0,0.0,1136.0,1137.0


**NEXT STEPS**

TODO: How should this be plotted or represented? It would be cool to have x,y,z scatter plots in 3D representing the spaces that cyro-sleep passengers were more likely to be transported from. 

TODO: Look into how RoomService affects whether non-cryo passengers were more likely to be transported if their rooms were in the lower-scoring areas.

# 2. Do the on-ship activities seem to interact with the anomaly independently? 

i.e. If someone is a big spender at the FoodCourt, and their cabin doesn't intersect with the anomaly, but they disappeared, maybe the FoodCourt itself intersects with the anomaly?

# 3. For non-cryo passengers who were transported, where did they spend the most money?

Some of the non-cryo passengers could have also been in their cabins during the anomaly, but maybe theres also an intersect with the anomaly and the activites on board?

In [49]:
activities_columns = ["FoodCourt", "ShoppingMall", "Spa", "VRDeck"]

In [48]:
train_df.columns

Index(['Name', 'PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Deck',
       'Number', 'Side', 'Destination', 'Age', 'VIP', 'RoomService',
       'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Transported',
       'deck_risk_rating', 'side_risk_rating', 'number_risk_rating',
       'sum_of_location_risk', 'cabin_coordinate_nan_check'],
      dtype='object')

In [37]:
awake_passengers = train_df[train_df["CryoSleep"]==False]

In [38]:
unsure_consciousness_passengers = train_df[train_df["CryoSleep"].isna()]

In [40]:
transported_awake_passengers = awake_passengers[awake_passengers["Transported"]==True]

In [47]:
transported_awake_passengers

Unnamed: 0,Name,PassengerId,HomePlanet,CryoSleep,Cabin,Deck,Number,Side,Destination,Age,...,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,deck_risk_rating,side_risk_rating,number_risk_rating,sum_of_location_risk,cabin_coordinate_nan_check
1,Juanna Vines,0002_01,Earth,False,F/0/S,F,0,S,TRAPPIST-1e,24.0,...,9.0,25.0,549.0,44.0,True,1.0,0.0,450.0,451.0,False
4,Willy Santantines,0004_01,Earth,False,F/1/S,F,1,S,TRAPPIST-1e,16.0,...,70.0,151.0,565.0,2.0,True,1.0,0.0,117.0,118.0,False
5,Sandie Hinetthews,0005_01,Earth,False,F/0/P,F,0,P,PSO J318.5-22,44.0,...,483.0,0.0,291.0,0.0,True,1.0,1.0,450.0,452.0,False
6,Billex Jacostaffey,0006_01,Earth,False,F/2/S,F,2,S,TRAPPIST-1e,26.0,...,1539.0,3.0,0.0,0.0,True,1.0,0.0,562.0,563.0,False
8,Andona Beston,0007_01,Earth,False,F/3/S,F,3,S,TRAPPIST-1e,35.0,...,785.0,17.0,216.0,0.0,True,1.0,0.0,12.0,13.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8669,Alchium Stranbeate,9252_01,Europa,False,B/301/P,B,301,P,55 Cancri e,26.0,...,8160.0,205.0,0.0,438.0,True,2.0,1.0,201.0,204.0,False
8682,Ireene Simson,9272_01,Earth,False,G/1507/P,G,1507,P,TRAPPIST-1e,26.0,...,242.0,510.0,0.0,0.0,True,0.0,1.0,,1.0,False
8685,Polaton Conable,9275_01,Europa,False,A/97/P,A,97,P,TRAPPIST-1e,0.0,...,0.0,0.0,0.0,0.0,True,6.0,1.0,91.0,98.0,False
8690,Fayey Connon,9279_01,Earth,False,G/1500/S,G,1500,S,TRAPPIST-1e,26.0,...,0.0,1872.0,1.0,0.0,True,0.0,0.0,,0.0,False


# 4. Are there other interactions between the variables?

* i.e. do children/young people end up Transported more than expected, are they spending more time at the VR deck?
* i.e. do VIPs get Transported more than expected, is there a VIP area/lounge that intersects with the anomaly? 
* 