# Part 1. Data Analysis (2.5 p.)

train.csv - Personal records for about ~8700 of the passengers, to be used as training data.
* `PassengerId` - A unique Id for each passenger.
* `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
* `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* `Cabin` - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* `Destination` - The planet the passenger will be debarking to.
* `Age` - The age of the passenger.
* `VIP` - Whether the passenger has paid for special VIP service during the voyage.
* `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* `Name` - The first and last names of the passenger.
* `Transported` - Whether the passenger was transported to another dimension. **This is the target, the column you are trying to predict!**

In [135]:
# import libraries: pandas, matplotlib.pyplot, numpy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt


In [136]:
# read the 'train.csv'
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


### 1.1 Which cabin type (`Cabin`) is the most popular among passengers? (0.5 p)

In [137]:
# value counts
df["Cabin"].value_counts()
#G/734/S

G/734/S     8
G/1368/P    7
E/13/S      7
G/981/S     7
C/137/S     7
           ..
E/107/P     1
E/496/S     1
F/1727/S    1
F/1137/S    1
A/5/P       1
Name: Cabin, Length: 6560, dtype: int64

### 1.2 For each `Destination`, how many passengers have VIP status? (0.5 p)

In [138]:
# groupby
df.groupby("Destination")["VIP"].sum()


Destination
55 Cancri e       65
PSO J318.5-22     18
TRAPPIST-1e      114
Name: VIP, dtype: int64

### 1.3 What is the average amount spent on the `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, and `VRDeck`? Use `mean()` to find out. (0.5 p)

In [139]:
# mean
rs = df['RoomService'].mean()
fc = df['FoodCourt'].mean()
sm = df['ShoppingMall'].mean()
sp = df['Spa'].mean()
vd = df['VRDeck'].mean()
print(rs)
print(fc)
print(sm)
print(sp)
print(vd)

224.687617481203
458.07720329024676
173.72916912197996
311.1387779083431
304.8547912992357


### 1.4 How many passengers were transported (`Transported` column)? Calculate both the count and the proportion of transported vs. not transported. (0.5 p)

In [140]:
# use value_counts(normalize=True) to calculate percentage
df["Transported"].value_counts(normalize=True)

True     0.503624
False    0.496376
Name: Transported, dtype: float64

### 1.5 Is there a difference in the average age of passengers who were transported versus those who were not? Compare using `groupby()` and `mean()` on the `Transported` column. (0.5 p)

In [217]:
df.groupby("Age")['Transported'].mean()


Age
0.0     0.808989
1.0     0.731343
2.0     0.706667
3.0     0.786667
4.0     0.746479
          ...   
75.0    0.500000
76.0    0.500000
77.0    0.500000
78.0    0.333333
79.0    0.000000
Name: Transported, Length: 80, dtype: float64

---
# Part 2. Machine Learning (7.5 p.)

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

* `Transported` - Whether the passenger was transported to another dimension. **This is the target, the column you are trying to predict!**

## 2.1. Data Preparing (3 p)

#### 2.1.1. Delete unnecessary columns (0.5 p.)

In [142]:
# For the purpose of the exam let's delete 'PassengerId', 'Name', 'Cabin' columns
df = df.drop(columns=['PassengerId','Name','Cabin'])
df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


#### 2.1.2. Fill in the missing values (1.5 p.)

```
HomePlanet (string)
Destination (string)

Age (numeric)
RoomService (numeric)
FoodCourt (numeric)
ShoppingMall (numeric)
Spa (numeric)
VRDeck (numeric)

VIP (bool)
CryoSleep (bool)
```

HINT: Use `df['column'].value_counts()` to look at the values and pick the most common value

In [154]:
# check the missing values
# isnull
df.isnull().sum()

HomePlanet      0
CryoSleep       0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Transported     0
dtype: int64

In [144]:
# hint: df[''].fillna(value = )
df['HomePlanet'].fillna(value='Earth',inplace=True)
df['HomePlanet'].value_counts()

Earth     4803
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64

In [145]:
df['CryoSleep'].fillna(value=False,inplace=True)
df['CryoSleep'].value_counts()

False    5656
True     3037
Name: CryoSleep, dtype: int64

In [146]:
df['Destination'].fillna(value='TRAPPIST-1e',inplace=True)
df['Destination'].value_counts()

TRAPPIST-1e      6097
55 Cancri e      1800
PSO J318.5-22     796
Name: Destination, dtype: int64

In [147]:
df['Age'].fillna(value=24.0,inplace=True)
df['Age'].value_counts()

24.0    503
18.0    320
21.0    311
19.0    293
23.0    292
       ... 
75.0      4
79.0      3
78.0      3
77.0      2
76.0      2
Name: Age, Length: 80, dtype: int64

In [148]:
df['VIP'].fillna(value=False,inplace=True)
df['VIP'].value_counts()

False    8494
True      199
Name: VIP, dtype: int64

In [149]:
df['RoomService'].fillna(value=0.0,inplace=True)
df['RoomService'].value_counts()

0.0       5758
1.0        117
2.0         79
3.0         61
4.0         47
          ... 
1230.0       1
987.0        1
930.0        1
3097.0       1
1186.0       1
Name: RoomService, Length: 1273, dtype: int64

In [150]:
df['FoodCourt'].fillna(value=0.0,inplace=True)

df['FoodCourt'].value_counts()

0.0       5639
1.0        116
2.0         75
4.0         53
3.0         53
          ... 
3206.0       1
3879.0       1
734.0        1
4076.0       1
2325.0       1
Name: FoodCourt, Length: 1507, dtype: int64

In [151]:
df['ShoppingMall'].fillna(value=0.0,inplace=True)
df['ShoppingMall'].value_counts()

0.0       5795
1.0        153
2.0         80
3.0         59
4.0         45
          ... 
2454.0       1
1770.0       1
871.0        1
9058.0       1
1031.0       1
Name: ShoppingMall, Length: 1115, dtype: int64

In [152]:
df['Spa'].fillna(value=0.0,inplace=True)
df['Spa'].value_counts()

0.0       5507
1.0        146
2.0        105
5.0         53
3.0         53
          ... 
1104.0       1
892.0        1
1559.0       1
777.0        1
2234.0       1
Name: Spa, Length: 1327, dtype: int64

In [153]:
df['VRDeck'].fillna(value=0.0,inplace=True)
df['VRDeck'].value_counts()

0.0        5683
1.0         139
2.0          70
3.0          56
5.0          51
           ... 
8040.0        1
1920.0        1
5913.0        1
11213.0       1
1543.0        1
Name: VRDeck, Length: 1306, dtype: int64

#### 2.1.3. Convert categorical variables (`HomePlanet` and `Destination`) into dummy variables (0.5 p.)

In [155]:
df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


In [162]:
df

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,0,1,0,0,0,1
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,1,0,0,0,0,1
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,0,0,1
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,0,0,1
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,0,1,0,1,0,0
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,False,1,0,0,0,1,0
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,1,0,0,0,0,1
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,0,1,0,1,0,0


In [None]:
# hint: pd.get_dummies(df, columns=[])
df = pd.get_dummies(df, columns=['HomePlanet','Destination'])

#### 2.1.4. Split the data into X and y (0.25 p.)

In [163]:
X = df.copy()
y = X.pop('Transported')

#### 2.1.5. Split the data into train and test (0.25 p.)

In [164]:
# test = 20%
from sklearn.model_selection import train_test_split as tts
X_train,X_test,y_train,y_test=tts(X,y, test_size=0.2)

## 2.2. kNN (1 p)

#### 2.2.1. Fit a kNN Model (sklearn) on train data (0.5 p.)

In [165]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

KNeighborsClassifier()

#### 2.2.2. Show the score on train and test data (0.5 p.)

In [167]:
# score on train data





y_test_pred = knn.predict(X_train)

knn.predict(X_train)
knn.score(X_train , y_train )

0.8219729651998849

In [168]:
# score on test data


y_test_pred = knn.predict(X_test)

knn.predict(X_test)
knn.score(X_test , y_test )

0.772857964347326

## 2.3. Logistic Regression (1 p)

#### 2.3.1. Fit a Logistic Regression Model on train data (0.5 p.)

In [173]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

#### 2.3.2. Show the score on train and test data (0.5 p.)

In [174]:
# score on train data
model.score(X_train,y_train)

0.7798389416163359

In [176]:
# score on test data
model.score(X_test,y_test)

0.7872340425531915

## 2.4. New Test Data (2.5 p)

#### 2.4.1. Prepare real test data exactly in the same way as training data. (1.5 p)

In [196]:
# read the 'test.csv'
df2=pd.read_csv('test.csv')
df2

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,9266_02,Earth,True,G/1496/S,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,Jeron Peter
4273,9269_01,Earth,False,,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,Matty Scheron
4274,9271_01,Mars,True,D/296/P,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,Jayrin Pore
4275,9273_01,Europa,False,D/297/P,,,False,0.0,2680.0,0.0,0.0,523.0,Kitakan Conale


In [197]:
# Prepare real test data exactly in the same way as training data
df2 = df2.drop(columns=['PassengerId','Name','Cabin'])
df2['HomePlanet'].fillna(value='Earth',inplace=True)
df2['CryoSleep'].fillna(value=False,inplace=True)
df2['Destination'].fillna(value='TRAPPIST-1e',inplace=True)
df2['Age'].fillna(value=24.0,inplace=True)
df2['VIP'].fillna(value=False,inplace=True)
df2['FoodCourt'].fillna(value=0.0,inplace=True)
df2['ShoppingMall'].fillna(value=0.0,inplace=True)
df2['RoomService'].fillna(value=0.0,inplace=True)
df2['Spa'].fillna(value=0.0,inplace=True)
df2['VRDeck'].fillna(value=0.0,inplace=True)


In [198]:
df2 = pd.get_dummies(df2, columns=['HomePlanet','Destination'])

In [199]:
X_test = df2.copy()
y_pred = model.predict(X_test)

In [200]:
print(y_pred)

[ True False  True ...  True  True  True]


#### 2.4.2. Make predictions and save as `csv` file (1 p)

```python
# make predictions
y_pred = model.predict(df_test)

# create dataframe from predictions array
df_predictions = pd.DataFrame({
    'my_prediction': y_pred
})


# save dataframe as csv file
df_predictions.to_csv('studentname_predictions.csv', index=False)
```

In [203]:
# make predictions
y_pred = model.predict(X_test)

# create dataframe from predictions array
df_predictions = pd.DataFrame({
    'my_prediction': y_pred
})


# save dataframe as csv file
df_predictions.to_csv('Almaz Akzholtoev.csv', index=False)

---
#### Congratulations! 

You've done a great job!

<!-- Prepared by Atalov S. -->



<div>
    <img src="https://media.tenor.com/4d4VLoLuZNkAAAAM/hooray-letsgo.gif"/>
</div>

<!-- Prepared by Atalov S. -->