# LAB | Intro to Machine Learning

**Load the data**

In this challenge, we will be working with Spaceship Titanic data. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

### load data

In [None]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

In [None]:
spaceship

### inspect data

**Check the shape of your data**

In [None]:
#your code here
print(spaceship.columns)

In [None]:
spaceship.keys()

**Check for data types**

In [None]:
print(spaceship.info())  # alternativ und mit weniger infos print(spaceship.dtypes)

**Check for missing values**

In [None]:
spaceship.duplicated().sum()

In [None]:
#your code here
spaceship.isna().sum()

### clean data

There are multiple strategies to handle missing data

- Removing all rows or all columns containing missing data.
- Filling all missing values with a value (mean in continouos or mode in categorical for example).
- Filling all missing values with an algorithm.

For this exercise, because we have such low amount of null values, we will drop rows containing any missing value. 

In [None]:
#your code here
df_spaceship_cleaned = spaceship.dropna()
df_spaceship_cleaned

In [None]:
df_spaceship_cleaned['Transported'] = df_spaceship_cleaned['Transported'].astype(int)

display(df_spaceship_cleaned)

### quick eda

In [None]:
df_space_numerical =  df_spaceship_cleaned.select_dtypes(include=['number'])
df_space_numerical

In [None]:
df_space_categorical = df_spaceship_cleaned.select_dtypes(include=['object', 'category'])
df_space_categorical

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#your code here
sns.pairplot(df_space_numerical, y_vars=['Age'])
plt.legend
plt.show()

In [None]:
sns.pairplot(df_space_categorical, hue ="HomePlanet")
plt.legend
plt.show()

### dev model // train/test

**KNN**

In [None]:
from sklearn.model_selection import train_test_split

K Nearest Neighbors is a distance based algorithm, and requeries all **input data to be numerical.**

Let's only select numerical columns as our features.

In [None]:
df_space_numerical

In [None]:
df_space_numerical.nunique()

In [None]:
df_space_numerical.describe()

And also lets define our target.

In [None]:
#your code here
features = df_space_numerical.drop(columns = ["Transported"])
target = df_space_numerical["Transported"]
target

**Train Test Split**

Now that we have split the data into **features** and **target** variables and imported the **train_test_split** function, split X and y into X_train, X_test, y_train, and y_test. 80% of the data should be in the training set and 20% in the test set.

In [None]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [None]:
X_train.head()

In [None]:
y_train.head()

**Model Selection**

In this exercise we will be using **KNN** as our predictive model.

You need to choose between **Classificator** or **Regressor**. Take into consideration target variable to decide.

Initialize a KNN instance without setting any hyperparameter.

### trying classifier

In [None]:
#your code here
from sklearn.neighbors import KNeighborsClassifier
knn_c = KNeighborsClassifier(n_neighbors=10)

Fit the model to your data.

In [None]:
#your code here
knn_c.fit(X_train, y_train)

In [None]:
#your code here
new_traveler_c = np.array([[25, 223, 479, 0, 313, 304]])
knn_c.predict(new_traveler_c)

Evaluate your model.

In [None]:
pred_c = knn_c.predict(X_test)
pred_c

In [None]:
y_test.values

In [None]:
knn_c.score(X_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score

pred_c = knn_c.predict(X_test)
accuracy_score(y_test, pred_c)

**The model works better with transported (77,4% acc.-score) - the N has still an impact, but in relation no so much**

____________________________________________________________________________

This was the result for the model with "Age"
- N= 10     => the accuracy score of 3.40%
- N= 100    => the accuracy score of 4.38%
- N= 1000   => the accuracy score of 4.23%

A higher "n" makes the model better in terms of the accuracy (but to much "n" makes it even worse) // 100 is better than 1000 neighbors

The high range of the data for each categorie makes it prob. hard to get it to a high accuracy score of 80% and higher to have a reliable prediction model

**Congratulations, you have just developed your first Machine Learning model!**

### trying regression // 2nd model

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
knn_r = KNeighborsRegressor(n_neighbors=100)

In [None]:
knn_r.fit(X_train, y_train)

In [None]:
new_traveler_r = np.array([[223, 479, 0, 313, 304, 1]])
knn_r.predict(new_traveler_r)

In [None]:
pred_r = knn_r.predict(X_test)
pred_r

In [None]:
y_test.values

In [None]:
#your code here
from sklearn.metrics import r2_score, root_mean_squared_error, mean_absolute_error

print(f'R2 core:', knn_r.score(X_test, y_test))
print(f'RMSE:', root_mean_squared_error(y_test, pred_r))
print(f'MAE:', mean_absolute_error(y_test, pred_r))

**the Regression Model makes sense with the target "age" but not with transported (only values 1/0, finite) // age could also be "infinte" from the value**

with target/ "age"

with N=10
- R2 core: 0.009019950613868732
- RMSE: 14.408438809457115
- MAE: 11.425945537065054

with N=100
- R2 core: 0.04880331178562369
- RMSE: 14.116259570660468
- MAE: 11.133993948562782

with N = 1000

- R2 core: 0.04252750916527137
- RMSE: 14.162751118544717
- MAE: 10.896608925869893

The low R2 score indicates a bad model (normaly R2 is between 0 and 1 - / as closer to 1, it is a good model)