#  K Nearest Neighbours (KNN) Classification Solution


![alt text](data/titanic.png "Title")


https://www.kaggle.com/c/titanic

### Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. 
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

* Can you predict if a person  of age 35 and fare 450 would have survived?

* part of the cells are already completed for you , roughly in the middle there is a "YOUR TURN" .. where you will have to insert your code

## 1. Analyse your Data

##### <font color='red'>Python:</font>

* to get access to the functionality of a library we always need to tell the program to import that library. We use import and the name of the library. Since we will have to type the name of the library everytime we want to access its methods we use the keyword 'as' to make the name shorter. 

In [None]:
import numpy as np 
import pandas as pd 

##### <font color='red'>Python:</font>

* to load a dataset we use the command pd.read_csv and we pass the path to the csv file .
* the data are stored in the data folder that you downloaded


In [None]:
# This is only the training set
data=pd.read_csv("data/titanic.csv")

##### <font color='red'>Python:</font>

* now all the data are stored in the object called data. 
* to get the idea of the data we can use:
    * .shape
    * .describe()
    * .info()
    

In [None]:
# data.shape  this is an attribute of the data
data.shape

In [None]:
# data.describe() this is a method call -->  notice the difference between a method and an attribute 
data.describe()

In [None]:
# data.info() this is a method call
data.info()

#### selecting only some columns for our model
the info method returned that there are some missing values in the dataset. For instance, there 891 entries but Age column has only 714 non-null value. This is a problem we need to fix. Also for the purpose of this exercise we will focus only on two features and one label:

    * Age --> FEATURE 
    * Fare  --> FEATURE
    * Survived --> TARGET

##### <font color='red'>Python:</font>

* to select only specific columns in a dataset we use indexing. Indexing is done by adding the square brackets and the name of the columns in double quotes as a list of names.
* We then assign the selection to the original data variable. So the object data is only made of 3 columns now
* data.hea() is a method that will show the first 5 rows of the data ( it makes the visualisation more compact)

In [None]:
data = data[["Survived","Age","Fare"]]
data.head()

##### <font color='red'>Python:</font>

* the column Age had some missing value. This time we decide to fill the value with the mean of the column. This is a decision we make, alternatevely we could have dropped the rows with the missing values or find a more elegant way to fill the missing value. For instance we could calculate two means, one for women and one for men.

In [None]:
# we fill the null values with the mean of the column
data=data.fillna(data.mean())

##### <font color='red'>Python:</font>

* another library for plotting is called matplotlib
* here we plot a histogram of our target

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(data["Survived"])

* 0 is not survived and 1 is survived

##### <font color='red'>Python:</font>

* the scatter plot is provided by the matplotlib library . it requires the x and y values and we can assign a color
* we assign a different color for each class (survived or not) 
* also we set the alpha color to 0.5 to have a better visualisation of the dots 

In [None]:
plt.scatter( x=data["Fare"],y=data["Age"], c=data["Survived"], alpha = 0.5)


Here Purple point implies that the person didn't survived 





# <font color='purple'>Your Turn:</font>

#### <font color='blue'>try to create a model using KNN classifier in order to make prediction with the titanic dataset</font>

your task are:

* import the train_test_split module from sklearn.model_selection
* understand which columns are your features and which are your labels
* split the data in train and test set 
* load the KNeighborsClassifier
* create the classifier with n_neighbors=20
* fit your data (features and labels) to the model 
* check the accuracy of the model with different n_neighbors
* deploy the model and make a prediction for a person of age 35 and fare 450


## 2. Define the features X and the target y

##### <font color='red'>Python:</font>

* in order to evaluate our model we need to split into train and test set
* we can use   train_test_split from sklearn.model_selection it also shuffles the data


## 3. Divide the data into 2 splits: training set and testing set

### 4. Create the model

### 5. Train the model

### 6. Evaluate the model

In [None]:
# training accuracy


test accucary:
    to get the accuracy of the model we need to use the score method and pass our feature test and label test

In [None]:
# Test accuracy


### 7. Tune the parameters of the model to increase the performance

In [None]:
####### try to change the k in the previous cells and run the entire code again


### 8. Make prediction

In [None]:
# age 35 and fare 450



this person would survive