##### Introduction
##### Plan:
##### 1.Import
##### 1.1.Import of Required Modules - Импорт необходимых модулей
##### 1.2.Importing (Reading) Data - Импорт (чтение) данных
##### 2.Exploratory Analysis - Исследовательский анализ
##### 2.1.Data Visualization - Визуализация данных
##### 3.Data Cleaning - Очистка данных
##### 4.Feature Engineering / Feature Selection - Проектирование свойств (фичей) / Выбор свойств (фичей)
##### 5.Machine Learning Models - Модели машинного обучения
##### 6.Evaluate & Interpret Results - Оценка и интерпретация результатов
##### 7.(if necessary) Define the Question of Interest/Goal - (при необходимости) Определение интересующего Вас вопроса/цели

### Introduction


### 1.Import

### 1.1.Import of Required Modules

In [None]:
import numpy as np                  # импортируем numpy
import matplotlib.pyplot as plt     # импортируем matplotlib.pyplot
import pandas as pd                 # импортируем pandas
import seaborn as sns               # импорт seaborn

### 1.2.Importing (Reading) Data

In [None]:
df = pd.read_csv('data/train.csv')

In [None]:
test_df = pd.read_csv('data/test.csv')

### 2.Exploratory Analysis 

In [None]:
type(df)

In [None]:
df
# в столбце Survived 0 обозначает умерших, а 1 — выживших

In [None]:
df.values

In [None]:
df.info()            # функция info выдаёт общую информацию о DataFrame

In [None]:
df.shape     # функция shape возвращает размер - это пара (кол-во строк, кол-во колонок)

In [None]:
df.columns   # функция columns возвращает список всех колонок

In [None]:
df.head(3)  # функция head возвращает первые (3) строки

In [None]:
df.tail(3)  # функция tail возвращает последние (3) строки

In [None]:
df.dtypes    # функция dtypes возвращает все типы данных из DataFrame

In [None]:
df.loc[[1,3,2]]  # выводим интересующую информацию только по строкам
                 # при этом очерёдность строк можно задавать произвольно

In [None]:
df.iloc[[1,3,2],[0, 3, 4, 1]]  # выводим интересующую информацию по строкам по индексам
                               # при этом очерёдность строк и столбцов можно задавать произвольно

### 2.1.Data Visualization

In [None]:
df['Age'].plot(kind='hist', bins=20)     # we build a histogram of the distribution of the age of passengers, the number of columns can be specified

In [None]:
df['Age'].plot(kind='kde')    # graph - age distribution
# будет построен график оценки плотности ядра (KDE).
# Он визуализирует плотность вероятности непрерывной и непараметрической переменной данных. 

In [None]:
df.groupby('Sex')['Age'].plot(kind='kde', xlim=[0,100], legend=True) # graph - age distribution, you can specify the boundaries, for men and women


In [None]:
df['Pclass'].value_counts().plot.pie(legend=True) # Distribution of passengers by cabin classes in a pie chart

In [None]:
df['Survived'].value_counts().plot.pie(legend=True)  
# Distribution of passengers into survivors and non-survivors on a pie chart
# в столбце Survived 0 обозначает умерших, а 1 — выживших

### 3.Data Cleaning 

In [None]:
df.info()

In [None]:
df.isnull().sum()   # we get the total number of NaN elements in X

In [None]:
# replacing the NaN values in the 'Age' column with the average value
df['Age'].fillna(df['Age'].median(), inplace = True)
df.isnull().sum()   # we make sure that now there are no NaN values in the 'Age' column

In [None]:
# deleting uninformative columns if we think it is possible to do so
#  столбцы Name, Ticket и Cabin обозначают имяб номер билета и каюта
df = df.drop(['Name', 'Ticket', 'Cabin'], axis = 1)

In [None]:
df.head(3)

### 4.Feature Engineering / Feature Selection 

In [None]:
# Pre-processing
# Recoding categorical features
df['Sex'].value_counts()

In [None]:
sex_mapping = {'male':0, "female":1}
df['Sex'] = df['Sex'].map(sex_mapping)

In [None]:
df.head(3)

In [None]:
# Recoding categorical features
# Столбец Embarked в датасете «Титаник» обозначает порт посадки пассажира
df['Embarked'].value_counts()

In [None]:
# Recoding categorical features 
# Here we actually create three new binary  features  instead of one 'Embarked' feature
Embarked_dummies = pd.get_dummies(df['Embarked'], prefix='port', dummy_na = False)
Embarked_dummies.head(3)

In [None]:
df.head(3)

In [None]:
# we combine our data in df with the created DataFrame Embarked_dummies
df = pd.concat([df, Embarked_dummies], axis=1)

In [None]:
# We delete the 'Embarked' column, since now in our Data Frame this  feature  is recoded into 3 new columns 
df = df.drop(['Embarked'], axis=1)
df.head(3)

### 5.Machine Learning Models

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# In this kernel we create a model - k nearest neighbors
df

In [None]:
X = df.drop(['Survived'], axis=1)
y = df['Survived']

In [None]:
X

In [None]:
y

In [None]:
# Dividing the original DataFrame by X_trail, X_test, y_train, y_test
# The original X is divided into X_trail, X_test in a certain proportion (in our case 75% and 25%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
len(X_train)

In [None]:
len(X_test)

In [None]:
# In this kernel we create a model - k nearest neighbors
# n_neighbors=1
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)   
# creating an instance object of the KNeighborsClassifier class from the neighbours module

In [None]:
# the fit method trains the model
# the fit method returns the knn object itself (and modifies it)
knn.fit(X_train, y_train)           

In [None]:
y_pred = knn.predict(X_test)        # the predict method makes a prediction for test set
acc_knn = accuracy_score(y_pred, y_test)
print("Accuracy on the test set: {:.2f}".format(acc_knn))
#knn.score(X_test, y_test)           # the score method of the knn object, which calculates the accuracy  of the model for the test set
#print("Accuracy on the test set: {:.2f}".format(knn.score(X_test, y_test)))

In [None]:
y_pred
# в столбце Survived 0 обозначает умерших, а 1 — выживших
# в массиве y_pred 0 обозначает умерших, а 1 — выживших

### 6.Evaluate & Interpret Results

In [None]:
training_accuracy = []
test_accuracy = []
# trying n_neighbors from 1 to 35
neighbors_settings = range(1, 36)
for n_neighbors in neighbors_settings:
    # building a model
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    # we record the correctness on the training set
    training_accuracy.append(knn.score(X_train, y_train))
    # we record the correctness on the test set
    test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="accuracy on the training set")
plt.plot(neighbors_settings, test_accuracy, label="accuracy on the test set")
plt.ylabel("accuracy")
plt.xlabel("number of neighbors")
plt.legend()

In [None]:
# The graph shows that the maximum accuracy on the test set is achieved with the number of neighbors equal to 23
# n_neighbors=1
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=23)   

In [None]:
# the fit method trains the model
# the fit method returns the knn object itself (and modifies it)
knn.fit(X_train, y_train) 

In [None]:
y_pred = knn.predict(X_test)        # the predict method makes a prediction for test set
knn.score(X_test, y_test)           # the score method of the knn object, which calculates the accuracy  of the model for the test set
print("Accuracy on the test set: {:.2f}".format(knn.score(X_test, y_test)))

### Selection of parameters
In this work, I did not consider other parameters.

### 7.(if necessary) Define the Question of Interest/Goal
I did not fill in this section in this Notebook.

### Submission

In [None]:
test_df

In [None]:
test_df.info()

In [None]:
test_df.isnull().sum()   # we get the total number of NaN elements in X

In [None]:
test_df['Age'].fillna(df['Age'].median(), inplace = True)
test_df['Fare'].fillna(df['Fare'].median(), inplace = True)
test_df = test_df.drop(['Name', 'Ticket', 'Cabin'], axis = 1)

In [None]:
sex_mapping = {'male':0, "female":1}
test_df['Sex'] = test_df['Sex'].map(sex_mapping)

In [None]:
test_df['Embarked'].value_counts()

In [None]:
Embarked_dummies = pd.get_dummies(test_df['Embarked'], prefix='port', dummy_na = False)
test_df = pd.concat([test_df, Embarked_dummies], axis=1)
test_df = test_df.drop(['Embarked'], axis=1)
test_df

In [None]:
test_df

In [None]:
test_df.isnull().sum()   # we get the total number of NaN elements in X

In [None]:
test_X = test_df

In [None]:
# In this kernel we create a model - k nearest neighbors
# n_neighbors=19
# from sklearn.neighbors import KNeighborsClassifier
# knn = KNeighborsClassifier(n_neighbors=19)   
# creating an instance object of the KNeighborsClassifier class from the neighbours module

In [None]:
pred_y = knn.predict(test_X)        # the prediction method makes a prediction for the test set specified by the kaggle.com
#knn.score(test_X, pred_y)           # the score method of the knn object, which calculates the accuracy  of the model for the test set

In [None]:
test_X

In [None]:
pred_y

In [None]:
test_X.to_csv('data/abc.csv', index=False)

In [None]:
submission = pd.DataFrame({
         "PassengerId": test_X["PassengerId"],
         "Survived": pred_y
     })
# submission.to_csv('titanic.csv', index=False) # in this form, we write the string on kaggle.com
submission.to_csv('data/titanic1.csv', index=False)

In [None]:
submission.head()