# Урок 4. Машинное обучение

Конечно, для Julia существует имплементация всеми любимого [ScikitLearn](https://github.com/cstjean/ScikitLearn.jl/). Прежде чем мы начнем, конечно нам нужно раздобыть данные, но чтобы далеко не ходить, мы просто возьмем данные из [Титаника](https://www.kaggle.com/c/titanic/data)

In [None]:
import Pkg;
Pkg.add("HTTP")
Pkg.add("ScikitLearn")
Pkg.add("CategoricalArrays")

In [None]:
using CSV
using HTTP # Модуль для работы с HTTP
using DataFrames
using ScikitLearn
using CategoricalArrays # Модуль для работы с Категориальными переменными

## 0. Подготовка данных

In [None]:
url_train = "https://raw.githubusercontent.com/JuliaEvangelists/Julia-in-DS/main/data/titanic/train.csv"
url_test = "https://raw.githubusercontent.com/JuliaEvangelists/Julia-in-DS/main/data/titanic/test.csv"

Кстати, мы не рассказали Вам в уроке 2, но модуль [CSV](https://juliadata.github.io/CSV.jl/stable/) может использовать модуль [HTTP](https://juliaweb.github.io/HTTP.jl/stable/) для того чтобы взять данные которые доступны в интернете:

In [None]:
train = CSV.read(HTTP.get(url_train).body)
test = CSV.read(HTTP.get(url_test).body)

size(train), size(test)

In [None]:
first(train, 5)

Сделаем небольшие предобработки и для начала заменим пустые значения нулями, для этого воспользуемся методом `coalesce.`, подробнее о работе с пропусками можно ознакомиться [тут](http://juliadata.github.io/DataFrames.jl/stable/man/missing/)

In [None]:
train_ready = coalesce.(train, 0) 
test_ready = coalesce.(test, 0) 
@show 

Удалим бесполезные столбцы:

In [None]:
train_ready = select(train_ready, Not([:Name, :PassengerId, :Ticket, :Cabin]))
test_ready = select(test_ready, Not([:Name, :PassengerId, :Ticket, :Cabin]))

@show

Затем, мы избавимся от текстовых данных приведя их к категориальным переменным. Подробнее [тут](http://juliadata.github.io/DataFrames.jl/stable/man/categorical/)

In [None]:
unique(train_ready.Embarked), unique(train_ready.Sex)

In [None]:
train_ready.Sex = recode(train_ready.Sex, "male" => 1, "female" => 0)
test_ready.Sex = recode(test_ready.Sex, "male" => 1, "female" => 0)
@show

In [None]:
train_ready.Embarked_S = recode(train_ready.Embarked, 0, "S" => 1)
train_ready.Embarked_C = recode(train_ready.Embarked, 0, "C" => 1)
train_ready.Embarked_Q = recode(train_ready.Embarked, 0, "Q" => 1)

test_ready.Embarked_S = recode(test_ready.Embarked, 0, "S" => 1)
test_ready.Embarked_C = recode(test_ready.Embarked, 0, "C" => 1)
test_ready.Embarked_Q = recode(test_ready.Embarked, 0, "Q" => 1)

@show

In [None]:
train_ready = select(train_ready, Not([:Embarked]))
test_ready = select(test_ready, Not([:Embarked]))

first(train_ready, 3)

In [None]:
X = convert(Array{Float64,2}, select(train_ready, Not("Survived")))
y = reshape(convert(Array, select(train_ready, "Survived")),  (891))

test = convert(Array{Float64,2}, test_ready)
@show

In [None]:
@sk_import linear_model: LogisticRegression

Every model's constructor accepts hyperparameters (such as regression strength, whether to fit the intercept, the penalty type, etc.) as keyword arguments. Check out ?LogisticRegression for details.

In [None]:
?LogisticRegression

In [None]:
model = LogisticRegression(fit_intercept=true, max_iter = 200)

In [None]:
fit!(model, X, y);

In [None]:
accuracy = score(model, X, y)

In [None]:
using ScikitLearn.CrossValidation: cross_val_score

In [None]:
scores = cross_val_score(model, X, y; cv=5)

In [None]:
using Statistics

In [None]:
scores_std = std(scores)
scores_mean = mean(scores)

In [None]:
print("Accuracy: $scores_mean (+/- $scores_std)")

In [None]:
using ScikitLearn.CrossValidation: KFold

In [None]:
scores = cross_val_score(model, X, y; cv=KFold(size(X)[1], n_folds=10))

In [None]:
@sk_import preprocessing: MinMaxScaler

In [None]:
scaler = MinMaxScaler(feature_range=(-1, 1))

In [None]:
X_scaled = fit_transform!(scaler, X)
scores = cross_val_score(model, X_scaled, y; cv=5)
scores_std = std(scores)
scores_mean = mean(scores)
print("Accuracy: $scores_mean (+/- $scores_std)")

In [None]:
predictions = predict_proba(model, test)

In [None]:
model.coef_

Отправьте Ваше решение на [kaggle](https://www.kaggle.com/c/titanic/data)

In [None]:
url_submission = "https://raw.githubusercontent.com/JuliaEvangelists/Julia-in-DS/main/data/titanic/gender_submission.csv"
sample_submission = CSV.read(HTTP.get(url_submission).body)

first(sample_submission, 3)
sample_submission.Survived = predictions[:, 1]

CSV.write("submission.csv", sample_submission)

In [None]:
@sk_import ensemble: RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_estimators=100)
@time cross_val_score(model, X, y, cv=5)

In [None]:
#Pkg.add("DecisionTree")
using DecisionTree

In [None]:
model = DecisionTreeClassifier()
fit!(model, X, y)
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)

In [None]:
# apply learned model
predict(model, test)
# get the probability of each label

In [None]:
model = RandomForestClassifier(n_trees=100)
@time cross_val_score(model, X, y, cv=5)