### Logistic Regression [Homework]

Follow these instructions:

1. Load the ```"titanic"``` dataset from the ```seaborn``` library.

2. Perform analysis, cleaning, and any necessary preprocessing on the data set. Note: You are not allowed to discard data points.

3. Split the data into ```train``` and ```validation``` with the following ratio 8-2 using the ```train_test_split()``` function from the ```sklearn``` library. Note that you must set ```random_state = 1```.

4. Build a model that will predict whether the passenger survived the accident or not.

5. Model accuracy for both ```train``` and ```validation``` data should be above ```80%```.


**Note:** Please include explanations at all stages.

## 1. Load the "titanic" dataset from the seaborn library

In [27]:
# Начало кода
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
titanic = sns.load_dataset("titanic")


In [28]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2. Let's clean and pre-process the data set

First, I would like to replace categorical values ​​with numeric values.

In [29]:
titanic["class"].unique

<bound method Series.unique of 0       Third
1       First
2       Third
3       First
4       Third
        ...  
886    Second
887     First
888     Third
889     First
890     Third
Name: class, Length: 891, dtype: category
Categories (3, object): ['First', 'Second', 'Third']>

In [30]:
titanic["who"].unique()

array(['man', 'woman', 'child'], dtype=object)

For interpreted values ​​we use `label_encoder`, for the rest we use `one_hot_encoder`.

In [31]:
titanic["sex"].replace("male", 1, inplace=True)
titanic["sex"].replace("female", 0, inplace=True)
titanic["class"].replace("First", 1, inplace=True)
titanic["class"].replace("Second", 2, inplace=True)
titanic["class"].replace("Third", 3, inplace=True)
titanic["who"].replace("man", 1, inplace=True)
titanic["who"].replace("woman", 0, inplace=True)
titanic["who"].replace("child", 2, inplace=True)
titanic["adult_male"].replace(True, 1, inplace=True)
titanic["adult_male"].replace(False, 0, inplace=True)
titanic["alive"].replace("yes", 1, inplace=True)
titanic["alive"].replace("no", 0, inplace=True)
titanic["alone"].replace(True, 1, inplace=True)
titanic["alone"].replace(False, 0, inplace=True)

titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.2500,S,3,1,1,,Southampton,0,0
1,1,1,0,38.0,1,0,71.2833,C,1,0,0,C,Cherbourg,1,0
2,1,3,0,26.0,0,0,7.9250,S,3,0,0,,Southampton,1,1
3,1,1,0,35.0,1,0,53.1000,S,1,0,0,C,Southampton,1,0
4,0,3,1,35.0,0,0,8.0500,S,3,1,1,,Southampton,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,S,2,1,1,,Southampton,0,1
887,1,1,0,19.0,0,0,30.0000,S,1,0,0,B,Southampton,1,1
888,0,3,0,,1,2,23.4500,S,3,0,0,,Southampton,0,0
889,1,1,1,26.0,0,0,30.0000,C,1,1,1,C,Cherbourg,1,1


We can notice that the `survived` and `alive` columns have very similar values. Let's check how much.

In [32]:
(titanic["survived"] == titanic["alive"]).value_counts()

True    891
Name: count, dtype: int64

`alive` completely repeats the `survived` attribute, which will be our target. Therefore, we can omit the `alive` sign.

In [33]:
titanic = titanic.drop(columns="alive")

But before using `one_hot_encoding` it is still better to fill in the missing data.

In [34]:
titanic.describe()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,who,adult_male,alone
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,0.789001,0.602694,0.602694
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.594291,0.489615,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,0.0,0.0,0.0
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542,1.0,1.0,1.0
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0,1.0,1.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,2.0,1.0,1.0


In [35]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    int64   
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    int64   
 10  adult_male   891 non-null    int64   
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alone        891 non-null    int64   
dtypes: category(2), float64(2), int64(8), object(2)
memory usage: 85.9+ KB


Not all columns contain 891 values. Let's consider features with missing values.

In [36]:
titanic["age"].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

Since we will in any case use `one_hot_encoding` for the `embarked`, `embark_town` and `deck` features, we can replace `NaN` with any other value - for example, "no"

In [37]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    int64   
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    int64   
 10  adult_male   891 non-null    int64   
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alone        891 non-null    int64   
dtypes: category(2), float64(2), int64(8), object(2)
memory usage: 85.9+ KB


Missing values ​​for the “age” attribute can be replaced with the average value for the column. But I think that would be too rude. Let's try to replace the missing values ​​with the values ​​of similar rows. To determine the similarity of two vectors we use `scipy.spatial.distance.cosine`. However, we still have categorical values, so we use `one_hot_encoding` first.

Now let's try to find vectors similar to those that do not have the value of the `age` attribute. When calculating `similarity` we will use a vector with the values ​​of all signs except the target `age`.

In [38]:
nas = numeric_titanic[numeric_titanic["age"].isna() == True].index.tolist()

NameError: name 'numeric_titanic' is not defined

In [None]:
import scipy
for idx in nas:
    similarities = {}
    navector = numeric_titanic.drop(columns='age').loc[idx].values
    for idx2 in numeric_titanic[numeric_titanic["age"].isna() == False].index.tolist():
        vector2 = numeric_titanic.drop(columns='age').loc[idx2].values
        similarity = 1 - scipy.spatial.distance.cosine(navector,vector2)
        similarities[idx2]=similarity
        
    max_idx2 = [key for key in similarities.keys() if similarities[key] == max(similarities.values())][0]
    replacing_value = numeric_titanic["age"].loc[max_idx2]
    numeric_titanic.at[idx, 'age']= replacing_value

In [None]:
similarities

{0: 0.9085087634073239,
 1: 0.9805828490311704,
 2: 0.9304751783763521,
 3: 0.9830093249690779,
 4: 0.9224224377748935,
 5: 0.9217632842313502,
 6: 0.9819523643397228,
 7: 0.9892638982923231,
 8: 0.9734933720028671,
 9: 0.9873286474745191,
 10: 0.9862321581655026,
 11: 0.9848441368228138,
 12: 0.9224224377748935,
 13: 0.994695549707806,
 14: 0.9130434040204417,
 15: 0.9913719874136602,
 16: 0.9881973748793831,
 17: 0.9799095922194868,
 18: 0.9948743681133047,
 19: 0.9054327478364079,
 20: 0.9902799825784248,
 21: 0.9768459386561931,
 22: 0.9031463533839669,
 23: 0.9826655164984617,
 24: 0.9902956594102419,
 25: 0.99568190072131,
 26: 0.893627479230283,
 27: 0.9793940039077571,
 28: 0.9205376699312651,
 29: 0.9191933722270215,
 30: 0.981954924176588,
 31: 0.9789972306801082,
 32: 0.9178073473900857,
 33: 0.972638348606616,
 34: 0.9805696761844811,
 35: 0.9835440829618443,
 36: 0.8880148924894637,
 37: 0.9224224377748935,
 38: 0.9928197293608866,
 39: 0.9542906183630265,
 40: 0.962448757

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   survived                 891 non-null    int64   
 1   pclass                   891 non-null    int64   
 2   sex                      891 non-null    int64   
 3   age                      891 non-null    float64 
 4   sibsp                    891 non-null    int64   
 5   parch                    891 non-null    int64   
 6   fare                     891 non-null    float64 
 7   class                    891 non-null    category
 8   who                      891 non-null    int64   
 9   adult_male               891 non-null    float64 
 10  alone                    891 non-null    float64 
 11  embarked_C               891 non-null    uint8   
 12  embarked_Q               891 non-null    uint8   
 13  embarked_S               891 non-null    uint8   
 14  embarked_U

We now have high-quality and complete numerical information.

In [None]:
numeric_titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
0,0,3,1,22.0,1,0,7.2500,3,1,1.0,...,1,0,0,0,0,0,0,0,0,1
1,1,1,0,38.0,1,0,71.2833,1,0,0.0,...,0,0,0,0,1,0,0,0,0,0
2,1,3,0,26.0,0,0,7.9250,3,0,0.0,...,1,0,0,0,0,0,0,0,0,1
3,1,1,0,35.0,1,0,53.1000,1,0,0.0,...,1,0,0,0,1,0,0,0,0,0
4,0,3,1,35.0,0,0,8.0500,3,1,1.0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,2,1,1.0,...,1,0,0,0,0,0,0,0,0,1
887,1,1,0,19.0,0,0,30.0000,1,0,0.0,...,1,0,0,1,0,0,0,0,0,0
888,0,3,0,30.0,1,2,23.4500,3,0,0.0,...,1,0,0,0,0,0,0,0,0,1
889,1,1,1,26.0,0,0,30.0000,1,1,1.0,...,0,0,0,0,1,0,0,0,0,0


## 3. Divide the data into train and validation

In [None]:
X = np.arange(1, 21)
#array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
#np.split(a, [int(0.8 * len(a)), int(0.9 * len(a))])

In [None]:
np.split(X, [int(0.8 * len(X)), int(0.9 * len(X))])

[array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]),
 array([17, 18]),
 array([19, 20])]

In [None]:
X_train = train_test_split(X, test_size = 0.2, random_state = 1)
print(len(X_train))

2


In [26]:
X_test = train_test_split(X, test_size = 0.2, random_state = 1)

NameError: name 'X' is not defined

## 4. Let's build a model based on logistic regression

Let's create and train a Logistic Regression model using data from the training set.

In [None]:
x = regression()

NameError: name 'regression' is not defined

The `max_iter` parameter was set to 1000 because with the default number of 100, the model cannot achieve convergence.

The accuracy of the Logistic Regression model on test data is: 80.45 %


The accuracy of the Logistic Regression model on train data is: 84.13 %


In [None]:
X_test

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,class,who,adult_male,alone,...,embark_town_Southampton,embark_town_Unknown,deck_A,deck_B,deck_C,deck_D,deck_E,deck_F,deck_G,deck_Unknown
862,1,0,48.0,0,0,25.9292,1,0,0.0,1.0,...,1,0,0,0,0,1,0,0,0,0
223,3,1,28.0,0,0,7.8958,3,1,1.0,1.0,...,1,0,0,0,0,0,0,0,0,1
84,2,0,17.0,0,0,10.5000,2,0,0.0,1.0,...,1,0,0,0,0,0,0,0,0,1
680,3,0,21.0,0,0,8.1375,3,0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,1
535,2,0,7.0,0,2,26.2500,2,2,0.0,0.0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,1,0,49.0,0,0,25.9292,1,0,0.0,1.0,...,1,0,0,0,0,1,0,0,0,0
815,1,1,40.0,0,0,0.0000,1,1,1.0,1.0,...,1,0,0,1,0,0,0,0,0,0
629,3,1,21.0,0,0,7.7333,3,1,1.0,1.0,...,0,0,0,0,0,0,0,0,0,1
421,3,1,21.0,0,0,7.7333,3,1,1.0,1.0,...,0,0,0,0,0,0,0,0,0,1


### Great job