## Task 
Build a model with `sklearn`'s `LogisticRegression` to get the accuracy of at least 0.80 (0.82 for the bonus point, 0.85 for the super-bonus point) on the test set.

Some (optional) suggestions:
- Add new features (e.g. missing value indicator columns)
- Fill missing values
- Encode categorical features (e.g. one-hot encoding)
- Scale the features (e.g. with standard or robust scaler)
- Think of other ways of preprocessing the features (e.g. `Fare` $\to$ `log(Fare)`)
- Try adding polynomial features


In [1]:
import numpy as np
import pandas as pd
import wget
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, OneHotEncoder,Binarizer, MultiLabelBinarizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [2]:
wget.download('https://raw.githubusercontent.com/HSE-LAMBDA/MLDM-2022/main/01-intro/train.csv', '/Users/spokr/OneDrive/Documents/GitHub/train.csv')

  0% [                                                                              ]     0 / 60302 13% [..........                                                                    ]  8192 / 60302 27% [.....................                                                         ] 16384 / 60302 40% [...............................                                               ] 24576 / 60302 54% [..........................................                                    ] 32768 / 60302 67% [....................................................                          ] 40960 / 60302 81% [...............................................................               ] 49152 / 60302 95% [..........................................................................    ] 57344 / 60302100% [..............................................................................] 60302 / 60302

'/Users/spokr/OneDrive/Documents/GitHub/train (1).csv'

#### About the data
Here's some of the columns
* Name - a string with person's full name
* Survived - 1 if a person survived the shipwreck, 0 otherwise.
* Pclass - passenger class. Pclass == 3 is cheap'n'cheerful, Pclass == 1 is for moneybags.
* Sex - a person's gender
* Age - age in years, if available
* SibSp - number of siblings on a ship
* Parch - number of parents on a ship
* Fare - ticket cost
* Embarked - port where the passenger embarked
 * C = Cherbourg; Q = Queenstown; S = Southampton

In [3]:
data = pd.read_csv("train.csv", index_col='PassengerId')
data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
data.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [5]:

def feature_selection_and_preprocessing(dataset):
  
    features = dataset[["Age","Fare", "Sex","Embarked",'SibSp', 'Parch', "Pclass"]].copy()
    #filling in missing values
    features["Embarked"] = features["Embarked"].fillna(features["Embarked"].mode()[0])
    features["Age"].fillna(features["Age"].median(),inplace = True)    
    #encoding binary variables
    features.Sex = features.Sex.replace(['male', 'female'],[1,0])
    features['Pclass_1'] = np.where(features['Pclass']==1, 1, 0)
    features['Pclass_2'] = np.where(features['Pclass']==2, 2, 0)
    features['Pclass_3'] = np.where(features['Pclass']==3, 3, 0)
    features = features.drop(columns="Pclass")
    features['Embarked_С'] = np.where(features['Embarked']=='C', 1, 0)
    features['Embarked_S'] = np.where(features['Embarked']=='S', 2, 0)
    features['Embarked_Q'] = np.where(features['Embarked']=='Q', 3, 0)
    features = features.drop(columns="Embarked")
    #transformation of quantitative variables
    features["Age_sin"] = np.sin(1+features.Age)
    features["Fare_log"] = np.log(1+features.Fare)
    
    return features
    
model = make_pipeline(
    StandardScaler(),
    PolynomialFeatures(),
    LogisticRegression(),
)

# Validation code (do not touch)
data = pd.read_csv("train.csv", index_col='PassengerId')
data_train, data_test = train_test_split(data, test_size=200, random_state=42)

model.fit(
    feature_selection_and_preprocessing(
        data_train.drop('Survived', axis=1)
    ),
    data_train['Survived']
)

train_predictions = model.predict(
    feature_selection_and_preprocessing(
        data_train.drop('Survived', axis=1)
    )
)

test_predictions = model.predict(
    feature_selection_and_preprocessing(
        data_test.drop('Survived', axis=1)
    )
)

print("Train accuracy:", accuracy_score(
    data_train['Survived'],
    train_predictions
))
print("Test accuracy:", accuracy_score(
    data_test['Survived'],
    test_predictions
))

Train accuracy: 0.8523878437047757
Test accuracy: 0.84


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
