# Spaceship. Part 4. New start.

## Task description

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

## Starting over

We'll start over, tweaking some steps of the process and testing every step using the best model we've found in Part 3.

We'll repeat all the commentary, so it will be conveinient for readers to start from Part 4, without reading previous parts.

## Test function

But first, let's create a function that allows us easily test every new step by providing cross-validation average ROC AUC and accuracy scores, as well as preparing data for new submissions:

In [None]:
#Import ROC AUC metric
from sklearn.metrics import roc_auc_score

# Random seed for reproducibility
SEED = 123

# Prepare our best model for training
from sklearn.ensemble import RandomForestClassifier
model_for_tests = RandomForestClassifier(random_state=SEED, \
                               n_estimators= 516, \
                               criterion= 'log_loss', \
                               max_depth= 17, \
                               max_features=0.7, \
                               max_leaf_nodes=123,\
                               min_impurity_decrease= 0.00020380822483963789, \
                               min_samples_leaf= 2, \
                               max_samples= 0.9999360987512214, \
                               n_jobs=-1
                               )



def get_cv_scores(train, test, model, scores_df, verbose=1):
    
    '''
    This function takes train and test sets, as well as a model for cross validation and a DataFrame with previous scores.
    
    Setting verbose to 0 prevents function for printing out updated scores.
    
    It returns:
        
        -) Updated DataFrame with new:
            1) Average training ROC AUC score.
            2) Average cross-validation ROC AUC score.
            3) Average training accuracy score. 
            4) Average cross-validation accuracy score.
        
        -) A dataset for a new submission.
    '''
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # Empty lists for collecting scores
    train_roc_auc_scores = []
    cv_roc_auc_scores = []
    train_accuracy_scores = []
    cv_accuracy_scores = []
    
    # Iterate through folds
    for train_index, cv_index in kf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate scores and append to the scores lists
        train_pred_proba = model.predict_proba(cv_train.drop('Transported', axis=1))[:, 1]
        train_roc_auc_scores.append(roc_auc_score(cv_train['Transported'], train_pred_proba))
        cv_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        cv_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], cv_pred_proba))
        train_accuracy_scores.append()
        
        
        
    
    
    return 
                         
