# Santander Customer Satisfaction
Before starting, a short recap about the problem goal. Given is an anonymized dataset of santader customers containing a large number of numeric features and one binary value describing if the customer is satisfied or not (1 for unsatisfied customers, 0 for satisfied customers). 

**Goal** is to build a model to predict a customer satisfaction based on given features.

- Prodvided Data
    - *train.csv*: labeled data containing features and target values. Each data sample (customer) is a row in the csv

- Problem Description
    - **Type**: binary classification problem
    - **Evaluation metric Santander**: Area under the ROC curve (aka AUC) 


## Before running solution
Make sure you have the dependencies with their respective versions listed under "__requirements.txt__".

## Run Solution
Solution script is called "__main_model.ipynb__". Go to this script and just follow the steps there.

The results from this script are saved in "__./data/output/__":
* __rf_winner.pkl__: is a pickled python dictionary. The key "rf_object" contains the best performing classifier as value, the key "auc_test" contains the AUC-ROC highest value on the test set. 
* __rf_winner.png__: a plot of the AUC ROC of the "rf_object" over the cross validation values
* __log__: subdirectory containing:
    * __log.csv__: comma separated value table containing the models consequently achieving the best AUC ROC values when performing cross validation, including the model achieving best AUC ROC value over the test set
    * __cv_roc_plot_n.png__: image corresponding to n-th entry in log.csv
* __log_backup__: folder containing a run over "__main_model.ipynb__" on local machine before deliver to combient
    
## Folder Structure
* __data__: contains input and output data
* __include__: contains helper modules for main_model.ipynb
* __presi__: contains the presentation
* __00_ExploreData.ipynb__: First notebook generated intended for a first evaluation of the provided data
* __01_test_pipeline.ipynb__: file for testig. Was overwritten several times
* __requriements.txt__: python environment requirements
* __main_model.ipynb__: see "Run solution"





In [1]:
import autoreload

%load_ext autoreload
%autoreload 2

In [2]:
# ======================================================
# Load Data
# ======================================================
from explore_data import load_data

In [3]:
# Get relevant directories
src_dir, data_dir, train_data_path = load_data.get_directories()

Working directory: 		 C:\Users\dpere\Documents\combient_challenge
Path for data files: 		 C:\Users\dpere\Documents\combient_challenge\data
Path to trainign data file: 	 C:\Users\dpere\Documents\combient_challenge\data\train.csv


In [4]:
# Load training data csv as a pandas dataframe
pd_train = load_data.get_dataframe(train_data_path)