# Data Science Intern Interview Task

#### Note: All calculations should be done in this notebook using Python 3

## Part 1

You are given three files by a client:
 - 'Transactions.csv' contains a list of transactions for a bank,
 - 'EUR Exchange Rates.csv' contains a list of Euro to Pound Sterling exchange rates,
 - 'USD Exchange Rates.csv' contains a list of US Dollar to Pound Sterling exchange rates.
 
The client has asked for monthly forecasts of their data.

Your task is to clean and transform these files before they can be used for analysis.

Please produce a single csv file covering the period Jan-2015 to Feb-2019 with column headings:
 - Calendar Month
 - Sum of Withdrawals (GBP)
 - Sum of Deposits (GBP)
 - Number of Transactions
 - Account Balance for each account (GBP)
 
You may assume that the Account Balance is zero on 31-Dec-2014.

In [1]:
# Answer for Part 1



## Part 2

For this task you are given a single file containg patient data: 'diabetes.csv'.

The 'Outcome' column is 1 if the patient has diabetes, and 0 if they do not. The aim of this task is to build a model that will determine the probability of someone having diabetes or not, based on all other variables given in the dataset.

Some code has been provided for you, but please feel free to edit it and investigate.

We first split up the data into a training set and test set. Please run the code below.

In [2]:
import pandas as pd

# Load diabetes.csv into a dataframe
data = pd.read_csv('diabetes.csv')
print(data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


In [3]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
X = data.drop(columns='Outcome')
y = data['Outcome']

# Split data as 80% training set, and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We are given two models to compare. Please run the code below.

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

def random_forest_predict(X_train, y_train, X_test, y_test):

    rfc = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1, random_state=42)
    rfc.fit(X_train, y_train)

    return rfc.predict_proba(X_test)[:, 1]

def knn_predict(X_train, y_train, X_test, y_test):

    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    knn = KNeighborsClassifier(n_neighbors=5)
    knn.fit(X_train, y_train)

    return knn.predict_proba(X_test)[:, 1]

rf_probabilities = random_forest_predict(X_train, y_train, X_test, y_test)
knn_probabilities = knn_predict(X_train, y_train, X_test, y_test)

Please compare y_test to rf_probabilities and knn_probabilities and give a recommendation of which model to use. You may want to use different metrics and/or produce plots to back up your answer.

In [5]:
# Answer for Part 2

