Disclaimer: Despite the title of the notebook, we won't really do *deep* learning here as it's only useful when we have big and unstructured data

# Import

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import keras
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

# Preparing data

In [None]:
y = df["class"].copy()
X = df.drop("class", axis=1)
cols = X.columns
X.shape

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = pd.DataFrame(sc.fit_transform(X_train))
X_train.columns = cols

X_test = pd.DataFrame(sc.transform(X_test))
X_test.columns = cols

# Models

## Baseline model

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)

baseline = lr.score(X_test, y_test)
print("Baseline accuracy", baseline)

## Neural network model

### Shallow NN

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# np.random.seed(42)

model = Sequential()

model.add(Dense(4, input_shape=(len(X_train.columns),), activation="relu"))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs = 500, validation_data = (X_test, y_test))

accuracy = model.evaluate(X_test, y_test)[1]

print('Accuracy:', accuracy)
print("Diff with baseline:", baseline - accuracy)

plt.figure(figsize=(10,6))
plt.plot(history.history["acc"], label="train")
plt.plot(history.history["val_acc"], label="val")
plt.legend()

### Deep NN

In [None]:
from keras.models import Sequential
from keras.layers import Dense


model = Sequential()

model.add(Dense(16, input_shape=(len(X_train.columns),), activation="relu"))
model.add(Dense(16, activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs = 1000, validation_data = (X_test, y_test))

accuracy = model.evaluate(X_test, y_test)[1]

print('Accuracy:', accuracy)
print("Diff with baseline:", baseline - accuracy)

plt.figure(figsize=(10,6))
plt.plot(history.history["acc"], label="train")
plt.plot(history.history["val_acc"], label="val")
plt.legend()

We can clearly see the overfitting 

# Exercise

For this exercise you will use another dataset called HTRU2.

You can find it here: https://archive.ics.uci.edu/ml/datasets/HTRU2

This dataset contains data about pulsar candidates.

    Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter (see [2] for more uses).

    As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

    Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. The class labels used are 0 (negative) and 1 (positive).

Import data

In [None]:
df = pd.read_csv("HTRU_2.csv")
df.head()

Isolate target variable

In [None]:
y = df["target"].copy()
X = df.drop("target", axis=1)
X.shape

split into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

create a neural network model

In [None]:
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(16, input_shape = (len(X_train.columns),), activation = "relu"))
model.add(Dense(10, activation = "tanh"))
model.add(Dense(1, activation = "sigmoid"))

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

model.summary()

train your model and check performances

In [None]:
history = model.fit(X_train, y_train, epochs = 1000, validation_data = (X_test, y_test))

accuracy = model.evaluate(X_test, y_test)[1]

print('Accuracy:', accuracy)

plt.figure(figsize=(10,6))
plt.plot(history.history["acc"], label="train")
plt.plot(history.history["val_acc"], label="val")
plt.legend()

In [None]:
plt.figure(figsize=(10,6))
plt.plot(history.history["acc"], label="train")
plt.plot(history.history["val_acc"], label="val")
plt.legend()