## Logistic Regression

Evaluating Logistic Regression on the [SPECTF Heart Data Set](https://archive.ics.uci.edu/ml/datasets/SPECTF+Heart) and the [Skin Segmentation Data Set](https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation). Using numpy, pandas and scikit-learn.

## 1. Setup
Imports the required dependencies, and define the path to the datasets.

In [35]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn import metrics

In [None]:
# Path to the datasets
path = '../core/src/test/resources/'

## 2. SPECTF Heart Data Set

Prepare the training and testing dataset and convert them into numpy arrays.

In [54]:
train_csv = pd.read_csv(path + 'spectf.train.csv')
test_csv = pd.read_csv(path + 'spectf.test.csv')

train_data = np.array(train_csv)
X_train = train_data[:, 1:]
y_train = np.ravel(train_data[:, 0])

test_data = np.array(test_csv)
X_test = test_data[:, 1:]
y_test = np.ravel(test_data[:, 0])

In [65]:
model = LogisticRegression(C=0.1, fit_intercept)
model = model.fit(X_train, y_train)

# Check accuracy on the training dataset
model.score(X_train, y_train)

0.97468354430379744

In [64]:
preds = model.predict(X_test)
probs = model.predict_proba(X_test)

# Check accuracy on the testing dataset
print metrics.accuracy_score(y_test, preds)

print metrics.confusion_matrix(y_test, preds)
print metrics.classification_report(y_test, preds)

0.618279569892
[[  7   8]
 [ 63 108]]
             precision    recall  f1-score   support

          0       0.10      0.47      0.16        15
          1       0.93      0.63      0.75       171

avg / total       0.86      0.62      0.71       186

