This notebook explores the basic details about various scikit-learn data sets

In [2]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

# The copy of UCI ML Breast Cancer Wisconsin (Diagnostic) dataset is
# downloaded from: https://goo.gl/U2Uwz2
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine

import numpy as np
import itertools 
import time

from matplotlib import rcParams, pyplot as plt
from itertools import product
from copy import deepcopy

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

%reload_ext autoreload
%autoreload 2

rcParams['figure.figsize'] = (10.0, 10.0)

globalStart = time.time()

# CALIFORNIA HOUSING DATA SET

In [3]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

data = fetch_california_housing()

# Setup variables
X = data.data
y = data.target

feature_names = list(data.feature_names)

print ("SETUP")
print (X.shape, y.shape)
print ("Features: \n" + str(feature_names))
print (data.DESCR)


SETUP
(20640, 8) (20640,)
Features: 
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from t

# BOSTON HOUSING DATA SET

In [5]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

### https://www.kaggle.com/prasadperera/the-boston-housing-dataset
### https://www.ritchieng.com/machine-learning-project-boston-home-prices/
### https://www.kaggle.com/schirmerchad/bostonhoustingmlnd
### https://www.kaggle.com/heptapod/uci-ml-datasets
### https://github.com/udacity/machine-learning/tree/master/projects/boston_housing

## features - Crime(CRIM), avg rooms (RM), nitric oxides concentration (NOX), Distance (DIS), tax, LSTAT

## Target - MEDV (Median value of owner-occupied homes in $1000's) 

data = load_boston()

# Setup variables
X = data.data
y = data.target

feature_names = list(data.feature_names)

print ("SETUP")
print (X.shape, y.shape)
print ("Features: \n" + str(feature_names))
print (data.DESCR)

SETUP
(506, 13) (506,)
Features: 
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - 

# BREAST CANCER DATA SET

In [4]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

data = load_breast_cancer()

# Setup variables
X = data.data
y = data.target

feature_names = list(data.feature_names)
target_names = list(data.target_names) # output: [0:'malignant', 1:'benign']

print ("SETUP")
print (X.shape, y.shape)
print ("Features: \n" + str(feature_names))
print ("Targets: \n" + str(target_names))
print (data.DESCR)

SETUP
(569, 30) (569,)
Features: 
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
Targets: 
['malignant', 'benign']
Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)


# IRIS DATA SET

In [5]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

data = load_iris()

# Setup variables
X = data.data
y = data.target

feature_names = list(data.feature_names)
target_names = list(data.target_names) # output: [0:'malignant', 1:'benign']

print ("SETUP")
print (X.shape, y.shape)
print ("Features: \n" + str(feature_names))
print ("Targets: \n" + str(target_names))
print (data.DESCR)

SETUP
(150, 4) (150,)
Features: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Targets: 
['setosa', 'versicolor', 'virginica']
Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3%

# WINE DATA SET

In [6]:
###############################
# RUN THIS CELL BEFORE CODING #
###############################

data = load_wine()

# Setup variables
X = data.data
y = data.target

feature_names = list(data.feature_names)
target_names = list(data.target_names) # output: [0:'malignant', 1:'benign']

print ("SETUP")
print (X.shape, y.shape)
print ("Features: \n" + str(feature_names))
print ("Targets: \n" + str(target_names))
print (data.DESCR)

SETUP
(178, 13) (178,)
Features: 
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Targets: 
['class_0', 'class_1', 'class_2']
Wine Data Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- 1) Alcohol
 		- 2) Malic acid
 		- 3) Ash
		- 4) Alcalinity of ash  
 		- 5) Magnesium
		- 6) Total phenols
 		- 7) Flavanoids
 		- 8) Nonflavanoid phenols
 		- 9) Proanthocyanins
		- 10)Color intensity
 		- 11)Hue
 		- 12)OD280/OD315 of diluted wines
 		- 13)Proline
        	- class:
                - class_0
                - class_1
                - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                  