# Logistic Regression Lab 2

Scikit-Learn includes several sample datasets which can demonstrate
logistic regression's usefulness.

This is a very free-form lab: you won't be walked through it step-by-step,
so you might want to keep some other examples open.

In [71]:
import sklearn.datasets

We will look at the Wisconsin breast cancer database, and a classic
dataset of [different kinds of iris flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [72]:
bc = sklearn.datasets.load_breast_cancer()
print bc.DESCR

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)
        
        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.
 

In [73]:
iris = sklearn.datasets.load_iris()
print iris.DESCR

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

# Wisconsin

In the Wisconsin breast cancer database, you are trying to predict whether
a tumour is malignant or benign. The database consists of the measurements
of the tumour (bc.data) and the nature of the tumour (bc.target) -- 1 = malignant, 0 == benign.

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation



In [74]:
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

In [75]:
logreg = LogisticRegression(C=1e9)

In [76]:
import pandas as pd
bcdf = pd.DataFrame(data=bc.data, columns=['radius (mean)',
                                    'texture (mean)',
                                    'perimeter (mean)',
                                    'area (mean)',
                                    'smoothness (mean)',
                                    'compactness (mean)',
                                    'concavity (mean)',
                                    'concave points (mean)',
                                    'symmetry (mean)',
                                    'fractal dimension (mean)',
                                    'radius (standard error)',
                                    'texture (standard error)',
                                    'perimeter (standard error)',
                                    'area (standard error)',
                                    'smoothness (standard error)',
                                    'compactness (standard error)',
                                    'concavity (standard error)',
                                    'concave points (standard error)',
                                    'symmetry (standard error)',
                                    'fractal dimension (standard error)',
                                    'radius (worst)',
                                    'texture (worst)',
                                    'perimeter (worst)',
                                    'area (worst)',
                                    'smoothness (worst)',
                                    'compactness (worst)',
                                    'concavity (worst)',
                                    'concave points (worst)',
                                    'symmetry (worst)',
                                    'fractal dimension (worst)',
                                   ])

In [77]:
bcdf['target'] = bc.target

In [78]:
bcdf.head()

Unnamed: 0,radius (mean),texture (mean),perimeter (mean),area (mean),smoothness (mean),compactness (mean),concavity (mean),concave points (mean),symmetry (mean),fractal dimension (mean),...,texture (worst),perimeter (worst),area (worst),smoothness (worst),compactness (worst),concavity (worst),concave points (worst),symmetry (worst),fractal dimension (worst),target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [79]:
columns = bcdf[['radius (mean)',
                                    'texture (mean)',
                                    'perimeter (mean)',
                                    'area (mean)',
                                    'smoothness (mean)',
                                    'compactness (mean)',
                                    'concavity (mean)',
                                    'concave points (mean)',
                                    'symmetry (mean)',
                                    'fractal dimension (mean)',
                                    'radius (standard error)',
                                    'texture (standard error)',
                                    'perimeter (standard error)',
                                    'area (standard error)',
                                    'smoothness (standard error)',
                                    'compactness (standard error)',
                                    'concavity (standard error)',
                                    'concave points (standard error)',
                                    'symmetry (standard error)',
                                    'fractal dimension (standard error)',
                                    'radius (worst)',
                                    'texture (worst)',
                                    'perimeter (worst)',
                                    'area (worst)',
                                    'smoothness (worst)',
                                    'compactness (worst)',
                                    'concavity (worst)',
                                    'concave points (worst)',
                                    'symmetry (worst)',
                                    'fractal dimension (worst)',
                                   ]]

In [80]:
import seaborn as sb
sb.pairplot(bcdf)

<seaborn.axisgrid.PairGrid at 0x16f845b90>

In [66]:
logreg.fit(columns,bcdf.target)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [67]:
prediction = logreg.predict(columns)

# Irises

There are three kinds of flowers in the dataset:

- [Setosa](https://en.wikipedia.org/wiki/Iris_setosa) ( = 0)

- [Versicolor](https://en.wikipedia.org/wiki/Iris_versicolor) ( = 1)

- [Virginica](https://en.wikipedia.org/wiki/Iris_virginica) ( = 2)

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation