# BDA Lab 2: Supervised Machine Learning
## Classification and Regression

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

import warnings 

warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

%matplotlib inline



## Cancer Dataset: Classification

First we load the Breast Cancer dataset, which is a Scikit-learn dataset.

In [5]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print("cancer.keys():\n", cancer.keys())
print("Shape of cancer data:", cancer.data.shape)



cancer.keys():
 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
Shape of cancer data: (569, 30)


In [6]:
load_breast_cancer()

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

## Question 1
cancer.target is a array consisting of 0's and 1's. We'd like to know the distribution -- print this, using np.bincount. Print the names of the features.

In [7]:
np.bincount(cancer.target)  #amounts of 0´s and 1´s


array([212, 357])

In [8]:
print(cancer.feature_names)  #one of the keys is the features = feautures for my ML model

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [9]:
cancer.feature_names.shape

(30,)

In [10]:
print(cancer.data)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]


## Question 2
Create a test train split using cancer.data and cancer.target


In [11]:
#cancer.target will be my laballed data y --> I want to know if its 0 or 1
#cancer.data is my sam

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=0)



## Question 3
Create a KNeighborsClassifier. Use the fit method with the train data. Then score the model on both the train and test data, and print the results. 

In [12]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [13]:

print('Accuracy of KNN n-5, on the training set: {:.3f}'.format(knn.score(X_train, y_train)))
print('Accuracy of KNN n-5, on the test set: {:.3f}'.format(knn.score(X_test, y_test)))

Accuracy of KNN n-5, on the training set: 0.946
Accuracy of KNN n-5, on the test set: 0.916


## Question 4
Create a loop that builds kNN classifiers with either distance or uniform weighting, with numbers of neighbors varying between 1 and 20. What is the best combination? Produce a list consisting of test accuracy, training accuracy, number of neighbors and weighting choice. The list should be sorted by test accuracy.

<hr>

<i>Hints: </i>
<br>
<i>
Create two dicts: training_accuracy and test_accuracy. Make two loops: an outer loop ranging over number of neighbors, and an inner loop ranging over weighting choice. Then build the model with the current options and create a key unique to that option (eg. by concatenating the two options). Then you can store the current results in the the training and test dicts, using the current key. When you're done with the loops, you print out the sorted results of the two dicts.</i>

In [14]:
#Stratify is making sure that the split is than 75%/25% by label as well and specifying the label column

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)

# Create two lists for training and test accuracies
training_accuracy = {}
test_accuracy = {}

# Define a range of 1 to 10 (included) neighbors to be tested
neighbors_settings = range(1,11)
w = ["distance", "uniform"]

# Loop with the KNN through the different number of neighbors to determine the most appropriate (best)
for n_neighbors in neighbors_settings:
    for ws in w:
        
        key = str(n_neighbors) + " " + ws   # I convert n_neighbors into str otherwise I cannot add it to ws
                                            # The key is the combination of the number of k + the weight combinations, e.g. "2 Uniform" or "2 Distance"
                                            # Then is that case you can search for your accuracy by naming the uniform of distance (just by searching thr "key")
    
        knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=ws)
        knn.fit(X_train, y_train)
        training_accuracy[key]="{:.3f}".format(knn.score(X_train, y_train))
        #training_accuracy.append(clf.score(X_train, y_train))  
        test_accuracy[key]="{:.3f}".format(knn.score(X_test, y_test))

        
#res= sorted(zip(training_accuracy, test_accuracy))
print ("Test  -   Train    -  k, weighting\n")


# With test_accuracy.GET i just select the value of test_accuracy

#Reverse : Use the reverse parameter in sorted() to sort the dictionary in reverse order, 
#based on the second argument --> in this case test was the second argument, thats why the lowest test_accuracy is 1st

#The value of the key parameter should be a function (or other callable) that takes a single argument and 
#returns a key to use for sorting purposes. This technique is fast because the key function is called 
#exactly once for each input record.


for key in sorted(test_accuracy, key = test_accuracy.get, reverse = False):
    
    print(test_accuracy[key]," - ", training_accuracy[key],"  -  ", key)


Test  -   Train    -  k, weighting

0.888  -  0.977   -   2 uniform
0.902  -  1.000   -   1 distance
0.902  -  1.000   -   1 uniform
0.902  -  1.000   -   2 distance
0.916  -  1.000   -   4 distance
0.916  -  0.934   -   9 uniform
0.916  -  1.000   -   10 distance
0.916  -  0.939   -   10 uniform
0.923  -  1.000   -   3 distance
0.923  -  0.958   -   3 uniform
0.923  -  0.955   -   4 uniform
0.923  -  0.948   -   5 uniform
0.923  -  1.000   -   8 distance
0.923  -  1.000   -   9 distance
0.930  -  1.000   -   5 distance
0.930  -  1.000   -   6 distance
0.930  -  1.000   -   7 distance
0.930  -  0.944   -   7 uniform
0.930  -  0.941   -   8 uniform
0.937  -  0.946   -   6 uniform


## Linear models for classification

## Question 5
Produce a LogisticRegression classifier with default settings for the cancer dataset. Print the training and test accuracy.<br>
<i>(You may get a warning, TOTAL NO. of ITERATIONS REACHED LIMIT. You can ignore that. Or you can try larger settings for the optional parameter, max_iter, to remove the warning. You may see improved scores as a result!)</i>

In [15]:
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state = 0)

log_reg = LogisticRegression()  #here I am creating a classifier
log_reg.fit(X_train, y_train) 


LogisticRegression()

In [16]:
print('Accuracy on the training subset: {:.3f}'.format(log_reg.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(log_reg.score(X_test, y_test)))

Accuracy on the training subset: 0.958
Accuracy on the test subset: 0.937


## Question 6
Print the three features with the highest coefficients -- these are the features which the model most strongly links with the target value of 1, i.e., <i>benign</i>. Print the three features with the lowest coefficients -- these are the features which the model most strongly links with the target value 0, i.e., <i>malignant</i>. On each line, print the feature name, followed by its coefficient value.


In [17]:
sorted_coef= sorted(list(zip(list(cancer.feature_names), log_reg.coef_[0])),
                    key=lambda e: e[1], reverse=True)

                    
for i in sorted_coef:
    print(i)
    


('mean radius', 1.5401326895947085)
('worst radius', 1.4431551414281087)
('texture error', 0.5980304133214931)
('perimeter error', 0.47130761409091976)
('mean texture', 0.14825821884843635)
('mean perimeter', 0.10838088991509104)
('radius error', 0.03831100309254111)
('smoothness error', -0.005887078627145201)
('fractal dimension error', -0.006717993286983086)
('mean area', -0.0071515141095640136)
('symmetry error', -0.02162683705571555)
('mean fractal dimension', -0.022202032610368568)
('concave points error', -0.022325780772632146)
('worst area', -0.022421602968474407)
('compactness error', -0.06484907721542935)
('mean smoothness', -0.06935069577776512)
('mean symmetry', -0.09000480496750345)
('worst fractal dimension', -0.0916629456565422)
('concavity error', -0.09232314271652398)
('worst smoothness', -0.11737342420026399)
('area error', -0.11775112916014886)
('mean concave points', -0.1768256861136574)
('worst perimeter', -0.2274198242701929)
('worst symmetry', -0.26269538203229686

In [18]:
print("The top 3 features are: ", sorted_coef[0:3])
print("\nThe Lowest 3 features are: ", sorted_coef[-3:])


The top 3 features are:  [('mean radius', 1.5401326895947085), ('worst radius', 1.4431551414281087), ('texture error', 0.5980304133214931)]

The Lowest 3 features are:  [('mean concavity', -0.412423747163664), ('worst compactness', -0.8969185229888859), ('worst concavity', -1.0972363532985538)]


The function np.argpartition(coefs, k) will return an array that starts with the indices of the smallest n-k elements in coefs and ends with the indices of the largest k elements in coefs. Since it does not perform a full sort, it is more efficient than doing a full sort of the array (note that using -3 in the function is the same as using len(coefs)-3). If you don't need the efficiency you could also replace that row with top_three = np.argsort(coefs)[-3:]

## Question 7
Print the 3 features that are the <i>least</i> informative about our target value (Hint: you should use the absolute value of coefficients.)

In [19]:
sorted_coef_1= list(zip(list(cancer.feature_names), log_reg.coef_[0]))  # I could have run it without this since its rearranging what it on top of it
sorted_coef_abs= sorted(sorted_coef_1, key=lambda x:abs(x[1]))

# Lambda itera por cada elemento de la lista y aplica la funcion después de x a cada elemento. 
# y tomo [1] porque las listas operan con posicion empezando en cero
# y considerando la lista e arriba quiero aplicar abs a log_reg.coef
# luego abajo tomo the [0:3] que significa los primeros 3 elementos de la nueva lista de abs values
# cuando pongo key=lambda es basicamente algo fijo que viene con sorted functions. es como poner weights in KNN classifier
#


least_informative_features= sorted_coef_abs[0:3]

print(least_informative_features)

[('smoothness error', -0.005887078627145201), ('fractal dimension error', -0.006717993286983086), ('mean area', -0.0071515141095640136)]


In [20]:
print(sorted_coef_abs)

[('smoothness error', -0.005887078627145201), ('fractal dimension error', -0.006717993286983086), ('mean area', -0.0071515141095640136), ('symmetry error', -0.02162683705571555), ('mean fractal dimension', -0.022202032610368568), ('concave points error', -0.022325780772632146), ('worst area', -0.022421602968474407), ('radius error', 0.03831100309254111), ('compactness error', -0.06484907721542935), ('mean smoothness', -0.06935069577776512), ('mean symmetry', -0.09000480496750345), ('worst fractal dimension', -0.0916629456565422), ('concavity error', -0.09232314271652398), ('mean perimeter', 0.10838088991509104), ('worst smoothness', -0.11737342420026399), ('area error', -0.11775112916014886), ('mean texture', 0.14825821884843635), ('mean concave points', -0.1768256861136574), ('worst perimeter', -0.2274198242701929), ('worst symmetry', -0.26269538203229686), ('mean compactness', -0.2985915061072213), ('worst texture', -0.3292413520444602), ('worst concave points', -0.3421762303096131),

## Question 8
Try different values of C for a Logistic Regression model with the same data. Give results sorted by test accuracy. Each line of the output should include test accuracy, training accuracy, and C value.

In [18]:
c=[0.0001,0.001,0.01,0.1,1,10,100,1000]

for i in c:
    log_reg=LogisticRegression(C = i)
    log_reg.fit(X_train, y_train) 
    test_acc="{:.3f}".format(log_reg.score(X_test,y_test))
    test_tr="{:.3f}".format(log_reg.score(X_train,y_train))
    print("The test accuracy is: ", test_acc, "with an accuracy on the training set of: ", test_tr, "Using C value of: ", i)


# There is an example of how to print it in the PPT, I need to rewrite it 

#print ("    Test",  "   Tran",   "C val   2")
#for r in 

The test accuracy is:  0.895 with an accuracy on the training set of:  0.934 Using C value of:  0.0001
The test accuracy is:  0.937 with an accuracy on the training set of:  0.953 Using C value of:  0.001
The test accuracy is:  0.923 with an accuracy on the training set of:  0.939 Using C value of:  0.01
The test accuracy is:  0.930 with an accuracy on the training set of:  0.946 Using C value of:  0.1
The test accuracy is:  0.937 with an accuracy on the training set of:  0.958 Using C value of:  1
The test accuracy is:  0.930 with an accuracy on the training set of:  0.960 Using C value of:  10
The test accuracy is:  0.951 with an accuracy on the training set of:  0.965 Using C value of:  100
The test accuracy is:  0.930 with an accuracy on the training set of:  0.965 Using C value of:  1000


<h1> Regression</h1>

In [19]:
from sklearn.datasets import fetch_california_housing

## the code below about ssl is because of a possible error in data access
## if you get the error you should uncomment it
#import ssl

#try:
#    _create_unverified_https_context = ssl._create_unverified_context
#except AttributeError:
#    pass
#else:
#    ssl._create_default_https_context = _create_unverified_https_context


housing = fetch_california_housing()

print("Data shape:", housing.data.shape)

Data shape: (20640, 8)


In [20]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [21]:
print(housing.DESCR[:1400])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

### Question 9
Build a LinearRegression model using the housing dataset, and print the score on test data

### Question 10
Print the 3 features with the highest coefficients, those with the lowest, and those closest to zero.