In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
sns.set_context("poster")

# Data Preprocessing

Most of Machine Learning algorithms make assumptions on your data, for example that the scales are comparable or simply they work only on numerical data. This implies that we need to pre-process the data. 
User oriented applications, such as BigML, do that automatically. However, when you use a language such as Python or R, you have to do it manually and decide for each attribute. 

In the Machine Learning process (see figure below) pre-processing is the first step after loading and examining your data. 

There are 4 basic processes that we will treat separatelty. Depending on the algorithm that we will use, we'll need to apply all of them or only some:

        1) Rescale data.
        2) Standarize data.
        3) Normalize data. 
        4) Binarize data. 
        
Before the pre-processing there are three(3) important steps:

        a) Load your dataset
        b) Examine it and get rid of everything that doesn't apply.
        c) Split the dataset into the input and output variables.
        
You will observe that scikit-learn provides two equivalent ways. First you can use the fit() function to prepare your data and later the transform() function. Or you can use the combined fit-and-transform. 

![text](ML_process.png)

<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

----------------------------------------------------------------------------------------------------------
DATACAMP CHAPTER 1:

SUPERVISED LEARNING
The goal is to learn from data for which the right output is known, so that we can make predictions on new data for which we don’t know the output.
The already labeled data is called training data.

All machine learning models implemented as Python classes: they implement the algorithms for learning and predicting + store the info learned from data.

Training a model on the data = 'fitting' a model to the data.
In scikit learn: .fit() method

To predict the labels of new data: .predict() method

Exercise: The importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point. The target needs to be a single column with the same number of observations as the feature data. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Split the data into training and test set.
Fit/train the classifier on the training set.
Make predicitons on the labeled test set and compare these predictions with the known labels. You then compute the accuracy of your predictions.

train_test_split() --> function to randomly split our data. The first argument will be the feature data, the second the targets or labels. The test size keyword argument specifies what proportion of the original data is used of the test set (e.g. 0.3). the random state kwarg sets a seed for the random number generator that splits the data into train and test. Setting the seed with the same argument later will allow you to reproduce the exact split and your downstream results.
By default, train test split, splits the data into 75 % training data and 25 % test data.
You want the labels to be distributed in train and test sets as they are in the original dataset. To achieve this, we use the keyword argument stratify equals y, where y is the list or array containing the labels.

To check the accuracy of our model, we use the score method of the model and pass it X test and y test.

k-NN: larger k = smoother decision boundary = less complex model
      smaller k = more complex model = can lead to overfitting
Generally, complex models run the risk of being sensitive to noise in the specific data that you have, rather than reflecting general trends in the data. This is known as overfitting.

It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data.

----------------------------------------------------------------------------------------------------------

DATACAMP CHAPTER 2:

In regression tasks, the target value is a continuously varying variable, such as country's GDP or the price of a house.

Fitting a regression model: ----- import numpy and linear_model from sklearn, and instantiate LinearRegresssion as regr. We then fit the regressor to the data using regr.fit and passing in the data. After this, we want to check out the regressors predictions over the range of the data. We can achieve that by using np linspace between the maximum and minimum number of rooms and make a prediction for this data. predict...


a, b are the parameters of the model that we want to learn. y = ax+b
y is target and x is single feature.
So the question of fitting is reduced to: how do we choose a and b?
The default scoring method for linear regression is caled R squared.
This metric quantifies the amount of variance in the target variable that is predicted from the feature variables.
To compute the R squared, we once again apply the method score to the model and pass it two arguments: the test data and the test data target.

Note that generally you will never use linear regression out of the box; you will most likely wish to use regularization, which places further constraints on the model coefficients.

CROSS VALIDATION: This method avoids the problem of your metric of choice being dependent on the train test split.

    Potential pitfall if not cross validation: e.g. the R squared on the test set... the R squared returned is dependent on the way that you split up the data. The data points in the test set may have some peculiarities that mean the R squared computed on it is not representative of the model's ability to generalize to unseen data.
    To combat this dependency on what is essentially an arbitrary split, we use a technique called CROSS-VALIDATION. groups in which you split data = folds
    We get k number of  R squared from which we can calculate statistics of interest, such as the mean and median and 95 % confidence intervals. k-fold CV.
    Trade-off: as using more folds is more computationally expensive. Because you are fitting and predicting more times.


Default score for linear regression is R squared.

%timeit cross_val_score(reg, X, y, cv = ____) ; to see how long it takes.

REGULARIZED REGRESSION: To alter the loss function so that it penalizes for large coefficients that can lead to overfitting.

    RIDGE REGRESSION, OLS loss function + sq value of each coefficient multiplied by some constant alpha.
    We can select the alpha (also called lambda in the wild) for which our model performs best. Picking alpha is similar to picking k in k-NN. This is called hyperparameter tuning.
    Alpha can be thought of as a parameter that controls model complexity.
    Setting argument normalize to True ensures that all our variables are on the same scale
    
    LASSO REGRESSION, OLS loss function + the absolute value of each coefficient multiplied by some constant alpha.
    Cool aspect of lasso regression is that it can be used to select important features of a dataset. This is because it tends to shrink the coefficients of less important features to be exactly zero. The features whose coefficients are not shrunk to zero are 'selected' by the Lasso algorithm. 
    To extract the coef attribute write .coef_   after fit.

In [6]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]

C=p_indians.iloc[:,0:8]
C.values
X
pd.DataFrame(X).head()

# PRINTOPTIONS:
# numpy.set_printoptions :These options determine the way floating point numbers, arrays and other NumPy objects are displayed.
# precision : int or None, optional. Number of digits of precision for floating point output (default 8). 
# May be None if floatmode is not fixed, to print as many digits as necessary to uniquely specify the value.


# NAMES:
#names : array-like, optional; List of column names to use. 
#If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed.

# .VALUES:
#dataFrame.values; Return a Numpy representation of the DataFrame.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Rescale Data </h1>

It is very common that the attributes have very different scales. Therefore, many machine learning algorithms benefit from rescalling the attributes to all have the same scale. Normally between 0 and 1. This process is commonly called normalization. 

This is important with optimization algorithms that use gradient descent. Also with algorithms, like regressions, that weight inputs like regression or neural networks. It is also needed when the the algorithms use distances such as the case of k-means or k-nn(K-Nearest Neighbors). 

For rescaling your data, you use the <b>MinMaxScaler</b> class.

----------------------------
DATACAMP:
Min-Max scaling, also called normalization sometimes. When your data is scaled linearly between a minimum and maximum value often 0 and 1, with 0 corresponding with the lowest value in the column, and 1 with the largest. As it is a linear scaling while the values will change, the distribution will not.
After min-max scaling, although distribution is the same, the values sit fully between 0 and 1.
  
(screenshots in my word document)

---------------------------

In [8]:
# Rescale data between 0 and 1

p_indians.head()

from sklearn.preprocessing import MinMaxScaler

# Scale between 0 and 1
scaler=MinMaxScaler(feature_range=(0,1))
rescaledX=scaler.fit_transform(X) #solo lo transforma, no entrena el modelo. Según parámetros línea anterior.

rescaledX


# MINMAXSCALER
# sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
# Standardizes features by scaling each feature to a given range.
# This estimator scales and translates each feature individually such that
# it is in the given range on the training set, i.e. between zero and one.
# where min, max = feature_range.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[0.353, 0.744, 0.59 , ..., 0.501, 0.234, 0.483],
       [0.059, 0.427, 0.541, ..., 0.396, 0.117, 0.167],
       [0.471, 0.92 , 0.525, ..., 0.347, 0.254, 0.183],
       ...,
       [0.294, 0.608, 0.59 , ..., 0.39 , 0.071, 0.15 ],
       [0.059, 0.633, 0.492, ..., 0.449, 0.116, 0.433],
       [0.059, 0.467, 0.574, ..., 0.453, 0.101, 0.033]])

<h1>Standarize Data </h1>

Standarization is a technique that assumes a Gaussians distribution but different means and standard deviations.
Transforming them to a Gaussian of mean 0 and standard deviation of 1. 

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with 
rescaled data, such as linear regression, logistic regression or LDA (linear discriminant analysis).

For standarizing you use the <b>StandardScaler</b> class. 


DATACAMP:
As opposed to finding an outer boundary and squeezing everything within it, standardization instead finds the mean of your data and centers your distribution around it, calculating the number of standard deviations away from the mean each point is. These values (the number of standard deviations) are then used as your new values. This centers the data around 0 but technically has no limit to the maximum and minimum values as you can see here.

In [9]:
# Standarize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

# import sklearn.preprocessing.StandardScaler (es lo mismo)
# importar toda la librería ocuparía mucho espacio. La memoria ocupada por una sola función es menor.+
# solo puedo llamar directamente a lo que está en el "import", sino utilizar notación de puntos.
# from pandas import read_csv as read --> ejemplo.

p_indians.head()

scaler=StandardScaler().fit(X) # no entreno nada, solo aplico el scaler sobre X.
# el fit lo que mira son means, stdv, etc
rescaledX=scaler.transform(X) # aquí lo aplica a los números de la columna.

rescaledX

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[ 0.64 ,  0.848,  0.15 , ...,  0.204,  0.468,  1.426],
       [-0.845, -1.123, -0.161, ..., -0.684, -0.365, -0.191],
       [ 1.234,  1.944, -0.264, ..., -1.103,  0.604, -0.106],
       ...,
       [ 0.343,  0.003,  0.15 , ..., -0.735, -0.685, -0.276],
       [-0.845,  0.16 , -0.471, ..., -0.24 , -0.371,  1.171],
       [-0.845, -0.873,  0.046, ..., -0.202, -0.474, -0.871]])

In [6]:
# para mí para comparar.
#print(rescaledX)
#print(scaler)

In [8]:
p_indians.outcome.unique()

array([1, 0])

<b><font color="red" size=6>Mission 1</font>

a) Sometimes we only want to standarize some attributes and not all. 
For the sake of example, let's say that we only want standarize glucose.
<br><br>
b) Create a new X with all the attributes of the all X but the standarized glucose.
<br><br>
c) Do the same Scaling (from 0 to 1) instead of Standarizing. 
<br><br>
hint: we are dealing with numpy arrays and not DataFrames here, you should use np.concatenate()
</b>

In [12]:
# A) Standardize only glucose.
#!!!! me sale error---¿algo que ver con np.concatenate?!!!!

# Import
from sklearn.preprocessing import StandardScaler

# Select the array corresponding to glucose
glucose_array = X[:,1:2]
glucose_array # print to compare

# Instantiate the function
scaler=StandardScaler().fit(glucose_array)
rescaled_glucose =scaler.transform(glucose_array)
rescaled_glucose # print to compare



array([[148.],
       [ 85.],
       [183.],
       [ 89.],
       [137.],
       [116.],
       [ 78.],
       [115.],
       [197.],
       [125.],
       [110.],
       [168.],
       [139.],
       [189.],
       [166.],
       [100.],
       [118.],
       [107.],
       [103.],
       [115.],
       [126.],
       [ 99.],
       [196.],
       [119.],
       [143.],
       [125.],
       [147.],
       [ 97.],
       [145.],
       [117.],
       [109.],
       [158.],
       [ 88.],
       [ 92.],
       [122.],
       [103.],
       [138.],
       [102.],
       [ 90.],
       [111.],
       [180.],
       [133.],
       [106.],
       [171.],
       [159.],
       [180.],
       [146.],
       [ 71.],
       [103.],
       [105.],
       [103.],
       [101.],
       [ 88.],
       [176.],
       [150.],
       [ 73.],
       [187.],
       [100.],
       [146.],
       [105.],
       [ 84.],
       [133.],
       [ 44.],
       [141.],
       [114.],
       [ 99.],
       [10

array([[ 8.483e-01],
       [-1.123e+00],
       [ 1.944e+00],
       [-9.982e-01],
       [ 5.041e-01],
       [-1.532e-01],
       [-1.342e+00],
       [-1.845e-01],
       [ 2.382e+00],
       [ 1.285e-01],
       [-3.410e-01],
       [ 1.474e+00],
       [ 5.666e-01],
       [ 2.132e+00],
       [ 1.412e+00],
       [-6.539e-01],
       [-9.059e-02],
       [-4.349e-01],
       [-5.600e-01],
       [-1.845e-01],
       [ 1.598e-01],
       [-6.852e-01],
       [ 2.351e+00],
       [-5.929e-02],
       [ 6.918e-01],
       [ 1.285e-01],
       [ 8.170e-01],
       [-7.478e-01],
       [ 7.544e-01],
       [-1.219e-01],
       [-3.723e-01],
       [ 1.161e+00],
       [-1.030e+00],
       [-9.043e-01],
       [ 3.460e-02],
       [-5.600e-01],
       [ 5.354e-01],
       [-5.913e-01],
       [-9.669e-01],
       [-3.097e-01],
       [ 1.850e+00],
       [ 3.789e-01],
       [-4.662e-01],
       [ 1.568e+00],
       [ 1.193e+00],
       [ 1.850e+00],
       [ 7.857e-01],
       [-1.56

In [15]:
X[:,1:2]
array[:,1:2]

array([[148.],
       [ 85.],
       [183.],
       [ 89.],
       [137.],
       [116.],
       [ 78.],
       [115.],
       [197.],
       [125.],
       [110.],
       [168.],
       [139.],
       [189.],
       [166.],
       [100.],
       [118.],
       [107.],
       [103.],
       [115.],
       [126.],
       [ 99.],
       [196.],
       [119.],
       [143.],
       [125.],
       [147.],
       [ 97.],
       [145.],
       [117.],
       [109.],
       [158.],
       [ 88.],
       [ 92.],
       [122.],
       [103.],
       [138.],
       [102.],
       [ 90.],
       [111.],
       [180.],
       [133.],
       [106.],
       [171.],
       [159.],
       [180.],
       [146.],
       [ 71.],
       [103.],
       [105.],
       [103.],
       [101.],
       [ 88.],
       [176.],
       [150.],
       [ 73.],
       [187.],
       [100.],
       [146.],
       [105.],
       [ 84.],
       [133.],
       [ 44.],
       [141.],
       [114.],
       [ 99.],
       [10

array([[148.],
       [ 85.],
       [183.],
       [ 89.],
       [137.],
       [116.],
       [ 78.],
       [115.],
       [197.],
       [125.],
       [110.],
       [168.],
       [139.],
       [189.],
       [166.],
       [100.],
       [118.],
       [107.],
       [103.],
       [115.],
       [126.],
       [ 99.],
       [196.],
       [119.],
       [143.],
       [125.],
       [147.],
       [ 97.],
       [145.],
       [117.],
       [109.],
       [158.],
       [ 88.],
       [ 92.],
       [122.],
       [103.],
       [138.],
       [102.],
       [ 90.],
       [111.],
       [180.],
       [133.],
       [106.],
       [171.],
       [159.],
       [180.],
       [146.],
       [ 71.],
       [103.],
       [105.],
       [103.],
       [101.],
       [ 88.],
       [176.],
       [150.],
       [ 73.],
       [187.],
       [100.],
       [146.],
       [105.],
       [ 84.],
       [133.],
       [ 44.],
       [141.],
       [114.],
       [ 99.],
       [10

In [14]:
X
array

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [None]:
# B) Create a new X with all the attributes of the all X but the standarized glucose. 

p_indians.head() # to check number of column to be dropped

p_indians_noglucose = p_indians.drop(labels='glucose',axis=1) #????por qué axis????
p_indians_noglucose.head() # check that glucose has been dropped

array_noglucose = p_indians_noglucose.values
array_noglucose # to check

new_X = array_noglucose[:, 7]


In [None]:
# C) Do the same Scaling (from 0 to 1) instead of Standardizing. --> I assume I have to apply it to new_X, not only glucose
!!!me sale mismo error de 2D!!!

# Import
from sklearn.preprocessing import MinMaxScaler

# Scale between 0 and 1
new_scaler=MinMaxScaler(feature_range=(0,1))
new_rescaledX=scaler.fit_transform(new_X)

# Check
new_rescaledX



<h1>Normalize Data </h1>

Normalization works with observations (rows) instead of attributes (columns). 

The idea here is to have a length 1 for each observation (a vector of length 1 in linear algebra).

It is useful in algorithms that weigth input values as a whole, such is the case of Neural Networks
and also distance algorithms such as K-nn (Nearest Neighbors)

For normalization you use the <b>Normalizer</b> class. 


In [None]:
from sklearn.preprocessing import Normalizer

p_indians.head()

scaler=Normalizer().fit(X)
normalizedX=scaler.transform(X)

normalizedX


# Transforma con una función idéntica la de Standardization, pero solo cambia el denominador.

<h1>Binarize Data </h1>

Binarize consist in transforming data using a binary threshold; all values above are marked as 1
and all values below as zero. 

Sometimes you want to transform probabilities into crisp values. Many times it is used in feature engineering
when you add a new feature. 

For binarization you use the <b>Binarizer</b> class. 


In [None]:
from sklearn.preprocessing import Binarizer

p_indians.head()

binarizer=Binarizer(threshold=0.0).fit(X) # sería como crear un array con True False
# tengo esquema en papel con ejemplo de colores de ojos.
# en variables númericas, hay threshold. "Los floats pueden luchar contra ints también... al revés puede dar posible error."
# Los floats consumen más espacio.
# CPU tarda tiempo coger y poner memoria, no en procesarlo.

binaryX=binarizer.transform(X) # aquí al True False le pone 1s y 0s respectivamente.

binaryX[:10,0:8]

# el punto en los números en el output indica que son floats.
# en este caso al ser la columna de outcome ceros y unos, y el threshold es cero. Entonces no me afecta.

<b><font color="red" size=6>Mission 2</font>

a) We want to highlight everybody with a glucose level over 140 setting it to 1. 
<br><br>
b) We'll do the same with blood pressure over 80. 
<br><br>
c) Finally we will create a new attribute that we will name warning, when both 
glucose and blood pressure is set to 1, being 0 otherwise. 
<br><br>
d) We need a new X (let's call it X_new) with these attributes instead of the original ones.

</b>

In [49]:
# A) We want to highlight everybody with a glucose level over 140 setting it to 1. 

# Import
from sklearn.preprocessing import Binarizer

# Select glucose
glucose = X[:,1:2]

# Binarize
new_gbinarizer=Binarizer(threshold=140.0).fit(glucose)
new_gbinaryX=binarizer.transform(glucose)

new_gbinaryX
    

array([[1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

In [52]:
# B) We'll do the same with blood pressure over 80. 

# Import
from sklearn.preprocessing import Binarizer

# Select glucose
pressure = X[:,2:3]

# Binarize
new_pbinarizer=Binarizer(threshold=80.0).fit(pressure)
new_pbinaryX=binarizer.transform(pressure)

new_pbinaryX

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

In [54]:
# C) Finally we will create a new attribute that we will name warning, when both glucose and blood pressure is set to 1, ...
# ... being 0 otherwise. 

warning = new_gbinaryX*new_pbinaryX
warning



array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

In [11]:
# D) We need a new X (let's call it X_new) with these attributes instead of the original ones.

# esto es lo que tiene Carles:
#X_new = np.concatenate((X[:,0:1],binary_glucose,binary_blood,X[:,3:],warning),axis=1)
#¿por qué coge el primer término?




AttributeError: 'int' object has no attribute 'copy'

<h1>Three useful functions: map, zip and filter </h1>

There are three very useful functions many times used in preprocessing for econdinf, filter or interation. 

They are: map, zip and filter. 


<h1>map</h1> 

Let's you apply a function to a sequence of elements like a list or a dictionnary.

Very useful for encoding. Please remember that it results an iterator that must be converted to a list to be used.

In [None]:
#map 

ln=["1","2","3","4","5"]
dn={"1":"10","2":"20","3":"30","4":"40","5":"50"}

list(map(lambda x: int(x),ln)) # a lago que yo he llamado ln, aplícale la función lambda aquí definida.

list(map(lambda x:int(x),dn)) # te cambia el key. El valor es lo que viene detrás.

# Función que aplica funciones. Diferente de apply, esta es más lenta.

#Para crear dataframe vacío, es como crear df pero dentro de paréntesis solo pongo unos claudátors como el de lista.

<h1>zip</h1> 

It enables you to iterate over two or more lists at the same time. 

In [None]:
#zip

first = [1, 3, 8, 4, 9]
second = [2, 2, 7, 5, 8]

# Iterate over two or more list at the same time
for x, y in zip(first, second):
    print(x + y)

# List comprehension. Unir dos listas de forma eficiente.

In [None]:
# Ejemplo mío.

first = [1, 3, 8, 4, 9]
second = [2, 2, 7, 5, 8]

[print(x, y) for x, y  in zip(first, second)] # usando list comprehension. [lo que yo quiero + for + lista/o zip de listas]



<h1>filter</h1> 

Similar to map() but in this case will return True or False. 

In [None]:
#filter

ln=["1","2","3","4","5"]
dn={"1":"10","2":"20","3":"30","4":"40","5":"50"}

list(filter(lambda x:int(x)>1,ln)) # return list con criterio aplicado

list(filter(lambda x:int(x)>1,dn)) # aplica sobre keys. Iteran con función next (coge el siguiente)
list(filter(lambda x:int(x)>10,dn.values())) # aplica sobre values del diccionario, no sobre las keys.

# Si aplicas Filter sobre df de T or F, solo te devuelve los True. Lo mismo en diccionario y en lista y tuple.
# Tuple so inmutables pero sí iterables.
# solo una lambda for línea.
# si en vez de lambda x pongo y, int(y) también cambiaría.

In [None]:
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Will return true if input number is even
def even(x):
    return x % 2 == 0
 
even_n = filter(even, numbers) # primero la función y luego sobre qué debe aplicar la función
even_n1 = filter(lambda x: x % 2==0, numbers) # es lo mismo pero define la función con una lambda.

list(even_n)
list(even_n1)


# Lambdas, código funciona más rápido.
# Mejor definir la función por separado por temas de saber dónde están los errores.