Introduction to Python
================================

Lesson 2 - Part 2
--------

## Summary

In this Lesson we will start using Python in order to create some models.

The models that we will create are:
  - Linear

In order to do so we must intruduce some important libraries:

  - Pandas
  - Sklearn
  - Numpy

In [None]:
%load_ext rpy2.ipython

In [None]:
from IPython.display import display, HTML

## Linear Model

Now we will create a linear model using the data of the `diabetes`  dataset of `SKLearn`.

The description of the dataset, taken from the [Doc. Page](http://scikit-learn.org/stable/datasets/index.html#diabetes-dataset), says:

*Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.*

We will use the **body max index** to predict the **target variabile**, i.e. the measure of disease progression one year after baseline.

The same analysis is done 3 times on order to show you how to handle it by using:

  - `Numpy`
  - `Pandas`
  - `R`

## Numpy  solution

Let's see how to solve the problem using `Numpy`.

**NOTE**: the graph at the end of the analysis is done using the library `matplotlib.pyplot`

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
display(diabetes)
print('The type of diabetes.data:',type(diabetes.data),diabetes.data.shape)
print('The type of diabetes.target:',type(diabetes.target),diabetes.target.shape)

In [None]:
# Use only one feature
#diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X = diabetes.data[:,2].reshape(-1,1)
print(diabetes_X.shape)

In [None]:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
print('Dimension of diabetes_X_train:',diabetes_X_train.shape)
print('Dimension of diabetes_X_test:',diabetes_X_test.shape)

In [None]:
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
print('Dimension of diabetes_y_train:',diabetes_y_train.shape)
print('Dimension of diabetes_y_test:',diabetes_y_test.shape)

**NOTE**: as you can see the shape of train and test are different. This is necessary because the linear regression needs an `x` variables to be a **2D array**

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_,regr.intercept_)

In [None]:
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
print("Root squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred,squared=False))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

In [None]:
# Plot outputs

#Scatter and line
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

#Axis 
plt.tick_params(axis='both', which='major', labelsize=10, pad=15)
plt.tick_params(axis='y', which='minor', labelsize=10, pad=15)

#Print graph
plt.show()


## Pandas  solution

Let's see how to solve the problem using `Pandas`.

Again the graph at the end of the analysis is done using the library `matplotlib.pyplot`.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

testDim = 20
nrows = diabetes.data.shape[0]
index = ['patient_%i'%i for i in range(0,nrows)]
df = pd.DataFrame(diabetes.data,
                  columns=['Age','Sex','Body_mass_index','Average_blood_pressure','S1','S2','S3','S4','S5','S6'],
                  index=index)
df.head()

In [None]:
# d_X = df.iloc[:,2]
d_X = df[['Body_mass_index']]
d_Y = pd.DataFrame(diabetes.target,index=index)
#testPats = ['patient_%i'%i for i in range()]
d_X_train = d_X.iloc[:-testDim]
d_X_test = d_X.iloc[-testDim:]
print('Dimension of d_X_train:',d_X_train.shape)
print('Dimension of d_X_test:',d_X_test.shape)
d_Y_train = d_Y.iloc[:-testDim]
d_Y_test = d_Y.iloc[-testDim:]
print('Dimension of d_Y_train:',d_Y_train.shape)
print('Dimension of d_Y_test:',d_Y_test.shape)

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(d_X_train, d_Y_train)

# Make predictions using the testing set
d_Y_pred = regr.predict(d_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_,regr.intercept_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(d_Y_test, d_Y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(d_Y_test, d_Y_pred))

In [None]:
# Plot outputs
plt.scatter(d_X_test, d_Y_test,  color='blue')
plt.plot(d_X_test, d_Y_pred, color='red', linewidth=3)

plt.tick_params(axis='both', which='major', labelsize=10, pad=15)
plt.tick_params(axis='y', which='minor', labelsize=10, pad=15)

plt.show()


## Exercise 
what I'm doing in the box below?

In [None]:
start=0
end=testDim
allPats = set(index)
ret = []
while end<=nrows:
    testPats = {'patient_%i'%i for i in range(start,end)}
    trainPats =allPats.difference(testPats)
    d_X_train = d_X.loc[trainPats]
    d_X_test = d_X.loc[testPats]
    print('Dimension of d_X_train:',d_X_train.shape)
    print('Dimension of d_X_test:',d_X_test.shape)
    d_Y_train = d_Y.loc[trainPats]
    d_Y_test = d_Y.loc[testPats]
    print('Dimension of d_Y_train:',d_Y_train.shape)
    print('Dimension of d_Y_test:',d_Y_test.shape)
    regr = linear_model.LinearRegression()
    regr.fit(d_X_train, d_Y_train)
    d_Y_pred = regr.predict(d_X_test)
    # The coefficients
    print('Coefficients: \n', regr.coef_,regr.intercept_)
    # The mean squared error
    rmse = mean_squared_error(d_Y_test, d_Y_pred,squared=False)
    print("Mean squared error: %.2f"
          % rmse)
    # Explained variance score: 1 is perfect prediction
    print('Variance score: %.2f' % r2_score(d_Y_test, d_Y_pred))
    start += testDim
    end += testDim
    ret.append(rmse)
res = pd.DataFrame({'rmse':ret})
print('Averaged RMSE: %f'%res['rmse'].mean())

## R solution

Let's see how to solve the problem using R.

Please note how it is easier to plot the graph with R.

In [None]:
%%R -i df,d_Y
library(dplyr)
library(magrittr)
library(ggplot2)



nrow=dim(df)[1]
ntest=20
df['target']=d_Y
df %<>%
  select(Body_mass_index,target)
df_Train=df[1:(nrow-ntest),]
df_Test=df[(nrow-ntest+1):nrow,]

model=lm(target ~ Body_mass_index,data=df_Train)

print(model)

pred=predict.lm(model,df_Test)
df_Test %>% 
  mutate(pred=pred) %>% 
  ggplot() +
  geom_point(aes(Body_mass_index,target)) +
  geom_line(aes(Body_mass_index,pred))   


## Random Forest Classificator

Now we'll use a rando forest on th on the breast cancer dataset ([Man. Page](http://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset)).


In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.ensemble import RandomForestClassifier as rfc
data = load_breast_cancer()
df_X=pd.DataFrame(data.data)
df_Y=data.target
print('shape di df_X:',df_X.shape)
print('shape di df_Y:',df_Y.shape)

In [None]:
nrow=df_X.shape[0]
ntest=50
df_X_train=df_X.head(nrow-ntest)
df_Y_train=df_Y[0:(nrow-ntest)]
df_X_test=df_X.tail(ntest)
df_Y_test=df_Y[(nrow-ntest):nrow]
print(df_Y_test.shape)
print(df_Y_train.shape)

In [None]:
model = rfc(n_jobs=-1,oob_score=True,n_estimators=100)
print('Model parameters:',model)
model.fit(df_X_train,df_Y_train)
print('Importance of the features',model.feature_importances_)
print('Acuracy od the model during train is: %i%s'%(int(model.oob_score_*100),'%'))
pred=model.predict(df_X_test)
diff=pred-df_Y_test
diff=sum(abs(diff))
acc=1-diff/len(df_Y_test)
print('Accuracy of the model on the test set: %i%s'%(int(acc*100),'%'))
print('Accuracy of the model on the test set: %i%s'%(int(model.score(df_X_test,df_Y_test)*100),'%'))

**NOTE**: we can do the same analysis in a easier way.

In [None]:
from sklearn.model_selection import cross_val_score
model = rfc(n_jobs=-1,n_estimators=10)
scores = cross_val_score(model, data.data, data.target, cv=10)
# scores = cross_val_score(model,XFake,YFake, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))           



## Exponential fit

In the last part of this lesson we'll see how to implement a function ad try to find the best parameters in order to obtain the best fit.

For this scope we'll use the function `optimize.minimize` of `sicpy` ([library page](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html)).

In [None]:
import numpy as np
import scipy
import matplotlib.pyplot as plt

baskets = np.array([475, 108, 2, 38, 320])
scaling_factor = np.array([95.5, 57.7, 1.4, 21.9, 88.8])

def lsq(arg):
    a = arg[0]*100
    b = arg[1]*100
    c = arg[2]*0.1
    now = a - (b*np.exp(c * baskets)) - scaling_factor
    return np.sum(now**2)

def pred(arg):
    a = arg[0]*100
    b = arg[1]*100
    c = arg[2]*0.1
    ret = a - (b*np.exp(c * baskets))
    return ret

guesses = [1, 1, -0.9]
res = scipy.optimize.minimize(lsq, guesses)

print('Message:',res.message)

print('X:',res.x)

print([lsq(guesses), lsq(res.x)])

In [None]:
#Reorder of variabiles
baskets=np.sort(baskets)
scaling_factor=np.sort(scaling_factor)
#Prevision
prev=pred(res.x)
#Plot
plt.scatter(baskets,scaling_factor,color='black')
plt.plot(baskets,prev)
plt.show

## End Exercise 

Following the examples above we want to use the *Boston Houses* dataset to create a linear model.

We want to:
  1. import the dataset (se below)
  2. Separate target and data
  3. Select 3 feautres 
  3. Split train and test: the last 50 samples are for the test  
  4. Train the model 
  
Hint:

```
from sklearn import datasets 
ds  = datasets.load_boston()
```

## Tips and Tricks

Let's focous on some tricks to increase performance.

### How to append a line to a DF

In [None]:
%%timeit -n 1 -r 1
dfFake2 = pd.DataFrame({'a':[1],'b':[2]})
for i in range(0,10000):
    curDF = pd.DataFrame({'a':[1],'b':[2]}) 
    dfFake2 = dfFake2.append(curDF)
print(dfFake2.shape)
print(dfFake2.head())

In [None]:
dfFake2

In [None]:
%%timeit -n 1 -r 1
retList = []
for i in range(0,10001):
    curDict = {'a':1,'b':2}
    retList.append(curDict)
dfFake3 = pd.DataFrame(retList)
print(dfFake3.shape)
print(dfFake3.head())

### `n_jobs` PARAMETER

In [None]:
import numpy as np
import pandas as pd
data = load_breast_cancer()
type(data.target
    )
dfFake=pd.DataFrame()
YFake=np.ndarray(shape=(1,0))
XFake=np.ndarray(shape=(0,30))
for i in range(0,1000):
    XFake=np.append(XFake,data.data,axis=0)
    YFake=np.append(YFake,data.target)
print(XFake.shape)

In [None]:
scores = cross_val_score(model,XFake,YFake, cv=10)