Regression Examples using Model_Validation, Splitter_Classes, LogisticRegression and DecisionTreeRegressor

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = [12, 12]
np.random.seed(42)

Functions by training and validation

In [3]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import math


#cross_validate like model_selection and LogisticRegression
def LogReg_cross_validate(df,y): 
    #Returns the mean accuracy on the given test data and labels.
    clf=LogisticRegression(C=.1,solver='lbfgs', random_state=42)
    cv_results=cross_validate(clf, df, y, cv=10)
    return cv_results['test_score'].mean() #return mean accuracy

#LeaveOneOut like splitter and LogisticRegression
def LogReg_LoO(df,y):
    X = df.values
    y = y
    loo = LeaveOneOut()
    list_results=[]
    list_results_acc=[]
    clf=LogisticRegression(C=1,solver='liblinear', random_state=42)
    for train_index, test_index in loo.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        clf=clf.fit(X_train,y_train)
        result=clf.predict(X_test)
        list_results.append(mean_squared_error(y_test,result))
        list_results_acc.append(clf.score(X_test,y_test))
    return  np.array(list_results_acc).mean(), np.array(list_results).mean() # return mean accuracy, mean_squared_error

#LeaveOneOut like splitter and DecisionTreeRegressor
def tree_LoO(df,y):
    list_results=[]
    list_results_r2=[]    
    X = df.values
    y = y
    loo = LeaveOneOut()
    clf=DecisionTreeRegressor(random_state=42)
    for train_index, test_index in loo.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        clf=clf.fit(X_train,y_train)
        result=clf.predict(X_test)
        list_results.append(mean_squared_error(y_test,result))
        list_results_r2.append(clf.score(X_test,y_test))
    return np.array(list_results).mean(),np.array(list_results_r2).mean() # return mean_squared_error, coefficient of determination R^2

**Import Data**

**Fertility Data Set**

"100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits"

link:https://archive.ics.uci.edu/ml/datasets/Fertility

Attribute Information:

    - Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)

    - Age at the time of analysis. 18-36 (0, 1)

    - Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)

    - Accident or serious trauma 1) yes, 2) no. (0, 1)

    - Surgical intervention 1) yes, 2) no. (0, 1)

    - High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)

    - Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)

    - Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)

    - Number of hours spent sitting per day ene-16 (0, 1)

    - Output: Diagnosis normal (N), altered (O)

In [4]:
df=pd.read_csv('data/fertility.csv',header=None) 

In [5]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.33,0.69,0,1,1,0,0.8,0,0.88,N
1,-0.33,0.94,1,0,1,0,0.8,1,0.31,O
2,-0.33,0.5,1,0,0,0,1.0,-1,0.5,N
3,-0.33,0.75,0,1,1,0,1.0,-1,0.38,N
4,-0.33,0.67,1,1,0,0,0.8,-1,0.5,O


In [6]:
df=df.rename(columns={
    0:'season',
    1:'age',
    2:'childish_diseases',
    3:'accident',
    4:'surgical_intervention',
    5:'high_fevers',
    6:'frequency_alcohol_consumption',
    7:'smoking',
    8:'hours_spent_sitting',
    9:'output'
})
df.head()

Unnamed: 0,season,age,childish_diseases,accident,surgical_intervention,high_fevers,frequency_alcohol_consumption,smoking,hours_spent_sitting,output
0,-0.33,0.69,0,1,1,0,0.8,0,0.88,N
1,-0.33,0.94,1,0,1,0,0.8,1,0.31,O
2,-0.33,0.5,1,0,0,0,1.0,-1,0.5,N
3,-0.33,0.75,0,1,1,0,1.0,-1,0.38,N
4,-0.33,0.67,1,1,0,0,0.8,-1,0.5,O


In [7]:
import pandas_profiling
pandas_profiling.ProfileReport(df)

0,1
Number of variables,10
Number of observations,100
Total Missing (%),0.0%
Total size in memory,7.9 KiB
Average record size in memory,80.8 B

0,1
Numeric,6
Categorical,1
Boolean,3
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.44

0,1
0,56
1,44

Value,Count,Frequency (%),Unnamed: 3
0,56,56.0%,
1,44,44.0%,

0,1
Distinct count,18
Unique (%),18.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.669
Minimum,0.5
Maximum,1
Zeros (%),0.0%

0,1
Minimum,0.5
5-th percentile,0.5
Q1,0.56
Median,0.67
Q3,0.75
95-th percentile,0.92
Maximum,1.0
Range,0.5
Interquartile range,0.19

0,1
Standard deviation,0.12132
Coef of variation,0.18134
Kurtosis,-0.027747
Mean,0.669
MAD,0.09668
Skewness,0.66685
Sum,66.9
Variance,0.014718
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
0.67,14,14.0%,
0.56,12,12.0%,
0.75,10,10.0%,
0.53,9,9.0%,
0.78,7,7.0%,
0.58,7,7.0%,
0.5,7,7.0%,
0.69,7,7.0%,
0.64,6,6.0%,
0.81,5,5.0%,

Value,Count,Frequency (%),Unnamed: 3
0.5,7,7.0%,
0.53,9,9.0%,
0.56,12,12.0%,
0.58,7,7.0%,
0.61,5,5.0%,

Value,Count,Frequency (%),Unnamed: 3
0.86,1,1.0%,
0.89,1,1.0%,
0.92,2,2.0%,
0.94,2,2.0%,
1.0,2,2.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.87

0,1
1,87
0,13

Value,Count,Frequency (%),Unnamed: 3
1,87,87.0%,
0,13,13.0%,

0,1
Distinct count,5
Unique (%),5.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.832
Minimum,0.2
Maximum,1
Zeros (%),0.0%

0,1
Minimum,0.2
5-th percentile,0.6
Q1,0.8
Median,0.8
Q3,1.0
95-th percentile,1.0
Maximum,1.0
Range,0.8
Interquartile range,0.2

0,1
Standard deviation,0.1675
Coef of variation,0.20132
Kurtosis,0.74233
Mean,0.832
MAD,0.1344
Skewness,-0.83766
Sum,83.2
Variance,0.028057
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
1.0,40,40.0%,
0.8,39,39.0%,
0.6,19,19.0%,
0.2,1,1.0%,
0.4,1,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.2,1,1.0%,
0.4,1,1.0%,
0.6,19,19.0%,
0.8,39,39.0%,
1.0,40,40.0%,

Value,Count,Frequency (%),Unnamed: 3
0.2,1,1.0%,
0.4,1,1.0%,
0.6,19,19.0%,
0.8,39,39.0%,
1.0,40,40.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.19
Minimum,-1
Maximum,1
Zeros (%),63.0%

0,1
Minimum,-1
5-th percentile,-1
Q1,0
Median,0
Q3,1
95-th percentile,1
Maximum,1
Range,2
Interquartile range,1

0,1
Standard deviation,0.58075
Coef of variation,3.0566
Kurtosis,-0.24542
Mean,0.19
MAD,0.4536
Skewness,-0.037793
Sum,19
Variance,0.33727
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
0,63,63.0%,
1,28,28.0%,
-1,9,9.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,9,9.0%,
0,63,63.0%,
1,28,28.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,9,9.0%,
0,63,63.0%,
1,28,28.0%,

0,1
Distinct count,14
Unique (%),14.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.4068
Minimum,0.06
Maximum,1
Zeros (%),0.0%

0,1
Minimum,0.06
5-th percentile,0.19
Q1,0.25
Median,0.38
Q3,0.5
95-th percentile,0.75
Maximum,1.0
Range,0.94
Interquartile range,0.25

0,1
Standard deviation,0.1864
Coef of variation,0.4582
Kurtosis,0.58214
Mean,0.4068
MAD,0.14868
Skewness,0.77562
Sum,40.68
Variance,0.034743
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
0.25,17,17.0%,
0.5,16,16.0%,
0.38,13,13.0%,
0.19,11,11.0%,
0.31,11,11.0%,
0.63,10,10.0%,
0.44,9,9.0%,
0.88,3,3.0%,
0.75,3,3.0%,
0.56,2,2.0%,

Value,Count,Frequency (%),Unnamed: 3
0.06,2,2.0%,
0.13,1,1.0%,
0.19,11,11.0%,
0.25,17,17.0%,
0.31,11,11.0%,

Value,Count,Frequency (%),Unnamed: 3
0.56,2,2.0%,
0.63,10,10.0%,
0.75,3,3.0%,
0.88,3,3.0%,
1.0,1,1.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
N,88
O,12

Value,Count,Frequency (%),Unnamed: 3
N,88,88.0%,
O,12,12.0%,

0,1
Distinct count,4
Unique (%),4.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-0.0789
Minimum,-1
Maximum,1
Zeros (%),0.0%

0,1
Minimum,-1.0
5-th percentile,-1.0
Q1,-1.0
Median,-0.33
Q3,1.0
95-th percentile,1.0
Maximum,1.0
Range,2.0
Interquartile range,2.0

0,1
Standard deviation,0.79673
Coef of variation,-10.098
Kurtosis,-1.4306
Mean,-0.0789
MAD,0.70163
Skewness,0.34113
Sum,-7.89
Variance,0.63477
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
-0.33,37,37.0%,
1.0,31,31.0%,
-1.0,28,28.0%,
0.33,4,4.0%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,28,28.0%,
-0.33,37,37.0%,
0.33,4,4.0%,
1.0,31,31.0%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,28,28.0%,
-0.33,37,37.0%,
0.33,4,4.0%,
1.0,31,31.0%,

0,1
Distinct count,3
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-0.35
Minimum,-1
Maximum,1
Zeros (%),23.0%

0,1
Minimum,-1
5-th percentile,-1
Q1,-1
Median,-1
Q3,0
95-th percentile,1
Maximum,1
Range,2
Interquartile range,1

0,1
Standard deviation,0.80873
Coef of variation,-2.3107
Kurtosis,-1.0837
Mean,-0.35
MAD,0.728
Skewness,0.72636
Sum,-35
Variance,0.65404
Memory size,880.0 B

Value,Count,Frequency (%),Unnamed: 3
-1,56,56.0%,
0,23,23.0%,
1,21,21.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,56,56.0%,
0,23,23.0%,
1,21,21.0%,

Value,Count,Frequency (%),Unnamed: 3
-1,56,56.0%,
0,23,23.0%,
1,21,21.0%,

0,1
Distinct count,2
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.51

0,1
1,51
0,49

Value,Count,Frequency (%),Unnamed: 3
1,51,51.0%,
0,49,49.0%,

Unnamed: 0,season,age,childish_diseases,accident,surgical_intervention,high_fevers,frequency_alcohol_consumption,smoking,hours_spent_sitting,output
0,-0.33,0.69,0,1,1,0,0.8,0,0.88,N
1,-0.33,0.94,1,0,1,0,0.8,1,0.31,O
2,-0.33,0.5,1,0,0,0,1.0,-1,0.5,N
3,-0.33,0.75,0,1,1,0,1.0,-1,0.38,N
4,-0.33,0.67,1,1,0,0,0.8,-1,0.5,O


Enconder Output: Diagnosis normal (N->1), altered (O->0)

In [8]:
y=df['output'].replace('N',1).replace('O',0)
y.head()

0    1
1    0
2    1
3    1
4    0
Name: output, dtype: int64

In [9]:
df_new=df.drop('output',axis=1)
df_new.head()

Unnamed: 0,season,age,childish_diseases,accident,surgical_intervention,high_fevers,frequency_alcohol_consumption,smoking,hours_spent_sitting
0,-0.33,0.69,0,1,1,0,0.8,0,0.88
1,-0.33,0.94,1,0,1,0,0.8,1,0.31
2,-0.33,0.5,1,0,0,0,1.0,-1,0.5
3,-0.33,0.75,0,1,1,0,1.0,-1,0.38
4,-0.33,0.67,1,1,0,0,0.8,-1,0.5


### Modeling

**Scaling data**

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_new)
df_scaled=pd.DataFrame(scaler.transform(df_new))

df_scaled.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,-0.316753,0.17397,-2.586949,1.128152,0.980196,-0.32881,-0.192006,0.434959,2.551481
1,-0.316753,2.245043,0.386556,-0.886405,0.980196,-0.32881,-0.192006,1.677698,-0.521943
2,-0.316753,-1.400045,0.386556,-0.886405,-1.020204,-0.32881,1.008032,-0.807781,0.502532
3,-0.316753,0.671028,-2.586949,1.128152,0.980196,-0.32881,1.008032,-0.807781,-0.144505
4,-0.316753,0.008284,0.386556,1.128152,-1.020204,-0.32881,-0.192006,-0.807781,0.502532


**Apply PCA on scaled data**

In [11]:
from sklearn.decomposition import PCA

df_scaled_pca = PCA(n_components=2).fit_transform(df_scaled)
df_scaled_pca = pd.DataFrame(df_scaled_pca, columns=["PC1", "PC2"])

df_scaled_pca.head()

Unnamed: 0,PC1,PC2
0,-0.708366,1.851833
1,2.048427,0.176078
2,-2.179288,-0.627278
3,0.104001,1.723623
4,-0.220799,-0.465998


Training en Evaluatin the datas:

In [12]:
results=pd.DataFrame(columns=['Metrics','Data','Data_Scaled','Data_Scaled_PCA'])

results['Metrics']=['mean_acc','mse','r2_score']
results['Data']=[LogReg_cross_validate(df_new,y),'NaN','NaN']
results['Data_Scaled']=[LogReg_LoO(df_scaled,y)[0],LogReg_LoO(df_scaled,y)[1],'NaN',]
results['Data_Scaled_PCA']=['NaN',tree_LoO(df_scaled_pca,y)[0],tree_LoO(df_scaled_pca,y)[1]]
results=results.set_index(['Metrics'])
results

Unnamed: 0_level_0,Data,Data_Scaled,Data_Scaled_PCA
Metrics,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mean_acc,0.881414,0.87,
mse,,0.13,0.195
r2_score,,,0.79
