<h2>Part 1</h2>

In [9]:
from scipy.stats import ttest_ind
from scipy.stats import ttest_ind_from_stats
from pandas import read_csv
import math

import warnings
warnings.filterwarnings('ignore')

# sklearn 10FCV 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

# confusion matrix
from sklearn.metrics import confusion_matrix

# SK learn Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB


filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values


X = array[:,0:8]
Y = array[:,8]

# Folds and seed
num_folds = 10
seed = 1

print("Naive Bayes:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", round(results.mean()*100.0,2),
      "      Standard Deviation", round(results.std()*100.0,2))

# over all confusion matrix
y_pred1 = cross_val_predict(model, X, Y, cv=kfold)


print("\n\n\nLogistic Regression:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model2 = LogisticRegression(solver='liblinear')
results2 = cross_val_score(model2, X, Y, cv=kfold)
print("Accuracy:", round(results2.mean()*100.0,2),
      "      Standard Deviation", round(results2.std()*100.0,2))

# over all confusion matrix
y_pred2 = cross_val_predict(model2, X, Y, cv=kfold)





# Statistics for output 1
mean1 = round(results.mean()*100.0,2)

# Binomial Theorem
std1 = math.sqrt((results.mean()*len(y_pred1))*(1 - results.mean()))
n1 = len(y_pred1)


# Statistics for output 2
mean2 = round(results2.mean()*100.0,2)

# Bionomial Theorem
std2 = math.sqrt((results2.mean()*len(y_pred2))*(1 - results2.mean()))
n2 = len(y_pred2) 

# T-test must be for indeopendant means (as you are providing two)
result = ttest_ind_from_stats(mean1, std1, n1, mean2, std2, n2)


print("\n\nStats Analysis\n---------------------------------------------")
print()
print(result)
print()
print("p-value on its own :", result[1])
print()
if result[1] < 0.05:
    print("Results are statistically significant")
else:
    print("Results are not statistically significant")

Naive Bayes:
------------------------------------
Accuracy: 75.52       Standard Deviation 4.28



Logistic Regression:
------------------------------------
Accuracy: 76.95       Standard Deviation 4.84


Stats Analysis
---------------------------------------------

Ttest_indResult(statistic=-2.375932736085734, pvalue=0.0176268572335954)

p-value on its own : 0.0176268572335954

Results are statistically significant


<p>As we can see, the p-value is less than 0.05, therefore we reject the null hypothesis. Because there is a statistically significance between the two algorithms, we would choose the one with a higher accuracy which, in this case, is Logistic Regression</p>

<h2>Part 2</h2>

In [8]:
from scipy.stats import ttest_ind
from scipy.stats import ttest_ind_from_stats
from pandas import read_csv
import math

import warnings
warnings.filterwarnings('ignore')

# sklearn 10FCV 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

# confusion matrix
from sklearn.metrics import confusion_matrix

# SK learn Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier


filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values


X = array[:,0:8]
Y = array[:,8]

# Folds and seed
num_folds = 10
seed = 1

print("Naive Bayes:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model = GaussianNB()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy:", round(results.mean()*100.0,2),
      "      Standard Deviation", round(results.std()*100.0,2))
y_pred1 = cross_val_predict(model, X, Y, cv=kfold)
# Statistics for output 1
mean1 = round(results.mean()*100.0,2)
# Binomial Theorem
std1 = math.sqrt((results.mean()*len(y_pred1))*(1 - results.mean()))
n1 = len(y_pred1)
print('Mean: ', mean1)
print('STD: ', std1)
print('N: ', n1)


print("\n\n\nLogistic Regression:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model2 = LogisticRegression(solver='liblinear')
results2 = cross_val_score(model2, X, Y, cv=kfold)
print("Accuracy:", round(results2.mean()*100.0,2),
      "      Standard Deviation", round(results2.std()*100.0,2))
y_pred2 = cross_val_predict(model2, X, Y, cv=kfold)
# Statistics for output 2
mean2 = round(results2.mean()*100.0,2)
# Bionomial Theorem
std2 = math.sqrt((results2.mean()*len(y_pred2))*(1 - results2.mean()))
n2 = len(y_pred2)
print('Mean: ', mean2)
print('STD: ', std2)
print('N: ', n2)


print("\n\n\nKNN:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model3 = KNeighborsClassifier(n_neighbors=3)
results3 = cross_val_score(model3, X, Y, cv=kfold)
print("Accuracy:", round(results3.mean()*100.0,2),
      "      Standard Deviation", round(results3.std()*100.0,2))
y_pred3 = cross_val_predict(model3, X, Y, cv=kfold)
# Statistics for output 3
mean3 = round(results3.mean()*100.0,2)
# Binomial Theorem
std3 = math.sqrt((results3.mean()*len(y_pred3))*(1 - results3.mean()))
n3 = len(y_pred3)
print('Mean: ', mean3)
print('STD: ', std3)
print('N: ', n3)


print("\n\n\nDecision Tree:\n------------------------------------")
kfold = KFold(n_splits=num_folds, random_state=seed)
model4 = DecisionTreeClassifier()
results4 = cross_val_score(model4, X, Y, cv=kfold)
print("Accuracy:", round(results4.mean()*100.0,2),
      "      Standard Deviation", round(results4.std()*100.0,2))
y_pred4 = cross_val_predict(model4, X, Y, cv=kfold)
# Statistics for output 4
mean4 = round(results4.mean()*100.0,2)
# Bionomial Theorem
std4 = math.sqrt((results4.mean()*len(y_pred4))*(1 - results4.mean()))
n4 = len(y_pred4)
print('Mean: ', mean4)
print('STD: ', std4)
print('N: ', n4)

Naive Bayes:
------------------------------------
Accuracy: 75.52       Standard Deviation 4.28
Mean:  75.52
STD:  11.916004680331053
N:  768



Logistic Regression:
------------------------------------
Accuracy: 76.95       Standard Deviation 4.84
Mean:  76.95
STD:  11.671070407881018
N:  768



KNN:
------------------------------------
Accuracy: 70.56       Standard Deviation 6.06
Mean:  70.56
STD:  12.63067857706682
N:  768



Decision Tree:
------------------------------------
Accuracy: 68.87       Standard Deviation 5.58
Mean:  68.87
STD:  12.831676705127537
N:  768


<h4>ANOVA and post-hoc test</h4>

<p>Tukey HSD Post-hoc Test...</p>
<p>Logistic Regression vs Naive Bayes: Diff=1.4300, 95%CI=-0.1791 to 3.0391, p=0.1020</p>
<p>Logistic Regression vs KNN: Diff=-4.9600, 95%CI=-6.5691 to -3.3509, p=0.0000</p>
<p>Logistic Regression vs Decision Tree: Diff=-6.6500, 95%CI=-8.2591 to -5.0409, p=0.0000</p>
<p>Naive Bayes vs KNN: Diff=-6.3900, 95%CI=-7.9991 to -4.7809, p=0.0000</p>
<p>Naive Bayes vs Decision Tree: Diff=-8.0800, 95%CI=-9.6891 to -6.4709, p=0.0000</p>
<p>KNN vs Decision Tree: Diff=-1.6900, 95%CI=-3.2991 to -0.0809, p=0.0351</p>
<br>
<p>From the results obtained, we can say that there is no statistically significant difference between Logistic Regression and Naive Bayes. Therefore, we can choose any of these two algorithms depending on other factors like efficiency. For the rest of the p-values, there is a statistically significant difference due to the p-value being less than 0.05. This means that we should choose the algorithm with the highest accuracy.</p>
