In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Model Evaluation

We need to evaluate the performance of our machine learning models because of two main reason. First to be able to tune them deciding which interventions increase their performance and which ones don't. And secondly to have concrete knowlodge on how accurate they are and therefore to what extent can we trust them. 

The first insight that we have to incorporate into our thinking of model evaluation is that by no means we can evaluate our model with the data that we used for training. Nowadays algorithms are very sophisticated and therefore prone to overfitting. It is therefore necessary to use new data, unseen by model, for its evaluation. 

There are two main approaches. One, the most obvious, is to divide the data in a training and test set. We train the model with the train set and we use the test set to evaluate it. It is simple and works well if we have lots of data. However, if data is scarce, then we don't have enough diversity in the data and the evaluation could not be very accurate. 

The second approach tries to solve this problem of evaluation, as accurate as we can, a model with a limited amount of data. As you can imagine, they consist on using sampling techniques with or without repetition in order to try to "augment" the amount of data available. 

Once we have chosen the best hyperparameters and have the model ready for production, we train it with the whole data and put it in operational use. 

We are going to look at four different techniques that we can use to split our data and create useful estimates of our models:

        1) Train and test sets.
        2) K-fold Cross-validation.
        3) Leave one-out cross-validation.
        4) Repeated random test-train splits.

Yes, we will use the Pima Indians onset of diabetes dataset. 


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

In [2]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
X
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Split into Train and Test Sets </h1>

A simple idea and also probably the most commonly used approach is to split our data into two sets. Use one for training and the other for testing. Normally a 70% of the data is used for training and 30% for testing, but of course these are arbitrary numbers and anything can be (e.g. 80% - 20% if the dataset is large). 

The points in favor of this approach is that is simple and fast. It works well when datasets are large but also it is widely used as a first approximation. One important thing that must be taken into account is that the variance of both sets is similar, if not we can encounter unwanted surprises. 

The downside is that we can have meaninful differences is the differences in variance are high and we that we take an important risk when the amount of data is small. Once the model is in production we may find that its performance has little in common with what we tested because the data that it encounters is really different. 

The <b> train_test_split </b> module in scikit-learn is the one used for splitting the dataset.


In [3]:
# Split into Train and Test Sets
set_printoptions(precision=3)
p_indians.head()

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

# we need to make it reproducible, so we use a seed for the pseudo-random
test_size=0.3
seed = 7

# the actual split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)

# Let's do the log regresssion
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)

# Now let's find the accurary with the test split
result = model.score(X_test, Y_test)

print(f'Accuracy {result*100:5.3f}')



Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy 76.190


<b><font color="red" size=6>Mission 1</font>

a) Change the distribution between Train and Test Sets. How does it affect accurarcy?
<br><br>
</b>

In [4]:
#a)
# USE THE SAME PSEUDO RANDOM SEED AND MANY DIFFERENT TEST SIZES
test_size2=0.2
test_size4=0.4
test_size6=0.6
test_size8=0.8
seed = 7

# the actual splits
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X,Y, test_size=test_size2, random_state=seed)
X_train4, X_test4, Y_train4, Y_test4 = train_test_split(X,Y, test_size=test_size4, random_state=seed)
X_train6, X_test6, Y_train6, Y_test6 = train_test_split(X,Y, test_size=test_size6, random_state=seed)
X_train8, X_test8, Y_train8, Y_test8 = train_test_split(X,Y, test_size=test_size8, random_state=seed)



# Let's do the log regresssions
model2 = LogisticRegression(solver='liblinear')
model4 = LogisticRegression(solver='liblinear')
model6 = LogisticRegression(solver='liblinear')
model8 = LogisticRegression(solver='liblinear')
model2.fit(X_train2,Y_train2)
model4.fit(X_train4,Y_train4)
model6.fit(X_train6,Y_train6)
model8.fit(X_train8,Y_train8)


# Now let's find the accurary with the test split
result2 = model.score(X_test2, Y_test2)
result4 = model.score(X_test4, Y_test4)
result6 = model.score(X_test6, Y_test6)
result8 = model.score(X_test8, Y_test8)

print("########## Test Size = 0.2")
print(f'Accuracy {result2*100:5.3f}')

print("########## Test Size = 0.4")
print(f'Accuracy {result4*100:5.3f}')

print("########## Test Size = 0.6")
print(f'Accuracy {result6*100:5.3f}')

print("########## Test Size = 0.8")
print(f'Accuracy {result8*100:5.3f}')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

########## Test Size = 0.2
Accuracy 78.571
########## Test Size = 0.4
Accuracy 76.948
########## Test Size = 0.6
Accuracy 77.657
########## Test Size = 0.8
Accuracy 77.561


<img src=k-fold.png>

<h1>K-fold Cross-Validation</h1>

The objective of k-fold cross-validation is to reduce the variance that we encounter when using the train-test split approach. 

In this approach the available data is divided into k splits that are called folds (3, 5, 10 are common). We train and test the model k times. Each time we use k-1 folds for training and one fold for testing. Once we finish we use the mean of the evaluation measure together with its standard deviation as performance measure. 

Obviously the dataset must be large enough to accommodate the process. 

K-Fold Cross Validation uses the <b>KFold </b> class. 


In [5]:
# K-fold Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

p_indians.head()

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

print(f'Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, k-fold 10 - Accuracy 76.951% (4.841%)


<h1>Leave One Out Cross-Validation</h1>

What will happen if we take k-fold to the extreme? In this case we will have as many folds as points, so k will be equal to the number of points and the prediction will be done each time for the one point left. 

This is an effort to make the most reasonable estimate possible given a dataset, it's called leave one out cross validation. 

Obviously you pay a penalty in terms of computational expense and the standard deviation has more variance than with k-fold. 

For Leave One Out Cross-Validation you use the <b>LeaveOneOut</b> class. 


In [6]:
# Leave one out cross-validation 

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

# Leave one out cross-validation
loo=LeaveOneOut()

# Logistic Regression
model = LogisticRegression(solver='liblinear')

#  performance
results = cross_val_score(model, X, Y, cv=loo)

print(f'Logistic regression, Leave one out - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, Leave one out - Accuracy 76.953% (42.113%)


<h1>Repeated Random Test-Train Splits </h1>

Another approach is to apply repeatedly a train-test split. This way takes advantage of the train-test speed and the reduction of variance of cross validation at the same time. 

A down side of the method is that we are including much of the same data, therefore results even if they look very nice, may not be realistic.

For Repeated Random Test-Train Splits you use the <b>ShuffleSplit</b> class.


In [7]:
# Repeated Random Test-Train Splits

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

nrepeat=10
test_size=0.3
seed=7

shuffle=ShuffleSplit(n_splits=nrepeat, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='liblinear')

res = cross_val_score(model, X, Y, cv=shuffle)

print(f'Log Regression - Repeated Test-Train {nrepeat:d} - Accuracy {res.mean()*100:5.3f}% {res.std()*100:5.3f}%')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Log Regression - Repeated Test-Train 10 - Accuracy 76.970% 1.366%


<h1>Which one to use and When? </h1>

First things first. K-fold is the gold-standard, if you are doing any serious work or presenting results to a digital educated audiende, please use k-fold and you'll avoid problems. 

Train and Test is ok, and it is used for common quick & dirty work. As you have observed in the exercices if the dataset is moderately large, the differences are small. Certainly you avoid surprises with repeatedly using train-test or much better k-fold and your last model should be evaluated always this way, but train and test split is ok for model selection and hyperparameter tunning. 

What about the rest? In all these techniques you try to balance accuracy in the estimated performance, evaluation speed and dataset size, they correspond to different bets in this balance. 

You don't know what to do ... The staple is k-fold with 10-cross-validation, start there. 

<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data and log regression for top-10, top-50 and top-100 evaluate the models with train-test split and k-fold-10.
<br><br>
b) Same for the data of the Times ranking. 
<br><br>

</b>

In [8]:
sh = pd.read_csv("shanghaiData.csv")

########### CLEAN THE DATASET
sh=sh[sh['year']==2015]
sh['world_rank'] = sh['world_rank'].str.replace('-', '').astype(int)
sh['total_score'].fillna(0,inplace=True)
sh['ns'].fillna(0,inplace=True)
sh = sh.dropna()


########### TESTSIZES AND SEED
seed = 7
test_size = 0.3


############ RECREATE TOP10, TOP50 & TOP100 DATASET
n10=10
def top10(a):
    if a['world_rank']<=n10:
        return 1
    else:
        return 0
    
n50=50
def top50(a):
    if a['world_rank']<=n50:
        return 1
    else:
        return 0
    
n100=100
def top100(a):
    if a['world_rank']<=n100:
        return 1
    else:
        return 0

    
sh["top10"]= sh.apply(top10, axis=1).astype("int")
sh["top50"]= sh.apply(top50, axis=1).astype("int")
sh["top100"]= sh.apply(top100, axis=1).astype("int")
sh.head(12)



############## SEPARATE INTO INPUT AND OUTPUT COMPONENTS
sh_array=sh.values
independentsh=sh_array[:,4:11]
dependentsh10=sh_array[:,11:12].astype("int")
dependentsh50=sh_array[:,12:13].astype("int")
dependentsh100=sh_array[:,13].astype("int")

print("################ EVALUATION USING ACCURACY AND TRAINING SPLIT #####################")


################# MODEL WITH SPLIT=0.3 AND TOP10
dependentsh10_train, dependentsh10_test, independentsh_train, independentsh_test  = train_test_split(independentsh, dependentsh10, test_size=test_size, random_state=seed)


model2 = LogisticRegression(solver='liblinear')
model2.fit(dependentsh10_train, np.ravel(independentsh_train))

result2 = model2.score(dependentsh10_test, independentsh_test)
print("ACCURACY OF 30% TRAINING SPLIT TOP10")
print(f'Accuracy {result2*100:5.3f}')


################# MODEL WITH SPLIT=0.3 AND TOP50
dependentsh50_train, dependentsh50_test, independentsh_train, independentsh_test  = train_test_split(independentsh, dependentsh50, test_size=test_size, random_state=seed)


model3 = LogisticRegression(solver='liblinear')
model3.fit(dependentsh50_train, np.ravel(independentsh_train))

result3 = model3.score(dependentsh50_test, np.ravel(independentsh_test))
print("ACCURACY OF 30% TRAINING SPLIT TOP 50")
print(f'Accuracy {result2*100:5.3f}')


################# MODEL WITH SPLIT=0.3 AND TOP100
dependentsh100_train, dependentsh100_test, independentsh_train, independentsh_test  = train_test_split(independentsh, dependentsh100, test_size=test_size, random_state=seed)


model4 = LogisticRegression(solver='liblinear')
model4.fit(dependentsh100_train, np.ravel(independentsh_train))

result4 = model4.score(dependentsh100_test, np.ravel(independentsh_test))
print("ACCURACY OF 30% TRAINING SPLIT TOP 100")
print(f'Accuracy {result2*100:5.3f}')




print("###################### K-FOLD EVALUATION ######################")

#KFold
splits= 10
kfold= KFold(n_splits=splits, random_state=7, shuffle = True)

#Obtain the performance measure - accuracy
results10 = cross_val_score(model2, independentsh, np.ravel(dependentsh10), cv=kfold) 
results50 = cross_val_score(model3, independentsh, np.ravel(dependentsh50), cv=kfold) 
results100 = cross_val_score(model4, independentsh, np.ravel(dependentsh100), cv=kfold)

print(f'Logistic regression, k-fold {splits:d} - Accuracy {results10.mean()*100:5.3f}% ({results10.std()*100:5.3f}%)')
print(f'Logistic regression, k-fold {splits:d} - Accuracy {results50.mean()*100:5.3f}% ({results50.std()*100:5.3f}%)')
print(f'Logistic regression, k-fold {splits:d} - Accuracy {results100.mean()*100:5.3f}% ({results100.std()*100:5.3f}%)')

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year,top10,top50,top100
4397,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,76.6,2015,1,1,1
4398,2,Stanford University,2,73.3,40.7,89.6,80.1,70.1,70.6,53.8,2015,1,1,1
4399,3,Massachusetts Institute of Technology (MIT),3,70.4,68.2,80.7,60.6,73.1,61.1,68.0,2015,1,1,1
4400,4,"University of California, Berkeley",4,69.6,65.1,79.4,66.1,65.6,67.9,56.5,2015,1,1,1
4401,5,University of Cambridge,1,68.8,77.1,96.6,50.8,55.6,66.4,55.8,2015,1,1,1
4402,6,Princeton University,5,61.0,53.3,93.4,57.1,43.0,42.4,70.3,2015,1,1,1
4403,7,California Institute of Technology,6,59.6,49.5,66.7,49.3,56.4,44.0,100.0,2015,1,1,1
4404,8,Columbia University,7,58.8,63.5,65.9,52.1,51.9,68.8,33.2,2015,1,1,1
4405,9,University of Chicago,8,57.1,59.8,86.3,49.0,42.9,49.8,42.0,2015,1,1,1
4406,10,University of Oxford,2,56.6,49.7,54.9,52.3,51.9,70.9,43.1,2015,1,1,1


################ EVALUATION USING ACCURACY AND TRAINING SPLIT #####################


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP10
Accuracy 100.000


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP 50
Accuracy 100.000


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP 100
Accuracy 100.000
###################### K-FOLD EVALUATION ######################
Logistic regression, k-fold 10 - Accuracy 99.400% (1.281%)
Logistic regression, k-fold 10 - Accuracy 99.800% (0.600%)
Logistic regression, k-fold 10 - Accuracy 99.200% (1.327%)


In [9]:
################ SAME FOR TIMES RANKING
df = pd.read_csv("timesData.csv")

########### CLEAN THE DATASET
df["num_students"] = df["num_students"].str.replace(',','')
df["world_rank"] = df["world_rank"].str.replace('=','')
df["international_students"] = df["international_students"].str.replace('%','')
df["female_male_ratio"] = df["female_male_ratio"].str[:2]
df=df[df['year']==2015]
df['world_rank'] = df['world_rank'].str.replace('-', '').astype(int)
df = df.drop(columns="country")
df = df.drop(columns="university_name")
df = df.replace({'-': ''}, regex=True)
df = df.drop(columns="total_score")
df = df.apply(pd.to_numeric)
df = df.dropna()

########### TESTSIZES AND SEED
seed = 7
test_size = 0.3


############ RECREATE TOP10, TOP50 & TOP100 DATASET
df["top10"]= df.apply(top10, axis=1).astype("int")
df["top50"]= df.apply(top50, axis=1).astype("int")
df["top100"]= df.apply(top100, axis=1).astype("int")
df.head(12)



############## SEPARATE INTO INPUT AND OUTPUT COMPONENTS
df_array=df.values
df_array
independent=df_array[:,1:10]
dependent10=df_array[:,11:12].astype("int")
dependent50=df_array[:,12:13].astype("int")
dependent100=df_array[:,13].astype("int")

print("################ EVALUATION USING ACCURACY AND TRAINING SPLIT #####################")


################# MODEL WITH SPLIT=0.3 AND TOP10
dependent10_train, dependent10_test, independent_train, independent_test  = train_test_split(independent, dependent10, test_size=test_size, random_state=seed)


model5 = LogisticRegression(solver='liblinear')
model5.fit(dependent10_train, np.ravel(independent_train))

result5 = model5.score(dependent10_test, independent_test)
print("ACCURACY OF 30% TRAINING SPLIT TOP10")
print(f'Accuracy {result5*100:5.3f}')


################# MODEL WITH SPLIT=0.3 AND TOP50
dependent50_train, dependent50_test, independent_train, independent_test  = train_test_split(independent, dependent50, test_size=test_size, random_state=seed)


model6 = LogisticRegression(solver='liblinear')
model6.fit(dependent50_train, np.ravel(independent_train))

result6 = model6.score(dependent50_test, np.ravel(independent_test))
print("ACCURACY OF 30% TRAINING SPLIT TOP 50")
print(f'Accuracy {result6*100:5.3f}')


################# MODEL WITH SPLIT=0.3 AND TOP100
dependent100_train, dependent100_test, independent_train, independent_test  = train_test_split(independent, dependent100, test_size=test_size, random_state=seed)


model7 = LogisticRegression(solver='liblinear')
model7.fit(dependent100_train, np.ravel(independent_train))

result7 = model7.score(dependent100_test, np.ravel(independent_test))
print("ACCURACY OF 30% TRAINING SPLIT TOP 100")
print(f'Accuracy {result2*100:5.3f}')


print("###################### K-FOLD EVALUATION ######################")

#KFold
splits= 10
kfold= KFold(n_splits=splits, random_state=7, shuffle = True)

#Obtain the performance measure - accuracy
ftresults10 = cross_val_score(model5, independent, np.ravel(dependent10), cv=kfold) 
ftresults50 = cross_val_score(model6, independent, np.ravel(dependent50), cv=kfold) 
ftresults100 = cross_val_score(model7, independent, np.ravel(dependent100), cv=kfold)

print(f'Logistic regression, k-fold {splits:d} - Accuracy {ftresults10.mean()*100:5.3f}% ({ftresults10.std()*100:5.3f}%)')
print(f'Logistic regression, k-fold {splits:d} - Accuracy {ftresults50.mean()*100:5.3f}% ({ftresults50.std()*100:5.3f}%)')
print(f'Logistic regression, k-fold {splits:d} - Accuracy {ftresults100.mean()*100:5.3f}% ({ftresults100.std()*100:5.3f}%)')

Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,top10,top50,top100
1402,1,92.2,67.0,98.1,99.7,89.1,2243.0,6.9,27.0,33.0,2015,1,1,1
1404,3,88.6,90.7,97.7,95.5,72.9,19919.0,11.6,34.0,46.0,2015,1,1,1
1405,4,91.5,69.0,96.7,99.1,63.1,15596.0,7.8,22.0,42.0,2015,1,1,1
1406,5,89.7,87.8,95.6,95.2,51.1,18812.0,11.8,34.0,46.0,2015,1,1,1
1407,6,89.1,84.3,88.2,100.0,95.7,11074.0,9.0,33.0,37.0,2015,1,1,1
1408,7,86.6,61.2,94.7,99.6,82.7,7929.0,8.4,27.0,45.0,2015,1,1,1
1409,8,84.2,58.5,96.7,99.1,44.8,36186.0,16.4,15.0,50.0,2015,1,1,1
1410,9,84.6,92.7,88.3,89.4,72.7,15060.0,11.7,51.0,37.0,2015,1,1,1
1411,9,88.5,59.8,90.8,94.0,42.0,11751.0,4.4,20.0,50.0,2015,1,1,1
1412,11,83.9,65.2,89.9,97.3,36.8,14221.0,6.9,21.0,42.0,2015,0,1,1


array([[1.000e+00, 9.220e+01, 6.700e+01, ..., 1.000e+00, 1.000e+00,
        1.000e+00],
       [3.000e+00, 8.860e+01, 9.070e+01, ..., 1.000e+00, 1.000e+00,
        1.000e+00],
       [4.000e+00, 9.150e+01, 6.900e+01, ..., 1.000e+00, 1.000e+00,
        1.000e+00],
       ...,
       [3.514e+05, 2.850e+01, 3.600e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.514e+05, 1.780e+01, 5.010e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.514e+05, 1.620e+01, 2.140e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

################ EVALUATION USING ACCURACY AND TRAINING SPLIT #####################


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP10
Accuracy 100.000


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP 50
Accuracy 98.077


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

ACCURACY OF 30% TRAINING SPLIT TOP 100
Accuracy 100.000
###################### K-FOLD EVALUATION ######################
Logistic regression, k-fold 10 - Accuracy 97.672% (3.383%)
Logistic regression, k-fold 10 - Accuracy 95.639% (2.990%)
Logistic regression, k-fold 10 - Accuracy 88.412% (5.222%)
