In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Model Evaluation

We need to evaluate the performance of our machine learning models because of two main reason. First to be able to tune them deciding which interventions increase their performance and which ones don't. And secondly to have concrete knowlodge on how accurate they are and therefore to what extent can we trust them. 

The first insight that we have to incorporate into our thinking of model evaluation is that by no means we can evaluate our model with the data that we used for training. Nowadays algorithms are very sophisticated and therefore prone to overfitting. It is therefore necessary to use new data, unseen by model, for its evaluation. 

There are two main approaches. One, the most obvious, is to divide the data in a training and test set. We train the model with the train set and we use the test set to evaluate it. It is simple and works well if we have lots of data. However, if data is scarce, then we don't have enough diversity in the data and the evaluation could not be very accurate. 

The second approach tries to solve this problem of evaluation, as accurate as we can, a model with a limited amount of data. As you can imagine, they consist on using sampling techniques with or without repetition in order to try to "augment" the amount of data available. 

Once we have chosen the best hyperparameters and have the model ready for production, we train it with the whole data and put it in operational use. 

We are going to look at four different techniques that we can use to split our data and create useful estimates of our models:

        1) Train and test sets.
        2) K-fold Cross-validation.
        3) Leave one-out cross-validation.
        4) Repeated random test-train splits.

Yes, we will use the Pima Indians onset of diabetes dataset. 


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

yo:

You divide the data into TEST (the one you present) and TRAINING  (20%). Set of info.
K-FOLD to present the model --> to see validation.
data is separated in various chanks (e.g. 1-10)
-first iteration = train 2-10 and test with 1
second iteration train 1 & 3-10 and test with 2
until all chunks are testes
compare test results
stdv of results (if stdv is high e.g. need to acquire new data.

Problem:test can have bias.
Multiple splits (not to be presented)
Normally ten, bu if not enough data tehn 4 (std higher?)
Look at std of results. If low, then trusted.
Typical error --> have too many k-folds.

You need a god chunk of data for training and test.




TEST AND SPLIT a variation is to randomize it. Imagine you have ten with random samples

In [3]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
X
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Split into Train and Test Sets </h1>

A simple idea and also probably the most commonly used approach is to split our data into two sets. Use one for training and the other for testing. Normally a 70% of the data is used for training and 30% for testing, but of course these are arbitrary numbers and anything can be (e.g. 80% - 20% if the dataset is large). 

The points in favor of this approach is that is simple and fast. It works well when datasets are large but also it is widely used as a first approximation. One important thing that must be taken into account is that the variance of both sets is similar, if not we can encounter unwanted surprises. 

The downside is that we can have meaninful differences is the differences in variance are high and we that we take an important risk when the amount of data is small. Once the model is in production we may find that its performance has little in common with what we tested because the data that it encounters is really different. 

The <b> train_test_split </b> module in scikit-learn is the one used for splitting the dataset.


In [4]:
# Split into Train and Test Sets
set_printoptions(precision=3)
p_indians.head()

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

# we need to make it reproducible, so we use a seed for the pseudo-random
test_size=0.3
seed = 7 # para que siempre me dé lo mismo. Para asegurar que sale mismo resultado que profesor.

# the actual split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)

# Let's do the log regresssion
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)

# Now let's find the accurary with the test split
result = model.score(X_test, Y_test) # lo haces con el test para ver si hay overfit.

# print(model.score(X_train, Y_train)) -- train siempre error menor porque se ha entrenado con este número.
# Si error del train y del test es similar vas por buen camino.
print(f'Accuracy {result*100:5.3f}') # en porcentaje, tres decimales f de float, buscar el 5!!!!

#f strings es la forma más rápida y convenient. 


# puedo definir test o train size pero no ambos porque son equivalentes.
# liblinear es un algoritmo. Es para True o False y pequeño dataset, sino newton para tener múltiples opciones.
# lbfgs es el default en vez de liblinear, para grandes o pequeños dfs. Multinomial loss, que no sea solo True or False.

# Model score te da accuracy de la clasificación, es un porcentaje (algo dividido entre todo).
# 95 % bien 
# 99% o tienes dependencia lineal o estás haciendo overfitting.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy 76.190


<b><font color="red" size=6>Mission 1</font>

a) Change the distribution between Train and Test Sets. How does it affect accurarcy?
<br><br>
</b>

In [5]:
# A) Change the distribution between Train and Test Sets. How does it affect accurarcy? 

# Imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

test_size=0.3
seed = 7

# Split, log regression and result for i % of data as test
print("-------------Test is 30 % of the data-------------")
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result= model.score(X_test, Y_test)
print(f'Accuracy with test 30 % {result*100:5.3f}')

print("-------------Test is 10 % of the data-------------")
test_size=0.1
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result= model.score(X_test, Y_test)
print(f'Accuracy with test 10 % {result*100:5.3f}')

print("-------------Test is 50 % of the data-------------")
test_size=0.5
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result= model.score(X_test, Y_test)
print(f'Accuracy with test 50 % {result*100:5.3f}')

print("-------------Test is 60 % of the data-------------")
test_size=0.6
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result= model.score(X_test, Y_test)
print(f'Accuracy with test 60 % {result*100:5.3f}')




# ¿¿¿DEBO RESPONDER A PREGUNTA TEÓRICA TAMBIÉN???

-------------Test is 30 % of the data-------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy with test 30 % 76.190
-------------Test is 10 % of the data-------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy with test 10 % 83.117
-------------Test is 50 % of the data-------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy with test 50 % 77.604
-------------Test is 60 % of the data-------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy with test 60 % 75.054


<img src=k-fold.png>

Train_test_split se hace aleatorio.

<h1>K-fold Cross-Validation</h1>

The objective of k-fold cross-validation is to reduce the variance that we encounter when using the train-test split approach. 

In this approach the available data is divided into k splits that are called folds (3, 5, 10 are common). We train and test the model k times. Each time we use k-1 folds for training and one fold for testing. Once we finish we use the mean of the evaluation measure together with its standard deviation as performance measure. 

Obviously the dataset must be large enough to accommodate the process. 

K-Fold Cross Validation uses the <b>KFold </b> class. 


In [6]:
# K-fold Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

p_indians.head()

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#cv es "cross validation model"?

print(f'Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')

# random state es seed.
# cv es la forma en que divides data.
# buscar splits:d!!
# te da 10 scores, accuracies. Quiero la media y std de ellos.

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, k-fold 10 - Accuracy 76.951% (4.841%)


<h1>Leave One Out Cross-Validation</h1>

What will happen if we take k-fold to the extreme? In this case we will have as many folds as points, so k will be equal to the number of points and the prediction will be done each time for the one point left. 

This is an effort to make the most reasonable estimate possible given a dataset, it's called leave one out cross validation. 

Obviously you pay a penalty in terms of computational expense and the standard deviation has more variance than with k-fold. 

For Leave One Out Cross-Validation you use the <b>LeaveOneOut</b> class. 


yo:
k-fold to extreme. the subset is only one. you test with one element
loo = leave one out.

Normally k-fold is good enough. k-fold is good, that is the summary.
prof: play with the model changing values, etc.

In [7]:
# Leave one out cross-validation 

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

# Leave one out cross-validation
loo=LeaveOneOut()

# Logistic Regression
model = LogisticRegression(solver='liblinear')

#  performance
results = cross_val_score(model, X, Y, cv=loo)

print(f'Logistic regression, Leave one out - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, Leave one out - Accuracy 76.823% (42.196%)


<h1>Repeated Random Test-Train Splits </h1>

Another approach is to apply repeatedly a train-test split. This way takes advantage of the train-test speed and the reduction of variance of cross validation at the same time. 

A down side of the method is that we are including much of the same data, therefore results even if they look very nice, may not be realistic.

For Repeated Random Test-Train Splits you use the <b>ShuffleSplit</b> class.


yo:
es un poco más de lo mismo. Repite el test 10 veces en este caso?
profe suele imprimir table... 2stdv is 96%? shape normal curve.

Normally you use train and split. 
boostng and bugging for artificiaylly augmenting the size of the sample if you don't have enough table.
el modelo lo presentas con k-fold. primero lo haces con train and split para ver si sale... pero luego hace k-folds y eso es lo que presentas.
similar a lo que se hace en deep learning.
train with more data... increase accuracy. hasta cierto punto en el que ya se queda estable. con test data empieza y luego baja? buscar inflexion point. But normally you are not so sophisticated.


In [8]:
# Repeated Random Test-Train Splits

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

nrepeat=10
test_size=0.3
seed=7

shuffle=ShuffleSplit(n_splits=nrepeat, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='liblinear')

res = cross_val_score(model, X, Y, cv=shuffle)

print(f'Log Regression - Repeated Test-Train {nrepeat:d} - Accuracy {res.mean()*100:5.3f}% {res.std()*100:5.3f}%')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Log Regression - Repeated Test-Train 10 - Accuracy 76.970% 1.366%


<h1>Which one to use and When? </h1>

First things first. K-fold is the gold-standard, if you are doing any serious work or presenting results to a digital educated audiende, please use k-fold and you'll avoid problems. 

Train and Test is ok, and it is used for common quick & dirty work. As you have observed in the exercices if the dataset is moderately large, the differences are small. Certainly you avoid surprises with repeatedly using train-test or much better k-fold and your last model should be evaluated always this way, but train and test split is ok for model selection and hyperparameter tunning. 

What about the rest? In all these techniques you try to balance accuracy in the estimated performance, evaluation speed and dataset size, they correspond to different bets in this balance. 

You don't know what to do ... The staple is k-fold with 10-cross-validation, start there. 

<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data and log regression for top-10, top-50 and top-100 evaluate the models with train-test split and k-fold-10.
<br><br>
b) Same for the data of the Times ranking. 
<br><br>

</b>

In [9]:
# A) Using the Shanghai Data and log regression for top-10, top-50 and top-100 evaluate the models with train-test split 
# ...and k-fold-10. 

# Import dataset (copy-pasted from the 'Feature Selection' notebook missions)
sg_data = pd.read_csv('shanghaiData.csv')
sg_data['world_rank'] = sg_data['world_rank'].str.split('-').str.get(0).astype(float)
sg_data.drop(columns=['national_rank','total_score'], inplace=True)
sg_data.dropna(inplace=True)
sg_data.head()

print('-----------------------------Train-test split TOP10------------------------------------')
# Select input and target variables with iloc (integer position-based) and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top10'] = sg_data['world_rank']<11
Y = sg_data['Top10'].values

# Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP10 {result*100:5.3f}') 

print('-----------------------------Train-test split TOP50------------------------------------')
# Select input and target variables with iloc (integer position-based) and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top50'] = sg_data['world_rank']<51
Y = sg_data['Top50'].values

#  Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP50 {result*100:5.3f}') 

print('-----------------------------Train-test split TOP100------------------------------------')
# Select input and target variables with iloc (integer position-based) and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top100'] = sg_data['world_rank']<101
Y = sg_data['Top100'].values

#  Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP100 {result*100:5.3f}')



print('-----------------------------K-FOLD TOP10------------------------------------')

# Imports
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

# Select input and target variables and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top10'] = sg_data['world_rank']<11
Y = sg_data['Top10'].values

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP10, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


print('-----------------------------K-FOLD TOP50------------------------------------')

# Imports
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

# Select input and target variables and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top50'] = sg_data['world_rank']<51 # not necessary since I have it for the train-test split already, but pretend not.
Y = sg_data['Top50'].values

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP50, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


print('-----------------------------K-FOLD TOP100------------------------------------')

# Imports
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

# Select input and target variables and transform to numpy arrays.
X = sg_data.iloc[:,2:8].values
sg_data['Top100'] = sg_data['world_rank']<101 # not necessary since I have it for the train-test split already, but pretend not.
Y = sg_data['Top100'].values

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP100, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,world_rank,university_name,alumni,award,hici,ns,pub,pcp,year
0,1.0,Harvard University,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2.0,University of Cambridge,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3.0,Stanford University,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4.0,"University of California, Berkeley",71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5.0,Massachusetts Institute of Technology (MIT),74.0,80.6,66.7,65.8,64.3,53.0,2005


-----------------------------Train-test split TOP10------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP10 99.727
-----------------------------Train-test split TOP50------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP50 98.770
-----------------------------Train-test split TOP100------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP100 98.565
-----------------------------K-FOLD TOP10------------------------------------
Logistic regression TOP10, k-fold 10 - Accuracy 99.836% (0.179%)
-----------------------------K-FOLD TOP50------------------------------------
Logistic regression TOP50, k-fold 10 - Accuracy 98.687% (0.307%)
-----------------------------K-FOLD TOP100------------------------------------
Logistic regression TOP100, k-fold 10 - Accuracy 98.358% (0.610%)


In [11]:
# B) Same for the data of the Times ranking. 

# I copy-paste the code of my previous notebook.
# Import dataset and do corresponding adjustments.
times = pd.read_csv('timesData.csv')
times.world_rank.replace('=.','.', regex=True, inplace=True)
times['world_rank'] = times['world_rank'].str.split('-').str.get(0).astype(float)
times.world_rank = pd.to_numeric(times.world_rank, errors='coerce')
times.dropna(subset=['world_rank'], axis=0, inplace=True)
times.international = pd.to_numeric(times.international, errors='coerce')
times.income = pd.to_numeric(times.income, errors='coerce')
times.num_students.replace('\D','', regex=True, inplace=True)
times.num_students = pd.to_numeric(times.num_students, errors='coerce')
times.international_students.replace('\D','', regex=True, inplace=True)
times.international_students = pd.to_numeric(times.international_students, errors='coerce')
times.international_students = times.international_students/100
times['females'] = times.female_male_ratio.str.split(':').str.get(0)
times.females = pd.to_numeric(times.females, errors='coerce')
times.drop(columns='total_score',inplace=True)
times.dropna(inplace=True)
times.head()


print('-----------------------------Train-test split TOP10------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top10'] = times['world_rank']<11
Y = times['Top10'].values

# Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP10 {result*100:5.3f}') 

print('-----------------------------Train-test split TOP50------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top50'] = times['world_rank']<51
Y = times['Top50'].values

# Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP50 {result*100:5.3f}') 

print('-----------------------------Train-test split TOP100------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top100'] = times['world_rank']<101
Y = times['Top100'].values

# Specific imports, log regression and accuracy
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test) # find accuracy

print(f'Accuracy for TOP100 {result*100:5.3f}') 


print('-----------------------------K-FOLD TOP10------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top10'] = times['world_rank']<11
Y = times['Top10'].values

# Imports
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP10, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')

print('-----------------------------K-FOLD TOP50------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top50'] = times['world_rank']<51
Y = times['Top50'].values

# Imports (although not necessary because I have imported them already above, to remember)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP50, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')

print('-----------------------------K-FOLD TOP100------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top100'] = times['world_rank']<101
Y = times['Top100'].values

# Imports (although not necessary because I have imported them already above, to remember)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

#Show results
print(f'Logistic regression TOP100, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,females
1,2.0,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,0.27,33 : 67,2011,33.0
2,3.0,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,0.33,37 : 63,2011,37.0
3,4.0,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,0.22,42 : 58,2011,42.0
5,6.0,University of Cambridge,United Kingdom,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,0.34,46 : 54,2011,46.0
6,6.0,University of Oxford,United Kingdom,88.2,77.2,93.9,95.1,73.5,19919.0,11.6,0.34,46 : 54,2011,46.0


-----------------------------Train-test split TOP10------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP10 94.793
-----------------------------Train-test split TOP50------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP50 92.649
-----------------------------Train-test split TOP100------------------------------------


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for TOP100 88.821
-----------------------------K-FOLD TOP10------------------------------------
Logistic regression TOP10, k-fold 10 - Accuracy 94.063% (9.043%)
-----------------------------K-FOLD TOP50------------------------------------
Logistic regression TOP50, k-fold 10 - Accuracy 92.776% (9.568%)
-----------------------------K-FOLD TOP100------------------------------------
Logistic regression TOP100, k-fold 10 - Accuracy 88.368% (10.026%)


In [12]:
# B) Same for the data of the Times ranking. 

# I copy-paste the code of my previous notebook.
# Import dataset and do corresponding adjustments.
times = pd.read_csv('timesData.csv')
times.world_rank.replace('=.','.', regex=True, inplace=True)
times['world_rank'] = times['world_rank'].str.split('-').str.get(0).astype(float)
times.world_rank = pd.to_numeric(times.world_rank, errors='coerce')
times.dropna(subset=['world_rank'], axis=0, inplace=True)
times.international = pd.to_numeric(times.international, errors='coerce')
times.income = pd.to_numeric(times.income, errors='coerce')
times.num_students.replace('\D','', regex=True, inplace=True)
times.num_students = pd.to_numeric(times.num_students, errors='coerce')
times.international_students.replace('\D','', regex=True, inplace=True)
times.international_students = pd.to_numeric(times.international_students, errors='coerce')
times.international_students = times.international_students/100
times['females'] = times.female_male_ratio.str.split(':').str.get(0)
times.females = pd.to_numeric(times.females, errors='coerce')
times.drop(columns='total_score',inplace=True)
times.dropna(inplace=True)
times.head()


print('-----------------------------Train-test split TOP10------------------------------------')

# Select input and target variables and transform to numpy arrays.
X = pd.concat([times.iloc[:,3:11],times['females']],axis=1).values
times['Top10'] = times['world_rank']<11
Y = times['Top10'].values


Y

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,females
1,2.0,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,0.27,33 : 67,2011,33.0
2,3.0,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,0.33,37 : 63,2011,37.0
3,4.0,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,0.22,42 : 58,2011,42.0
5,6.0,University of Cambridge,United Kingdom,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,0.34,46 : 54,2011,46.0
6,6.0,University of Oxford,United Kingdom,88.2,77.2,93.9,95.1,73.5,19919.0,11.6,0.34,46 : 54,2011,46.0


-----------------------------Train-test split TOP10------------------------------------


array([ True,  True,  True, ..., False, False, False])

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
test_size=0.3 
seed = 7 # to reproduce
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
Y_train

array([False, False, False, ..., False, False, False])