In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Model Evaluation

We need to evaluate the performance of our machine learning models because of two main reason. First to be able to tune them deciding which interventions increase their performance and which ones don't. And secondly to have concrete knowlodge on how accurate they are and therefore to what extent can we trust them. 

The first insight that we have to incorporate into our thinking of model evaluation is that by no means we can evaluate our model with the data that we used for training. Nowadays algorithms are very sophisticated and therefore prone to overfitting. It is therefore necessary to use new data, unseen by model, for its evaluation. 

There are two main approaches. One, the most obvious, is to divide the data in a training and test set. We train the model with the train set and we use the test set to evaluate it. It is simple and works well if we have lots of data. However, if data is scarce, then we don't have enough diversity in the data and the evaluation could not be very accurate. 

The second approach tries to solve this problem of evaluation, as accurate as we can, a model with a limited amount of data. As you can imagine, they consist on using sampling techniques with or without repetition in order to try to "augment" the amount of data available. 

Once we have chosen the best hyperparameters and have the model ready for production, we train it with the whole data and put it in operational use. 

We are going to look at four different techniques that we can use to split our data and create useful estimates of our models:

        1) Train and test sets.
        2) K-fold Cross-validation.
        3) Leave one-out cross-validation.
        4) Repeated random test-train splits.

Yes, we will use the Pima Indians onset of diabetes dataset. 


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

In [2]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
X
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Split into Train and Test Sets </h1>

A simple idea and also probably the most commonly used approach is to split our data into two sets. Use one for training and the other for testing. Normally a 70% of the data is used for training and 30% for testing, but of course these are arbitrary numbers and anything can be (e.g. 80% - 20% if the dataset is large). 

The points in favor of this approach is that is simple and fast. It works well when datasets are large but also it is widely used as a first approximation. One important thing that must be taken into account is that the variance of both sets is similar, if not we can encounter unwanted surprises. 

The downside is that we can have meaninful differences is the differences in variance are high and we that we take an important risk when the amount of data is small. Once the model is in production we may find that its performance has little in common with what we tested because the data that it encounters is really different. 

The <b> train_test_split </b> module in scikit-learn is the one used for splitting the dataset.


In [3]:
# Split into Train and Test Sets
set_printoptions(precision=3)
p_indians.head()

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

# we need to make it reproducible, so we use a seed for the pseudo-random
test_size=0.3
seed = 7

# the actual split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)# random=seed la2an ta b 
                                                                            # test number 2 nesta3mela hiye zeta

# Let's do the log regresssion
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)

# Now let's find the accurary with the test split
result = model.score(X_test, Y_test)

print(f'Accuracy {result*100:5.3f}')



Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy 76.190


<b><font color="red" size=6>Mission 1</font>

a) Change the distribution between Train and Test Sets. How does it affect accurarcy?
<br><br>
</b>

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

test_size=0.2
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)                                                                        

model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)

print(f'Accuracy {result*100:5.3f}')



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy 79.870


In [5]:
#The bigger the size of the training dataset (or the smaller the size of the test dataset),the better the accuracy.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

test_size=0.6
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=test_size, random_state=seed)                                                                        

model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)

print(f'Accuracy {result*100:5.3f}')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy 75.271


In [7]:
#The smaller the size of the training dataset (or the bigger the size of the test dataset),the worse the accuracy.
# because it has less data to train on!! so less accurate!

<img src=k-fold.png>

<h1>K-fold Cross-Validation</h1>

The objective of k-fold cross-validation is to reduce the variance that we encounter when using the train-test split approach. 

In this approach the available data is divided into k splits that are called folds (3, 5, 10 are common). We train and test the model k times. Each time we use k-1 folds for training and one fold for testing. Once we finish we use the mean of the evaluation measure together with its standard deviation as performance measure. 

Obviously the dataset must be large enough to accommodate the process. 

K-Fold Cross Validation uses the <b>KFold </b> class. 


In [8]:
# K-fold Cross Validation

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

p_indians.head()

# KFold
splits=10
kfold=KFold(n_splits=splits, random_state=7)

#Logistic regression
model = LogisticRegression(solver='liblinear')

# Obtain the performance measure - accuracy
results = cross_val_score(model, X, Y, cv=kfold)

print(f'Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, k-fold 10 - Accuracy 76.951% (4.841%)


<h1>Leave One Out Cross-Validation</h1>

What will happen if we take k-fold to the extreme? In this case we will have as many folds as points, so k will be equal to the number of points and the prediction will be done each time for the one point left. 

This is an effort to make the most reasonable estimate possible given a dataset, it's called leave one out cross validation. 

Obviously you pay a penalty in terms of computational expense and the standard deviation has more variance than with k-fold. 

For Leave One Out Cross-Validation you use the <b>LeaveOneOut</b> class. 


In [9]:
# Leave one out cross-validation 

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

# Leave one out cross-validation
loo=LeaveOneOut()

# Logistic Regression
model = LogisticRegression(solver='liblinear')

#  performance
results = cross_val_score(model, X, Y, cv=loo)

print(f'Logistic regression, Leave one out - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Logistic regression, Leave one out - Accuracy 76.953% (42.113%)


<h1>Repeated Random Test-Train Splits </h1>

Another approach is to apply repeatedly a train-test split. This way takes advantage of the train-test speed and the reduction of variance of cross validation at the same time. 

A down side of the method is that we are including much of the same data, therefore results even if they look very nice, may not be realistic.

For Repeated Random Test-Train Splits you use the <b>ShuffleSplit</b> class.


In [10]:
# Repeated Random Test-Train Splits

from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

p_indians.head()

nrepeat=10
test_size=0.3
seed=7

shuffle=ShuffleSplit(n_splits=nrepeat, test_size=test_size, random_state=seed)

model = LogisticRegression(solver='liblinear')

res = cross_val_score(model, X, Y, cv=shuffle)

print(f'Log Regression - Repeated Test-Train {nrepeat:d} - Accuracy {res.mean()*100:5.3f}% {res.std()*100:5.3f}%')


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Log Regression - Repeated Test-Train 10 - Accuracy 76.970% 1.366%


<h1>Which one to use and When? </h1>

First things first. K-fold is the gold-standard, if you are doing any serious work or presenting results to a digital educated audiende, please use k-fold and you'll avoid problems. 

Train and Test is ok, and it is used for common quick & dirty work. As you have observed in the exercices if the dataset is moderately large, the differences are small. Certainly you avoid surprises with repeatedly using train-test or much better k-fold and your last model should be evaluated always this way, but train and test split is ok for model selection and hyperparameter tunning. 

What about the rest? In all these techniques you try to balance accuracy in the estimated performance, evaluation speed and dataset size, they correspond to different bets in this balance. 

You don't know what to do ... The staple is k-fold with 10-cross-validation, start there. 

<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data and log regression for top-10, top-50 and top-100 evaluate the models with train-test split and k-fold-10.
<br><br>
b) Same for the data of the Times ranking. 
<br><br>

</b>

In [11]:
# a) Using the Shangai Data and log regression
# for top-10, top-50 and top-100 evaluate the models with train-test split and k-fold-10.

sid = pd.read_csv('shanghaiData.csv')
sid = sid[pd.notnull(sid['ns'])]
sid['world_rank'] = sid['world_rank'].str.split('-').str.get(0).astype(int)
sid["top_10"]=sid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
Xs=sid.iloc[:,4:10]
# Xs.head()
Ys = sid['top_10'].values
# Yu = sid['world_rank'].values
                

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

test_size=0.2
seed = 7                                                         #number any one  --> same kind of randomness

X_train, X_test, Y_train, Y_test = train_test_split(Xs,Ys, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'SHANGHAI TOP10 train-test split - Accuracy {result*100:5.3f}')

splits=10
kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xs, Ys, cv=kfold)
print(f'SHANGHAI TOP10 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
# #######################################################
print()
sid["top_50"]=sid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
Ys = sid['top_50'].values
# sid.head()
X_train, X_test, Y_train, Y_test = train_test_split(Xs,Ys, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'SHANGHAI TOP50 train-test split - Accuracy {result*100:5.3f}')

kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xs, Ys, cv=kfold)
print(f'SHANGHAI TOP50 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
#####################################################
print()
sid["top_100"]=sid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Ys = sid['top_100'].values
# sid.head()
X_train, X_test, Y_train, Y_test = train_test_split(Xs,Ys, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'SHANGHAI TOP100 train-test split - Accuracy {result*100:5.3f}')

kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xs, Ys, cv=kfold)
print(f'SHANGHAI TOP100 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
##############################################################
print()
# sid

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

SHANGHAI TOP10 train-test split - Accuracy 99.590
SHANGHAI TOP10 Logistic regression, k-fold 10 - Accuracy 99.836% (0.179%)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

SHANGHAI TOP50 train-test split - Accuracy 99.385
SHANGHAI TOP50 Logistic regression, k-fold 10 - Accuracy 98.687% (0.307%)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

SHANGHAI TOP100 train-test split - Accuracy 98.256
SHANGHAI TOP100 Logistic regression, k-fold 10 - Accuracy 98.358% (0.610%)



In [12]:
# b) Same for the data of the Times ranking.

originaltid = pd.read_csv('timesData.csv')
tid =originaltid.copy()

tid['world_rank'] = tid['world_rank'].str.split('-').str.get(0)
tid.world_rank.replace('=','', regex=True, inplace=True)  
print("----------------------------------------------")

tid.income = pd.to_numeric(tid.income, errors='coerce')
tid.international = pd.to_numeric(tid.international, errors='coerce')
tid.international_students.replace('%','', regex=True, inplace=True)    # \D
tid.international_students = pd.to_numeric(tid.international_students, errors='coerce')
tid.num_students = tid.num_students.astype(str)
tid['num_students'] = tid.num_students.apply(lambda x: x.replace(',', ''))
tid.num_students = pd.to_numeric(tid.num_students, errors='coerce')
tid.drop(['female_male_ratio','country','university_name','total_score','year'],axis=1,inplace=True)
tid.dropna(inplace=True)
tid.world_rank = pd.to_numeric(tid.world_rank, errors='coerce')
# tid["top_10"]=tid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
tid.iloc[0:5,:]

Xt= tid.iloc[:,1:9]
Xt.head()
# Yt = tid["top_10"].values



from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 

tid["top_10"]=tid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
Yt = tid["top_10"].values

test_size=0.3
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(Xt,Yt, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'TIMES TOP10 train-test split - Accuracy {result*100:5.3f}')

splits=10
kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xt, Yt, cv=kfold)
print(f'TIMES TOP10 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
# #######################################################
print()
tid["top_50"]=tid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
Yt = tid["top_50"].values
# sid.head()
X_train, X_test, Y_train, Y_test = train_test_split(Xt,Yt, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'TIMES TOP50 train-test split - Accuracy {result*100:5.3f}')

kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xt, Yt, cv=kfold)
print(f'TIMES TOP50 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
#####################################################
print()
tid["top_100"]=tid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Yt = tid["top_100"].values
# sid.head()
X_train, X_test, Y_train, Y_test = train_test_split(Xt,Yt, test_size=test_size, random_state=seed)                                                                        
model = LogisticRegression(solver='liblinear')
model.fit(X_train,Y_train)
result = model.score(X_test, Y_test)
print(f'TIMES TOP100 train-test split - Accuracy {result*100:5.3f}')

kfold=KFold(n_splits=splits, random_state=7)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, Xt, Yt, cv=kfold)
print(f'TIMES TOP100 Logistic regression, k-fold {splits:d} - Accuracy {results.mean()*100:5.3f}% ({results.std()*100:5.3f}%)')
##############################################################
print()


----------------------------------------------


Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students
0,1,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0
1,2,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0
2,3,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0
3,4,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0
5,6,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0


Unnamed: 0,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students
0,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0
1,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0
2,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0
3,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0
5,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

TIMES TOP10 train-test split - Accuracy 98.569
TIMES TOP10 Logistic regression, k-fold 10 - Accuracy 98.798% (1.395%)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

TIMES TOP50 train-test split - Accuracy 94.850
TIMES TOP50 Logistic regression, k-fold 10 - Accuracy 94.421% (5.560%)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

TIMES TOP100 train-test split - Accuracy 88.984
TIMES TOP100 Logistic regression, k-fold 10 - Accuracy 87.983% (9.169%)

