<a href="https://www.kaggle.com/code/amsamms/sklearn-competition?scriptVersionId=106005661" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.base import clone

In [None]:
X= pd.read_csv('../input/data-science-london-scikit-learn/train.csv',header=None)

In [None]:
y= pd.read_csv('../input/data-science-london-scikit-learn/trainLabels.csv',header=None)

In [None]:
y=y[0]

In [None]:
y.shape

In [None]:
X.shape

In [None]:
x=5

In [None]:
X.isnull().sum().sum()

Now we are sure that X is 2 dimensional np array and y is 1 dimensional np array and there is no missing data, lets check normal distribution of independent variables X

In [None]:
for column in X.columns:
    X[column].hist()
    plt.show()

All 40 columns are normally distributed, hence logistic regression and support vector machine, works just fine, also i will try two ensble methods, adaboost and randomforest

In [None]:
X.describe()

Also, all data seems to have close range, no need for using `MINMAXSCALER()`

Next, lets define a function that takes a predefined number of estimators and train-test split, fit and predict for a specefic number of times, and compare the results

Note: the following function is not very simple, as it can used any where, so i wrote it in a generic way

In [None]:
def estimators_repeater(estimators=[RandomForestClassifier(),AdaBoostClassifier(),SVC()],tr_slicer=(None,None),tst_slicer=(None,None),loops=500,scorer=accuracy_score,X=X,y=y):
    '''This function aims to train list of supplied estimators with selcted slices of datasets for as many time as required(default 500)
    and then produce a list of training score, test score and time used for each estimator
    
    - It is important to import all used estimators, score to be used
    inputs :
    - estimators : a list of estimators, deafult is Randomforest, Adaboost and support vector machine
    
    - tr_slicer : slicer for the number of observations needed in the training, default is all samples [:], slicer should be tuples
    of integers(starter,ender) default is(None,None)
    
    - tst_slicer : slicer for the number of samples to be tested at, default is all samples [0:-1], slicer should be tuples
    of integers(starter,ender) default is(0,-1)
    
    -loops : int, is a number of loops needed : default 500
    
    -scorer : default is accuracy_score, but it can be anything choosen from sklearn.metrics but if it is something calculated by
    methods other than accuracy, it should be modified in the 
    
    -X= features in the form of dataframe or np.array of 2 dimensions
    -y= target in the form of dataframe, series or np.array
    
    Returns : 3 global dataframes for training score(training_score_df),
    testing score(testing_score_df) and time (time_df) used for each estimator for fitting and predicting
    
    
    Example, fitting first 200 samples for SVC() and RandomForestClassifier() for 400 loop and scorer is accuracy for full test dataset:
    
    estimators_repeater(estimators=[RandomForestClassifier(),SVC()],tr_slicer=(0,200),loops=400,scorer=accuracy_score,X=X,y=y)
    
    '''
    
    training_score={}
    testing_score={}
    timing={}
    for clf in estimators:
        clf_name = clf.__class__.__name__
        training_score[clf_name]=[]
        testing_score[clf_name]=[]
        timing[clf_name]=[]
       
    for i in range (loops):
        k1=time()
        X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=i)
        for clf in estimators:
            a=time()
            clf_name = clf.__class__.__name__
            clean_clf=clone(clf)
            clean_clf.fit(X_train[tr_slicer[0]:tr_slicer[1]],y_train[tr_slicer[0]:tr_slicer[1]])
            training_score[clf_name].append(scorer(y_train[tr_slicer[0]:tr_slicer[1]],clean_clf.predict(X_train[tr_slicer[0]:tr_slicer[1]])))
            testing_score[clf_name].append(scorer(y_test[tst_slicer[0]:tst_slicer[1]],clean_clf.predict(X_test[tst_slicer[0]:tst_slicer[1]])))
            b=time()
            timing[clf_name].append(b-a)
        k2=time()
        print(f'loop number {i} out of {loops} took {k2-k1} seconds')
    
    global training_score_df
    training_score_df=pd.DataFrame(training_score)
    global testing_score_df
    testing_score_df=pd.DataFrame(testing_score)
    global timing_df
    timing_df=pd.DataFrame(timing)

A simple apply to the above function would be below as follows:

* test split data 
* fit X, and y for randomforest, adaboost, SVC and logestic regression
* record the time for the above steps, also record test data score and record train data score
* Do the above 500 times
* the final output would be 3 dataframes, one for the test_Scores, and one for train scores, and last one for time consumed for each loop

In [None]:
estimators_repeater(estimators=[RandomForestClassifier(), AdaBoostClassifier(), SVC(),LogisticRegression()],X=X,y=y)

In [None]:
training_score_df.describe()

It seems for training score, randomforest rocks ( which is expected), folloed by SVC

In [None]:
testing_score_df.describe()

As for testing score, it seems SVC is the best with average score of 0.894472

In [None]:
timing_df.describe()

it seems randomforest classifier took average of 0.5 second per fit,predict cycle, while adaboost took 0.3 second while SVC took 0.06 second ! 

So i will use SVC for final submission

In [None]:
testing_score_df.sort_values(by='SVC',ascending=False)

In [None]:
# initiate SVC instance with random_state that get the most svc score from the predefined function
prediction=SVC(random_state=303)

In [None]:
# fitting whole test.csv and trainlabel.csv
prediction.fit(X,y)

In [None]:
# defining test data that will be predicted for submission
test_data= pd.read_csv('../input/data-science-london-scikit-learn/test.csv',header=None)

In [None]:
predictions=prediction.predict(test_data)

In [None]:
# constructing of submission dataframe
df=pd.DataFrame()
df['Id']=[i for i in range(1,9001)]
df['Solution']=predictions

In [None]:
df.tail()

In [None]:
df.to_csv('submission.csv',index=False)

Final submission get me **0.89819** which consider good, as i didn't use gridsearch