In this notebook, we will classify the pictures using a Bag of Words approach. <br> <br>
The sifts of each train and test pictures has been extracted after different types preprocessing on the pictures:
- Resizing, Transforming to gray scale and extracting every sift
- Denoizing, Resizing, Transforming to gray scale and extracting every sift
- Resizing, Transforming to gray scale, extracting the sifts the dense way (dividing the pictures into square, and exctracting one sift per square) <br>

Here we will cluster the previously extracted sifts to obtain a dictionnary, count the number of "words" per pictures using different weighting and build our models on top of that <br>

# Imports

We first import the necessary Python libraries

In [1]:
import numpy as np
import pandas as pd
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.cluster import MiniBatchKMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

Now we will import and process the data: <br> 
We have two datasets of sifts per preprocessing techniques we used

In [2]:
#Reading the sift descriptors files (dense)
sifts_train = pd.read_csv('train_sifts_dense.csv')
sifts_test = pd.read_csv('test_sifts_dense.csv')
sifts_train.head()

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,...,119,120,121,122,123,124,125,126,127,Id
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,14.0,0.0,0.0,0.0,0.0,0.0,20.0,255.0,8.0,-1191173
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,33.0,99.0,4.0,0.0,0.0,0.0,0.0,2.0,40.0,-1191173
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,175.0,58.0,0.0,0.0,0.0,0.0,0.0,0.0,-1191173
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,177.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0,-1191173
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,176.0,7.0,0.0,0.0,0.0,0.0,0.0,4.0,-1191173


The sifts_train dataframe contains all the sifts descriptors extracted for each train picture:
- Unnamed: 0 : Id of the sift in the picture 
- 0 - 127 : Sift descriptor values
- Id : Id of the picture

In [24]:
print('Train:', 8000 - len(np.unique(sifts_train['Id'])), 'pictures without any sifts descriptors')
print('Test:', 13999 - len(np.unique(sifts_test['Id'])), 'pictures without any sifts descriptors')

('Train:', 0, 'pictures without any sifts descriptors')
('Test:', 0, 'pictures without any sifts descriptors')


In [25]:
#Scaling the sifts descriptors values for the clustering
sifts_train_values = sifts_train.loc[:,'0':'127']
sifts_test_values = sifts_test.loc[:,'0':'127']
std_sc = StandardScaler()
sifts_train_values = pd.DataFrame(std_sc.fit_transform(sifts_train_values))
sifts_test_values = pd.DataFrame(std_sc.transform(sifts_test_values))
sifts_train_values.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
0,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,4.664526,-0.130893,-0.517853,-0.479069,-0.491777,-0.475926,-0.501479,0.124039,5.052102,-0.244152
1,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.537519,0.376582,1.650301,-0.360755,-0.491777,-0.475926,-0.501479,-0.47564,-0.463781,0.676013
2,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.537519,-0.504821,3.314742,1.236482,-0.491777,-0.475926,-0.501479,-0.47564,-0.507385,-0.474193
3,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.537519,-0.504821,3.358543,1.354796,-0.491777,-0.475926,-0.501479,-0.47564,-0.507385,-0.474193
4,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.537519,-0.397985,3.336642,-0.272019,-0.491777,-0.475926,-0.501479,-0.47564,-0.507385,-0.359172


# KMeans Clustering

Now we perform a clustering of the sifts descriptors to regroup all the sifts that look alike. 
We use the K means method, with different values of k (100, 250, 750, 1000). 

In [27]:
#Adding a column to the 
warnings.filterwarnings("ignore")
nb_clusters = [100, 250, 500, 750, 1000]
for k in nb_clusters:
    print(k)
    name = 'cluster_km_' + str(k) 
    km = MiniBatchKMeans(n_init = 3, max_iter= 100, n_clusters = k)
    km.fit(sifts_train_values)
    sifts_train_values[name] = km.labels_
    sifts_test_values[name] = km.predict(sifts_test_values)
sifts_train_values['Id'] = sifts_train['Id']
sifts_test_values['Id'] = sifts_test['Id']

100
250
500
750
1000


In [28]:
sifts_train_values.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,124,125,126,127,cluster_km_100,cluster_km_250,cluster_km_500,cluster_km_750,cluster_km_1000,Id
0,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.501479,0.124039,5.052102,-0.244152,41,80,189,234,52,-1191173
1,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.501479,-0.47564,-0.463781,0.676013,4,175,145,148,459,-1191173
2,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.501479,-0.47564,-0.507385,-0.474193,10,228,423,208,416,-1191173
3,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.501479,-0.47564,-0.507385,-0.474193,10,228,423,208,416,-1191173
4,-0.4831,-0.452598,-0.460596,-0.445141,-0.466192,-0.450662,-0.476716,-0.447204,-0.516943,-0.476097,...,-0.501479,-0.47564,-0.507385,-0.359172,10,228,423,346,58,-1191173


In the sift_train_values, the columns cluster_km_k correspond to the label of cluster to which the sift is belonging. In the sift_test_values these columns correspond to the predicted cluster label.

# Weighting + Model

Now that we have a cluster label for each sift, we will count the number of sift clusters by image, using 3 types of weighting:
- Binary: 1 if the category is present in the picture, 0 otherwise.
- Count: Number of time the category appears in the picture
- Tf-Idf

Once we obtain the final dataset, (1 line per picture, 1 column per cluster, value = weight), we will be able to train a model on it

In [29]:
id_train = pd.read_csv('../Data/id_train.csv')
binary_vectorizer = CountVectorizer(min_df=1, binary = True)
count_vectorizer = CountVectorizer(min_df=1)
tfidf_vectorizer = TfidfVectorizer(min_df=1)

#### Testing different models with different number of clusters

For each number k of clusters, we will try: 
- Logistic regression with different penalization (L1 and L2, different values of c)
- Bernouilli Naive Bayes with different penalization
- KNN with different number of neighbors and two types of weightings (uniform, distance)
- SVM with different kernels
- Random Forest
- Gradient Boosting

Each model is build on a sample of 0.8 of the sifts_train_binary dataset. Then we test its performance and the validation dataset (0.2 remaining pictures). We compute the maximal accuracy score and the proportion per class of well classified images.

In [30]:
n_j = -1 #nb of jobs for the grid search 
n_cv = 5#nb of cv dfolds for the grid search

# Binary weigthing

## Logistic regression

In [12]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
   
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression(solver ='lbfgs')
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
	 Logistic Regression
('\t Best params:', {'penalty': 'l2', 'multi_class': 'multinomial', 'C': 0.01, 'class_weight': None}, ':', 0.44828125000000002)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
	 Logistic Regression
('\t Best params:', {'penalty': 'l2', 'multi_class': 'ovr', 'C': 0.01, 'class_weight': None}, ':', 0.50718750000000001)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
	 Logistic Regression
('\t Best params:', {'penalty': 'l2', 'multi_class': 'ovr', 'C': 0.01, 'class_weight': None}, ':', 0.54921874999999998)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
	 Logistic Regression
('\t Best params:', {'penalty': 'l2', 'multi_class': 'ovr', 'C': 0.001, 'class_weight': 'balanced'}, ':', 0.55843750000000003)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
	 Logistic Regression
('\t Best params:', {'penalty': 'l2', 'multi_class': 'ovr', 'C': 0.001, 'class_weight': 'balanced'}, ':', 0.56015625000000002)
('\t', array([0, 0, 0, 0]))


## Bernouilli Naive Bayes

In [13]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Bernouilli Naive Bayes
    print()
    print('\t Bernoulli NB')
    bnb = BernoulliNB()
    parameters = {'alpha' : [0, 0.25, 0.5, 1, 5, 10, 25, 50, 75, 100]}
    gs_bnb = GridSearchCV(bnb , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_bnb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_bnb.predict(X_test))
    print('\t Best params:', gs_bnb.best_params_, ':', gs_bnb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 Bernoulli NB
('\t Best params:', {'alpha': 50}, ':', 0.44515624999999998)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
()
	 Bernoulli NB
('\t Best params:', {'alpha': 75}, ':', 0.50390625)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 Bernoulli NB
('\t Best params:', {'alpha': 75}, ':', 0.54203124999999996)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 Bernoulli NB
('\t Best params:', {'alpha': 50}, ':', 0.56046874999999996)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 Bernoulli NB
('\t Best params:', {'alpha': 25}, ':', 0.56562500000000004)
('\t', array([0, 0, 0, 0]))


## KNN

In [15]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 1)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], 'weights' : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 Knn
('\t Best params:', {'n_neighbors': 75, 'weights': 'uniform'}, ':', 0.43281249999999999)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
()
	 Knn
('\t Best params:', {'n_neighbors': 25, 'weights': 'uniform'}, ':', 0.45687499999999998)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 Knn
('\t Best params:', {'n_neighbors': 75, 'weights': 'distance'}, ':', 0.51406249999999998)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 Knn
('\t Best params:', {'n_neighbors': 100, 'weights': 'uniform'}, ':', 0.52390625000000002)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 Knn
('\t Best params:', {'n_neighbors': 150, 'weights': 'uniform'}, ':', 0.51359374999999996)
('\t', array([0, 0, 0, 0]))


## SVM

In [19]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #SVM
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'kernel':['linear', 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.75, 1, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 SVM
('\t Best params:', {'kernel': 'sigmoid', 'C': 0.75}, ':', 0.43125000000000002)
('\t', array([0, 0, 0, 1]))
(250, 'clusters:')
()
	 SVM
('\t Best params:', {'kernel': 'rbf', 'C': 0.1}, ':', 0.48375000000000001)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 SVM
('\t Best params:', {'kernel': 'poly', 'C': 10}, ':', 0.54468749999999999)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 SVM
('\t Best params:', {'kernel': 'poly', 'C': 10}, ':', 0.55874999999999997)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 SVM
('\t Best params:', {'kernel': 'rbf', 'C': 0.25}, ':', 0.55718749999999995)
('\t', array([0, 0, 0, 0]))


## Random Forest

In [20]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Random Forest
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [6, 8, 10, 12, 15], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 Random Forest
('\t Best params:', {'max_features': 0.2, 'n_estimators': 500, 'criterion': 'gini', 'max_depth': 15}, ':', 0.4478125)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
()
	 Random Forest
('\t Best params:', {'max_features': 0.2, 'n_estimators': 250, 'criterion': 'gini', 'max_depth': 15}, ':', 0.46921875000000002)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 Random Forest
('\t Best params:', {'max_features': 0.3, 'n_estimators': 100, 'criterion': 'gini', 'max_depth': 10}, ':', 0.49421874999999998)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 Random Forest
('\t Best params:', {'max_features': 0.3, 'n_estimators': 500, 'criterion': 'gini', 'max_depth': 15}, ':', 0.49296875000000001)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 Random Forest
('\t Best params:', {'max_features': 0.3, 'n_estimators': 250, 'criterion': 'gini', 'max_depth': 15}, ':', 0.49796875000000002)
('\t', array([0, 0, 0, 0]))


 ## Gradient Boosting

In [21]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    # Gradient Boosting
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1,0.3, 'sqrt', 'log2', None], 'max_depth': [2, 5, 10]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 0.3, 'n_estimators': 100, 'max_depth': 2}, ':', 0.453125)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'log2', 'n_estimators': 250, 'max_depth': 3}, ':', 0.51484375000000004)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'sqrt', 'n_estimators': 500, 'max_depth': 5}, ':', 0.55843750000000003)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'log2', 'n_estimators': 500, 'max_depth': 2}, ':', 0.57218749999999996)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'sqrt', 'n_estimators': 500, 'max_depth': 2}, ':', 0.57093749999999999)
('\t', array([0, 0, 0, 0]))


for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_binary = pd.DataFrame(binary_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_binary['Id'] = sifts_train_groups.index
    sifts_train_binary = pd.merge(sifts_train_binary, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_binary.iloc[:,0:-2], sifts_train_binary['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
   
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression()
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l1', 'l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

    #Bernouilli Naive Bayes
    print()
    print('\t Bernoulli NB')
    bnb = BernoulliNB()
    parameters = {'alpha' : [0, 0.25, 0.5, 1, 5, 10, 25, 50, 75, 100]}
    gs_bnb = GridSearchCV(bnb , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_bnb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_bnb.predict(X_test))
    print('\t Best params:', gs_bnb.best_params_, ':', gs_bnb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    

    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 1)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], weights : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #SVM
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'ker':['linear', 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #Random Forest
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [6, 8, 10, 12, 15], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    # Gradient Boosting
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.3, 'sqrt', 'log2', None], 'max_depth': [2,3, 5, 10]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

# Count Weigthing

## Logistic regression

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
   
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression(solver ='lbfgs')
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## Bernouilli Naive Bayes

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Bernouilli Naive Bayes
    print()
    print('\t Bernoulli NB')
    bnb = BernoulliNB()
    parameters = {'alpha' : [0, 0.25, 0.5, 1, 5, 10, 25, 50, 75, 100]}
    gs_bnb = GridSearchCV(bnb , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_bnb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_bnb.predict(X_test))
    print('\t Best params:', gs_bnb.best_params_, ':', gs_bnb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## KNN

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 1)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], 'weights' : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## SVM

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #SVM
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'kernel':[ 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.75, 1, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## Random Forest

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Random Forest
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [10, 15, 20, 50], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

 ## Gradient Boosting

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    # Gradient Boosting
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.3, 'sqrt', 'log2', None], 'max_depth': [2, 5, 10, 20]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_count = pd.DataFrame(count_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_count['Id'] = sifts_train_groups.index
    sifts_train_count = pd.merge(sifts_train_count, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_count.iloc[:,0:-2], sifts_train_count['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
        
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression()
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l1', 'l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

    #Multinomial Naive Bayes
    print()
    print('\t Multinomial NB')
    bnb = MultinomialNB()
    parameters = {'alpha' : [0, 0.25, 0.5, 1, 5, 10, 25, 50, 75, 100]}
    gs_bnb = GridSearchCV(bnb , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_bnb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_bnb.predict(X_test))
    print('\t Best params:', gs_bnb.best_params_, ':', gs_bnb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 2)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], weights : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'ker':['linear', 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [6, 8, 10, 12, 15], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.3, 'sqrt', 'log2', None], 'max_depth': [2,3, 5, 10]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

# TFIDF Weighting

## Logistic regression

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
   
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression(solver ='lbfgs')
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## Bernouilli Naive Bayes

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Bernouilli Naive Bayes
    print()
    print('\t Bernoulli NB')
    bnb = BernoulliNB()
    parameters = {'alpha' : [0, 0.25, 0.5, 1, 5, 10, 25, 50, 75, 100]}
    gs_bnb = GridSearchCV(bnb , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_bnb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_bnb.predict(X_test))
    print('\t Best params:', gs_bnb.best_params_, ':', gs_bnb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## KNN

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 1)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], 'weights' : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## SVM

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #SVM
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'ker':[ 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.75, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

## Random Forest

In [None]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Random Forest
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [5, 10, 15, 20, 50], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

 ## Gradient Boosting

In [22]:
for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    # Gradient Boosting
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.3, 'sqrt', 'log2', None], 'max_depth': [2, 5, 10, 20]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

(100, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 0.3, 'n_estimators': 100, 'max_depth': 5}, ':', 0.57140625)
('\t', array([0, 0, 0, 0]))
(250, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 0.3, 'n_estimators': 500, 'max_depth': 5}, ':', 0.58125000000000004)
('\t', array([0, 0, 0, 0]))
(500, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'sqrt', 'n_estimators': 500, 'max_depth': 5}, ':', 0.59375)
('\t', array([0, 0, 0, 0]))
(750, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 0.3, 'n_estimators': 250, 'max_depth': 5}, ':', 0.59250000000000003)
('\t', array([0, 0, 0, 0]))
(1000, 'clusters:')
()
	 Gradient Boosting
('\t Best params:', {'max_features': 'sqrt', 'n_estimators': 500, 'max_depth': 5}, ':', 0.59031250000000002)
('\t', array([0, 0, 0, 0]))


for k in nb_clusters:
    print(k, 'clusters:')
    sifts_train_groups = sifts_train_values.groupby(["Id"])['cluster_km_'+str(k)].apply(list).apply(lambda x: ' '.join(map(str, x)))
    sifts_train_tfidf = pd.DataFrame(tfidf_vectorizer.fit_transform(sifts_train_groups).todense())
    sifts_train_tfidf['Id'] = sifts_train_groups.index
    sifts_train_tfidf = pd.merge(sifts_train_tfidf, id_train, how='inner', on='Id')
    X_train, X_test, y_train, y_test = train_test_split(sifts_train_tfidf.iloc[:,0:-2], sifts_train_tfidf['label'], test_size=0.2, random_state=42)
    y_test_count = np.array(y_test.value_counts().sort_index())
    
    #Logistic Regression
    print("\t Logistic Regression")
    logreg = LogisticRegression()
    parameters = {'C' : [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1, 5, 10], 'penalty': ['l1', 'l2'], 'class_weight': ['balanced', None], 'multi_class' : ['ovr', 'multinomial']}
    gs_logreg = GridSearchCV(logreg , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_logreg.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_logreg.predict(X_test))
    print('\t Best params:', gs_logreg.best_params_, ':', gs_logreg.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)

    #Gaussian Naive Bayes
    print()
    print('\t Gaussian NB')
    gnb = GaussianNB()
    gnb.fit(X_train, y_train)
    acc = accuracy_score(y_test, gnb.predict(X_test))
    conf_mat = confusion_matrix(y_test, gnb.predict(X_test))
    print('\t', acc)
    print('\t', np.diagonal(conf_mat)/y_test_count)
 
    #KNN
    print()
    print('\t Knn')
    knn = KNeighborsClassifier(p = 2)
    parameters = {'n_neighbors' : [5, 10, 25, 50, 75, 100, 150, 200, 250], weights : ['uniform', 'distance']}
    gs_knn = GridSearchCV(knn , parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_knn.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_knn.predict(X_test))
    print('\t Best params:', gs_knn.best_params_, ':', gs_knn.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #SVM
    print()
    print('\t SVM')
    svm = SVC(decision_function_shape='ovr', class_weight = 'balanced')
    parameters = {'ker':['linear', 'poly', 'rbf', 'sigmoid'], 'C':[0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 10]}
    gs_svm = GridSearchCV(svm, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_svm.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_svm.predict(X_test))
    print('\t Best params:', gs_svm.best_params_, ':', gs_svm.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #Random Forest
    print()
    print('\t Random Forest')
    rf = RandomForestClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.2, 0.3, 'sqrt', 'log2'], 'max_depth': [6, 8, 10, 12, 15], 'criterion': ['gini', 'entropy']}
    gs_rf = GridSearchCV(rf, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_rf.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_rf.predict(X_test))
    print('\t Best params:', gs_rf.best_params_, ':', gs_rf.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)
    
    #Gradient Boosting
    print()
    print('\t Gradient Boosting')
    gb = GradientBoostingClassifier()
    parameters = {'n_estimators': [100, 250, 500], 'max_features':[0.1, 0.3, 'sqrt', 'log2', None], 'max_depth': [2,3, 5, 10]}
    gs_gb = GridSearchCV(gb, parameters, scoring='accuracy', n_jobs=n_j, cv = n_cv)
    gs_gb.fit(X_train, y_train)
    conf_mat = confusion_matrix(y_test, gs_gb.predict(X_test))
    print('\t Best params:', gs_gb.best_params_, ':', gs_gb.best_score_)
    print('\t', np.diagonal(conf_mat)/y_test_count)