ADVANCED TOPICS IN MACHINE LEARNING

Assignment - 2

Ιπποκράτης Κοτσάνης - 131

Φιλίτσα-Ιωάννα Κουσκουβέλη - 125

PART B

**Sources:**

    https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text
    https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
    https://www.analyticsvidhya.com/blog/2021/01/in-depth-intuition-of-k-means-clustering-algorithm-in-machine-learning/
    https://pypi.org/project/kneed/
    https://github.com/arvkevi/kneed
    https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/

**Links used:**

    https://pythonprogramminglanguage.com/kmeans-text-clustering/
    https://www.w3schools.com/python/python_ml_k-means.asp
    https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/


#**Import the necessary libraries and modules** 

In [14]:
!pip install kneed

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [15]:
import warnings
from sklearn.exceptions import ConvergenceWarning, UndefinedMetricWarning, FitFailedWarning

import pandas as pd 
import re
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # https://pypi.org/project/sklearn-features/
from sklearn.cluster import KMeans
from kneed import KneeLocator

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, zero_one_loss
 

# **Functions**

In [16]:
def load_vocab_dict(vocabFileName):
    vocab = {}
    with open(vocabFileName, 'r') as f:
        for line in f:
            word, index = line.strip().split(',')
            vocab[word] = int(index)
    return vocab


def txts_to_bOs(fileName):
    '''
    Takes as input the file's name of our data's file. Each line corresponds to a text. 
    Each text comprises of a different number of sentences. Each sentence comprises 
    of a different number of words. 

    It returns a list of strings. Each string correspends to a text's sentences.
    Words are separated with simple spaces one from another.
    '''
    # Read data from file
    with open(fileName, 'r') as f:
        lines = f.readlines()  

    dataset = []
    pattern = r'<\d+>\s([\d\s]+)'
    for line in lines:
        txt_in_sentences = re.findall(pattern, line)
        txt = ''.join(txt_in_sentences) #join the strings of different sentences to a single one
        dataset.append(txt.rstrip()) #strip pf any characters in the end
    return dataset


def txts_to_sentences(fileName):
    '''
    Takes as input the file's name of our data's file. Each line corresponds to a text. 
    Each text comprises of a different number of sentences. Each sentence comprises 
    of a different number of words. 

    It returns a list of strings. Each string correspends to a text's sentence.
    Words are separated with simple spaces one from another.
    '''
    # Read data from file
    with open(fileName, 'r') as f:
        lines = f.readlines()

    dataset, idxs_list = [], []

    idx_pattern = r"<(\d+)>"
    sentence_pattern = r"<\d+>\s((?:\d+\s)+)"

    for line in lines:
        idx = re.search(idx_pattern, line) #; print(idx.group(1))
        idxs_list.append(int(idx.group(1))) 
        sentences = re.findall(sentence_pattern, line)
        sentences = [s.strip() for s in sentences] #; print(sentences)
        dataset+= sentences
    return idxs_list, dataset


def vectorize_inputData(trainData, testData, vectorizer):  

    trainData_matrix = vectorizer.fit_transform(trainData)
    testData_matrix = vectorizer.transform(testData)
    #print('skata')
    #print(type(trainData_matrix), trainData_matrix.shape)
    return trainData_matrix, testData_matrix


def plot_elbow(start, stop, inertias_list):    
    plt.plot(range(start, stop), inertias_list, marker='o')
    plt.title('Elbow method')
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia')
    plt.show()
    plt.clf()
    plt.close()


def search_opt_k_kmeans(data_matrix, max_num_k, vectorizer, num_centroids = 10, numIter=100, elbowPlot_flag = True):
    
    top_centroids_lol = []
    inertias_list = []
    candidate_k_list = [i for i in range(2,max_num_k)]

    for k in candidate_k_list:
        model = KMeans(n_clusters=k, init='k-means++', max_iter=numIter, n_init=1)
        model.fit(data_matrix)

        #print("Top terms per cluster:")
        order_centroids = model.cluster_centers_.argsort()[:, ::-1]
        terms = vectorizer.get_feature_names_out()
        curren_k_list = []
        for i in range(k):
            #print("Cluster %d:" % i),
            cluster_top = []
            for ind in order_centroids[i, :num_centroids]:
                #print(' %s' % terms[ind]),
                cluster_top.append(terms[ind]) # creating a list of each cluster's top centroids   
            curren_k_list.append(cluster_top) # a list containing the lists of each cluster's top centroids for k value = k 

        inertias_list.append(model.inertia_)            
        top_centroids_lol.append(curren_k_list) # list of max_num_k-2 elements, each element coresponds to a search of k, each element is a list of lists
    
    kneedle = KneeLocator(candidate_k_list, inertias_list, curve="convex", direction="decreasing")
    opt_k = kneedle.knee #; print(opt_k)
    if opt_k == None:
        return None, None
    
    idx_opt = candidate_k_list.index(opt_k)

    if elbowPlot_flag:
        plot_elbow(2,max_num_k, inertias_list)

    return opt_k, top_centroids_lol[idx_opt]


def txts_to_clusterVectors(model, data_matrix, opt_k, idxs_list):
    
    data_transf = model.predict(data_matrix)
    #print(type(train_data_transf), train_data_transf.shape)
    
    data_transf_list = data_transf.tolist()
    df = pd.DataFrame(columns=[str(i) for i in range(1, (opt_k+1))]) #; print(train_df)

    for idx in idxs_list:
        
        if idx>len(data_transf_list):
            idx = len(data_transf_list)

        txt_clusters = data_transf_list[:idx]
        txt_clusters_dict = dict(list(zip([str(i) for i in range(1, (opt_k+1))], [ [0] for i in range(1, (opt_k+1))])))
        
        for cluster in txt_clusters:
            txt_clusters_dict[str(cluster+1)][0]+=1 

        new_row = pd.DataFrame(txt_clusters_dict)
        df = pd.concat([df, new_row], ignore_index=True)
        
        if data_transf_list:
            data_transf_list = data_transf_list[idx:]
        else:
            break
    
    print('clustered data:\n', df.head(100)); print('clustered data shape:\n', df.shape)

    return df


def get_targValues(fileName):
    
    df = pd.read_csv(fileName, sep='\s+', header=None)#; print(df.head())
    value_counts = df.apply(lambda x: x.eq(1).sum()) #; print(value_counts, type(value_counts))
    max_index = value_counts.idxmax(); print('most frequent class: ', (max_index+1))
    return max_index, df[max_index]


def get_txt_targVal_from_sentences(fileName, class_idx):

    df = pd.read_csv(fileName, sep='\s+', header=None)
    df = df[[0,(class_idx)]]
    txt_idx, targVal = -1, 0
    idx_list, target_list = [], []

    for i in range(df.shape[0]):

        if df.iloc[i, 1] and txt_idx==df.iloc[i, 0]:
            continue

        if df.iloc[i, 1]:
            targVal=df.iloc[i, 1]

        if txt_idx!=df.iloc[i, 0]:
            target_list.append(targVal)
            idx_list.append(df.iloc[i, 0])
            txt_idx=df.iloc[i, 0]
            targVal=0

    #print(idx_list)
    #print(target_list)
    return idx_list, target_list


def get_clfs():
    # get a list of models to evaluate
    clfs_dict = dict()

    # Logistic Regression Classifier Grid Search
    clfs_dict['lr'] = GridSearchCV(LogisticRegression(max_iter=10000), {'penalty': ['l2'], 'C': [0.1, 1, 10, 100]}, cv=5)

    # Linear SVM Classifier Grid Search
    clfs_dict['lsvc'] = GridSearchCV(LinearSVC(max_iter=100000), {'C': [0.1, 1, 10, 100]}, cv=5)
     
    # Gradient Boosting Classifier (base)
    clfs_dict['gbm'] = GradientBoostingClassifier()
    
    # Random Forest Classifier (base)
    clfs_dict['rf'] = RandomForestClassifier()

    return clfs_dict


def apply_clf_models(clfs_dict, X_train, y_train, X_test, y_test):
    for clf_name, clf in clfs_dict.items():
      clf.fit(X_train, y_train)
      pred_labels = clf.predict(X_test)

      print('\nEvaluating {}'.format(clf_name))
      print(classification_report(y_test, pred_labels))
      print('Subset accuracy = {}'.format((1 - zero_one_loss(y_test, pred_labels))))


def select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test):
    df_X_train = df_X_train.head(n)
    df_y_train = df_y_train.head(n)
    df_X_test = df_X_test.head(int(n/2))
    df_y_test = df_y_test.head(int(n/2))
    
    return df_X_train, df_y_train, df_X_test, df_y_test




In [17]:
    # Data files' names
    x_train_fileName, x_test_fileName, = 'train-data.dat', 'test-data.dat' # X 
    y_train_fileName, y_test_fileName = 'train-label.dat', 'test-label.dat' # y

    vectorizer = TfidfVectorizer()

    # Read y_train data and transform the multi-labeled output to binary 
    class_idx, df_y_train = get_targValues(y_train_fileName) # class_idx: index of the most common class 
    #print('y_train_fileName',df_y_train.shape)

    # Read y_test data and transform the multi-labeled output to binary 
    df_y_test = pd.read_csv(y_test_fileName, sep='\s+', header=None)
    df_y_test = df_y_test[class_idx] #; print('y_test_fileName',df_y_test.shape)
    

most frequent class:  3


Εντοπίζεται η συχνότερη από τις 20 κλάσεις και μετασχιματίζονται το y_train και το y_test, ώστε να αντιμετωπίστει ένα 
δυαδικό πρόβλημα ταξινόμησης.

# **02b: Task 01**

In [18]:
    # Read X_train data
    idxs_X_train_list, X_train = txts_to_sentences(x_train_fileName) #; print(type(X_train[0]),type(X_train))
    #print(X_train[0:1])

    # Read X_test data
    idxs_X_test_list, X_test = txts_to_sentences(x_test_fileName)
    
    # Vectorize X_train data and X_test data
    trainData_matrix, testData_matrix = vectorize_inputData(X_train, X_test, vectorizer)



In [19]:
    # k-means search to find optimum k
    max_num_k = 11
    numIter = 100
    opt_k = None
    while(opt_k == None):
        opt_k, top_centroids_lol = search_opt_k_kmeans(trainData_matrix, max_num_k, vectorizer, elbowPlot_flag=False)
    
    # Create k-means model with the optimum k on training set examples
    model = KMeans(n_clusters=opt_k, init='k-means++', max_iter=numIter, n_init=1)
    model.fit(trainData_matrix)

    print('optimum k: ', opt_k)
    print (type(trainData_matrix), trainData_matrix.shape)
    

optimum k:  5
<class 'scipy.sparse._csr.csr_matrix'> (149925, 8510)


In [20]:
    # Transform X_train data via clustering
    df_X_train = txts_to_clusterVectors(model, trainData_matrix, opt_k, idxs_X_train_list) #; print('trainData_matrix', df_X_train.shape)

    # Transform X_test data via clustering
    df_X_test = txts_to_clusterVectors(model, testData_matrix, opt_k, idxs_X_test_list) #; print('testData_matrix',df_X_test.shape)

    '''
    n = 200
    df_X_train, df_y_train, df_X_test, df_y_test = select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test)
    #'''
    

clustered data:
     1  2  3  4   5
0   0  0  0  1   1
1   1  0  0  1  26
2   0  3  1  1  26
3   0  0  0  0   2
4   0  0  0  0   7
.. .. .. .. ..  ..
95  1  1  0  4  25
96  0  1  0  0  27
97  8  2  3  0  18
98  0  0  0  2   9
99  0  0  0  1  30

[100 rows x 5 columns]
clustered data shape:
 (8251, 5)
clustered data:
     1  2  3  4   5
0   2  3  2  1  23
1   0  2  1  0  28
2   1  1  0  0  29
3   0  1  0  3  15
4   0  0  0  0   4
.. .. .. .. ..  ..
95  0  2  1  2  25
96  0  1  0  0   6
97  1  0  1  0  10
98  0  0  0  0   7
99  0  0  0  0   4

[100 rows x 5 columns]
clustered data shape:
 (3983, 5)


'\nn = 200\ndf_X_train, df_y_train, df_X_test, df_y_test = select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test)\n#'

Στο *DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels Data Set* dataset ομαδοποιήθηκαν (με k-means clustering) οι προτάσεις του συνόλου εκπαίδευσης. Για το βελτιστο k που βρέθηκε (και με βαση τα clusters του training set) κάθε έγγραφο του training set και ακολούθως του testing set αναπαραστάθηκε με βάση τις ομάδες στις οποίες ανήκαν οι προτάσεις του.

In [21]:
    clfs_dict = get_clfs()
    apply_clf_models(clfs_dict, df_X_train, df_y_train, df_X_test, df_y_test)


Evaluating lr
              precision    recall  f1-score   support

           0       0.61      1.00      0.76      2425
           1       1.00      0.00      0.00      1558

    accuracy                           0.61      3983
   macro avg       0.80      0.50      0.38      3983
weighted avg       0.76      0.61      0.46      3983

Subset accuracy = 0.6090886266633191





Evaluating lsvc
              precision    recall  f1-score   support

           0       0.61      1.00      0.76      2425
           1       1.00      0.00      0.00      1558

    accuracy                           0.61      3983
   macro avg       0.80      0.50      0.38      3983
weighted avg       0.76      0.61      0.46      3983

Subset accuracy = 0.6090886266633191

Evaluating gbm
              precision    recall  f1-score   support

           0       0.61      0.98      0.75      2425
           1       0.47      0.03      0.05      1558

    accuracy                           0.61      3983
   macro avg       0.54      0.50      0.40      3983
weighted avg       0.56      0.61      0.48      3983

Subset accuracy = 0.6075822244539292

Evaluating rf
              precision    recall  f1-score   support

           0       0.61      0.82      0.70      2425
           1       0.38      0.17      0.23      1558

    accuracy                           0.57      3983
   mac

Τα παραπάνω αποτελέσματα αξιολόγησης προέκυψαν αφού εφαρομόστηκε Grid Search για τους αλγορίθμους-κατηγοριοποιητές LogisticRegression (lr) και LinearSVM (lsvc) (με παραμέτρους {'penalty': ['l2'], 'C': [0.1, 1, 10, 100]} και {'C': [0.1, 1, 10, 100]} αντίστοιχα) και τα base models για τους GradientBoostingClassifier και RandomForestClassifier.

Τόσο η Logistic Regression όσο και τα Linear SVMs σημείωσαν το ίδιο accuracy στο testing set (περίπου 60,9%). Ωστόσο, εξετάζοντας, επίσης, τις μετρικές precision, recall και F1, τα μοντέλα έχουν για τη "θετική" κατηγορία (δηλαδή ανήκουν στην συχνότερη κλάση) με recall 0,00 και βαθμολογία F1 0,00. Αυτό συνεπάγεται πως τα μοντέλα "δυσκολεύονται" να αναγνωρίσουν σωστά τις περιπτώσεις που ανήκουν στη "θετική" κατηγορία.

Το μοντέλο Classifier Boosting Gradient σημειώνει accuracy στο testing set παρόμοια με εκείνη των LogisticRegression και Linear SVMs  (περίπου 60,8%). Ωστόσο, αποδίδει ελαφρώς καλύτερα όσον αφορά τις μετρικές recall και F1, σε σύγκριση με τα  LogisticRegression και Linear SVMs. Επιτυγχάνει recall 0,03 και βαθμολογία F1 0,05 για τη "θετική" κατηγορία.

Το mοντέλο Random Forest Classifier σημειώνει το χαμηλότερο μεταξύ των μοντέλων accuracy στο testing set (περίπου 56,6%).  Ωστόσο, αποδίδει καλύτερα όσον αφορά τις μετρικές recall και F1, σε σύγκριση με τα  LogisticRegression, Linear SVMs και Classifier Boosting Gradient. Επιτυγχάνει  recall 0,17 και βαθμολογία F1 0,23 για τη "θετική" κατηγορία.



# **02b: Task 02**

In [22]:
    X_train = txts_to_bOs(x_train_fileName)
    #print(X_train[0:1])
    X_test = txts_to_bOs(x_test_fileName)
    trainData_matrix, testData_matrix = vectorize_inputData(X_train, X_test, vectorizer)

    #print(type(trainData_matrix), type(testData_matrix))
    #print(trainData_matrix.shape,testData_matrix.shape)

    df_X_train = pd.DataFrame(trainData_matrix.toarray())
    df_X_test = pd.DataFrame(testData_matrix.toarray())

    '''
    n = 200
    df_X_train, df_y_train, df_X_test, df_y_test = select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test)
    '''
    #df_X_train, df_y_train, df_X_test, df_y_test = select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test)



'\nn = 200\ndf_X_train, df_y_train, df_X_test, df_y_test = select_head_records(n, df_X_train, df_y_train, df_X_test, df_y_test)\n'

Στο *DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels Data Set* dataset κάθε έγγραφο αναπαρίσταται με βάση όλες τις προτάσεις του. 

In [23]:
    clfs_dict = get_clfs()
    apply_clf_models(clfs_dict, df_X_train, df_y_train, df_X_test, df_y_test)
 
    #randSearch_clf_models(df_X_train, df_y_train, df_X_test, df_y_test)
    


Evaluating lr
              precision    recall  f1-score   support

           0       0.69      0.87      0.77      2425
           1       0.66      0.40      0.49      1558

    accuracy                           0.68      3983
   macro avg       0.67      0.63      0.63      3983
weighted avg       0.68      0.68      0.66      3983

Subset accuracy = 0.6834044689932212

Evaluating lsvc
              precision    recall  f1-score   support

           0       0.69      0.87      0.77      2425
           1       0.66      0.38      0.49      1558

    accuracy                           0.68      3983
   macro avg       0.68      0.63      0.63      3983
weighted avg       0.68      0.68      0.66      3983

Subset accuracy = 0.682400200853628

Evaluating gbm
              precision    recall  f1-score   support

           0       0.65      0.91      0.76      2425
           1       0.64      0.25      0.36      1558

    accuracy                           0.65      3983
   macr

Τα παραπάνω αποτελέσματα αξιολόγησης προέκυψαν αφού εφαρομόστηκε Grid Search για τους αλγορίθμους-κατηγοριοποιητές LogisticRegression (lr) και LinearSVM (lsvc) (με παραμέτρους {'penalty': ['l2'], 'C': [0.1, 1, 10, 100]} και {'C': [0.1, 1, 10, 100]} αντίστοιχα) και τα base models για τους GradientBoostingClassifier και RandomForestClassifier.

Η Logistic Regression και τα Linear SVMs σημείωσαν παρόμοιο accuracy στο testing set (περίπου 68,3% και 68,2% αντίστοιχα). Οι τιμές των μετρικών precision, recall και F1 δεν διαφέρουν ιδιαίτερα μεταξύ των δύο μοντέλων. Τα μοντέλα "αποδίδουν" αρκετά καλά και για τη "θετική" κατηγορία (δηλαδή τη συχνότερη κλάση) και για την "αρνητική" κλάση, με ελαφρώς καλύτερη απόδοση για την "αρνητική" ετικέτα.

Το μοντέλο Classifier Boosting Gradient σημειώνει accuracy στο testing set περίπου 65,1%. Σε ό,τι αφορά τις μετρικές precision, recall και F1 φαίνονται ιδιαίτερα υψηλές για την "αρνητική" ετικέτα. Επιπλέον, ειδικά η recall και η  F1 είναι υψηλότερες συγκριτικά με τη "θετική" ετικέτα, υποδηλώνοντας "δυσκολία" στην ακρίβεια αναγνώρισης περιπτώσεων που ανήκουν στη συχνότερη κλάση. 

Το mοντέλο Random Forest Classifier σημειώνει accuracy στο testing set περίπου 65,9%.  Παρόμοια με το μοντέλο Classifier Boosting Gradient, έχει καλή απόδοση στις precision, recall και F1 για την ετικέτα 0, αλλά η απόδοση για την ετικέτα 1 είναι σχετικά χαμηλότερη.