Projekt: Machine Learning <br>
Student: Wojciech Miarczyński 106532 <br>
Data   : 06-02-2016 <br>

W projekcie korzystałem z 2 klasyfikatorów : AdaBoostClassifier oraz BaggingClassifier. Próbowałem również dobrać różne parametry do drzewa decyzyjnego i searchGridu. Ostatecznie okazało się że BaggingClassifier był najbardziej skuteczny, ponieważ zwracał wynik 73%. AdaBoostClassifier za to zwracał co najwyżej 55%.

Wykorzystane biblioteki :


In [2]:
import pandas as pd
from sklearn import svm, grid_search, tree
from sklearn import preprocessing
import numpy as np
from collections import Counter
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier
from sklearn.externals import joblib
import warnings

Funkcje do wczytywania i filtrowania danych : 

In [3]:
def read(file, sep):
    df = pd.read_csv(file,sep=sep, header=0, na_values=["nan", "NA", "NaN"], keep_default_na = False)
    print "data read: ",format(len(df)),",",len(df.columns)
    return df


def filter_data(data,remove_classes):
    #Filtrowanie kolumn
    filtered = data[~data["res_name"].isin(remove_classes)]

    #Unikalnosc pdb_code , res_name
    unique = filtered.drop_duplicates(subset=["pdb_code", "res_name"], keep='first')

    #Ograniczenie do 5 powtorzen
    counts = unique[['res_name']].stack().value_counts()
    values = counts[counts>=5].index
    min5Occurs = unique.loc[unique['res_name'].isin(values)]

    print "original: ",format(len(data)),",",len(data.columns)
    print "filtered: ",format(len(filtered)),",",len(filtered.columns)
    print "unique: ",format(len(unique)),",",len(unique.columns)
    print "5+ occurs: ",format(len(min5Occurs)),",",len(min5Occurs.columns)

    return min5Occurs


def remove_na_columns(df):
    no_na_columns = df.count()==len(df)
    no_na_columns = no_na_columns.values
    return df.iloc[:,no_na_columns]

Funkcja do uczenia :

In [4]:
def learn(data,test_data):
    #zamiana reprezentacji tekstowej na binarna
    encoder = preprocessing.LabelEncoder()
    encoder.fit(data[['res_name']])
    data[['res_name']] = encoder.transform(data[['res_name']])

    #usuniecie kolumn ktore zawieraja NA
    data = remove_na_columns(data)
    test_data = remove_na_columns(test_data)

    ##zmienna tymczasowa, potrzebna pozniej przy klasyfikacji
    original = data

    ##selekcja kolumn ktore sa typu float (
    float_idx = data.dtypes == np.float64
    data = data.loc[:,float_idx]

    test_float_idx = test_data.dtypes == np.float64
    test_data = test_data.loc[:,test_float_idx]

    ##selekcja kolumn ktore sa wspolne dla test_data i data
    common_columns = [col for col in test_data if col in data]
    test_data = test_data[common_columns]
    data = data[common_columns]

    ##klasyfikacja
    df, classes = data.loc[:,float_idx], original[['res_name']]

    my_tree = tree.DecisionTreeClassifier(max_features = "auto",max_depth = None)
    #method = AdaBoostClassifier(base_estimator = my_tree)
    method = BaggingClassifier(base_estimator = my_tree)

    params = {
        "base_estimator__criterion" : ["gini", "entropy"],
        "base_estimator__splitter" :   ["best", "random"],
        "n_estimators": [1, 5]
     }

    classificator = grid_search.GridSearchCV(method, param_grid=params, scoring = 'recall')
    classificator.fit(df, np.asarray(classes).ravel())

    joblib.dump(classificator, "klasyfikator.pk")

    print 'grid_scores:', classificator.grid_scores_
    print 'best_estimator:', classificator.best_estimator_
    print 'best_score:', classificator.best_score_
    print 'best_params:', classificator.best_params_

Funkcja main - wczytanie danych :

In [5]:
warnings.filterwarnings("ignore")
LABELS = True

#Wczytanie danych
data = read("all_summary.txt",";")
test_data = read("test_data.txt",",")

#filtrowanie danych
data = filter_data(data,["DA","DC","DT","DU","DG", "DI","UNK","UNX","UNL","PR","PD","Y1","EU","N","15P","UQ","PX4", "NAN"])

data read:  40309 , 795
data read: 

 18917 , 824
original: 

 40309 , 795
filtered:  40027 , 795
unique:  14132 , 795
5+ occurs:  10767 , 795


Funkcja main - uczenie na normalnym zbiorze :

In [6]:
learn(data,test_data)

grid_scores: [mean: 0.16681, std: 0.00460, params: {'n_estimators': 1, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'best'}, mean: 0.23191, std: 0.00931, params: {'n_estimators': 5, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'best'}, mean: 0.15139, std: 0.00188, params: {'n_estimators': 1, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'random'}, mean: 0.21259, std: 0.01271, params: {'n_estimators': 5, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'random'}, mean: 0.16476, std: 0.00388, params: {'n_estimators': 1, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'best'}, mean: 0.22513, std: 0.00427, params: {'n_estimators': 5, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'best'}, mean: 0.14888, std: 0.00347, params: {'n_estimators': 1, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'random'}, mean: 0.20061, std: 0.00722, params: {'n_estimators': 

Funkcja main - Uczenie na zbiorze etykiet :

In [7]:
if(LABELS):
    labels = read("labels.txt",",")
    data['res_name']=labels['res_name_group']
    
#uczenie
learn(data,test_data)

data read:  11005 , 2
grid_scores:

 [mean: 0.52206, std: 0.01177, params: {'n_estimators': 1, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'best'}, mean: 0.72295, std: 0.00969, params: {'n_estimators': 5, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'best'}, mean: 0.52540, std: 0.00877, params: {'n_estimators': 1, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'random'}, mean: 0.72639, std: 0.00732, params: {'n_estimators': 5, 'base_estimator__criterion': 'gini', 'base_estimator__splitter': 'random'}, mean: 0.54537, std: 0.00892, params: {'n_estimators': 1, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'best'}, mean: 0.73270, std: 0.00861, params: {'n_estimators': 5, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'best'}, mean: 0.51556, std: 0.00338, params: {'n_estimators': 1, 'base_estimator__criterion': 'entropy', 'base_estimator__splitter': 'random'}, mean: 0.72852, std: 0.01143, params: {'n_estimators': 5, 'base_est