## Multiclass classification

Note the next cell takes around 10 mins to run!

This could be using the OvO strategy (One vs One classifier) where since there are 3 classes, it would train 3 binary classifiers and select the class which wins the most. Note that in this case, the OvR strategy (One vs Rest) was chosen by the model (can see in output). It would also need to train 3 binary classifiers so there's not much of a difference in performance. This quickly changes as we increase the number of classes!


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

def get_train_test():
    # load
    df = pd.read_csv("star_classification.csv")
    # remove outliers
    # select the row with u or z or g below 0 (should not be allowed)
    outliers = df[(df["u"] <= 0)  
                | (df["z"] <= 0) 
                | (df["g"] <= 0)]

    # drop the row corresponding to the outlier
    df = df.drop(outliers.index, axis=0)
    # we drop the data we are not interested in and which won't be useful
    df = df.drop(
        columns=["obj_ID", "fiber_ID", "MJD", "plate", "spec_obj_ID", 
                 "field_ID", "cam_col", "rerun_ID", "run_ID"], axis=1)
    
    # replace all infinite values with NaN
    #df = df.replace([np.inf, -np.inf], np.nan)
    #df = df.dropna(axis=0)
    
    # replace star class with 1s and others with 0
    #df.loc[df['class'] == "STAR", 'class'] = 1
    #df.loc[df['class'] != "STAR", 'class'] = 0
    
    # stratified split
    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
    
    for train_index, test_index in split.split(df, df["class"]):
        strat_train_set = df.loc[train_index]
        strat_test_set = df.loc[test_index]
        
    return strat_train_set, strat_test_set

get_train_test()
train, test = get_train_test()
train = train.reset_index(drop=True)

# seperate measurements (x) with class(y)
x_train = train.drop(columns=["class"], axis=1)
y_train = train["class"]

# replace all infinite values with NaN
x_train = x_train.replace([np.inf, -np.inf], np.nan)
y_train = y_train.replace([np.inf, -np.inf], np.nan)

# drop all the rows with NaN
x_train = x_train.dropna(axis=0)
y_train = y_train.dropna(axis=0)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike


#### Trying with SVC

In [6]:
from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(x_train, y_train) # y_train, not y_train_star
#svm_clf.predict([some_object])
# Takes around 10 minutes to run!!



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [8]:
# introduce new random object from training set
import random
rand_index = random.randint(0, len(x_train))
some_object = x_train.loc[x_train.index == rand_index]
print("The object has label:", y_train.loc[x_train.index == rand_index], "in the dataset")
print("and the model predicts that it is a", svm_clf.predict(some_object))

The object has label: 1295    STAR
Name: class, dtype: object in the dataset
and the model predicts that it is a ['STAR']


In [9]:
some_object_scores = svm_clf.decision_function(some_object)
# shows that highest score corresponds to:
# list of target classes stored in "classes_"
print("The maximum score corresponds to:", svm_clf.classes_[np.argmax(some_object_scores)])
print("The stored classes are:", svm_clf.classes_, "with respective scores:", some_object_scores)

The maximum score corresponds to: STAR
The stored classes are: ['GALAXY' 'QSO' 'STAR'] with respective scores: [[ 1.06128058 -0.23297349  2.22567201]]


In [None]:
# evaluating the performance of the model through cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(svm_clf, x_train, y_train, cv=3, scoring="accuracy")



In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train.astype(np.float64))
cross_val_score(svm_clf, x_train_scaled, y_train, cv=3, scoring="accuracy")

#### Trying with SGD

we can also try it out with SGD. Note we train with the whole y_train and not only y_train_star -> it will run 3 binary classifiers using OvR strategy.

Very quick to run (sub 1 minute)

In [None]:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(x_train, y_train)

In [None]:
print("Looking at decision function array below, the model predicts that the object"
     "which was chosen is a", svm_clf.classes_[np.argmax(some_object_scores)])
sgd_clf.decision_function(some_object)

In [None]:
# evaluating the performance of the model through cross validation
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, x_train, y_train, cv=3, scoring="accuracy")

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train.astype(np.float64))
cross_val_score(sgd_clf, x_train_scaled, y_train, cv=3, scoring="accuracy")

Immediately improve to around 90% accuracy! Fine tuned the model!

#### Trying with Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, x_train, y_train, cv=3,
                                    method="predict_proba")
y_scores_forest = y_probas_forest[:, 1]   # score = proba of positive class

## 10. Error Analysis

Maybe create a function that would automatically display confusion matrix so i can call it for every model?

#### A. For SGD

In [None]:
# sgd_clf.fit(x_train_scaled, y_train)
#y_train_pred = sgd_clf.predict(x_train_scaled)

# better to use cross-validate! -> will train model three times and test on 1/3 of trainingset
# which is used as a test set! otherwise you would test on values you used to fit!
y_train_pred = cross_val_predict(sgd_clf, x_train_scaled, y_train, cv=3)

In [None]:
conf_mx_SGD = confusion_matrix(y_train, y_train_pred)

# obviously since its mainly galaxies, that's why it appears more white!
plt.matshow(conf_mx_SGD, cmap=plt.cm.gray)

conf_mx_SGD

In [None]:
# focusing on the ERRORS
row_sums_SGD = conf_mx_SGD.sum(axis=1, keepdims=True)
# divide each value by number of objects in that class -> compare error rates
norm_conf_mx_SGD = conf_mx_SGD / row_sums_SGD
# ignore diagonal -> zeroes
np.fill_diagonal(norm_conf_mx_SGD, 0)
plt.matshow(norm_conf_mx_SGD, cmap=plt.cm.gray)
plt.show()

norm_conf_mx_SGD

The order is Galaxy, QSO, Stars so we notice that objects often get misclassified as galaxies, especially stars! However, galaxies most often get correclty classified (little errors on that row)

#### B. For SVC

In [None]:
y_train_pred_SVC = cross_val_predict(svm_clf, x_train_scaled, y_train, cv=3)

In [None]:
conf_mx_SVC = confusion_matrix(y_train, y_train_pred_SVC)

# obviously since its mainly galaxies, that's why it appears more white!
plt.matshow(conf_mx_SVC, cmap=plt.cm.gray)
plt.show()

conf_mx_SVC

In [None]:
# focusing on the ERRORS
row_sums_SVC = conf_mx_SVC.sum(axis=1, keepdims=True)
# divide each value by number of objects in that class -> compare error rates
norm_conf_mx_SVC = conf_mx_SVC / row_sums_SVC
# ignore diagonal -> zeroes
np.fill_diagonal(norm_conf_mx_SVC, 0)
plt.matshow(norm_conf_mx_SVC, cmap=plt.cm.gray)
plt.show()

norm_conf_mx_SVC

QSO often misclassified as Galaxies and galaxies occasionally misclassified as stars. Stars always well classified.

#### C. Random Forest Classifier

### Comparing Random Forest, SGD and SVC

In [None]:
# comparing Random Forest, SGD and SVC
# show confusion matrices side to side 
# show accuracy