# Classification task
## Data:
### Response:
__religion:__ 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
### Predictors:
1. __name:__ Name of the country concerned
2. __landmass:__ 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
3. __zone:__ Geographic quadrant, based on Greenwich and the Equator 1=NE, 2=SE, 3=SW, 4=NW
4. __area:__ in thousands of square km
5. __population:__	in round millions
6. __language:__ 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. __bars:__ Number of vertical bars in the flag
8. __stripes:__ Number of horizontal stripes in the flag
9. __colours:__ Number of different colours in the flag
10. __red:__ 0 if red absent, 1 if red present in the flag
11. __green:__ same for green
12. __blue:__ same for blue
13. __gold:__ same for gold (also yellow)
14. __white:__ same for white
15. __black:__ same for black
16. __orange:__ same for orange (also brown)
17. __mainhue:__ predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
18. __circles:__ Number of circles in the flag
19. __crosses:__ Number of (upright) crosses
20. __saltires:__ Number of diagonal crosses
21. __quarters:__ Number of quartered sections
22. __sunstars:__ Number of sun or star symbols
23. __crescent:__ 1 if a crescent moon symbol present, else 0
24. __triangle:__ 1 if any triangles present, 0 otherwise
25. __icon:__ 1 if an inanimate image present (e.g., a boat), otherwise 0
26. __animate:__ 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
27. __text:__ 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
28. __topleft:__ colour in the top-left corner (moving right to decide tie-breaks)
29. __botright:__ Colour in the bottom-left corner (moving left to decide tie-breaks)

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

# define columns
colnames=['name', 'landmass', 'zone', 'area', 'population', 'language', 'religion', 'bars', 'stripes', 'colours',
          'red', 'green', 'blue', 'gold', 'white', 'black', 'orange', 'mainhue', 'circles', 'crosses', 'saltires',
          'quarters', 'sunstars', 'crescent', 'triangle', 'icon', 'animate', 'text', 'topleft', 'botright']
# read data
df = pd.read_csv('data/flag.data', names=colnames, header=None)
# convert factor columns (mainhue, topleft, botright)
convert_factor = {'black': 0, 'blue': 1, 'brown': 2, 'gold': 3, 'green': 4, 'orange': 5, 'red': 6, 'white': 7}
for factor in ['mainhue', 'topleft', 'botright']:
    df[factor] = df[factor].apply(lambda x: convert_factor[x])

df.head()

Unnamed: 0,name,landmass,zone,area,population,language,religion,bars,stripes,colours,...,saltires,quarters,sunstars,crescent,triangle,icon,animate,text,topleft,botright
0,Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,0,4
1,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,6,6
2,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,4,7
3,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,1,6
4,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,1,6


In [3]:
from sklearn.model_selection import train_test_split

# split data
X, y = df.drop(['religion', 'name'], axis=1), df['religion']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Naive Bayes

In [45]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# define a Gaussian Classifier
nb_model = GaussianNB()

# train the model
nb_model.fit(X_train, y_train)

# predict
nb_pred = nb_model.predict(X_test)

# classification_report
target_names = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
print(classification_report(y_test, nb_pred, target_names=target_names))

                 precision    recall  f1-score   support

       Catholic       0.67      0.33      0.44         6
Other Christian       0.85      0.55      0.67        20
         Muslim       0.60      0.33      0.43         9
       Buddhist       0.00      0.00      0.00         2
          Hindu       0.00      0.00      0.00         0
         Ethnic       0.38      0.20      0.26        15
        Marxist       0.18      0.60      0.27         5
         Others       0.00      0.00      0.00         2

    avg / total       0.56      0.37      0.43        59



# k-neighbors

In [5]:
convert_rel = {0: 'Catholic', 1: 'Other Christian', 2: 'Muslim', 3: 'Buddhist', 
               4: 'Hindu', 5: 'Ethnic', 6: 'Marxist', 7: 'Others'}
for cls in set(y.values):
    print('Religion: {}, number of samples: {}'.format(convert_rel[cls], list(y.values).count(cls)))

Religion: Catholic, number of samples: 40
Religion: Other Christian, number of samples: 60
Religion: Muslim, number of samples: 36
Religion: Buddhist, number of samples: 8
Religion: Hindu, number of samples: 4
Religion: Ethnic, number of samples: 27
Religion: Marxist, number of samples: 15
Religion: Others, number of samples: 4


In [44]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# rescale data
scaler = StandardScaler()
scaler.fit(X_train)

x_train = scaler.transform(X_train)
x_test = scaler.transform(X_test)
# train and choose best possible number of nighbors
error = {}
for num_neigh in range(2, 15): 
    kns_model = KNeighborsClassifier(n_neighbors=num_neigh)
    kns_model.fit(x_train, y_train)
    
    # add mean error value
    kns_pred = kns_model.predict(x_test)
    error[num_neigh] = np.mean(kns_pred != y_test)

num_neighbors = sorted(error.items(), key=lambda x: x[1])[0][0]

kns_model = KNeighborsClassifier(n_neighbors=num_neighbors)
kns_model.fit(x_train, y_train)

# predict
kns_pred = kns_model.predict(x_test)

# classification_report
target_names = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
print(classification_report(y_test, kns_pred, target_names=target_names))

                 precision    recall  f1-score   support

       Catholic       0.17      0.67      0.28         6
Other Christian       0.71      0.75      0.73        20
         Muslim       0.58      0.78      0.67         9
       Buddhist       0.00      0.00      0.00         2
          Hindu       1.00      0.07      0.12        15
         Ethnic       0.00      0.00      0.00         5
        Marxist       0.00      0.00      0.00         2

    avg / total       0.60      0.46      0.41        59



# Logistic regression

In [46]:
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(solver='newton-cg')
logreg_model.fit(X_train, y_train)

# predict
logreg_pred = logreg_model.predict(X_test)

# classification_report
target_names = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
print(classification_report(y_test, logreg_pred, target_names=target_names))

                 precision    recall  f1-score   support

       Catholic       0.27      0.50      0.35         6
Other Christian       0.62      0.65      0.63        20
         Muslim       0.42      0.89      0.57         9
       Buddhist       0.00      0.00      0.00         2
          Hindu       0.50      0.13      0.21        15
         Ethnic       0.00      0.00      0.00         5
        Marxist       0.00      0.00      0.00         2

    avg / total       0.43      0.44      0.39        59



# Random Forest classifier

In [42]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# perform grid search to get best fit parameters
'''
parameters = {'n_estimators': [10, 100, 200, 500, 1000, 1200],
              'max_depth': [None, 2, 5, 7, 10, 11, 13],
              'bootstrap': [True, False]}

rf_model = GridSearchCV(RandomForestClassifier(random_state=0), parameters, cv=5)
rf_model.fit(X_train, y_train)
'''

rf_model = RandomForestClassifier(n_estimators=10, max_depth=None, random_state=0)
rf_model.fit(X_train, y_train)
# predict
rf_pred = rf_model.predict(X_test)

# classification_report
target_names = ['Catholic', 'Other Christian', 'Muslim', 'Buddhist', 'Hindu', 'Ethnic', 'Marxist', 'Others']
print(classification_report(y_test, rf_pred, target_names=target_names))

                 precision    recall  f1-score   support

       Catholic       0.50      0.67      0.57         6
Other Christian       0.75      0.90      0.82        20
         Muslim       0.50      1.00      0.67         9
       Buddhist       0.00      0.00      0.00         2
          Hindu       0.00      0.00      0.00         0
         Ethnic       1.00      0.40      0.57        15
        Marxist       0.50      0.20      0.29         5
         Others       0.00      0.00      0.00         2

    avg / total       0.68      0.64      0.61        59



# Comparison

__Naive Bayes__
<br>
precision: _0.56_
<br>
recall: _0.37_
<br>
f1-score: _0.43_
<br>
__k-neighbors__
<br>
precision: _0.60_
<br>
recall: _0.46_
<br>
f1-score: _0.41_
<br>
__Logistic regression__
<br>
precision: _0.43_
<br>
recall: _0.44_
<br>
f1-score: _0.39_
<br>
__Random Forest__
<br>
precision: _0.68_
<br>
recall: _0.64_
<br>
f1-score: _0.61_
<br>
__Logistic regression__ (0.39) < __k-neighbors__ (0.41) < __Naive Bayes__ (0.43) < __Random Forest__ (0.61)
<br>
Classifiers performs the best when there are more samples of particular class. For instance, 'Other Christian', 'Muslim', 'Ethnic' with more samples get better results in comparison with 'Buddhist' and 'Others'. It is difficult to find a suitable classifier on such a small dataset due to the fact that we have only a few examples of some classes left for test dataset and we cannot fairly evaluate the classifier's performance (metrics cannot be normally calculated). But we clearly see that random forest shows the best results.