# Ethno

This notebook is a global template for loading and preprocess data.

## What is the SVM algorithm?

 It is a supervised learning algorithm designed to solve discrimination and regression problems.
 
 It is a very good image classification algorithm.
 
 ![enter image description here](https://editor.analyticsvidhya.com/uploads/61706svm3.png)
 
 ## Summary
 
 1. [Data preparation](#prepaData)
 2. [Learning of model](#model)
 3. [Displaying metrcis](#metric)
 4. [Conclusion](#conclusion)

### 1. Data preparation <a id="prepaData"></a>
 
 - To start, we import the necessary libraries.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import cv2
from tqdm import tqdm

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

 - We define all necessary variables.

In [2]:
path = '../images'
IMG_SIZE = 100

- We create the function that load all our data and we call it

In [3]:
def loadingData(path):
    CATEGORIES = []
    data = []
    
    dirs = os.listdir(path)
    for file in dirs:
        CATEGORIES.append(file)

    for category in tqdm(CATEGORIES):
        localPath = os.path.join(path, category)
        for img in os.listdir(localPath):
            img_array = cv2.imread(os.path.join(localPath, img))
            new_array=cv2.resize(img_array,(IMG_SIZE,IMG_SIZE))
            data.append([new_array, category])
    
    return data

In [4]:
data = loadingData(path)

100%|██████████| 2/2 [00:00<00:00,  5.83it/s]


- We create the function that preprocess all our data and we call it

In [5]:
from sklearn.model_selection import train_test_split

def preprocessing(df):
    x = []
    y = []

    for categories, label in df:
        x.append(categories)
        y.append(label)
        
    x = np.array(x).reshape(len(x),-1)
    x = x / 255.0
        
    #split
    return train_test_split(x, y, test_size=0.33, random_state=42)

In [6]:
X_train, x_test, y_train, y_test = preprocessing(data)

### 2. Learning of model <a id="model"></a>

- We find the optimal parameter

In [7]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf', 'linear']}

grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
 
# fitting the model for grid search
grid.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.500 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.562 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.562 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.533 total time=   0.0s
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.625 total time=   0.0s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.500 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.533 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.500 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf', 'linear']},
             verbose=3)

In [8]:
# print best parameter after tuning
print(grid.best_params_)
 
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
SVC(C=10, gamma=0.001)


- We start our model learning

In [9]:
svc = SVC(kernel='rbf',gamma=grid.best_params_['gamma'], C=grid.best_params_['C'], class_weight='balanced')
svc.fit(X_train, y_train)

SVC(C=10, class_weight='balanced', gamma=0.001)

In [10]:
pred = svc.predict(x_test)

### 2. Displaying metrcis<a id="metric"></a>
- Accuracy

In [11]:
from sklearn.metrics import accuracy_score
print("Accuracy on data is",accuracy_score(y_test,pred))

Accuracy on data is 0.717948717948718


The precision is about 71 %

In [12]:
from sklearn.metrics import classification_report, confusion_matrix
print("Classification report is")
print(classification_report(y_test,pred))

Classification report is
                       precision    recall  f1-score   support

     Pigeon_Guillemot       0.86      0.57      0.69        21
Red_headed_Woodpecker       0.64      0.89      0.74        18

             accuracy                           0.72        39
            macro avg       0.75      0.73      0.71        39
         weighted avg       0.76      0.72      0.71        39



We can see that our model is 69% accurate for detecting pigeons against 74% for woodpeckers.

In [13]:
print('confusion matrix')
print(confusion_matrix(y_test, pred))

confusion matrix
[[12  9]
 [ 2 16]]


The confusion matrix confirms this trend which is that our model is more effective on woodpecker

In [15]:
result = pd.DataFrame({'original' : y_test,'predicted' : pred})
result[:5]

Unnamed: 0,original,predicted
0,Pigeon_Guillemot,Pigeon_Guillemot
1,Red_headed_Woodpecker,Red_headed_Woodpecker
2,Pigeon_Guillemot,Pigeon_Guillemot
3,Pigeon_Guillemot,Red_headed_Woodpecker
4,Pigeon_Guillemot,Pigeon_Guillemot


In [None]:
import pickle
filename = 'svm_model.sav'
pickle.dump(svc, open(filename, 'wb'))

### 2. Conclusion<a id="conclusion"></a>

Our overall accuracy is not too bad with 71% accuracy. As we can see, there is a large amount of false positives. There is however a good proportion of true negatives.

This means that our model will be more accurate in determining woodpecker, with 74% accuracy, compared to 69% accuracy in determining pigeon.

The cause of this accuracy discrepancy can be due to several things such as an uneven distribution of our starting data set with one class more present than the other.

Another solution is to test with another kernel. We can also try data augmentation.