![title](image.png)

<h1>Feather Vision</h1>


<h2>1. Analysis</h2>

<style>
	p {
		line-height: 180%;
	}
</style>
<h3>1.1. Project description </h3>

<p>Ornithology enthusiasts and zoologists are people we thank for protecting and discovering new species on almost daily basis. Their jobs are alredy extremely difficult with the amount of information they have to keep track of while always being on the lookout for rare sightings of exotic birds.
With FeatherVision their job will be made much easier, when it comes to identifying numerous different species of birds.</p>
<p>The objective of our project is to create a machine learning model capable of accurately classifying bird species based on visual attributes in images. This way anyone could become a beginner ornithologist by simply capturing an image of their own and find out the details of the fauna in their local area. And for those advanced in the field, they could spend more time taking care of endangered species and researching by decreasing the precious time needed to properly identify these species.</p> 
<p> The world of birds is enormous and with FeatherVision you will never get lost in it again!  </p>

<style>
	p {
		line-height: 180%;
	}
</style>

<h3>1.3. Framing the problem</h3>

<p>Our goal is to be able to identify bird species by just looking at them. You simply take a picture of the bird and our trained model will tell you what kind of bird this is. Since our dataset is fully lableled, meaning all the pictures have a label (bird species name) that the model can check, we chose the <b>Supervised Learning</b> approach. This means we will most likely choose a classification model to classify an image to a specie's name.</p>

<style>
	p {
		line-height: 180%;
	}
</style>
<h2>2. About the data</h2>
<p>
We have decided to use birds classification dataset from <a href="https://www.kaggle.com/datasets/gpiosenka/100-bird-species">kaggle</a>. It has a large number of training images (almost 85,000!) and includes over 500 different kinds of birds. This will help our model learn to identify lots of different bird species. We wanted a dataset with lots of high-quality images, good bird visibility, and this one fits us well. Important note is that all the images are photographed and not AI generated. While there are a few things to keep in mind, like each picture only having one bird, which can be good or bad, and also the images are 80% male, we believe this dataset is a good choice for our bird classification project.
</p>

<h3>Size and type of data</h3>
<p>
As mentioned earlier, the dataset contains almost 85K good quality images.   
<ul>
	<li>The images are sized 224 X 224 and are in JPG format.</li>
	<li>The average image size is around 20kB.</li>
	<li>The images are photographed during all seasons and it should contain species from every continent.</li>
</ul>
</p>

<h2>3. Data exploration</h2>

<h4>3.1 Features and it's characteristics</h4>

In [1]:
import pandas as pd

rawData = pd.read_csv('./archive_exploration/birds.csv')

display(rawData)

Unnamed: 0,class id,filepaths,labels,data set,scientific name
0,0.0,train/ABBOTTS BABBLER/001.jpg,ABBOTTS BABBLER,train,MALACOCINCLA ABBOTTI
1,0.0,train/ABBOTTS BABBLER/007.jpg,ABBOTTS BABBLER,train,MALACOCINCLA ABBOTTI
2,0.0,train/ABBOTTS BABBLER/008.jpg,ABBOTTS BABBLER,train,MALACOCINCLA ABBOTTI
3,0.0,train/ABBOTTS BABBLER/009.jpg,ABBOTTS BABBLER,train,MALACOCINCLA ABBOTTI
4,0.0,train/ABBOTTS BABBLER/002.jpg,ABBOTTS BABBLER,train,MALACOCINCLA ABBOTTI
...,...,...,...,...,...
89880,524.0,valid/BLACK BREASTED PUFFBIRD/3.jpg,BLACK BREASTED PUFFBIRD,valid,NOTHARCHUS PECTORALIS
89881,524.0,valid/BLACK BREASTED PUFFBIRD/4.jpg,BLACK BREASTED PUFFBIRD,valid,NOTHARCHUS PECTORALIS
89882,524.0,valid/BLACK BREASTED PUFFBIRD/1.jpg,BLACK BREASTED PUFFBIRD,valid,NOTHARCHUS PECTORALIS
89883,524.0,valid/BLACK BREASTED PUFFBIRD/2.jpg,BLACK BREASTED PUFFBIRD,valid,NOTHARCHUS PECTORALIS


<style>
	p {
		line-height: 180%;
	}
</style>
<p>As we can see above, the dataset has alsoms 90000 datapoints and 5 different features - <b><i>class id</i></b>, <b><i>filepaths</i></b>, <b><i>labels</i></b>, <b><i>dataset</i></b>, and <b><i>scientific name</i></b>. The <b><i>class id</i></b>, <b><i>labels</i></b>, and <b><i>scientific name</i></b> are self explanatory, the <b><i>filepaths</i></b> is a string which contains a path to the image of that particular datapoint. In the feature  <b><i>data set</i></b> we can see that the author of the dataset has already divided the data into train, valid, and test splits. </p>

In [2]:
#Check the feature's data types
print(rawData.dtypes)

print("\n\n")

#Print how many different values there are in each feature
for column in rawData.columns:
    print(column, ":", len(rawData[column].unique()))

class id           float64
filepaths           object
labels              object
data set            object
scientific name     object
dtype: object



class id : 525
filepaths : 89885
labels : 525
data set : 3
scientific name : 522


<style>
	p {
		line-height: 180%;
	}
</style>
<p>The feature datatypes are mostly objects as printed out above. That means that its values are categorical. The least categorias are in the <b><i>data set</i></b> feature - 3 and the most is of course in the <b><i>filepaths</i></b> as each of those values should be unique. What is intersting is that the number of values for features <b><i>labels</i></b> and <b><i>scientific name</i></b> are not the same, 525 : 522. Let's investigate that further.</p>

In [3]:
print(rawData.groupby('scientific name')['labels'].nunique().sort_values(ascending=False))

print("\n\n")
print("PSITTACULA EUPATRIA:", rawData[rawData['scientific name'] == 'PSITTACULA EUPATRIA']['labels'].unique())
print("COLAPTES AURATUS:", rawData[rawData['scientific name'] == 'COLAPTES AURATUS']['labels'].unique())
print("COLUMBA LIVIA:", rawData[rawData['scientific name'] == 'COLUMBA LIVIA']['labels'].unique())

scientific name
PSITTACULA EUPATRIA         2
COLAPTES AURATUS            2
COLUMBA LIVIA               2
PASSERINA CYANEA            1
PARIDAE                     1
                           ..
DENDRAGAPUS OBSCURUS        1
DELICHON URBICUM            1
DACNIS CAYANA               1
DACELO                      1
ZOSTEROPS MADERASPATANUS    1
Name: labels, Length: 522, dtype: int64



PSITTACULA EUPATRIA: ['ALEXANDRINE PARAKEET' 'AMERICAN AVOCET']
COLAPTES AURATUS: ['GILDED FLICKER' 'NORTHERN FLICKER']
COLUMBA LIVIA: ['JACOBIN PIGEON' 'ROCK DOVE']


<style>
	p {
		line-height: 180%;
	}
</style>
<p>As it turns out there are 3 <b><i>scientific names</b></i> where each of them maps to 2 different <b><i>labels</b></i>, as you can see above. This indicates that the best idea is to classify by the <b><i>labels</b></i> feature as doing otherwise would cause the model to be unsufficient.</p>

In [4]:
#Check the percentage of missing values in each feature
print(rawData.isnull().mean())

#Check if there are any duplicate rows
print("\nNumber of duplicated rows:", rawData.duplicated().sum())

class id           0.0
filepaths          0.0
labels             0.0
data set           0.0
scientific name    0.0
dtype: float64

Number of duplicated rows: 0


<style>
	p {
		line-height: 180%;
	}
</style>
<p>Since we chose a well maintained dataset for our project we can see that there is 0 data missing, which means we can use the whole dataset without worries. We also checked if there are any duplicates in the dataset, it turns out there are not but it is a good habit to check that because if there were any duplicates we would like to get rid of them. The reason is that if one of the duplicated would be in the training set and another one in the test set it would compromise the performance of our model.</p>

<style>
	p {
		line-height: 180%;
	}
</style>
<h4>3.4 Identifying the target</h4>

<p>Our target is the feature <b><i>labels</i></b>. As argumented before it is a better choice than <b><i>scientific name</i></b> because of the duplicates present in that feature.</p>