# Team Members: Samuel Tan and William Meng

## Importing Data

In [4]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [5]:
# This file contains all the main external libs we'll use
from fastai.imports import *

In [6]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

Use a pretrained neural net, called Resnet34. Set the size of the images to 224 by 224 pixels.

In [7]:
arch=resnet34
PATH = "/content/clouderizer/cs152/data/competitions/dog-breed-identification/"
sz=224

Unzip the data files (provided by Kaggle).

In [8]:
os.chdir(PATH)
!unzip '*.zip'

Archive:  sample_submission.csv.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Note that this result was from a re-run of the entire notebook, so the files were already unzipped (as the above message shows).

Generate randomly sampled validation data from the initial dataset.

In [9]:
label_csv = f'{PATH}labels.csv'
n = len(list(open(label_csv)))-1
val_idxs = get_cv_idxs(n)
val_idxs

array([2882, 4514, 7717, ..., 8922, 6774,   37])

Format data for model.

In [10]:
data = ImageClassifierData.from_csv(PATH,'train', label_csv, tfms=tfms_from_model(arch,sz), val_idxs=val_idxs, test_name='test', suffix = '.jpg')

## Train Model

We settled on 7 as the number of epochs when considering underfitting/overfitting (for instance, for 20 epochs the validation loss was over twice that of the training loss). Indeed, we observed that our score was the highest when using 7 epochs.

In [49]:
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 7)

HBox(children=(IntProgress(value=0, description='Epoch', max=7), HTML(value='')))

epoch      trn_loss   val_loss   accuracy   
    0      2.071434   0.980211   0.77544   
    1      1.046492   0.667764   0.828278  
    2      0.753494   0.590875   0.824853  
    3      0.619193   0.547516   0.832681  
    4      0.558535   0.531087   0.840509  
    5      0.508794   0.526207   0.834149  
    6      0.474092   0.498486   0.845401  



[array([0.49849]), 0.8454011747515131]

Generate the probabilities, keeping in mind we need to exponentiate.

In [50]:
log_preds, y = learn.predict_with_targs(is_test=True) # use test dataset rather than validation dataset
probs = np.exp(log_preds)

Checking the shape.

In [24]:
probs.shape

(10357, 120)

Checking to make sure that the classes are the dog breeds.

In [14]:
data.classes

['affenpinscher',
 'afghan_hound',
 'african_hunting_dog',
 'airedale',
 'american_staffordshire_terrier',
 'appenzeller',
 'australian_terrier',
 'basenji',
 'basset',
 'beagle',
 'bedlington_terrier',
 'bernese_mountain_dog',
 'black-and-tan_coonhound',
 'blenheim_spaniel',
 'bloodhound',
 'bluetick',
 'border_collie',
 'border_terrier',
 'borzoi',
 'boston_bull',
 'bouvier_des_flandres',
 'boxer',
 'brabancon_griffon',
 'briard',
 'brittany_spaniel',
 'bull_mastiff',
 'cairn',
 'cardigan',
 'chesapeake_bay_retriever',
 'chihuahua',
 'chow',
 'clumber',
 'cocker_spaniel',
 'collie',
 'curly-coated_retriever',
 'dandie_dinmont',
 'dhole',
 'dingo',
 'doberman',
 'english_foxhound',
 'english_setter',
 'english_springer',
 'entlebucher',
 'eskimo_dog',
 'flat-coated_retriever',
 'french_bulldog',
 'german_shepherd',
 'german_short-haired_pointer',
 'giant_schnauzer',
 'golden_retriever',
 'gordon_setter',
 'great_dane',
 'great_pyrenees',
 'greater_swiss_mountain_dog',
 'groenendael',


Input the probabilities into a dataframe, including a column for the id's (following the sample submission).

In [51]:
df = pd.DataFrame(probs)
df.columns = data.classes
df.insert(0, 'id', [o[5:-4] for o in data.test_ds.fnames])
df.head()

Unnamed: 0,id,affenpinscher,afghan_hound,african_hunting_dog,airedale,american_staffordshire_terrier,appenzeller,australian_terrier,basenji,basset,...,toy_poodle,toy_terrier,vizsla,walker_hound,weimaraner,welsh_springer_spaniel,west_highland_white_terrier,whippet,wire-haired_fox_terrier,yorkshire_terrier
0,433832e6fdc7400cfefd357d2bb889a2,0.0006316795,2.362755e-05,0.0003808084,0.000869493,0.002834617,0.001246202,0.003501303,0.0002443731,9.365161e-06,...,0.0007410989,0.001541,0.0006541756,9.094047e-05,0.001822798,5.940974e-05,0.0004940692,0.0007353099,0.0001866468,0.01593384
1,cecb377c724cd2e385458d8b0eba2a49,4.151429e-07,9.493127e-08,2.533695e-06,7.9218e-07,2.430508e-06,1.327539e-06,7.809668e-08,6.203733e-08,1.710265e-07,...,1.730322e-07,5e-06,0.003013215,0.0001212405,0.0001042325,8.810629e-07,1.115024e-07,9.364611e-07,4.884091e-06,9.981766e-08
2,e7ed96b272013c6de9505a753816ce75,2.695915e-05,9.972332e-07,8.504518e-07,3.583614e-06,4.740868e-05,6.38277e-06,0.942874,0.0002220828,2.381228e-06,...,4.428845e-06,3.8e-05,3.484718e-06,3.919203e-06,4.444007e-07,1.170891e-05,0.0001032226,5.828138e-07,1.312689e-05,0.0002915359
3,4bf924974410498a1d52d9eb45eb0703,3.461973e-07,1.127799e-07,8.143571e-06,8.684868e-07,5.913324e-06,3.204967e-05,2.728062e-08,1.095674e-06,9.026891e-05,...,9.410955e-08,1.3e-05,2.192598e-06,0.01839617,1.52423e-05,7.352124e-06,3.039453e-07,7.699245e-06,3.000569e-07,5.770946e-08
4,f9c6eaf6f490f30fdecd76831805d0f7,7.633561e-06,1.98375e-06,3.978221e-07,9.556312e-08,6.35537e-08,2.464959e-07,2.072688e-06,4.335882e-07,2.258121e-06,...,0.0002677134,1e-06,5.910231e-07,3.988113e-07,2.574534e-07,1.84327e-05,4.230863e-07,1.664037e-08,2.61913e-07,2.910837e-05


Save the probabilities into the Kaggle submission format.

In [52]:
SUBM = f'../../out/'
os.makedirs(SUBM, exist_ok=True)
df.to_csv(f'{SUBM}dogbreed_4.gz', compression='gzip', index=False)

![Submission  Result on Kaggle](submission.png)