# Setup

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Loading in Dfs


In [None]:
root = r"../input/petfinder-pawpularity-score/"

sample_submission = pd.read_csv(os.path.join(root, "sample_submission.csv"))
test = pd.read_csv(os.path.join(root, "test.csv"))
train = pd.read_csv(os.path.join(root, "train.csv"))

# Exploring Sample Submission

In [None]:
display(sample_submission.head()), sample_submission.shape

In [None]:
sample_submission.info()

In [None]:
sample_submission.Pawpularity.value_counts()

In [None]:
sample_submission.Pawpularity.nunique()

In [None]:
sample_submission.Pawpularity.unique()

# Looking at Train CSV

```
Photo Metadata
The train.csv and test.csv files contain metadata for photos in the training set and test set, respectively. Each pet photo is labeled with the value of 1 (Yes) or 0 (No) for each of the following features:

Focus - Pet stands out against uncluttered background, not too close / far.
Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
Face - Decently clear face, facing front or near-front.
Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
Action - Pet in the middle of an action (e.g., jumping).
Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
Group - More than 1 pet in the photo.
Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
Human - Human in the photo.
Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
Info - Custom-added text or labels (i.e. pet name, description).
Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0.
```

In [None]:
train

In [None]:
train.shape, train.info()

In [None]:
analyzed_columns = ["Eyes", "Face", "Near", "Action", "Accessory", "Group", "Collage", "Human", "Occlusion",
                    "Info", "Blur"]


# Remember 1 is for present and 0 is for absent.
for col in analyzed_columns:
    display(train[col].value_counts()); print("________________________________")
    
# Most images show eyes (though some might not). 
# Encoding human information may be useful here as we (humans) always tend to look at the eyes.
# Eyes may be very important and can be weighed higher or used as a dominant feature in predicting score.
# So can face. 
# Near may be also important as we feel closer to the animal through the visual. 
# Action can be important. We often laugh or overload with cuteness from seeing animals doing stuff.
# Accessory is a huge part and can work against the model in a spatial setting but does encourage
# people to adopt as it enhances their "cuteness".
# Group may deter the adoption of any specific animal as it may increase indecisiveness.
# Collages can definitely work well. 
# Humans: NOPE.
# Occlusion works AGAINST the adoption of the pet!
# Info may work well if they have a cute name.
# Blur works AGAINST adoption!

# Looking at Pawpularity!

```
How Pawpularity Score Is Derived
The Pawpularity Score is derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms (web & mobile) and various metrics.
Duplicate clicks, crawler bot accesses and sponsored profiles are excluded from the analysis.
```

Seems like building a custom loss function to assess pawpularity would not be possible. I saw somewhere a Kaggle grandmaster was cheating by webscraping the scores.😳

In [None]:
train.Pawpularity.hist()

I wonder why there is a little bump at the 100. 

In [None]:
train.Pawpularity.dtype

In [None]:
train.Pawpularity.nunique()

In [None]:
train.Pawpularity.value_counts()  # Interesting....it seems like a regression problem yet
# there are only 100 classes (from the training data).

# Question: could there be more "classes" that *aren't* in the training data?
# This could pose a problem for the classification task.

# An idea: maybe we could turn this into a classification task with a DNN to encode feature info from metadata
# and intermingle this with a large CNN (EffNet maybe?) and the classification task is 101 classes.
# The last class is others and if this CNN predicts this as the highest probability, then the image + metadata
# will be deferred to a regression model!

# Looking at Test CSV

In [None]:
test.columns

In [None]:
test

In [None]:
test.shape  # They make it very easy for us to do fast submissions!

They will be replacing the test data with their own test images and metadata I presume. 

```
Example Test Data
In addition to the training data, we include some randomly generated example test data to help you author submission code. When your submitted notebook is scored, this example data will be replaced by the actual test data (including the sample submission).

test/ - Folder containing randomly generated images in a format similar to the training set photos. The actual test data comprises about 6800 pet photos similar to the training set photos.
test.csv - Randomly generated metadata similar to the training set metadata.
sample_submission.csv - A sample submission file in the correct format.
```

Hmmmm, this raises a question. "Randomly generated" metadata. Do they ever specify if they will put actual metadata? I assume they will do that.

# Looking at the Images

In [None]:
from PIL import Image

In [None]:
Image.open(r"../input/petfinder-pawpularity-score/test/4128bae22183829d2b5fea10effdb0c3.jpg")

Wow, that puppy is so cute! Ok, jokes aside, this is just noise. 

In [None]:
Image.open(r"../input/petfinder-pawpularity-score/train/0007de18844b0dbbb5e1f607da0606e0.jpg")

THAT's an actual puppy. 

Let's take a look at multiple photos.

In [None]:
# !pip install wandb --upgrade -qqq

# import wandb
# wandb.login()

In [None]:
len(os.listdir("../input/petfinder-pawpularity-score/train")) / 4

In [None]:
train[train.Id == "0007de18844b0dbbb5e1f607da0606e0"].Pawpularity

In [None]:
# from glob import glob
# from tqdm import tqdm

# list_of_image_paths = glob(r"../input/petfinder-pawpularity-score/train/*")

# data = []
# cnt = 0
# for image_path in tqdm(list_of_image_paths):
#     imageid = image_path.split("/")[-1].split(".")[0]
#     data.append([train[train.Id == imageid].Pawpularity.values[0], wandb.Image(Image.open(image_path))])
# #     if cnt == 10: break
# #     cnt += 1

In [None]:
# run = wandb.init(project="paw-pictures", name="paw-dataset")
# my_table = wandb.Table(columns=["Pawpularity Score", "Images"], 
#                        data=data)
# run.log({"paw-dataset": my_table})
# run.finish()

[Check out my first ever table visualization of a dataset!](https://wandb.ai/vincenttu/paw-pictures?workspace=user-vincenttu)