# Shelter Dogs - can we predict if a dog will be 'reserved' or 'available' based on their listing information, and could this help shelters funnel resources to likely-to-be-overlooked dogs?




### Interesting things

Here's a sneak preview of some of the more interesting findings, in case scrolling through a long page of scripting isn't up your street!  :)

(NB if you run these cells, the outputs will have errors as the cells need to be run in order, so just skip these - you'll see the same charts throughout the investigation)




In [169]:
fig_overall_reserved_dogs.show()

In [170]:
fig_breeds_reserved.show()
fig_breeds_percentage_reserved.show()

In [171]:
fig_traits.show()
fig_traits_slider.show()

In [172]:
fig_age_reserved.show()

In [173]:
fig_photos_reserved.show()

##Investigation

In [70]:
#Imports
import plotly
import pandas as pd
import plotly.io as pio
pio.renderers.default = "colab"

# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# init_notebook_mode(connected=True)

from collections import Counter
import numpy as np

import plotly.graph_objects as go


In [71]:
#Import CSV file and skip initial space to make boolean comparisons easier later - a lot of white space in initial file
shelterDogsOriginalData = pd.read_csv('https://raw.githubusercontent.com/A-F-McG/shelterDogsAdoptionPredictions/refs/heads/master/dogsTrustCrawledData.csv', skipinitialspace=True)

Let's have an initial glance at the dataset and see if there's anything we need to clean up

In [72]:
shelterDogsOriginalData.head()

Unnamed: 0,age,breed,centreLocation,gender,hasPhoto,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,isNewDog,name,numberOfPhotos,reserved
0,'2 to 5 Years','Staffordshire Bull Terrier (SBT)','West Calder','Female',True,False,Child friendly (under 12 yrs),,,,,,,,Especially likes cuddles,,Needs some training,,Stranger friendly,,,,True,'Abby',3,False
1,'5 to 7 Years','Shih Tzu ','Shoreham','Male',True,False,,,,,,Medical needs,,,,,Needs some training,Small but sparky,Stranger friendly,,,,False,'Alfie',3,False
2,'5 to 7 Years','A Crossbreed ','Merseyside','Male',True,True,,Crossbreed,,,Has basic housetraining,,Live wire,,,Loves toys/games,Needs some training,,,Bright spark/very clever,,,False,'Alfie',5,False
3,'2 to 5 Years','Belgian Shepherd Dog: Malinois (BSD)','London (Harefield)','Female',True,True,,Crossbreed,,,,,Live wire,,Especially likes cuddles,Loves toys/games,Needs some training,,,Bright spark/very clever,Loves treats/will work for food,,False,'Alessia',5,False
4,'1 to 2 Years','Lurcher ','Shoreham','Male',True,True,,Crossbreed,,Gentle giant,,,,,,Loves toys/games,Needs some training,,Stranger friendly,,Loves treats/will work for food,,False,'Alby',3,True


In [73]:
shelterDogsOriginalData.describe(include='all')

Unnamed: 0,age,breed,centreLocation,gender,hasPhoto,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,isNewDog,name,numberOfPhotos,reserved
count,948,948,948,948,948,948,101,619,385,68,284,33,430,72,203,491,619,131,270,262,537,85,948,948,948.0,948
unique,6,83,20,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,648,,2
top,'2 to 5 Years','A Crossbreed ','Kenilworth','Male',True,False,Child friendly (under 12 yrs),Crossbreed,Dog friendly,Gentle giant,Has basic housetraining,Medical needs,Live wire,Living off site,Especially likes cuddles,Loves toys/games,Needs some training,Small but sparky,Stranger friendly,Bright spark/very clever,Loves treats/will work for food,Young at heart,False,'Charlie',,False
freq,381,133,128,581,892,740,101,619,385,68,284,33,430,72,203,491,619,131,270,262,537,85,581,8,,672
mean,,,,,,,,,,,,,,,,,,,,,,,,,3.408228,
std,,,,,,,,,,,,,,,,,,,,,,,,,1.681948,
min,,,,,,,,,,,,,,,,,,,,,,,,,0.0,
25%,,,,,,,,,,,,,,,,,,,,,,,,,2.0,
50%,,,,,,,,,,,,,,,,,,,,,,,,,4.0,
75%,,,,,,,,,,,,,,,,,,,,,,,,,5.0,




## Neaten up the dataset

I'm going to rearrange the dataset so that all of the icons are last (and finally 'reserved' as this is the target variable).

In [74]:
#Make a new dataframe so that we always have the original
shelterDogsData = shelterDogsOriginalData[['name', 'age', 'breed', 'gender', 'centreLocation', 'isNewDog', 'hasPhoto', 'numberOfPhotos','hasVideo',
       'icon_childFriendly', 'icon_crossBreed', 'icon_dogFriendly',
       'icon_gentleGiant', 'icon_hasBasicHousetraining',
       'icon_hasMedicalNeeds', 'icon_livewire', 'icon_livingOffsite',
       'icon_lovesCuddles', 'icon_lovesToysGames', 'icon_needsTraining',
       'icon_smallButSparky', 'icon_strangerFriendly', 'icon_veryClever',
                                       'icon_willWorkForFood', 'icon_youngAtHeart', 'reserved']]


In [75]:
pd.set_option('display.max_columns', 50)
shelterDogsData.head()

Unnamed: 0,name,age,breed,gender,centreLocation,isNewDog,hasPhoto,numberOfPhotos,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,reserved
0,'Abby','2 to 5 Years','Staffordshire Bull Terrier (SBT)','Female','West Calder',True,True,3,False,Child friendly (under 12 yrs),,,,,,,,Especially likes cuddles,,Needs some training,,Stranger friendly,,,,False
1,'Alfie','5 to 7 Years','Shih Tzu ','Male','Shoreham',False,True,3,False,,,,,,Medical needs,,,,,Needs some training,Small but sparky,Stranger friendly,,,,False
2,'Alfie','5 to 7 Years','A Crossbreed ','Male','Merseyside',False,True,5,True,,Crossbreed,,,Has basic housetraining,,Live wire,,,Loves toys/games,Needs some training,,,Bright spark/very clever,,,False
3,'Alessia','2 to 5 Years','Belgian Shepherd Dog: Malinois (BSD)','Female','London (Harefield)',False,True,5,True,,Crossbreed,,,,,Live wire,,Especially likes cuddles,Loves toys/games,Needs some training,,,Bright spark/very clever,Loves treats/will work for food,,False
4,'Alby','1 to 2 Years','Lurcher ','Male','Shoreham',False,True,3,True,,Crossbreed,,Gentle giant,,,,,,Loves toys/games,Needs some training,,Stranger friendly,,Loves treats/will work for food,,True


There aren't any missing values at all which is great! The icon features don't all have the full 948 entries, but they each have only one unique value. This is because they are really boolean values, e.g. 'icon_childfriendly' is actually True for every instance with an entry and False for every N/A. It's N/A because it's if the dog didn't have that particular atribute, it just wasn't listed on the site at all. I'm going to change all the icons to boolean variables.

### Changing the icon features to boolean variables

In [76]:
#Make sure to only run this cell once or else you'll end up with everything reading as true!

iconColumnNames = ['icon_childFriendly', 'icon_crossBreed', 'icon_dogFriendly',
       'icon_gentleGiant', 'icon_hasBasicHousetraining',
       'icon_hasMedicalNeeds', 'icon_livewire', 'icon_livingOffsite',
       'icon_lovesCuddles', 'icon_lovesToysGames', 'icon_needsTraining',
       'icon_smallButSparky', 'icon_strangerFriendly', 'icon_veryClever',
                                          'icon_willWorkForFood', 'icon_youngAtHeart']

#Change all icon entries to True and all N/A to False to convert to boolean variables.
#Ignore the warning, I am indeed changing values on the copy of the original dataset.
for iconsTitles in iconColumnNames:
    shelterDogsData.loc[shelterDogsData[iconsTitles].notnull(), iconsTitles] = True
    shelterDogsData.loc[shelterDogsData[iconsTitles].isnull(), iconsTitles] = False

In [77]:
#Check that all bools have been changed correctly
shelterDogsData.dtypes

Unnamed: 0,0
name,object
age,object
breed,object
gender,object
centreLocation,object
isNewDog,bool
hasPhoto,bool
numberOfPhotos,int64
hasVideo,bool
icon_childFriendly,object


Let's have a look at what else we can change.

In [78]:
shelterDogsData.describe(include='all')

Unnamed: 0,name,age,breed,gender,centreLocation,isNewDog,hasPhoto,numberOfPhotos,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,reserved
count,948,948,948,948,948,948,948,948.0,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948
unique,648,6,83,2,20,2,2,,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
top,'Charlie','2 to 5 Years','A Crossbreed ','Male','Kenilworth',False,True,,False,False,True,False,False,False,False,False,False,False,True,True,False,False,False,True,False,False
freq,8,381,133,581,128,581,892,,740,847,619,563,880,664,915,518,876,745,491,619,817,678,686,537,863,672
mean,,,,,,,,3.408228,,,,,,,,,,,,,,,,,,
std,,,,,,,,1.681948,,,,,,,,,,,,,,,,,,
min,,,,,,,,0.0,,,,,,,,,,,,,,,,,,
25%,,,,,,,,2.0,,,,,,,,,,,,,,,,,,
50%,,,,,,,,4.0,,,,,,,,,,,,,,,,,,
75%,,,,,,,,5.0,,,,,,,,,,,,,,,,,,


I'm going to explore whether there are any very popular names which might have some correlation to the target variable.

### Exploring whether keeping the names is worth it

In [79]:
Counter(shelterDogsData['name']).most_common()[:10]

[("'Charlie'", 8),
 ("'Poppy'", 8),
 ("'Benji'", 7),
 ("'Bella'", 7),
 ("'Barney'", 7),
 ("'Bailey'", 7),
 ("'Lola'", 7),
 ("'Max'", 7),
 ("'Buddy'", 6),
 ("'Frankie'", 6)]

The most popular names don't seem to cover that many dogs so there doesn't seem to be any point keeping any. Because there are so many unique names, I'm going to drop this column.

In [80]:
shelterDogsData.drop(['name'], axis = 'columns', inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [81]:
shelterDogsData.head()

Unnamed: 0,age,breed,gender,centreLocation,isNewDog,hasPhoto,numberOfPhotos,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,reserved
0,'2 to 5 Years','Staffordshire Bull Terrier (SBT)','Female','West Calder',True,True,3,False,True,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False
1,'5 to 7 Years','Shih Tzu ','Male','Shoreham',False,True,3,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False
2,'5 to 7 Years','A Crossbreed ','Male','Merseyside',False,True,5,True,False,True,False,False,True,False,True,False,False,True,True,False,False,True,False,False,False
3,'2 to 5 Years','Belgian Shepherd Dog: Malinois (BSD)','Female','London (Harefield)',False,True,5,True,False,True,False,False,False,False,True,False,True,True,True,False,False,True,True,False,False
4,'1 to 2 Years','Lurcher ','Male','Shoreham',False,True,3,True,False,True,False,True,False,False,False,False,False,True,True,False,True,False,True,False,True


### Exploring whether to keep the breeds variable

In [82]:
Counter(shelterDogsData['breed']).most_common()

[("'A Crossbreed '", 133),
 ("'Lurcher '", 109),
 ("'Collie Cross (Border)'", 61),
 ("'Jack Russell Terrier (JRT)'", 58),
 ("'Border Collie '", 52),
 ("'Staffordshire Cross (SBT)'", 46),
 ("'German Shepherd Dog (GSD / Alsatian)'", 43),
 ("'Staffordshire Bull Terrier (SBT)'", 41),
 ("'Terrier Cross'", 38),
 ("'Lab Cross'", 37),
 ("'Greyhound  '", 36),
 ("'Beagle  '", 16),
 ("'Siberian Husky  '", 15),
 ("'Akita'", 13),
 ("'Patterdale Terrier'", 13),
 ("'Rottweiler '", 12),
 ("'Pug '", 11),
 ("'Boxer  '", 11),
 ("'Terrier: Yorkshire '", 10),
 ("''", 10),
 ("'Chihuahua: Short Hr'", 9),
 ("'Harrier'", 9),
 ("'American Bulldog'", 9),
 ("'Shih Tzu  '", 8),
 ("'Bulldog: French'", 8),
 ("'Cocker Spaniel'", 7),
 ("'Labrador'", 7),
 ("'Spaniel: English Springer'", 7),
 ("'Bichon Frise '", 7),
 ("'Bulldog: English'", 6),
 ("'Saluki '", 6),
 ("'Belgian Shepherd Dog: Malinois (BSD)'", 5),
 ("'Dobermann  '", 5),
 ("'Shar-Pei  '", 5),
 ("'Spaniel Cross'", 4),
 ("'Chihuahua: Long Hr '", 4),
 ("'Whippet

There seem to be a lot of different types of terriers with just a few dogs, so I'm going to combine these into a purebred terrier category.

In [83]:
for index, breed in enumerate(shelterDogsData['breed']):
    if "Terrier" in breed and "Terrier Cross" not in breed:
        shelterDogsData.loc[index, 'breed'] = "Terrier Purebreed"

In [84]:
Counter(shelterDogsData['breed']).most_common()

[('Terrier Purebreed', 138),
 ("'A Crossbreed '", 133),
 ("'Lurcher '", 109),
 ("'Collie Cross (Border)'", 61),
 ("'Border Collie '", 52),
 ("'Staffordshire Cross (SBT)'", 46),
 ("'German Shepherd Dog (GSD / Alsatian)'", 43),
 ("'Terrier Cross'", 38),
 ("'Lab Cross'", 37),
 ("'Greyhound  '", 36),
 ("'Beagle  '", 16),
 ("'Siberian Husky  '", 15),
 ("'Akita'", 13),
 ("'Rottweiler '", 12),
 ("'Pug '", 11),
 ("'Boxer  '", 11),
 ("''", 10),
 ("'Chihuahua: Short Hr'", 9),
 ("'Harrier'", 9),
 ("'American Bulldog'", 9),
 ("'Shih Tzu  '", 8),
 ("'Bulldog: French'", 8),
 ("'Cocker Spaniel'", 7),
 ("'Labrador'", 7),
 ("'Spaniel: English Springer'", 7),
 ("'Bichon Frise '", 7),
 ("'Bulldog: English'", 6),
 ("'Saluki '", 6),
 ("'Belgian Shepherd Dog: Malinois (BSD)'", 5),
 ("'Dobermann  '", 5),
 ("'Shar-Pei  '", 5),
 ("'Spaniel Cross'", 4),
 ("'Chihuahua: Long Hr '", 4),
 ("'Whippet'", 4),
 ("'Foxhound  '", 4),
 ("'Pointer: English'", 4),
 ("'Dachshund: Std Smooth Hr '", 3),
 ("'Dogue De Bordeaux'"

In [85]:
#Calculate percentage of dogs which fall into the most popular breeds
def percentage_in_most_popular(data, number_of_breeds_included):
    breeds_counted = Counter(shelterDogsData["breed"])
    list_of_n_most_common_breeds = breeds_counted.most_common()[:number_of_breeds_included:1]
    number_of_animals_in_breeds = sum(x[1] for x in list_of_n_most_common_breeds)
    total_no_breeds = shelterDogsData["breed"].count()
    percentage_covered = number_of_animals_in_breeds/total_no_breeds
    return percentage_covered

In [86]:
percentage_covered_by_number_of_breeds_array = []
for breeds in np.arange(shelterDogsData["breed"].nunique()+1):
    percentage_covered_by_number_of_breeds_array.append(percentage_in_most_popular(shelterDogsData, breeds))

In [87]:
trace = go.Scatter(x=np.arange(shelterDogsData["breed"].nunique()+1),
                   y=percentage_covered_by_number_of_breeds_array,
                  line = dict(color='#f242f5'))
data = [trace]
layout = go.Layout(title="Percentage of dogs covered by number of different breeds", xaxis=dict(title="Number of breeds"), yaxis=dict(title="Percentage"))

fig = go.Figure(data=data, layout=layout)

fig.show()
# py.iplot(fig, filename="covered")

~73% of dogs are covered by the top 10 breeds. This seems a reasonable number of breeds to keep, so I'll mark all the other breeds as 'other' and then convert the variable to a categorical variable.

In [88]:
#if breed name is not in top 10 common
#find all its indexes
#change breed value to 'other'

dog_breeds = shelterDogsData["breed"].unique()
number_of_dog_breeds = shelterDogsData["breed"].nunique()

for breed in dog_breeds:
    if breed not in [x[0] for x in Counter(shelterDogsData["breed"]).most_common()[:10:1]]:
        breed_indexes = shelterDogsData.loc[shelterDogsData["breed"] == breed].index
        shelterDogsData.loc[breed_indexes.values, "breed"] = "other"

In [89]:
shelterDogsData.describe(include='all')

Unnamed: 0,age,breed,gender,centreLocation,isNewDog,hasPhoto,numberOfPhotos,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,reserved
count,948,948,948,948,948,948,948.0,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948,948
unique,6,11,2,20,2,2,,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
top,'2 to 5 Years',other,'Male','Kenilworth',False,True,,False,False,True,False,False,False,False,False,False,False,True,True,False,False,False,True,False,False
freq,381,255,581,128,581,892,,740,847,619,563,880,664,915,518,876,745,491,619,817,678,686,537,863,672
mean,,,,,,,3.408228,,,,,,,,,,,,,,,,,,
std,,,,,,,1.681948,,,,,,,,,,,,,,,,,,
min,,,,,,,0.0,,,,,,,,,,,,,,,,,,
25%,,,,,,,2.0,,,,,,,,,,,,,,,,,,
50%,,,,,,,4.0,,,,,,,,,,,,,,,,,,
75%,,,,,,,5.0,,,,,,,,,,,,,,,,,,


Note: if I was doing a more detailed exploration, I might want to explore this even further and look at the list of breeds more closely to see if any were similar that we could lump together or if any were different that we could separate.

In [90]:
shelterDogsData.head()

Unnamed: 0,age,breed,gender,centreLocation,isNewDog,hasPhoto,numberOfPhotos,hasVideo,icon_childFriendly,icon_crossBreed,icon_dogFriendly,icon_gentleGiant,icon_hasBasicHousetraining,icon_hasMedicalNeeds,icon_livewire,icon_livingOffsite,icon_lovesCuddles,icon_lovesToysGames,icon_needsTraining,icon_smallButSparky,icon_strangerFriendly,icon_veryClever,icon_willWorkForFood,icon_youngAtHeart,reserved
0,'2 to 5 Years',Terrier Purebreed,'Female','West Calder',True,True,3,False,True,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False
1,'5 to 7 Years',other,'Male','Shoreham',False,True,3,False,False,False,False,False,False,True,False,False,False,False,True,True,True,False,False,False,False
2,'5 to 7 Years','A Crossbreed ','Male','Merseyside',False,True,5,True,False,True,False,False,True,False,True,False,False,True,True,False,False,True,False,False,False
3,'2 to 5 Years',other,'Female','London (Harefield)',False,True,5,True,False,True,False,False,False,False,True,False,True,True,True,False,False,True,True,False,False
4,'1 to 2 Years','Lurcher ','Male','Shoreham',False,True,3,True,False,True,False,True,False,False,False,False,False,True,True,False,True,False,True,False,True


### Exploring whether to keep the centre location, age and gender variables

In [91]:
Counter(shelterDogsData['centreLocation']).most_common()

[("'Kenilworth'", 128),
 ("'Loughborough'", 65),
 ("'Merseyside'", 63),
 ("'London (Harefield)'", 60),
 ("'Basildon'", 58),
 ("'Evesham'", 54),
 ("'Shrewsbury'", 53),
 ("'Salisbury'", 49),
 ("'Bridgend'", 47),
 ("'Leeds'", 43),
 ("'West Calder'", 40),
 ("'Shoreham'", 40),
 ("'Newbury'", 36),
 ("'Manchester'", 35),
 ("'Glasgow'", 34),
 ("'Snetterton'", 33),
 ("'Darlington'", 30),
 ("'Canterbury'", 30),
 ("'Ilfracombe'", 25),
 ("'Ballymena (N.Ireland)'", 25)]

In [92]:
Counter(shelterDogsData['age']).most_common()

[("'2 to 5 Years'", 381),
 ("'5 to 7 Years'", 203),
 ("'8+ Years'", 183),
 ("'1 to 2 Years'", 117),
 ("'6 to 12 Months'", 41),
 ("'0 to 6 Months'", 23)]

There seem to be a reasonable number of locations and ages, so I'll keep them all and convert these to categorical variables later on when I need them for models. I am going to remove the ' ' that encases the words in the locations, ages and genders categories though.

In [93]:
columnsToRemoveApostrophesFrom = ['centreLocation', 'age', 'gender', 'breed']

for columns in columnsToRemoveApostrophesFrom:
    for index, locations in enumerate(shelterDogsData[columns]):
        shelterDogsData.loc[index, columns] = shelterDogsData.loc[index, columns].replace("'","").strip()

## Data Exploration

Let's take a quick look at the target (whether the dog is reserved or available on the website) distribution.

In [94]:
#See how many dogs are reserved and how many are available
reservedStats = shelterDogsData['reserved'].value_counts()
reservedStats

Unnamed: 0_level_0,count
reserved,Unnamed: 1_level_1
False,672
True,276


In [162]:
colours = ['#42daf5', '#a4f542', '#39a127', '#138c9c']

trace = go.Pie(labels = ["Available", "Reserved"], values = reservedStats.values, marker=dict(colors=colours), textinfo='label+percent')
layout = go.Layout(title="Distribution of reserved dogs")

fig_overall_reserved_dogs = go.Figure(data = [trace],
                layout=layout,
               )
fig_overall_reserved_dogs.show()

### Let's see if the breed has any correlation to whether the dog is reserved.

In [96]:
groupedByBreed = shelterDogsData.groupby(['breed', 'reserved']).size()
dogs_status_breeds = pd.DataFrame(groupedByBreed.reset_index())

dogs_status_breeds

Unnamed: 0,breed,reserved,0
0,A Crossbreed,False,98
1,A Crossbreed,True,35
2,Border Collie,False,39
3,Border Collie,True,13
4,Collie Cross (Border),False,46
5,Collie Cross (Border),True,15
6,German Shepherd Dog (GSD / Alsatian),False,28
7,German Shepherd Dog (GSD / Alsatian),True,15
8,Greyhound,False,31
9,Greyhound,True,5


In [157]:
numberOfReservedDogsForEachBreed = dogs_status_breeds.loc[dogs_status_breeds['reserved'] == True][0]
numberOfAvailableDogsForEachBreed = dogs_status_breeds.loc[dogs_status_breeds['reserved'] == False][0]

trace1 = go.Bar(
    x=dogs_status_breeds['breed'].unique(),
    y=numberOfReservedDogsForEachBreed,
    name='Reserved',
    marker=dict(
        color=colours[1],
    )
)
trace2 = go.Bar(
    x=dogs_status_breeds['breed'].unique(),
    y=numberOfAvailableDogsForEachBreed,
    name='Available',
    marker=dict(
        color=colours[0],
    )
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group',
    title = 'Number of available and reserved dogs of different breeds',
    xaxis = dict(title="Breed", tickfont=dict(
            size=12,
        ),tickangle=15),
    yaxis = dict(title="Number of dogs")
)

fig_breeds_reserved = go.Figure(data=data, layout=layout)
fig_breeds_reserved.show()

In [160]:
numberOfDogsInEachBreed = dogs_status_breeds.groupby('breed').sum()[0]
percentageOfReservedDogsForEachBreed = (dogs_status_breeds.loc[dogs_status_breeds['reserved'] == True][0]).values/(dogs_status_breeds.groupby('breed').sum()[0]).values
percentageOfAvailableDogsForEachBreed = (dogs_status_breeds.loc[dogs_status_breeds['reserved'] == False][0]).values/(dogs_status_breeds.groupby('breed').sum()[0]).values

numberOfUniqueBreeds = dogs_status_breeds['breed'].nunique()
arrayOfNumberOfUniqueBreeds = np.arange(numberOfUniqueBreeds)

arrayOfAverageReservedPercentage = [reservedStats.values[1]/reservedStats.sum()]*numberOfUniqueBreeds
arrayOfAverageAvailablePercentage = [reservedStats.values[0]/reservedStats.sum()]*numberOfUniqueBreeds

trace1 = go.Bar(
    x=arrayOfNumberOfUniqueBreeds,
    y=percentageOfReservedDogsForEachBreed,
    name='Reserved',
    marker=dict(
        color=colours[1],
    )
)
trace2 = go.Bar(
    x=arrayOfNumberOfUniqueBreeds,
    y=percentageOfAvailableDogsForEachBreed,
    name='Available',
    marker=dict(
        color=colours[0],
    )
)

trace3 = go.Scatter(
    x=arrayOfNumberOfUniqueBreeds,
    y=arrayOfAverageReservedPercentage,
    name='Average percentage of reserved dogs across all breeds',
    marker=dict(
        color=colours[2],
    ),
    mode='lines'
)

trace4 = go.Scatter(
    x=arrayOfNumberOfUniqueBreeds,
    y=arrayOfAverageAvailablePercentage,
    name='Average percentage of available dogs across all breeds',
    marker=dict(
        color=colours[3],
    ),
    mode='lines'
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    barmode='group',
    title = 'Percentage of available and reserved dogs of different breeds',
    xaxis = dict(title="Breed", tickvals=arrayOfNumberOfUniqueBreeds, ticktext = dogs_status_breeds['breed'].unique(), tickfont=dict(
            size=8,
        ),tickangle=20),

    yaxis = dict(title="Percentage of dogs")
)

fig_breeds_percentage_reserved = go.Figure(data=data, layout=layout)
fig_breeds_percentage_reserved.show()

The breed of dog does seem to have some bearing on whether they're available or reserved. For example, if you're a greyhound, you're less likely to be reserved and if you're a lab cross, you have the highest chance of being reserved (even though you're still more likely to be available).

### Let's now see whether any of the icons look like they have a significant bearing on whether the dogs are available or reserved

In [99]:
Counter(shelterDogsData[iconColumnNames])

Counter({'icon_childFriendly': 1,
         'icon_crossBreed': 1,
         'icon_dogFriendly': 1,
         'icon_gentleGiant': 1,
         'icon_hasBasicHousetraining': 1,
         'icon_hasMedicalNeeds': 1,
         'icon_livewire': 1,
         'icon_livingOffsite': 1,
         'icon_lovesCuddles': 1,
         'icon_lovesToysGames': 1,
         'icon_needsTraining': 1,
         'icon_smallButSparky': 1,
         'icon_strangerFriendly': 1,
         'icon_veryClever': 1,
         'icon_willWorkForFood': 1,
         'icon_youngAtHeart': 1})

In [100]:
yValuesTrue = []
for index, columns in enumerate(iconColumnNames):
     yValuesTrue.append(Counter(shelterDogsData[iconColumnNames[index]])[True])

yValuesFalse = []
for index, columns in enumerate(iconColumnNames):
     yValuesFalse.append(Counter(shelterDogsData[iconColumnNames[index]])[False])


In [164]:
trace1 = go.Bar(
    x=iconColumnNames,
    y=yValuesTrue,
    name='True',
    marker=dict(
        color=colours[1],
    )
)

trace2 = go.Bar(
    x=iconColumnNames,
    y=yValuesFalse,
    name='False',
    marker=dict(
        color='#e30f0b',
    )
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title = 'Traits of the dogs',
    xaxis = dict(title="Traits", tickfont=dict(
            size=12,
        ),tickangle=25),
    yaxis = dict(title="Number of dogs")
)

fig_traits = go.Figure(data=data, layout=layout)
fig_traits.show()

Note: 'False' for these traits doesn't necessarily mean false, it just means it wasn't listed.

In [102]:
iconsTrueReserved = []
iconsTrueAvailable = []
iconsFalseReserved = []
iconsFalseAvailable = []

for icons in iconColumnNames:
    iconsTrueReserved.append(shelterDogsData[shelterDogsData[icons]==True][[icons, 'reserved']].groupby('reserved').size()[True]),
    iconsTrueAvailable.append(shelterDogsData[shelterDogsData[icons]==True][[icons, 'reserved']].groupby('reserved').size()[False]),
    iconsFalseReserved.append(shelterDogsData[shelterDogsData[icons]==False][[icons, 'reserved']].groupby('reserved').size()[True]),
    iconsFalseAvailable.append(shelterDogsData[shelterDogsData[icons]==False][[icons, 'reserved']].groupby('reserved').size()[False])



In [165]:
# initialize notebook for offline plotting
# init_notebook_mode()

# Set initial slider/title index
start_index = 0

# Build all traces with visible=False

trace1 = []

for index, icon in enumerate(iconColumnNames):
    trace1.append(go.Pie(
           visible = False,
           labels = ["Available", "Reserved"],
           textinfo='label+percent',
           textfont=dict(size=8),
           marker=dict(colors=colours),
           domain=dict(x=[0, yValuesTrue[index]/(yValuesTrue[index]+yValuesFalse[index])]),
           title = "Dog is " + iconColumnNames[index][5:],
           values = [iconsTrueAvailable[index], iconsTrueReserved[index]]))

for index, icon in enumerate(iconColumnNames):
    trace1.append(go.Pie(
           visible = False,
           textfont=dict(size=8),
           labels = ["Available", "Reserved"],
           textinfo='label+percent',
           domain=dict(x=[yValuesTrue[index]/(yValuesTrue[index]+yValuesFalse[index]), 1]),
           title = "Dog is not " + iconColumnNames[index][5:],
           marker=dict(colors=colours),
           values = [iconsFalseAvailable[index], iconsFalseReserved[index]]))

# Make initial trace visible
trace1[start_index]['visible'] = True
trace1[start_index+len(iconColumnNames)]['visible']=True

# Build slider steps
steps = []
for i in range(len(iconColumnNames)):
    step = dict(
        # Update method allows us to update both trace and layout properties
        method = 'update',
        args = [
            # Make the ith trace visible
            {'visible': [t == i for t in np.arange(len(iconColumnNames))]},
            ],
            label = iconColumnNames[i][5:]
    )
    steps.append(step)

# Build sliders
sliders = [go.layout.Slider(
    currentvalue = {"prefix": "Icon label: "},
    steps = steps,

)]

layout = go.Layout(
    sliders=sliders,
    title={'text': "Proportion of reserved and available dogs according to their different attributes", 'font':dict(
                size=14,
            )},
)

fig_traits_slider = go.Figure(data=trace1, layout=layout)

fig_traits_slider.show()

### Are any of the traits highly correlated with each other?

In [104]:
import seaborn as sns

## Taking a peek at age, gender and centre locations

In [105]:
agesForReservedDogs = shelterDogsData[shelterDogsData['reserved'] == True][['age', 'reserved']].groupby('age').size()
agesForAvailableDogs = shelterDogsData[shelterDogsData['reserved'] == False][['age', 'reserved']].groupby('age').size()

In [106]:
#Ages have been automatically sorted alphabetically. Reorder so the position of 6 to 12 months category
#is moved into second place, where it would appear chronologically
ageOrder = [0,4,1,2,3,5]
agesForReservedDogs = [agesForReservedDogs[i] for i in ageOrder]


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



In [166]:
#Need to sort the age categories labels because the groupby methods above automatically do this,
#then re-order to correct chronological order like above, so the x axis labels align with correct bars
ageCategories = [sorted(shelterDogsData['age'].unique())[i] for i in ageOrder]

trace1 = go.Bar(
    x=ageCategories,
    y=agesForReservedDogs,
    name='Reserved',
    text=['Text A', 'Text B', 'Text C'],
    marker=dict(
        color=colours[0],
    )
)

trace2 = go.Bar(
    x=ageCategories,
    y=agesForAvailableDogs,
    name='Available',
    marker=dict(
        color= colours[1],
    )
)

annotations = []
for i in np.arange(len(ageCategories)):
            annotations.append(dict(
                x=i,
                y=agesForReservedDogs[i]+agesForAvailableDogs[i] + 10,
               # xref='x',
               # yref='y',
                text=str(int(round(100*(agesForReservedDogs[i]/(agesForReservedDogs[i]+agesForAvailableDogs[i]))))) + '% reserved',
                showarrow=True,
                arrowhead=7,
              #  ax=0,
              #  ay=-40
            ))

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title = 'Comparing the amount of dogs in different age categories',
    xaxis = dict(title="Age", tickfont=dict(
            size=12,
        ),tickangle=25),
    yaxis = dict(title="Number of dogs"),
    annotations=annotations
)

fig_age_reserved = go.Figure(data=data, layout=layout)
fig_age_reserved.show()


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



There are lots of interesting thing shown in this chart, leaving many questions that could be investigated.

The lowest number is young pups below 6 months, which have the highest percentage of reservations - this makes perfect logical sense because lots of people want pups.

Between the 6 month mark and 7 years, there's a rise then fall in numbers, but on contrast a fall then rise in the percentage of reserved animals. My theory is this may be because people often prefer either a younger dog or an older trained one. The dogs at 1-2 years struggle the most because they're out of the adorable puppy stage, but they're not into the trained adult stage yet.   

Senior dogs after 8 years also have a low reservation percentage, and the number of these dogs is much higher than their neightbour category 5-7 years. This is sad, but not surprising - it's not uncommon knowledge that older dogs are often overlooked in shelters.

In [108]:
numberOfFemaleDogs=shelterDogsData.groupby('gender').size()['Female']
numberOfMaleDogs=shelterDogsData.groupby('gender').size()['Male']
print("There are currently %d female dogs and %d male dogs." %(numberOfFemaleDogs,numberOfMaleDogs))

There are currently 367 female dogs and 581 male dogs.


### Out of the dogs who are reserved, how many are each gender?

In [109]:
numberOfReservedFemales = shelterDogsData[shelterDogsData['reserved']==True][['gender', 'reserved']].groupby('gender').size()['Female']
numberOfReservedMales = shelterDogsData[shelterDogsData['reserved']==True][['gender', 'reserved']].groupby('gender').size()['Male']
numberOfUneservedFemales = shelterDogsData[shelterDogsData['reserved']==False][['gender', 'reserved']].groupby('gender').size()['Female']
numberOfUneservedMales = shelterDogsData[shelterDogsData['reserved']==False][['gender', 'reserved']].groupby('gender').size()['Male']


In [110]:
coloursFM =['#f542da','#3a22f0']

trace1 = go.Pie(labels = ["Female", "Male"],
                values = [numberOfReservedFemales, numberOfReservedMales],
                marker=dict(colors=coloursFM),
                textinfo='label+percent',
                title = "Out of reserved dogs, how many are which gender?",
                domain =dict(x=[0,0.5],y=[0.5,1]))

trace2 = go.Pie(labels = ["Female", "Male"],
                values = [numberOfUneservedFemales, numberOfUneservedMales],
                marker=dict(colors=coloursFM),
                textinfo='label+percent',
                title = "Out of unreserved dogs, how many are which gender?",
                domain=dict(x=[0.5,1],y=[0.5,1]))

trace3 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [numberOfReservedFemales, numberOfUneservedFemales],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "Out of females, how many are reserved?",
                domain =dict(x=[0,0.5],y=[0,0.5]))

trace4 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [numberOfReservedMales, numberOfUneservedMales],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "Out of males, how many are reserved?",
                domain =dict(x=[0.5,1],y=[0,0.5]))

layout = go.Layout(title="Distribution of genders for reserved animals")

fig = go.Figure(data = [trace1, trace2, trace3, trace4],
                layout=layout,
               )
fig.show()

Comparing the pie charts, there doesn't seem to be an indication that gender plays a part in whether the dogs is reserved for adoption or not, but I'll still leave the feature in the model for now in case it does provide any insight when I begin to train.

### Exploring centre locations

In [111]:
locationsReserved = shelterDogsData[shelterDogsData['reserved']==True][['centreLocation', 'reserved']].groupby('centreLocation').size()
locationsUnreserved = shelterDogsData[shelterDogsData['reserved']==False][['centreLocation', 'reserved']].groupby('centreLocation').size()


In [112]:
locationNames = sorted(shelterDogsData.centreLocation.unique())

In [113]:
trace1 = go.Bar(
    x=locationNames,
    y=locationsReserved,
    name='Reserved',
    marker=dict(
        color=colours[0],
    )
)

trace2 = go.Bar(
    x=locationNames,
    y=locationsUnreserved,
    name='Available',
    marker=dict(
        color= colours[1],
    )
)

#gather list to use in following graph
percentageOfReservedDogsAtLocation = []
for i in np.arange(len(locationNames)):
    percentageReserved = int(round(100*(locationsReserved[i]/(locationsReserved[i]+locationsUnreserved[i]))))
    percentageOfReservedDogsAtLocation.append(percentageReserved)

annotations = []
for i in np.arange(len(locationNames)):
            annotations.append(dict(
                x=i,
                y=locationsReserved[i],
                text=str(percentageOfReservedDogsAtLocation[i]) +"%",
                showarrow=True,
                arrowhead=1,
                ax=0,
            ))

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title = 'Comparing the amount of dogs in different locations. Each location has a percentage correlating to the percentage of dogs there which are reserved.',
    xaxis = dict(title="Age", tickfont=dict(
            size=12,
        ),tickangle=25),
    yaxis = dict(title="Number of dogs"),
    annotations=annotations
)

fig = go.Figure(data=data, layout=layout)
fig.show()


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



It seems to be that the number of dogs correlates with the percentage of reserved dogs, but it's difficult to tell so I'm going to investigate that a bit more.

In [114]:
groupedByLocation = shelterDogsData.groupby(['centreLocation', 'reserved']).size()
dogsStatusLocation = pd.DataFrame(groupedByLocation.reset_index())

In [123]:
dogsStatusLocation['totalAtLocation']='dummy value'
for i in np.arange(len(dogsStatusLocation)-1):
    if dogsStatusLocation['centreLocation'][i]==dogsStatusLocation['centreLocation'][i+1]:
        totalAtLocation = dogsStatusLocation[0][i]+dogsStatusLocation[0][i+1]
        dogsStatusLocation['totalAtLocation'][i]=totalAtLocation
        dogsStatusLocation['totalAtLocation'][i+1]=totalAtLocation


ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




In [124]:
dogsStatusLocation.head()

Unnamed: 0,centreLocation,reserved,0,totalAtLocation,"(1, totalAtLocation)","(3, totalAtLocation)","(5, totalAtLocation)","(7, totalAtLocation)","(9, totalAtLocation)","(11, totalAtLocation)","(13, totalAtLocation)","(15, totalAtLocation)","(17, totalAtLocation)","(19, totalAtLocation)","(21, totalAtLocation)","(23, totalAtLocation)","(25, totalAtLocation)","(27, totalAtLocation)","(29, totalAtLocation)","(31, totalAtLocation)","(33, totalAtLocation)","(35, totalAtLocation)","(37, totalAtLocation)","(39, totalAtLocation)"
0,Ballymena (N.Ireland),False,22,25,25,58,47,30,30,54,34,25,128,43,60,65,35,63,36,49,40,53,33,40
1,Ballymena (N.Ireland),True,3,25,25,58,47,30,30,54,34,25,128,43,60,65,35,63,36,49,40,53,33,40
2,Basildon,False,36,58,25,58,47,30,30,54,34,25,128,43,60,65,35,63,36,49,40,53,33,40
3,Basildon,True,22,58,25,58,47,30,30,54,34,25,128,43,60,65,35,63,36,49,40,53,33,40
4,Bridgend,False,28,47,25,58,47,30,30,54,34,25,128,43,60,65,35,63,36,49,40,53,33,40


In [125]:
orderedNumberOfReservedAtLocation = dogsStatusLocation[dogsStatusLocation['reserved']==True].sort_values(by=['totalAtLocation'])[0]
orderedTotalNumberAtLocation = dogsStatusLocation[dogsStatusLocation['reserved']==True].sort_values(by=['totalAtLocation'])['totalAtLocation']

In [128]:
trace = go.Scatter(
    x=orderedNumberOfReservedAtLocation,
    y=orderedNumberOfReservedAtLocation/orderedTotalNumberAtLocation,
    mode='markers')

layout = go.Layout(
    title = 'A scatter plot to compare the amount of dogs at shelters to the percentage of reserved dogs there',
    xaxis = dict(title="Number of dogs at shelter"),
    yaxis = dict(title="Percentage of dogs reserved"),
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
fig.show()

There does indeed seem to be a positive correlation between the number of dogs at a shelter and the percentage of dogs which are reserved there. Perhaps that's because the bigger shelters are more well known and so more people go there to adopt dogs, or perhaps the larger shelters have more money/better resources/better advertising. Perhaps the larger centres are based in busier towns so there's just more demand for dogs. There could be a lot of causes for this correlation.

Maybe I should add in another feature which is 'number of dogs at shelter'. The centre location would be categorical, but this would be numerical.

## Investigating whether having a photo, how many photos or having a video has any correlation with reserved status

In [129]:
noDogsReservedWithPhoto = shelterDogsData[shelterDogsData['hasPhoto']==True][['hasPhoto','reserved']].groupby('reserved').size()[True]
noDogsUnreservedWithPhoto =shelterDogsData[shelterDogsData['hasPhoto']==True][['hasPhoto','reserved']].groupby('reserved').size()[False]
noDogsReservedWithoutPhoto = shelterDogsData[shelterDogsData['hasPhoto']==False][['hasPhoto','reserved']].groupby('reserved').size()[True]
noDogsUnreservedWithoutPhoto =shelterDogsData[shelterDogsData['hasPhoto']==False][['hasPhoto','reserved']].groupby('reserved').size()[False]

noDogsReservedWithVideo = shelterDogsData[shelterDogsData['hasVideo']==True][['hasVideo','reserved']].groupby('reserved').size()[True]
noDogsUnreservedWithVideo =shelterDogsData[shelterDogsData['hasVideo']==True][['hasVideo','reserved']].groupby('reserved').size()[False]
noDogsReservedWithoutVideo = shelterDogsData[shelterDogsData['hasVideo']==False][['hasVideo','reserved']].groupby('reserved').size()[True]
noDogsUnreservedWithoutVideo =shelterDogsData[shelterDogsData['hasVideo']==False][['hasVideo','reserved']].groupby('reserved').size()[False]


In [130]:
allWithPhoto = noDogsReservedWithPhoto + noDogsUnreservedWithPhoto
allWithPhoto = noDogsReservedWithVideo + noDogsUnreservedWithVideo
totalDogs = len(shelterDogsData)

In [132]:
trace1 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [noDogsReservedWithPhoto, noDogsUnreservedWithPhoto],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "For those with a photo, how many are reserved?",
                domain =dict(x=[0,0.5],y=[0.5,1]))

trace2 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [noDogsReservedWithoutPhoto, noDogsUnreservedWithoutPhoto],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "For those without a photo, how many are reserved?",
                domain=dict(x=[0.5,1], y=[0.5,1]))

trace3 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [noDogsReservedWithPhoto, noDogsUnreservedWithPhoto],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "For those with a video, how many are reserved?",
                domain =dict(x=[0,0.5], y=[0,0.5]))

trace4 = go.Pie(labels = ["Reserved", "Unreserved"],
                values = [noDogsReservedWithoutPhoto, noDogsUnreservedWithoutPhoto],
                marker=dict(colors=colours),
                textinfo='label+percent',
                title = "For those without a video, how many are reserved?",
                domain=dict(x=[0.5,1], y=[0,0.5]))

fig = go.Figure(data = [trace1, trace2, trace3, trace4])
fig.show()

It would seem that most dogs have a photo, and if you do, you're more likely to be reserved, although it doesn't appear to make a huge difference.

In [133]:
#How many photos vs percentage reserved
#maybe hasPhoto is redundant since we already have numver of photos with 0 as an option

In [134]:
numberOfDogsWithXNumberOfPhotos = shelterDogsData.groupby('numberOfPhotos').size()

In [135]:
numberOfReservedDogsWithXNumberOfPhotos = shelterDogsData[shelterDogsData['reserved']==True].groupby('numberOfPhotos').size()

In [167]:
trace = go.Scatter(
    x=sorted(shelterDogsData.numberOfPhotos.unique()),
    y=numberOfDogsWithXNumberOfPhotos/numberOfReservedDogsWithXNumberOfPhotos,
    mode='markers')

layout = go.Layout(
    title = 'A scatter plot to compare the the number of photos of dogs displayed on the website with the percentage of reserved dogs there',
    xaxis = dict(title="Number of photos of dog"),
    yaxis = dict(title="Percentage of dogs reserved"),
)

data = [trace]
fig_photos_reserved = go.Figure(data=data, layout=layout)
fig_photos_reserved.show()

It does seem to be that the more photos, the higher the chances of being reserved are. This could be questioned since the dogs with no photos appear to have the same pecentage of reservations as those with 5 photos. It should be noted that with no photo represent a very small part of the population compared to the other groups.

## Investigating suitable machine learning methods

In [138]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Firstly, I'm going to make a copy of the dataset. Then, I'm going to take out a small portion to test all my models on after training. I won't take out a validation set because I'm going to be using a cross validation technique to train, which split the training set up into several sections and partitions it out into mini training and validation sets. I'm going to do this because my dataset isn't huge and this will give me a better idea of how good my models are.

In [139]:
shelterDogsDataTrain = shelterDogsData.copy()

The features need to be converted to the types of variables that we can use in models.

I will use label encoding for age since it is an ordinal variable - it's categorical but there's a logical (chronological) order. For breed, gender and centre location, I will use one hot encoding.

In [140]:
le1 = LabelEncoder()

In [143]:
#Label encode for the age category. I'm doing this manually because sklearn's label encode sorts labels
#alphabetically and this would mean they wouldn't be in chronological order, which they should be ordinally

for index, datapoint in enumerate(shelterDogsDataTrain['age']):
    if '0 to 6 Months' in datapoint:
        shelterDogsDataTrain.loc[index,'age']=0
    if "6 to 12 Months" in datapoint:
        shelterDogsDataTrain.loc[index,'age']=1
    if '1 to 2 Years' in datapoint:
        shelterDogsDataTrain.loc[index,'age']=2
    if '2 to 5 Years' in datapoint:
        shelterDogsDataTrain.loc[index,'age']=3
    if '5 to 7 Years' in datapoint:
        shelterDogsDataTrain.loc[index,'age']]=4
    if '8+ Years' in datapoint:
        shelterDogsDataTrain.loc[index,'age']=5

TypeError: argument of type 'int' is not iterable

In [144]:
#Change the dtype of the age category from object to numeric so it can be used in models
shelterDogsDataTrain['age'] = shelterDogsDataTrain['age'].apply(pd.to_numeric)

In [145]:
shelterDogsDataTrain_encoded = pd.get_dummies(shelterDogsDataTrain)

In [146]:
shelterDogsDataTrain_encoded.shape

(948, 71)

In [147]:
shelterDogsDataTrain.shape

(948, 25)

In [148]:
y= shelterDogsDataTrain_encoded['reserved']
X= shelterDogsDataTrain_encoded.drop(['reserved'], axis=1)

In [149]:
from sklearn.model_selection import cross_val_predict

In [150]:
from sklearn.linear_model import LinearRegression, LogisticRegressionCV

I always try to start with the absolute most basic classifier I can think of, so when I start applying algorithms I can see whether they really do offer improvement to this base model and whether they're worth the extra computing time. The most basic model I think is to always predict the outcome which is the mode of the dataset.

In [151]:
print("The mode of the target variable is {0}".format(y.mode()))

The mode of the target variable is 0    False
Name: reserved, dtype: bool


In [152]:
from sklearn.metrics import accuracy_score

In [153]:
y_pred = np.full((len(y),),False)

In [154]:
print("Accuracy using base model: {0}".format(accuracy_score(y, y_pred)))

Accuracy using base model: 0.7088607594936709


The simplest classifier I can think of is a logistic regression classifier. It makes sense because I am trying to predict something with two outcomes - reserved or available. I'll start with this as a base predictor.

In [155]:
logisticPredictions = LogisticRegressionCV(cv=5, random_state=0).fit(X,y)

In [156]:
logisticPredictions.score(X,y)

0.7521097046413502

74.6% accuracy is an improvement on the base model, and it's been an interesting investigation, but further work would need to be done as this isn't high enough to be reliable in any real world application.