# (Messy) Petfinder Analysis

In [1]:
import pandas as pd

## Project Details
* **Problem**: I plan to look at a snapshot of adoptable dogs across the United States (as posted on [Petfinder.com](https://www.petfinder.com)) to determine the predominant reported breeds of dogs within the shelter and rescue systems. In particular, I want to look at this distribution state-by-state.
* **Questions**:
    - For each state, what is the most populous reported breed found in shelters?
    - What is the current average length of stay (LOS) for each reported breed?
    - What is the current average length of stay (LOS) for dogs in each state?
    - For each US geographical region, what are the top 10 reported breeds?
* **Justification**: Examining which reported dog breeds are most often ending up homeless and offered for adoption in local shelters/rescues may help to answer questions about breed popularity and dog ownership culture in the United States (and whether there are any region-specific trends). It may also help to answer questions about how meaningful a breed identity actually is in the context of animal rescue. [This professional study](https://pubmed.ncbi.nlm.nih.gov/27008213/) determined that breed labels in dogs can influence their perceived adoptability and LOS. Hopefully, this analysis will provide some insight into how dog breed identity in shelter/rescue systems differs by region. The results could perhaps lead to further study or reinforce preexisting conclusions.
* **Datasets**: [allDogDescriptions.csv](https://github.com/the-pudding/data/tree/master/dog-shelters), a dataset of all adoptable dogs from petfinder.com on September 20th, 2019.
* **Ethical Concerns/Considerations**:
    - The results may cause people to draw false conclusions about why certain breeds often end up in shelters and rescues.
    - The results risk reinforcing biases associated with various dog breeds.
    - The results may influence rescue/shelter intake depending on the dog's perceived adoptability.

The first task is to simply read in the data.

In [2]:
df = pd.read_csv("data/allDogDescriptions.csv")
df.head()

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,status,posted,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,adoptable,2019-09-20T16:37:59+0000,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,adoptable,2019-09-20T16:24:57+0000,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,adoptable,2019-09-20T14:10:11+0000,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,adoptable,2019-09-20T10:08:22+0000,Pahrump,NV,89048,US,89009,2019-09-20,Dog,
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,adoptable,2019-09-20T06:48:30+0000,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...


I start off by getting the overall breed frequencies:

In [3]:
print("OVERALL FREQUENCIES FOR PRIMARY LISTED BREED:")
print(df['breed_primary'].value_counts())
print("\n\nOVERALL FREQUENCIES FOR SECONDARY LISTED BREED:")
print(df['breed_secondary'].value_counts())
uniq_breeds = pd.concat([df['breed_primary'], df['breed_secondary']]).drop_duplicates()
print("\n\nNUMBER OF UNIQUE DOG BREEDS: ", uniq_breeds.size)

OVERALL FREQUENCIES FOR PRIMARY LISTED BREED:
Pit Bull Terrier               7890
Labrador Retriever             7198
Chihuahua                      3766
Mixed Breed                    3242
Terrier                        2641
                               ... 
Field Spaniel                     1
Bouvier des Flandres              1
Briard                            1
Wirehaired Pointing Griffon       1
Chinook                           1
Name: breed_primary, Length: 216, dtype: int64


OVERALL FREQUENCIES FOR SECONDARY LISTED BREED:
Mixed Breed                           4348
Labrador Retriever                    2194
Pit Bull Terrier                      1365
Terrier                               1195
Hound                                 1143
                                      ... 
Finnish Spitz                            1
Toy Manchester Terrier                   1
Curly-Coated Retriever                   1
Lowchen                                  1
Nova Scotia Duck Tolling Retrie

The most common dog listed as the primary breed is the Pit Bull Terrier, closely followed by the Labrador Retriever and then by the Chihuahua. The most common secondary breed is simply "Mixed Breed", followed again by the lab and the pit bull. In total, there are 223 unique breeds posted on Petfinder.

**Question**: For each state, what is the most common breed found in shelters?

In [4]:
# some values in the original data have seemingly been shifted, which is why some "states" are actually zip codes
# I'm hoping to fix this at some point
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
df.groupby('contact_state')['breed_primary'].value_counts().groupby('contact_state').apply(lambda x : x.nlargest(10))

contact_state  contact_state  breed_primary                      
12220          12220          Basset Hound                             1
                              Beagle                                   1
                              Mixed Breed                              1
12477          12477          American Bulldog                         1
                              Mixed Breed                              1
17325          17325          Alaskan Malamute                         2
19053          19053          Pit Bull Terrier                         1
19063          19063          Pit Bull Terrier                         1
20136          20136          Maltese                                  1
20905          20905          Fox Terrier                              1
23112          23112          Labrador Retriever                       1
24588          24588          Hound                                    1
37189          37189          Yellow Labrador Retriever   

Is there really any significant difference between these groups? I realized that a Chi-Square Test wouldn't work out if the categories could differ. Maybe I can come back to this, but currently I can't think of a way to compare these groups in a way that would make sense.

**Question:** What is the current average length of stay (LOS) for dogs in each state?
<br>This refers only to pets who are still listed on Petfinder, where we "guess" the length of stay by comparing the date the dog was posted to the date the data was collected.

In [66]:
"""
Consumes two strings representing a date/time
Returns a list of ints containing the difference in years, months, and days respectively
"""
def subdate(date1, date2):
    # YYYY-MM-DD
    if '-' not in date2 or 'T' not in date2:
        return
    date2 = date2[0:date2.index('T')]
    date1 = pd.to_datetime(date1)
    date2 = pd.to_datetime(date2)
    return (date1 - date2).days

"""def grouphandler(grp):
    ret = grp.apply(lambda y : subdate('2019-09-20', str(y)))
    return ret"""

#los = df.apply((lambda y : subdate('2019-09-20', str(y['posted']))), axis=1)#.groupby('contact_state')['posted']
#los

0           0.0
1           0.0
2           0.0
3           0.0
4           0.0
5           0.0
6           0.0
7           0.0
8           0.0
9           0.0
10          0.0
11          0.0
12          0.0
13          0.0
14          0.0
15          0.0
16          0.0
17          0.0
18          0.0
19          0.0
20          0.0
21          0.0
22          0.0
23          0.0
24          0.0
25          0.0
26          0.0
27          0.0
28          0.0
29          0.0
30          0.0
31          0.0
32          1.0
33          1.0
34          1.0
35          1.0
36          1.0
37          1.0
38          1.0
39          1.0
40          1.0
41          1.0
42          1.0
43          1.0
44          1.0
45          1.0
46          1.0
47          1.0
48          1.0
49          1.0
50          1.0
51          1.0
52          1.0
53          1.0
54          1.0
55          1.0
56          1.0
57          1.0
58          1.0
59          1.0
60          1.0
61          1.0
62      