You have to work on the [Dogs adoptions](https://drive.google.com/file/d/1wQsA0oB6wwYlnkvvcyBCmLk7QmgVWNax/view?usp=sharing) dataset. 

It contains three files:
*  `dogs.csv`, shortly *dogs*
*  `dogTravel.csv`, shortly *travels*
*  `NST-EST2021-POP.csv`

### Notes

1.    It is mandatory to use GitHub for developing the project.
1.    The project must be a jupyter notebook.
1.    There is no restriction on the libraries that can be used, nor on the Python version.
1.    All questions on the project **must** be asked in a public channel on [Zulip](https://focs.zulipchat.com).
1.    At most 3 students can be in each group. You must create the groups by yourself.
1.    You do not have to send me the project *before* the discussion.

### 0.1 Importing files

In [23]:
# Importing Pandas
import pandas as pd

# Opening dogs.csv and checking columns
with open("dogs.csv", "r") as dogs_file:
    headers = dogs_file.readline()
    print(headers)

id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,color_secondary,color_tertiary,age,sex,size,coat,fixed,house_trained,declawed,special_needs,shots_current,env_children,env_dogs,env_cats,name,status,posted,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost



In [24]:
# Creating 'dogs' df 
dogs = pd.read_csv("dogs.csv", sep=',', doublequote='"', low_memory=False)

# Checking the head
dogs.head()

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter in his senior years but as you see from the pictur...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really wants a home of his own. We are getting more info...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?r...,Dog,Dog,Shepherd,,False,False,Brindle,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,"Approx 2 years old.\r\n Did I catch your eye? I don't blame you if you had to stop and stare, I ...",87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referr...,Dog,Dog,German Shepherd Dog,,False,False,,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id...,Dog,Dog,Dachshund,,False,False,,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets along well with other dogs in his size range. This cut...,93,241.09


### 0.2 Cleaning up

In [25]:
tmp_dog_full = pd.read_csv("dogs.csv", sep=',', doublequote='"', low_memory=False, encoding='utf-8')
print(f'tmp_dog_full shape: {tmp_dog_full.shape}')

# check what lines are ok and what need to be managed in different way: use contact state as watermark
tmp_dog_full['ok'] = ~tmp_dog_full.contact_state.str.isnumeric()
tmp_dog_full.columns = [col.lower().replace(".", "_") for col in tmp_dog_full.columns]

# split dataframe with different case
tmp_dog_ok = tmp_dog_full[tmp_dog_full.ok == True]
tmp_dog_not_ok = tmp_dog_full[tmp_dog_full.ok == False]
print('tmp_dog_ok:')
display(tmp_dog_ok.head(5))
print('##################################')
print('tmp_dog_not_ok')
display(tmp_dog_not_ok.head(5))

# check all rows are ok
print(len(tmp_dog_ok.contact_state.unique()))
tmp_dog_ok.contact_state.unique()

# manage not ok dataframe: split name column and shift the others

pd.set_option('display.max_colwidth', 100) #50
print('before')
display(tmp_dog_not_ok.head(1))
tmp_dog_not_ok_fixed = pd.DataFrame(columns=tmp_dog_not_ok.columns, index=tmp_dog_not_ok.index)
tmp_dog_not_ok_fixed.iloc[:, 0:24] =  tmp_dog_not_ok.iloc[:, 0:24].copy()
tmp_dog_not_ok_fixed.iloc[:, 26:] =  tmp_dog_not_ok.iloc[:, 25:].drop('accessed', axis = 1).copy()
tmp_dog_not_ok.iloc[: , 24]
tmp_dog_not_ok_fixed.name = tmp_dog_not_ok.name.apply(lambda x : x.split('\",')[0])
tmp_dog_not_ok_fixed.status = tmp_dog_not_ok.name.apply(lambda x : x.split('\",')[1].strip('"'))
print('after')
tmp_dog_not_ok_fixed.head()

# unify dataframes
print('tmp_dog_ok shape:', tmp_dog_ok.shape)
print('tmp_dog_not_ok shape:', tmp_dog_not_ok.shape)
dogs = pd.concat([tmp_dog_ok, tmp_dog_not_ok_fixed])
print('dogs shape:', dogs.shape)
del tmp_dog_full
del tmp_dog_not_ok
del tmp_dog_not_ok_fixed
del tmp_dog_ok

dogs.columns = [col.lower().replace(".", "_") for col in dogs.columns]
dogs.drop('ok', axis=1, inplace=True)
dogs.columns


# travels dataset

tmp_travels = pd.read_csv("dogTravel.csv", sep=',', doublequote='"', low_memory=False).drop('index', axis=1)
display(tmp_travels.head())
display(tmp_travels.contact_state.unique())
display(tmp_travels[tmp_travels.contact_state == '17325'].id.unique())
anomalies = tmp_travels[tmp_travels.contact_state == '17325'].id.unique()
tmp_travels.loc[tmp_travels.id == anomalies[0], 'contact_state'] = 'PA'
tmp_travels.loc[tmp_travels.id == anomalies[1], 'contact_state'] = 'PA'
display(tmp_travels[tmp_travels.id.isin(anomalies)])
display(tmp_travels.contact_state.unique())

travels = tmp_travels.copy()
del tmp_travels

# states dataset

tmp_states = pd.read_csv("NST-EST2021-POP.csv", header=None, names=["state", "population"], sep=',', low_memory=False)
tmp_states.head()

tmp_states.population = tmp_states.population.str.replace('.', '', regex=False).astype(int)
states = tmp_states.copy()
del tmp_states
states.head()

tmp_dog_full shape: (58180, 37)
tmp_dog_ok:


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/nv/las-vegas/animal-network-nv163/?referrer_id=87b...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter in his senior years but as you see from the pictur...,70,124.81,True
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/nv/las-vegas/animal-network-nv163/?referrer_id=87b...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really wants a home of his own. We are getting more info...,49,122.07,True
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/nv/mesquite/city-of-mesquite-animal-shelter-nv99/?r...,Dog,Dog,Shepherd,,False,False,Brindle,...,NV,89027,US,89009,2019-09-20,Dog,"Approx 2 years old.\r\n Did I catch your eye? I don't blame you if you had to stop and stare, I ...",87,281.51,True
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/nv/pahrump/pets-are-worth-saving-paws-nv202/?referr...,Dog,Dog,German Shepherd Dog,,False,False,,...,NV,89048,US,89009,2019-09-20,Dog,,62,145.83,True
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv/henderson/wagging-tails-rescue-nv184/?referrer_id...,Dog,Dog,Dachshund,,False,False,,...,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets along well with other dogs in his size range. This cut...,93,241.09,True


##################################
tmp_dog_not_ok


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,89146,US,89009,2019-09-20,,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88,False
5549,38169117,AZ414,https://www.petfinder.com/dog/annabelle-annie-38169117/az/chandler/underdog-rescue-of-az-az414/?...,Dog,Dog,Boxer,Pit Bull Terrier,True,False,Black,...,85249,US,AZ,2019-09-20,,Dog,You can fill out an adoption application online on our official website.\r\n\r\nMEET ANNABELLE o...,80,130.77,False
10888,45833989,NY98,https://www.petfinder.com/dog/pepper-courtesy-listing-45833989/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Beagle,,False,False,,...,12220,US,CT,2019-09-20,,Dog,This is Pepper. He is a 15 year old tri-color beagle. He is 32 lbs and can still run a mile! He ...,86,180.7,False
11983,45515547,NY98,https://www.petfinder.com/dog/cooper-courtesy-listing-45515547/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Mixed Breed,,False,False,,...,12220,US,CT,2019-09-20,,Dog,"Cooper is 13 years old, but according to a very recent vet visit he is in perfect health. He is ...",105,400.82,False
12495,45294115,NY98,https://www.petfinder.com/dog/daisy-courtesy-listing-45294115/ny/albany/peppertree-rescue-ny98/?...,Dog,Dog,Basset Hound,,False,False,Brown / Chocolate,...,12220,US,CT,2019-09-20,,Dog,"â¢Basset Hound, female, â¢10 years \r\n\r\nDelightful Daisy is a friendly girl looking for a r...",57,82.61,False


53
before


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,89146,US,89009,2019-09-20,,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88,False


after
tmp_dog_ok shape: (58147, 38)
tmp_dog_not_ok shape: (33, 38)
dogs shape: (58180, 38)


Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He lov...,Arkansas,,,
1,44698509,Groveland,FL,"Duke is an almost 2 year old Potcake from Abacos in the Bahamas. He is a happy boy, who loves hi...",Abacos,Bahamas,,
2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star looking to settle down with the right person! \r\n\r\nAs...,Adam,Maryland,,
3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from another rescue ~~Interacted with other dogs and was ...,Adaptil,,True,
4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her people very much and likes getting loved on. She can ...,Afghanistan,,,


array(['MN', 'FL', 'MD', 'CO', 'CT', 'OH', 'AL', 'NY', 'NJ', 'PA', 'VA',
       'GA', 'ME', 'NH', 'MI', 'VT', 'TN', 'WI', 'NM', 'OR', 'WA', 'IA',
       'KY', 'NV', 'UT', 'AZ', 'NC', 'AR', 'MA', 'RI', 'OK', 'CA', 'IN',
       'SC', 'IL', 'MO', 'TX', 'DC', 'KS', 'DE', 'WV', 'NB', 'MS', 'LA',
       '17325'], dtype=object)

array([36978896, 33218331], dtype=int64)

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
2472,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Maryland,,True,
2473,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Maryland,,True,
3190,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,New Jersey,,True,
3191,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",New Jersey,,True,
3237,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,New York,,True,
3238,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",New York,,True,
3714,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Pennsylvania,,True,
3715,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Pennsylvania,,True,
6029,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Virginia,,True,
6030,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Virginia,,True,


array(['MN', 'FL', 'MD', 'CO', 'CT', 'OH', 'AL', 'NY', 'NJ', 'PA', 'VA',
       'GA', 'ME', 'NH', 'MI', 'VT', 'TN', 'WI', 'NM', 'OR', 'WA', 'IA',
       'KY', 'NV', 'UT', 'AZ', 'NC', 'AR', 'MA', 'RI', 'OK', 'CA', 'IN',
       'SC', 'IL', 'MO', 'TX', 'DC', 'KS', 'DE', 'WV', 'NB', 'MS', 'LA'],
      dtype=object)

Unnamed: 0,state,population
0,Alabama,5024279
1,Alaska,733391
2,Arizona,7151502
3,Arkansas,3011524
4,California,39538223


### 1. Extract all dogs with status that is not *adoptable*

In [26]:
print(dogs[dogs.status != 'adoptable'].shape)
not_adoptable_dogs = dogs[dogs.status != 'adoptable']

not_adoptable_dogs

(33, 37)


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,Las Vegas,NV,89146,US,89009,2019-09-20,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88
5549,38169117,AZ414,https://www.petfinder.com/dog/annabelle-annie-38169117/az/chandler/underdog-rescue-of-az-az414/?...,Dog,Dog,Boxer,Pit Bull Terrier,True,False,Black,...,Chandler,AZ,85249,US,AZ,2019-09-20,Dog,You can fill out an adoption application online on our official website.\r\n\r\nMEET ANNABELLE o...,80,130.77
10888,45833989,NY98,https://www.petfinder.com/dog/pepper-courtesy-listing-45833989/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Beagle,,False,False,,...,Albany,NY,12220,US,CT,2019-09-20,Dog,This is Pepper. He is a 15 year old tri-color beagle. He is 32 lbs and can still run a mile! He ...,86,180.7
11983,45515547,NY98,https://www.petfinder.com/dog/cooper-courtesy-listing-45515547/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Mixed Breed,,False,False,,...,Albany,NY,12220,US,CT,2019-09-20,Dog,"Cooper is 13 years old, but according to a very recent vet visit he is in perfect health. He is ...",105,400.82
12495,45294115,NY98,https://www.petfinder.com/dog/daisy-courtesy-listing-45294115/ny/albany/peppertree-rescue-ny98/?...,Dog,Dog,Basset Hound,,False,False,Brown / Chocolate,...,Albany,NY,12220,US,CT,2019-09-20,Dog,"â¢Basset Hound, female, â¢10 years \r\n\r\nDelightful Daisy is a friendly girl looking for a r...",57,82.61
12600,45229004,NY1436,https://www.petfinder.com/dog/elmo-momo-45229004/ny/saugerties/ulster-county-canines-ny1436/?ref...,Dog,Dog,American Bulldog,,True,False,,...,Saugerties,NY,12477,US,CT,2019-09-20,Dog,"Hello i'm MoMo or Elmo , 7 year old, mixed breed! I can't wait to have a family of my own. I wan...",73,136.3
12613,45227052,NY1436,https://www.petfinder.com/dog/bianca-pinky-45227052/ny/saugerties/ulster-county-canines-ny1436/?...,Dog,Dog,Mixed Breed,,False,False,White / Cream,...,Saugerties,NY,12477,US,CT,2019-09-20,Dog,"Hello I'm Bianca, a female, 7 year old mixed breed. I enjoy not only getting but giving love. I ...",107,231.31
17619,45569380,CA1209,https://www.petfinder.com/dog/baby-girl-45569380/va/bristow/american-maltese-association-rescue-...,Dog,Dog,Maltese,,False,False,White / Cream,...,Bristow,VA,20136,US,DC,2019-09-20,Dog,"This 10-year young senior is very sweet and loving. She weighs 9.5 lbs, heart worm negative, up-...",76,263.63
18611,44694387,MD295,https://www.petfinder.com/dog/king-bert-bertie-44694387/md/silver-spring/master-md295/?referrer_...,Dog,Dog,Fox Terrier,Chihuahua,True,False,Bicolor,...,Silver Spring,MD,20905,US,DC,2019-09-20,Dog,"\""Bertie\"" came to us from the shelter. He was found stray in the inner city. He had some skin...",61,158.84
19747,36978896,VA127,https://www.petfinder.com/dog/maddie-cutie-patootie-36978896/pa/gettysburg/chesapeake-area-alask...,Dog,Dog,Alaskan Malamute,,False,False,Bicolor,...,Gettysburg,PA,17325,US,DC,2019-09-20,Dog,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,119,431.66


### 2. For each (primary) breed, determine the number of dogs

In [27]:
dogs['breed_primary'].value_counts()

Pit Bull Terrier                7890
Labrador Retriever              7198
Chihuahua                       3766
Mixed Breed                     3242
Terrier                         2641
                                ... 
Wirehaired Pointing Griffon        1
Boykin Spaniel                     1
Old English Sheepdog               1
Belgian Shepherd / Laekenois       1
Tosa Inu                           1
Name: breed_primary, Length: 216, dtype: int64

### 3. For each (primary) breed, determine the ratio between the number of dogs of `Mixed Breed` and those not of Mixed Breed. Hint: look at the `secondary_breed`.

In [28]:
# Counting each combination
dogs.groupby(['breed_primary','breed_secondary']).size().reset_index().rename(columns={0:'count'})

Unnamed: 0,breed_primary,breed_secondary,count
0,Affenpinscher,Chihuahua,1
1,Affenpinscher,Mixed Breed,1
2,Afghan Hound,Cocker Spaniel,1
3,Airedale Terrier,Catahoula Leopard Dog,1
4,Airedale Terrier,Coonhound,2
...,...,...,...
2953,Yorkshire Terrier,Scottish Terrier,1
2954,Yorkshire Terrier,Shih Tzu,10
2955,Yorkshire Terrier,Silky Terrier,2
2956,Yorkshire Terrier,Terrier,13


In [29]:
# Creating a groupby for Mixed Breeds
dogs.groupby('breed_primary')['breed_secondary'].apply(lambda x: (x=='Mixed Breed').sum()).reset_index(name='count')

Unnamed: 0,breed_primary,count
0,Affenpinscher,1
1,Afghan Hound,0
2,Airedale Terrier,1
3,Akbash,0
4,Akita,6
...,...,...
211,Wirehaired Pointing Griffon,0
212,Wirehaired Terrier,4
213,Xoloitzcuintli / Mexican Hairless,0
214,Yellow Labrador Retriever,0


In [30]:
# Creating a groupby for those who aren't Mixed Breeds
dogs.groupby('breed_primary')['breed_secondary'].apply(lambda x: (x!='Mixed Breed').sum()).reset_index(name='count')

Unnamed: 0,breed_primary,count
0,Affenpinscher,16
1,Afghan Hound,4
2,Airedale Terrier,18
3,Akbash,3
4,Akita,175
...,...,...
211,Wirehaired Pointing Griffon,1
212,Wirehaired Terrier,56
213,Xoloitzcuintli / Mexican Hairless,11
214,Yellow Labrador Retriever,158


In [31]:
# Storing them in two different df
mixed_breed = dogs.groupby('breed_primary')['breed_secondary'].apply(lambda x: (x=='Mixed Breed').sum()).reset_index(name='mixed')
not_mixed_breed = dogs.groupby('breed_primary')['breed_secondary'].apply(lambda x: (x!='Mixed Breed').sum()).reset_index(name='not_mixed')

In [32]:
# Merging the two df into a single one
ratio_mixed = mixed_breed.merge(not_mixed_breed, left_on='breed_primary', right_on='breed_primary')

ratio_mixed

Unnamed: 0,breed_primary,mixed,not_mixed
0,Affenpinscher,1,16
1,Afghan Hound,0,4
2,Airedale Terrier,1,18
3,Akbash,0,3
4,Akita,6,175
...,...,...,...
211,Wirehaired Pointing Griffon,0,1
212,Wirehaired Terrier,4,56
213,Xoloitzcuintli / Mexican Hairless,0,11
214,Yellow Labrador Retriever,0,158


In [33]:
# Calculating the ratio in a different column
ratio_mixed['ratio'] = ratio_mixed['mixed']/ratio_mixed['not_mixed']

ratio_mixed

Unnamed: 0,breed_primary,mixed,not_mixed,ratio
0,Affenpinscher,1,16,0.062500
1,Afghan Hound,0,4,0.000000
2,Airedale Terrier,1,18,0.055556
3,Akbash,0,3,0.000000
4,Akita,6,175,0.034286
...,...,...,...,...
211,Wirehaired Pointing Griffon,0,1,0.000000
212,Wirehaired Terrier,4,56,0.071429
213,Xoloitzcuintli / Mexican Hairless,0,11,0.000000
214,Yellow Labrador Retriever,0,158,0.000000


### 4. For each (primary) breed, determine the earliest and the latest `posted` timestamp.

In [34]:
## Formatting the 'posted' column
dogs['posted'] = pd.to_datetime(dogs['posted'], errors="coerce")

## Creating the df with earliest and latest 'posted' timestamps
earliest_latest_timestamp = dogs.groupby('breed_primary', as_index=False).aggregate({'posted':[min, max]})

earliest_latest_timestamp

Unnamed: 0_level_0,breed_primary,posted,posted
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max
0,Affenpinscher,2012-03-08 10:27:33+00:00,2019-09-14 10:10:51+00:00
1,Afghan Hound,2017-06-29 23:28:51+00:00,2019-07-27 00:38:48+00:00
2,Airedale Terrier,2014-06-13 12:59:36+00:00,2019-09-19 18:40:39+00:00
3,Akbash,2019-07-21 00:35:59+00:00,2019-08-23 17:11:04+00:00
4,Akita,2012-03-03 09:31:08+00:00,2019-09-20 15:19:57+00:00
...,...,...,...
211,Wirehaired Pointing Griffon,2016-06-29 20:03:55+00:00,2016-06-29 20:03:55+00:00
212,Wirehaired Terrier,2012-11-27 14:07:54+00:00,2019-09-19 22:52:45+00:00
213,Xoloitzcuintli / Mexican Hairless,2007-02-01 00:00:00+00:00,2019-09-08 11:15:54+00:00
214,Yellow Labrador Retriever,2010-05-31 00:00:00+00:00,2019-09-20 06:30:27+00:00


### 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

In [35]:
## TODO extract missing answer!

malefemale = dogs[['contact_state', 'contact_city', 'contact_zip', 'contact_country', 'sex']].copy()
malefemale['imbalance'] = malefemale.sex.apply(lambda x : 1 if x.upper() == 'MALE' else -1)

malefemale_imbalance = malefemale.groupby('contact_state', as_index=False).sum('imbalance')[['contact_state', 'imbalance']]
malefemale_imbalance.iloc[[malefemale_imbalance.imbalance.idxmin(), malefemale_imbalance.imbalance.idxmax()]]
malefemale_imbalance

Unnamed: 0,contact_state,imbalance
0,AK,1
1,AL,-4
2,AR,-7
3,AZ,113
4,CA,110
5,CO,-51
6,CT,58
7,DC,-16
8,DE,0
9,FL,101


### 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay.

In [36]:
dogs.stay_duration = dogs.stay_duration.astype(int)
dogs.stay_cost = dogs.stay_cost.astype(float)
stay = dogs.groupby(['age', 'size'], as_index=False).agg({'stay_duration' : 'mean', 'stay_cost' : 'mean'})
stay.stay_duration = stay.stay_duration.apply(lambda x : round(x, 2))
stay.stay_cost = stay.stay_cost.apply(lambda x : round(x, 2))
stay

Unnamed: 0,age,size,stay_duration,stay_cost
0,Adult,Extra Large,89.02,232.59
1,Adult,Large,89.53,238.66
2,Adult,Medium,89.42,238.26
3,Adult,Small,89.41,238.97
4,Baby,Extra Large,87.03,237.18
5,Baby,Large,89.7,238.7
6,Baby,Medium,89.58,237.11
7,Baby,Small,89.96,239.08
8,Senior,Extra Large,88.86,235.23
9,Senior,Large,88.98,237.51


### 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs.

In [37]:
many_travels = travels[['id', 'contact_state']].groupby('id', as_index=False).count().rename({'contact_state':'travels'}, axis=1)
many_travels = many_travels[many_travels.travels > 2]
many_travels

Unnamed: 0,id,travels
5,16657005,4
9,20905974,5
17,24894870,4
18,24894894,4
55,33218331,7
...,...,...
4110,46042569,3
4111,46042587,3
4112,46042618,3
4113,46043099,3


In [38]:
more_travels = many_travels.merge(dogs[['id', 'breed_primary']], left_on='id', right_on='id')
more_travels.sort_values('travels', ascending=False)

Unnamed: 0,id,travels,breed_primary
68,44759410,11,German Shepherd Dog
67,44759409,11,German Shepherd Dog
178,45728583,7,Alaskan Malamute
142,45537987,7,Alaskan Malamute
55,44572953,7,Alaskan Malamute
...,...,...,...
226,45831317,3,Shiba Inu
224,45831313,3,Chihuahua
223,45831312,3,Chihuahua
222,45831310,3,Labrador Retriever


### 8. Fix the `travels` table so that the correct state is computed from  the `manual` and the `found` fields. If `manual` is not missing, then it overrides what is stored in `found`.

In [47]:
# Creating a copy
exercise_8 = travels.copy()

# Creating an empty list to store the correct states
correct_states = []

# Looping through each row of the dataframe
for _, row in exercise_8.iterrows():
    # If the 'manual' column is not missing, use its value to populate the 'correct_state' column
    if pd.notnull(row['manual']):
        correct_states.append(row['manual'])
    else:
        correct_states.append(row['found'])

# Add the 'correct_state' column to the dataframe
## TODO ma siamo sicuri che non bisogna sovrascrivere il campo "found"? eventualmente mettiamo qui sotto exercise_8['found'] = correct_states
## TODO inotre il correct state non è così corretto :) 
## TODO nell'esercizio 9 non te ne accorgi più perchè fai il merge sull'id e le righe sporche te le perdi 
exercise_8['correct_state'] = correct_states
print(sorted(set(correct_states)))
exercise_8

['Adaptil', 'Afghanistan', 'Alabama', 'Amish Country', 'Apoquel', 'Arizona', 'Ark', 'Ark.', 'Arkansas', 'Arroyo', 'Aruba', 'Atlanta', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Berkeley Heights', 'Billie', 'Birmingham', 'Blaine', 'Blue Ridge', 'Bosnia', 'Boulder', 'Brida', 'Bridgewater', 'British Virgin Islands', 'Brittany', 'Buddy', 'Calico Rock', 'California', 'Canada', 'Carolinas', 'Cayman Islands', 'Central', 'Char', 'Charlotte', 'Charlotte North Carolina', 'Chihuahua', 'China', 'Cleveland', 'Clinton', 'Clover', 'Coast', 'Collingswood', 'Colorado', 'Connecticut', 'Costa Rica', 'County', 'Cytopoint', 'Death', 'Delaware', 'Dena', 'Dickson City', 'Doggie', 'Egypt', 'Elyria', 'England', 'Fairfax', 'Far Rockaway', 'Finland', 'Florenceville', 'Florida', 'Georgia', 'Glaucoma', 'Great Dane', 'Greece', 'Haiti', 'Hawaii', 'Heartworm', 'Hickory', 'Ho-Bo Care Boxer', 'Honduras', 'Howlin4Spirit', 'Idaho', 'Illinois', 'India', 'Indiana', 'Indianapolis', 'Iowa', 'Iran', 'Ireland', 'Jamaica', 'Jefferson

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there,correct_state
0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He lov...,Arkansas,,,,Arkansas
1,44698509,Groveland,FL,"Duke is an almost 2 year old Potcake from Abacos in the Bahamas. He is a happy boy, who loves hi...",Abacos,Bahamas,,,Bahamas
2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star looking to settle down with the right person! \r\n\r\nAs...,Adam,Maryland,,,Maryland
3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from another rescue ~~Interacted with other dogs and was ...,Adaptil,,True,,Adaptil
4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her people very much and likes getting loved on. She can ...,Afghanistan,,,,Afghanistan
...,...,...,...,...,...,...,...,...,...
6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc.org) for more information about this pet.\r\n\r\nM...,WV,,True,,WV
6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed Bull Breed mix. She came from a shelter in Wyoming...,Wyoming,,,,Wyoming
6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescued from a high kill shelter with his girlfriend Yaz...,Yazmin,,True,,Yazmin
6192,44519341,Dayton,OH,Callie is a 14 year old Chihuahua whose owner died and whose family couldn't keep her permanentl...,Young,Ohio,,,Ohio


#### For the future me: the 'exercise_8' contains duplicates

### 9. For each state, compute the ratio between the number of travels and the population.

In [40]:
# Storing the NST-EST2021-POP.csv into a new df, 'populationsDf'
populationsDf = pd.read_csv("NST-EST2021-POP.csv", sep=',', doublequote='"', low_memory=False, names=["correct_state", "population"])
populationsDf.columns

Index(['correct_state', 'population'], dtype='object')

In [49]:
import re

## TODO infatti se qua ti metti una left join....viene fuori... NARNIA :D 
## TODO invece NARNIA non esce perchè è una inner join, ma perdi 2000 record
## TODO comunque prima di mettere in join exercise_8 bisogna farci su un group by per stato e contare i viaggi.
## TODO inoltre questa mancata groupby è quella che ti fa uscire i duplicati qualche riga più su
# Merge the two dataframes on the 'contact_state' column
print(f'nomber of rows before merge: {exercise_8.shape[0]}')
exercise_9 = exercise_8.merge(populationsDf, on='correct_state')
print(f'nomber of rows after merge: {exercise_9.shape[0]}')
print(exercise_9['id'])

# Removing duplicate rows based on the 'id' column, keeping the last occurrence of each duplicate row
## TODO questi duplicati vengono fuori per il motivo che scrivo qualche riga più su
exercise_9 = exercise_9.drop_duplicates(subset='id', keep='last')
print(f'nomber of rows after drop: {exercise_9.shape[0]}')

# Group the dataframe by the 'correct_state' column
grouped_df = exercise_9.groupby('correct_state')

# Create an empty dictionary to store the results
results = {}

# Iterate through each group
for name, group in grouped_df:
    # Calculate the number of travels and the population
    num_travels = group.shape[0]
    population = group['population'].str.replace('.', '').astype(int).sum()
    
    # Calculate the ratio and store it in the dictionary
    ratio = num_travels / population
    results[name] = ratio

# Convert the dictionary to a dataframe
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['ratio'])


nomber of rows before merge: 6194
nomber of rows after merge: 4186
0       44520267
1       45745645
2       45745640
3       45760416
4       45759786
          ...   
4181    45903459
4182    45794510
4183    45386321
4184    38643680
4185    45799729
Name: id, Length: 4186, dtype: int64
nomber of rows after drop: 3099


  population = group['population'].str.replace('.', '').astype(int).sum()


In [50]:
print(sorted(exercise_8.correct_state.unique()))
results_df

['Adaptil', 'Afghanistan', 'Alabama', 'Amish Country', 'Apoquel', 'Arizona', 'Ark', 'Ark.', 'Arkansas', 'Arroyo', 'Aruba', 'Atlanta', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Berkeley Heights', 'Billie', 'Birmingham', 'Blaine', 'Blue Ridge', 'Bosnia', 'Boulder', 'Brida', 'Bridgewater', 'British Virgin Islands', 'Brittany', 'Buddy', 'Calico Rock', 'California', 'Canada', 'Carolinas', 'Cayman Islands', 'Central', 'Char', 'Charlotte', 'Charlotte North Carolina', 'Chihuahua', 'China', 'Cleveland', 'Clinton', 'Clover', 'Coast', 'Collingswood', 'Colorado', 'Connecticut', 'Costa Rica', 'County', 'Cytopoint', 'Death', 'Delaware', 'Dena', 'Dickson City', 'Doggie', 'Egypt', 'Elyria', 'England', 'Fairfax', 'Far Rockaway', 'Finland', 'Florenceville', 'Florida', 'Georgia', 'Glaucoma', 'Great Dane', 'Greece', 'Haiti', 'Hawaii', 'Heartworm', 'Hickory', 'Ho-Bo Care Boxer', 'Honduras', 'Howlin4Spirit', 'Idaho', 'Illinois', 'India', 'Indiana', 'Indianapolis', 'Iowa', 'Iran', 'Ireland', 'Jamaica', 'Jefferson

Unnamed: 0,ratio
Alabama,1.990335e-07
Arizona,1.398308e-07
Arkansas,3.320578e-07
California,2.529198e-08
Colorado,1.731987e-07
Connecticut,2.773199e-07
Delaware,1.010154e-06
Florida,4.642916e-08
Georgia,9.335405e-08
Hawaii,6.871572e-07


### 10. For each dog, compute the number of days from the `posted` day to the day of last access.

In [51]:
# Creating a df copy for this exercise
exercise_10 = dogs[['id', 'name', 'posted', 'accessed']].copy()

# Computing the number of days from the 'posted' day to the day of last access, assuming it's 'accessed' column
# The value is stored in 'days_delay' column
exercise_10['posted'] = pd.to_datetime(pd.to_datetime(exercise_10['posted']).dt.date)
exercise_10['accessed'] = pd.to_datetime(exercise_10['accessed'])
exercise_10['days_delay'] = (exercise_10['accessed'].dt.date - exercise_10['posted'].dt.date).dt.days

# Printing the result
exercise_10

Unnamed: 0,id,name,posted,accessed,days_delay
0,46042150,HARLEY,2019-09-20,2019-09-20,0
1,46042002,BIGGIE,2019-09-20,2019-09-20,0
2,46040898,Ziggy,2019-09-20,2019-09-20,0
3,46039877,Gypsy,2019-09-20,2019-09-20,0
4,46039306,Theo,2019-09-20,2019-09-20,0
...,...,...,...,...,...
56013,45916348,\Cody\,2019-09-09,2019-09-20,11
56248,45733027,\Gracie\,2019-08-24,2019-09-20,27
56464,45413997,\Jameson\,2019-07-31,2019-09-20,51
56473,45406516,\Canelo\,2019-07-29,2019-09-20,53


### 11. Partition the dogs according to the number of weeks from the `posted` day to the day of last access.

In [58]:
7//5

1

In [61]:
##TODO questo esercizio l'ho modificato: uscivano le settimane decimali. "//" sta per divisione intera.

# Creating a df copy for this exercise
exercise_11 = exercise_10

# Creating a new column, 'weeks', that stores the number of weeks from the posted day to the day of last access
exercise_11["weeks"] = round(exercise_11["days_delay"] // 7,0).astype(int)

# Grouping the dogs in different partitions, based on 'weeks' value
partitioned_dogs = exercise_11.groupby("weeks").count()[['id']].rename({'id': 'number_of_dogs'}, axis=1)
# # Printing them
partitioned_dogs

Unnamed: 0_level_0,number_of_dogs
weeks,Unnamed: 1_level_1
0,9803
1,6547
2,5764
3,3353
4,2439
...,...
729,1
746,1
811,1
812,1


### 12. Find for duplicates in the `dogs` dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined for determining if two rows are duplicates.

In [23]:
df = dogs[['id', 'breed_primary', 'sex', 'description']].copy()

In [24]:
# seleziona casualmente un campione di 1000 record del dataframe
df = df.sample(8000)

# filtra il dataframe per escludere i record con valori NaN nella colonna 'description'
df = df[df['description'].notnull()]

# rimuovi i simboli dalla colonna 'description'
df['description'] = df['description'].str.replace(r'[^\w\s]', '')

# crea una lista vuota per i duplicati
duplicates = []

  df['description'] = df['description'].str.replace(r'[^\w\s]', '')


In [25]:
# itera su ogni record del dataframe
for i, row in df.iterrows():
    # confronta il record corrente con quelli successivi
    for j in range(i + 1, len(df)):
        # se 'breed_primary' e 'sex' sono uguali...
        if row['breed_primary'] == df.iloc[j]['breed_primary'] and row['sex'] == df.iloc[j]['sex']:
            # ...confronta le colonne 'description'
            description1 = set(row['description'].split())
            description2 = set(df.iloc[j]['description'].split())
            if len(description1 & description2) / len(description1 | description2) >= 0.9:
                # se i record condividono almeno il 90% delle parole, considerali duplicati
                duplicates.append(row)

# visualizza i duplicati
print(duplicates)

[id                                                                                                          46038703
breed_primary                                                                                      Italian Greyhound
sex                                                                                                           Female
description      Cashew is sweet girl who was born around July 8th With her long legs she should grow into a grac...
Name: 11, dtype: object, id                                                                                    45924189
breed_primary                                                                   Dogo Argentino
sex                                                                                       Male
description      Casan is a big dude but hes pretty relaxed and chill He likes treats and toys
Name: 181, dtype: object, id                                                                                                  