You have to work on the [Dogs adoptions](https://drive.google.com/file/d/1wQsA0oB6wwYlnkvvcyBCmLk7QmgVWNax/view?usp=sharing) dataset. 

It contains three files:
*  `dogs.csv`, shortly *dogs*
*  `dogTravel.csv`, shortly *travels*
*  `NST-EST2021-POP.csv`

### Notes

1.    It is mandatory to use GitHub for developing the project.
1.    The project must be a jupyter notebook.
1.    There is no restriction on the libraries that can be used, nor on the Python version.
1.    All questions on the project **must** be asked in a public channel on [Zulip](https://focs.zulipchat.com).
1.    At most 3 students can be in each group. You must create the groups by yourself.
1.    You do not have to send me the project *before* the discussion.

### 0.1 Importing files

In [1]:
# Importing Pandas
import pandas as pd

# Opening dogs.csv and checking columns
with open("dogs.csv", "r") as dogs_file:
    headers = dogs_file.readline()
    print(headers)

id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,color_secondary,color_tertiary,age,sex,size,coat,fixed,house_trained,declawed,special_needs,shots_current,env_children,env_dogs,env_cats,name,status,posted,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost



In [2]:
# Creating 'dogs' df 
dogs = pd.read_csv("dogs.csv", sep=',', doublequote='"', low_memory=False)

# Checking the head
dogs.head()

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09


### 0.2.1 Dogs df cleaning

In [3]:
# For cleaning purpose, from this time on multiple dataframes are created
# tmp_dog_full is the original dataframe. His shape is shown below.
tmp_dog_full = pd.read_csv("dogs.csv", sep=',', doublequote='"', low_memory=False, encoding='utf-8')
print(f'tmp_dog_full shape: {tmp_dog_full.shape}')

# A new column called 'ok' is created, which is set to the opposite of whether the 'contact_state' column
# contains numeric value or not. It checks which lines are ok and what needs to be managed in a different way, using contact_state as watermark
tmp_dog_full['ok'] = ~tmp_dog_full.contact_state.str.isnumeric()

# Makes all the column names lowercase and replaces dots with underscores in them
tmp_dog_full.columns = [col.lower().replace(".", "_") for col in tmp_dog_full.columns]

# A new dataframe, called tmp_dog_ok is created. It contains the rows
# where 'ok' is True.
tmp_dog_ok = tmp_dog_full[tmp_dog_full.ok == True]
print('tmp_dog_ok:')
display(tmp_dog_ok.head(5))

# Checks that all rows are ok, and prints the number of unique contact_state.
print('check all rows are ok')
print(len(tmp_dog_ok.contact_state.unique()))
tmp_dog_ok.contact_state.unique()

tmp_dog_full shape: (58180, 37)
tmp_dog_ok:


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81,True
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07,True
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51,True
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,NV,89048,US,89009,2019-09-20,Dog,,62,145.83,True
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09,True


check all rows are ok
53


array(['NV', 'AZ', 'UT', 'CA', 'AK', 'AL', 'AR', 'CO', 'NY', 'MA', 'CT',
       'RI', 'NJ', 'NH', 'VT', 'MD', 'VA', 'DC', 'PA', 'WV', 'DE', 'FL',
       'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'OH', 'KS', 'KY', 'LA', 'ME',
       'QC', 'NB', 'MI', 'MN', 'WI', 'MO', 'MS', 'MT', 'NC', 'SC', 'ND',
       'NE', 'NM', 'OK', 'OR', 'SD', 'TN', 'TX', 'WA', 'WY'], dtype=object)

In [4]:
# A new df, tmp_dog_not_ok, is created. It only contains the rows from the
# original dataframe tmp_dog_full where the column 'ok' is False.
# This is the complement of the dataframe tmp_dog_ok created previously.
tmp_dog_not_ok = tmp_dog_full[tmp_dog_full.ok == False]
print('tmp_dog_not_ok')
display(tmp_dog_not_ok.head(5))

tmp_dog_not_ok


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41...,Dog,Dog,German Shepherd Dog,,False,False,,...,89146,US,89009,2019-09-20,,Dog,Meet handsome 3 year old Gunther. Gunther came...,108,256.88,False
5549,38169117,AZ414,https://www.petfinder.com/dog/annabelle-annie-...,Dog,Dog,Boxer,Pit Bull Terrier,True,False,Black,...,85249,US,AZ,2019-09-20,,Dog,You can fill out an adoption application onlin...,80,130.77,False
10888,45833989,NY98,https://www.petfinder.com/dog/pepper-courtesy-...,Dog,Dog,Beagle,,False,False,,...,12220,US,CT,2019-09-20,,Dog,This is Pepper. He is a 15 year old tri-color ...,86,180.7,False
11983,45515547,NY98,https://www.petfinder.com/dog/cooper-courtesy-...,Dog,Dog,Mixed Breed,,False,False,,...,12220,US,CT,2019-09-20,,Dog,"Cooper is 13 years old, but according to a ver...",105,400.82,False
12495,45294115,NY98,https://www.petfinder.com/dog/daisy-courtesy-l...,Dog,Dog,Basset Hound,,False,False,Brown / Chocolate,...,12220,US,CT,2019-09-20,,Dog,"â¢Basset Hound, female, â¢10 years \n\nDelig...",57,82.61,False


In [5]:
pd.set_option('display.max_colwidth', 100) #50

# Managing "not ok" dataframe: split name column and shift the others
# Let's see what the dataframe looks like before the cleaning.
print('before')
display(tmp_dog_not_ok.head(1))

# tmp_dog_not_ok_fixed is created, with the same columns and indices as the original dataframe
tmp_dog_not_ok_fixed = pd.DataFrame(columns=tmp_dog_not_ok.columns, index=tmp_dog_not_ok.index)

# Copies the columns from 0 to 24 and from 26 to the end of the original datafame
# to the new dataframe, but dropping the column 'accessed'.
tmp_dog_not_ok_fixed.iloc[:, 0:24] =  tmp_dog_not_ok.iloc[:, 0:24].copy()
tmp_dog_not_ok_fixed.iloc[:, 26:] =  tmp_dog_not_ok.iloc[:, 25:].drop('accessed', axis = 1).copy()

# Taking the 24th column of the original dataframe and splitting it into
# two new columns, 'name' and 'status'.
tmp_dog_not_ok.iloc[: , 24]
tmp_dog_not_ok_fixed.name = tmp_dog_not_ok.name.apply(lambda x : x.split('\",')[0])
tmp_dog_not_ok_fixed.status = tmp_dog_not_ok.name.apply(lambda x : x.split('\",')[1].strip('"'))
print('after')
tmp_dog_not_ok_fixed.head()


before


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,89146,US,89009,2019-09-20,,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88,False


after


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost,ok
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,NV,89146,US,89009,2019-09-20,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88,False
5549,38169117,AZ414,https://www.petfinder.com/dog/annabelle-annie-38169117/az/chandler/underdog-rescue-of-az-az414/?...,Dog,Dog,Boxer,Pit Bull Terrier,True,False,Black,...,AZ,85249,US,AZ,2019-09-20,Dog,You can fill out an adoption application online on our official website.\n\nMEET ANNABELLE or AN...,80,130.77,False
10888,45833989,NY98,https://www.petfinder.com/dog/pepper-courtesy-listing-45833989/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Beagle,,False,False,,...,NY,12220,US,CT,2019-09-20,Dog,This is Pepper. He is a 15 year old tri-color beagle. He is 32 lbs and can still run a mile! He ...,86,180.7,False
11983,45515547,NY98,https://www.petfinder.com/dog/cooper-courtesy-listing-45515547/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Mixed Breed,,False,False,,...,NY,12220,US,CT,2019-09-20,Dog,"Cooper is 13 years old, but according to a very recent vet visit he is in perfect health. He is ...",105,400.82,False
12495,45294115,NY98,https://www.petfinder.com/dog/daisy-courtesy-listing-45294115/ny/albany/peppertree-rescue-ny98/?...,Dog,Dog,Basset Hound,,False,False,Brown / Chocolate,...,NY,12220,US,CT,2019-09-20,Dog,"â¢Basset Hound, female, â¢10 years \n\nDelightful Daisy is a friendly girl looking for a retir...",57,82.61,False


In [6]:
# Combining the two dataframes, 'tmp_dog_ok' and 'tmp_dog_not_ok_fixed' into 'dogs'.
# It concatenates the two dataframes vertically.
print('tmp_dog_ok shape:', tmp_dog_ok.shape)
print('tmp_dog_not_ok shape:', tmp_dog_not_ok.shape)
dogs = pd.concat([tmp_dog_ok, tmp_dog_not_ok_fixed])
print('dogs shape:', dogs.shape)

# Temporary dataframes are deleted.
del tmp_dog_full
del tmp_dog_not_ok
del tmp_dog_not_ok_fixed
del tmp_dog_ok

# Makes all column names lowercase and replaces dots with underscores
dogs.columns = [col.lower().replace(".", "_") for col in dogs.columns]

# Dropping the 'ok' column
dogs.drop('ok', axis=1, inplace=True)
dogs.columns

tmp_dog_ok shape: (58147, 38)
tmp_dog_not_ok shape: (33, 38)
dogs shape: (58180, 38)


Index(['id', 'org_id', 'url', 'type_x', 'species', 'breed_primary',
       'breed_secondary', 'breed_mixed', 'breed_unknown', 'color_primary',
       'color_secondary', 'color_tertiary', 'age', 'sex', 'size', 'coat',
       'fixed', 'house_trained', 'declawed', 'special_needs', 'shots_current',
       'env_children', 'env_dogs', 'env_cats', 'name', 'status', 'posted',
       'contact_city', 'contact_state', 'contact_zip', 'contact_country',
       'stateq', 'accessed', 'type_y', 'description', 'stay_duration',
       'stay_cost'],
      dtype='object')

### 0.2.2 Travels dataset cleaning

In [7]:
# The travels dataset is stored into the temporary 'tmp_travels' dataframe
tmp_travels = pd.read_csv("dogTravel.csv", sep=',', doublequote='"', low_memory=False).drop('index', axis=1)

# Data exploration
display(tmp_travels.head())
display(tmp_travels.contact_state.unique())
display(tmp_travels[tmp_travels.contact_state == '17325'].id.unique())

# The variable 'anomalies' is created, which contains the unique 'id' values
# of the rows where the 'contact_state' is '17325'
anomalies = tmp_travels[tmp_travels.contact_state == '17325'].id.unique()

# Changing the 'contact_state' value of all the rows where the 'id' is equal 
# to the anomalies declared before.
tmp_travels.loc[tmp_travels.id == anomalies[0], 'contact_state'] = 'PA'
tmp_travels.loc[tmp_travels.id == anomalies[1], 'contact_state'] = 'PA'
display(tmp_travels[tmp_travels.id.isin(anomalies)])
display(tmp_travels.contact_state.unique())

# Storing the tmp_travels with the fixed values into the original one, 'travels'
travels = tmp_travels.copy()
del tmp_travels

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He lov...,Arkansas,,,
1,44698509,Groveland,FL,"Duke is an almost 2 year old Potcake from Abacos in the Bahamas. He is a happy boy, who loves hi...",Abacos,Bahamas,,
2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star looking to settle down with the right person! \n\nAs you...,Adam,Maryland,,
3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from another rescue ~~Interacted with other dogs and was ...,Adaptil,,True,
4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her people very much and likes getting loved on. She can ...,Afghanistan,,,


array(['MN', 'FL', 'MD', 'CO', 'CT', 'OH', 'AL', 'NY', 'NJ', 'PA', 'VA',
       'GA', 'ME', 'NH', 'MI', 'VT', 'TN', 'WI', 'NM', 'OR', 'WA', 'IA',
       'KY', 'NV', 'UT', 'AZ', 'NC', 'AR', 'MA', 'RI', 'OK', 'CA', 'IN',
       'SC', 'IL', 'MO', 'TX', 'DC', 'KS', 'DE', 'WV', 'NB', 'MS', 'LA',
       '17325'], dtype=object)

array([36978896, 33218331])

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
2472,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Maryland,,True,
2473,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Maryland,,True,
3190,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,New Jersey,,True,
3191,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",New Jersey,,True,
3237,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,New York,,True,
3238,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",New York,,True,
3714,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Pennsylvania,,True,
3715,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Pennsylvania,,True,
6029,36978896,PA,PA,Maddie is our little Miss Cutie Patootie! She is a short and stocky malamute girl with so much p...,Virginia,,True,
6030,33218331,PA,PA,"Born in August 2014, Bucky has a great sense of humor and is full of personality. He would love...",Virginia,,True,


array(['MN', 'FL', 'MD', 'CO', 'CT', 'OH', 'AL', 'NY', 'NJ', 'PA', 'VA',
       'GA', 'ME', 'NH', 'MI', 'VT', 'TN', 'WI', 'NM', 'OR', 'WA', 'IA',
       'KY', 'NV', 'UT', 'AZ', 'NC', 'AR', 'MA', 'RI', 'OK', 'CA', 'IN',
       'SC', 'IL', 'MO', 'TX', 'DC', 'KS', 'DE', 'WV', 'NB', 'MS', 'LA'],
      dtype=object)

### 0.2.3 States df cleaning

In [8]:
# Storing the states dataset into tmp_states
tmp_states = pd.read_csv("NST-EST2021-POP.csv", header=None, names=["state", "population"], sep=',', low_memory=False)
tmp_states.head()

Unnamed: 0,state,population
0,Alabama,5.024.279
1,Alaska,733.391
2,Arizona,7.151.502
3,Arkansas,3.011.524
4,California,39.538.223


In [9]:
# Replacing all occurrences of '.' in the 'population' column with an empty string
tmp_states.population = tmp_states.population.str.replace('.', '', regex=False).astype(int)

# Creating the original 'states' dataframe
states = tmp_states.copy()
del tmp_states
states.head()

Unnamed: 0,state,population
0,Alabama,5024279
1,Alaska,733391
2,Arizona,7151502
3,Arkansas,3011524
4,California,39538223


### 1. Extract all dogs with status that is not *adoptable*

In [10]:
pd.set_option('display.max_rows', 10) #50

print(dogs[dogs.status != 'adoptable'].shape)
not_adoptable_dogs = dogs[dogs.status != 'adoptable']

display(not_adoptable_dogs)

(33, 37)


Unnamed: 0,id,org_id,url,type_x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateq,accessed,type_y,description,stay_duration,stay_cost
644,41330726,NV173,https://www.petfinder.com/dog/gunther-gunny-41330726/nv/las-vegas/vegas-shepherd-rescue-nv173/?r...,Dog,Dog,German Shepherd Dog,,False,False,,...,Las Vegas,NV,89146,US,89009,2019-09-20,Dog,Meet handsome 3 year old Gunther. Gunther came to us after being returned to the local shelter f...,108,256.88
5549,38169117,AZ414,https://www.petfinder.com/dog/annabelle-annie-38169117/az/chandler/underdog-rescue-of-az-az414/?...,Dog,Dog,Boxer,Pit Bull Terrier,True,False,Black,...,Chandler,AZ,85249,US,AZ,2019-09-20,Dog,You can fill out an adoption application online on our official website.\n\nMEET ANNABELLE or AN...,80,130.77
10888,45833989,NY98,https://www.petfinder.com/dog/pepper-courtesy-listing-45833989/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Beagle,,False,False,,...,Albany,NY,12220,US,CT,2019-09-20,Dog,This is Pepper. He is a 15 year old tri-color beagle. He is 32 lbs and can still run a mile! He ...,86,180.7
11983,45515547,NY98,https://www.petfinder.com/dog/cooper-courtesy-listing-45515547/ny/albany/peppertree-rescue-ny98/...,Dog,Dog,Mixed Breed,,False,False,,...,Albany,NY,12220,US,CT,2019-09-20,Dog,"Cooper is 13 years old, but according to a very recent vet visit he is in perfect health. He is ...",105,400.82
12495,45294115,NY98,https://www.petfinder.com/dog/daisy-courtesy-listing-45294115/ny/albany/peppertree-rescue-ny98/?...,Dog,Dog,Basset Hound,,False,False,Brown / Chocolate,...,Albany,NY,12220,US,CT,2019-09-20,Dog,"â¢Basset Hound, female, â¢10 years \n\nDelightful Daisy is a friendly girl looking for a retir...",57,82.61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56013,45916348,WA581,https://www.petfinder.com/dog/cody-45916348/wa/seattle/6dogrees-rescue-wa581/?referrer_id=87b31e...,Dog,Dog,Golden Retriever,Terrier,True,False,Bicolor,...,Seattle,WA,98106,US,WA,2019-09-20,Dog,"*Please apply online at www.6dogrees.com to get a faster response.\nMeet \""Cody\"" He is a 3 yr o...",92,164.55
56248,45733027,WA581,https://www.petfinder.com/dog/gracie-45733027/wa/seattle/6dogrees-rescue-wa581/?referrer_id=87b3...,Dog,Dog,Papillon,Cavalier King Charles Spaniel,True,False,Golden,...,Seattle,WA,98106,US,WA,2019-09-20,Dog,"Meet \""Gracie\"" a very beautiful 11 pound,female Spaniel mix. 6dogrees Rescue saved Gracie rom a...",82,168.54
56464,45413997,WA581,https://www.petfinder.com/dog/jameson-45413997/wa/seattle/6dogrees-rescue-wa581/?referrer_id=87b...,Dog,Dog,Rat Terrier,Chihuahua,True,False,"Tricolor (Brown, Black, & White)",...,Seattle,WA,98106,US,WA,2019-09-20,Dog,"Meet \""Jameson\"" He is a very handsome 9 pound, tri-color, smooth coat Rat terrier mix. 6dogree...",108,297.27
56473,45406516,WA581,https://www.petfinder.com/dog/canelo-45406516/wa/seattle/6dogrees-rescue-wa581/?referrer_id=87b3...,Dog,Dog,Chihuahua,Terrier,True,False,Bicolor,...,Seattle,WA,98106,US,WA,2019-09-20,Dog,"Meet Canelo a adorable small, tan very sweet, adult Chihuahua. \nCanelo weighs 6 pounds and is...",94,312.1


### 2. For each (primary) breed, determine the number of dogs

In [11]:
dogs['breed_primary'].value_counts()

Pit Bull Terrier                7890
Labrador Retriever              7198
Chihuahua                       3766
Mixed Breed                     3242
Terrier                         2641
                                ... 
Wirehaired Pointing Griffon        1
Boykin Spaniel                     1
Old English Sheepdog               1
Belgian Shepherd / Laekenois       1
Tosa Inu                           1
Name: breed_primary, Length: 216, dtype: int64

### 3. For each (primary) breed, determine the ratio between the number of dogs of `Mixed Breed` and those not of Mixed Breed. Hint: look at the `secondary_breed`.

In [12]:
print('distinct breed: ', len(dogs.breed_primary.unique()))
breeds = dogs.groupby('breed_primary', as_index=False).count()[['breed_primary','id']].rename({'id': 'number_of_dogs'}, axis=1).sort_values(by='number_of_dogs', ascending=False)
breeds

distinct breed:  216


Unnamed: 0,breed_primary,number_of_dogs
157,Pit Bull Terrier,7890
124,Labrador Retriever,7198
57,Chihuahua,3766
138,Mixed Breed,3242
195,Terrier,2641
...,...,...
88,Field Spaniel,1
41,Boykin Spaniel,1
190,Spinone Italiano,1
200,Tosa Inu,1


In [13]:
## compute total mixed dogs by primary breed
sec_breeds = dogs[dogs.breed_secondary.notnull()]
sec_breeds = sec_breeds.groupby('breed_primary', as_index=False).count()[['breed_primary','id']].rename({'id': 'number_of_dogs'}, axis=1)
sec_breeds.head()


Unnamed: 0,breed_primary,number_of_dogs
0,Affenpinscher,2
1,Afghan Hound,1
2,Airedale Terrier,9
3,Akita,52
4,Alaskan Malamute,14


In [14]:
## compute ratios
mix_breeds = breeds.merge(sec_breeds, on='breed_primary', how='left', suffixes=('_tot','_mixed'))
mix_breeds['number_of_dogs_mixed'] = mix_breeds['number_of_dogs_mixed'].fillna(0)
mix_breeds['mixed_ratio_perc'] = mix_breeds.apply(lambda x : round(x.number_of_dogs_mixed/x.number_of_dogs_tot, 2)*100, axis=1)
mix_breeds['pure_ratio_perc'] = mix_breeds.apply(lambda x : 100 - x.mixed_ratio_perc, axis=1)
mix_breeds

Unnamed: 0,breed_primary,number_of_dogs_tot,number_of_dogs_mixed,mixed_ratio_perc,pure_ratio_perc
0,Pit Bull Terrier,7890,2243.0,28.0,72.0
1,Labrador Retriever,7198,3228.0,45.0,55.0
2,Chihuahua,3766,1085.0,29.0,71.0
3,Mixed Breed,3242,114.0,4.0,96.0
4,Terrier,2641,822.0,31.0,69.0
...,...,...,...,...,...
211,Field Spaniel,1,0.0,0.0,100.0
212,Boykin Spaniel,1,0.0,0.0,100.0
213,Spinone Italiano,1,0.0,0.0,100.0
214,Tosa Inu,1,0.0,0.0,100.0


### 4. For each (primary) breed, determine the earliest and the latest `posted` timestamp.

In [15]:
## Formatting the 'posted' column
dogs['posted'] = pd.to_datetime(dogs['posted'], errors="coerce")

## Creating the df with earliest and latest 'posted' timestamps
earliest_latest_timestamp = dogs.groupby('breed_primary', as_index=False).aggregate({'posted':[min, max]})

earliest_latest_timestamp

Unnamed: 0_level_0,breed_primary,posted,posted
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max
0,Affenpinscher,2012-03-08 10:27:33+00:00,2019-09-14 10:10:51+00:00
1,Afghan Hound,2017-06-29 23:28:51+00:00,2019-07-27 00:38:48+00:00
2,Airedale Terrier,2014-06-13 12:59:36+00:00,2019-09-19 18:40:39+00:00
3,Akbash,2019-07-21 00:35:59+00:00,2019-08-23 17:11:04+00:00
4,Akita,2012-03-03 09:31:08+00:00,2019-09-20 15:19:57+00:00
...,...,...,...
211,Wirehaired Pointing Griffon,2016-06-29 20:03:55+00:00,2016-06-29 20:03:55+00:00
212,Wirehaired Terrier,2012-11-27 14:07:54+00:00,2019-09-19 22:52:45+00:00
213,Xoloitzcuintli / Mexican Hairless,2007-02-01 00:00:00+00:00,2019-09-08 11:15:54+00:00
214,Yellow Labrador Retriever,2010-05-31 00:00:00+00:00,2019-09-20 06:30:27+00:00


### 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

In [16]:
malefemale = dogs[['contact_state', 'contact_city', 'contact_zip', 'contact_country', 'sex']].copy()
malefemale['imbalance'] = malefemale.sex.apply(lambda x : 1 if x.upper() == 'MALE' else -1)

malefemale_imbalance = malefemale.groupby('contact_state', as_index=False).sum('imbalance')[['contact_state', 'imbalance']]
malefemale_imbalance.iloc[[malefemale_imbalance.imbalance.idxmin(), malefemale_imbalance.imbalance.idxmax()]]

Unnamed: 0,contact_state,imbalance
5,CO,-51
36,OH,205


### 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay.

In [17]:
dogs.stay_duration = dogs.stay_duration.astype(int)
dogs.stay_cost = dogs.stay_cost.astype(float)
stay = dogs.groupby(['age', 'size'], as_index=False).agg({'stay_duration' : 'mean', 'stay_cost' : 'mean'})
stay.stay_duration = stay.stay_duration.apply(lambda x : round(x, 2))
stay.stay_cost = stay.stay_cost.apply(lambda x : round(x, 2))
stay

Unnamed: 0,age,size,stay_duration,stay_cost
0,Adult,Extra Large,89.02,232.59
1,Adult,Large,89.53,238.66
2,Adult,Medium,89.42,238.26
3,Adult,Small,89.41,238.97
4,Baby,Extra Large,87.03,237.18
...,...,...,...,...
11,Senior,Small,89.07,238.28
12,Young,Extra Large,90.59,245.84
13,Young,Large,90.10,238.15
14,Young,Medium,89.52,239.30


### 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs.

In [18]:
many_travels = travels[['id', 'contact_state']].groupby('id', as_index=False).count().rename({'contact_state':'travels'}, axis=1)
many_travels = many_travels[many_travels.travels > 2]
many_travels

Unnamed: 0,id,travels
5,16657005,4
9,20905974,5
17,24894870,4
18,24894894,4
55,33218331,7
...,...,...
4110,46042569,3
4111,46042587,3
4112,46042618,3
4113,46043099,3


In [19]:
#just a check
travels[travels.id == 46042569]

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
1977,46042569,Baltimore,MD,Willow #4 is a 2-year-old spayed female yellow lab. Willow weighs 58 pounds and is up to date on...,Lab Rescue LRCP,,True,
2364,46042569,Baltimore,MD,Willow #4 is a 2-year-old spayed female yellow lab. Willow weighs 58 pounds and is up to date on...,Maryland,,True,
5903,46042569,Baltimore,MD,Willow #4 is a 2-year-old spayed female yellow lab. Willow weighs 58 pounds and is up to date on...,Virginia,,True,


In [20]:
breed_travels = many_travels.merge(dogs[['id', 'breed_primary']], left_on='id', right_on='id')

# print the result
breed_travels.sort_values('travels', ascending=False)

Unnamed: 0,id,travels,breed_primary
68,44759410,11,German Shepherd Dog
67,44759409,11,German Shepherd Dog
178,45728583,7,Alaskan Malamute
142,45537987,7,Alaskan Malamute
55,44572953,7,Alaskan Malamute
...,...,...,...
226,45831317,3,Shiba Inu
224,45831313,3,Chihuahua
223,45831312,3,Chihuahua
222,45831310,3,Labrador Retriever


### 8. Fix the `travels` table so that the correct state is computed from  the `manual` and the `found` fields. If `manual` is not missing, then it overrides what is stored in `found`.

In [21]:
# Creating a copy
exercise_8 = travels.copy()
exercise_8.found = exercise_8.apply(lambda x : x['found'] if pd.isnull(x['manual']) else x['manual'] ,axis=1)
exercise_8

Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made his long trek up her from Arkansas on 4/2019. He lov...,Arkansas,,,
1,44698509,Groveland,FL,"Duke is an almost 2 year old Potcake from Abacos in the Bahamas. He is a happy boy, who loves hi...",Bahamas,Bahamas,,
2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star looking to settle down with the right person! \n\nAs you...,Maryland,Maryland,,
3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from another rescue ~~Interacted with other dogs and was ...,Adaptil,,True,
4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her people very much and likes getting loved on. She can ...,Afghanistan,,,
...,...,...,...,...,...,...,...,...
6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc.org) for more information about this pet.\n\nMy na...,WV,,True,
6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed Bull Breed mix. She came from a shelter in Wyoming...,Wyoming,,,
6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescued from a high kill shelter with his girlfriend Yaz...,Yazmin,,True,
6192,44519341,Dayton,OH,Callie is a 14 year old Chihuahua whose owner died and whose family couldn't keep her permanentl...,Ohio,Ohio,,


### 9. For each state, compute the ratio between the number of travels and the population.

In [22]:
# charge a decoding table
abbreviations = pd.read_csv("abbreviations.csv", sep=',', quotechar='"')
print(abbreviations.shape)
abbreviations

(51, 3)


Unnamed: 0,state,abbrev,code
0,Alabama,Ala.,AL
1,Alaska,Alaska,AK
2,Arizona,Ariz.,AZ
3,Arkansas,Ark.,AR
4,California,Calif.,CA
...,...,...,...
46,Virginia,Va.,VA
47,Washington,Wash.,WA
48,West Virginia,W.Va.,WV
49,Wisconsin,Wis.,WI


In [23]:
## check if merge would be ok
anomalies = [s for s in states['state'].unique() if s not in abbreviations.state.unique()]
print(f"anomalies number in states df: {len(anomalies)}")

anomalies = [s for s in travels['contact_state'].unique() if s not in abbreviations.code.unique()]
print(f"anomalies number in travels df: {len(anomalies)}")

print('next state is missing in file of population: located in Canada')
travels[travels['contact_state'].isin(anomalies)]


anomalies number in states df: 0
anomalies number in travels df: 1
next state is missing in file of population: located in Canada


Unnamed: 0,id,contact_city,contact_state,description,found,manual,remove,still_there
1147,33842633,Florenceville,NB,Our rescues mean a lot to the vet staff and DunRoamin' volunteers. We do everything we can to fi...,Florenceville,,True,
4065,40589804,Florenceville,NB,Our rescues mean a lot to the vet staff and DunRoamin' volunteers. We do everything we can to fi...,Saint John,,True,


In [24]:
travels_by_states = travels.groupby('contact_state', as_index=False).count()[['contact_state', 'id']].rename({'id' : 'travels', 'contact_state': 'code'}, axis=1)
travels_by_states

Unnamed: 0,code,travels
0,AL,75
1,AR,10
2,AZ,70
3,CA,28
4,CO,103
...,...,...
39,VA,1025
40,VT,49
41,WA,634
42,WI,83


In [25]:
# match state name to state code
states_travels = states.merge(abbreviations[['state', 'code']], on='state')
states_travels

Unnamed: 0,state,population,code
0,Alabama,5024279,AL
1,Alaska,733391,AK
2,Arizona,7151502,AZ
3,Arkansas,3011524,AR
4,California,39538223,CA
...,...,...,...
46,Virginia,8631393,VA
47,Washington,7705281,WA
48,West Virginia,1793716,WV
49,Wisconsin,5893718,WI


In [26]:
## fill missing values and compute the ratio
states_travels = states_travels.merge(travels_by_states, on='code', how='left').fillna(0)
states_travels['travels_per_people'] = states_travels.apply(lambda x : x['travels']/x['population'], axis=1)
states_travels

Unnamed: 0,state,population,code,travels,travels_per_people
0,Alabama,5024279,AL,75.0,1.492751e-05
1,Alaska,733391,AK,0.0,0.000000e+00
2,Arizona,7151502,AZ,70.0,9.788154e-06
3,Arkansas,3011524,AR,10.0,3.320578e-06
4,California,39538223,CA,28.0,7.081755e-07
...,...,...,...,...,...
46,Virginia,8631393,VA,1025.0,1.187526e-04
47,Washington,7705281,WA,634.0,8.228123e-05
48,West Virginia,1793716,WV,27.0,1.505255e-05
49,Wisconsin,5893718,WI,83.0,1.408279e-05


### 10. For each dog, compute the number of days from the `posted` day to the day of last access.

In [27]:
# Creating a df copy for this exercise
exercise_10 = dogs[['id', 'name', 'posted', 'accessed']].copy()

# Computing the number of days from the 'posted' day to the day of last access, assuming it's 'accessed' column
# The value is stored in 'days_delay' column
exercise_10['posted'] = pd.to_datetime(pd.to_datetime(exercise_10['posted']).dt.date)
exercise_10['accessed'] = pd.to_datetime(exercise_10['accessed'])
exercise_10['days_delay'] = (exercise_10['accessed'].dt.date - exercise_10['posted'].dt.date).dt.days

# Printing the result
exercise_10

Unnamed: 0,id,name,posted,accessed,days_delay
0,46042150,HARLEY,2019-09-20,2019-09-20,0
1,46042002,BIGGIE,2019-09-20,2019-09-20,0
2,46040898,Ziggy,2019-09-20,2019-09-20,0
3,46039877,Gypsy,2019-09-20,2019-09-20,0
4,46039306,Theo,2019-09-20,2019-09-20,0
...,...,...,...,...,...
56013,45916348,\Cody\,2019-09-09,2019-09-20,11
56248,45733027,\Gracie\,2019-08-24,2019-09-20,27
56464,45413997,\Jameson\,2019-07-31,2019-09-20,51
56473,45406516,\Canelo\,2019-07-29,2019-09-20,53


### 11. Partition the dogs according to the number of weeks from the `posted` day to the day of last access.

In [28]:
# Creating a df copy for this exercise
exercise_11 = exercise_10

# Creating a new column, 'weeks', that stores the number of weeks from the posted day to the day of last access
exercise_11["weeks"] = round(exercise_11["days_delay"] // 7,0).astype(int)

# Grouping the dogs in different partitions, based on 'weeks' value
partitioned_dogs = exercise_11.groupby("weeks").count()[['id']].rename({'id': 'number_of_dogs'}, axis=1)
# # Printing them
partitioned_dogs

Unnamed: 0_level_0,number_of_dogs
weeks,Unnamed: 1_level_1
0,9803
1,6547
2,5764
3,3353
4,2439
...,...
729,1
746,1
811,1
812,1


### 12. Find for duplicates in the `dogs` dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined for determining if two rows are duplicates.

In [29]:
# lowercase, remove punctuation, tokenize, lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stop = stopwords.words('english')
stop.extend(['dog', 'dogs', '-', 'old'])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/simonebellavia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/simonebellavia/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/simonebellavia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
#first step remove dogs with no description
descripted_dogs = dogs[dogs['description'].notnull()]
print(f'{len(descripted_dogs)} have description')

49475 have description


In [31]:
#second step: find dog by equal 'breed_primary', 'sex', 'description'
#
dupdog1= descripted_dogs[descripted_dogs[['breed_primary', 'sex', 'description']].duplicated(keep='first')][['id','breed_primary','sex', 'description']] 
dupdog2= descripted_dogs[descripted_dogs[['breed_primary', 'sex', 'description']].duplicated(keep='last')][['id','breed_primary','sex', 'description']]
duplicated_dogs = pd.concat([dupdog1, dupdog2])
duplicated_dogs = duplicated_dogs.drop_duplicates()
duplicated_dogs_id = list(duplicated_dogs['id'])

print(f'found {len(duplicated_dogs_id)} duplicated dogs on {descripted_dogs.shape[0]} dogs with description')

duplicated_dogs.sort_values(by=['sex','breed_primary'])

found 2775 duplicated dogs on 49475 dogs with description


Unnamed: 0,id,breed_primary,sex,description
39500,46023963,Alaskan Malamute,Female,My adoption fee has been fully sponsored. Adopt me and pay $0.00!!!
39499,46023964,Alaskan Malamute,Female,My adoption fee has been fully sponsored. Adopt me and pay $0.00!!!
7792,45639446,American Bulldog,Female,Lola Jewel is a jewel her foster says! She is about 3yrs old (we guesstimated a birthday of 8/7/...
8003,45252099,American Bulldog,Female,All people inquiring about pets listed as Courtesy Listings on the EAPL website and/or EAPL post...
8257,44517341,American Bulldog,Female,Meet Tyla! She is an adorable American Bulldog mix. She is 4.11 yo (estimated bd 8/12/2014.) She...
...,...,...,...,...
21363,45451610,Yorkshire Terrier,Male,YOU MUST APPLY ON OUR WEBSITE\nhttp://www.yorkierescueme.com\nBEFORE CONTACTING MY FOSTER HOME!\...
17610,45572590,Yorkshire Terrier,Male,YOU MUST APPLY ON OUR WEBSITE\nhttp://www.yorkierescueme.com\nBEFORE CONTACTING MY FOSTER HOME!\...
17809,45451786,Yorkshire Terrier,Male,YOU MUST APPLY ON OUR WEBSITE\nhttp://www.yorkierescueme.com\nBEFORE CONTACTING MY FOSTER HOME!\...
34992,37010444,Mixed Breed,Unknown,No Notes


In [32]:
#third step: catch remained duplicated

print(f'{len(descripted_dogs)} - {len(duplicated_dogs)}' )

# seleziona tutte le righe presenti in df2
# isin() filtra le righe di 
idx = descripted_dogs.index[descripted_dogs.isin(duplicated_dogs).any(1)]

# droppa le righe matchanti, de facto descripted_dogs - duplicated_dogs nella condizione dichiara in "idx" 
descripted_dogs = descripted_dogs.drop(idx)

# storo su dogs_12
dogs_12 = descripted_dogs

print(f'remain {len(dogs_12)} on {descripted_dogs.shape[0]} dogs')

### OPPURE ###
# substraction_df = pd.merge(descripted_dogs, duplicated_dogs, how='outer', indicator='Duplicated')
# dogs_12 = substraction_df[substraction_df["Duplicated" == 'left_only']]
### Scegli quella che più ti piace oppure se hai un'alternativa scartale pure entrambe, il conto comunque torna :) ###


49475 - 2775
remain 46700 on 46700 dogs


In [33]:
# filtra il dataframe per escludere i record con valori NaN nella colonna 'description'
dogs_12['lemm_description'] = dogs_12.description.str.lower().str.replace('[^a-zA-Z0-9 \w+\.\w+@\w+\.\w \w+@\w+\.\w www.\w+\.\w]',' ', regex=True)    
dogs_12['lemm_description'] = dogs_12['lemm_description'].str.lower().str.replace('(\w)(\. )',r'\1 ', regex=True).str.strip('.')  
dogs_12['lemm_description'] = dogs_12['lemm_description'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split() if word not in stop])) 

In [34]:
from collections import Counter
pd.set_option('display.max_colwidth', 500) #50

dogs_12['cleaned_description'] = dogs_12.lemm_description.str.replace('(\w+)? ?(\d+) (\w+)',r'\1\2\3', regex=True)
dogs_12['cleaned_description'] = dogs_12.cleaned_description.str.replace(' \w ',' ', regex=True)
dogs_12['cleaned_description'] = dogs_12.cleaned_description.str.replace('\s+',' ', regex=True)
print(f'rows before pruning: {len(dogs_12)}')
dogs_12 = dogs_12[dogs_12['cleaned_description'].notnull()]
print(f'rows after pruning: {len(dogs_12)}')
dogs_12['description_counter'] = dogs_12['cleaned_description'].apply(lambda x: dict(Counter(x.split()))) 
dogs_12['description_dictionary'] = dogs_12['description_counter'].apply(lambda x: set(x.keys())) 

rows before pruning: 46700
rows after pruning: 46700


In [35]:
#calcolo la percentuale di stopwords sull'intero corpus di descrizioni
all_description_words = dogs_12.description.apply(lambda x : len(str(x).split())).sum()
all_cleaned_words = dogs_12.cleaned_description.apply(lambda x : len(str(x).split())).sum()
ratio_cleaned_words = round(100*all_cleaned_words/all_description_words,2)

print(f"""{ratio_cleaned_words}% of words are stopwords""")

49.12% of words are stopwords


In [36]:
# clusters = dogs_12[['breed_primary', 'sex']].drop_duplicates().shape
# print(f'number of cluster: {clusters}')
print(f'dogs before clustering: {dogs_12.shape[0]}')
dogs_clusters = dogs_12.groupby(['breed_primary', 'sex'])[['id']].count().reset_index().rename(columns={'id':'counts'})
print(f'dogs after clustering: {dogs_clusters.counts.sum()}')
print(f'dogs clusters: {dogs_clusters.shape}')
dogs_clusters

dogs before clustering: 46700
dogs after clustering: 46700
dogs clusters: (401, 3)


Unnamed: 0,breed_primary,sex,counts
0,Affenpinscher,Female,7
1,Affenpinscher,Male,7
2,Afghan Hound,Female,1
3,Afghan Hound,Male,3
4,Airedale Terrier,Female,6
...,...,...,...
396,Xoloitzcuintli / Mexican Hairless,Male,7
397,Yellow Labrador Retriever,Female,69
398,Yellow Labrador Retriever,Male,75
399,Yorkshire Terrier,Female,140


In [37]:
# crea una lista vuota per i duplicati
duplicated_couples = []
counter = 0
threashold = 0.9

# filtro un sesso per volta per ottimizzare i calcoli
for sex in ['Male', 'Female']:
    clusters_by_sex = dogs_clusters[dogs_clusters['sex'] == sex][['breed_primary','counts']]
    dogs_by_sex = dogs_12[dogs_12['sex'] == sex]
    cluster_size = clusters_by_sex.shape[0]
    print(f'sex: {sex}')
    cluster_number = 0
    
    # analizzo un cluster per volta
    for breed_primary, counts in clusters_by_sex.values:
        
        cluster_number = cluster_number + 1 
        print(f'processing cluster number: {cluster_number} of {cluster_size}--> {breed_primary} ({counts})')
        
        
        this_cluster = dogs_by_sex[dogs_by_sex['breed_primary']==breed_primary]
        duplicated_id_already_found = []
        
        # confronta ogni record con quelli successivi nel cluster
        for i in range(0, counts-1):
            
            # TODO per la proprietà transitiva se ho trovato che descA=descB e descA=descC, so già che anche descB=descC e quindi salto il controllo.
            if this_cluster.iloc[i]['id'] in duplicated_id_already_found:
                pass
            else:
                first_dog = this_cluster.iloc[i]
                desc1 = first_dog['cleaned_description']
                set1 = first_dog['description_dictionary']
                
                for j in range(i+1, counts):
                    counter = counter + 1
                    second_dog = this_cluster.iloc[j]
                    desc2 = second_dog['cleaned_description']
                    set2 = second_dog['description_dictionary']
                
                    if desc1 == desc2:
                        duplicated_couples.append({'sex': sex, 'breed_primary':breed_primary, 'first':first_dog['id'], 'second':second_dog['id'], 'overlap_ratio':1})
                        duplicated_id_already_found.append(second_dog['id'])
                    else:
                    # ...confronta le colonne 'cleaned_description'
                        union = len(set1 | set2)
                        intersect = len(set1 & set2)
                        overlap_ratio = intersect / union
                        if overlap_ratio>=threashold:
                            duplicated_couples.append({'sex': sex, 'breed_primary':breed_primary, 'first':first_dog['id'], 'second':second_dog['id'], 'overlap_ratio':overlap_ratio})
# salva i duplicati
df = pd.DataFrame(duplicated_couples)
df.to_csv(f'duplicates_full_optimized.csv', index=False, sep=',', encoding='utf-8')  
print(counter)

sex: Male
processing cluster number: 1 of 200--> Affenpinscher (7)
processing cluster number: 2 of 200--> Afghan Hound (3)
processing cluster number: 3 of 200--> Airedale Terrier (10)
processing cluster number: 4 of 200--> Akbash (1)
processing cluster number: 5 of 200--> Akita (84)
processing cluster number: 6 of 200--> Alaskan Malamute (40)
processing cluster number: 7 of 200--> American Bulldog (509)
processing cluster number: 8 of 200--> American Eskimo Dog (21)
processing cluster number: 9 of 200--> American Foxhound (10)
processing cluster number: 10 of 200--> American Hairless Terrier (1)
processing cluster number: 11 of 200--> American Staffordshire Terrier (806)
processing cluster number: 12 of 200--> Anatolian Shepherd (55)
processing cluster number: 13 of 200--> Australian Cattle Dog / Blue Heeler (385)
processing cluster number: 14 of 200--> Australian Kelpie (14)
processing cluster number: 15 of 200--> Australian Shepherd (329)
processing cluster number: 16 of 200--> Austr

KeyboardInterrupt: 

In [None]:
df = pd.read_csv(f'duplicates_full.csv', sep=',', encoding='utf-8')

In [None]:
threashold = 0.5
df.describe()

In [None]:
dogs.shape