# <center>Project for Foundations of Computer Science</center>
### <center>University of Milano-Bicocca</center>
<center>Matteo Corona - Costanza Pagnin</center>

### 0. Preliminary steps
### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import re

### Reading *.csv* files from GitHub Repository

In [2]:
travel=pd.read_csv('https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/dogTravel.csv', index_col=0)
dog=pd.read_csv('https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/dogs.csv')
nst=pd.read_csv('https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/NST-EST2021-POP.csv')

### 1. Extract all dogs with status that is *not adoptable*

Some values were off by one column so they had to be properly shifted

In [3]:
# Shifting values
dog.loc[dog['status']!='adoptable','status':'accessed']=dog.loc[dog['status']!='adoptable','status':'accessed'].shift(periods=1, axis="columns")

In [4]:
# Cheching all possible values in status
dog["status"].unique()

array(['adoptable', nan], dtype=object)

Since there are two different values, the NaN values refers to the *not adoptable* dogs

In [5]:
# Replacing NaN values
dog.loc[dog['status']!='adoptable', [ 'status']] = 'not adoptable'
# Printing the first not adoptable dogs to visualize the data
dog.loc[dog['status']!='adoptable', ['id', 'status']]

Unnamed: 0,id,status
644,41330726,not adoptable
5549,38169117,not adoptable
10888,45833989,not adoptable
11983,45515547,not adoptable
12495,45294115,not adoptable
12600,45229004,not adoptable
12613,45227052,not adoptable
17619,45569380,not adoptable
18611,44694387,not adoptable
19747,36978896,not adoptable


In [6]:
print("There are", len(dog.loc[dog['status']!='adoptable', ['id']]) ,"dogs with status that is not adoptable" )

There are 33 dogs with status that is not adoptable


### 2. For each (primary) breed, determine the number of dogs

In [7]:
# Checking if all dogs have a primary key
dog[dog['breed_primary'].isna()]

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost


In [8]:
# Checking if all dogs have an id
dog[dog['id'].isna()]

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost


In [12]:
# Grouping dogs by their primary key and counting them
dog.groupby('breed_primary')['id'].count()

breed_primary
Affenpinscher                         17
Afghan Hound                           4
Airedale Terrier                      19
Akbash                                 3
Akita                                181
                                    ... 
Wirehaired Pointing Griffon            1
Wirehaired Terrier                    60
Xoloitzcuintli / Mexican Hairless     11
Yellow Labrador Retriever            158
Yorkshire Terrier                    360
Name: id, Length: 216, dtype: int64

### 3. For each (primary) breed, determine the ratio between the number of dogs of `Mixed Breed` and those not of Mixed Breed. Hint: look at the `secondary_breed`.

In [None]:
dog.insert(loc = 6,
          column = 'secondary_clean',
          value = 1)

In [None]:
dog.loc[dog['breed_secondary']=='Mixed Breed', 'secondary_clean']='m'

In [None]:
dog.loc[(dog.breed_secondary!='Mixed Breed'),'secondary_clean']='o'

In [None]:
dog

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,secondary_clean,breed_secondary,breed_mixed,breed_unknown,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,m,Mixed Breed,True,False,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,m,Mixed Breed,True,False,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,o,,False,False,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,o,,False,False,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,o,,False,False,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58175,44605893,WY20,https://www.petfinder.com/dog/tren-44605893/wy...,Dog,Dog,Border Collie,o,,False,False,...,Lander,WY,82520,US,WY,2019-09-20,Dog,"Due to the small size of our volunteer base, w...",100,324.34
58176,44457061,WY24,https://www.petfinder.com/dog/harley-44457061/...,Dog,Dog,Australian Shepherd,o,Australian Cattle Dog / Blue Heeler,True,False,...,Riverton,WY,82501,US,WY,2019-09-20,Dog,,65,245.90
58177,42865848,WY20,https://www.petfinder.com/dog/echo-42865848/wy...,Dog,Dog,Border Collie,o,,False,False,...,Glenrock,WY,82637,US,WY,2019-09-20,Dog,"Due to the small size of our volunteer base, w...",100,184.06
58178,42734734,WY24,https://www.petfinder.com/dog/simon-42734734/w...,Dog,Dog,Boxer,m,Mixed Breed,True,False,...,Riverton,WY,82501,US,WY,2019-09-20,Dog,,58,61.05


In [None]:
tab=dog.groupby(['breed_primary','secondary_clean'])['id'].count()
tab

breed_primary                      secondary_clean
Affenpinscher                      m                    1
                                   o                   16
Afghan Hound                       o                    4
Airedale Terrier                   m                    1
                                   o                   18
                                                     ... 
Wirehaired Terrier                 o                   56
Xoloitzcuintli / Mexican Hairless  o                   11
Yellow Labrador Retriever          o                  158
Yorkshire Terrier                  m                   15
                                   o                  345
Name: id, Length: 354, dtype: int64

In [None]:
tabdf=tab.unstack() #determine m/o
tabdf

secondary_clean,m,o
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,1.0,16.0
Afghan Hound,,4.0
Airedale Terrier,1.0,18.0
Akbash,,3.0
Akita,6.0,175.0
...,...,...
Wirehaired Pointing Griffon,,1.0
Wirehaired Terrier,4.0,56.0
Xoloitzcuintli / Mexican Hairless,,11.0
Yellow Labrador Retriever,,158.0


In [None]:
#dove m=Nan il ratio è 0
tabdf.loc[tabdf['m'].isnull(),'m']=0

In [None]:
#dove o=Nan il ratio è +infinito, quindi posso mettere o=0.1
tabdf.loc[tabdf['o'].isnull(),'o']=0.1

In [None]:
tabdf

secondary_clean,m,o
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,1.0,16.0
Afghan Hound,0.0,4.0
Airedale Terrier,1.0,18.0
Akbash,0.0,3.0
Akita,6.0,175.0
...,...,...
Wirehaired Pointing Griffon,0.0,1.0
Wirehaired Terrier,4.0,56.0
Xoloitzcuintli / Mexican Hairless,0.0,11.0
Yellow Labrador Retriever,0.0,158.0


In [None]:
tabdf['ratio']=tabdf.apply(lambda row: row['m']/row['o'], axis=1)

In [None]:
#check
tabdf.loc['Norwegian Buhund'] #only entry where o=0


secondary_clean
m         2.0
o         0.1
ratio    20.0
Name: Norwegian Buhund, dtype: float64

In [None]:
tabdf.head()

secondary_clean,m,o,ratio
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Affenpinscher,1.0,16.0,0.0625
Afghan Hound,0.0,4.0,0.0
Airedale Terrier,1.0,18.0,0.055556
Akbash,0.0,3.0,0.0
Akita,6.0,175.0,0.034286


### 4. For each (primary) breed, determine the earliest and the latest `posted` timestamp.



In [None]:
dog['posted']=pd.to_datetime(dog['posted'])

In [None]:
time=dog.groupby('breed_primary')[['posted']].max()
time

Unnamed: 0_level_0,posted
breed_primary,Unnamed: 1_level_1
Affenpinscher,2019-09-14 10:10:51+00:00
Afghan Hound,2019-07-27 00:38:48+00:00
Airedale Terrier,2019-09-19 18:40:39+00:00
Akbash,2019-08-23 17:11:04+00:00
Akita,2019-09-20 15:19:57+00:00
...,...
Wirehaired Pointing Griffon,2016-06-29 20:03:55+00:00
Wirehaired Terrier,2019-09-19 22:52:45+00:00
Xoloitzcuintli / Mexican Hairless,2019-09-08 11:15:54+00:00
Yellow Labrador Retriever,2019-09-20 06:30:27+00:00


In [None]:
time['postedmin']=dog.groupby('breed_primary')[['posted']].min()
time

Unnamed: 0_level_0,posted,postedmin
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,2019-09-14 10:10:51+00:00,2012-03-08 10:27:33+00:00
Afghan Hound,2019-07-27 00:38:48+00:00,2017-06-29 23:28:51+00:00
Airedale Terrier,2019-09-19 18:40:39+00:00,2014-06-13 12:59:36+00:00
Akbash,2019-08-23 17:11:04+00:00,2019-07-21 00:35:59+00:00
Akita,2019-09-20 15:19:57+00:00,2012-03-03 09:31:08+00:00
...,...,...
Wirehaired Pointing Griffon,2016-06-29 20:03:55+00:00,2016-06-29 20:03:55+00:00
Wirehaired Terrier,2019-09-19 22:52:45+00:00,2012-11-27 14:07:54+00:00
Xoloitzcuintli / Mexican Hairless,2019-09-08 11:15:54+00:00,2007-02-01 00:00:00+00:00
Yellow Labrador Retriever,2019-09-20 06:30:27+00:00,2010-05-31 00:00:00+00:00


### 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

per ora solo pastroccio non so che state scegliere

In [None]:
dog.stateQ.unique() #should  

array(['89009', '89014', '89024', '89027', '89121', '89406', '89408',
       '89423', '89431', '89451', '89801', 'AK', 'AL', 'AR', 'AZ', 'CA',
       'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN',
       'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT',
       'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NY', 'OH', 'OK', 'OR', 'PA',
       'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV',
       'WY'], dtype=object)

In [None]:
contr=dog[dog['stateQ']=='89009'][['contact_state']] #in 89009 c'è sia NV che AZ
contr.contact_state.unique()

array(['NV', 'AZ'], dtype=object)

In [None]:
dog.contact_state.unique()

array(['NV', 'AZ', 'UT', 'CA', 'AK', 'AL', 'AR', 'CO', 'NY', 'MA', 'CT',
       'RI', 'NJ', 'NH', 'VT', 'MD', 'VA', 'DC', 'PA', 'WV', 'DE', 'FL',
       'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'OH', 'KS', 'KY', 'LA', 'ME',
       'QC', 'NB', 'MI', 'MN', 'WI', 'MO', 'MS', 'MT', 'NC', 'SC', 'ND',
       'NE', 'NM', 'OK', 'OR', 'SD', 'TN', 'TX', 'WA', 'WY'], dtype=object)

In [None]:
dog.iloc[0]

id                                                          46042150
org_id                                                         NV163
url                https://www.petfinder.com/dog/harley-46042150/...
type.x                                                           Dog
species                                                          Dog
breed_primary                         American Staffordshire Terrier
secondary_clean                                                    m
breed_secondary                                          Mixed Breed
breed_mixed                                                     True
breed_unknown                                                  False
color_primary                                          White / Cream
color_secondary                          Yellow / Tan / Blond / Fawn
color_tertiary                                                   NaN
age                                                           Senior
sex                               

### 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay.

In [None]:
dog.groupby(['age','size'], as_index=False)[['stay_duration','stay_cost']].mean()

NameError: ignored

### 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs.

### 8. Fix the `travels` table so that the correct state is computed from  the `manual` and the `found` fields. If `manual` is not missing, then it overrides what is stored in `found`.

### 9. For each state, compute the ratio between the number of travels and the population.

### 10. For each dog, compute the number of days from the `posted` day to the day of last access.

### 11. Partition the dogs according to the number of weeks from the `posted` day to the day of last access.

### 12. Find for duplicates in the `dogs` dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined for determining if two rows are duplicates.

In [14]:
pip install gingerit

SyntaxError: invalid syntax (<ipython-input-14-8aa4f9aeeca6>, line 1)

In [15]:
pip install ftfy

Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
 prova = dog.loc[12495].at["description"]

'â\x80¢Basset Hound, female, â\x80¢10 years \n\nDelightful Daisy is a friendly girl looking for a retirement home! Daisy is a spry 10 who greets people with a wagging tail and a hop so it is easy to pet her. She also enjoys walks, snuggling on the couch, and treats, not necessarily in that order. Daisy is a loved pet who will be missed, but she does not enjoy living with young children, and two have joined the family. Daisy is happy to leave them alone but the children are young and humans are not as easy to train as a dog is. Daisy does live with another dog but can be protective of her food, and may be happiest as an only dog, unless the family is prepared to manage the dogs. Daisy is much more about people than other dogs. Daisy has never lived with cats, but does have the hound part of Basset Hound in full, and likes to chase small fuzzy creatures in the yard, so we suspect it would not go well. She is open to meeting a cat though to see if our theory is correct. Daisy will be stay

In [21]:
from ftfy import fix_encoding
from ftfy import fix_text
prova = dog.loc[12495].at["description"]
fix_text(prova)

'•Basset Hound, female, •10 years \n\nDelightful Daisy is a friendly girl looking for a retirement home! Daisy is a spry 10 who greets people with a wagging tail and a hop so it is easy to pet her. She also enjoys walks, snuggling on the couch, and treats, not necessarily in that order. Daisy is a loved pet who will be missed, but she does not enjoy living with young children, and two have joined the family. Daisy is happy to leave them alone but the children are young and humans are not as easy to train as a dog is. Daisy does live with another dog but can be protective of her food, and may be happiest as an only dog, unless the family is prepared to manage the dogs. Daisy is much more about people than other dogs. Daisy has never lived with cats, but does have the hound part of Basset Hound in full, and likes to chase small fuzzy creatures in the yard, so we suspect it would not go well. She is open to meeting a cat though to see if our theory is correct. Daisy will be staying with h

In [4]:
pip install spacy

Collecting spacy
  Downloading spacy-3.4.3-cp38-cp38-win_amd64.whl (12.2 MB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.9-cp38-cp38-win_amd64.whl (18 kB)
Collecting typer<0.8.0,>=0.3.0
  Downloading typer-0.7.0-py3-none-any.whl (38 kB)
Collecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.5-cp38-cp38-win_amd64.whl (1.3 MB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.8-cp38-cp38-win_amd64.whl (96 kB)
Collecting wasabi<1.1.0,>=0.9.1
  Downloading wasabi-0.10.1-py3-none-any.whl (26 kB)
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.5-cp38-cp38-win_amd64.whl (481 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4
  Downloading pydantic-1.10.2-cp38-cp38-win_amd64.whl (2.2 MB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.8-py3-none-any.whl (17 kB)
Collecting spacy-legacy<3.1.0,>=3.0.10
  Downloading spacy_legacy-3.0.10-py2.py3-none-any.

In [6]:
pip install contextualSpellCheck

Collecting contextualSpellCheck
  Downloading contextualSpellCheck-0.4.3-py3-none-any.whl (128 kB)
Collecting torch>=1.4
  Downloading torch-1.13.0-cp38-cp38-win_amd64.whl (167.3 MB)
Collecting transformers>=4.0.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
Collecting editdistance==0.6.0
  Downloading editdistance-0.6.0-cp38-cp38-win_amd64.whl (24 kB)
Note: you may need to restart the kernel to use updated packages.
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-win_amd64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers, torch, editdistance, contextualSpellCheck
Successfully installed contextualSpellCheck-0.4.3 editdistance-0.6.0 huggingface-hub-0.11.1 tokenizers-0.13.2 torch-1.13.0 transformers-4.25.1


In [7]:
import spacy
import contextualSpellCheck

nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)
doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')

print(doc._.performed_spellCheck) #Should be True
print(doc._.outcome_spellCheck) #Income was $9.4 million compared to the prior year of $2.7 million.

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.