# <center>Project for Foundations of Computer Science</center>
### <center>University of Milano-Bicocca</center>
<center>Matteo Corona - Costanza Pagnin</center>

### 0. Preliminary steps
### Importing libraries

In [1]:
# Importing the necessary libraries 
from collections import Counter
from spacy.cli import download
from ftfy import fix_encoding
import pandas as pd
import numpy as np
import spacy
import ast
import re

### Reading *.csv* files from GitHub Repository

In [2]:
# Reading .csv files from GitHub Repository
nst    = pd.read_csv("https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/NST-EST2021-POP.csv", header=None)
travel = pd.read_csv("https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/dogTravel.csv", index_col=0)
dog    = pd.read_csv("https://raw.githubusercontent.com/CoroTheBoss/CS-project/main/dogs.csv")

### 1. Extract all dogs with status that is *not adoptable*

In [3]:
# Shifting values (some values were off by one column)
dog.loc[dog["status"] != "adoptable",
        "status":"accessed"] = dog.loc[dog["status"] != "adoptable",
                                       "status":"accessed"].shift(periods = 1, axis = "columns")

In [4]:
# Cheching all possible values in status
dog["status"].unique()

array(['adoptable', nan], dtype=object)

In [5]:
# Replacing NaN values (the NaN values refers to the not adoptable dogs)
dog.loc[dog.status != "adoptable", ["status"]] = "not adoptable"
# Printing the first not adoptable dogs to visualize the data
dog.loc[dog.status != "adoptable", ["id", "status"]].head()

Unnamed: 0,id,status
644,41330726,not adoptable
5549,38169117,not adoptable
10888,45833989,not adoptable
11983,45515547,not adoptable
12495,45294115,not adoptable


In [6]:
print("There are", len(dog[dog.status != "adoptable"]) ,"dogs with status that is not adoptable" )

There are 33 dogs with status that is not adoptable


### 2. For each (primary) breed, determine the number of dogs

In [7]:
# Grouping id by their primary_breed and counting them
dog.groupby("breed_primary")["id"].count()

breed_primary
Affenpinscher                         17
Afghan Hound                           4
Airedale Terrier                      19
Akbash                                 3
Akita                                181
                                    ... 
Wirehaired Pointing Griffon            1
Wirehaired Terrier                    60
Xoloitzcuintli / Mexican Hairless     11
Yellow Labrador Retriever            158
Yorkshire Terrier                    360
Name: id, Length: 216, dtype: int64

### 3. For each (primary) breed, determine the ratio between the number of dogs of `Mixed Breed` and those not of Mixed Breed. Hint: look at the `secondary_breed`.

In [8]:
breed_tab = dog.groupby(["breed_primary","breed_mixed"])["id"].count()
breed_tab = breed_tab.unstack()
breed_tab.columns = ["not_mixed", "mixed"]
breed_tab[np.isnan(breed_tab)] = 0
breed_tab["not_mixed_%"] = round(100 * breed_tab["not_mixed"] / (breed_tab["mixed"] + breed_tab["not_mixed"]), 1)
breed_tab["mixed_%"] = round(100 * breed_tab["mixed"] / (breed_tab["mixed"] + breed_tab["not_mixed"]), 1)
breed_tab["ratio"] = breed_tab["mixed"] / breed_tab["not_mixed"]
breed_tab

Unnamed: 0_level_0,not_mixed,mixed,not_mixed_%,mixed_%,ratio
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Affenpinscher,12.0,5.0,70.6,29.4,0.416667
Afghan Hound,0.0,4.0,0.0,100.0,inf
Airedale Terrier,2.0,17.0,10.5,89.5,8.500000
Akbash,1.0,2.0,33.3,66.7,2.000000
Akita,98.0,83.0,54.1,45.9,0.846939
...,...,...,...,...,...
Wirehaired Pointing Griffon,0.0,1.0,0.0,100.0,inf
Wirehaired Terrier,15.0,45.0,25.0,75.0,3.000000
Xoloitzcuintli / Mexican Hairless,6.0,5.0,54.5,45.5,0.833333
Yellow Labrador Retriever,36.0,122.0,22.8,77.2,3.388889


### 4. For each (primary) breed, determine the earliest and the latest `posted` timestamp.



In [9]:
# Converting "posted" column for manipulating dates and times
dog.posted = pd.to_datetime(dog.posted)
# Grouping "posted" by the primary_breed and finding erliest and lates posted time for each group
time_tab = dog.groupby("breed_primary")[["posted"]].min()
time_tab["postedmin"] = dog.groupby("breed_primary")[["posted"]].max()
# Renaming columns and printing dataframe
time_tab.columns = ["erliest_posted_timestamp", "latest_posted_timestamp"]
time_tab

Unnamed: 0_level_0,erliest_posted_timestamp,latest_posted_timestamp
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,2012-03-08 10:27:33+00:00,2019-09-14 10:10:51+00:00
Afghan Hound,2017-06-29 23:28:51+00:00,2019-07-27 00:38:48+00:00
Airedale Terrier,2014-06-13 12:59:36+00:00,2019-09-19 18:40:39+00:00
Akbash,2019-07-21 00:35:59+00:00,2019-08-23 17:11:04+00:00
Akita,2012-03-03 09:31:08+00:00,2019-09-20 15:19:57+00:00
...,...,...
Wirehaired Pointing Griffon,2016-06-29 20:03:55+00:00,2016-06-29 20:03:55+00:00
Wirehaired Terrier,2012-11-27 14:07:54+00:00,2019-09-19 22:52:45+00:00
Xoloitzcuintli / Mexican Hairless,2007-02-01 00:00:00+00:00,2019-09-08 11:15:54+00:00
Yellow Labrador Retriever,2010-05-31 00:00:00+00:00,2019-09-20 06:30:27+00:00


### 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

In [10]:
# Grouping id by their contact_state and sex and counting them
state_tab = dog.groupby(["contact_state","sex"])["id"].count()
# Unstacking table
state_tab = state_tab.unstack()
# Setting the NaN values to zero 
state_tab[np.isnan(state_tab)] = 0
# Computing the sex imbalance and then printing the dataframe
state_tab["sex_imbalance"] = state_tab["Male"] - state_tab["Female"]
state_tab.head()

sex,Female,Male,Unknown,sex_imbalance
contact_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AK,7.0,8.0,0.0,1.0
AL,716.0,712.0,0.0,-4.0
AR,351.0,344.0,0.0,-7.0
AZ,1067.0,1181.0,1.0,114.0
CA,777.0,887.0,0.0,110.0


In [11]:
# Printing the state with the highest sex imbalance
print("The state with the highest sex imbalance is Ohio.")
state_tab.loc[state_tab["sex_imbalance"] == state_tab["sex_imbalance"].max()]

The state with the highest sex imbalance is Ohio.


sex,Female,Male,Unknown,sex_imbalance
contact_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
OH,1234.0,1439.0,0.0,205.0


### 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay.

In [12]:
# Grouping dogs by their age and size and averaging the stay_duration and stay_cost values for each group
round(dog.groupby(["age","size"])[["stay_duration","stay_cost"]].mean(), 2)

Unnamed: 0_level_0,Unnamed: 1_level_0,stay_duration,stay_cost
age,size,Unnamed: 2_level_1,Unnamed: 3_level_1
Adult,Extra Large,89.02,232.59
Adult,Large,89.53,238.66
Adult,Medium,89.42,238.26
Adult,Small,89.41,238.97
Baby,Extra Large,87.03,237.18
Baby,Large,89.7,238.7
Baby,Medium,89.58,237.11
Baby,Small,89.96,239.08
Senior,Extra Large,88.86,235.23
Senior,Large,88.98,237.51


### 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs.

In [13]:
# Grouping "contact_city" by the dogs id and counting them
travel_tab = travel.groupby(["id"], as_index=False)[["contact_city"]].count()
# Renaming columns
travel_tab.columns = ["id", "count"]
# Excluding all dogs that do not match the given condition
travel_tab = travel_tab[travel_tab["count"] > 2]
# Merging the dataframe with "dog" in order to show the breed
pd.merge(travel_tab,dog[["id", "breed_primary"]],
         how = "left", on = ["id"])

Unnamed: 0,id,count,breed_primary
0,16657005,4,Pit Bull Terrier
1,20905974,5,Chow Chow
2,24894870,4,Hound
3,24894894,4,Hound
4,33218331,7,Alaskan Malamute
...,...,...,...
558,46042569,3,Labrador Retriever
559,46042587,3,Labrador Retriever
560,46042618,3,Labrador Retriever
561,46043099,3,Labrador Retriever


In [14]:
# Printing the number of dogs involved in at least three travels
print("There are", len(travel_tab) ,"dogs involved in at least three travels" )

There are 563 dogs involved in at least three travels


### 8. Fix the `travels` table so that the correct state is computed from  the `manual` and the `found` fields. If `manual` is not missing, then it overrides what is stored in `found`.

In [15]:
# Asking if the element in the "manual" column is not null and then replacing the found column
travel.loc[travel["manual"].isnull() == False,"found"] = travel.loc[travel["manual"].isnull() == False, "manual"]
# Printing the fixed dataframe
travel[["id","found","manual"]]

Unnamed: 0_level_0,id,found,manual
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,44520267,Arkansas,
1,44698509,Bahamas,Bahamas
2,45983838,Maryland,Maryland
3,44475904,Adaptil,
4,43877389,Afghanistan,
...,...,...,...
6189,40492179,WV,
6190,45799729,Wyoming,
6191,34276515,Yazmin,
6192,44519341,Ohio,Ohio


### 9. For each state, compute the ratio between the number of travels and the population.

In [16]:
# Fixing a value in travels: 17325 refers to Pennsylvania (PA)
travel["contact_state"] = travel["contact_state"].replace('17325','PA')
# Opening and reading the file (found on the internet) which contains the state abbreviations
file = open("abbreviations.txt", "r")
contents = file.read()
# Converting the file into a dictionary
abb = ast.literal_eval(contents)
file.close()
# Naming nst columns (the file was without header)
nst.columns = ["state", "population"]
# Substituting state names with abbreviations
nst = nst.replace({"state": abb})
# Converting population column into numeric values
nst["population"] = nst["population"].str.replace('.', '', regex=True)
nst.head()

Unnamed: 0,state,population
0,AL,5024279
1,AK,733391
2,AZ,7151502
3,AR,3011524
4,CA,39538223


In [17]:
# Grouping id by the contact_state
state_tab_ratio = travel.groupby(["contact_state"], as_index=False)["id"].count()
# Renaming columns
state_tab_ratio.columns = ["state", "travels"]
# Merging the new dataframe with the nst dataframe in otder to show the state populations
state_tab_ratio = pd.merge(nst,state_tab_ratio[["state","travels"]],
         how = "left", on = ["state"])
# Setting the NaN values to zero 
state_tab_ratio = state_tab_ratio.fillna(0)
# Fixing "population" type and setting it as float
state_tab_ratio["population"] = state_tab_ratio["population"].astype(float)
# Computing the ratio between travel and population and printing the dataframe
state_tab_ratio["ratio"] = state_tab_ratio["travels"] / state_tab_ratio["population"]
state_tab_ratio.head()

Unnamed: 0,state,population,travels,ratio
0,AL,5024279.0,75.0,1.492751e-05
1,AK,733391.0,0.0,0.0
2,AZ,7151502.0,70.0,9.788154e-06
3,AR,3011524.0,10.0,3.320578e-06
4,CA,39538223.0,28.0,7.081755e-07


### 10. For each dog, compute the number of days from the `posted` day to the day of last access.

In [18]:
# Ignoring the SettingWithCopyWarning
import warnings
warnings.filterwarnings('ignore')
# Selecting the needed columns and saving them in a new dataframe for convenience
days_tab = dog[["id","posted", "accessed"]]
# Converting "posted" and "accessed" columns for manipulating dates and times
days_tab.accessed = pd.to_datetime(days_tab.accessed).dt.date
days_tab.posted   = pd.to_datetime(days_tab.posted).dt.date
# Computing their difference and saving the result in a new column
days_tab["days"]  = days_tab["accessed"] - days_tab["posted"]
# Printing the dataframe 
days_tab

Unnamed: 0,id,posted,accessed,days
0,46042150,2019-09-20,2019-09-20,0 days
1,46042002,2019-09-20,2019-09-20,0 days
2,46040898,2019-09-20,2019-09-20,0 days
3,46039877,2019-09-20,2019-09-20,0 days
4,46039306,2019-09-20,2019-09-20,0 days
...,...,...,...,...
58175,44605893,2019-05-03,2019-09-20,140 days
58176,44457061,2019-04-13,2019-09-20,160 days
58177,42865848,2018-09-27,2019-09-20,358 days
58178,42734734,2018-09-12,2019-09-20,373 days


### 11. Partition the dogs according to the number of weeks from the `posted` day to the day of last access.

In [19]:
# Selecting the needed columns and saving them in a new dataframe for convenience
weeks_tab = dog[["id","posted", "accessed"]]
# Converting "posted" and "accessed" columns for manipulating dates and times
weeks_tab.accessed = pd.to_datetime(weeks_tab.accessed).dt.date
weeks_tab.posted   = pd.to_datetime(weeks_tab.posted).dt.date
# Computing the number of weeks from the posted day to the day of last access
weeks_tab["weeks"] = round((weeks_tab["accessed"] - weeks_tab["posted"])/np.timedelta64(1,'W'),0)
# Grouping dogs according to the number of weeks
weeks_tab = weeks_tab.groupby(["weeks"])[['id']].agg(lambda x: list(x))
# Printing the dataframe
weeks_tab

Unnamed: 0_level_0,id
weeks,Unnamed: 1_level_1
0.0,"[46042150, 46042002, 46040898, 46039877, 46039..."
1.0,"[45989641, 45988823, 45988816, 45988814, 45987..."
2.0,"[45919405, 45917309, 45917305, 45917298, 45911..."
3.0,"[45841113, 45841108, 45841101, 45841088, 45841..."
4.0,"[45751169, 45748689, 45748573, 45748545, 45748..."
...,...
730.0,[5142790]
747.0,[4527948]
812.0,[2613506]
813.0,[2592031]


### 12. Find for duplicates in the `dogs` dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined method for determining if two rows are duplicates.

In [20]:
# Slecting the needed column and dropping values with NaN description (they can't be compared) 
dog_duplicates = dog[['breed_primary','sex',"id",'description']].dropna(subset=['description'])
dog_duplicates = dog_duplicates[pd.notnull(dog_duplicates["description"])]
# Cleaning descriptions: replacing the unwanted "/n" character with a space
dog_duplicates["description"] = dog_duplicates["description"].apply(lambda x: x.replace("\n", " "))
# Cleaning descriptions: using fix_encoding() function 
dog_duplicates["description"] = dog_duplicates["description"].apply(lambda x: fix_encoding(x))
# Grouping dogs accordi to breed and sex in order to delete those that are unique (they can't have duplicates)
dog_duplicates = dog_duplicates.groupby(["breed_primary", "sex"])[["id", "description"]].agg(lambda x: list(x))
dog_duplicates = dog_duplicates[dog_duplicates['id'].map(len)!=1]
# Restoring the dataframe index
dog_duplicates = dog_duplicates.reset_index()

In [21]:
# Importing the natural language processing (nlp) needed packages
nlp = spacy.load("en_core_web_sm")
regex = re.compile(r'[a-z]')

# Defining a function that splits each description contained in a row into a list of words
def makelist(row):
    # Creating blank output
    output = []
    for description in row:
        # Applying nlp to the current element and saving result in a temporary variable
        doc = nlp(description)
        # Creating a blank list for containing the list of word
        description_list = list()
        # For loop that iterates on each word contained in the current element of the row
        for token in doc:
            if regex.search(token.text.lower()): # Condition for a string to be a word
                # Append word in the descript_list
                description_list.append(token.text.lower())
        # Appending the new list of words in the final output list
        output.append(description_list)
    return(output)

In [22]:
# Applying the makelist() function to all elements in the description column
dog_duplicates['listed'] = dog_duplicates['description'].apply(lambda row: makelist(row))
# Printing the dataframe in order to visualize the data
dog_duplicates

Unnamed: 0,breed_primary,sex,id,description,listed
0,Affenpinscher,Female,"[45889013, 22427951, 45970614, 45871731, 45916...",[This cutie is very sweet. She is a little shy...,"[[this, cutie, is, very, sweet, she, is, a, li..."
1,Affenpinscher,Male,"[38985146, 45728674, 45787432, 45858286, 45362...",[Ralphie is a darling black Affenpinscher mix ...,"[[ralphie, is, a, darling, black, affenpinsche..."
2,Afghan Hound,Male,"[45382284, 42476375, 39728532]",[We do not know what breed Bear is. He resembl...,"[[we, do, not, know, what, breed, bear, is, he..."
3,Airedale Terrier,Female,"[45682240, 45811667, 45295124, 43692266, 46007...",[Meet Cher! She is a very sweet puppy needing ...,"[[meet, cher, she, is, a, very, sweet, puppy, ..."
4,Airedale Terrier,Male,"[44752626, 45682439, 45565329, 29481512, 45308...","[Ehu is a cool-looking dog with a happy, posit...","[[ehu, is, a, cool, looking, dog, with, a, hap..."
...,...,...,...,...,...
353,Xoloitzcuintli / Mexican Hairless,Male,"[44118935, 43283678, 43248772, 45905447, 44869...",[Vlad is one of our sanctuary dogs. He is a M...,"[[vlad, is, one, of, our, sanctuary, dogs, he,..."
354,Yellow Labrador Retriever,Female,"[43828473, 39983699, 28672284, 44927114, 46006...","[We've had Goldie for almost 5 years, and she ...","[[we, 've, had, goldie, for, almost, years, an..."
355,Yellow Labrador Retriever,Male,"[46031975, 45987630, 45592988, 45288980, 39528...",[Ranger is a sweet boy full of puppy energy an...,"[[ranger, is, a, sweet, boy, full, of, puppy, ..."
356,Yorkshire Terrier,Female,"[45908725, 45272327, 44618549, 45310965, 45609...",[Meet Lucy! Lucy is a 10 month old Carin Terr...,"[[meet, lucy, lucy, is, a, month, old, carin, ..."


In [23]:
# Defining a function that compares two description and compute the percentage of shared words
def comparison(x,y):
    # Counting how many times each word appears in a description wiht the "Counter()" function
    a = Counter(x)
    b = Counter(y)
    # Defining a blank list for the final output
    res=[]
    # For loop that iterates on each common element of the two listx and y
    # set() lists all the unique elements contained in a list
    # The ".intersection()" method return only the common element between two list
    for i in set(x).intersection(set(y)):
        # Extendind the "res" list basing on how many time each common word appears in the the two lists
        # (we choose the minumun number of appearence in the two lists)
        res.extend([i] * min(b[i], a[i]))
    # Finding all the uncommon words
    uncommon = set(x) ^ set(y)
    # Computind denominator as the number of common words + the number of uncommon words
    denominator = len(res) + len(uncommon)
    # COmputing the output as the ratio common / (common + uncommon) and rounding the value for convenience
    return (round(len(res)/denominator*100,1))

# Defining a function that gets all possible combination of two elements in a list
def getCombinations(seq):
    # Defining a blank list for the output
    combinations = []
    # For loop that iterates on each element of the considered sequence
    for i in range(0,len(seq)):
        # For loop that iterates on each element of the sequence (starting from the previous element)
        for j in range(i+1,len(seq)):
            # Appending the combination in the output
            combinations.append([seq[i],seq[j]])
    return combinations

In [24]:
# Searching for blank list of words and popping them (and their corresponding id) from the dataframe
# Iterating in each element of the "listed" column
for elem in dog_duplicates["listed"]:
    # Iterating in each element of a single "listed" element
    for lista in elem:
        # Condition for the element to be a blank list
        if lista == []:
            # Finding the corresponding indexes
            index  = list(dog_duplicates["listed"]).index(elem)
            found  = elem.index(lista)
            # Callind the .pop() method in order to delete the blank lists and their corresponding ids
            dog_duplicates["id"][index].pop(found)
            elem.pop(found)

In [25]:
# Creating a blank dataframe for containing the duplicate couples of id
#all_duplicates = pd.DataFrame()
# Creating blank lists for storing the id which will fill the dataframe
first  = []
second = []
# For loop tha iterates on each element in the "listed" column
for elem in dog_duplicates["listed"]:
    # Calling the getCombinations() function
    temp = getCombinations(elem)
    # For loop that iterates on each combination
    for x in temp:
        # Condition for two list to share 90% of the words (calling the compariosn() function)
        if comparison(x[0], x[1]) >= 90:
            # Finding the corresponding indexes of the first description in the tested combination
            index  = list(dog_duplicates["listed"]).index(elem)
            found1 = elem.index(x[0])
            # Saving the first id
            found1 = dog_duplicates["id"][index][found1]
            # Finding the corresponding indexes of the second description in the tested combination
            found2 = elem.index(x[1])
            # Saving the second id
            found2 = dog_duplicates["id"][index][found2]
            # Appendind the ids in to lists
            first.append(found1)
            second.append(found2)
# Creating dataframe of containing the duplicates couples of ids and naming the columns 
all_duplicates = pd.DataFrame(list(zip(first, second)),
               columns =['id', 'duplicates'])

In [26]:
# Removing pairs of equal id from the dataframe
for index in range(len(all_duplicates)):
    if all_duplicates["id"][index] == all_duplicates["duplicates"][index]:
                   all_duplicates = all_duplicates.drop(index)
# Printing the dataframe
all_duplicates

Unnamed: 0,id,duplicates
0,45970614,45871731
3,46020162,46019974
4,46020162,46019762
5,46019974,46019762
6,45897570,45897563
...,...,...
8357,46037706,46037440
8358,40616088,40616074
8359,40616088,40615973
8360,40616074,40615973


In [27]:
dog["description"][list(dog["id"]).index(45958024)]

'Thank you for your interest in dog who needs a new home. Adoption can be one of the most fulfilling experiences in your life and especially in the life of your new canine companion.\n\nA note to those looking at CPR. We have no perfect dogs - but many of our dogs have proven to be perfect for a certain someone. Our job is to find those someone\'s. Our most successful adopters come to CPR not because they want a dog. They come to us because they want to give a dog a home. We urge all of our potential adopters not to fall in love with a picture. While we like to know what you like in looks, a dog is more than a pretty face. Our goal is to match you with the best possible pet for you and your home. We want to create furever families.\n\nOur descriptions contain all that we know about this dog today. We\'ll update as more information is known.\n\nAre you looking to adopt within the next 30 days? If so please visit our webpage at www.carolinapoodlerescue.org and complete the application. Y

In [28]:
dog["description"][list(dog["id"]).index(45957955)]

'Thank you for your interest in dog who needs a new home. Adoption can be one of the most fulfilling experiences in your life and especially in the life of your new canine companion.\n\nA note to those looking at CPR. We have no perfect dogs - but many of our dogs have proven to be perfect for a certain someone. Our job is to find those someone\'s. Our most successful adopters come to CPR not because they want a dog. They come to us because they want to give a dog a home. We urge all of our potential adopters not to fall in love with a picture. While we like to know what you like in looks, a dog is more than a pretty face. Our goal is to match you with the best possible pet for you and your home. We want to create furever families.\n\nOur descriptions contain all that we know about this dog today. We\'ll update as more information is known.\n\nAre you looking to adopt within the next 30 days? If so please visit our webpage at www.carolinapoodlerescue.org and complete the application. Y