# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Part 2: Dataset + Data Collection

## Overview

Based on the feedback you received from your lightning talk, choose **one** of your topic areas to move forward. For Part 2, you'll need to collect, clean, and document the dataset(s) you intend to use for your project.

This is not always a trivial task. Remember that data acquisition, transformation, and cleaning are typically the most time-consuming parts of data science projects, so don’t procrastinate!

Once you have your data, read into it and review it to confirm whether it is as productive as you intended. If not, switch datasets, gather additional data (e.g. multiple datasets), or revise your project goals.

Create your own database and data dictionary, then clean and munge your data as appropriate. Finally, document your work so far.

**Goal**: Find the data you need for your project, clean, and document it.


## Requirements

1. Find and Clean Your Data: Source and format the required data for your project.
   - Create a database
   - Create a data dictionary
2. Perform preliminary data munging and cleaning of your data: organize your data relevant to your project goals.
   - Review data to verify initial assumptions
   - Clean and munge data as necessary
3. Describe your data: keep your intended audience(s) in mind.
   - Document your work so far in a Jupyter notebook.

## Scrape 1 - geting advert URLs

To do - refactor this section so scrape is a single loop, not one per animal type. 

In [2]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import requests
import bs4
from bs4 import BeautifulSoup
from selenium import webdriver

import re
import time
import json

I am sourcing my data by webscraping pets4homes.co.uk. My main goal is to scrape information for each current listing for dogs and cats. Time permitting, I may also go back and scrape other animal listings too.

My plan to source the data requires two rounds a webscraping. The first round will loop through each page of search results and access the url associated with each listing. After removing any duplicates returned by this scrape, I will use the scraped urls to view each listing and scrape it's relevant information.

**N.B.** *A quick warning from the outset - because of some issues I had during scraping, I ended up writing this code with a lot of repetition (rather than using loops and functions). As such, part 1 & 2 of the scrape are badly in need of refactoring. The cleaning section is, well, cleaner.*

#### URL scraper

In [263]:
# importing a page of manually scraped html to be used for developing functions/code

path = '/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dogs_html.rtf'

with open(path) as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')

In [3]:
# function to extract urls from search pages

def get_url(page):
    '''    
    Extracts Pets4Homes advert listing URLs from a page of search results:
    
    Takes a BeautifulSoup object as input. Finds all instances of the class storing each an advert listing's
    URL. Adds each unique URL to a list. Returns the list.
    
    ---------
    Arguments
    ---------
    page : bs4.BeautifulSoup
        the html for a Pets4Homes search page, encoded as a BeautifulSoup object
    
    ---------
    Returns
    ---------
    cur_page_listing_urls : list
        a list containg the URLs for each advert listed in page
    '''
    
    cur_page_listing_urls = []

    # n.b. Pets4Home's HTML class names appear to change on a daily basis. 'ib iu' works as I am writing this
    # comment, but may not when you are reading it.
    
    for a in page.find_all('a', class_="ab Ym"):
        url = 'https://www.pets4homes.co.uk' + a['href']
        if url in cur_page_listing_urls:
            continue
        else:
            cur_page_listing_urls.append(url)
    
    return cur_page_listing_urls

In [None]:
# testing get_url
get_url(soup)

#### Dogs

In [300]:
# getting dogs urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

for page_num in range(1,501):
    
    URL = f'https://www.pets4homes.co.uk/sale/puppies/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))    

# flattening list of lists and dropping duplicates
dogs = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
dogs = pd.DataFrame(dogs, columns = ['URLs'])

# exporting to csv
dogs.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dogs_ulrs_10-01-23.csv')

In [301]:
# checking output of dog search scrape
dogs = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dogs_ulrs_10-01-23.csv',
                   index_col='Unnamed: 0')
dogs

Unnamed: 0,URLs
0,https://www.pets4homes.co.uk/classifieds/bhhhe...
1,https://www.pets4homes.co.uk/classifieds/n-9id...
2,https://www.pets4homes.co.uk/classifieds/yblc-...
3,https://www.pets4homes.co.uk/classifieds/b4ei5...
4,https://www.pets4homes.co.uk/classifieds/j0vow...
...,...
7918,https://www.pets4homes.co.uk/classifieds/hjb6l...
7919,https://www.pets4homes.co.uk/classifieds/pqe3z...
7920,https://www.pets4homes.co.uk/classifieds/owqq6...
7921,https://www.pets4homes.co.uk/classifieds/pyigz...


#### Cats

In [390]:
# getting cats urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,219):
    
    URL = f'https://www.pets4homes.co.uk/sale/kittens/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {219-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
cats = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
cats = pd.DataFrame(cats, columns = ['URLs'])

# exporting to csv
cats.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/cat_ulrs.csv')

10 pages scraped, 209 to go. Time is: 11:45:51
20 pages scraped, 199 to go. Time is: 11:46:21
30 pages scraped, 189 to go. Time is: 11:47:00
40 pages scraped, 179 to go. Time is: 11:47:38
50 pages scraped, 169 to go. Time is: 11:48:28
60 pages scraped, 159 to go. Time is: 11:49:11
70 pages scraped, 149 to go. Time is: 11:49:47
80 pages scraped, 139 to go. Time is: 11:50:29
90 pages scraped, 129 to go. Time is: 11:51:02
100 pages scraped, 119 to go. Time is: 11:51:35
110 pages scraped, 109 to go. Time is: 11:52:09
120 pages scraped, 99 to go. Time is: 11:52:41
130 pages scraped, 89 to go. Time is: 11:53:19
140 pages scraped, 79 to go. Time is: 11:53:59
150 pages scraped, 69 to go. Time is: 11:54:44
160 pages scraped, 59 to go. Time is: 11:55:30
170 pages scraped, 49 to go. Time is: 11:56:16
180 pages scraped, 39 to go. Time is: 11:57:04
190 pages scraped, 29 to go. Time is: 11:58:02
200 pages scraped, 19 to go. Time is: 11:59:01
210 pages scraped, 9 to go. Time is: 11:59:55


#### Reptiles

In [433]:
# getting reptiles urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,112):
    
    URL = f'https://www.pets4homes.co.uk/sale/reptiles/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {112-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
reptiles = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
reptiles = pd.DataFrame(reptiles, columns = ['URLs'])

# exporting to csv
reptiles.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/reptiles_ulrs.csv')

10 pages scraped, 209 to go. Time is: 08:42:46
20 pages scraped, 199 to go. Time is: 08:43:06
30 pages scraped, 189 to go. Time is: 08:43:30
40 pages scraped, 179 to go. Time is: 08:44:01
50 pages scraped, 169 to go. Time is: 08:44:40
60 pages scraped, 159 to go. Time is: 08:45:29
70 pages scraped, 149 to go. Time is: 08:46:06
80 pages scraped, 139 to go. Time is: 08:46:43
90 pages scraped, 129 to go. Time is: 08:47:20
100 pages scraped, 119 to go. Time is: 08:48:05
110 pages scraped, 109 to go. Time is: 08:48:58


#### Birds

In [4]:
# getting birds urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,107):
    
    URL = f'https://www.pets4homes.co.uk/sale/birds/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {107-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
birds = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
birds = pd.DataFrame(birds, columns = ['URLs'])

birds.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/birds_ulrs.csv')

10 pages scraped, 97 to go. Time is: 11:21:11
20 pages scraped, 87 to go. Time is: 11:21:28
30 pages scraped, 77 to go. Time is: 11:21:45
40 pages scraped, 67 to go. Time is: 11:22:04
50 pages scraped, 57 to go. Time is: 11:22:25
60 pages scraped, 47 to go. Time is: 11:22:45
70 pages scraped, 37 to go. Time is: 11:23:05
80 pages scraped, 27 to go. Time is: 11:23:25
90 pages scraped, 17 to go. Time is: 11:23:43
100 pages scraped, 7 to go. Time is: 11:23:55


NameError: name 'reptiles' is not defined

#### Rodents

In [10]:
# getting rodents urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,86):
    
    URL = f'https://www.pets4homes.co.uk/sale/rodents/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {107-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
rodents = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
rodents = pd.DataFrame(rodents, columns = ['URLs'])

rodents.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/rodents_ulrs.csv')

10 pages scraped, 97 to go. Time is: 11:29:12
20 pages scraped, 87 to go. Time is: 11:29:24
30 pages scraped, 77 to go. Time is: 11:29:36
40 pages scraped, 67 to go. Time is: 11:29:51
50 pages scraped, 57 to go. Time is: 11:30:07
60 pages scraped, 47 to go. Time is: 11:30:24
70 pages scraped, 37 to go. Time is: 11:30:42
80 pages scraped, 27 to go. Time is: 11:30:59


URLs    https://www.pets4homes.co.uk/classifieds/juq3f...
Name: 0, dtype: object

#### Rabbits

n.b. although rabbits are rodents, they are listed separately on Pets4Homes.

In [13]:
# getting rabbits urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,86):
    
    URL = f'https://www.pets4homes.co.uk/sale/rabbits/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {85-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
rabbits = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
rabbits = pd.DataFrame(rabbits, columns = ['URLs'])

rabbits.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/rabbits_ulrs.csv')

10 pages scraped, 75 to go. Time is: 11:32:27
20 pages scraped, 65 to go. Time is: 11:32:41
30 pages scraped, 55 to go. Time is: 11:32:54
40 pages scraped, 45 to go. Time is: 11:33:11
50 pages scraped, 35 to go. Time is: 11:33:27
60 pages scraped, 25 to go. Time is: 11:33:44
70 pages scraped, 15 to go. Time is: 11:34:02
80 pages scraped, 5 to go. Time is: 11:34:13


#### Fish

In [16]:
# getting fish urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,35):
    
    URL = f'https://www.pets4homes.co.uk/sale/fish/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {34-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
fish = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
fish = pd.DataFrame(fish, columns = ['URLs'])

fish.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/fish_ulrs.csv')

10 pages scraped, 24 to go. Time is: 11:45:24
20 pages scraped, 14 to go. Time is: 11:45:36
30 pages scraped, 4 to go. Time is: 11:45:48


In [17]:
# getting poultry urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,25):
    
    URL = f'https://www.pets4homes.co.uk/sale/poultry/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {24-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
poultry = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
poultry = pd.DataFrame(poultry, columns = ['URLs'])

poultry.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/poultry_ulrs.csv')

10 pages scraped, 14 to go. Time is: 11:48:38
20 pages scraped, 4 to go. Time is: 11:48:50


#### Horses

In [18]:
# getting horses urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,15):
    
    URL = f'https://www.pets4homes.co.uk/sale/horses/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {14-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
horses = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
horses = pd.DataFrame(horses, columns = ['URLs'])

horses.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/horses_ulrs.csv')

10 pages scraped, 4 to go. Time is: 11:52:05


#### Invertebrates

In [20]:
# getting invertebrates urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

n = 0

for page_num in range(1,10):
    
    URL = f'https://www.pets4homes.co.uk/sale/invertebrates/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))
          
    n += 1
    if n % 10 == 0:
        print(f'{n} pages scraped, {9-n} to go. Time is:', time.ctime()[11:19])

# flattening list of lists and dropping duplicates
invertebrates = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
invertebrates = pd.DataFrame(invertebrates, columns = ['URLs'])

invertebrates.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/invertebrates_ulrs.csv')

## Scrape 2 - getting listings' HTML

I had initially designed my 2nd scraper loop to pull up each advert listing one by one, extract the relevent data and add it to a DataFrame. This would allow me to only save the information I need, not the full html for each listing. 

Unfortunately, I discovered that Pets4Homes changes its HTML class names on a regular (I think daily) basis. This was causing problems because code that I had written to extract the relevant information on one day would stop working the next. Although this didn't happen, it is also possible that Pets4Homes could have updated their class names whilst I was in a middle of a scrape. Any scrapes after the change would return no/wrong information, but this wouldn't become clear until the scrape was finished (which would have been a considerable time waste).

Given the problems outlined above, I have changed my approach so that my 2nd scraper loop will save the full html for each advert. I will then process the saved html in a 3rd to extract relevant information. This will ensure that I can work on writing information extraction code without worrying about a Pets4Homes update making it obsolete.

The shortcoming of this approach is that the code written for the 3rd step will only work for the pages I've already scraped, not for future scrapes. It might be possible to write a 2nd scraper loop which is uneffected by Pets4Homes' changing html classes, but I felt that given the timeframe, the approach I have chosen was safer.

#### Dogs

In [360]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 7910

dogs_html = []

for page in dogs['URLs'][n:]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    dogs_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(n, 'Time is:', time.ctime())

13


In [361]:
dogs_html_strs = [str(i) for i in dogs_html]

dogs_html_strs_DF = pd.DataFrame(data = dogs_html_strs, columns = ['URLs'])
dogs_html_strs_DF.head()

In [413]:
len(cats)

3612

In [364]:
dogs_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dogs_html_strs_(7910-to-7923)_11-01-23')

#### Cats

In [424]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 0

cats_html = []

for page in cats['URLs'][0:371]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    cats_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(n, 'Time is:', time.ctime()[11:19])
    
print(len(cats_html))

50 Time is: 20:46:04
100 Time is: 20:48:49
150 Time is: 20:51:00
200 Time is: 20:53:52
250 Time is: 20:57:16
300 Time is: 21:00:17
350 Time is: 21:03:47
371


In [425]:
cats_html_strs = [str(i) for i in cats_html]

cats_html_strs_DF = pd.DataFrame(data = cats_html_strs, columns = ['URLs'])
cats_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><style class=""vjs-styles..."
2,"<html lang=""en""><head><style class=""vjs-styles..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [426]:
cats_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/cats_listing_scrapes_(0-to-371).csv')

#### Reptiles

In [31]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 1584

reptiles = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/reptiles_ulrs.csv',
                      index_col = 'Unnamed: 0')

reptiles_html = []

for page in reptiles['URLs'][n:len(reptiles)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    reptiles_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(n, 'Time is:', time.ctime()[11:19])
    
print(len(reptiles_html))

1600 Time is: 13:12:26
1650 Time is: 13:13:38
1700 Time is: 13:14:46
1750 Time is: 13:15:55
1800 Time is: 13:17:17
1850 Time is: 13:18:57
1900 Time is: 13:20:55
1950 Time is: 13:23:11
2000 Time is: 13:25:51
2050 Time is: 13:29:05
2100 Time is: 13:31:43
2150 Time is: 13:32:53
612


In [32]:
reptiles_html_strs = [str(i) for i in reptiles_html]

reptiles_html_strs_DF = pd.DataFrame(data = reptiles_html_strs, columns = ['URLs'])
reptiles_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [34]:
reptiles_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/reptiles_listing_scrapes_(1584-to-612).csv')

#### Birds

In [42]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 1465

birds = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/birds_ulrs.csv',
                      index_col = 'Unnamed: 0')

birds_html = []

for page in birds['URLs'][n:len(birds)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    birds_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(birds)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(birds_html))

1500 scraped,180 remaining. Time is: 15:29:53
1550 scraped,130 remaining. Time is: 15:31:21
1600 scraped,80 remaining. Time is: 15:32:58
1650 scraped,30 remaining. Time is: 15:34:57
215


In [44]:
len(birds_html)

215

In [43]:
len(birds) -

1680

In [45]:
birds_html_strs = [str(i) for i in birds_html]

birds_html_strs_DF = pd.DataFrame(data = birds_html_strs, columns = ['URLs'])
birds_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><style class=""vjs-styles..."


In [46]:
birds_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/birds_listing_scrapes_(1465-to-end).csv')

#### Rodents

In [47]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 0

rodents = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/rodents_ulrs.csv',
                      index_col = 'Unnamed: 0')

rodents_html = []

for page in rodents['URLs'][n:len(rodents)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    rodents_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(rodents)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(rodents_html))

50 scraped,1624 remaining. Time is: 15:44:16
100 scraped,1574 remaining. Time is: 15:46:30
150 scraped,1524 remaining. Time is: 15:49:16
200 scraped,1474 remaining. Time is: 15:51:51
250 scraped,1424 remaining. Time is: 15:54:57
300 scraped,1374 remaining. Time is: 15:58:33
350 scraped,1324 remaining. Time is: 16:00:48
400 scraped,1274 remaining. Time is: 16:02:32
450 scraped,1224 remaining. Time is: 16:05:01
500 scraped,1174 remaining. Time is: 16:07:31
550 scraped,1124 remaining. Time is: 16:09:41
600 scraped,1074 remaining. Time is: 16:12:24
650 scraped,1024 remaining. Time is: 16:14:31
700 scraped,974 remaining. Time is: 16:16:18
750 scraped,924 remaining. Time is: 16:18:36
800 scraped,874 remaining. Time is: 16:21:28
850 scraped,824 remaining. Time is: 16:24:21
900 scraped,774 remaining. Time is: 16:26:01
950 scraped,724 remaining. Time is: 16:27:33
1000 scraped,674 remaining. Time is: 16:29:26
1050 scraped,624 remaining. Time is: 16:31:40
1100 scraped,574 remaining. Time is: 16:3

In [48]:
rodents_html_strs = [str(i) for i in rodents_html]

rodents_html_strs_DF = pd.DataFrame(data = rodents_html_strs, columns = ['URLs'])
rodents_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [51]:
rodents_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/rodents_listing_scrapes_(all).csv')

#### Rabbits

In [75]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 1280

rabbits = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/rabbits_ulrs.csv',
                      index_col = 'Unnamed: 0')

rabbits_html = []

for page in rabbits['URLs'][n:len(rabbits)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    rabbits_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(rabbits)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(rabbits_html))

1300 scraped,69 remaining. Time is: 23:36:26
1350 scraped,19 remaining. Time is: 23:37:26
89


In [77]:
rabbits_html_strs = [str(i) for i in rabbits_html]

rabbits_html_strs_DF = pd.DataFrame(data = rabbits_html_strs, columns = ['URLs'])
rabbits_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [76]:
len(rabbits_html) + 1280

1369

In [78]:
rabbits_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/rabbits_listing_scrapes_(1280-to-1369).csv')

#### Fish

In [84]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 332

fish = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/fish_ulrs.csv',
                      index_col = 'Unnamed: 0')

fish_html = []

for page in fish['URLs'][n:len(fish)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    fish_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(fish)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(fish_html))

350 scraped,311 remaining. Time is: 00:16:21
400 scraped,261 remaining. Time is: 00:17:18
450 scraped,211 remaining. Time is: 00:18:15
500 scraped,161 remaining. Time is: 00:19:09
550 scraped,111 remaining. Time is: 00:20:05
600 scraped,61 remaining. Time is: 00:20:58
650 scraped,11 remaining. Time is: 00:21:52
329


In [86]:
fish_html_strs = [str(i) for i in fish_html]

fish_html_strs_DF = pd.DataFrame(data = fish_html_strs, columns = ['URLs'])
fish_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><style class=""vjs-styles..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [85]:
len(fish_html) + 332

661

In [87]:
fish_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/fish_listing_scrapes_(332-to-661).csv')

#### poultry

In [89]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 0

poultry = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/poultry_ulrs.csv',
                      index_col = 'Unnamed: 0')

poultry_html = []

for page in poultry['URLs'][n:len(poultry)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    poultry_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(poultry)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(poultry_html))

50 scraped,411 remaining. Time is: 00:39:38
100 scraped,361 remaining. Time is: 00:40:32
150 scraped,311 remaining. Time is: 00:41:28
200 scraped,261 remaining. Time is: 00:42:22
250 scraped,211 remaining. Time is: 00:43:17
300 scraped,161 remaining. Time is: 00:44:22
350 scraped,111 remaining. Time is: 00:45:17
400 scraped,61 remaining. Time is: 00:46:19
450 scraped,11 remaining. Time is: 00:47:36
461


In [90]:
poultry_html_strs = [str(i) for i in poultry_html]

poultry_html_strs_DF = pd.DataFrame(data = poultry_html_strs, columns = ['URLs'])
poultry_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [91]:
fish_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/poultry_listing_scrapes_(all).csv')

#### horses

In [92]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 0

horses = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/horses_ulrs.csv',
                      index_col = 'Unnamed: 0')

horses_html = []

for page in horses['URLs'][n:len(horses)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    horses_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(horses)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(horses_html))

50 scraped,219 remaining. Time is: 14:40:38
100 scraped,169 remaining. Time is: 14:42:10
150 scraped,119 remaining. Time is: 14:43:37
200 scraped,69 remaining. Time is: 14:45:11
250 scraped,19 remaining. Time is: 14:46:47
269


In [93]:
horses_html_strs = [str(i) for i in horses_html]

horses_html_strs_DF = pd.DataFrame(data = horses_html_strs, columns = ['URLs'])
horses_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [94]:
horses_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/horses_listing_scrapes_(all).csv')

#### invertebrates

In [95]:
dr = webdriver.Chrome()

# I've had to carry out this scrape in multiple stages due to selenium/chrome crashing. 
# n is the number of URLs in the dogs list which I had scraped prior to the last crash.
n = 0

invertebrates = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/scraped_urls/invertebrates_ulrs.csv',
                      index_col = 'Unnamed: 0')

invertebrates_html = []

for page in invertebrates['URLs'][n:len(invertebrates)]:
    
    URL = page

    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    invertebrates_html.append(page)
    
    # output to track progress
    n += 1 
    if n % 50 == 0:
        print(f'{n} scraped,{len(invertebrates)-n} remaining. Time is: {time.ctime()[11:19]}')
    
print(len(invertebrates_html))

50 scraped,118 remaining. Time is: 14:50:51
100 scraped,68 remaining. Time is: 14:52:10
150 scraped,18 remaining. Time is: 14:53:27
168


In [96]:
invertebrates_html_strs = [str(i) for i in invertebrates_html]

invertebrates_html_strs_DF = pd.DataFrame(data = invertebrates_html_strs, columns = ['URLs'])
invertebrates_html_strs_DF.head()

Unnamed: 0,URLs
0,"<html lang=""en""><head><link as=""script"" href=""..."
1,"<html lang=""en""><head><link as=""script"" href=""..."
2,"<html lang=""en""><head><link as=""script"" href=""..."
3,"<html lang=""en""><head><link as=""script"" href=""..."
4,"<html lang=""en""><head><link as=""script"" href=""..."


In [98]:
invertebrates_html_strs_DF.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/invertebrates_listing_scrapes_(all).csv')

## Data extraction

I've now scraped the full html all current listings on Pets4homes.co.uk. The next step its to process this html to extract relevant the data.

In [170]:
def extract_data(html):
    """
    TO DO
    """
    
    soup = BeautifulSoup(html, 'html.parser')
    
    data_dict = {}

    application = soup.find('script', attrs = {'type':"application/ld+json"}).text.replace('\\n','\n')
    application_json = json.loads(application, strict=False)
    
    # TITLE
    try:
        data_dict['title'] = soup.find('title').text.split('|')[0]
    except:
        pass
        
    # PRICE
    try:
        data_dict['price'] = (application_json['offers']['price'])
    except:
        pass
        
    # URL
    try:
        data_dict['url'] = (application_json['offers']['url'])
    except:
        pass
        
    # SELLER
    try:
        data_dict['seller_type'] = (application_json['offers']['seller']['type'])
    except:
        pass
    
    try:
        data_dict['seller_name'] = (application_json['offers']['seller']['name'])
    except:
        pass
        
        
    # VERIFICATION
    colour_dict = {'#69d4a1':1, '#c0ccda':0}

    try:
        for i in (soup.find(attrs = {'data-testid':"verification-status-field"})):
            els = i.find_all()
            colour = (els[0]['style'])
            data_dict[els[1].text] = colour_dict[colour[13:20]]
    except Exception as e:
        try:
            contact_type = ['Phone', 'Email','Facebook', 'Google']
            verification_status = soup.find_all(attrs = {'data-testid':"verification-status-field"})[:4]
            for ct, vs in zip(contact_type, verification_status):
                if '#69d4a1' in str(vs):
                    data_dict[ct] = 1
                elif '#c0ccda' in str(vs):
                    data_dict[ct] = 0
        except Exception as e:
            pass

    # IMAGES
    try:
        data_dict['n_images'] = len(application_json['image'])
    except:
        pass
    
    # CATERGORY
    try:
        data_dict['category'] = (application_json['category']).replace(' ','').split('>')[1]
    except:
        pass
    
    
    # DETAILS 
    try:
        listing_details = soup.find(attrs = {"data-testid": 'listing-details'})
        listing_params = listing_details.find_all(attrs = {"data-testid": 'details-parameter'})

        for param in listing_params:

            param_text = re.sub("[\<].*?[\>]", "|", str(param))
            param_text = param_text.split("|")
            values = []
            for el in param_text:
                if el != '':
                    values.append(el)
            data_dict[values[0]] = values[1]
    except:
        pass

    # DESCRIPTION
    try:
        data_dict['description'] = soup.find('meta', attrs = {'name':"description"})['content']
    except:
        pass
    
    
    return(data_dict)

In [171]:
fish_test = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data/data_fish_listings/fish_listing_scrapes_(0-to-332).csv')

extract_data(fish_test.iloc[0].values[1])

{'title': 'breading pair of oscars ',
 'price': '70',
 'url': '/classifieds/taskggl35-breading-pair-of-oscars-telford/',
 'seller_type': 'Person',
 'seller_name': 'Phil B.',
 'Phone': 1,
 'Email': 1,
 'Facebook': 1,
 'Google': 0,
 'n_images': 3,
 'category': 'Fish',
 'Adv. ID': 'tASkGGL35',
 'Adv. Location': 'Trench, Telford',
 'Advert Type': 'For sale',
 'Advertiser': 'Individual',
 'Breed': 'Cichlids',
 'Pet Age: ': '3 days',
 'description': 'large pair of oscars around 12 inchs 1 albino and female a tigar a proven pair of breaders no fault of there own for sale I just want a change in tank tigar oscar was Born with a lip defect but never stopped her eating and never effected her every day life must go together'}

In [173]:
# combining all scraped html into single df for data extraction

all_data = pd.DataFrame()

path = '/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data'

# BIRDS
all_data = all_data.append(pd.read_csv(f'{path}/data_birds_listing/birds_listing_scrapes_(0-to-1465).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_birds_listing/birds_listing_scrapes_(1465-to-end).csv',
                                      index_col = 'Unnamed: 0'))

print('Birds done.')



# CATS
all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(0-to-371).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(370-to-3268).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(3268-to-3612).csv',
                                      index_col = 'Unnamed: 0'))

print('Cats done.')



# DOGS
all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(0-to-4368)_11-01-23',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(4368-to-5561)_11-01-23',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(5561-to-6984)_11-01-23',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(6984-to-7910)_11-01-23',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(7910-to-7923)_11-01-23',
                                      index_col = 'Unnamed: 0'))

print('Dogs done.')



# FISH
all_data = all_data.append(pd.read_csv(f'{path}/data_fish_listings/fish_listing_scrapes_(0-to-332).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_fish_listings/fish_listing_scrapes_(332-to-661).csv',
                                      index_col = 'Unnamed: 0'))

print('Fish done.')


# HORSES
all_data = all_data.append(pd.read_csv(f'{path}/horses_listing_scrapes_(all).csv',
                                      index_col = 'Unnamed: 0'))

print('Horses done.')


# INVERTEBRATES
all_data = all_data.append(pd.read_csv(f'{path}/invertebrates_listing_scrapes_(all).csv',
                                      index_col = 'Unnamed: 0'))


print('Invertebrates done.')


# POULTRY
all_data = all_data.append(pd.read_csv(f'{path}/poultry_listing_scrapes_(all).csv',
                                      index_col = 'Unnamed: 0'))


print('Poultry done.')


# POULTRY
all_data = all_data.append(pd.read_csv(f'{path}/poultry_listing_scrapes_(all).csv',
                                      index_col = 'Unnamed: 0'))

print('Poultry done.')


# RABBITS
all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(0-to-537).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(537-to-923).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(923-to-1280).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(1280-to-1369).csv',
                                      index_col = 'Unnamed: 0'))


print('Rabbits done.')



# REPTILES
all_data = all_data.append(pd.read_csv(f'{path}/data_reptiles_listings/reptiles_listing_scrapes_(0-to-1584).csv',
                                      index_col = 'Unnamed: 0'))

all_data = all_data.append(pd.read_csv(f'{path}/data_reptiles_listings/reptiles_listing_scrapes_(1584-to-2196).csv',
                                      index_col = 'Unnamed: 0'))


print('Reptiles done.')



# RODENTS
all_data = all_data.append(pd.read_csv(f'{path}/rodents_listing_scrapes_(all).csv',
                                      index_col = 'Unnamed: 0'))


print('Rodents done.')


  all_data = all_data.append(pd.read_csv(f'{path}/data_birds_listing/birds_listing_scrapes_(0-to-1465).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_birds_listing/birds_listing_scrapes_(1465-to-end).csv',


Birds done.


  all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(0-to-371).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(370-to-3268).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_cats_listings/cats_listing_scrapes_(3268-to-3612).csv',


Cats done.


  all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(0-to-4368)_11-01-23',
  all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(4368-to-5561)_11-01-23',
  all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(5561-to-6984)_11-01-23',
  all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(6984-to-7910)_11-01-23',
  all_data = all_data.append(pd.read_csv(f'{path}/data_dogs_listings/dogs_html_strs_(7910-to-7923)_11-01-23',


Dogs done.


  all_data = all_data.append(pd.read_csv(f'{path}/data_fish_listings/fish_listing_scrapes_(0-to-332).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_fish_listings/fish_listing_scrapes_(332-to-661).csv',


Fish done.


  all_data = all_data.append(pd.read_csv(f'{path}/horses_listing_scrapes_(all).csv',


Horses done.


  all_data = all_data.append(pd.read_csv(f'{path}/invertebrates_listing_scrapes_(all).csv',


Invertebrates done.


  all_data = all_data.append(pd.read_csv(f'{path}/poultry_listing_scrapes_(all).csv',


Poultry done.


  all_data = all_data.append(pd.read_csv(f'{path}/poultry_listing_scrapes_(all).csv',


Poultry done.


  all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(0-to-537).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(537-to-923).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(923-to-1280).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_rabbits_listings/rabbits_listing_scrapes_(1280-to-1369).csv',


Rabbits done.


  all_data = all_data.append(pd.read_csv(f'{path}/data_reptiles_listings/reptiles_listing_scrapes_(0-to-1584).csv',
  all_data = all_data.append(pd.read_csv(f'{path}/data_reptiles_listings/reptiles_listing_scrapes_(1584-to-2196).csv',


Reptiles done.
Rodents done.


  all_data = all_data.append(pd.read_csv(f'{path}/rodents_listing_scrapes_(all).csv',


In [174]:
# extracting dataset from htmls

pets_data = []

n = 0
el = 0

for pet in all_data['URLs']:
    el += 1
    if el % 50 == 0:
        print(f'{el} complete. {len(all_data["URLs"]) - el} to go.')
    try:
        pets_data.append(extract_data(pet))
    except:
        n += 1
        pass    
print(f'{n} failed. Done.')

50 complete. 20161 to go.
100 complete. 20111 to go.
150 complete. 20061 to go.
200 complete. 20011 to go.
250 complete. 19961 to go.
300 complete. 19911 to go.
350 complete. 19861 to go.
400 complete. 19811 to go.
450 complete. 19761 to go.
500 complete. 19711 to go.
550 complete. 19661 to go.
600 complete. 19611 to go.
650 complete. 19561 to go.
700 complete. 19511 to go.
750 complete. 19461 to go.
800 complete. 19411 to go.
850 complete. 19361 to go.
900 complete. 19311 to go.
950 complete. 19261 to go.
1000 complete. 19211 to go.
1050 complete. 19161 to go.
1100 complete. 19111 to go.
1150 complete. 19061 to go.
1200 complete. 19011 to go.
1250 complete. 18961 to go.
1300 complete. 18911 to go.
1350 complete. 18861 to go.
1400 complete. 18811 to go.
1450 complete. 18761 to go.
1500 complete. 18711 to go.
1550 complete. 18661 to go.
1600 complete. 18611 to go.
1650 complete. 18561 to go.
1700 complete. 18511 to go.
1750 complete. 18461 to go.
1800 complete. 18411 to go.
1850 complet

14750 complete. 5461 to go.
14800 complete. 5411 to go.
14850 complete. 5361 to go.
14900 complete. 5311 to go.
14950 complete. 5261 to go.
15000 complete. 5211 to go.
15050 complete. 5161 to go.
15100 complete. 5111 to go.
15150 complete. 5061 to go.
15200 complete. 5011 to go.
15250 complete. 4961 to go.
15300 complete. 4911 to go.
15350 complete. 4861 to go.
15400 complete. 4811 to go.
15450 complete. 4761 to go.
15500 complete. 4711 to go.
15550 complete. 4661 to go.
15600 complete. 4611 to go.
15650 complete. 4561 to go.
15700 complete. 4511 to go.
15750 complete. 4461 to go.
15800 complete. 4411 to go.
15850 complete. 4361 to go.
15900 complete. 4311 to go.
15950 complete. 4261 to go.
16000 complete. 4211 to go.
16050 complete. 4161 to go.
16100 complete. 4111 to go.
16150 complete. 4061 to go.
16200 complete. 4011 to go.
16250 complete. 3961 to go.
16300 complete. 3911 to go.
16350 complete. 3861 to go.
16400 complete. 3811 to go.
16450 complete. 3761 to go.
16500 complete. 3711

In [338]:
pets_df = pd.DataFrame.from_dict(pets_data)

## Cleaning data

In [339]:
# removing duplicate rows
print(f'{pets_df.shape[0] - pets_df.drop_duplicates().shape[0]} duplicate rows dropped') 
pets_df.drop_duplicates(inplace = True)

718 duplicate rows dropped


In [340]:
# checking columns
print(pets_df.columns)

Index(['title', 'price', 'url', 'seller_type', 'seller_name', 'Phone', 'Email',
       'Facebook', 'Google', 'n_images', 'category', 'Adv. ID',
       'Adv. Location', 'Advert Type', 'Advertiser', 'Breed', 'Pet Age: ',
       'Pet Colour', 'Sex', 'description', 'Health Checked', 'Is Microchipped',
       'Is Neutered', 'Is Vaccinated', 'Is Worm Treated', 'Pet Available',
       'Pets in litter', 'Registered', 'Is advertiser the orig. breeder',
       'Is KC Registered', 'Pet Viewable with Mother', 'Birth Year',
       'Category 1', 'Category 2', 'Gender', 'Height', 'Origin',
       'Level Jumping', 'Level Dressage'],
      dtype='object')


In [341]:
# Cleaning column names

pets_df.columns = ['title', 
 'price', 
 'url', 
 'seller_type', 
 'seller_name', 
 'phone_verified', 
 'email_verified',
 'facebook_verified', 
 'google_verified', 
 'n_images', 
 'category',
 'advert_id', 
 'advert_location',
 'advert_type', 
 'advertiser_type', 
 'breed', 
 'pet_age', 
 'pet_colour', 
 'pet_sex',
 'description',
 'health_checked',
 'microchipped',
 'neutered',
 'vaccinated',
 'worm_treated',
 'pet_available',
 'pets_in_litter',
 'registered',
 'original_breeder',
 'kc_registered',
 'viewable_with_mother',
 'birth_year',
 'category_1',
 'category_2',
 'gender',
 'height',
 'origin',
 'level_jumping',
 'level_dressage']

print(pets_df.columns)

Index(['title', 'price', 'url', 'seller_type', 'seller_name', 'phone_verified',
       'email_verified', 'facebook_verified', 'google_verified', 'n_images',
       'category', 'advert_id', 'advert_location', 'advert_type',
       'advertiser_type', 'breed', 'pet_age', 'pet_colour', 'pet_sex',
       'description', 'health_checked', 'microchipped', 'neutered',
       'vaccinated', 'worm_treated', 'pet_available', 'pets_in_litter',
       'registered', 'original_breeder', 'kc_registered',
       'viewable_with_mother', 'birth_year', 'category_1', 'category_2',
       'gender', 'height', 'origin', 'level_jumping', 'level_dressage'],
      dtype='object')


In [342]:
pets_df.dtypes

title                    object
price                    object
url                      object
seller_type              object
seller_name              object
phone_verified          float64
email_verified          float64
facebook_verified       float64
google_verified         float64
n_images                  int64
category                 object
advert_id                object
advert_location          object
advert_type              object
advertiser_type          object
breed                    object
pet_age                  object
pet_colour               object
pet_sex                  object
description              object
health_checked           object
microchipped             object
neutered                 object
vaccinated               object
worm_treated             object
pet_available            object
pets_in_litter           object
registered               object
original_breeder         object
kc_registered            object
viewable_with_mother     object
birth_ye

In [343]:
# converting price dtype to float
pets_df.price = pets_df.price.astype(float)

In [344]:
# dropping advert type, as all values are the same
pets_df.drop(columns = 'advert_type', inplace = True)

In [345]:
# 9475 of the sellers have unique names
sum(pets_df['seller_name'].value_counts()==1)

# 3219 of the seller names appear more than once
sum(pets_df['seller_name'].value_counts()>1)

# of the sellers who's names appear more than once, 1153 are organizations
sum(pets_df[pets_df['seller_type'] == 'Organization']['seller_name'].value_counts()>1)

# I'm going to drop the seller_name column and replace it with a continuous numeric column indicating the 
# number of times that seller's name appeared in the dataset. 
seller_name_n_occurances = dict(pets_df['seller_name'].value_counts())
pets_df['seller_n_adverts'] = [seller_name_n_occurances[i] for i in pets_df['seller_name']]
pets_df.drop(columns = 'seller_name', inplace = True)

# Roughly 2/3rds of the sellers who's names appear more than once are individuals, it is likely that several 
# of these are multiple individuals with similar names (e.g. John S.). I will explore this relationship
# during EDA and will perform additional cleaning if it seems necessary.

In [346]:
# 6653 unique locations
pets_df['advert_location'].value_counts()

# converting all values to strings
pets_df['advert_location'] = pets_df['advert_location'].astype(str)

# taking only the last elemnt of each location
pets_df['advert_location'] = [i.split(',')[-1].replace(' ', '') for i in pets_df['advert_location']]

# 1168 unique locations
pets_df['advert_location'].value_counts()

London         1116
Birmingham      454
Manchester      386
Nottingham      251
Doncaster       218
               ... 
Corsham           1
Lifton            1
IsleofLewis       1
Keith             1
Bures             1
Name: advert_location, Length: 1168, dtype: int64

In [347]:
# the seller_type and advertiser_type columns contain similar information. Advertiser type has more details
# on the types of organisations, but includes some NaNs. I will combine these columns.

# no 'Organization' has advertiser_type == 'Individual', there are 68 NaNs.
pets_df[pets_df['seller_type'] == 'Organization']['advertiser_type'].value_counts(dropna = False)

# every 'Person' has advertiser_type == 'Individual', except 92 NaNs
pets_df[pets_df['seller_type'] == 'Person']['advertiser_type'].value_counts(dropna = False)

# Inferring any row with seller_type == 'Person' should have advertiser_type == 'Individual' & replacing NaNs
pets_df.loc[pets_df['seller_type'] == 'Person','advertiser_type'] = 'Individual'

# Looking at rows with seller_type == 'Organization' & advertiser_type == NaN
pets_df[(pets_df['seller_type'] == 'Organization') & (pets_df['advertiser_type'].isna())]

# These adverts all have similar/generic titles (e.g. "rabbits for sale"), NaNs in most columns
# and are not verified by any of the 4 verification methods. I suspect they're fake ads and so will drop them.
pets_df = pets_df[pets_df['advertiser_type'].notna()]

# advertiser_type now contains all information in seller_type (plus extra detail), dropping seller_type
pets_df.drop(columns = 'seller_type', inplace = True)

In [348]:
# 388 unique values for breed
pets_df.breed.value_counts()

# 92 NaNs
sum(pets_df.breed.isna())
pets_df[pets_df.breed.isna()]

# these adverts seem similar to the ones I removed in the previous cell. I.e. mostly NaNs, unverified, generic
# names, etc. Dropping them.
pets_df = pets_df[pets_df.breed.notna()]

In [349]:
# pet_age and birth_year contain similar information, but the latter is horse specific. I'll combine them.

# all rows with birth_year != NaN are horses
pets_df[pets_df.birth_year.notna()]['category'].value_counts()

# no horse rows have NaNs for birth_year
len(pets_df[pets_df['category'] == 'Horses'])

# Horses' age in years can be approximated from their birth_year
pets_df.loc[pets_df['category'] == 'Horses','pet_age'] = (2023 - pets_df[pets_df['category'] == 'Horses']['birth_year'].astype(int)).astype(str)+' years'

# birth_year can now be dropped
pets_df.drop(columns = 'birth_year', inplace = True)

# 34 rows remain with NaN for pet_age. I'll drop them.
pets_df['pet_age'].isna().sum()
pets_df = pets_df[pets_df['pet_age'].notna()]

In [350]:
import datetime

#dt = datetime.datetime.strptime(str_td, "%H:%M:%S")
pets_age_in_days = []
for row in [i.split(',') for i in pets_df['pet_age'].astype(str)]:
    num_days = 0
    for el in row:
        try:
            if el == 'Just Born Today':
                pass
            elif 'NaN' in el: # some rows have 'NaN years' / 'NaN months'
                num_days = np.nan
            elif 'year' in el:
                num_days += (int(re.findall('\d+', el)[0]) * 365)
            elif 'month' in el:
                # 30.436875 is the mean month length in the gregorian calendar
                num_days += (int(re.findall('\d+', el)[0]) * 30.436875)
            elif 'week' in el:
                num_days += (int(re.findall('\d+', el)[0]) * 7)
            elif 'day' in el:
                num_days += int(re.findall('\d+', el)[0])
        except Exception as e:
              print(el)
    pets_age_in_days.append(num_days)
    
pets_df['pets_age_in_days'] = pets_age_in_days

# after converting all ages to days, 3 missing values remain, which I'll drop
pets_df[pets_df['pets_age_in_days'].isna()]
pets_df = pets_df[pets_df['pets_age_in_days'].notna()]

In [351]:
# about half of the rows have NaNs for pet_available (the date from which the pet is available for collection)
pets_df['pet_available'].isna().sum()

# Given this and given that this is unlikely to be a good predictor of price, I'm going to drop the column.
pets_df = pets_df.drop(columns = 'pet_available')

In [352]:
# 'registered' is cat specific and indicates if the cat is regisitered with one of 3 ownership clubs
# e.g. The Governing Council of the Cat Fancy (no, seriously). 'kc_registered' indicates whether a dog is 
# registered with the UK Kennel Club. These two colums can be combined. Later I'll create interaction terms
# with the pet catergories.

# for dogs, we have yes and no values and no NaNs
pets_df[pets_df['category'] == 'Dogs']['kc_registered'].value_counts(dropna = False)

# for cats, we have 2344 NaNs and either 'TICA', 'GCCF', 'FIFe', or some combination of them
pets_df[pets_df['category'] == 'Cats']['registered'].value_counts(dropna = False)

# If a cat is registered with any club, I will encode this as 1
pets_df.loc[(pets_df['category'] == 'Cats') & (pets_df['registered'].notna()), 'registered'] = 1

# I will assume that cats with NaNs for registered are not registered with any club
pets_df.loc[(pets_df['category'] == 'Cats') & (pets_df['registered'].isna()), 'registered'] = 0

# for dogs, I'll move the kc_registered values over to registered
pets_df.loc[(pets_df['category'] == 'Dogs') & (pets_df['kc_registered'] == 'yes'), 'registered'] = 1
pets_df.loc[(pets_df['category'] == 'Dogs') & (pets_df['kc_registered'] == 'no'), 'registered'] = 0

# kc_registered can now be dropped
pets_df.drop(columns = 'kc_registered', inplace = True)

In [353]:
# Pets in litter takes the form of 'n males / n females'. Needs to be cleaned into two continuous numeric 
# columns. 

males_in_litter = []
females_in_litter = []

for row in pets_df[pets_df['pets_in_litter'].notna()]['pets_in_litter']:
    
    males_in_cur_litter = 0
    females_in_cur_litter = 0

    els = row.split('/')
    for el in els[:2]:
    
        if 'female' in el:
            females_in_cur_litter += int(re.findall('\d+', el)[0])
        
        elif 'male' in el:
            males_in_cur_litter += int(re.findall('\d+', el)[0])
    
    males_in_litter.append(males_in_cur_litter)
    females_in_litter.append(females_in_cur_litter)
    
# adding new columns
pets_df.loc[pets_df['pets_in_litter'].notna(), 'males_in_litter'] = males_in_litter
pets_df.loc[pets_df['pets_in_litter'].notna(), 'females_in_litter'] = females_in_litter

# dropping pets in litter
pets_df.drop(columns = 'pets_in_litter', inplace = True)

The columns 'pet_sex', 'gender' and 'neutered' have some over lap and need some specific processing. 'pet_sex' has three levels (male, female and mixed), 'gender' is specific to horse listings and also has three levels (mare, gelding and stallion). 'neutered' only has two levels, but a gelding is an neutered male horse, but the horse rows all have NaNs for the 'neutered' column. I will sort these columns so that the information stored in 'gender' is in the 'pet_sex' and 'neutered' columns, and 'gender' can be dropped.

In [354]:
# if gender == gelding, neutered == yes, pet_sex == male
pets_df.loc[pets_df['gender'] == 'Gelding', 'neutered'] = 'yes'
pets_df.loc[pets_df['gender' ]== 'Gelding', 'pet_sex'] = 'Male'

# if gender == stallion, neutered == n, pet_sex == male
pets_df.loc[pets_df['gender'] == 'Stallion', 'neutered'] = 'no'
pets_df.loc[pets_df['gender' ]== 'Stallion', 'pet_sex'] = 'Male'

# spaying a mare is extremely rare as it a dangerous operation. Its typically only done in the UK as a 
# life-saving proceedure. As such, I will assume all listed mares have not been spayed.
pets_df.loc[pets_df['gender'] == 'Mare', 'neutered'] = 'no'
pets_df.loc[pets_df['gender' ]== 'Mare', 'pet_sex'] = 'Female'

# dropping gender
pets_df.drop(columns = 'gender', inplace = True)

Having disentangled gender and pet_sex, there is still some semantic overlap between the pet_sex and fe/males_in_litter. Specifically, columns with non-NaN values for pet_sex all have NaNs for fe/males_in_litter. Where the litter is all male or all female, pet_sex can be updated to match that. Where there are both, pet_sex can be updated to mixed. 

I still want to keep both columns as fe/males_in_litter indicates a) that the pets are infants and b) the number of either gender, which is absent from pet_sex. However, I cant just drop pet_sex, as some of the pets are adult animals and so would have a fe/males_in_litter value of 0.

In [355]:
pets_df['pet_sex'].value_counts()

Mixed     3256
Male      2302
Female    1638
Name: pet_sex, dtype: int64

In [356]:
pets_df.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data/pets4homes_data_(16-01-23)')

In [357]:
pets_df = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data/pets4homes_data_(16-01-23)',
                     index_col = 'Unnamed: 0')

  pets_df = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data/pets4homes_data_(16-01-23)',


In [358]:
pets_df.head()

Unnamed: 0,title,price,url,phone_verified,email_verified,facebook_verified,google_verified,n_images,category,advert_id,advert_location,advertiser_type,breed,pet_age,pet_colour,pet_sex,description,health_checked,microchipped,neutered,vaccinated,worm_treated,registered,original_breeder,viewable_with_mother,category_1,category_2,height,origin,level_jumping,level_dressage,seller_n_adverts,pets_age_in_days,males_in_litter,females_in_litter
0,Budgies for sale,20.0,/classifieds/vpqtzqc0i-budgies-for-sale-middle...,1.0,1.0,0.0,0.0,9,Birds,VpQTZqc0I,Middlesbrough,Breeder,Budgerigars,"10 months, 3 days","Blue, Green, Yellow",Mixed,Exhibition Budgies for sale\n\nBoth male and f...,,,,,,,,,,,,,,,3,307.36875,,
1,Ringneck,120.0,/classifieds/khxbn9dk9-ringneck-witney/,1.0,1.0,0.0,1.0,3,Birds,khxBn9dK9,Witney,Individual,Ringnecks,3 years,,Female,A honest post please read fully. \nIndie is my...,,,,,,,,,,,,,,,6,1095.0,,
2,kakarekis,35.0,/classifieds/m4uz4ww-a-kakarekis-northwich/,1.0,1.0,0.0,0.0,2,Birds,m4uZ4ww-A,Northwich,Breeder,Parakeets,"8 months, 26 days",Green,Mixed,beautiful this years young baby’s parent reare...,,,,,,,,,,,,,,,1,269.495,,
3,Baby / adult budgies for sale £ 15 each 2 for...,15.0,/classifieds/ohjzun2lp-baby-adult-budgies-for-...,1.0,1.0,0.0,0.0,7,Birds,OHjzUN2Lp,WalthamAbbey,Breeder,Budgerigars,3 days,,Mixed,Baby and adult budgies for sale\n£15 each \n2 ...,,,,,,,,,,,,,,,3,3.0,,
4,Beautiful friendly white capped Pionus parrots,675.0,/classifieds/5npb4stu6-beautiful-friendly-whit...,1.0,1.0,0.0,1.0,7,Birds,5nPB4sTu6,Ilford,Breeder,Parrots,15 weeks,Green,Mixed,Hand reared white capped pionus. Very friendly...,,,,,,,,,,,,,,,4,105.0,,


In [359]:
# These are all NaN
pets_df[pets_df['pet_sex'].notna()]['males_in_litter'].value_counts(dropna = False)
pets_df[pets_df['pet_sex'].notna()]['females_in_litter'].value_counts(dropna = False)

# These are also all NaN
pets_df[pets_df['males_in_litter'].notna()]['pet_sex'].value_counts(dropna = False)
pets_df[pets_df['females_in_litter'].notna()]['pet_sex'].value_counts(dropna = False)

# These can be updated to 'Mixed'
pets_df.loc[(pets_df['pet_sex'].isna()) & (pets_df['males_in_litter'] > 0) & (pets_df['females_in_litter'] > 0),['pet_sex']] = 'Mixed'

# These can be updated to 'Male'
pets_df.loc[(pets_df['pet_sex'].isna()) & (pets_df['males_in_litter'] > 0) & (pets_df['females_in_litter'] == 0),['pet_sex']] = 'Male'

# These can be updated to 'Female'
pets_df.loc[(pets_df['pet_sex'].isna()) & (pets_df['males_in_litter'] == 0) & (pets_df['females_in_litter'] > 0),['pet_sex']] = 'Female'

# This leaves 1791 NaNs
pets_df['pet_sex'].value_counts(dropna= False)

Mixed     9687
Male      4183
Female    3615
NaN       1791
Name: pet_sex, dtype: int64

In [360]:
pets_df.isna().sum()

title                       0
price                       0
url                         0
phone_verified              0
email_verified              0
facebook_verified           0
google_verified             0
n_images                    0
category                    0
advert_id                   0
advert_location             0
advertiser_type             0
breed                       0
pet_age                     0
pet_colour              11138
pet_sex                  1791
description                 0
health_checked           7863
microchipped             6524
neutered                 6269
vaccinated               6524
worm_treated             6599
registered               7861
original_breeder        13770
viewable_with_mother    11451
category_1              19021
category_2              19072
height                  19026
origin                  19021
level_jumping           19263
level_dressage          19268
seller_n_adverts            0
pets_age_in_days            0
males_in_l

I still have a large number of missing values for several columns. This will be due to one of two reasons. 1. some aspect of the scrape/data extraction steps failed for those instances. 2. Those columns specify information which is not applicable to certain catergories of animals. An example of 2 is level_dressage, which is specific to horses. 

I need to determine which NaNs belong to which category. The NaNs in catergory 1 need to be cleaned/imputed/dropped. The NaNs in category 2 can replaced by a placeholder value indicating that this category is not relevant for that animal type.

In [361]:
# This groupby shows which columns are entirely NaNs for each animal type
# These columns are most likely to be instances of category 2.
pd.set_option('display.max_columns', None)
(pets_df.groupby("category").apply(lambda x: x.isna().mean()) == 1)[pets_df.columns]

Unnamed: 0_level_0,title,price,url,phone_verified,email_verified,facebook_verified,google_verified,n_images,category,advert_id,advert_location,advertiser_type,breed,pet_age,pet_colour,pet_sex,description,health_checked,microchipped,neutered,vaccinated,worm_treated,registered,original_breeder,viewable_with_mother,category_1,category_2,height,origin,level_jumping,level_dressage,seller_n_adverts,pets_age_in_days,males_in_litter,females_in_litter
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
Birds,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,True,True
Cats,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,True,True,False,False,False,False
Dogs,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,False,False,False,False
Fish,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,True,True
Horses,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,True,True,True,True,True,False,False,False,False,False,False,False,False,True,True
Invertebrates,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,True,True
Rabbits,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,True,True,True,True,True,True,True,False,False,True,True
Reptiles,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,True,True
Rodents,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,True,True,True,True,True,True,True,True,True,True,True,True,True,False,False,True,True


I will begin with the horse-specific columns, as these are the clearest instacnes of category 2.

In [362]:
pets_df[(pets_df['category'] == 'Horses')][['category_1', 
                                            'category_2', 
                                            'height', 
                                            'origin', 
                                            'level_jumping',
                                            'level_dressage']].isna().sum()

category_1          0
category_2         51
height              5
origin              0
level_jumping     242
level_dressage    247
dtype: int64

In [363]:
pets_df[(pets_df['category'] != 'Horses')][['category_1', 
                                            'category_2', 
                                            'height', 
                                            'origin', 
                                            'level_jumping',
                                            'level_dressage']].isna().sum()

category_1        19021
category_2        19021
height            19021
origin            19021
level_jumping     19021
level_dressage    19021
dtype: int64

In [364]:
# all horses have values for category_1 and origin, all other animals get a placeholder value
pets_df.loc[pets_df['category'] != 'Horses', 'category_1'] = 'Not applicable'
pets_df.loc[pets_df['category'] != 'Horses', 'origin'] = 'Not applicable'

In [365]:
# 48 horses have NaN for category_2
pets_df.loc[pets_df['category'] == 'Horses', 'category_2'].value_counts(dropna=False)

# as one level of category_2 is the catch-all 'other', this value can be assigned to the NaN rows
pets_df.loc[(pets_df['category'] == 'Horses') & (pets_df['category_2'].isna()), 'category_2'] = 'Other'

# all other animals get a placeholder
pets_df.loc[pets_df['category'] != 'Horses', 'category_2'] = 'Not applicable'

In [366]:
pets_df.shape

(19276, 35)

In [367]:
# 5 horses are missing height values. I will add placeholders for other animals and drop these 5 horses
pets_df.loc[pets_df['category'] != 'Horses', 'height'] = 'Not applicable'
pets_df = pets_df[pets_df['height'].notna()]

In [368]:
# There are only 13 animals in the entire dataset with a jumping level
pets_df[pets_df['category'] == 'Horses']['level_jumping'].notna().sum()

# I will assume that if a horse does not have a value for jumping level, then it is not a jumping horse.
# This can then be treated as a binary variable for horses.
pets_df.loc[(pets_df['category'] == 'Horses') & (pets_df['level_jumping'].notna()), 'level_jumping'] = 'Yes'
pets_df.loc[(pets_df['category'] == 'Horses') & (pets_df['level_jumping'].isna()), 'level_jumping'] = 'No'

# all other animals get a placeholder value
pets_df.loc[pets_df['category'] != 'Horses', 'level_jumping'] = 'Not applicable'

# renaming the column
pets_df.rename(columns={'level_jumping':'jumping_horse'}, inplace = True)

In [369]:
# There are only 8 animals in the entire dataset with a dressage level
pets_df[pets_df['category'] == 'Horses']['level_dressage'].notna().sum()

# I will assume that if a horse does not have a value for dressage level, then it is not a dressage horse.
# This can then be treated as a binary variable for horses.
pets_df.loc[(pets_df['category'] == 'Horses') & (pets_df['level_dressage'].notna()), 'level_dressage'] = 'Yes'
pets_df.loc[(pets_df['category'] == 'Horses') & (pets_df['level_dressage'].isna()), 'level_dressage'] = 'No'

# all other animals get a placeholder value
pets_df.loc[pets_df['category'] != 'Horses', 'level_dressage'] = 'Not applicable'

# renaming the column
pets_df.rename(columns={'level_dressage':'dressage_horse'}, inplace = True)

In [370]:
pets_df.shape

(19271, 35)

In [371]:
# Pet colour

# Pets4homes does not record pet colour for fish, invertibrates, reptiles or rodents
no_colour = ['Fish', 'Invertebrates', 'Reptiles', 'Rodents']
for i in no_colour:
    pets_df.loc[(pets_df['category'] == i), 'pet_colour'] = 'Not applicable'

# other NaNs are where the seller has not listed the colour
pets_df.loc[pets_df['pet_colour'].isna(), 'pet_colour'] = 'Unlisted'

In [388]:
# Pet sex

# Pets4homes does not record pet sex for fish
pets_df.loc[(pets_df['category'] == 'Fish'), 'pet_sex'] = 'Not applicable'

# other NaNs are where the seller has not listed info
pets_df.loc[pets_df['pet_sex'].isna(), 'pet_sex'] = 'Unlisted'

In [373]:
# health checked, microchipped, vaccinated, worm treated, registered, males in litter, females in litter

# Pets4homes does not record any of the above for animals other than dogs and cats
categories = pets_df['category'].unique()
for i in categories:
    if i == 'Cats' or i == 'Dogs':
        continue
    pets_df.loc[(pets_df['category'] == i), 'health_checked'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'microchipped'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'vaccinated'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'worm_treated'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'registered'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'males_in_litter'] = 'Not applicable'
    pets_df.loc[(pets_df['category'] == i), 'females_in_litter'] = 'Not applicable'

# other NaNs are where the seller has not listed info
pets_df.loc[pets_df['health_checked'].isna(), 'health_checked'] = 'Unlisted'
pets_df.loc[pets_df['microchipped'].isna(), 'microchipped'] = 'Unlisted'
pets_df.loc[pets_df['vaccinated'].isna(), 'vaccinated'] = 'Unlisted'
pets_df.loc[pets_df['worm_treated'].isna(), 'worm_treated'] = 'Unlisted'
pets_df.loc[pets_df['registered'].isna(), 'registered'] = 'Unlisted'
pets_df.loc[pets_df['males_in_litter'].isna(), 'males_in_litter'] = 'Unlisted'
pets_df.loc[pets_df['females_in_litter'].isna(), 'females_in_litter'] = 'Unlisted'


In [374]:
# neutered is only applicable for cats, dogs and horses
categories = pets_df['category'].unique()
for i in categories:
    if i in ['Cats', 'Dogs', 'Horses']:
        continue
    pets_df.loc[(pets_df['category'] == i), 'neutered'] = 'Not applicable'

In [375]:
# original breeder and viewable with mother are only applicable to dogs

pets_df.loc[(pets_df['category'] != 'Dogs'), 'original_breeder'] = 'Not applicable'
pets_df.loc[(pets_df['category'] != 'Dogs'), 'viewable_with_mother'] = 'Not applicable'

# other NaNs are where the seller has not listed info
pets_df.loc[pets_df['original_breeder'].isna(), 'original_breeder'] = 'Unlisted'
pets_df.loc[pets_df['viewable_with_mother'].isna(), 'viewable_with_mother'] = 'Unlisted'

In [376]:
pets_df[pets_df['worm_treated'].isna()]

Unnamed: 0,title,price,url,phone_verified,email_verified,facebook_verified,google_verified,n_images,category,advert_id,advert_location,advertiser_type,breed,pet_age,pet_colour,pet_sex,description,health_checked,microchipped,neutered,vaccinated,worm_treated,registered,original_breeder,viewable_with_mother,category_1,category_2,height,origin,jumping_horse,dressage_horse,seller_n_adverts,pets_age_in_days,males_in_litter,females_in_litter


In [389]:
pets_df.isna().sum()

title                   0
price                   0
url                     0
phone_verified          0
email_verified          0
facebook_verified       0
google_verified         0
n_images                0
category                0
advert_id               0
advert_location         0
advertiser_type         0
breed                   0
pet_age                 0
pet_colour              0
pet_sex                 0
description             0
health_checked          0
microchipped            0
neutered                0
vaccinated              0
worm_treated            0
registered              0
original_breeder        0
viewable_with_mother    0
category_1              0
category_2              0
height                  0
origin                  0
jumping_horse           0
dressage_horse          0
seller_n_adverts        0
pets_age_in_days        0
males_in_litter         0
females_in_litter       0
dtype: int64

The following columns do not require any cleaning:
 - url (wont be used in analysis, kept for reference)
 - advert_id (wont be used in analysis, kept for reference)
 - n_images.

In [390]:
pets_df.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/data/pets4homes_data_cleaned')

In [394]:
pets_df.groupby('category').mean()

Unnamed: 0_level_0,price,phone_verified,email_verified,facebook_verified,google_verified,n_images,seller_n_adverts,pets_age_in_days
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Birds,401.652116,0.999394,0.990904,0.150394,0.195876,4.727107,8.451789,417.127237
Cats,688.375796,0.99665,0.999162,0.092406,0.212451,10.324958,2.165271,257.343958
Dogs,5323.000409,0.999234,0.998851,0.100472,0.16622,11.691561,1.8934,139.191073
Fish,3459.258891,1.0,1.0,0.440571,0.017433,3.949287,3.944532,396.391639
Horses,2484.712,1.0,0.996,0.296,0.016,5.348,2.776,2357.9
Invertebrates,30.211728,1.0,1.0,0.401235,0.018519,3.54321,5.092593,338.443607
Rabbits,49.286462,1.0,0.997008,0.290202,0.013463,6.145849,2.747943,238.490932
Reptiles,155.264887,0.999539,0.99677,0.125981,0.239963,4.636364,11.23904,971.527713
Rodents,52.771657,1.0,0.998193,0.353614,0.013855,5.101807,6.16988,202.579484


#### Creating dummy variables

In [379]:
dummies = pd.get_dummies(pets_df[['health_checked',
                                  'microchipped',
                                  'vaccinated',
                                  'worm_treated',
                                  'original_breeder',
                                  'viewable_with_mother']],
              drop_first = True,
              dummy_na = True)

dummies.mean()*100

health_checked_Unlisted           0.010378
health_checked_no                 9.781537
health_checked_yes               49.442167
health_checked_nan                0.000000
microchipped_no                  19.822531
microchipped_yes                 39.411551
microchipped_nan                  0.000000
vaccinated_no                    15.053708
vaccinated_yes                   44.180375
vaccinated_nan                    0.000000
worm_treated_Unlisted             0.010378
worm_treated_no                   4.893363
worm_treated_yes                 54.330341
worm_treated_nan                  0.000000
original_breeder_Unlisted        12.075139
original_breeder_no               0.902911
original_breeder_yes             27.668517
original_breeder_nan              0.000000
viewable_with_mother_Unlisted     0.041513
viewable_with_mother_no           4.384827
viewable_with_mother_yes         36.220227
viewable_with_mother_nan          0.000000
dtype: float64

In [395]:
pets_df.isna().sum()

title                   0
price                   0
url                     0
phone_verified          0
email_verified          0
facebook_verified       0
google_verified         0
n_images                0
category                0
advert_id               0
advert_location         0
advertiser_type         0
breed                   0
pet_age                 0
pet_colour              0
pet_sex                 0
description             0
health_checked          0
microchipped            0
neutered                0
vaccinated              0
worm_treated            0
registered              0
original_breeder        0
viewable_with_mother    0
category_1              0
category_2              0
height                  0
origin                  0
jumping_horse           0
dressage_horse          0
seller_n_adverts        0
pets_age_in_days        0
males_in_litter         0
females_in_litter       0
dtype: int64

In [397]:
pets_df_dictionary = {
    'title' : 'The title of the listing',
    'price' : 'The price of the listed pet',
    'url' : 'The URL of the listing',
    'phone_verified' : "Whether the seller's profile has been verified by phone",
    'email_verified' : "Whether the seller's profile has been verified by email",
    'facebook_verified' : "Whether the seller's profile has been verified by Facebook",
    'google_verified' : "Whether the seller's profile has been verified by Google",
    'n_images' : 'The number of images included in the listing',
    'category' : 'The type of animal',
    'advert_id' : "The listing's unique ID",
    'advert_location' : "The location of the seller",
    'advertiser_type' : 'The type of seller',
    'breed' : 'The breed of the pet(s) being sold',
    'pet_age' : 'The age of the pet(s) being sold',
    'pet_colour' : 'The colour of the pet(s) being sold',
    'pet_sex' : 'The sex of the pet(s) being sold',
    'description' : 'The listing description provided by the seller',
    'health_checked' : 'Whether the pets(s) have been health checked',
    'microchipped' : 'Whether the pets(s) have been microchipped',
    'neutered' : "Whether the pet(s) have been neutered",
    'vaccinated' : "Whether the pet(s) have been vaccinated",
    'worm_treated' : "Whether the pet(s) have been worm treated",
    'registered' : "Whether the pet(s) are registered with a breeders or owners club/society",
    'original_breeder' : 'Whether the seller is the original breeder of the pet',
    'viewable_with_mother' : "Whether the pet is viewable with it's mother",
    'category_1' : 'The primary category of a horse', 
    'category_2' : 'The secondary category of a horse',
    'height' : 'The height of a horse (measured in hands)',
    'origin' : 'The origin of a horse',
    'jumping_horse' : 'Whether a horse does show jumping',
    'dressage_horse' : 'Wether a horse does dressage',
    'seller_n_adverts' : 'The number of adverts that seller had at the time of data collection',
    'pets_age_in_days' : "The age of the pet(s), measured in days",
    'males_in_litter' : "The number of male pets in the litter",
    'females_in_litter' : "The number of female pets in the litter"}

#### Bonus

4. Document your project goals (revise from your initial pitch)
   - Articulate “Specific aim”
   - Outline proposed methods and models
   - Define risks & assumptions

5. Create a blog post of at least 500 words that describes your work so far. Link to it in your Jupyter notebook.


## Deliverable Format & Submission

- Table, file, or database with relevant text file or notebook description.

---

## Suggested Ways to Get Started

- Review your initial proposal topic and feedback, and revise accordingly.
- Spend time with your data and verify that it can help you accomplish the goals you set out to pursue.
- If not, document how you intend to either change those goals.
- Alternatively, go find some additional data and/or try another source.

---

## Useful Resources

- [Exploratory Data Analysis](http://insightdatascience.com/blog/eda-and-graphics-eli-bressert.html)
- [Best practices for data documentation](https://www.dataone.org/all-best-practices)

---

## Project Feedback + Evaluation

[Attached here is a complete rubric for this project.](./capstone-part-02-rubric.md)

Your instructors will score each of your technical requirements using the scale below:

Score  | Expectations
--- | ---
**0** | _Incomplete._
**1** | _Does not meet expectations._
**2** | _Meets expectations, good job!_