# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Part 2: Dataset + Data Collection

## Overview

Based on the feedback you received from your lightning talk, choose **one** of your topic areas to move forward. For Part 2, you'll need to collect, clean, and document the dataset(s) you intend to use for your project.

This is not always a trivial task. Remember that data acquisition, transformation, and cleaning are typically the most time-consuming parts of data science projects, so don’t procrastinate!

Once you have your data, read into it and review it to confirm whether it is as productive as you intended. If not, switch datasets, gather additional data (e.g. multiple datasets), or revise your project goals.

Create your own database and data dictionary, then clean and munge your data as appropriate. Finally, document your work so far.

**Goal**: Find the data you need for your project, clean, and document it.


## Requirements

1. Find and Clean Your Data: Source and format the required data for your project.
   - Create a database
   - Create a data dictionary
2. Perform preliminary data munging and cleaning of your data: organize your data relevant to your project goals.
   - Review data to verify initial assumptions
   - Clean and munge data as necessary
3. Describe your data: keep your intended audience(s) in mind.
   - Document your work so far in a Jupyter notebook.

In [2]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import requests
import bs4
from bs4 import BeautifulSoup
from selenium import webdriver

import re
import time



I am sourcing my data by webscraping pets4homes.co.uk. My goal is to scrape information for each current listing for a dog. Time permitting, I may also go back and scrape all cat listings too.

To source the data I will need conduct two rounds a webscraping. The first round will loop through each page of search results and access the url associated with each listing. After removing any duplicates returned by this scrape, I will use the scraped urls to view each listing and scrape it's relevant information.

In [10]:
# temp - for developing code

path = '/Users/lewis/Desktop/Pets4homes_html.rtf'

with open(path) as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [56]:
# function to extract urls from search pages

def get_url(page):
    
    cur_page_listing_urls = []

    for a in page.find_all('a', class_="cb Um"):
        url = 'https://www.pets4homes.co.uk' + a['href']
        if url in cur_page_listing_urls:
            continue
        else:
            cur_page_listing_urls.append(url)
    
    return cur_page_listing_urls

get_url(soup)

['https://www.pets4homes.co.uk/classifieds/yr3fb2k-p-american-xl-bullys-derby/',
 'https://www.pets4homes.co.uk/classifieds/vyty6c1dk-minature-sproodles-both-parents-dna-health-tested-ludlow/',
 'https://www.pets4homes.co.uk/classifieds/z1oexw0ee-beautiful-pocket-bully-york/',
 'https://www.pets4homes.co.uk/classifieds/nkagacnp6-4-minature-black-cream-tan-puppies-for-sale-saint-austell/',
 'https://www.pets4homes.co.uk/classifieds/zs49btlds-xl-bully-puppies-enzo-bossy-huntingdon/',
 'https://www.pets4homes.co.uk/classifieds/ki82wcwyi-beautiful-blue-pomeranian-girl-wednesbury/',
 'https://www.pets4homes.co.uk/classifieds/lxvob6o3u-french-bulldog-puppies-5-weeks-old-wigston/',
 'https://www.pets4homes.co.uk/classifieds/svdoywp2-bosipoos-winchester/',
 'https://www.pets4homes.co.uk/classifieds/dldyzmqdc-f1-goldendoodle-puppies-available-18th-feb-wrexham/',
 'https://www.pets4homes.co.uk/classifieds/jlhv6nwjw-beautiful-shihpoochon-puppies-for-sale-llanelli/',
 'https://www.pets4homes.co.uk

In [64]:
# getting dogs urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

for page_num in range(1,501):
    
    URL = f'https://www.pets4homes.co.uk/sale/puppies/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))

# flattening list of lists and dropping duplicates
dogs = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
dogs = pd.DataFrame(dogs, columns = ['URLs'])

# exporting to csv
dogs.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dogs_ulrs.csv')

500


In [73]:
# getting cats urls

# launching Chrome
dr = webdriver.Chrome()

listing_urls = []

for page_num in range(1,224):
    
    URL = f'https://www.pets4homes.co.uk/sale/kittens/local/local/page-{page_num}/'
    
    # going to the URL
    dr.get(URL)

    # getting the html 
    html = dr.page_source

    page = BeautifulSoup(html, 'html.parser')
    
    listing_urls.append(get_url(page))

# flattening list of lists and dropping duplicates
cats = set([item for sublist in listing_urls for item in sublist])

# converting to DataFrame
cats = pd.DataFrame(cats, columns = ['URLs'])

# exporting to csv
cats.to_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/cat_ulrs.csv')

223


In [82]:
dogs = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/dog_ulrs.csv',
                   index_col='Unnamed: 0')
dogs.head()

Unnamed: 0,URLs
0,https://www.pets4homes.co.uk/classifieds/bhhhe...
1,https://www.pets4homes.co.uk/classifieds/n-9id...
2,https://www.pets4homes.co.uk/classifieds/yblc-...
3,https://www.pets4homes.co.uk/classifieds/umi4g...
4,https://www.pets4homes.co.uk/classifieds/b4ei5...


In [88]:
print(dogs.iloc[0].values)

['https://www.pets4homes.co.uk/classifieds/bhhhealt0-beautiful-pups-looking-for-forever-homes-sittingbourne/']


In [83]:
cats = pd.read_csv('/Users/lewis/Desktop/GA/DSI25-lessons/projects/project-capstone/cat_ulrs.csv',
                   index_col='Unnamed: 0')
cats.head()

Unnamed: 0,URLs
0,https://www.pets4homes.co.uk/classifieds/lwbo0...
1,https://www.pets4homes.co.uk/classifieds/d32lj...
2,https://www.pets4homes.co.uk/classifieds/gdgkk...
3,https://www.pets4homes.co.uk/classifieds/51nyb...
4,https://www.pets4homes.co.uk/classifieds/3izse...


#### Bonus

4. Document your project goals (revise from your initial pitch)
   - Articulate “Specific aim”
   - Outline proposed methods and models
   - Define risks & assumptions

5. Create a blog post of at least 500 words that describes your work so far. Link to it in your Jupyter notebook.


## Deliverable Format & Submission

- Table, file, or database with relevant text file or notebook description.

---

## Suggested Ways to Get Started

- Review your initial proposal topic and feedback, and revise accordingly.
- Spend time with your data and verify that it can help you accomplish the goals you set out to pursue.
- If not, document how you intend to either change those goals.
- Alternatively, go find some additional data and/or try another source.

---

## Useful Resources

- [Exploratory Data Analysis](http://insightdatascience.com/blog/eda-and-graphics-eli-bressert.html)
- [Best practices for data documentation](https://www.dataone.org/all-best-practices)

---

## Project Feedback + Evaluation

[Attached here is a complete rubric for this project.](./capstone-part-02-rubric.md)

Your instructors will score each of your technical requirements using the scale below:

Score  | Expectations
--- | ---
**0** | _Incomplete._
**1** | _Does not meet expectations._
**2** | _Meets expectations, good job!_