Skip to content

TikitaPeralta/wikihow_scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraping WikiHow - Images & Captions

Description

This repository contains a gitignore, a JSON file, a python file and the images scraped from WikiHow. The purpose of this project was to scrape the site: WikiHow.com for all images and their captions. This project was done in python. In the later YouTube video I talked through my process and went through the result.

Installation

Setting up the virtual environment

  1. pip install virtualenv
  2. python3 -m venv venv
  3. source venv/bin/activate

Create gitignore

  1. touch .gitignore (no output expected)
  2. open .gitignore and write in venv

Installing dependencies

  1. pip install requests
  2. pip install beautifulsoup4
  3. pip install pathlib
  4. pip install urllib3
  5. pip install pillow

Usage

Import modules

import requests
import urllib.request
from pathlib import Path
import pathlib
import os
from bs4 import BeautifulSoup
from random import randint 
from time import sleep
import json
from PIL import Image
  • access every image independent of page, i.e. the main page's layout is different to every article page
  • exclude png/svg from img finding, as the majority of these were icons or profile photos from comments
  • if link is found, add to list of URLs
  • iterate through list of URLs until finished, while len(list) > 0
  • no repetitions = add to 'URL_seen' list and check if not in
  • follow link for each image and download to documents/webscrape/photos file
  • place into JSON format

JSON format

{
    "1" : ["caption", "img link", "path to img"]
}

YouTube Video

Scraping Project Video

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages