skrape{it}

skrape{it} is a Kotlin-based HTML/XML testing and web scraping library that can be used seamlessly in Spring-Boot, Ktor, Android or other Kotlin-JVM projects. The ability to analyze and extract HTML including client-side rendered DOM trees and all other XML-related markup specifications such as SVG, UML, RSS,... makes it unique. It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. First and foremost skrape{it} aims to be a testing tool (not tied to a particular test runner), but it can also be used to scrape websites in a convenient fashion.

Features

Parsing

Deserialization of HTML/XML from websites, local html files and html as string to data classes / POJOs.
Designed to deserialize HTML but can handle any XML-related markup specifications such as SVG, UML, RSS or XML itself.
DSL to select html elements as well as supporting CSS query-selector syntax by string invocation.

Http-Client

Http-Client without verbosity and ceremony to make requests and corresponding request options like headers, cookies etc in a fluent style interface.
Pre-configure client regarding auth and other request settings
Can handle client side rendered web pages. Javascript execution results can optionally be considered in the response body.

Idomatic

Easy to use, idiomatic and type-safe DSL to ensure a high level of readability.
Build-in matchers/assertions based on infix functions to archive a very high level of readability.
DSL is behaving like a Fluent-Api to make data extraction/scraping as comfortable as possible.

Compatibility

Not bind to a specific test-runner, framework or whatever.
Open to use any other assertion library of your choice.

Extensions

In addition, extensions for well-known testing libraries are provided to extend them with the mentioned skrape{it} functionality. Currently available:

skrape{it} MockMvc extension
skrape{it} Ktor extension

Quick Start

Read the Docs

You'll always find the latest documentation, release notes and examples at https://docs.skrape.it. If you don't want to read that much or just want to get a rough overview on how to use skrape{it}, you can have a look at the Documentation by Example section:

Installation

All our official/stable releases will be published to mavens central repository.

Add dependency

Gradle

dependencies {
    implementation("it.skrape:skrapeit-core:1.0.0-alpha6")
}

Maven

<dependency>
    <groupId>it.skrape</groupId>
    <artifactId>skrapeit-core</artifactId>
    <version>1.0.0-alpha6</version>
</dependency>

using bleeding edge features before official release

We are offering snapshot releases via jitpack. Thereby you can install every commit and version you want. But be careful, these are non official releases and may be unstable as well as breaking changes can occur at any time.

Add experimental stuff

Gradle

repositories {
    maven { url "https://jitpack.io" }
}
dependencies {
    implementation("com.github.skrapeit:skrape.it:master-SNAPSHOT")
}

Maven

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

...

<dependency>
    <groupId>com.github.skrapeit</groupId>
    <artifactId>skrape.it</artifactId>
    <version>master-SNAPSHOT</version>
</dependency>

Documentation by Example

Parse and verify HTML from String

@Test
internal fun `can read and return html from String`() {
    htmlDocument("""
        <html>
            <body>
                <h1>welcome</h1>
                <div>
                    <p>first p-element</p>
                    <p class="foo">some p-element</p>
                    <p class="foo">last p-element</p>
                </div>
            </body>
        </html>""") {

        h1 {
            findFirst {
                text toBe "welcome"
            }
            p {
                withClass = "foo"
                findFirst {
                    text toBe "some p-element"
                    className  toBe "foo"
                }
            }
            p {
                findAll {
                    text toContain "p-element"
                }
                findLast {
                    text toBe "last p-element"
                }
            }
        }
    }
}

Parse HTML and extract it

data class MyDataClass(
        var httpStatusCode: Int = 0,
        var httpStatusMessage: String = "",
        var paragraph: String = "",
        var allParagraphs: List<String> = emptyList(),
        var allLinks: List<String> = emptyList()
)

class HtmlExtractionService {

    fun extract() {
        val extracted = skrape {
            url = "http://localhost:8080/"

            extractIt<MyDataClass> {
                it.httpStatusCode = statusCode
                it.httpStatusMessage = statusMessage.toString()
                htmlDocument {
                    it.allParagraphs = p { findAll { eachText }}
                    it.paragraph = p { findFirst { text }}
                    it.allLinks = a { findAll { eachHref }}
                }
            }
        }
        print(extracted)
        // will print:
        // MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
    }
}

Testing HTML responses:

@Test
fun `dsl can skrape by url`() {
    skrape {
        url = "http://localhost:8080/example"
        expect {
            htmlDocument {
                // all offical html and html5 elements are supported by the DSL
                div {
                    withClass = "foo" and "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"
                    }

                    findAll {
                        toBePresentExactlyTwice
                    }
                }
                // can handle custom tags as well
                "a-custom-tag" {
                    findFirst {
                        toBePresentExactlyOnce
                        text toBe "i'm a custom html5 tag"
                        text
                    }
                }
                // can handle custom tags written in css selctor query syntax
                "div.foo.bar.fizz.buzz" {
                    findFirst {
                        text toBe "div with class foo"
                    }
                }

                // can handle custom tags and add selector specificas via DSL
                "div.foo" {

                    withClass = "bar" and "fizz" and "buzz"

                    findFirst {
                        text toBe "div with class foo"
                    }
                }
            }
        }
    }
}

Configure HTTP-Client:

class ExampleTest {
    val myPreConfiguredClient = skrape {
        // url can be a plain url as string or build by #urlBuilder
        url = urlBuilder {
            protocol = Protocol.HTTPS
            host = "skrape.it"
            port = 12345
            path = "/foo"
            queryParams = mapOf("foo" to "bar")
        }
        
        mode = DOM // optional -> defaults to SOURCE (plain http request) - DOM will also render JS
        method = GET // optional -> defaults to GET
        timeout = 5000 // optional -> defaults to 5000ms
        followRedirects = true // optional -> defaults to true
        userAgent = "some custom user agent" // optional -> defaults to "Mozilla/5.0 skrape.it"
        cookies = mapOf("some-cookie-name" to "some-value") // optional
        headers = mapOf("some-custom-header" to "some-value") // optional
        
        asConfig // <--- returns the configured request object
    }

    @Test
    fun `can use preconfigured client`() {

        myPreConfiguredClient.expect { 
            statusCode toBe 200
            // do more stuff
        }

        myPreConfiguredClient.apply {
            followRedirects = false
        }.expect { 
            statusCode toBe 301
            // do more stuff
        }
    }
}

Sponsoring

Skrape{it} is and always will be free and open-source. However your sponsorship of this project is greatly appreciated and will fund the caffeine and pizzas that fuel its development. To sponsor Skrape{it}, just click this button →

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
.github		.github
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
CONTRIBUTE.md		CONTRIBUTE.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
release.sh		release.sh
settings.gradle.kts		settings.gradle.kts
skrape.png		skrape.png

License

mikaelpeltier/skrape.it

Folders and files

Latest commit

History

Repository files navigation