Many frameworks nowadays are pretty beefy and contain features most developers won't ever use.
This project was made to facilitate extracting data from websites in a concise yet simple way
(prior knowledge about the framework is not required to get the job done).
Only the default libraries are used (including System.Net.Http
since .NET Core 2.1).
The code has been tested and works most of the time, but it's not guaranteed to work as expected every time since html
can be weird.
<div class="outer_div" property="random73913">
StartText
<div class="inner_div">
Inner text
</div>
Ending text
</div>
HtmlDoc doc = new HtmlDoc(html);
Tag? tag = doc.Find("div", ("class", "inner_div", Compare.EXACT));
if (tag != null){
string extract = doc.ExtractText(tag);
Console.WriteLine(extract);
}
Output: Inner text
Each attribute pair has its own comparison policy and follows the format: (key, value, comparison_policy)
Use Compare.VALUE_STARTS_WITH
if attributes are obfuscated either intentionally or due to css
auto-generating gibberish.
HtmlDoc doc = new HtmlDoc(html);
Tag? tag = doc.Find("div",
("class", "outer_div", Compare.EXACT),
("property", "random", Compare.VALUE_STARTS_WITH)
);
if (tag != null){
string extract = doc.ExtractText(tag);
Console.WriteLine(extract);
}
Output:
StartText
Inner text
Ending text
Change the concatenating char
doc.SetConcatenatingChar(';')
Output:
StartText;Inner text;Ending text
Or disable concatenation completely
doc.DelimitTags(false)
Output:
StartTextInner textEnding text
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
HtmlDoc doc = new HtmlDoc(input);
Tag? tag = doc.Find("ul");
if (tag == null) {
return;
}
List<Tag> listElements = doc.ExtractTags(tag, "li");
string html = HtmlDoc.fetchHtml("https://toscrape.com");
HtmlDoc doc = new HtmlDoc(html);
Tag? tag = new HtmlDoc(input).Find("a", ("href", "", Compare.KEY_ONLY));
if (tag == null) {
return;
}
string link = tag.GetAttribute("href");
Console.WriteLine(link);