diff --git a/src/basics.md b/src/basics.md index 0752c235..0f6f1d84 100644 --- a/src/basics.md +++ b/src/basics.md @@ -13,6 +13,8 @@ | [Maintain global mutable state][ex-global-mut-state] | [![lazy_static-badge]][lazy_static] | [![cat-rust-patterns-badge]][cat-rust-patterns] | | [Access a file randomly using a memory map][ex-random-file-access] | [![memmap-badge]][memmap] | [![cat-filesystem-badge]][cat-filesystem] | | [Define and operate on a type represented as a bitfield][ex-bitflags] | [![bitflags-badge]][bitflags] | [![cat-no-std-badge]][cat-no-std] | +| [Extract a list of unique #Hashtags from a text][ex-extract-hashtags] | [![regex-badge]][regex] [![lazy_static-badge]][lazy_static] | [![cat-text-processing-badge]][cat-text-processing] | + [ex-std-read-lines]: #ex-std-read-lines @@ -470,6 +472,43 @@ fn main() { } ``` +[ex-extract-hashtags]: #ex-extract-hashtags + +## Extract a list of unique #Hashtags from a text + +[![regex-badge]][regex] [![lazy_static-badge]][lazy_static] [![cat-text-processing-badge]][cat-text-processing] + +Extracts a sorted and deduplicated list of hashtags from a text. + +The hashtag regex given here only catches latin hashtags that start with a letter. The complete [twitter hashtag regex] is way more complicated. + +```rust +extern crate regex; +#[macro_use] extern crate lazy_static; + +use regex::Regex; +use std::collections::HashSet; + +/// Note: A HashSet does not contain duplicate values. +fn extract_hashtags(text: &str) -> HashSet<&str> { + lazy_static! { + static ref HASHTAG_REGEX : Regex = Regex::new( + r"\#[a-zA-Z][0-9a-zA-Z_]*" + ).unwrap(); + } + HASHTAG_REGEX.find_iter(text).map(|mat| mat.as_str()).collect() +} + +fn main() { + let tweet = "Hey #world, I just got my new #dog, say hello to Till. #dog #forever #2 #_ "; + let tags = extract_hashtags(tweet); + assert!(tags.contains("#dog") && tags.contains("#forever") && tags.contains("#world")); + assert_eq!(tags.len(), 3); +} +``` + + + [cat-no-std-badge]: https://badge-cache.kominick.com/badge/no_std--x.svg?style=social @@ -535,3 +574,4 @@ fn main() { [race-condition-file]: https://en.wikipedia.org/wiki/Race_condition#File_systems +[twitter hashtag regex]: https://github.com/twitter/twitter-text/blob/master/java/src/com/twitter/Regex.java#L255 diff --git a/src/intro.md b/src/intro.md index 2d878c74..3cbe5635 100644 --- a/src/intro.md +++ b/src/intro.md @@ -31,6 +31,8 @@ community. It needs and welcomes help. For details see | [Maintain global mutable state][ex-global-mut-state] | [![lazy_static-badge]][lazy_static] | [![cat-rust-patterns-badge]][cat-rust-patterns] | | [Access a file randomly using a memory map][ex-random-file-access] | [![memmap-badge]][memmap] | [![cat-filesystem-badge]][cat-filesystem] | | [Define and operate on a type represented as a bitfield][ex-bitflags] | [![bitflags-badge]][bitflags] | [![cat-no-std-badge]][cat-no-std] | +| [Extract a list of unique #Hashtags from a text][ex-extract-hashtags] | [![regex-badge]][regex] [![lazy_static-badge]][lazy_static] | [![cat-text-processing-badge]][cat-text-processing] | + ## [Encoding](encoding.html) @@ -220,6 +222,7 @@ Keep lines sorted. [ex-threadpool-fractal]: concurrency.html#ex-threadpool-fractal [ex-dedup-filenames]: app.html#ex-dedup-filenames [ex-extract-links-webpage]: net.html#ex-extract-links-webpage +[ex-extract-hashtags]: basics.html#ex-extract-hashtags [ex-file-post]: net.html#ex-file-post [ex-file-predicate]: app.html#ex-file-predicate [ex-file-skip-dot]: app.html#ex-file-skip-dot