A simple text scanner for .NET which can parse primitive types and strings using regular expressions.
- Project Home: github.com/ngbrown/TextScanner
- Bug/Feature Tracking: github.com/ngbrown/TextScanner/issues
A TextScanner
breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.
This is a port of the Java class java.util.Scanner to .NET.
By default, a scanner uses white space to separate tokens. (White space characters include blanks, tabs, and line terminators. For the full list, refer to the documentation for Char.IsWhiteSpace.) To see how scanning works, let’s look at ScanXan, a program that reads the individual words in xanadu.txt and prints them out, one per line.
The .NET implementation of the Java tutorial would look like this:
namespace ScanXan
{
using System;
using System.IO;
using TextScanner;
internal class ScanXan
{
private static void Main(string[] args)
{
TextScanner s = null;
try
{
s = new TextScanner(new StreamReader("xanadu.txt"));
while (s.HasNext())
{
Console.WriteLine(s.Next());
}
}
finally
{
if (s != null)
{
s.Close();
}
}
}
}
}
The output is the same:
In
Xanadu
did
Kubla
Khan
A
stately
pleasure-dome
...
To use a different token separator, invoke UseDelimiter()
, specifying a regular expression. For example, suppose you wanted the token separator to be a comma, optionally followed by white space. You would invoke,
s.UseDelimiter(",\\s*");
The ScanXan
example treats all input tokens as simple string
values. TextScanner
also supports tokens for all of the .NET primitive types (except for char
), as well as Decimal
. Also, numeric values can use thousands separators. Thus, in a en-US
locale, TextScanner
correctly reads the string “32,767” as representing an integer value.
We have to mention the locale, because thousands separators and decimal symbols are locale specific. So, the following example would not work correctly in all locales if we didn’t specify that the scanner should use the en-US
locale. That’s not something you usually have to worry about, because your input data usually comes from sources that use the same locale as you do.
The ScanSum
example reads a list of double values and adds them up. Here’s the source:
namespace ScanSum
{
using System;
using System.Globalization;
using System.IO;
using TextScanner;
internal class ScanSum
{
private static void Main(string[] args)
{
TextScanner s = null;
double sum = 0;
try
{
s = new TextScanner(new StreamReader("usnumbers.txt"));
s.UseCulture(new CultureInfo("en-US"));
while (s.HasNext())
{
if (s.HasNextDouble())
{
sum += s.NextDouble();
}
else
{
s.Next();
}
}
}
finally
{
if (s != null)
{
s.Close();
}
}
Console.WriteLine(sum);
}
}
}
And here’s the sample input file, usnumbers.txt
8.5
32,767
3.14159
1,000,000.1
The output string is “1032778.74159”.
We can rewrite the ScanXan
example with using
and foreach
blocks like this:
namespace ScanXan
{
using System;
using System.IO;
using TextScanner;
internal class ScanXan
{
private static void Main(string[] args)
{
using (var s = new TextScanner(new StreamReader("xanadu.txt")))
{
foreach (var token in s)
{
Console.WriteLine(token);
}
}
}
}
}
The output is the same as before.
The specification is derived from Sun’s specification. See their documentation license.
The source code was written without any reference the Java library source code.
The source code is licensed under The Common Development and Distribution License.