Skip to content

ngbrown/TextScanner

Repository files navigation

TextScanner

A simple text scanner for .NET which can parse primitive types and strings using regular expressions.

Project Info

A TextScanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.

This is a port of the Java class java.util.Scanner to .NET.

Breaking input into tokens

By default, a scanner uses white space to separate tokens. (White space characters include blanks, tabs, and line terminators. For the full list, refer to the documentation for Char.IsWhiteSpace.) To see how scanning works, let’s look at ScanXan, a program that reads the individual words in xanadu.txt and prints them out, one per line.

The .NET implementation of the Java tutorial would look like this:

namespace ScanXan
{
    using System;
    using System.IO;

    using TextScanner;

    internal class ScanXan
    {
        private static void Main(string[] args)
        {
            TextScanner s = null;

            try
            {
                s = new TextScanner(new StreamReader("xanadu.txt"));

                while (s.HasNext())
                {
                    Console.WriteLine(s.Next());
                }
            }
            finally
            {
                if (s != null)
                {
                    s.Close();
                }
            }
        }
    }
}

The output is the same:

In
Xanadu
did
Kubla
Khan
A
stately
pleasure-dome
...

To use a different token separator, invoke UseDelimiter(), specifying a regular expression. For example, suppose you wanted the token separator to be a comma, optionally followed by white space. You would invoke,

s.UseDelimiter(",\\s*");

Translating individual tokens

The ScanXan example treats all input tokens as simple string values. TextScanner also supports tokens for all of the .NET primitive types (except for char), as well as Decimal. Also, numeric values can use thousands separators. Thus, in a en-US locale, TextScanner correctly reads the string “32,767” as representing an integer value.

We have to mention the locale, because thousands separators and decimal symbols are locale specific. So, the following example would not work correctly in all locales if we didn’t specify that the scanner should use the en-US locale. That’s not something you usually have to worry about, because your input data usually comes from sources that use the same locale as you do.

The ScanSum example reads a list of double values and adds them up. Here’s the source:

namespace ScanSum
{
    using System;
    using System.Globalization;
    using System.IO;

    using TextScanner;

    internal class ScanSum
    {
        private static void Main(string[] args)
        {
            TextScanner s = null;
            double sum = 0;

            try
            {
                s = new TextScanner(new StreamReader("usnumbers.txt"));
                s.UseCulture(new CultureInfo("en-US"));

                while (s.HasNext())
                {
                    if (s.HasNextDouble())
                    {
                        sum += s.NextDouble();
                    }
                    else
                    {
                        s.Next();
                    }
                }
            }
            finally
            {
                if (s != null)
                {
                    s.Close();
                }
            }

            Console.WriteLine(sum);
        }
    }
}

And here’s the sample input file, usnumbers.txt

8.5
32,767
3.14159
1,000,000.1

The output string is “1032778.74159”.

Updated for .NET

We can rewrite the ScanXan example with using and foreach blocks like this:

namespace ScanXan
{
    using System;
    using System.IO;

    using TextScanner;

    internal class ScanXan
    {
        private static void Main(string[] args)
        {
            using (var s = new TextScanner(new StreamReader("xanadu.txt")))
            {
                foreach (var token in s)
                {
                    Console.WriteLine(token);
                }
            }
        }
    }
}

The output is the same as before.

License

The specification is derived from Sun’s specification. See their documentation license.

The source code was written without any reference the Java library source code.

The source code is licensed under The Common Development and Distribution License.

About

A simple text scanner for .NET which can parse primitive types and strings using regular expressions. A port of Java's java.util.Scanner

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published