Skip to content

Strings in Java

Claude Martin edited this page Sep 17, 2020 · 36 revisions

This is the FAQ part on Strings. See our general FAQ for other topics.

What are Strings?

A string is a sequence of characters. A string has a length (which can be zero or more) and the characters are in a certain order. The characters are taken from a certain alphabet, which is the character set (see below).

Java has a class called java.lang.String, which is used to represent strings in Java. This FAQ is all about that class. If you want to learn about strings you should read a book or follow this tutorial:
https://docs.oracle.com/javase/tutorial/java/data/strings.html

How to convert a String to a number?

Use Integer.parseInt(String) or Double.parseDouble(String). The equivalent methods exist for Long and Float. These methods throw a NumberFormatException if the input is not valid.

How to convert a number to a String?

There's Integer.toString(int) and Double.toString(double). They both return a string representation of the argument. Long and Float also have one such a method each.

How to convert an object to a String?

There is toString(), but you'd always have to check for null. So you can use Objects.toString(Object) instead. There's even an overload to provide a null-default. And in concatenation you can just use the object like this:

String string = "Object: " + myObject; 

Java will automatically use toString() on non-null values and "null" if it is null.

Alternatively you could use any JSON library to JSONify objects. It will convert the object to JavaScript Object Notation, which is human readable.

How is a String reference different from a String object?

The object is an instance of the String class (java.lang.String). It exists as memory on the Java heap and it behaves like any other object. It is not a value. In Java we distinguish between pure values and referenced objects.

Variables only hold values. Since a reference is a value you can assign it to a variable. So when you assign a String to a variable the JVM simply copies the reference, not the String.

Why are String objects immutable?

Strings are immutable in Java, simply because the makers of Java (Sun Microsystems, now Oracle) designed them this way (JLS: "A String object has a constant (unchanging) value.").

Making Strings immutable has some advantages over just using an array of char:

  • Arrays are not as secure, since any method could alter their contents.
  • The performance is better when you do not need to copy the array just so you don't need to worry about other methods altering the string.
  • The length is fixed and you do not need to search for an ending character (usually \0 in C).

Note that these are not reasons why they are immutable. They are problems that are solved by the immutability of Java Strings. But other languages have other solutions, such as Copy-On-Write-Strings (PHP, Swift, Delphi), which are mutable but still perform well and are just as secure.

How are String objects immutable?

Just like any other immutable type. Oracle already has a guide on this: https://docs.oracle.com/javase/tutorial/essential/concurrency/immutable.html

Note that this doesn't mean you could't use reflection to alter the String. But this would be prevented by using a Java Security Manager, that would not grant reflection on Strings.

Why is String final?

That's one requirement to make a class immutable (see above). It prevents you from creating a subclass of String, which could introduce bugs and mutability. Implement java.lang.CharacterSequence if you need your own String-like class and don't forget to override toString().

What are literal constants?

Numbers, booleans and Strings can be defined in the source code as "literals". You literally have the value inside the code. Anything that is a series of decimals is a number. true is the literal for one of the two boolean values. And anything inside double Quotes is a String literal.

String str = "this here is a string literal";  

What is the String constant pool?

Strings are used a lot in all applications. And they are special because we can use String literals to create them. Some Strings are used a lot and they would take a lot of space if they existed more than once. So the JVM will put String literals to a constant pool. Note that modern versions of Java will not treat them much different from other objects, but there is a lot of optimization.

So when you use a String literal in your code it will only exist once, even if you use the same literal in another unit of code. The JVM checks if there is already such a String and will reuse it. You can even put your dynamically created Strings into that pool by invoking myString.intern().

There are other types of constant memory and you can read about that in the JLS:

What are the different constructors used for?

First you need to understand two of the constructors should not be used!

DO NOT USE:  str = new String();  
INSTEAD USE: str = "";

DO NOT USE:  str2 = new String(str1);  
INSTEAD USE: str2 = str1;

Read about the String constant pool (above) to learn why this is relevant.

For the other constructors you can simply read their API documentation. That's what it's here for.

You will rarely need any of them. Some can be used to get a String of a char[] or int[] (array of Unicode code points).

How can I process a character sequence?

You probably want to use StringBuilder. It acts as a mutable String alternative and has better performance if you execute many operations on a large String.

You can also use string.chars() to get a stream of code points. This is helpful if you want to process one symbol after the other.

For small Strings and few operations you can just use regular Strings, as the type already has many methods to help you process Strings. You simply create a new String instance and let the garbage collector take care of unused Strings. The same is true for all immutable types. The important part to understand is that a variable is not a String. A String-type variable only holds a reference to a String (or null). Unless the variable is final you can simply assign a reference to another String to it.

// One String, two variables:
String greeting = "Hallo";
String backup = greeting;
// We make a new String:
greeting =  greeting.replace('a', 'e');
// We now have our greeting:
System.out.println(greeting);
// But we still have the original:
System.out.println(backup);
backup = null;
// GC can now remove the original String.

How can I convert the character encoding of a string?

For reading you should just use Files.readAllLines(Path, Charset) or a Reader, with the appropriate Charset:

try (var stream = new FileInputStream(name)) {
  try(var reader = new InputStreamReader(stream, StandardCharsets.UTF_8)) {
     reader.read(buffer);
     ...

This would treat the data as UTF-8. Use BufferedReader to read lines as Strings.

If you have a byte[] from somewhere and you need to convert it, you should still let Java do the work for you. You could use a ByteArrayInputStream to use the above approach. But when you have a byte array it's easier to just use the constructor java.lang.String.String(byte[], int, int, Charset). To get the encoded bytes from a String you use getBytes(Charset).

byte[] data = api.getData(); // some API gives you a byte[] instead of a Java String
String str = new String(data, 0, data.length, Charset.forName("ISO-8859-5")); // read as Latin/Cyrillic
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // convert to UTF-8

How are Strings implemented in Java?

The class java.lang.String models an immutable sequence of characters.

String is in the package java.lang, so you do not need to import it. And since you can use String literals (i.e. String str = "hello";), you do not need any constructors to create instances. The String class has many methods for String operations (such as substring and indexOf). For concatenation the infix operator + can be used.

Technical details (you will need to read the rest of this page to understand this):
In Java a String represents a sequence of code units in the UTF-16 format (as in many other languages, such as JavaScript, C#). A String is an object (instance of java.lang.String). String implements the interfaces Serializable, Comparable<String>, and CharSequence. The primitive type char is used for the API, for example the method charAt. However, in modern implementations of Java Strings can be compacted, so they use a byte[] and a field named coder for the encoding. If Latin1 is used (compact) then each byte represents one symbol (all code points need no more than 8 bits). If UTF-16 is used (not compacted) then two bytes are used per char (two or four per code point).
In many cases one char (code unit) represents one symbol, but some symbols (code points) need two chars (surrogate pair). The value of a String (which is a byte[]) can not be altered. But there are mutable types such as StringBuilder.
All this might seem rather confusing. This is due to backwards compatibility, when Strings in Java used to be just a wrapper for a char[]. Often it's better to just process code points. So have a look at the method codePoints. It allows you to process all code points as integers. Converting it back to a String is still somewhat cumbersome in Java 10. But you can do this:

var upper = "hello".codePoints().map(Character::toUpperCase)
    .collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
    .toString();

Where do I get the source code of the String class?

It's in the JDK. Install it:
http://www.oracle.com/technetwork/java/javase/downloads/index.html

Then simply open java/lang/String.java from src.zip.

What is a character?

This is probably the most confusing thing about strings. Especially because there is a type called "char" in Java, which is not a character, but actually a code unit for UTF-16. That means it's just a 16-bit byte, and not necessarily one character. It could be only half a character, since some need two chars to be stored.

The term has different meanings, depending on context. It could stand for:

  • a letter/grapheme: A-Z are Latin letters
  • a printable symbol: Any symbol that you could think of. Letters, numbers, punctuation, emojis etc.
  • any element of a character set: Even non-printable characters, such as control characters.
  • a code unit: The type "char" holds one code unit.

In mathematics the formal definition of a language is that is a set of words. Each word is a sequence of symbols.

What is a character set?

A character set (or charset) is literally a set of characters. A set as in a collection of well defined and distinct elements. And characters as in letters, numbers, symbols, emojis etc. Some characters have special meanings and are not printable (control characters, such as "backspace").

Each character in a charset is mapped to a numeric value. For example in US-ASCII (see below), 'A' is 65 and so on. GBK is a character set for simplified Chinese characters. With the numbering the characters have a certain order (in theory a set is not ordered). So it's basically a list of characters, sometimes represented as a table (ASCII table). Each character has a number (the position in the list, starting at 0) and a definition of what it looks like (though each font will render it a bit differently).

What is a character encoding?

It defines how a single character is encoded in a text file. For any character set that has 128 or less characters you would simply store it in the 7 least significant bits of a byte (a byte usually has 8 bits). If it has up to 256 characters you simply use all 8 bits. Unicode (see below) has many more characters, so the encoding is a bit more complex than that.

The character 'A' is 65 in many charsets (see above) and that's 100 0001 in binary, so that's how it is encoded in US-ASCII. UTF-16 would store this as 0000 0000 0100 0001 while UTF-8 would store it as 0100 0001. In this case it's just filled with zeroes so it's 16 or 8 bits, respectively. In both cases the character uses exactly one code unit. But a symbol such as 𝄞 (G Clef) needs more than one unit:

  • ASCII : not available
  • UTF-16BE : 1101 1000 0011 0100 | 1101 1101 0001 1110 (two units)
  • UTF-16LE : 0011 0100 1101 1000 | 0001 1110 1101 1101 (two units)
  • UTF-8 : 1111 0000 | 1001 1101 | 1000 0100 | 1001 1110 (four units)

To understand how this is done you need to read the specifications of UTF-16 / UTF-8. UTF-16 supports BE and LE format (see next question).

What is java.nio.charset.Charset?

This class has a misleading name. It describes an encoding as a "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes". UTF-8 and UTF-16 are different encodings for the same character set (namely Unicode). It should have been named Encoding instead and StandardCharset should have been StandardEncoding. But they are aware of it and blame it on the IETF:

The name of this class is taken from the terms used in RFC 2278. In that document a charset is defined as the combination of one or more coded character sets and a character-encoding scheme. (This definition is confusing; some other software systems define charset as a synonym for coded character set.)

And there you can find a historical note, which explains why "charset" was used instead of "encoding".

What is endianness?

Big-endian and little-endian are two different formats to store or transmit a number that is wider than a single byte. A byte is usually 8 bits. That's enough for 0 to 255 (unsigned) or -127 to 128 (signed). UTF-16 (see below) needs two bytes per code unit (=16 bits). So there are two possible byte orders of UTF-16. To know which it is you can use a byte order mark (see next).

What is a byte order mark?

The byte order mark (BOM) is an optional sequence of bytes (not characters) at the beginning of a string, when it is encoded. It is important for UTF-16 so the byte order (also called endianness, see above) is clear. For UTF-8 there is no byte order since every unit is 8 bits wide, but a BOM can be used as a indication of UTF-8.
So a file and a network stream can beginn with a BOM. To prevent that it is confused with a character the sequence of those bytes are reserved:

In UTF-8 the value FEFF needs three bytes: EF BB BF
U+EFBB (first unit when UTF-8 with BOM is interpreted as UTF-16 by mistake) is not defined.

What is a code page?

Code pages (CP) are used to distinguish different character encodings but there is no clear definition of the term. It was supposed to make things easier, but only made them worse with different names for the code pages and different code pages by different vendors. Today, most vendors recommend Unicode (see below). Some code pages are still in use, such as Windows-1252 by Microsoft.

Some character sets have a region that is undefined and can be used for system specific characters. This usually only leads to problems, that's why these ranges should not be used. Windows code pages are often based on some extended ASCII character set (see below) and they simply add more characters by defining the system specific characters.
If a text is transferred from one computer to another you need to make sure both support the same code page (the system, the application and the font) so that the text will be the same. Or save the trouble by using Unicode.

What is US-ASCII?

US-ASCII (often just ASCII) is short for American Standard Code for Information Interchange. It defines a character encoding standard and is rarely used today. A single code unit is only 7 bits, so it only knows 128 code points. This is enough for A-Z, a-z. 0-9, some punctuation, and some control characters (e.g. backspace). However, there are many character sets that are based on US-ASCII. Many use 8 bits and therefore have 256 code points. The term extended ASCII is used for such encodings. The now popular Unicode uses such an extended ASCII encoding as the first block (Latin1).
This allows you to just cast a byte to char if ASCII is used. But this bears the risk that non-ASCII characters get misinterpreted. See here.

What is ANSI?

It's short for American National Standards Institute, which is an institute, not a character set and not an encoding. However, the term is often mistakenly used for Microsoft Windows code pages, such as Windows-1252. This is wrong! ANSI was never a code page or anything like that. So don't use that term, unless you actually mean that institute. Microsoft does not use that term. Some text editors use it as a place holder for the default Windows code page of a Windows system (e.g. notepad++).

What is Unicode?

Unicode is a standard defined by the Unicode Consortium. It's basically a very long list that gives a number (called code point) to each and every character. The list contains letters, numbers, symbols, emojis, etc. Unicode also defines processing, storage and interchange of text data.

It is based on the Universal Coded Character Set and there are different versions, since new characters are added to the set constantly.

In many cases the Unicode character set is the universal alphabet (set of characters) used for strings (not just in Java). The alphabets of natural languages (such as Latin, Greek etc.) are all part of Unicode.

Therefore it is also recommended to use Unicode formats (i.e. UTF-8) for source code. The Java compiler can read many encodings, but when you collaborate with other programmers you do not want to waste time on problems that arise with system specific encodings.

There are other character sets. ISO/IEC 8859-1 (Latin1) is another popular character set and it is actually used as the first block of Unicode.

Since it's just a set it does not define how characters could be stored, but there are encoding formats (called UTF, see below) that support all code points of Unicode. Some supplementary blocks are reserved so you could define your own symbols there for private use (Supplement­ary Private Use Area planes).

What is UTF?

UTF stands for Unicode Transformation Format. It's about how you represent Unicode code points on a binary level. Most computers use 8-bit bytes, but Unicode needs up to 21 bits per code point. So UTF describes how characters are stored in a file.

UTF-8 uses one or more bytes for each Unicode code point. UTF-16 uses two or four bytes. UTF-32 uses a full 32-bit integer for each code point. There are pros and cons for each format, but you will usually only use UTF-8 for files and UTF-16 for Strings in Java. So both are important to learn.

(See above to learn about BOM.)

What is a code point?

A code point is the number of one Unicode character. With that number (position, index) you can find the other information that the Unicode standard defines:

  • The name of that character
  • The block that it is part of
  • How it looks
  • How it has to be printed (some are zero-width)
  • Similar characters

What is a code unit?

A code unit is a sequence of bits of a certain length used as a single unit for some encoding. Usually they are 7, 8, 16 or 32 bits long. A single character can take more than one code unit. The data stored in the code units that belong together is one code point (see above).

What are Unicode blocks?

The Unicode standard has many characters. They are grouped in blocks. For some of these blocks there are 8-bit character sets, that can be used if only characters of that block/set are used. But there is not much benefit and only makes text processing much more complicated.

What's the length of a String?

Just using myString.length() might not be what you need.

It's actually not easy to define the length of a String. You could count "ä" as two characters ("a" and "¨") or just one. You can say that non-printable characters have no influence to the length (such as control characters or even line breaks, backspace might actually reduce the length). To solve this problem, you need to normalise the string. You replace each substring with the one you actually want to count.

Then you need to make sure you count the code units or code points (see above), depending on what you want.

Here's a full explanation and example:
https://humanoidreadable.wordpress.com/2014/08/17/string-length/

How to compare Strings?

String implements Comparable<String>, so you can use compareTo to see which of two Strings comes first. compareToIgnoreCase does the same, but ignores case.

In Java you must use equals so two strings are compared for equality. That means that the length must be the same and each character is compared as well. When you use == instead, you compare for the same object. But two different String objects (instances of String class) could have the same contents. So always use equals:

Scanner s = new Scanner(System.in);
String s1 = nextLine();
String s2 = nextLine();
boolean equal = s1.equals(s2);

equalsIgnoreCase does the same but ignores case.
To prevent null pointer exceptions you can use Objects.equals instead.

== vs. equals()

Strings are just regular objects. So the difference between == and equals() is the same as it is for List, int[], File etc.

  • The == operator compares references
  • equals() compares object states.

The object state of a String is the array of bytes it holds. Those are compared by equals(). Two Strings could contain the same data but since they are two separate objects, they are not the same when compared using ==. It's best to use Objects.equals(a, b) to compare two strings for equality.

Further reading

Here are some useful links:

Clone this wiki locally