-
Notifications
You must be signed in to change notification settings - Fork 20
Strings in Java
This is the FAQ part on Strings. See our general FAQ for other topics.
A string is a sequence of characters. A string has a length (which can be zero or more) and the characters are in a certain order. The characters are taken from a certain alphabet, which is the character set (see below).
Java has a class called java.lang.String
, which is used to represent strings in Java. This FAQ is all about that class. If you want to learn about strings you should read a book or follow this tutorial:
https://docs.oracle.com/javase/tutorial/java/data/strings.html
Use Integer.parseInt(String)
or Double.parseDouble(String)
. The equivalent methods exist for Long
and Float
. These methods throw a NumberFormatException
if the input is not valid.
There's Integer.toString(int)
and Double.toString(double)
. They both return a string representation of the argument. Long
and Float
also have one such a method each.
There is toString()
, but you'd always have to check for null
. So you can use Objects.toString(Object)
instead. There's even an overload to provide a null-default. And in concatenation you can just use the object like this:
String string = "Object: " + myObject;
Java will automatically use toString()
on non-null values and "null" if it is null.
Alternatively you could use any JSON library to JSONify objects. It will convert the object to JavaScript Object Notation, which is human readable.
The object is an instance of the String class (java.lang.String
). It exists as memory on the Java heap and it behaves like any other object. It is not a value. In Java we distinguish between pure values and referenced objects.
Variables only hold values. Since a reference is a value you can assign it to a variable. So when you assign a String to a variable the JVM simply copies the reference, not the String.
Strings are immutable in Java, simply because the makers of Java (Sun Microsystems, now Oracle) designed them this way (JLS: "A String object has a constant (unchanging) value.").
Making Strings immutable has some advantages over just using an array of char
:
- Arrays are not as secure, since any method could alter their contents.
- The performance is better when you do not need to copy the array just so you don't need to worry about other methods altering the string.
- The length is fixed and you do not need to search for an ending character (usually \0 in C).
Note that these are not reasons why they are immutable. They are problems that are solved by the immutability of Java Strings. But other languages have other solutions, such as Copy-On-Write-Strings (PHP, Swift, Delphi), which are mutable but still perform well and are just as secure.
Just like any other immutable type. Oracle already has a guide on this: https://docs.oracle.com/javase/tutorial/essential/concurrency/immutable.html
Note that this doesn't mean you could't use reflection to alter the String. But this would be prevented by using a Java Security Manager, that would not grant reflection on Strings.
That's one requirement to make a class immutable (see above). It prevents you from creating a subclass of String, which could introduce bugs and mutability. Implement java.lang.CharacterSequence
if you need your own String-like class and don't forget to override toString()
.
Numbers, booleans and Strings can be defined in the source code as "literals". You literally have the value inside the code. Anything that is a series of decimals is a number. true
is the literal for one of the two boolean values. And anything inside double Quotes is a String literal.
String str = "this here is a string literal";
Strings are used a lot in all applications. And they are special because we can use String literals to create them. Some Strings are used a lot and they would take a lot of space if they existed more than once. So the JVM will put String literals to a constant pool. Note that modern versions of Java will not treat them much different from other objects, but there is a lot of optimization.
So when you use a String literal in your code it will only exist once, even if you use the same literal in another unit of code. The JVM checks if there is already such a String and will reuse it. You can even put your dynamically created Strings into that pool by invoking myString.intern()
.
There are other types of constant memory and you can read about that in the JLS:
- https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html
- https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.4
First you need to understand two of the constructors should not be used!
DO NOT USE: str = new String();
INSTEAD USE: str = "";
DO NOT USE: str2 = new String(str1);
INSTEAD USE: str2 = str1;
Read about the String constant pool (above) to learn why this is relevant.
For the other constructors you can simply read their API documentation. That's what it's here for.
You will rarely need any of them. Some can be used to get a String of a char[]
or int[]
(array of Unicode code points).
You probably want to use StringBuilder
. It acts as a mutable String alternative and has better performance if you execute many operations on a large String.
You can also use string.chars()
to get a stream of code points. This is helpful if you want to process one symbol after the other.
For small Strings and few operations you can just use regular Strings, as the type already has many methods to help you process Strings. You simply create a new String instance and let the garbage collector take care of unused Strings. The same is true for all immutable types. The important part to understand is that a variable is not a String. A String-type variable only holds a reference to a String (or null
). Unless the variable is final you can simply assign a reference to another String to it.
// One String, two variables:
String greeting = "Hallo";
String backup = greeting;
// We make a new String:
greeting = greeting.replace('a', 'e');
// We now have our greeting:
System.out.println(greeting);
// But we still have the original:
System.out.println(backup);
backup = null;
// GC can now remove the original String.
For reading you should just use Files.readAllLines(Path, Charset)
or a Reader
, with the appropriate Charset
:
try (var stream = new FileInputStream(name)) {
try(var reader = new InputStreamReader(stream, StandardCharsets.UTF_8)) {
reader.read(buffer);
...
This would treat the data as UTF-8. Use BufferedReader
to read lines as Strings.
If you have a byte[]
from somewhere and you need to convert it, you should still let Java do the work for you. You could use a ByteArrayInputStream
to use the above approach. But when you have a byte array it's easier to just use the constructor java.lang.String.String(byte[], int, int, Charset)
. To get the encoded bytes from a String you use getBytes(Charset)
.
byte[] data = api.getData(); // some API gives you a byte[] instead of a Java String
String str = new String(data, 0, data.length, Charset.forName("ISO-8859-5")); // read as Latin/Cyrillic
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // convert to UTF-8
The class java.lang.String
models an immutable sequence of characters.
String
is in the package java.lang
, so you do not need to import it. And since you can use String literals (i.e. String str = "hello";
), you do not need any constructors to create instances. The String class has many methods for String operations (such as substring
and indexOf
). For concatenation the infix operator +
can be used.
Technical details (you will need to read the rest of this page to understand this):
In Java a String
represents a sequence of code units in the UTF-16 format (as in many other languages, such as JavaScript, C#). A String is an object (instance of java.lang.String
). String
implements the interfaces Serializable
, Comparable<String>
, and CharSequence
. The primitive type char
is used for the API, for example the method charAt
. However, in modern implementations of Java Strings can be compacted, so they use a byte[]
and a field named coder
for the encoding. If Latin1 is used (compact) then each byte represents one symbol (all code points need no more than 8 bits). If UTF-16 is used (not compacted) then two bytes are used per char (two or four per code point).
In many cases one char (code unit) represents one symbol, but some symbols (code points) need two chars (surrogate pair). The value of a String (which is a byte[]
) can not be altered. But there are mutable types such as StringBuilder
.
All this might seem rather confusing. This is due to backwards compatibility, when Strings in Java used to be just a wrapper for a char[]
. Often it's better to just process code points. So have a look at the method codePoints
. It allows you to process all code points as integers. Converting it back to a String is still somewhat cumbersome in Java 10. But you can do this:
var upper = "hello".codePoints().map(Character::toUpperCase)
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
It's in the JDK. Install it:
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Then simply open java/lang/String.java
from src.zip
.
This is probably the most confusing thing about strings. Especially because there is a type called "char" in Java, which is not a character, but actually a code unit for UTF-16. That means it's just a 16-bit byte, and not necessarily one character. It could be only half a character, since some need two chars to be stored.
The term has different meanings, depending on context. It could stand for:
- a letter/grapheme: A-Z are Latin letters
- a printable symbol: Any symbol that you could think of. Letters, numbers, punctuation, emojis etc.
- any element of a character set: Even non-printable characters, such as control characters.
- a code unit: The type "char" holds one code unit.
In mathematics the formal definition of a language is that is a set of words. Each word is a sequence of symbols.
A character set (or charset) is literally a set of characters. A set as in a collection of well defined and distinct elements. And characters as in letters, numbers, symbols, emojis etc. Some characters have special meanings and are not printable (control characters, such as "backspace").
Each character in a charset is mapped to a numeric value. For example in US-ASCII (see below), 'A' is 65 and so on. GBK is a character set for simplified Chinese characters. With the numbering the characters have a certain order (in theory a set is not ordered). So it's basically a list of characters, sometimes represented as a table (ASCII table). Each character has a number (the position in the list, starting at 0) and a definition of what it looks like (though each font will render it a bit differently).
It defines how a single character is encoded in a text file. For any character set that has 128 or less characters you would simply store it in the 7 least significant bits of a byte (a byte usually has 8 bits). If it has up to 256 characters you simply use all 8 bits. Unicode (see below) has many more characters, so the encoding is a bit more complex than that.
The character 'A' is 65 in many charsets (see above) and that's 100 0001
in binary, so that's how it is encoded in US-ASCII. UTF-16 would store this as 0000 0000 0100 0001
while UTF-8 would store it as 0100 0001
. In this case it's just filled with zeroes so it's 16 or 8 bits, respectively. In both cases the character uses exactly one code unit. But a symbol such as 𝄞
(G Clef) needs more than one unit:
- ASCII : not available
- UTF-16BE :
1101 1000 0011 0100 | 1101 1101 0001 1110
(two units) - UTF-16LE :
0011 0100 1101 1000 | 0001 1110 1101 1101
(two units) - UTF-8 :
1111 0000 | 1001 1101 | 1000 0100 | 1001 1110
(four units)
To understand how this is done you need to read the specifications of UTF-16 / UTF-8. UTF-16 supports BE and LE format (see next question).
This class has a misleading name. It describes an encoding as a "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes". UTF-8 and UTF-16 are different encodings for the same character set (namely Unicode). It should have been named Encoding
instead and StandardCharset
should have been StandardEncoding
.
But they are aware of it and blame it on the IETF:
The name of this class is taken from the terms used in RFC 2278. In that document a charset is defined as the combination of one or more coded character sets and a character-encoding scheme. (This definition is confusing; some other software systems define charset as a synonym for coded character set.)
And there you can find a historical note, which explains why "charset" was used instead of "encoding".
Big-endian and little-endian are two different formats to store or transmit a number that is wider than a single byte. A byte is usually 8 bits. That's enough for 0 to 255 (unsigned) or -127 to 128 (signed). UTF-16 (see below) needs two bytes per code unit (=16 bits). So there are two possible byte orders of UTF-16. To know which it is you can use a byte order mark (see next).
The byte order mark (BOM) is an optional sequence of bytes (not characters) at the beginning of a string, when it is encoded. It is important for UTF-16 so the byte order (also called endianness, see above) is clear. For UTF-8 there is no byte order since every unit is 8 bits wide, but a BOM can be used as a indication of UTF-8.
So a file and a network stream can beginn with a BOM. To prevent that it is confused with a character the sequence of those bytes are reserved:
In UTF-8 the value FEFF
needs three bytes: EF BB BF
U+EFBB (first unit when UTF-8 with BOM is interpreted as UTF-16 by mistake) is not defined.
Code pages (CP) are used to distinguish different character encodings but there is no clear definition of the term. It was supposed to make things easier, but only made them worse with different names for the code pages and different code pages by different vendors. Today, most vendors recommend Unicode (see below). Some code pages are still in use, such as Windows-1252
by Microsoft.
Some character sets have a region that is undefined and can be used for system specific characters. This usually only leads to problems, that's why these ranges should not be used. Windows code pages are often based on some extended ASCII character set (see below) and they simply add more characters by defining the system specific characters.
If a text is transferred from one computer to another you need to make sure both support the same code page (the system, the application and the font) so that the text will be the same. Or save the trouble by using Unicode.
US-ASCII (often just ASCII) is short for American Standard Code for Information Interchange. It defines a character encoding standard and is rarely used today. A single code unit is only 7 bits, so it only knows 128 code points. This is enough for A-Z, a-z. 0-9, some punctuation, and some control characters (e.g. backspace).
However, there are many character sets that are based on US-ASCII. Many use 8 bits and therefore have 256 code points. The term extended ASCII is used for such encodings. The now popular Unicode uses such an extended ASCII encoding as the first block (Latin1).
This allows you to just cast a byte to char if ASCII is used. But this bears the risk that non-ASCII characters get misinterpreted. See here.
It's short for American National Standards Institute, which is an institute, not a character set and not an encoding. However, the term is often mistakenly used for Microsoft Windows code pages, such as Windows-1252. This is wrong! ANSI was never a code page or anything like that. So don't use that term, unless you actually mean that institute. Microsoft does not use that term. Some text editors use it as a place holder for the default Windows code page of a Windows system (e.g. notepad++).
Unicode is a standard defined by the Unicode Consortium. It's basically a very long list that gives a number (called code point) to each and every character. The list contains letters, numbers, symbols, emojis, etc. Unicode also defines processing, storage and interchange of text data.
It is based on the Universal Coded Character Set and there are different versions, since new characters are added to the set constantly.
In many cases the Unicode character set is the universal alphabet (set of characters) used for strings (not just in Java). The alphabets of natural languages (such as Latin, Greek etc.) are all part of Unicode.
Therefore it is also recommended to use Unicode formats (i.e. UTF-8) for source code. The Java compiler can read many encodings, but when you collaborate with other programmers you do not want to waste time on problems that arise with system specific encodings.
There are other character sets. ISO/IEC 8859-1 (Latin1) is another popular character set and it is actually used as the first block of Unicode.
Since it's just a set it does not define how characters could be stored, but there are encoding formats (called UTF, see below) that support all code points of Unicode. Some supplementary blocks are reserved so you could define your own symbols there for private use (Supplementary Private Use Area planes).
UTF stands for Unicode Transformation Format. It's about how you represent Unicode code points on a binary level. Most computers use 8-bit bytes, but Unicode needs up to 21 bits per code point. So UTF describes how characters are stored in a file.
UTF-8 uses one or more bytes for each Unicode code point. UTF-16 uses two or four bytes. UTF-32 uses a full 32-bit integer for each code point. There are pros and cons for each format, but you will usually only use UTF-8 for files and UTF-16 for Strings in Java. So both are important to learn.
(See above to learn about BOM.)
A code point is the number of one Unicode character. With that number (position, index) you can find the other information that the Unicode standard defines:
- The name of that character
- The block that it is part of
- How it looks
- How it has to be printed (some are zero-width)
- Similar characters
A code unit is a sequence of bits of a certain length used as a single unit for some encoding. Usually they are 7, 8, 16 or 32 bits long. A single character can take more than one code unit. The data stored in the code units that belong together is one code point (see above).
The Unicode standard has many characters. They are grouped in blocks. For some of these blocks there are 8-bit character sets, that can be used if only characters of that block/set are used. But there is not much benefit and only makes text processing much more complicated.
Just using myString.length()
might not be what you need.
It's actually not easy to define the length of a String. You could count "ä" as two characters ("a" and "¨") or just one. You can say that non-printable characters have no influence to the length (such as control characters or even line breaks, backspace might actually reduce the length). To solve this problem, you need to normalise the string. You replace each substring with the one you actually want to count.
Then you need to make sure you count the code units or code points (see above), depending on what you want.
Here's a full explanation and example:
https://humanoidreadable.wordpress.com/2014/08/17/string-length/
String
implements Comparable<String>
, so you can use compareTo
to see which of two Strings comes first. compareToIgnoreCase
does the same, but ignores case.
In Java you must use equals
so two strings are compared for equality. That means that the length must be the same and each character is compared as well. When you use ==
instead, you compare for the same object. But two different String objects (instances of String class) could have the same contents. So always use equals
:
Scanner s = new Scanner(System.in);
String s1 = nextLine();
String s2 = nextLine();
boolean equal = s1.equals(s2);
equalsIgnoreCase
does the same but ignores case.
To prevent null pointer exceptions you can use Objects.equals
instead.
Strings are just regular objects. So the difference between ==
and equals()
is the same as it is for List
, int[]
, File
etc.
- The == operator compares references
-
equals()
compares object states.
The object state of a String is the array of bytes it holds. Those are compared by equals()
. Two Strings could contain the same data but since they are two separate objects, they are not the same when compared using ==
. It's best to use Objects.equals(a, b)
to compare two strings for equality.
Here are some useful links:
Our Group on Facebook: https://www.facebook.com/groups/Javagroup123/
The Rules (read them!): https://github.com/Javagroup123/group/wiki/Rules
Frequently Asked Questions: https://github.com/Javagroup123/group/wiki/FAQ
Recommended Books: https://github.com/Javagroup123/group/wiki/Recommended-Books