### Data types


Array types and conversions between types
NumPy supports a much greater variety of numerical types than Python does. This section shows which are available, and how to modify an array’s data-type.


NumPy numerical types are instances of dtype (data-type) objects, each having unique characteristics. Once you have imported NumPy using

In [1]:
import numpy as np

the dtypes are available as np.bool_, np.float32, etc.

Advanced types, not listed in the table above, are explored in section Structured arrays.

https://numpy.org/doc/1.19/user/basics.rec.html#structured-arrays

There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex. Those with numbers in their name indicate the bitsize of the type (i.e. how many bits are needed to represent a single value in memory). Some types, such as int and intp, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

Data-types can be used as functions to convert python numbers to array scalars (see the array scalar section for an explanation), python sequences of numbers to arrays of that type, or as arguments to the dtype keyword that many numpy functions or methods accept. Some examples:

In [2]:
np.float32(1)

1.0

In [3]:
np.int_([1,2,3])

array([1, 2, 3])

In [4]:
np.arange(3, dtype=np.int8)

array([0, 1, 2], dtype=int8)

Array types can also be referred to by character codes, mostly to retain backward compatibility with older packages such as Numeric. Some documentation may still refer to these, for example:

In [5]:
np.arange(3, dtype='f')

array([0., 1., 2.], dtype=float32)

We recommend using dtype objects instead.

To convert the type of an array, use the .astype() method (preferred) or the type itself as a function. For example:

In [6]:
z = np.arange(5, dtype=np.int8)
z

array([0, 1, 2, 3, 4], dtype=int8)

In [9]:
a = z.astype(np.float32)
a

array([0., 1., 2., 3., 4.], dtype=float32)

In [10]:
np.int8(z)

array([0, 1, 2, 3, 4], dtype=int8)

Note that, above, we use the Python float object as a dtype. NumPy knows that int refers to np.int_, bool means np.bool_, that float is np.float_ and complex is np.complex_. The other data-types do not have Python equivalents.

To determine the type of an array, look at the dtype attribute:

In [12]:
z.dtype

dtype('int8')

dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer:

In [13]:
d = np.dtype(int)
d

dtype('int64')

In [14]:
np.issubdtype(d, np.integer)

True

In [15]:
np.issubdtype(d, np.floating)

False

### Array Scalars
NumPy generally returns elements of arrays as array scalars (a scalar with an associated dtype). Array scalars differ from Python scalars, but for the most part they can be used interchangeably (the primary exception is for versions of Python older than v2.x, where integer array scalars cannot act as indices for lists and tuples). There are some exceptions, such as when code requires very specific attributes of a scalar or when it checks specifically whether a value is a Python scalar. Generally, problems are easily fixed by explicitly converting array scalars to Python scalars, using the corresponding Python type function (e.g., int, float, complex, str, unicode).

The primary advantage of using array scalars is that they preserve the array type (Python may not have a matching scalar type available, e.g. int16). Therefore, the use of array scalars ensures identical behaviour between arrays and scalars, irrespective of whether the value is inside an array or not. NumPy scalars also have many of the same methods arrays do.

### Overflow Errors
The fixed size of NumPy numeric types may cause overflow errors when a value requires more memory than available in the data type. For example, numpy.power evaluates 100 * 10 ** 8 correctly for 64-bit integers, but gives 1874919424 (incorrect) for a 32-bit integer.

In [16]:
np.power(100, 8, dtype=np.int64)

10000000000000000

In [17]:
np.power(100, 8, dtype=np.int32)

1874919424

The behaviour of NumPy and Python integer types differs significantly for integer overflows and may confuse users expecting NumPy integers to behave similar to Python’s int. Unlike NumPy, the size of Python’s int is flexible. This means Python integers may expand to accommodate any integer and will not overflow.

NumPy provides numpy.iinfo and numpy.finfo to verify the minimum or maximum values of NumPy integer and floating point values respectively

In [21]:
np.iinfo(np.int)

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [20]:
np.iinfo(np.int64)

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

In [22]:
np.iinfo(np.int32)

iinfo(min=-2147483648, max=2147483647, dtype=int32)

If 64-bit integers are still too small the result may be cast to a floating point number. Floating point numbers offer a larger, but inexact, range of possible values.

In [23]:
np.power(100,100, dtype=np.int64)

0

In [24]:
np.power(100,100,dtype=np.float64)

1e+200

Extended Precision
Python’s floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to np.float64. In some unusual situations it may be useful to use floating-point numbers with more precision. Whether this is possible in numpy depends on the hardware and on the development environment: specifically, x86 machines provide hardware floating-point with 80-bit precision, and while most C compilers provide this as their long double type, MSVC (standard for Windows builds) makes long double identical to double (64 bits). NumPy makes the compiler’s long double available as np.longdouble (and np.clongdouble for the complex numbers). You can find out what your numpy provides with np.finfo(np.longdouble).

NumPy does not provide a dtype with more precision than C’s long double; in particular, the 128-bit IEEE quad precision data type (FORTRAN’s REAL*16) is not available.

For efficient memory alignment, np.longdouble is usually stored padded with zero bits, either to 96 or 128 bits. Which is more efficient depends on hardware and development environment; typically on 32-bit systems they are padded to 96 bits, while on 64-bit systems they are typically padded to 128 bits. np.longdouble is padded to the system default; np.float96 and np.float128 are provided for users who want specific padding. In spite of the names, np.float96 and np.float128 provide only as much precision as np.longdouble, that is, 80 bits on most x86 machines and 64 bits in standard Windows builds.

Be warned that even if np.longdouble offers more precision than python float, it is easy to lose that extra precision, since python often forces values to pass through float. For example, the % formatting operator requires its arguments to be converted to standard python types, and it is therefore impossible to preserve extended precision even if many decimal places are requested. It can be useful to test your code with the value 1 + np.finfo(np.longdouble).eps.

### Array creation

Introduction
There are 5 general mechanisms for creating arrays:

1. Conversion from other Python structures (e.g., lists, tuples)

2. Intrinsic numpy array creation objects (e.g., arange, ones, zeros, etc.)

3. Reading arrays from disk, either from standard or custom formats

4. Creating arrays from raw bytes through the use of strings or buffers

5. Use of special library functions (e.g., random)

This section will not cover means of replicating, joining, or otherwise expanding or mutating existing arrays. Nor will it cover creating object arrays or structured arrays. Both of those are covered in their own sections.
### Converting Python array_like Objects to NumPy Arrays¶
In general, numerical data arranged in an array-like structure in Python can be converted to arrays through the use of the array() function. The most obvious examples are lists and tuples. See the documentation for array() for details for its use. Some objects may support the array-protocol and allow conversion to arrays this way. A simple way to find out if the object can be converted to a numpy array using array() is simply to try it interactively and see if it works! (The Python Way).

Examples:

In [1]:
import numpy as np
x = np.array([1,2,3,4])
x

array([1, 2, 3, 4])

In [2]:
x1 = np.array([1, 2, 3, 4])
x1

array([1, 2, 3, 4])

In [4]:
x2 = np.array([[1,2.0],[0,0],(1+1j,3.)])
x2

array([[1.+0.j, 2.+0.j],
       [0.+0.j, 0.+0.j],
       [1.+1.j, 3.+0.j]])

In [5]:
x3 = np.array([[ 1.+0.j, 2.+0.j], [ 0.+0.j, 0.+0.j], [ 1.+1.j, 3.+0.j]])
x3

array([[1.+0.j, 2.+0.j],
       [0.+0.j, 0.+0.j],
       [1.+1.j, 3.+0.j]])

### Intrinsic NumPy Array Creation
NumPy has built-in functions for creating arrays from scratch:

zeros(shape) will create an array filled with 0 values with the specified shape. The default dtype is float64.

In [6]:
a = np.zeros(5, dtype=np.float64)
a

array([0., 0., 0., 0., 0.])

ones(shape) will create an array filled with 1 values. It is identical to zeros in all other respects.

arange() will create arrays with regularly incrementing values. Check the docstring for complete information on the various ways it can be used. A few examples will be given here:

In [7]:
a1 = np.ones(5, dtype=np.float64)
a1

array([1., 1., 1., 1., 1.])

In [8]:
a2 = np.arange(10, dtype=np.int64)
a2

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
a3 = np.arange(2,10)
a3

array([2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
a4 = np.arange(2,10,2)
a4

array([2, 4, 6, 8])

Note that there are some subtleties regarding the last usage that the user should be aware of that are described in the arange docstring.

linspace() will create arrays with a specified number of elements, and spaced equally between the specified beginning and end values. For example:

In [15]:
a5 = np.linspace(1,5,10)
a5

array([1.        , 1.44444444, 1.88888889, 2.33333333, 2.77777778,
       3.22222222, 3.66666667, 4.11111111, 4.55555556, 5.        ])

The advantage of this creation function is that one can guarantee the number of elements and the starting and end point, which arange() generally will not do for arbitrary start, stop, and step values.

indices() will create a set of arrays (stacked as a one-higher dimensioned array), one per dimension with each representing variation in that dimension. An example illustrates much better than a verbal description:

In [17]:
np.indices((3,3))

array([[[0, 0, 0],
        [1, 1, 1],
        [2, 2, 2]],

       [[0, 1, 2],
        [0, 1, 2],
        [0, 1, 2]]])

This is particularly useful for evaluating functions of multiple dimensions on a regular grid.



### Reading Arrays From Disk
This is presumably the most common case of large array creation. The details, of course, depend greatly on the format of data on disk and so this section can only give general pointers on how to handle various formats.

#### Standard Binary Formats
Various fields have standard formats for array data. The following lists the ones with known python libraries to read them and return numpy arrays (there may be others for which it is possible to read and convert to numpy arrays so check the last section as well)

HDF5: h5py

FITS: Astropy

Examples of formats that cannot be read directly but for which it is not hard to convert are those formats supported by libraries like PIL (able to read and write many image formats such as jpg, png, etc).

#### Common ASCII Formats
Comma Separated Value files (CSV) are widely used (and an export and import option for programs like Excel). There are a number of ways of reading these files in Python. There are CSV functions in Python and functions in pylab (part of matplotlib).

More generic ascii files can be read using the io package in scipy.

#### Custom Binary Formats
There are a variety of approaches one can use. If the file has a relatively simple format then one can write a simple I/O library and use the numpy fromfile() function and .tofile() method to read and write numpy arrays directly (mind your byteorder though!) If a good C or C++ library exists that read the data, one can wrap that library with a variety of techniques though that certainly is much more work and requires significantly more advanced knowledge to interface with C or C++.

#### Use of Special Libraries
There are libraries that can be used to generate arrays for special purposes and it isn’t possible to enumerate all of them. The most common uses are use of the many array generation functions in random that can generate arrays of random values, and some utility functions to generate special matrices (e.g. diagonal).

### Importing data with genfromtxt
NumPy provides several functions to create arrays from tabular data. We focus here on the genfromtxt function.

In a nutshell, genfromtxt runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop, but gives more flexibility. In particular, genfromtxt is able to take missing data into account, when other faster and simpler functions like loadtxt cannot.

In [18]:
import numpy as np
from io import StringIO

### Defining the input
The only mandatory argument of genfromtxt is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or io.StringIO object. If a single string is provided, it is assumed to be the name of a local or remote file. If a list of strings or a generator returning strings is provided, each string is treated as one line in a file. When the URL of a remote file is passed, the file is automatically downloaded to the current directory and opened.

Recognized file types are text files and archives. Currently, the function recognizes gzip and bz2 (bzip2) archives. The type of the archive is determined from the extension of the file: if the filename ends with '.gz', a gzip archive is expected; if it ends with 'bz2', a bzip2 archive is assumed.

### Splitting the lines into columns
#### The delimiter argument
Once the file is defined and open for reading, genfromtxt splits each non-empty line into a sequence of strings. Empty or commented lines are just skipped. The delimiter keyword is used to define how the splitting should take place.

Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma (,) or a semicolon (;) as delimiter:

In [21]:
data = u"1, 2, 3\n4, 5, 6"
d = np.genfromtxt(StringIO(data), delimiter=",")
d

array([[1., 2., 3.],
       [4., 5., 6.]])

Another common separator is "\t", the tabulation character. However, we are not limited to a single character, any string will do. By default, genfromtxt assumes delimiter=None, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.

Alternatively, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. In that case, we need to set delimiter to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes):

In [22]:
data = u"  1  2  3\n  4  5 67\n890123  4"
d1 = np.genfromtxt(StringIO(data), delimiter=(3))
print(d1)

[[  1.   2.   3.]
 [  4.   5.  67.]
 [890. 123.   4.]]


In [24]:
data = u"123456789\n   4  7 9\n   4567 9"
d2 = np.genfromtxt(StringIO(data), delimiter=(4,3,2))
d2

array([[1234.,  567.,   89.],
       [   4.,    7.,    9.],
       [   4.,  567.,    9.]])

### The autostrip argument
By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. This behavior can be overwritten by setting the optional argument autostrip to a value of True:

In [26]:
data = u"1, abc , 2\n 3, xxx, 4"
d3 = np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5",autostrip=False)
d3

array([['1', ' abc ', ' 2'],
       ['3', ' xxx', ' 4']], dtype='<U5')

In [27]:
data = u"1, abc , 2\n 3, xxx, 4"
d3 = np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5",autostrip=True)
d3

array([['1', 'abc', '2'],
       ['3', 'xxx', '4']], dtype='<U5')

### The comments argument
The optional argument comments is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumes comments='#'. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored:

In [28]:
data = u"""#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
 """
d5 = np.genfromtxt(StringIO(data), delimiter=",",comments="#")
d5

array([[1., 2.],
       [3., 4.],
       [5., 6.],
       [7., 8.],
       [9., 0.]])

### Skipping lines and choosing columns¶
#### The skip_header and skip_footer arguments
The presence of a header in the file can hinder data processing. In that case, we need to use the skip_header optional argument. The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. Similarly, we can skip the last n lines of the file by using the skip_footer attribute and giving it a value of n:

In [29]:
data = u"\n".join(str(i) for i in range(0,10))
d6 = np.genfromtxt(StringIO(data),)
d6

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [30]:
d7 = np.genfromtxt(StringIO(data),skip_footer=5, skip_header=3)
d7

array([3., 4.])

By default, skip_header=0 and skip_footer=0, meaning that no lines are skipped.

### The usecols argument
In some cases, we are not interested in all the columns of the data but only a few of them. We can select which columns to import with the usecols argument. This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. Remember that by convention, the first column has an index of 0. Negative integers behave the same as regular Python negative indexes.

For example, if we want to import only the first and the last columns, we can use usecols=(0, -1):

In [31]:
data = u"1 2 3\n4 5 6"
d8 = np.genfromtxt(StringIO(data), usecols=(0,-1))
d8

array([[1., 3.],
       [4., 6.]])

If the columns have names, we can also select which columns to import by giving their name to the usecols argument, either as a sequence of strings or a comma-separated string:

In [32]:
d9 = np.genfromtxt(StringIO(data),names="A,B,C", usecols=("A","B"))
d9

array([(1., 2.), (4., 5.)], dtype=[('A', '<f8'), ('B', '<f8')])

In [33]:
d10 = np.genfromtxt(StringIO(data), names="A,B,C", usecols=("A,C"))
d10

array([(1., 3.), (4., 6.)], dtype=[('A', '<f8'), ('C', '<f8')])

### Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument. Acceptable values for this argument are:

a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.

a sequence of types, such as dtype=(int, float, float).

a comma-separated string, such as dtype="i4,f8,|U3".

a dictionary with two keys 'names' and 'formats'.

a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].

an existing numpy.dtype object.

the special value None. In that case, the type of the columns will be determined from the data itself (see below).

In all the cases but the first one, the output will be a 1D array with a structured dtype. This dtype has as many fields as items in the sequence. The field names are defined with the names keyword.

When dtype=None, the type of each column is determined iteratively from its data. We start by checking whether a string can be converted to a boolean (that is, if the string matches true or false in lower cases); then whether it can be converted to an integer, then to a float, then to a complex and eventually to a string. This behavior may be changed by modifying the default mapper of the StringConverter class.

The option dtype=None is provided for convenience. However, it is significantly slower than setting the dtype explicitly.

### Setting the names
#### The names argument
A natural approach when dealing with tabular data is to allocate a name to each column. A first possibility is to use an explicit structured dtype, as mentioned previously:

In [35]:
a = np.genfromtxt(StringIO(data), dtype=[(_, int) for _ in "ABC"])
a

array([(1, 2, 3), (4, 5, 6)],
      dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])

Another simpler possibility is to use the names keyword with a sequence of strings or a comma-separated string:

In [37]:
a1 = np.genfromtxt(StringIO(data), names="A,B,C")
a1

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

In the example above, we used the fact that by default, dtype=float. By giving a sequence of names, we are forcing the output to a structured dtype.

We may sometimes need to define the column names from the data itself. In that case, we must use the names keyword with a value of True. The names will then be read from the first line (after the skip_header ones), even if the line is commented out:

In [38]:
data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
a2 = np.genfromtxt(data, skip_header=1, names=True)
a2

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

The default value of names is None. If we give any other value to the keyword, the new names will overwrite the field names we may have defined with the dtype:

In [39]:
data = StringIO("1 2 3\n 4 5 6")
ndtype = [("a", "int"), ("b", "float"), ("c", "int")]
names = "A,B,C"
a3 = np.genfromtxt(data, names=names, dtype=ndtype)
a3

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('A', '<i8'), ('B', '<f8'), ('C', '<i8')])

### The defaultfmt argument¶
If names=None but a structured dtype is expected, names are defined with the standard NumPy default of "f%i", yielding names like f0, f1 and so forth:

In [40]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int))

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

In the same way, if we don’t give enough names to match the length of the dtype, the missing names will be defined with this default template:

In [41]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int), names="a")

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('a', '<i8'), ('f0', '<f8'), ('f1', '<i8')])

We can overwrite this default with the defaultfmt argument, that takes any format string:

In [42]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i")

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('var_00', '<i8'), ('var_01', '<f8'), ('var_02', '<i8')])

### Tweaking the conversion
#### The converters argument
Usually, defining a dtype is sufficient to define how the sequence of strings must be converted. However, some additional control may sometimes be required. For example, we may want to make sure that a date in a format YYYY/MM/DD is converted to a datetime object, or that a string like xx% is properly converted to a float between 0 and 1. In such cases, we should define conversion functions with the converters arguments.

The value of this argument is typically a dictionary with column indices or column names as keys and a conversion functions as values. These conversion functions can either be actual functions or lambda functions. In any case, they should accept only a string as input and output only a single element of the wanted type.

In the following example, the second column is converted from as string representing a percentage to a float between 0 and 1:

In [2]:
convertfunc = lambda x: float(x.strip(b"%"))/100
data = u"1, 2.3%, 45.\n6, 78.9%, 0"
names = ("i", "p", "n")

In [6]:
import numpy as np
from io import StringIO
np.genfromtxt(StringIO(data), delimiter=",", names=names)

array([(1., nan, 45.), (6., nan,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

We need to keep in mind that by default, dtype=float. A float is therefore expected for the second column. However, the strings ' 2.3%' and ' 78.9%' cannot be converted to float and we end up having np.nan instead. Let’s now use a converter:

In [7]:
np.genfromtxt(StringIO(data), delimiter=",", names=names, converters={1:convertfunc})

array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

The same results can be obtained by using the name of the second column ("p") as key instead of its index (1):

In [8]:
np.genfromtxt(StringIO(data), delimiter=",", names=names, converters={"p": convertfunc})

array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

Converters can also be used to provide a default for missing entries. In the following example, the converter convert transforms a stripped string into the corresponding float or into -999 if the string is empty. We need to explicitly strip the string from white spaces as it is not done by default:

In [9]:
data = u"1, , 3\n 4, 5, 6"

In [10]:
convert = lambda x: float(x.strip() or -999)

In [11]:
np.genfromtxt(StringIO(data), delimiter=",", names=names, converters={1: convert})

array([(1., -999., 3.), (4.,    5., 6.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

### Using missing and filling values
Some entries may be missing in the dataset we are trying to import. In a previous example, we used a converter to transform an empty string into a float. However, user-defined converters may rapidly become cumbersome to manage.

The genfromtxt function provides two other complementary mechanisms: the missing_values argument is used to recognize missing data and a second argument, filling_values, is used to process these missing data.


### missing_values
By default, any empty string is marked as missing. We can also consider more complex strings, such as "N/A" or "???" to represent missing or invalid data. The missing_values argument accepts three kind of values:

a string or a comma-separated string
This string will be used as the marker for missing data for all the columns

a sequence of strings
In that case, each item is associated to a column, in order.

a dictionary
Values of the dictionary are strings or sequence of strings. The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key None can be used to define a default applicable to all columns.


### filling_values
We know how to recognize missing data, but we still need to provide a value for these missing entries. By default, this value is determined from the expected dtype according to this table:

Expected type                               Default

bool                                          False

int                                             -1

float                                         np.nan

complex                                       np.nan+0j

string                                        '???'

We can get a finer control on the conversion of missing values with the filling_values optional argument. Like missing_values, this argument accepts different kind of values:

a single value
This will be the default for all columns

a sequence of values
Each entry will be the default for the corresponding column

a dictionary
Each key can be a column index or a column name, and the corresponding value should be a single object. We can use the special key None to define a default for all columns.

In the following example, we suppose that the missing values are flagged with "N/A" in the first column and by "???" in the third column. We wish to transform these missing values to 0 if they occur in the first and second column, and to -999 if they occur in the last column:

In [13]:
data = u"N/A, 2, 3\n4, ,???"
names = ("p", "q", "r")

In [17]:
np.genfromtxt(StringIO(data), delimiter=",",\
              names=names, missing_values={0:"N/A", 1:" ", 2:"???"},\
              filling_values={0:0, 1:0, 2:-999})

array([(0., 2.,    3.), (4., 0., -999.)],
      dtype=[('p', '<f8'), ('q', '<f8'), ('r', '<f8')])