# numpy.genfromtext()

In [146]:
import numpy as np
import io


```python
np.genfromtxt(fname, dtype=<class 'float'>, comments='#',
                delimiter=None, skip_header=0, skip_footer=0, converters=None,
                missing_values=None, filling_values=None, usecols=None, names=None,
                excludelist=None, deletechars=" !#$%&'()*+, -./:;<=>?@[\\]^{|}~",
                replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i',
                unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None,
                encoding=None, *, ndmin=0, like=None)
```

__[Go to Documentation](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html)__

to avoid creating lots of text sample in hard disk we will use `io` to fake it

io.StringIO() will return us a class just like a file we opened but its actually in the memory and not in the disk

## fname


fname: file, str, pathlib.Path, list of str, generator
File, filename, list, or generator to read. If the filename extension is .gz or .bz2, the file is first decompressed. Note that generators must return bytes or strings. The strings in a list or produced by a generator are treated as lines.

## dtype

- optional
- Data type of the resulting array. If None, the dtypes will be determined by the contents of each column, individually.

## comments

- default = "#" (optional)
- The character used to indicate the start of a comment. All the characters occurring on a line __after__ a comment are discarded.

In [147]:
text_file = io.StringIO("""
1 2 3 4
# a comment which contains the whole file
5 6 7 8 # this is a comment but the beginning of the line isn't
10 11 12 13
""")

arr_1 = np.genfromtxt(text_file, dtype=int)
arr_1

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [10, 11, 12, 13]])

## delimiter

- str, int, or sequence, optional
- The string used to separate values.
- By default, any consecutive whitespaces act as delimiter.
- An integer or sequence of integers can also be provided as width(s) of each field.
- new lines(\n) also make element separated

In [148]:
text_file1 = io.StringIO("""
1, 22, 333
4, 55, 666
7, 88, 999
""")

arr_2 = np.genfromtxt(text_file1, dtype=int, delimiter=',')
arr_2

array([[  1,  22, 333],
       [  4,  55, 666],
       [  7,  88, 999]])

### we can also use int to specify a width


in some cases it may error to return us, for example if there will be no int in specified area, it will return -1

in example below delimiter is set to two and first two characters are ``1,`` so it cant convert the comma into a int so it will do like this:

In [149]:
text_file2 = io.StringIO("1, 22, 333\n4, 55, 666\n7, 88, 999")
arr_3 = np.genfromtxt(text_file2, dtype=int, delimiter=2)
arr_3

array([[-1,  2, -1,  3, 33],
       [-1,  5, -1,  6, 66],
       [-1,  8, -1,  9, 99]])

### and its possible to provide a sequence like this:

in this method we can declare how many cols we wish to have for example in below example our output will have 2 cols since we declare 2 numbers

In [150]:
text_file3 = io.StringIO("123456789")

arr_4 = np.genfromtxt(text_file3, dtype=int, delimiter=(3, 5))
arr_4

array([  123, 45678])

in this method if we select whites-paces, it will ignore them (while still counting):

In [151]:
text_file4 = io.StringIO("123 1000000\n345 200  22")

arr_5 = np.genfromtxt(text_file4, dtype=int, delimiter=(3, 4))
"""
here in the first row it will select 123, and then it select the next 4 characters which are ' 100' and it will ignore the first white space and convert it to int 100
then ignore the remaining 0s and go to the next line, on the next line it again ignore the whitespace and 22 in the end of the line
"""
arr_5

array([[123, 100],
       [345, 200]])

## autostrip

- bool
- default = False (optional)
- Whether to automatically strip white spaces from the variables.

In [152]:
text_file5 = io.StringIO("hi, parham \n numpy, python")

arr_6 = np.genfromtxt(text_file5, dtype="S5", delimiter=",", autostrip=False)

arr_6

array([[b'hi', b' parh'],
       [b'numpy', b' pyth']], dtype='|S5')

In [153]:
text_file6 = io.StringIO("hi, parham \n numpy, python")

arr_7 = np.genfromtxt(text_file6, dtype="S5", delimiter=",", autostrip=True)

arr_7

array([[b'hi', b'parha'],
       [b'numpy', b'pytho']], dtype='|S5')

as you can see when it strip a white space it wont count it  
so in later scenario unlike the first one (where autostrip was False just like default),it also selected a in parham and o in python as their 5th character

## skip_header & skip_footer

- both are int and are optional


In [154]:
text_file7 = io.StringIO("""lets imagine its a header for the file
which will provide some information about data set or something
or think of the headers in CSV files like:
id, age, sex
0, 20, 0
1, 22, 0
2, 40, 1
3, 44, 1
end of the dataset :D
""")

arr_8 = np.genfromtxt(text_file7, dtype=int,  delimiter=",", skip_header=4, skip_footer=1)

arr_8

array([[ 0, 20,  0],
       [ 1, 22,  0],
       [ 2, 40,  1],
       [ 3, 44,  1]])

## usecols


- sequence, optional
- Which columns to read, with 0 being the first. For example, usecols = (1, 4, 5) will extract the 2nd, 5th and 6th columns.
- negative indices also work
- you can also use names of the columns

In [155]:
text_file8 = io.StringIO("1 2 3 4\n10 20 30 40")

arr_9 = np.genfromtxt(text_file8, dtype=int, usecols=(1, 2, -1)) 

arr_9

array([[ 2,  3,  4],
       [20, 30, 40]])

In [156]:
text_file9 = io.StringIO("1 2 3\n10 20 30")

arr_10 = np.genfromtxt(text_file9, dtype=int,names=("ID", "age", "parham"),  usecols=(-1, "ID")) 
# will return the last column(-1) as first column and the first column called "ID" as second column:
arr_10

array([( 3,  1), (30, 10)], dtype=[('parham', '<i8'), ('ID', '<i8')])

## names

- {None, True, str, sequence}, optional
- If names is True, the field names are read from the first line after the first skip_header lines.
    - This line can optionally be preceded by a comment delimiter.
    - Any content before the comment delimiter is discarded.
- If names is a sequence or a single-string of comma-separated names, the names will be used to define the field names in a structured dtype
- If names is None, the names of the dtype fields will be used, if any.

In [157]:
text_file10 = io.StringIO("""this is header
and this one is also header
age, weight, hight
20, 70, 170
90, 190, 100
30, 50, 160
""")

arr_11 = np.genfromtxt(text_file10, dtype=int,names=True, skip_header=2, delimiter=',')
arr_11

array([(20,  70, 170), (90, 190, 100), (30,  50, 160)],
      dtype=[('age', '<i8'), ('weight', '<i8'), ('hight', '<i8')])

even if the first line after the header was a comment it would use the comment:

In [158]:
text_file11 = io.StringIO("""this is header
and this one is also header
#age, #weight, #hight
20, 70, 170
90, 190, 100
30, 50, 160
""")

arr_12 = np.genfromtxt(text_file11, dtype=int,names=True, skip_header=2, delimiter=',')
arr_12

array([(20,  70, 170), (90, 190, 100), (30,  50, 160)],
      dtype=[('age', '<i8'), ('weight', '<i8'), ('hight', '<i8')])

we can use a comma separated string too:

In [159]:
text_file12 = io.StringIO("""
20, 70, 170
90, 190, 100
30, 50, 160
""")
arr_13 = np.genfromtxt(text_file12, dtype=int,names='age, weight, hight', delimiter=',')
arr_13

array([(20,  70, 170), (90, 190, 100), (30,  50, 160)],
      dtype=[('age', '<i8'), ('weight', '<i8'), ('hight', '<i8')])

or we can use dtype to do that:

In [160]:
text_file13 = io.StringIO("""
20, 70, 170
90, 190, 100
30, 50, 160
""")
arr_14 = np.genfromtxt(text_file13, dtype=[(name, int) for name in ['age', 'weight', 'hight']], delimiter=',')
arr_14

array([(20,  70, 170), (90, 190, 100), (30,  50, 160)],
      dtype=[('age', '<i8'), ('weight', '<i8'), ('hight', '<i8')])

## defaultfmt
   

- str
- default = 'f%i' (optional)
    - f is just f but i is auto incremented integer starting from 0

in below example ill modify dtypes to show the col names in output:

In [161]:
text_file14 = io.StringIO("""
20, 70, 170
90, 190, 100
30, 50, 160
""")
arr_15 = np.genfromtxt(text_file14, delimiter=',', dtype=(int, float, float))
arr_15

array([(20,  70., 170.), (90, 190., 100.), (30,  50., 160.)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<f8')])

we can change the default naming like this: 

In [162]:
text_file15 = io.StringIO("""
20, 70, 170
90, 190, 100
30, 50, 160
""")
arr_16 = np.genfromtxt(text_file15, delimiter=',', dtype=(int, float, float), defaultfmt="parham_%i")
arr_16

array([(20,  70., 170.), (90, 190., 100.), (30,  50., 160.)],
      dtype=[('parham_0', '<i8'), ('parham_1', '<f8'), ('parham_2', '<f8')])

there is 2 points:
1. we can also use other formats like %d or %e too
2. it is not index so if we manually name one it wont count the named col

take a look at example below:

In [163]:
text_file16 = io.StringIO("""
20, 70, 170
90, 190, 100
30, 50, 160
""")
arr_17 = np.genfromtxt(text_file16, delimiter=',', dtype=(int, float, float), defaultfmt="parham_%02d", names="firstcol")
# since we call our first col as 'first_col' the second col wont be parham_01 but it will be named as parham_00
# in other words, it wont count the manually named columns
arr_17

array([(20,  70., 170.), (90, 190., 100.), (30,  50., 160.)],
      dtype=[('firstcol', '<i8'), ('parham_00', '<f8'), ('parham_01', '<f8')])

## __converters & bad data__

- he set of functions that convert the data of a column to a value. 
- The converters can also be used to provide a default value for missing data: converters = {3: lambda s: float(s or 0)}.
- you can either use the name of the column or its index in the dictionary

for example imagine we have a dataset like this:
```csv
id, win_rate, class
1,   40%,     2
2,   33%,     1
3,   70%,     1
```
in this example we want integer value of that probability not its percentage so we should use convertors for that coll:

In [164]:
text_file17 = io.StringIO("""
1, 40%, 2
2, 33%, 1
3, 70%, 1
""")

arr_18 = np.genfromtxt(text_file17, delimiter=",")

arr_18

array([[ 1., nan,  2.],
       [ 2., nan,  1.],
       [ 3., nan,  1.]])

in previous example numpy failed to convert the values of second column to float so it replace them with `nan` which means `Not a Number`,
lets use convertors to fix it:

In [165]:
text_file18 = io.StringIO("""
1, 40%, 2
2, 33%, 1
3, 70%, 1
""")

my_convertor = lambda x: int(x.strip(b"%")) / 100

arr_19 = np.genfromtxt(text_file18, delimiter=",", converters={1: my_convertor})
arr_19

array([[1.  , 0.4 , 2.  ],
       [2.  , 0.33, 1.  ],
       [3.  , 0.7 , 1.  ]])

we could use col name too:

In [166]:
text_file19 = io.StringIO("""
1, 40%, 2
2, 33%, 1
3, 70%, 1
""")
my_convertor = lambda x: int(x.strip(b"%")) / 100

# arr_20 = np.genfromtxt(text_file18, delimiter=",", names="id, winrate, class", converters={"winrate": my_convertor})
# or a nicer way to define a dictionary without a need for quotations:
arr_20 = np.genfromtxt(text_file19, delimiter=",", names="id, winrate, class", converters=dict(winrate = my_convertor))

arr_20

array([(1., 0.4 , 2.), (2., 0.33, 1.), (3., 0.7 , 1.)],
      dtype=[('id', '<f8'), ('winrate', '<f8'), ('class', '<f8')])

## __missing_values & filling_values__

__missing_values__:
-  The set of strings corresponding to missing data.
- default = None (optional)
  
__filling_values__:
- The set of values to be used as default when the data are missing.
- it can be either a dict to point each col or a sequence to add a value for missing ones



### bad data examples:

first lets imagine there is a dataset where it has blank blocks like the example below:

In [167]:
text_file20 = io.StringIO("""
1, 40, 2
2,   , 1
3, 70, 1
""")
# in this example there is a missing data in [1][1] which is blank


my_convertor = lambda x: int(x.strip() or -9999)
# this convertor will try to strip and then convert to int, if stript return nothing, it will convert -9999 to int
# in other words it either try to convert to int or return -9999

arr_21 = np.genfromtxt(text_file20, delimiter=",", converters={1: my_convertor})

arr_21

array([(1.,    40, 2.), (2., -9999, 1.), (3.,    70, 1.)],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8')])

now imagine another example where our missing data is "N/A" like this:

```csv
age, class, score
"N/A", 2,    3
 4,     ,    1
 4,    7,    ???
```

notice we have three different missing data here:
- N/A
- " "
- ???

we can handle it like this:
- declare a dictionary to define a missing value for each row
- declare another dictionary to tell it how to replace those missing values
- we can you index, negative index and col names to point the desired column

In [168]:
text_file21 = io.StringIO("""
N/A, 2, 3
4, , 1
4, 7, ???
""")

missing_values_dict = {
    0: "N/A",
    1: " ",
    2: "???"
}
filling_values_dict = {
    0: -1,
    "B": -999,
    -1: 100
}

arr_22 = np.genfromtxt(text_file21, delimiter=",", names="A, B, C", missing_values=missing_values_dict, filling_values=filling_values_dict)

arr_22

array([(-1.,    2.,   3.), ( 4., -999.,   1.), ( 4.,    7., 100.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

naturally we could make it like this for a cleaner code:

In [169]:
text_file22 = io.StringIO("""
N/A, 2, 3
4, , 1
4, 7, ???
""")

kwargs = dict(
delimiter = ",",
names = "A, B, C",
missing_values = {0: "N/A", 1: " ", 2: "???"},
filling_values = {0: -1, "B": -999, -1: 100}
)
arr_23 = np.genfromtxt(text_file22, **kwargs)

arr_23

array([(-1.,    2.,   3.), ( 4., -999.,   1.), ( 4.,    7., 100.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

### a sequence as filling_values 

in this case we can set a value for each col's missing values:

In [170]:
text_file23 = io.StringIO("""
N/A, 2, 3
4, , 1
4, 7, ???
""")

arr_24 = np.genfromtxt(text_file23, delimiter=',', filling_values=(-100, -200, -300))
# here the dtype is float(as default) so it can not convert some of them to float so they become missing values and it will fill them,
# missing values from the first col will be replace with -100 and for the next col will be -200 and the last one will be -300
arr_24

array([[-100.,    2.,    3.],
       [   4., -200.,    1.],
       [   4.,    7., -300.]])