# Importing data with genfromtxt
* In short we say then genfromtxt runs two loops .. first loop converts each line from the file to the sequence of strings ..  The second loop converts each string to the appropriate data type.
* This mechanism is slower than a single loop, but gives more flexibility. In particular, genfromtxt is able to take missing data into account, when other faster and simpler functions like loadtxt cannot.
* Only one mandatory argument of the genfromtxt is  the source of data or we can say file name or source where data is stored

* When we pass url into the genfromtxt it automatically downloaded to the current directory and opened .
* Recognized file tpye are  -- text file and archives like  -- gzip ,bz2(bzip2 archives)
* The type of the archive is determined from the extension of the file: if the filename ends with '.gz', a gzip archive is expected; if it ends with 'bz2', a bzip2 archive is assumed.

In [1]:
import numpy as np
from io import StringIO

# Splitting the lines into columns
## Delimiter argument 
*  The delimiter keyword is used to define how the splitting should take place.
*  Quite often, a single character marks the separation between columns. For example, comma-separated files (CSV) use a comma (,) or a semicolon (;) as delimiter.
* By default the delimiter assumes `delimiter=None` meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.
  

In [2]:
data = "1, 2, 3\n4, 5, 6"
np.genfromtxt(StringIO(data), delimiter=",")

array([[1., 2., 3.],
       [4., 5., 6.]])

In [3]:
data = "  1  2  3\n  4  5 67\n890123  4"
np.genfromtxt(StringIO(data), delimiter=3)
data = "123456789\n   4  7 9\n   4567 9"
np.genfromtxt(StringIO(data), delimiter=(4, 3, 2))

array([[1234.,  567.,   89.],
       [   4.,    7.,    9.],
       [   4.,  567.,    9.]])

# The autostrip argument
* By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces. This behavior can be overwritten by setting the optional argument autostrip to a value of True.
* This removes the leading and trailing whitespaces from the data or string  when set to true.
* 

In [5]:
data = "1, abc , 2\n 3, xxx, 4"
# Without autostrip
print(np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5"))
# With autostrip
print(np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True))

[['1' ' abc ' ' 2']
 ['3' ' xxx' ' 4']]
[['1' 'abc' '2']
 ['3' 'xxx' '4']]


# The comments argument
* This argument in the genfromtxt marks the begining of the comment
* It simply defines a character string that marks the starting of the comment
* By default the genfromtxt assumes ` genfromtxt='#'`
* A comment may occur anywhere on the line
* Any character present after the comment marker is simply ignored .
  ## NOTE --- There is one notable exception to this behavior: if the optional argument `names=True`, the first commented line will be examined for names.
 

In [6]:
data = """#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
np.genfromtxt(StringIO(data), comments="#", delimiter=",")

array([[1., 2.],
       [3., 4.],
       [5., 6.],
       [7., 8.],
       [9., 0.]])

# Skipping lines and choosing the columns
## The skip_header and skip_footer arguments
* - skip_header - It tells NumPy to ignore the first few lines at the top of a file when reading data.
  - f the file has some extra stuff (like titles or descriptions) at the beginning that you don’t want to read, you can skip those lines.
  - If skip_header=2, NumPy will skip the first 2 lines of the file and start reading from the 3rd line.
* - skip_footer - It tells NumPy to ignore the last few lines at the bottom of a file when reading data.
  - If the file has some extra stuff (like notes or summaries) at the end that you don’t want to read, you can skip those lines.
  -  If skip_footer=1, NumPy will skip the last line of the file and stop reading just before it.

* By default both skip header and footer are set to 0 - skip_header=0 and skip_footer=0

In [7]:
data = "\n".join(str(i) for i in range(10))
np.genfromtxt(StringIO(data),)
np.genfromtxt(StringIO(data),
              skip_header=3, skip_footer=5)

array([3., 4.])

# The usecols argument
* It tells NumPy to only read specific columns from a file.
* If the file has many columns but you only need a few, you can use usecols to pick just the ones you want.
*  If usecols=(0, 2), NumPy will only read the first and third columns from the file.
*  If the columns have names, we can also select which columns to import by giving their name to the usecols argument, either as a sequence of strings or a comma-separated string:

In [8]:
data = "1 2 3\n4 5 6"
np.genfromtxt(StringIO(data), usecols=(0, -1))

array([[1., 3.],
       [4., 6.]])

In [9]:
data = "1 2 3\n4 5 6"
np.genfromtxt(StringIO(data),
              names="a, b, c", usecols=("a", "c"))
np.genfromtxt(StringIO(data),
              names="a, b, c", usecols=("a, c"))

array([(1., 3.), (4., 6.)], dtype=[('a', '<f8'), ('c', '<f8')])

# Choosing the data type

* The main way to control how the sequences of strings we have read from the file are converted to other types is to set the dtype argument
* dtype=float is the default for genfromtxt.
* dtype=float is the default for genfromtxt. is also accpetable
* a sequence of types, such as dtype=(int, float, float).

* a comma-separated string, such as dtype="i4,f8,|U3".

* a dictionary with two keys 'names' and 'formats'.

* a sequence of tuples (name, type), such as dtype=[('A', int), ('B', float)].

* an existing numpy.dtype object.

* the special value None. In that case, the type of the columns will be determined from the data itself

# Setting the names
## The names argument
    * It tells NumPy to give names to the columns when reading data from a file
    *  If your file has column names (like in a table), you can use names to make it easier to work with the data. Instead of using numbers to refer to columns, you can use names!
    * If names=True, NumPy will use the first row of the file as column names. You can also give your own names, like names=['A', 'B', 'C']
    * Using names makes your data easier to understand and work with. Instead of remembering that column 0 is "Name" and column 1 is "Age", you can just use the names directly!
* A natural approach when dealing with tabular data is to allocate a name to each column. A first possibility is to use an explicit structured dtype, as mentioned previously
* The default value of names is None

In [10]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])

array([(1, 2, 3), (4, 5, 6)],
      dtype=[('a', '<i8'), ('b', '<i8'), ('c', '<i8')])

# The defaultfmt argument
* It tells NumPy what default names to give to columns when the file doesn’t have column names.
* If your file doesn’t have column names, NumPy will automatically create names for the columns using the defaultfmt format. This helps you keep track of the columns when working with the data.
*  If defaultfmt='col_%d', NumPy will name the columns as col_0, col_1, col_2, etc.
*  If names=None but a structured dtype is expected, names are defined with the standard NumPy default of "f%i", yielding names like f0, f1 and so forth
*  We need to keep in mind that defaultfmt is used only if some names are expected but not defined.

In [11]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int))

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<i8')])

# Tweaking the conversion
## The converters argument
* It lets you change or tweak the data in specific columns while reading a file.
* If some columns in your file have data that needs to be fixed or converted (e.g., turning strings into numbers, replacing missing values, or applying a formula), you can use converters to do that automatically.
* 

In [12]:
convertfunc = lambda x: float(x.strip("%"))/100.
data = "1, 2.3%, 45.\n6, 78.9%, 0"
names = ("i", "p", "n")
# General case .....
np.genfromtxt(StringIO(data), delimiter=",", names=names)

array([(1., nan, 45.), (6., nan,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

In [13]:
# Converted case ...
np.genfromtxt(StringIO(data), delimiter=",", names=names,
              converters={1: convertfunc})

array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

In [14]:
data = "1, , 3\n 4, 5, 6"
convert = lambda x: float(x.strip() or -999)
np.genfromtxt(StringIO(data), delimiter=",",
              converters={1: convert})

array([[   1., -999.,    3.],
       [   4.,    5.,    6.]])

# Using missing and filling values
* The genfromtxt function provides two other complementary mechanisms: the missing_values argument is used to recognize missing data and a second argument, filling_values, is used to process these missing data.
* Done filling  because it helps in better  analysis .. missing values can affect the calculations
* 

# missing_values
*  The missing_values argument accepts three kinds of values:
    - a string or a comma-separated string
    - a sequence of strings
    - a dictionary -Values of the dictionary are strings or sequence of strings. The corresponding keys can be column indices (integers) or column names (strings). In addition, the special key None can be used to define a default applicable to all columns.
    - 

#  filling_values
    
    Expected type                                Default
    
    bool                                          false
    int                                            -1
    float                                         np.nan
    complex                                       np.nan+0j
    string                                        '???'
    

data = "N/A, 2, 3\n4, ,???"
kwargs = dict(delimiter=",",
              dtype=int,
              names="a,b,c",
              missing_values={0:"N/A", 'b':" ", 2:"???"},
              filling_values={0:0, 'b':0, 2:-999})
np.genfromtxt(StringIO(data), **kwargs)

# usemask
* We may also want to keep track of the occurrence of missing data by constructing a boolean mask, with True entries where data was missing and False otherwise. To do that, we just have to set the optional argument usemask to True (the default is False). The output array will then be a MaskedArray.
* Boolean Mask - A special array that marks missing values as True and valid values as False.
* 