## Importing Data with NumPy

In [1]:
import numpy as np

### np.loadtxt() vs np.genfromtxt()

In [2]:
lendingCoDataNumeric=np.loadtxt("LendingCompanyNumericData.csv",delimiter=',')
lendingCoDataNumeric


array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [3]:
lendingCoDataNumeric1=np.genfromtxt("LendingCompanyNumericData.csv",delimiter=',')
lendingCoDataNumeric1

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [4]:
np.array_equal(lendingCoDataNumeric,lendingCoDataNumeric1)

True

looks like so far loadtxt and genfromtxt are the same, lets find out from the following code.The main difference is that, the loadtxt is the faster of the two but it breaks when we feed it incomplete or ill formatted data. On the other hand, genfromtxt is slightly slower but can handle missing values. Now lets import a dataset with missing values in it and see what happens

In [9]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",delimiter=';')
lendingCoDataNumericNAN

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [13]:
lendingCoDataNumericNAN=np.loadtxt("LendingCoNumericDataWithNAN.csv",delimiter=';',dtype=str)
lendingCoDataNumericNAN
#we can use this method only when we want to observe the data values and not to perform any 
# mathematical operations

array([['2000', '40', '365', '3121', '4241', '13621'],
       ['2000', '40', '365', '3061', '4171', '15041'],
       ['1000', '40', '365', '2160', '3280', '15340'],
       ...,
       ['', '40', '365', '4201', '5001', '16600'],
       ['1000', '40', '365', '2080', '3320', '15600'],
       ['2000', '40', '365', '4601', '4601', '16600']], dtype='<U5')

In practice, we often deal with incomplete data. Sometimes we do not create this data ourselves. This is why genfromtxt is a great choice when loading files with numpy

### Partial Cleaning While Importing

In [14]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",delimiter=';')
lendingCoDataNumericNAN

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

We will use a function called skip_header which is used to remove the first line of our dataset. To remove the first two lines, we use skip_header=2 

In [19]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      skip_header=2)
lendingCoDataNumericNAN

#Now comparing the output of this code with that of the previous code, we realize that, the first two 
#rows of the first code have been removed 
#we can use skip_header=10 if we want to remove the first ten rows from the top

array([[ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       [ 2000.,    40.,   365.,  3041.,  4241., 15321.],
       [ 2000.,    50.,   365.,  3470.,  4820., 13720.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

### String vs Object vs Numbers