## Importing Data with NumPy

In [41]:
import numpy as np

### np.loadtxt() vs np.genfromtxt()

In [42]:
lendingCoDataNumeric=np.loadtxt("LendingCompanyNumericData.csv",delimiter=',')
lendingCoDataNumeric


array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [43]:
lendingCoDataNumeric1=np.genfromtxt("LendingCompanyNumericData.csv",delimiter=',')
lendingCoDataNumeric1

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [44]:
np.array_equal(lendingCoDataNumeric,lendingCoDataNumeric1)

True

looks like so far loadtxt and genfromtxt are the same, lets find out from the following code.The main difference is that, the loadtxt is the faster of the two but it breaks when we feed it incomplete or ill formatted data. On the other hand, genfromtxt is slightly slower but can handle missing values. Now lets import a dataset with missing values in it and see what happens

In [45]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",delimiter=';')
lendingCoDataNumericNAN

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [46]:
lendingCoDataNumericNAN=np.loadtxt("LendingCoNumericDataWithNAN.csv",delimiter=';',dtype=str)
lendingCoDataNumericNAN
#we can use this method only when we want to observe the data values and not to perform any 
# mathematical operations

array([['2000', '40', '365', '3121', '4241', '13621'],
       ['2000', '40', '365', '3061', '4171', '15041'],
       ['1000', '40', '365', '2160', '3280', '15340'],
       ...,
       ['', '40', '365', '4201', '5001', '16600'],
       ['1000', '40', '365', '2080', '3320', '15600'],
       ['2000', '40', '365', '4601', '4601', '16600']], dtype='<U5')

In practice, we often deal with incomplete data. Sometimes we do not create this data ourselves. This is why genfromtxt is a great choice when loading files with numpy

### Partial Cleaning While Importing

In [47]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",delimiter=';')
lendingCoDataNumericNAN

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

We will use a function called skip_header which is used to remove the first line of our dataset. To remove the first two lines, we use skip_header=2 

In [48]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      skip_header=2)
lendingCoDataNumericNAN

#Now comparing the output of this code with that of the previous code, we realize that, the first two 
#rows of the first code have been removed 
#we can use skip_header=10 if we want to remove the first ten rows from the top

array([[ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       [ 2000.,    40.,   365.,  3041.,  4241., 15321.],
       [ 2000.,    50.,   365.,  3470.,  4820., 13720.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

What if we want to remove columns and not rows? what do we do? let's do this

In [49]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      usecols=(0,1,5))
lendingCoDataNumericNAN
#the above code lets us use only the first, second and the sixt columns

array([[ 2000.,    40., 13621.],
       [ 2000.,    40., 15041.],
       [ 1000.,    40., 15340.],
       ...,
       [   nan,    40., 16600.],
       [ 1000.,    40., 15600.],
       [ 2000.,    40., 16600.]])

One interesting thing is that, we can use only the columns we want and also arrange them in the order that we want. Lets do this

In [50]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      usecols=(5,1,0))
lendingCoDataNumericNAN
#the above code rearranges our columns as we instructed

array([[13621.,    40.,  2000.],
       [15041.,    40.,  2000.],
       [15340.,    40.,  1000.],
       ...,
       [16600.,    40.,    nan],
       [15600.,    40.,  1000.],
       [16600.,    40.,  2000.]])

In [51]:
lendingCoDataNumericNAN=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      skip_header=2,
                                      skip_footer=2,
                                      usecols=(5,1,0))
lendingCoDataNumericNAN

array([[15340.,    40.,  1000.],
       [15321.,    40.,  2000.],
       [13720.,    50.,  2000.],
       ...,
       [16600.,    40.,  2000.],
       [16600.,    40.,  2000.],
       [16600.,    40.,    nan]])

Now we can set each of these columns to some variables. This is how we do that.Check the code below

In [52]:
rowOne,rowTwo,rowThree=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      skip_header=2,
                                      skip_footer=2,
                                      usecols=(5,1,0))
lendingCoDataNumericNAN

ValueError: too many values to unpack (expected 3)

Oh! Too many values to upack? alright, we can  do this. Check the code below

In [53]:
rowOne,rowTwo,rowThree=np.genfromtxt("LendingCoNumericDataWithNAN.csv",
                                      delimiter=';',
                                      skip_header=2,
                                      skip_footer=2,
                                      usecols=(5,1,0),
                                    unpack=True)
lendingCoDataNumericNAN

array([[15340.,    40.,  1000.],
       [15321.,    40.,  2000.],
       [13720.,    50.,  2000.],
       ...,
       [16600.,    40.,  2000.],
       [16600.,    40.,  2000.],
       [16600.,    40.,    nan]])

In [54]:
rowOne

array([15340., 15321., 13720., ..., 16600., 16600., 16600.])

In [55]:
rowThree

array([1000., 2000., 2000., ..., 2000., 2000.,   nan])

### String vs Object vs Numbers

In [60]:
lendingCoLT=np.genfromtxt('lendingCoLT.csv',delimiter=',')
lendingCoLT
#we first have to open the file outside this program to see the delimiter whether a colon or a comma

array([[      nan,       nan,       nan, ...,       nan,       nan,
              nan],
       [1.000e+00,       nan,       nan, ...,       nan,       nan,
        1.660e+04],
       [2.000e+00,       nan,       nan, ...,       nan,       nan,
        1.660e+04],
       ...,
       [1.041e+03,       nan,       nan, ...,       nan,       nan,
        1.660e+04],
       [1.042e+03,       nan,       nan, ...,       nan,       nan,
        1.560e+04],
       [1.043e+03,       nan,       nan, ...,       nan,       nan,
        1.660e+04]])

To get a neater output, we can print the variable. Let's do this

In [62]:
lendingCoLT=np.genfromtxt('lendingCoLT.csv',delimiter=',')
print(lendingCoLT)
#it looks more organized to print out our values than to just call them directly

[[      nan       nan       nan ...       nan       nan       nan]
 [1.000e+00       nan       nan ...       nan       nan 1.660e+04]
 [2.000e+00       nan       nan ...       nan       nan 1.660e+04]
 ...
 [1.041e+03       nan       nan ...       nan       nan 1.660e+04]
 [1.042e+03       nan       nan ...       nan       nan 1.560e+04]
 [1.043e+03       nan       nan ...       nan       nan 1.660e+04]]
