In [1]:
import numpy as np

### Checking for Missing Values

In [2]:
lending_com_data = np.loadtxt("Lending-company-Numeric.csv", 
                              delimiter = ',')
#we have used np.loadtxt to check if there are any missing values
#since loadtxt will not work if there are missing values

In [3]:
np.isnan(lending_com_data).sum()
#np.isnan is used to know if there is any missing data in the given data set
#here True means there are missing values and False means there are none
#.sum is used to sum the number of missing values
#as True represents 1 and False 0
#the sum tells us in this case there are no missing values

0

In [4]:
#lending_com_data_NAN = np.loadtxt("Lending-company-Numeric-NAN.csv", 
                              #delimiter = ';')
    #this code showed error so the given file has missing values

In [5]:
lending_com_data_NAN = np.genfromtxt("Lending-company-Numeric-NAN.csv", 
                              delimiter = ';')
#np.genfromtxt doesn't show errors

In [6]:
np.isnan(lending_com_data_NAN).sum()

260

In [7]:
#one way of filling the missing values is
lending_com_data_NAN = np.genfromtxt("Lending-company-Numeric-NAN.csv",
                                     delimiter = ';',
                                    filling_values = 0)

In [8]:
np.isnan(lending_com_data_NAN).sum()
#now all the missing values have been replaced with 0
#but since 0 can have a significant representation
#we should use some alternate

0

In [9]:
lending_com_data_NAN = np.genfromtxt("Lending-company-Numeric-NAN.csv",
                                     delimiter = ';')
temporary_fill = np.nanmax(lending_com_data_NAN).round(2) + 1
#here we are creating a temporary fill value
#this value will be greater then the biggest value in the data set by 1
#a good practice is to round off the numbers when working with floats

In [10]:
temporary_fill

64002.0

In [11]:
lending_com_data_NAN = np.genfromtxt('Lending-company-Numeric-NAN.csv',
                                    delimiter = ';',
                                    filling_values = temporary_fill)

In [12]:
np.isnan(lending_com_data_NAN).sum()

0

### Substituting Missing Values

Up till now we have placed the biggest value + 1 of the array in the place of missing values. But this can create a huge problem. The more prefered way is to substitute the values with the mean of the given column as it does not affect the overall deviation of data. 

In [13]:
lending_com_data_NAN = np.genfromtxt("Lending-company-Numeric-NAN.csv",
                                    delimiter = ';')
lending_com_data_NAN

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   nan,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [14]:
temporary_mean = np.nanmean(lending_com_data_NAN,
                           axis = 0).round(2)

In [15]:
temporary_mean[0]

2250.25

In [16]:
temporary_fill = np.nanmax(lending_com_data_NAN).round(2) + 1

lending_com_data_NAN = np.genfromtxt('Lending-company-Numeric-NAN.csv',
                                    delimiter = ';',
                                    filling_values = temporary_fill)

In [17]:
temporary_fill

64002.0

In [18]:
np.mean(lending_com_data_NAN[:,0]).round(2)

4263.25

In [19]:
temporary_mean[0]

2250.25

In [20]:
lending_com_data_NAN[:,0] = np.where(lending_com_data_NAN[:,0] == temporary_fill,
                                    temporary_mean[0],
                                    lending_com_data_NAN[:,0])

The above code can be break down in following steps:

1.) We are checking the 1st columns of the given data set, and if that column contains a value which is equal to the temporary_fill value which in our case is 64002.0 we replace it with the temporary_mean of the 1st column which we have stored above

2.) If the value is not equal to 64002.0 it remains the same 

In [21]:
np.mean(lending_com_data_NAN[:,0]).round(2)
#this proves that subsitiuting missing values with the mean value makes no difference to the value of the mean

2250.25

In [22]:
#now if we want to do this to all the columns we use a for loop...
for i in range(lending_com_data_NAN.shape[1]): #number of columns
    lending_com_data_NAN[:,i] = np.where(lending_com_data_NAN[:,i] == temporary_fill,
                                    temporary_mean[i],
                                    lending_com_data_NAN[:,i])

Now all the missing values have been changed from 64002.0 to the respective mean of the columns

In [23]:
for i in range(lending_com_data_NAN.shape[1]): #number of columns
    lending_com_data_NAN[:,i] = np.where(lending_com_data_NAN[:,i] < 0,
                                    0,
                                    lending_com_data_NAN[:,i])

This changes all the negative value to zero

### Reshaping Ndarrays

In [24]:
lending_co_data_numeric = np.loadtxt("Lending-company-Numeric.csv",
                                    delimiter = ',')

In [25]:
lending_co_data_numeric

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

<b>RESHAPING:</b>

Is the act of morphing the shape of an object a certain way. Usually it is used when a certain condition is needed to met. It's not always possible to store the output of a functionas a part of an existing aray (or Series).

YES there are certaibn restrictions to the shape we can give toa array since we have a fixed amount of data available.

In [26]:
lending_co_data_numeric.shape

(1043, 6)

In [27]:
np.reshape(lending_co_data_numeric, (6, 1043))
#the way reshape works is it flattens the array
#then it prints the first 1043 entries in 1st row
# then the 2nd and so on...
#so it does not make rows into columns...for that purpose transpose is used

array([[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
       [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
       [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.],
       [ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
       [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
       [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]])

In [28]:
np.transpose(lending_co_data_numeric)

array([[ 2000.,  2000.,  1000., ...,  2000.,  1000.,  2000.],
       [   40.,    40.,    40., ...,    40.,    40.,    40.],
       [  365.,   365.,   365., ...,   365.,   365.,   365.],
       [ 3121.,  3061.,  2160., ...,  4201.,  2080.,  4601.],
       [ 4241.,  4171.,  3280., ...,  5001.,  3320.,  4601.],
       [13621., 15041., 15340., ..., 16600., 15600., 16600.]])

In [30]:
#np.reshape(lending_co_data_numeric,(3,500))
#this code will show error
#as the dimensions are less then the specific requirement which is 
# m*n = 6258

In [31]:
np.reshape(lending_co_data_numeric,(3,2086))

array([[ 2000.,    40.,   365., ...,    50.,   365.,  5350.],
       [ 6850., 15150.,  1000., ..., 16600.,  2000.,    40.],
       [  365.,  3441.,  4661., ...,  4601.,  4601., 16600.]])

In [32]:
np.reshape(lending_co_data_numeric,(2,3,1043))
#you can change the dimesions of array

array([[[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
        [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
        [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.]],

       [[ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
        [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
        [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]]])

In [33]:
#another example to change the dimension is
np.reshape(lending_co_data_numeric,(1,1,2,3,1043))

array([[[[[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
          [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
          [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.]],

         [[ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
          [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
          [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]]]]])

###### Adding dimension is useful when a method or function only takes inputs with a higher number of dimensions than the array we want to plug in

In [34]:
lending_co_data_numeric
#so reshaping doesn't affect the dataset
#it just changes the position of the given data
#if you want to use the reshaped data then you need to store it seperately

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [35]:
#another way to reshape data is to use .reshape()
lending_co_data_numeric.reshape(6,1043)

array([[ 2000.,    40.,   365., ...,   365.,  1581.,  3041.],
       [12277.,  2000.,    40., ...,    50.,   365.,  5350.],
       [ 6850., 15150.,  1000., ...,  2000.,    40.,   365.],
       [ 3101.,  4351., 16600., ..., 16600.,  2000.,    40.],
       [  365.,  3441.,  4661., ...,  8450., 22250.,  2000.],
       [   40.,   365.,  3701., ...,  4601.,  4601., 16600.]])

### Removing Values

In [36]:
lending_co_data_numeric = np.loadtxt("Lending-company-Numeric.csv",
                                    delimiter = ',')

In [37]:
lending_co_data_numeric

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [38]:
#to delete the first element of the sheet
np.delete(lending_co_data_numeric, 0).shape

(6257,)

In [39]:
lending_co_data_numeric.size
#since we have not stored the deleted array the value is not changed

6258

In [40]:
lending_co_data_numeric

array([[ 2000.,    40.,   365.,  3121.,  4241., 13621.],
       [ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [41]:
#when you want to delete a complete row or column
np.delete(lending_co_data_numeric, 0, axis = 0)
# axis = 0 delete rows
#axis = 1 delete columns

array([[ 2000.,    40.,   365.,  3061.,  4171., 15041.],
       [ 1000.,    40.,   365.,  2160.,  3280., 15340.],
       [ 2000.,    40.,   365.,  3041.,  4241., 15321.],
       ...,
       [ 2000.,    40.,   365.,  4201.,  5001., 16600.],
       [ 1000.,    40.,   365.,  2080.,  3320., 15600.],
       [ 2000.,    40.,   365.,  4601.,  4601., 16600.]])

In [42]:
np.delete(lending_co_data_numeric, 0, axis = 1)

array([[   40.,   365.,  3121.,  4241., 13621.],
       [   40.,   365.,  3061.,  4171., 15041.],
       [   40.,   365.,  2160.,  3280., 15340.],
       ...,
       [   40.,   365.,  4201.,  5001., 16600.],
       [   40.,   365.,  2080.,  3320., 15600.],
       [   40.,   365.,  4601.,  4601., 16600.]])

In [43]:
np.delete(lending_co_data_numeric, 1, axis = 1)
#you change the col no. and that column is deleted accordingly 

array([[ 2000.,   365.,  3121.,  4241., 13621.],
       [ 2000.,   365.,  3061.,  4171., 15041.],
       [ 1000.,   365.,  2160.,  3280., 15340.],
       ...,
       [ 2000.,   365.,  4201.,  5001., 16600.],
       [ 1000.,   365.,  2080.,  3320., 15600.],
       [ 2000.,   365.,  4601.,  4601., 16600.]])

In [44]:
np.delete(lending_co_data_numeric, (0,2,4), axis = 1)

array([[   40.,  3121., 13621.],
       [   40.,  3061., 15041.],
       [   40.,  2160., 15340.],
       ...,
       [   40.,  4201., 16600.],
       [   40.,  2080., 15600.],
       [   40.,  4601., 16600.]])

In [45]:
#but when you want delete rows and columns
np.delete(np.delete(lending_co_data_numeric, 
                    [0,2,4],
                    axis = 1),
         [0,2,-1],
         axis = 0)

array([[   40.,  3061., 15041.],
       [   40.,  3041., 15321.],
       [   50.,  3470., 13720.],
       ...,
       [   40.,  4240., 16600.],
       [   40.,  4201., 16600.],
       [   40.,  2080., 15600.]])