The topics we will cover in this chapter include a tour of the numpy.ndarray data structure, the pandas.Series one-dimensional (1D) pandas data structure, the pandas.DataFrame two-dimensional (2D) pandas tabular data structure, and the pandas.Panel three-dimensional (3D) pandas data structure.

In [3]:
import array as arr
arr_x = arr.array("d", [98.6, 22.35, 72.1])
arr_x

array('d', [98.6, 22.35, 72.1])

It is not possible to create a two-dimensional entity with rows and columns using the array module. This can be achieved through a nested list of such arrays. Special functions implicit with matrices or arrays, such as matrix multiplication, determinants, and eigenvalues, are not defined in this module.


NumPy is the preferred package to create and work on array-type objects. NumPy allows multidimensional arrays to be created. Multidimensional arrays provide a systematic and efficient framework for storing data. 

All NumPy array objects are of the type numpy.ndarray

Attributes of an ndarray such as the data type, shape, number of dimensions, and size can be accessed by different attributes of the array. Some attributes for the ndarray ndarray_1 have been explored in the following code:

In [10]:
import numpy as np
ndarray_1 = np.array([[[100, 65, 160],
[150, 82, 200],
[ 90, 55, 80],
[130, 73, 220],
[190, 80, 150]],
[[ 95, 68, 140],
[145, 80, 222],
[ 90, 62, 100],
[150, 92, 200],
[140, 60, 90]],
[[110, 72, 160],
[160, 95, 185],
[100, 80, 110],
[140, 92, 120],
[100, 55, 100]]])

In [11]:
ndarray_1

array([[[100,  65, 160],
        [150,  82, 200],
        [ 90,  55,  80],
        [130,  73, 220],
        [190,  80, 150]],

       [[ 95,  68, 140],
        [145,  80, 222],
        [ 90,  62, 100],
        [150,  92, 200],
        [140,  60,  90]],

       [[110,  72, 160],
        [160,  95, 185],
        [100,  80, 110],
        [140,  92, 120],
        [100,  55, 100]]])

In [12]:
ndarray_1.dtype

dtype('int32')

In [13]:
ndarray_1.shape

(3, 5, 3)

In [14]:
ndarray_1.ndim

3

In [15]:
# Size of the array (number of elements in the array)
ndarray_1.size

45

In [16]:
ndarray_1.itemsize

4

In [17]:
ndarray_1.nbytes

180

In [18]:
ndarray_1.strides
# The shape of the array is given by the tuple (3, 5, 3). The values in the tuple represent the number
# of years for which there is data, the number of patients, and the number of clinical parameters, respectively. 
# For each year or first dimension, there are 15 records, and hence to move from one year to another in the array, 
# 60 bytes should be jumped across. On a similar note, each distinct patient has 3 records for a given year, 
# and 12 bytes of memory should be moved past to get to the next patient.

(60, 12, 4)

In [19]:
array1d = np.array([1, 2, 3, 4])
array1d

array([1, 2, 3, 4])

In [21]:
array2d = np.array([[0, 1, 2],[2, 3, 4]])
array2d

array([[0, 1, 2],
       [2, 3, 4]])

In [22]:
# Creating an array of ones with shape (2, 3, 4)
np.ones((2, 3, 4))

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

In [23]:
# Creating an array of zeros with shape (2, 1, 3)
np.zeros((2, 1, 3))

array([[[0., 0., 0.]],

       [[0., 0., 0.]]])

In [24]:
np.identity(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [25]:
# Creating an identity matrix of order 3 with the eye function
np.eye(N = 3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [26]:
# Creating a rectangular equivalent of identity matrix with 2 rows and 3 columns
np.eye(N = 2, M = 3)

array([[1., 0., 0.],
       [0., 1., 0.]])

In [27]:
# Offsetting the diagonal of ones by one position in the upper triangle
np.eye(N = 4, M = 3, k = 1)

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [29]:
# Offsetting the diagonal of ones by two positions in the lower triangle

#By default, k holds the value 0 in the eye function.

np.eye(N = 4, M = 3, k = -2)

array([[0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

The arange function of NumPy functionally resembles Python's range function. Based on a start value, stop value, and step value to increment or decrement subsequent values, the arange function generates a set of numbers. Just like the range function, the start and step arguments are optional here. But unlike range, which generates a list, arange generates an array:

In [32]:
# Creating an array with continuous values from 0 to 5
np.arange(6)

array([0, 1, 2, 3, 4, 5])

In [33]:
# Creating an array with numbers from 2 to 12 spaced out at intervals of 3
np.arange(2, 13, 3)

array([ 2,  5,  8, 11])

In [34]:
# Creating a linearly spaced array of 20 samples between 5 and 10
np.linspace(start = 5, stop = 10, num = 20)

array([ 5.        ,  5.26315789,  5.52631579,  5.78947368,  6.05263158,
        6.31578947,  6.57894737,  6.84210526,  7.10526316,  7.36842105,
        7.63157895,  7.89473684,  8.15789474,  8.42105263,  8.68421053,
        8.94736842,  9.21052632,  9.47368421,  9.73684211, 10.        ])

The arange function and linspace function do not allow for any shape specification by themselves and produce 1D arrays with the given sequence of numbers. We can very well use some shape manipulation methods to mold these arrays to the desired shape. 

In [35]:
# Creating a random array with 2 rows and 4 columns, from a uniform distribution
np.random.rand(2, 4)


array([[0.1415181 , 0.76985359, 0.32215195, 0.8684011 ],
       [0.58656301, 0.35148033, 0.26354222, 0.80429235]])

In [41]:
# Creating a 2X4 array from a standard normal distribution
print(np.random.randn(2, 4))
print('~'*50)
# Creating a 2X4 array from a normal distribution with mean 10 and standard deviation 5
print(5 * np.random.randn(2, 4) + 10)


[[ 2.14241647 -1.36694629  0.40681982  0.95121384]
 [ 1.65737796 -0.29611119 -1.35528032  0.20802836]]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[[4.32511855 9.01232898 0.90438197 2.28937742]
 [6.43190826 8.54937434 5.76745028 9.49727057]]


In [42]:
# Creating an array of shape (2, 3) with random integers chosen from the interval [2, 5)
np.random.randint(2, 5, (2, 3))

array([[3, 2, 3],
       [4, 4, 3]])

In [43]:
# Creating an uninitialized empty array of 4X3 dimensions
np.empty([4,3])

array([[0., 0., 0.],
       [0., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])

In [47]:
arr_a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_a)
print('-'*50)
# Getting the diagonal of the array
print(np.diag(arr_a))
print('-'*50)
# Constructing the diagonal matrix from a 1D array
# diag returns a 1D array of diagonals for a 2D input matrix. This 1D array of diagonals can be used here.
print(np.diag(np.diag(arr_a)))
print('-'*50)
# Creating the diagonal matrix with diagonals other than main diagonal
print(np.diag(np.diag(arr_a, k = 1)))


[[1 2 3]
 [4 5 6]
 [7 8 9]]
--------------------------------------------------
[1 5 9]
--------------------------------------------------
[[1 0 0]
 [0 5 0]
 [0 0 9]]
--------------------------------------------------
[[2 0]
 [0 6]]


In [48]:
# Repeating a 1D array 2 times
print(np.tile(np.array([1, 2, 3]), 2))
print('-'*50)
# Repeating a 2D array 4 times
print(np.tile(np.array([[1, 2, 3], [4, 5, 6]]), 4))
print('-'*50)
# Repeating a 2D array 4 times along axis 0 and 1 time along axis 1
print(np.tile(np.array([[1, 2, 3], [4, 5, 6]]), (4,1)))

[1 2 3 1 2 3]
--------------------------------------------------
[[1 2 3 1 2 3 1 2 3 1 2 3]
 [4 5 6 4 5 6 4 5 6 4 5 6]]
--------------------------------------------------
[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]


In [None]:
np.array([-2, -1, 0, 1, 2]).dtype

In [None]:
np.array([-2, -1, 0, 1, 2], dtype = "float")

In [None]:
np.array([-2, -1, 0, 1, 2], dtype = "str")

In [None]:
np.array(["a", "bb", "ccc", "dddd", "eeeee"])

In [52]:
print(np.array([True, False, True, True]).dtype)
print('-'*50)
print(np.array([0, 1, 1, 0, 0], dtype = "bool"))
print('-'*50)
print(np.array([0, 1, 2, 3, -4], dtype = "bool"))
print('-'*50)
# Complex array
print(np.array([[1 + 1j, 2 + 2j], [3 + 3j, 4 + 4j]]))
print('-'*50)
print(np.array([[1 + 1j, 2 + 2j], [3 + 3j, 4 + 4j]]).dtype)


bool
--------------------------------------------------
[False  True  True False False]
--------------------------------------------------
[False  True  True  True  True]
--------------------------------------------------
[[1.+1.j 2.+2.j]
 [3.+3.j 4.+4.j]]
--------------------------------------------------
complex128


In [55]:
# Int to float conversion
int_array = np.array([0, 1, 2, 3])
print(int_array)
print('-'*50)
print(int_array.astype("float"))
print('-'*50)
# Float to int conversion
float_array = np.array([1.56, 2.95, 3.12, 4.65])
print(float_array)
print('-'*50)
print(float_array.astype("int"))


[0 1 2 3]
--------------------------------------------------
[0. 1. 2. 3.]
--------------------------------------------------
[1.56 2.95 3.12 4.65]
--------------------------------------------------
[1 2 3 4]


In [61]:
# print entire array, element 0, element 1, last element.
ar = np.arange(5); 
print(ar)
print('-'*50)
# 2nd, last and 1st elements
ar=np.arange(5)
print(ar)
print('-'*50)
#Arrays can be reversed using the ::-1 idiom as follows:
ar=np.arange(5); 
print(ar)
print('-'*50)
print(ar[::-1])

[0 1 2 3 4]
--------------------------------------------------
[0 1 2 3 4]
--------------------------------------------------
[0 1 2 3 4]
--------------------------------------------------
[4 3 2 1 0]


In [63]:
ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); 
ar[1,1]

8

In [64]:
ar[1,1]=5; 
ar

array([[ 2,  3,  4],
       [ 9,  5,  7],
       [11, 12, 13]])

In [65]:
ar[2]

array([11, 12, 13])

In [66]:
ar[2,:]

array([11, 12, 13])

In [67]:
# Retrieve column 1
ar[:,1]

array([ 3,  5, 12])

In [69]:
ar=2*np.arange(6); ar

array([ 0,  2,  4,  6,  8, 10])

In [72]:
ar[1:5:2]

array([2, 6])

In [73]:
ar[1:6:2]

array([ 2,  6, 10])

In [74]:
# Obtain the first nelements using ar[:n]:
ar[:4]

array([0, 2, 4, 6])

In [75]:
ar[4:]

array([ 8, 10])

In [78]:
print(ar)
#Slice array with stepValue=3:
print(ar[::3])

[ 0  2  4  6  8 10]
[0 6]


## Array masking

In [85]:
np.random.seed(15)
ar=np.random.random_integers(0,25,10); 
print(ar)
print('-'*50)
evenMask=(ar % 2==0); 
print(evenMask)
print('-'*50)
evenNums=ar[evenMask]; 
print(evenNums)

[ 8 21 12  5 23  0  7 11 21 22]
--------------------------------------------------
[ True False  True False False  True False False False  True]
--------------------------------------------------
[ 8 12  0 22]


  ar=np.random.random_integers(0,25,10);


In [87]:
In [149]: ar=np.array(['Hungary','Nigeria', 
                       'Guatemala','','Poland',
                       '','Japan']); 
print(ar)
print('-'*50)

# if we wished to eliminate missing values by replacing them with a default value.
# Here, the missing value '' is replaced by 'USA' as the default country. 
# Note that '' is also an empty string:
    
ar[ar=='']='USA'; 
print(ar)

['Hungary' 'Nigeria' 'Guatemala' '' 'Poland' '' 'Japan']
--------------------------------------------------
['Hungary' 'Nigeria' 'Guatemala' 'USA' 'Poland' 'USA' 'Japan']


In [88]:
ar=11*np.arange(0,10); ar
print(ar)
print('-'*50)
ar[[1,3,4,2,7]]
print(ar)

[ 0 11 22 33 44 55 66 77 88 99]
--------------------------------------------------
[ 0 11 22 33 44 55 66 77 88 99]


In the preceding code, the selection object is a list, and elements at indices 1, 3, 4, 2, and 7 are selected. 
Now, assume that we change it to the following:

In [89]:
ar[1,3,4,2,7]

#We get an IndexError error since the array is 1D and we're specifying too many indices to access it:

IndexError: too many indices for array: array is 1-dimensional, but 5 were indexed

### Complex indexing
Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one:

In [91]:
ar=np.arange(15); 
print(ar)
print('-'*50)
ar2=np.arange(0,-10,-1)[::-1]; 
print(ar2)
print('-'*50)

#Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows:

ar[:10]=ar2; 
print(ar)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
--------------------------------------------------
[-9 -8 -7 -6 -5 -4 -3 -2 -1  0]
--------------------------------------------------
[-9 -8 -7 -6 -5 -4 -3 -2 -1  0 10 11 12 13 14]


## Copies and views

In [92]:
ar1=np.arange(12); 
print(ar1)
print('-'*50)
ar2=ar1[::2]; 
print(ar2)
print('-'*50) 
ar2[1]=-1; 
print(ar2)

[ 0  1  2  3  4  5  6  7  8  9 10 11]
--------------------------------------------------
[ 0  2  4  6  8 10]
--------------------------------------------------
[ 0 -1  4  6  8 10]


In [93]:
ar=np.arange(8);
print(ar)
print('-'*50)
arc=ar[:3].copy(); 
print(arc)
print('-'*50) 
arc[0]=-1; 
print(arc)

[0 1 2 3 4 5 6 7]
--------------------------------------------------
[0 1 2]
--------------------------------------------------
[-1  1  2]


## Basic operators

In [95]:
array_1 = np.array([[1, 2, 3], [4, 5, 6]])
print(array_1)
print('-'*50)
# Matrix multiplication of an array and its transpose
print(array_1 @ array_1.T)

[[1 2 3]
 [4 5 6]]
--------------------------------------------------
[[14 32]
 [32 77]]


In [100]:
# Computing the cube of each number from 0 to 1000, using a for loop
%timeit np.arange(1000) ** 3

3 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [101]:
# Computing the cube of each number from 0 to 1000, using a for loop
array_list = range(1000)
%timeit [array_list[i]**3 for i in array_list]

263 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### The example above shows that numpy operations are about 100 times faster than for loops.

### Mathematical operators

In [104]:
# Column sum of elements
print(np.array([[1, 2, 3], [4, 5, 6]]).sum(axis = 0))

# Cumulative sum of elements along axis 0
print(np.array([[1, 2, 3], [4, 5, 6]]).cumsum(axis = 0))

# Cumulative sum of all elements in the array
print(np.array([[1, 2, 3], [4, 5, 6]]).cumsum())


[5 7 9]
[[1 2 3]
 [5 7 9]]
[ 1  3  6 10 15 21]


### Statistical operators

In [105]:
array_x = np.array([[0, 1, 2], [3, 4, 5]])
np.mean(array_x)
np.median(array_x)
np.var(array_x)
np.std(array_x)

1.707825127659933

In [107]:
array_corr = np.random.randn(3,4)
np.corrcoef(array_corr)
np.cov(array_corr)

array([[ 0.46741424,  0.54164647, -0.62948127],
       [ 0.54164647,  1.0967383 , -0.5719871 ],
       [-0.62948127, -0.5719871 ,  1.6598233 ]])

### Logical operators

In [124]:
array_logical = np.random.randn(3, 4)
print(array_logical)
print('-'*50)
# Check if any value is negative along each dimension in axis 0
print(np.any(array_logical < 0, axis = 0))

# Check if all the values are negative in the array
print(np.all(array_logical < 0))

[[ 0.17700825 -0.65143391  0.47890877  0.09568479]
 [ 1.9811775  -0.26987231 -0.71554056 -0.84686743]
 [ 0.45180436  0.32849958 -0.89661332  0.88326518]]
--------------------------------------------------
[False  True  True  True]
False


Some functions test for the presence of NAs or infinite values in the array. Such functionalities are an essential part of data processing and data cleaning. These functions take in an array or array-like object as input and return the truth value as output:

In [125]:
print(np.isfinite(np.array([12, np.inf, 3, np.nan])))
print('-'*50)
print(np.isnan((np.array([12, np.inf, 3, np.nan]))))
print('-'*50)
print(np.isinf((np.array([12, np.inf, 3, np.nan]))))

[ True False  True False]
--------------------------------------------------
[False False False  True]
--------------------------------------------------
[False  True False False]


Operators such as greater, less, and equal help to perform element-to-element comparison between two arrays of identical shape:

In [126]:
# Creating two random arrays for comparison
array1 = np.random.randn(3,4)
print(array1)
print('-'*50)
array2 = np.random.randn(3, 4)
print(array2)
print('-'*50)
# Checking for the truth of array1 greater than array2
print(np.greater(array1, array2))
print('-'*50)
# Checking for the truth of array1 less than array2
print(np.less(array1, array2))

[[-1.57793062 -0.89442757 -0.43037381 -0.02844481]
 [-0.84742946  1.40732819  1.16421582  0.17061978]
 [ 0.15441828  0.00956198  0.08559231  0.80542052]]
--------------------------------------------------
[[ 0.88291496  0.52379055 -0.32115335 -1.45463775]
 [-1.65555653  0.24873851 -1.39627125 -0.53041408]
 [ 1.13944002  0.38841262  0.10531119  0.1600142 ]]
--------------------------------------------------
[[False False False  True]
 [ True  True  True  True]
 [False False False  True]]
--------------------------------------------------
[[ True  True  True False]
 [False False False False]
 [ True  True  True False]]


### Array shape manipulation

In [127]:
reshape_array = np.arange(0,15)
np.reshape(reshape_array, (5, 3))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [130]:
trans_array = np.arange(0,15).reshape(3, 5)
print(trans_array)
print('-'*50)
print(trans_array.T)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]
--------------------------------------------------
[[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]


In [131]:
trans_array = np.arange(0,24).reshape(2, 3, 4)
trans_array.T.shape

(4, 3, 2)

### Ravel helps to flatten the data from multidimensional to 1D:

In [134]:
ravel_array = np.arange(0,12).reshape(4, 3)
print(ravel_array)
print('-'*50)
print(ravel_array.ravel())
print('-'*50)

# The order in which the array is raveled can be set. The order can be "C", "F", "A", or "K". 
# "C" is the default order, where the array gets flattened along the row major, while with "F", 
# flattening occurs along the column major. "A" reads the array elements in a Fortran-like index-based 
#order and "K" reads the elements in the order in which they are stored in memory:

ravel_array.ravel(order = "F")

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
--------------------------------------------------
[ 0  1  2  3  4  5  6  7  8  9 10 11]
--------------------------------------------------


array([ 0,  3,  6,  9,  1,  4,  7, 10,  2,  5,  8, 11])

### Adding a new axis

In [135]:
# Creating a 1D array with 7 elements
array_x = np.array([0, 1, 2, 3, 4, 5, 6])
print(array_x)
print('-'*50)
# Adding a new axis changes the 1D array to 2D
print(array_x[:, np.newaxis])
print('-'*50)
print(array_x[:, np.newaxis].shape)
print('-'*50)
# Adding 2 new axis to the 1D array to make it 3D
print(array_x[:, np.newaxis, np.newaxis])
print('-'*50)  
print(array_x[:, np.newaxis, np.newaxis].shape)
    

[0 1 2 3 4 5 6]
--------------------------------------------------
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]]
--------------------------------------------------
(7, 1)
--------------------------------------------------
[[[0]]

 [[1]]

 [[2]]

 [[3]]

 [[4]]

 [[5]]

 [[6]]]
--------------------------------------------------
(7, 1, 1)


### Basic linear algebra operations

Linear algebra constitutes a set of vital operations for matrices and arrays. The NumPy package is built with a special module called linalg to deal with all linear algebra requirements. The following segment discusses some frequently used functions of the linalg module in detail.

The dot function of the linalg module helps in matrix multiplication. For 2D arrays, it behaves exactly like matrix multiplication. It requires the last dimension of the first array to be equal to the last dimension of the second array. The arrays need not have equal numbers of dimensions. For an N-dimensional array, the output will have 2N-2 dimensions:


In [139]:
# For 2D arrays
array_1 = np.random.randn(2, 4)
array_2 = np.random.randn(4, 2)
print(array_1)
print('-'*50)
print(array_2)
print('-'*50)
print(np.dot(array_1, array_2))
print('-'*50)
# For N dimensional arrays
array_1 = np.random.randn(2, 4, 2)
array_2 = np.random.randn(1, 1, 2, 1)
print(np.dot(array_1, array_2).shape)

[[-0.41070786 -0.71884278  0.77055707 -0.40157547]
 [-1.09640949  0.72790947 -0.49843124 -0.14130392]]
--------------------------------------------------
[[ 0.82585335 -0.32774714]
 [-0.37906692  0.37314405]
 [ 1.61165248 -1.6380887 ]
 [ 1.64838597  0.9856031 ]]
--------------------------------------------------
[[ 0.5132239  -1.79165843]
 [-2.2176212   1.30816516]]
--------------------------------------------------
(2, 4, 1, 1, 1)


The linalg.multidot function can help in computing the product of several arrays at once, instead of using a nested sequence of dot functions. This function automatically finds the most efficient order for evaluating the sequence of products.

The linalg.svd function helps in singular value decomposition and returns three arrays as the result of decomposition. It accepts an array with two or more dimensions as the input:

In [140]:
array_svd = np.random.randn(4, 3)
np.linalg.svd(array_svd)

(array([[-0.7165471 ,  0.67709603, -0.07095546, -0.15187674],
        [ 0.02375063, -0.19586423, -0.01063688, -0.98028566],
        [-0.69188907, -0.6854047 ,  0.19382235,  0.11807966],
        [ 0.08535616,  0.18275276,  0.97840946, -0.04506308]]),
 array([4.32181827, 2.57484718, 0.93725601]),
 array([[-0.20265395, -0.6515194 ,  0.7310635 ],
        [ 0.86345108, -0.47105308, -0.18044732],
        [-0.46193465, -0.59466921, -0.6580159 ]]))

Eigenvalues and eigenvectors of an array can be calculated with the linalg.eig function. The eig function requires the last two dimensions of the input array to be a square. The same function returns both the eigenvalues and the eigenvectors:

In [141]:
np.linalg.eig(np.random.randn(5, 5))

(array([-0.99021138+0.j        ,  0.33633478+1.53244295j,
         0.33633478-1.53244295j,  1.62796579+0.55533248j,
         1.62796579-0.55533248j]),
 array([[ 0.83990496+0.j        , -0.16797936+0.39124741j,
         -0.16797936-0.39124741j,  0.14825217-0.0192569j ,
          0.14825217+0.0192569j ],
        [-0.0984714 +0.j        , -0.14480561+0.39659146j,
         -0.14480561-0.39659146j,  0.11317588-0.10188239j,
          0.11317588+0.10188239j],
        [ 0.02905907+0.j        , -0.57761858+0.j        ,
         -0.57761858-0.j        ,  0.90960556+0.j        ,
          0.90960556-0.j        ],
        [ 0.52721053+0.j        ,  0.17138715+0.08568786j,
          0.17138715-0.08568786j, -0.18472645-0.14761669j,
         -0.18472645+0.14761669j],
        [-0.07789522+0.j        ,  0.39957977-0.33231274j,
          0.39957977+0.33231274j,  0.03488031+0.26447725j,
          0.03488031-0.26447725j]]))

The linalg module also has functions to solve linear equations. The linalg.solve function takes in a coefficient matrix and the dependent variable, and solves for the exact solution. It requires that all rows of the coefficient matrix must be linearly independent:

In [142]:
a = np.array([[1, 2, 3], [5, 4, 2], [8, 9, 7]])
b = np.array([6, 19, 47])
np.linalg.solve(a, b)

array([-6.27272727, 15.81818182, -6.45454545])

The linalg.det function computes the determinant of a square array. If there are more than two dimensions in the input array, it is treated as a stack of matrices and the determinant is computed for each stack. The last two dimensions must, however, correspond to a square matrix:

In [143]:
np.linalg.det(np.random.randn(3,3))
np.linalg.det(np.random.randn(2,3,3))

array([ 2.89875949, -0.58268981])

### Array sorting

In [153]:
ar=np.array([[3,2],[10,-1]])
print(ar)
print('-'*50)
ar.sort(axis=0)
print(ar)

[[ 3  2]
 [10 -1]]
--------------------------------------------------
[[ 3 -1]
 [10  2]]


## Implementing neural networks with NumPy

In [155]:
N = 1000
X1 = np.random.randn(N, 2) + np.array([0.9, 0.9])
X2 = np.random.randn(N, 2) + np.array([-0.9, -0.9])
Y1 = np.zeros((N, 1))
Y2 = np.ones((N, 1))

X = np.vstack((X1, X2))
Y = np.vstack((Y1, Y2))
train = np.hstack((X, Y))
print(train)

[[ 0.47266661  0.56079828  0.        ]
 [ 0.87938926  0.60225714  0.        ]
 [ 0.05409917  0.92867414  0.        ]
 ...
 [-1.3585913  -1.32040834  1.        ]
 [-0.81277616 -1.51825232  1.        ]
 [-1.79616994 -3.26677192  1.        ]]


Our aim is to build a simple neural network with one hidden layer and three neurons. For a moment, let's move away from NumPy to understand the architecture of the neural network we will be building from scratch.

The following is a schematic diagram of a simple neural network architecture:

![image.png](attachment:image.png)

There are two neurons in the input layer, three neurons in the hidden layer, and a single output neuron. The squares represent the bias. To implement the neural network, the independent variables and predictor have been stored in x and t:

In [157]:
x = train[:, 0:2]
print(x)
t = train[:, 2].reshape(2000, 1)

[[ 0.47266661  0.56079828]
 [ 0.87938926  0.60225714]
 [ 0.05409917  0.92867414]
 ...
 [-1.3585913  -1.32040834]
 [-0.81277616 -1.51825232]
 [-1.79616994 -3.26677192]]


In [181]:
def sigmoid(x, derive = False):
    if (derive == True):
        return x * (1 - x)
    return 1 / (1 + np.exp(-x))

The preceding function does the sigmoid transformation and also derivative computation (for backpropagation). The process of training consists of two modes of propagation—feedforward and backpropagation.

The first stage of feedforward is from the input layer to the hidden layer. This stage can be summarized with the following set of equations:

ah1 = sigmoid(x1*w_ih11 + x2*w_ih21 + 1* b_ih1)

ah2 = sigmoid(x1*w_ih12 + x2*w_ih22 + 1*b_ih2)

ah3 = sigmoid(x1*w_ih13 + x2*w_ih23 + 1*b_ih3)

Here, ah1, ah2, and ah3 are inputs to the next stage of the feedforward network, from the hidden layer to the output. This involves multiplying the input matrix of dimensions 2000 x 2 and weight matrix w_ih of dimensions 2 x 3 (three hidden neurons, hence 3), and then adding the bias. Instead of handling the bias components separately, they could be handled as part of the weight matrix. This can be done by adding a unit column vector to the input matrix and inserting the bias values as the last row of the weight matrix. Hence, the new dimensions of the input matrix and weight matrix would be 2000 x 3 and 3 x 3:

In [182]:
#The weight matrix is initialized with random values:
x_in = np.concatenate([x, np.repeat([[1]], 2000, axis = 0)], axis = 1)
w_ih = np.random.normal(size = (3, 3))

In [183]:

y_h = np.dot(x_in, w_ih)
print(y_h)
print('-'*50)
a_h = sigmoid(y_h)
print(a_h)

[[ 0.36375139 -0.08174245 -1.82873064]
 [ 0.54136326  0.24413459 -1.80128481]
 [ 0.11670997 -0.08276082 -1.79679573]
 ...
 [-0.1707162  -2.92904249 -2.20070179]
 [ 0.10731026 -2.69815803 -2.2010269 ]
 [-0.06414385 -4.82845147 -2.50900255]]
--------------------------------------------------
[[0.58994824 0.47957576 0.13838956]
 [0.63212949 0.5607323  0.14169474]
 [0.52914442 0.4793216  0.14224157]
 ...
 [0.4574243  0.05073642 0.09968749]
 [0.52680185 0.06308213 0.09965831]
 [0.48396953 0.00793542 0.07522947]]


In [184]:
a_hin = np.concatenate([a_h, np.repeat([[1]], 2000, axis = 0)], axis = 1)
w_ho = np.random.normal(size = (4, 1))
w_ho

array([[-1.30951935],
       [-0.95401343],
       [ 0.14811731],
       [-0.29020848]])

Now the matrix multiplication and sigmoid transformation can be done for this stage:

In [185]:
y_o = np.dot(a_hin, w_ho)
a_o = sigmoid(y_o)

In [186]:
# Output layer
delta_a_o_error = a_o - t
delta_y_o = sigmoid(a_o, derive=True)
delta_w_ho = a_hin
delta_output_layer = np.dot(delta_w_ho.T,(delta_a_o_error * delta_y_o))

# Hidden layer
delta_a_h = np.dot(delta_a_o_error * delta_y_o, w_ho[0:3,:].T)
delta_y_h = sigmoid(a_h, derive=True)
delta_w_ih = x_in
delta_hidden_layer = np.dot(delta_w_ih.T, delta_a_h * delta_y_h)

The change to be made to the weight has been computed. Let's use these delta values to update the weights:

In [187]:
eta = 0.1
w_ih = w_ih - eta * delta_hidden_layer
w_ho = w_ho - eta * delta_output_layer

Here, eta is the learning rate of the model. Feedforward will take place again using the updated weights. Backpropagation will again follow to reduce the error. Hence, feedforward and backpropagation should take place iteratively for a set number of epochs. The complete code is as follows:

In [189]:
### Neural Network with one hidden layer with feedforward and backpropagation
x = train[:,0:2]
t = train[:,2].reshape(2000,1)
x_in = np.concatenate([x, np.repeat([[1]], 2000, axis = 0)], axis = 1)
w_ih = np.random.normal(size = (3, 3))
w_ho = np.random.normal(size = (4, 1))
def sigmoid(x, derive = False):
    if (derive == True):
        return x * (1 - x)
    return 1 / (1 + np.exp(-x))
epochs = 500
eta = 0.1

for epoch in range(epochs):
# Feed forward
    y_h = np.dot(x_in, w_ih)
    a_h = sigmoid(y_h)
    a_hin = np.concatenate([a_h, np.repeat([[1]], 2000, axis = 0)],     axis = 1)
    y_o = np.dot(a_hin, w_ho)
    a_o = sigmoid(y_o)

    # Calculate the error
    a_o_error = ((1 / 2) * (np.power((a_o - t), 2)))

    # Backpropagation
    ## Output layer
    delta_a_o_error = a_o - t
    delta_y_o = sigmoid(a_o, derive=True)
    delta_w_ho = a_hin
    delta_output_layer = np.dot(delta_w_ho.T,(delta_a_o_error * delta_y_o))

    ## Hidden layer
    delta_a_h = np.dot(delta_a_o_error * delta_y_o, w_ho[0:3,:].T)
    delta_y_h = sigmoid(a_h, derive=True)
    delta_w_ih = x_in
    delta_hidden_layer = np.dot(delta_w_ih.T, delta_a_h * delta_y_h)
    w_ih = w_ih - eta * delta_hidden_layer
    w_ho = w_ho - eta * delta_output_layer
    print(a_o_error.mean())

0.183904841565596
0.2493061172257
0.24883816603954106
0.2475402986897844
0.2413292525342937
0.1291557901915279
0.2389754632836524
0.05738732835173803
0.04821223050048435
0.04743745722331914
0.0468859752231239
0.04632132389692259
0.045668474006992935
0.04494031234788589
0.04426812561529192
0.043720474565992165
0.04324926858355627
0.04282449677049515
0.04242812948658468
0.04206095514865746
0.04173345971115678
0.04146209638749092
0.04130077101360433
0.04150104964944119
0.04387970266247815
0.04432802252271486
0.05379081356187707
0.04369912431603805
0.04199430907137102
0.041370395791959065
0.04085580667279685
0.04043031619592779
0.04008140465878328
0.03979864878176605
0.03957328864008619
0.039419801745433955
0.03950633692760425
0.040664377165749596
0.0482100618255656
0.04405622264010854
0.049880914844405884
0.04108045855372588
0.0403948499372309
0.03978635908179419
0.03928999409505506
0.038922410705294744
0.03867169236501448
0.03850963411953415
0.038413906309639284
0.03840720939557697
0.038

The neural network has been implemented for 500 epochs. This is a simple yet efficient model quite suitable for a range of problems. Good accuracy can be obtained by choosing the right epoch, learning rate, loss function, and activation function. To test and validate, make use of just the feedforward module.

# Data structures in pandas


## Series

A Series is really a 1D NumPy array under the hood. It consists of a NumPy array coupled with an array of labels. Just like a NumPy array, a series can be wholly composed of any data type. The labels are together called the index of the series. A series consists of two components—1D data and the index.


The general construct for creating a series data structure is as follows:

import pandas as pd
ser = pd.Series(data, index = idx)

Here, data can be one of the following:

    An ndarray
    A Python dictionary
    A scalar value


In [191]:
ser = pd.Series(np.random.randn(7))
ser

0    0.354696
1    1.720727
2   -0.876338
3    1.375861
4    0.933529
5    0.941816
6    0.575559
dtype: float64

The following example creates a Series structure of the first five months of the year with a specified index of month names:

In [192]:
import calendar as cal
monthNames=[cal.month_name[i] for i in np.arange(1,6)]
months = pd.Series(np.arange(1,6), index = monthNames)
print(months)
print("-"*40)
print(months.index)

January     1
February    2
March       3
April       4
May         5
dtype: int32
----------------------------------------
Index(['January', 'February', 'March', 'April', 'May'], dtype='object')



## Using a Python dictionary

A dictionary consists of key-value pairs. When a dictionary is used to create a Series, the keys form the index, and the values form the 1D data of the Series:

In [193]:
currDict={'US' : 'dollar', 'UK' : 'pound', 'Germany': 'euro', 'Mexico':'peso', 'Nigeria':'naira', 'China':'yuan', 'Japan':'yen'}
currSeries = pd.Series(currDict)
print(currSeries)

US         dollar
UK          pound
Germany      euro
Mexico       peso
Nigeria     naira
China        yuan
Japan         yen
dtype: object


The index of a pandas Series structure is of type pandas.core.index.Index and can be viewed as an ordered multiset.

If an index is also specified when creating the Series, then this specified index setting overrides the dictionary keys. If the specified index contains values that are not keys in the original dictionary, NaN is appended against that index in the Series:

In [195]:
stockPrices = {'GOOG':1180.97, 'FB':62.57, 'TWTR': 64.50, 'AMZN':358.69, 'AAPL':500.6}
# "YHOO" is not a key in the above dictionary
stockPriceSeries = pd.Series(stockPrices, index=['GOOG','FB','YHOO','TWTR','AMZN','AAPL'], name='stockPrices')
stockPriceSeries
# Note: The name attribute is useful in tasks such as combining Series objects into a DataFrame structure.

GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64


## Using a scalar value

A Series can also be initialized with just a scalar value. For scalar data, an index must be provided. 

In [197]:
dogSeries=pd.Series('chihuahua', index=['breed', 'countryOfOrigin', 'name', 'gender'])
dogSeries = pd.Series('chihuahua', index=['breed', 'countryOfOrigin', 'name', 'gender'])
dogSeries

breed              chihuahua
countryOfOrigin    chihuahua
name               chihuahua
gender             chihuahua
dtype: object


## Operations on Series

The behavior of a Series is very similar to that of NumPy arrays, discussed previously in this chapter, with one caveat being that an operation such as slicing also slices the index of the series.


In [198]:
# Accessing value from series using index label
currDict['China']

'yuan'

In [199]:
# Assigning value to series through a new index label
stockPriceSeries['GOOG'] = 1200.0
stockPriceSeries

GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

Just as in the case of dict, KeyError is raised if you try to retrieve a missing label:

In [200]:
stockPriceSeries['MSFT']

KeyError: 'MSFT'

In [202]:
#This error can be avoided by explicitly using get as follows:
stockPriceSeries.get('MSFT', np.NaN)

nan

In [203]:
# Slice till the 4th index (0 to 3)
print(stockPriceSeries[:4])
print("-"*40)
print(stockPriceSeries[stockPriceSeries > 100])

GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
Name: stockPrices, dtype: float64
----------------------------------------
GOOG    1200.00
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64


In [204]:
# Mean of entire series
print(np.mean(stockPriceSeries))
print("-"*40)
# Standard deviation of entire series
print(np.std(stockPriceSeries))

437.27200000000005
----------------------------------------
417.4446361087899


Elementwise operations can also be performed on a Series:

In [205]:
ser * ser

0    0.125809
1    2.960900
2    0.767969
3    1.892993
4    0.871476
5    0.887018
6    0.331268
dtype: float64

An important feature of a Series is that data is automatically aligned based on the label:

In [207]:
print(ser[1:])
print("-"*40)
print(ser[1:] + ser[:-2])

1    1.720727
2   -0.876338
3    1.375861
4    0.933529
5    0.941816
6    0.575559
dtype: float64
----------------------------------------
0         NaN
1    3.441453
2   -1.752677
3    2.751721
4    1.867057
5         NaN
6         NaN
dtype: float64



## DataFrames

A DataFrame is a two-dimensional data structure composed of rows and columns—exactly like a simple spreadsheet or a SQL table. Each column of a DataFrame is a pandas Series. These columns should be of the same length, but they can be of different data types—float, int, bool, and so on. DataFrames are both value-mutable and size-mutable. This lets us perform operations that would alter values held within the DataFrame or add/delete columns to/from the DataFrame.

Similar to a Series, which has a name and index as attributes, a DataFrame has column names and a row index. The row index can be made of either numerical values or strings such as month names. Indexes are needed for fast lookups as well as proper aligning and joining of data in pandas multilevel indexing is also possible in DataFrames. 


## DataFrame creation

A DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:

    Dictionary of 1D ndarrays, lists, dictionaries, or Series structures
    2D NumPy array
    Structured or record ndarray
    Series
    Another DataFrame

Row label indexes and column labels can be specified along with the data. If they're not specified, they will be generated from the input data in an intuitive fashion, for example, from the keys of dict (in the case of column labels) or by using np.range(n) in the case of row labels, where n corresponds to the number of rows.

A DataFrame can be created from a variety of sources as discussed in the following subsections.


## Using a dictionary of Series
Each individual entity of a dictionary is a key-value pair. A DataFrame is, in essence, a dictionary of several Series put together. The name of the Series corresponds to the key, and the contents of the Series correspond to the value.

As the first step, the dictionary with all the Series should be defined:

In [208]:
stockSummaries = {
'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
index=['Closing price','EPS',
'Shares Outstanding(M)',
'Beta', 'P/E','Market Cap(B)']),
'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
index=['Closing price','EPS','Shares Outstanding(M)',
'Beta','P/E','Market Cap(B)']),
'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
index=['Closing price','EPS','Shares Outstanding(M)',
'P/E', 'Market Cap(B)']),
'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
index=['Closing price','EPS','Shares Outstanding(M)',
'P/E','Beta', 'Market Cap(B)']),
'TWTR':pd.Series([65.25,-0.3,555.2,36.23],
index=['Closing price','EPS','Shares Outstanding(M)',
'Market Cap(B)']),
'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
index=['Closing price','EPS','Shares Outstanding(M)','P/E',
'Market Cap(B)','Beta'])}

The preceding dictionary summarizes the performance of six different stocks and indicates that the DataFrame will have six columns. Observe that each series has a different set of indices and is of different length. The final DataFrame will contain a unique set of the values in each of the indices. If a certain column has no value at a row index, NA is appended to that cell automatically. Now, the following step wraps up this dictionary into a DataFrame:

In [210]:
stockDF = pd.DataFrame(stockSummaries)
stockDF

Unnamed: 0,AMZN,GOOG,FB,YHOO,TWTR,AAPL
Beta,0.52,0.87,,0.66,,0.84
Closing price,346.15,1133.43,61.48,34.9,65.25,501.53
EPS,0.59,36.05,0.59,1.27,-0.3,40.32
Market Cap(B),158.88,380.64,150.92,35.36,36.23,447.59
P/E,589.8,31.44,104.93,27.48,,12.44
Shares Outstanding(M),459.0,335.83,2450.0,1010.0,555.2,892.45


In [213]:
# The DataFrame need not necessarily have all the row and column labels from the original dictionary. 
# At times, only a subset of these rows and columns may be needed.

stockDF = pd.DataFrame(stockSummaries,
index=['Closing price','EPS',
'Shares Outstanding(M)',
'P/E', 'Market Cap(B)','Beta'],
columns=['FB','TWTR','SCNW'])

print(stockDF)

                            FB    TWTR SCNW
Closing price            61.48   65.25  NaN
EPS                       0.59   -0.30  NaN
Shares Outstanding(M)  2450.00  555.20  NaN
P/E                     104.93     NaN  NaN
Market Cap(B)           150.92   36.23  NaN
Beta                       NaN     NaN  NaN


### The row index and column names can be accessed as attributes of the DataFrame:

In [214]:
stockDF.index

Index(['Closing price', 'EPS', 'Shares Outstanding(M)', 'P/E', 'Market Cap(B)',
       'Beta'],
      dtype='object')

In [215]:
stockDF.columns

Index(['FB', 'TWTR', 'SCNW'], dtype='object')

## Using a dictionary of ndarrays/lists

In [218]:
# The dictionary of lists is defined in the following code:

algos = {'search': ['DFS','BFS','Binary Search',
'Linear','ShortestPath (Djikstra)'],
'sorting': ['Quicksort','Mergesort', 'Heapsort',
'Bubble Sort', 'Insertion Sort'],
'machine learning': ['RandomForest', 'K Nearest Neighbor',
'Logistic Regression', 'K-Means Clustering', 'Linear Regression']}

Now, let's convert this dictionary to a DataFrame and print it:

In [220]:
algoDF = pd.DataFrame(algos)
algoDF

Unnamed: 0,search,sorting,machine learning
0,DFS,Quicksort,RandomForest
1,BFS,Mergesort,K Nearest Neighbor
2,Binary Search,Heapsort,Logistic Regression
3,Linear,Bubble Sort,K-Means Clustering
4,ShortestPath (Djikstra),Insertion Sort,Linear Regression


In [221]:
pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4','algo_5'])

Unnamed: 0,search,sorting,machine learning
algo_1,DFS,Quicksort,RandomForest
algo_2,BFS,Mergesort,K Nearest Neighbor
algo_3,Binary Search,Heapsort,Logistic Regression
algo_4,Linear,Bubble Sort,K-Means Clustering
algo_5,ShortestPath (Djikstra),Insertion Sort,Linear Regression



### Using a structured array

Structured arrays are slightly different from ndarrays. Each field in a structured array can be of a different data type. For more information on structured arrays, refer to the following: http://docs.scipy.org/doc/numpy/user/basics.rec.html.

The following is an example of a structured array:

In [223]:
memberData = np.array([('Sanjeev',37,162.4),
('Yingluck',45,137.8),
('Emeka',28,153.2),
('Amy',67,101.3)],
dtype = [('Name','a15'),
('Age','i4'),
('Weight','f4')])
memberData

array([(b'Sanjeev', 37, 162.4), (b'Yingluck', 45, 137.8),
       (b'Emeka', 28, 153.2), (b'Amy', 67, 101.3)],
      dtype=[('Name', 'S15'), ('Age', '<i4'), ('Weight', '<f4')])

In [224]:
memberDF = pd.DataFrame(memberData)
memberDF

Unnamed: 0,Name,Age,Weight
0,b'Sanjeev',37,162.399994
1,b'Yingluck',45,137.800003
2,b'Emeka',28,153.199997
3,b'Amy',67,101.300003


In [225]:
pd.DataFrame(memberData, index=['a','b','c','d'])

Unnamed: 0,Name,Age,Weight
a,b'Sanjeev',37,162.399994
b,b'Yingluck',45,137.800003
c,b'Emeka',28,153.199997
d,b'Amy',67,101.300003


In [226]:
pd.DataFrame(memberData, columns = ["Weight", "Name", "Age"])

Unnamed: 0,Weight,Name,Age
0,162.399994,b'Sanjeev',37
1,137.800003,b'Yingluck',45
2,153.199997,b'Emeka',28
3,101.300003,b'Amy',67



### Using a list of dictionaries

When a list of dictionaries is converted to a DataFrame, each dictionary in the list corresponds to a row in the DataFrame and each key in each dictionary represents a column label.

Let's define a list of dictionaries:

In [227]:
demographicData = [{"Age": 32, "Gender": "Male"}, {"Race": "Hispanic", "Gender": "Female", "Age": 26}]

In [229]:
demographicDF = pd.DataFrame(demographicData)
demographicDF

Unnamed: 0,Age,Gender,Race
0,32,Male,
1,26,Female,Hispanic



## Using a dictionary of tuples for multilevel indexing

A dictionary of tuples can create a structured DataFrame with hierarchically indexed rows and columns. The following is a dictionary of tuples:

In [233]:
salesData = {("2012", "Q1"): {("North", "Brand A"): 100, ("North", "Brand B"): 80,
                              ("South", "Brand A"): 25, ("South", "Brand B"): 40},
("2012", "Q2"): {("North", "Brand A"): 30, ("South", "Brand B"): 50},
("2013", "Q1"): {("North", "Brand A"): 80, ("North", "Brand B"): 10, ("South", "Brand B"): 25},
("2013", "Q2"): {("North", "Brand A"): 70, ("North", "Brand B"): 50, ("South", "Brand A"): 35, ("South", "Brand B"): 40}}

Instead of a regular key-value pair, the key is a tuple with two values denoting two levels in the row index, and the value is a dictionary in which each key-value pair represents a column. Here, again, the key is a tuple and denotes two column indices.

Now this dictionary of tuples can be converted to a DataFrame and printed:

In [234]:
salesDF = pd.DataFrame(salesData)
salesDF

Unnamed: 0_level_0,Unnamed: 1_level_0,2012,2012,2013,2013
Unnamed: 0_level_1,Unnamed: 1_level_1,Q1,Q2,Q1,Q2
North,Brand A,100,30.0,80.0,70
North,Brand B,80,,10.0,50
South,Brand A,25,,,35
South,Brand B,40,50.0,25.0,40


## Using a Series

In [235]:
currDict={'US' : 'dollar', 'UK' : 'pound', 'Germany': 'euro', 'Mexico':'peso',
          'Nigeria':'naira', 'China':'yuan', 'Japan':'yen'}
currSeries = pd.Series(currDict)

In [236]:
currDF = pd.DataFrame(currSeries)
currDF

Unnamed: 0,0
US,dollar
UK,pound
Germany,euro
Mexico,peso
Nigeria,naira
China,yuan
Japan,yen


In [237]:
# Default setting
pd.DataFrame.from_dict(algos, orient = "columns")

Unnamed: 0,search,sorting,machine learning
0,DFS,Quicksort,RandomForest
1,BFS,Mergesort,K Nearest Neighbor
2,Binary Search,Heapsort,Logistic Regression
3,Linear,Bubble Sort,K-Means Clustering
4,ShortestPath (Djikstra),Insertion Sort,Linear Regression


In [238]:
pd.DataFrame.from_dict(algos, orient = "index", columns = ["A", "B", "C", "D", "E"])

Unnamed: 0,A,B,C,D,E
search,DFS,BFS,Binary Search,Linear,ShortestPath (Djikstra)
sorting,Quicksort,Mergesort,Heapsort,Bubble Sort,Insertion Sort
machine learning,RandomForest,K Nearest Neighbor,Logistic Regression,K-Means Clustering,Linear Regression


In [239]:
pd.DataFrame.from_records(memberData, index="Name")

Unnamed: 0_level_0,Age,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
b'Sanjeev',37,162.399994
b'Yingluck',45,137.800003
b'Emeka',28,153.199997
b'Amy',67,101.300003


## Operations on pandas DataFrames
Many operations, such as column/row indexing, assignment, concatenation, deletion, and so on, can be performed on DataFrames. Let's have a look at them in the following subsections.

In [241]:
#A specific column can be selected out from the DataFrame, as a Series, using the column name:
memberDF["Name"]

0     b'Sanjeev'
1    b'Yingluck'
2       b'Emeka'
3         b'Amy'
Name: Name, dtype: object

### Adding a new column

In [243]:
memberDF['Height'] = 60
memberDF['Height2'] = [57, 62, 65, 59]
memberDF

Unnamed: 0,Name,Age,Weight,Height,Height2
0,b'Sanjeev',37,162.399994,60,57
1,b'Yingluck',45,137.800003,60,62
2,b'Emeka',28,153.199997,60,65
3,b'Amy',67,101.300003,60,59


In [246]:
memberDF.insert(1, "ID2", ["S01", "S02", "S03", "S04"])
memberDF

Unnamed: 0,Name,ID2,ID,Age,Weight,Height,Height2
0,b'Sanjeev',S01,S01,37,162.399994,60,57
1,b'Yingluck',S02,S02,45,137.800003,60,62
2,b'Emeka',S03,S03,28,153.199997,60,65
3,b'Amy',S04,S04,67,101.300003,60,59


In [250]:
del memberDF["ID2"]
memberDF

Unnamed: 0,Name,ID,Age,Weight,Height,Height2
0,b'Sanjeev',S01,37,162.399994,60,57
1,b'Yingluck',S02,45,137.800003,60,62
2,b'Emeka',S03,28,153.199997,60,65
3,b'Amy',S04,67,101.300003,60,59


In [None]:
height2 = memberDF.pop("Height2")

In [253]:
memberDF

Unnamed: 0,Name,ID,Age,Weight,Height
0,b'Sanjeev',S01,37,162.399994,60
1,b'Yingluck',S02,45,137.800003,60
2,b'Emeka',S03,28,153.199997,60
3,b'Amy',S04,67,101.300003,60


## Alignment of DataFrames

In [255]:
ore1DF=pd.DataFrame(np.array([[20,35,25,20],
[11,28,32,29]]),
columns=['iron','magnesium',
'copper','silver'])
ore2DF=pd.DataFrame(np.array([[14,34,26,26],
[33,19,25,23]]),
columns=['iron','magnesium',
'gold','silver'])
ore1DF

Unnamed: 0,iron,magnesium,copper,silver
0,20,35,25,20
1,11,28,32,29


In [256]:
ore1DF + ore2DF

Unnamed: 0,copper,gold,iron,magnesium,silver
0,,,34,69,46
1,,,44,47,52


In [257]:
ore1DF + pd.Series([25,25,25,25], index=['iron', 'magnesium', 'copper', 'silver'])

Unnamed: 0,iron,magnesium,copper,silver
0,45,60,50,45
1,36,53,57,54


In [258]:
ore1DF["add_iron_copper"] = ore1DF["iron"] + ore1DF["copper"]

In [259]:
ore1DF

Unnamed: 0,iron,magnesium,copper,silver,add_iron_copper
0,20,35,25,20,45
1,11,28,32,29,43


In [260]:
logical_df1 = pd.DataFrame({'Col1' : [1, 0, 1], 'Col2' : [0, 1, 1] }, dtype=bool)
logical_df2 = pd.DataFrame({'Col1' : [1, 0, 0], 'Col2' : [0, 0, 1] }, dtype=bool)

In [261]:
logical_df1 | logical_df2

Unnamed: 0,Col1,Col2
0,True,False
1,False,True
2,True,True


In [263]:
print(ore1DF)
print("-"*40)
print(np.sqrt(ore1DF))

   iron  magnesium  copper  silver  add_iron_copper
0    20         35      25      20               45
1    11         28      32      29               43
----------------------------------------
       iron  magnesium    copper    silver  add_iron_copper
0  4.472136   5.916080  5.000000  4.472136         6.708204
1  3.316625   5.291503  5.656854  5.385165         6.557439



## Panels

A Panel is a 3D array. It is not as widely used as Series or DataFrames. It is not as easily displayed on screen or visualized as the other two because of its 3D nature. The Panel data structure is the final piece of the data structure puzzle in pandas. It is less widely used. It is generally used for 3D time-series data. The three axis names are as follows:

    items: This is axis 0. Each item corresponds to a DataFrame structure.
    major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
    minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

Panels are deprecated and will not be available in future versions. Hence, it's advisable to use multi-indexing in DataFrames instead of Panels.

## Data sources and pandas methods

The data sources for a data science project can be clubbed into the following categories:

    Databases: Most of the CRM, ERP, and other business operations tools store data in a database. Depending on the volume, velocity, and variety, it can be a traditional or NoSQL database. To connect to most of the popular databases, we need JDBC/ODBC drivers from Python. Fortunately, there are such drivers available for all the popular databases. Working with data in such databases involves making a connection through Python to these databases, querying the data through Python, and then manipulating it using pandas. We will look at an example of how to do this later in this chapter.
    Web services: Many of the business operations tools, especially Software as a Services (SaaS) tools, make their data accessible through Application Programming Interfaces (APIs) instead of a database. This reduces the infrastructure cost of hosting a database permanently. Instead, data is made available as a service, as and when required. An API call can be made through Python, which returns packets of data in formats such as JSON or XML. This data is parsed and then manipulated using pandas for further usage.
    Data files: A lot of data for prototyping data science models comes as data files. One example of data being stored as a physical file is the data from IoT sensors – more often than not, the data from these sensors is stored in a flat file, a .txt file, or a .csv file. Another source for a data file is the sample data that's been extracted from a database and stored in such files. The output of many data science and machine learning algorithms are also stored in such files, such as CSV, Excel, and .txt files. Another example is that the trained weight matrices of a deep learning neural network model can be stored as an HDF file.
    Web and document scraping: Two other sources of data are the tables and text present on web pages. This data is gleaned from these pages using Python packages such as BeautifulSoup and Scrapy and are put into a data file or database to be used further. The tables and data that are present in another non-data format file, such as PDF or Docs, are also a major source of data. This is then extracted using Python packages such as Tesseract and Tabula-py.

In this chapter, we will look at how to read and write data to and from these formats/sources using pandas and ancillary libraries. We will also discuss a little bit about these formats, their utilities, and various operations that can be performed on them.

The following is a summary of the read and write methods in Python for some of the data formats we are going to discuss in this chapter: 

![image.png](attachment:image.png)

## CSV and TXT

In [264]:
import pandas as pd
import os
os.chdir(' ')
data=pd.read_csv('Hospital Cost.csv')

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: ' '


Specifying column names for a dataset

The following code will specify the column names for a dataset:



In [None]:
column_names=pd.read_csv('Customer Churn Columns.csv')
column_names_list=column_names['Column Names'].tolist()
data=pd.read_csv('Customer Churn Model.txt',header=None,names=column_names

## Reading from a string of data

In [265]:
from io import StringIO
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3\nc,e,4\ng,f,5\ne,z,6'
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,a,b,1
1,a,b,2
2,c,d,3
3,c,e,4
4,g,f,5
5,e,z,6


In [266]:
from io import StringIO
data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3\nc,e,4\ng,f,5\ne,z,6'
pd.read_csv(StringIO(data),skiprows=lambda x: x % 3 != 0)

Unnamed: 0,col1,col2,col3
0,c,d,3
1,e,z,6



## Row index

If a file has one more column of data than the number of column names, the first column will be used as the DataFrame's row names:

In [267]:
data = 'a,b,c\n4,apple,bat,5.7\n8,orange,cow,10'
pd.read_csv(StringIO(data), index_col=0)

Unnamed: 0,a,b,c
4,apple,bat,5.7
8,orange,cow,10.0



### Reading a text file

read_csv can help read text files as well. Often, data is stored in .txt files with different kinds of delimiters. The sep parameter can be used to specify the delimiter of a particular file, as shown in the following code:

In [None]:
data=pd.read_csv('Tab Customer Churn Model.txt',sep='/t')


## Subsetting while reading

Only a selected list of columns can be subsetted and loaded using the usecols parameter while reading: 

In [None]:
ata=pd.read_csv('Tab Customer Churn Model.txt',sep='/t',usecols=[1,3,5])
data=pd.read_csv('Tab Customer Churn Model.txt',sep='/t',usecols=['VMail Plan','Area Code'])


Numeric lists, as well as explicit lists with column names, can be used. Numeric indexing follows Python indexing, that is, starting from 0.

## Indexing and multi-indexing

In [None]:
pd.read_csv('mindex.txt')
pd.read_csv('mindex.txt',index_col=[0,1])

data.loc[1977]
data.loc[(1977,'A')]


### Reading large files in chunks

Reading a large file in memory at once may consume the entire RAM of the computer and may cause it to throw an error. In such cases, it becomes pertinent to divide the data into chunks. These chunks can then be read sequentially and processed. This is achieved by using the chunksize parameter in read_csv.

The resulting chunks can be iterated over using a for loop. In the following code, we are printing the shape of the chunks:

In [None]:
for chunks in pd.read_csv('Chunk.txt',chunksize=500):
    print(chunks.shape)


These chunks can then be concatenated to each other using the concat method:

In [None]:
data=pd.read_csv('Chunk.txt',chunksize=500)
data=pd.concat(data,ignore_index=True)
print(data.shape)


## Writing to a CSV

A DataFrame is an in-memory object. Often, DataFrames need to be saved as physical files for later use. In such cases, the DataFrames can be written as a CSV or TXT file.

Let's create a synthesized DataFrame using random numbers:

In [268]:
import numpy as np
import pandas as pd
a=['Male','Female']
b=['Rich','Poor','Middle Class']
gender=[]
seb=[]

for i in range(1,101):
    gender.append(np.random.choice(a))
    seb.append(np.random.choice(b))
    height=30*np.random.randn(100)+155
    weight=20*np.random.randn(100)+60
    age=10*np.random.randn(100)+35
    income=1500*np.random.randn(100)+15000

df=pd.DataFrame({'Gender':gender,'Height':height,'Weight':weight,'Age':age,'Income':income,'Socio-Eco':seb})

In [269]:
df

Unnamed: 0,Gender,Height,Weight,Age,Income,Socio-Eco
0,Female,176.921590,86.446794,49.443952,13618.839513,Rich
1,Female,176.667305,48.321437,21.334596,15884.184108,Poor
2,Male,151.265198,44.172707,31.028598,16820.599454,Middle Class
3,Male,158.849838,56.276932,38.197687,14836.039592,Middle Class
4,Male,124.036257,79.795135,35.414817,11685.636239,Middle Class
...,...,...,...,...,...,...
95,Male,191.509648,78.006989,38.707176,16959.005271,Middle Class
96,Female,163.863058,70.477718,31.141672,14655.068237,Middle Class
97,Male,201.107750,54.371142,39.069568,16232.608590,Rich
98,Male,167.264993,73.384612,59.097302,14152.293588,Rich


This can be written to a .csv or .txt file using the to_csv method, as shown in the following code:

In [270]:
df.to_csv('data.csv')
df.to_csv('data.txt')


# JSON

JSON is a popular dictionary-like, key-value pair-based data structure that's suitable for exposing data as APIs from SaaS tools. address, postalCode, state, streetAddress, age, firstName, lastName, and phoneNumber are keys whose values are shown to the right of them. JSON files can be nested (the values of a key are JSON) as well. Here, address has nested values: 

![image.png](attachment:image.png)

DataFrames can be converted into JSON using to_json:

In [None]:
import numpy as np 
pd.DataFrame(np.random.randn(5, 2), columns=list('AB')).to_json()

![image.png](attachment:image.png)

While converting the DataFrame into a JSON file, the orientation can be set.

If we want to keep the column name as the primary index and the row indices as the secondary index, then we can choose the orientation to be columns:

In [None]:
dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),columns=list('ABC'), index=list('xyz')) 
dfjo.to_json(orient="columns")

![image.png](attachment:image.png)

If we want to keep the row indices as the primary index and the column names as the secondary index, then we can choose the orientation to be index: 

In [None]:
dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),columns=list('ABC'), index=list('xyz')) 
dfjo.to_json(orient="index")

Finally, we can also orient the converted JSON in order to separate the row indices, column names, and data values:

In [None]:
dfjo = pd.DataFrame(dict(A=range(1, 4), B=range(4, 7), C=range(7, 10)),columns=list('ABC'), 
                    index=list('xyz')) dfjo.to_json(orient="split")

![image.png](attachment:image.png)


## Writing a JSON to a file

JSON can be written to physical files like so:

In [None]:
import json with open('jsonex.txt','w') as outfile: json.dump(dfjo.to_json(orient="columns"), outfile)


## Reading a JSON

json_loads is used to read a physical file containing JSONs:

In [None]:
f=open('usagov_bitly.txt','r').readline() json.loads(f) 

The files can be read one JSON at a time using the open and readline methods: 

In [None]:
records=[] f=open('usagov_bitly.txt','r') 
for i in range(1000): fiterline=f.readline() d=json.loads(fiterline) records.append(d) f.close()

Now, records contains a list of JSONs from which all the values of a particular key can be pulled out. For example, here, we are pulling out all the latlong ('ll' column) wherever it has a non-zero value: 

In [None]:
latlong=[rec['ll'] for rec in records if 'll' in rec]


## Writing JSON to a DataFrame

A list of JSON objects can be converted into a DataFrame (much like a dictionary can). The records element we created previously is a list of JSONs (we can check this by using records[0:3] or type(records)):

In [None]:
df=pd.DataFrame(records) 
df.head() 
df['tz'].value_counts() 


## Reading feather files

The feather format is a binary file format for storing data that makes use of Apache Arrow, an in-memory columnar data structure. It was developed by Wes Mckinney and Hadley Wickham, chief scientists at RStudio as an initiative for a data sharing infrastructure across Python and R. The columnar serialization of data in feather files makes way for efficient read and write operations, making it far faster than CSV and JSON files where storage is record-wise.

In [None]:
pd.read_feather("sample.feather")

![image.png](attachment:image.png)


## Reading from a clipboard

This is a rather interesting feature in pandas. Any tabular data that has been copied onto the clipboard can be read as a DataFrame in pandas.


In [275]:
pd.read_clipboard()

Unnamed: 0,Gender,Entry_Date,Flag
0,A,M,1/19/2012
1,B,F,12/30/2012
2,C,M,5/5/2012



## Managing sparse data

Sparse data refers to data structures such as arrays, series, DataFrames, and panels in which there is a very high proportion of missing data or NaNs.

Let's create a sparse DataFrame:

In [276]:
df = pd.DataFrame(np.random.randn(100, 3))
df.iloc[:95] = np.nan

In [277]:
df.memory_usage()

Index    128
0        800
1        800
2        800
dtype: int64

In [281]:
df

Unnamed: 0,0,1,2
0,,,
1,,,
2,,,
3,,,
4,,,
...,...,...,...
95,0.320499,-1.058794,-1.212914
96,1.544345,-0.689210,-1.023178
97,-1.032622,-0.703434,-0.039521
98,0.195441,-0.955426,0.483690


In [284]:
import scipy
sparse_df = scipy.sparse.csr_matrix(df.values)
sparse_df

<100x3 sparse matrix of type '<class 'numpy.float64'>'
	with 300 stored elements in Compressed Sparse Row format>


### Writing JSON objects to a file

The to_json() function allows any DataFrame object to be converted into a JSON string or written to a JSON file if the file path is specified: 

In [286]:
df = pd.DataFrame({"Col1": [1, 2, 3, 4, 5], "Col2": ["A", "B", "B", "A", "C"], "Col3": [True, False, False, True, True]})
df.to_json()

'{"Col1":{"0":1,"1":2,"2":3,"3":4,"4":5},"Col2":{"0":"A","1":"B","2":"B","3":"A","4":"C"},"Col3":{"0":true,"1":false,"2":false,"3":true,"4":true}}'

The orientation of the data in the JSON can be altered. The to_json() function has an orient argument which can be set for the following modes: columns, index, record, value, and split. Columns is the default setting for orientation: 

In [287]:
df.to_json(orient="columns")

'{"Col1":{"0":1,"1":2,"2":3,"3":4,"4":5},"Col2":{"0":"A","1":"B","2":"B","3":"A","4":"C"},"Col3":{"0":true,"1":false,"2":false,"3":true,"4":true}}'

In [288]:
df.to_json(orient="table")

'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"Col1","type":"integer"},{"name":"Col2","type":"string"},{"name":"Col3","type":"boolean"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":0,"Col1":1,"Col2":"A","Col3":true},{"index":1,"Col1":2,"Col2":"B","Col3":false},{"index":2,"Col1":3,"Col2":"B","Col3":false},{"index":3,"Col1":4,"Col2":"A","Col3":true},{"index":4,"Col1":5,"Col2":"C","Col3":true}]}'