# Data Science: Data Manipulation

Effective data-driven science and computation requires understanding how data is stored and manipulated.
Python is a dynamically typed language. While a a statically typed language like C requires each variable to be explicitly declared, Python skips this espicification.

For example:
### C
- int variable = 0

### Python
- variable = 0

## A Python Integer is more than just an Integer
The standar Python interpretation is writter in C. This means that every Python object is simply a cleverly disguised C structure, wich contains not only its value, but other information as well. When we define an integer in Python it is actually a pointer to a compound C structure, which conatins several values.

<font color=green>struct _long</font> <font color=blue>object{</font>

&nbsp;
    <font color=green>long</font> <font color=blue>ob_refcnt;</font>
    
&nbsp;
   <font color=blue>PyTypeObject*ob_type;</font>
    
&nbsp;
   <font color=green>size_t</font> <font color=blue>ob_size;</font>
    
&nbsp;
   <font color=green>long</font> <font color=blue>ob_digit[1];</font>
    
&nbsp;
<font color=blue>};</font>

A single integer in Python 3.4 actually contains four pieces:
- <font color=blue>ob_refcnt</font>,  a  reference  count  that  helps  Python  silently  handle  memory  allocation and deallocation
- <font color=blue>ob_type</font>, which encodes the type of the variable
- <font color=blue>ob_size</font>, which specifies the size of the following data members
- <font color=blue>ob_digit</font>, which contains the actual integer value that we expect the Python variable to represent

A C integer is essentially a label for a position in memory whose  bytes  encode  an  integer  value.  A  Python  integer  is  a  pointer  to  a  position  in memory  containing  all  the  Python  object  information,  including  the  bytes  that  con tain the integer value. All this additional information
in  Python  types  comes  at  a  cost,  however,  which  becomes  especially  apparent  in
structures that combine many of these objects.

## A Python List is more than just a List
The standar mutable multielement container (data structure) is the list.

&nbsp;
In[1] list = [1,2,3,4,5]

&nbsp;
In[2] type(list)

&nbsp;
Out[2] returns: int

Because of Python's dynamic typing we can even create heterogeneous list:

&nbsp;
In[3] listTwo = [1,2,3,"4",5.7]

But  this  flexibility  comes  at  a  cost:  to  allow  these  flexible  types,  each  item  in  the  list
must contain its own type info, reference count, and other information—that is, each tem is a complete Python object. In the special case that all variables are of the same type,  much  of  this  information is  redundant:  it  can  be  much  more  efficient  to  store data in a fixed-type array. The difference between a dynamic-type list and a fixed-type (NumPy-style) array

# NumPy 

NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.

Numpy is also incredibly fast, as it has bindings to C libraries. For more info on why you would want to use Arrays instead of lists, check out this great [StackOverflow post](http://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists).


In [1]:
import numpy as np

Numpy arrays essentially come in two flavors: vectors and matrices. Vectors are strictly 1-d arrays and matrices are 2-d (but you should note a matrix can still have only one row or one column). 

We can create an array by directly converting a list or list of lists:

In [2]:
list = [1,2,3,4,5]
np.array(list)

array([1, 2, 3, 4, 5])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If do no match, NumPy will upcast if possible.


In [3]:
np.array([1,2,3.14,4,5])

array([1.  , 2.  , 3.14, 4.  , 5.  ])

If we want to explicitly set the data type of the resulting array, we can use the dtype parameter.

In [4]:
np.array([1,2,3.14,5], dtype = 'float32')

array([1.  , 2.  , 3.14, 5.  ], dtype=float32)

NumPy arrays can explicitly be multidimensional. The inner lists are treated as rows of the resultin two-dimensional array.

In [5]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix
np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [6]:
np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

## Creating arrays from scratch

In [7]:
np.zeros(10, dtype='int')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [8]:
np.ones((3,5), dtype='float')

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [9]:
np.full((3,5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [10]:
# Create an array filled with linear sequence
# Starting at 0, ending at 100, steppin by 5
np.arange(0,100,5)

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

In [11]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0,1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [12]:
# Creates an identity matrix
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [13]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3,3))

array([[0.1691855 , 0.23600427, 0.46380454],
       [0.95712938, 0.10819036, 0.45741494],
       [0.17866935, 0.05127164, 0.59649398]])

In [14]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standart deviation 1
np.random.normal(0,1,(4,4))

array([[ 0.55946514, -0.35912997,  0.11352896,  0.66508335],
       [ 0.83528782, -0.18429285,  1.23495724,  0.14877141],
       [-1.7834188 , -1.31922495,  1.49050798, -0.82963164],
       [ 1.4712052 , -0.75486763, -1.37462723, -1.36236274]])

In [15]:
# 3x3 array of random integers in the interval [0,10)
np.random.randint(0,10,(3,3))

array([[0, 9, 5],
       [7, 1, 8],
       [5, 3, 9]])

## NumPy arrays Attributes

In [16]:
np.random.seed(0)

x = np.random.randint(10, size=6) #One-dimensional array
y = np.random.randint(10, size=(3,4)) #Two-dimensional array
z = np.random.randint(10, size=(3,4,5)) #Three-dimensional array

In [17]:
print("x3 ndim: ", z.ndim)
print("x3 shape:", z.shape)
print("x3 size: ", z.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


In [18]:
z.dtype

dtype('int32')

In [19]:
print(z.itemsize, 'bytes') #Lists the size of each array element

4 bytes


In [20]:
print(z.nbytes, 'bytes') #Lists the total size of the array

240 bytes


## Reshaping of Arrays

Returns an array containing the same data with a new shape.

In [21]:
arr = np.arange(9)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [22]:
grid = arr.reshape(3,3)
grid

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

## Concatenation of Arrays

In [23]:
x

array([5, 0, 3, 3, 7, 9])

In [24]:
x1 = np.arange(0,6) 
x1

array([0, 1, 2, 3, 4, 5])

In [25]:
xConcatenate= np.concatenate([x,x1])
xConcatenate

array([5, 0, 3, 3, 7, 9, 0, 1, 2, 3, 4, 5])

## Splitting of arrays

In [26]:
x1,x2,x3 = np.split(x,[3,5])

In [27]:
print(x1,x2,x3)

[5 0 3] [3 7] [9]


## max, min, argmax, argmin

These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax

In [28]:
array = np.array([10, 12, 41, 17, 49,  2, 46,  3, 19, 39])

In [29]:
array.max()

49

In [30]:
array.min()

2

In [31]:
array.argmax()

4

In [32]:
array.argmin()

5

In [33]:
array.sum()

238

In [34]:
arr = np.arange(0,10)
arr + arr

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [35]:
arr * arr

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [36]:
arr - arr

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [37]:
# Warning on division by zero, but not an error!
# Just replaced with nan
arr/arr

  This is separate from the ipykernel package so we can avoid doing imports until


array([nan,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [38]:
# Also warning, but not an error instead infinity
1/arr

  


array([       inf, 1.        , 0.5       , 0.33333333, 0.25      ,
       0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111])

In [39]:
arr**3

array([  0,   1,   8,  27,  64, 125, 216, 343, 512, 729], dtype=int32)