![alt text](https://whatsthebigdata.files.wordpress.com/2017/06/data-science-machine-learning-software-2015-2017.jpg)

##**NumPy**


NumPy(Numerical Python) is a linear algebra library in Python. Why learning NumPy is important:

1.   NumPy is very useful for performing mathematical and logical operations on Arrays. 
2.   NumPy provides an abundance of useful features for operations on n-arrays and matrices in Python.

NumPy is a very important library on which almost every data science or machine learning Python packages such as SciPy(Scientific Python), Mat-plotlib(plotting library), Scikit-learn, etc depends on to a reasonable extent.

Documentation - https://numpy.org/doc/

###**Import Numpy and Pandas Libraries**

In [0]:
import pandas as pd
import numpy as np

In [0]:
print ("NumPy version - " + np.version.version)

NumPy version - 1.18.2


###**Creating Arrays**

####Creating 1 D *Numpy Arrays*

In [0]:
#create 1D numpy array from a list
array_1d = np.array([1, 2, 3, 4, 5])

print (array_1d)

[1 2 3 4 5]


In [0]:
#create 1D numpy array from a list variable
list_1 = [11, 22, 33, 44, 55]
array_l1 = np.array(list_1)

print(array_l1)

[11 22 33 44 55]


In [0]:
#print the type of an numpy array we created
print ("Array type = " + str(type(array_1d)))

#print the data type 
print ("Data type = " + str(array_1d.dtype))

Array type = <class 'numpy.ndarray'>
Data type = int64


In [0]:
#check the shape of the ndarray we created, it should have only 1 dimension
print(array_1d.shape)

(5,)


In [0]:
type(array_1d)

numpy.ndarray

In [0]:
#Demonstrating the Range function.
plain_range = range(4,12)
list(plain_range)

[4, 5, 6, 7, 8, 9, 10, 11]

Notice how the second number is exclusive. 

In [0]:
interval_range = range(1, 15, 2)

In [0]:
list(interval_range)

[1, 3, 5, 7, 9, 11, 13]

### Your Turn!

1. Create your own Python List using the range function.
2. Convert the list into a NumPy array
3. Check the data type to make sure it is a NumPy array
4. Convert that array back into a list



In [0]:
list(range(15, 1, -2))

[15, 13, 11, 9, 7, 5, 3]

### Python Matrices (2D) to NumPy Array

In [0]:
mymatrix = [[4,2,9],[32,43,68],[73,8,9]]
mymatrix

[[4, 2, 9], [32, 43, 68], [73, 8, 9]]

In [0]:
#Create a matrix (2D array)
numpy_array = np.array(mymatrix)

In [0]:
numpy_array

array([[ 4,  2,  9],
       [32, 43, 68],
       [73,  8,  9]])

In [0]:
type(numpy_array)

numpy.ndarray

### Convert a NumPy Array into a Pandas DataFrame

In [0]:
df_from_array = pd.DataFrame(numpy_array)
df_from_array

Unnamed: 0,0,1,2
0,4,2,9
1,32,43,68
2,73,8,9


### Your Turn!

1. Create your own Python matrix
2. Convert the matrix into a NumPy array
3. Check the data type to make sure it is a NumPy array
4. Convert the NumPy array to a Pandas DataFrame

### Python Data Types

In [0]:
type(4)

int

In [0]:
type(4.0)

float

In [0]:
type('4353')

str

In [0]:
type('Hello')

str

DateTime function

In [0]:
import datetime
datetime_example = datetime.datetime.now()
print(datetime_example)

2020-03-27 17:28:02.885315


In [0]:
type(datetime_example)

datetime.datetime

In [0]:
another_list = [23, [34, 21], 'Google']
another_list

[23, [34, 21], 'Google']

###**Indexing, Slice indexing**

We can use slice indexing to pull out sub-regions of ndarrays

More documentation can be found at https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html

In [0]:
#create 1D array
array_1d = np.array([1, 2, 3, 4, 5, 6, 7])
print (array_1d)

[1 2 3 4 5 6 7]


In [0]:
#because we have 1D array, we need only one index to access element at any position
#call the value at index 2
print ('element at index 2: ', array_1d[2])

#ndarrays are mutable, here we change an element at index 2 
array_1d[2] = 10
print ('element at index 2 after change: ', array_1d[2])

element at index 2:  3
element at index 2 after change:  10


In [0]:
print (array_1d)

[ 1  2 10  4  5  6  7]


In [0]:
#get the values in the range
print ('elements in the range: ', array_1d[1:3]) #a:b - including a, until (and excluding) b

#we can change values in the range as well
array_1d[1:3] = 10
print ('elements in the range after change: ', array_1d[1:3])

elements in the range:  [ 2 10]
elements in the range after change:  [10 10]


In [0]:
#for 2D arrays we will need two indexing first one for the row and second one for the column
#create 2D array
array_2d = np.array([[11, 12, 13, 14, 15, 16, 17], [21, 22, 23, 24, 25, 26, 27], [31, 32, 33, 34, 35, 36, 37]])
print ('original array: \n', array_2d, '\n')

#slicing: generates an array of the same rank
slice_row = array_2d[1:2, :] #a:b - including a, until (and excluding) b
print ('sliced array on [1:2, :]: \n', slice_row)

original array: 
 [[11 12 13 14 15 16 17]
 [21 22 23 24 25 26 27]
 [31 32 33 34 35 36 37]] 

sliced array on [1:2, :]: 
 [[21 22 23 24 25 26 27]]


In [0]:
#if we change the sliced array it will changed the original array too
slice_row[:, :] = 10

print ('sliced array: \n', slice_row, '\n')
print ('original array: \n', array_2d)


sliced array: 
 [[10 10 10 10 10 10 10]] 

original array: 
 [[11 12 13 14 15 16 17]
 [10 10 10 10 10 10 10]
 [31 32 33 34 35 36 37]]


In [0]:
#we can do the slicing for columns as well
slice_col = array_2d[:, 1:5]
print (slice_col)

[[12 13 14 15]
 [10 10 10 10]
 [32 33 34 35]]


In [0]:
#we can use filters to select just those elements which meet certain criteria 
#select the elements that are greater than 10
slice_col[slice_col>10]

array([12, 13, 14, 15, 32, 33, 34, 35])

In [0]:
#use similar logical filter to change elements in the array
#add 1 to all the odd values
slice_col[slice_col % 2 == 1] += 1
slice_col

array([[12, 14, 14, 16],
       [10, 10, 10, 10],
       [32, 34, 34, 36]])

###**Arithmetic array operations**

More documentation at https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html

In [0]:
#addition can be done with the plus sign or 'add' numpy function
arr_x = np.array([[10, 20, 30], [40, 50, 60]])
arr_y = np.array([[11, 21, 31], [41, 51, 61]])

print("Array X")
print(arr_x, '\n')

print("Array Y")
print(arr_y, '\n')

print("Direct Addition")
print (arr_x + arr_y, '\n')

print("Numpy Addition")
print (np.add(arr_x, arr_y))

Array X
[[10 20 30]
 [40 50 60]] 

Array Y
[[11 21 31]
 [41 51 61]] 

Direct Addition
[[ 21  41  61]
 [ 81 101 121]] 

Numpy Addition
[[ 21  41  61]
 [ 81 101 121]]


In [0]:
#the same is with the subtraction 
print("Direct Subtraction")
print (arr_y - arr_x, '\n')

print("Numpy Subtraction")
print (np.subtract(arr_y, arr_x))

Direct Subtraction
[[1 1 1]
 [1 1 1]] 

Numpy Subtraction
[[1 1 1]
 [1 1 1]]


In [0]:
#multiplication
print("Direct Multiplication")
print (arr_x * arr_y, '\n')

print("Numpy Multiplication")
print (np.multiply(arr_x, arr_y))

Direct Multiplication
[[ 110  420  930]
 [1640 2550 3660]] 

Numpy Multiplication
[[ 110  420  930]
 [1640 2550 3660]]


In [0]:
#division
print("Direct Division")
print (arr_y / arr_x, '\n')

print("Numpy Division")
print (np.divide(arr_y, arr_x))

Direct Division
[[1.1        1.05       1.03333333]
 [1.025      1.02       1.01666667]] 

Numpy Division
[[1.1        1.05       1.03333333]
 [1.025      1.02       1.01666667]]


In [0]:
#square root

print (np.sqrt(arr_x))

[[3.16227766 4.47213595 5.47722558]
 [6.32455532 7.07106781 7.74596669]]


In [0]:
#exponent (e**x)

print (np.exp(arr_x))

[[2.20264658e+04 4.85165195e+08 1.06864746e+13]
 [2.35385267e+17 5.18470553e+21 1.14200739e+26]]


In [0]:
#power

print (np.power(arr_x, 3))

[[  1000   8000  27000]
 [ 64000 125000 216000]]


###**Statistical methods**

More documentation at https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.statistics.html


In [0]:
#create random 4x6 matrix
arr = np.random.randn(4,6)
print (arr)

[[-0.46776425 -0.59193826  1.714382    0.66848482  0.42387325 -0.09313839]
 [-2.51965952  0.76993114  0.02091415 -0.02349162 -0.99492269 -1.00285551]
 [ 2.66771011  0.56007468  0.47088951  0.65709385  0.55297837  0.45115599]
 [ 1.23912292  0.9080372   1.20101258  1.778134    1.84565503  0.1600903 ]]


In [0]:
#compute the mean for all elements
print (arr.mean())

0.4331570687646356


In [0]:
#compute the mean by row
print (arr.mean(axis = 1))

[ 0.27564986 -0.62501401  0.89331708  1.18867534]


In [0]:
#compute mean by column
print (arr.mean(axis = 0))

[ 0.22985231  0.41152619  0.85179956  0.77005526  0.45689599 -0.1211869 ]


In [0]:
#compute the median for all elements, by row or column can be done the same way as for mean
print (np.median(arr))

0.5119339391015164


*what does axis mean?* https://stackoverflow.com/questions/22149584/what-does-axis-in-pandas-mean


In [0]:
#sum of all the elements, by row or column can be done the same way as for mean
print (np.sum(arr))

10.395769650351255


###**Read, or Write to disc**

In [0]:
np.savetxt('array.txt', X = arr, delimiter = ',')

In [0]:
np.loadtxt('array.txt', delimiter = ',')

array([[-0.46776425, -0.59193826,  1.714382  ,  0.66848482,  0.42387325,
        -0.09313839],
       [-2.51965952,  0.76993114,  0.02091415, -0.02349162, -0.99492269,
        -1.00285551],
       [ 2.66771011,  0.56007468,  0.47088951,  0.65709385,  0.55297837,
         0.45115599],
       [ 1.23912292,  0.9080372 ,  1.20101258,  1.778134  ,  1.84565503,
         0.1600903 ]])

##**Pandas**
Pandas is an numerical open source python library that is built on top of NumPy. Why learning Pandas is important:

Pandas allows you do fast analysis as well as data cleaning and preparation
Pandas can work well with data from a wide variety of sources such as; Excel sheet, csv file, sql file or even a webpage

Documentation - https://pandas.pydata.org/pandas-docs/version/0.25.3/

###**Pandas Series**

Pandas series are one-dimensional labeled array that are capable of holding of data of any type

In [0]:
#creating panda series from list
ls = ['a', 'b', 'c', 'd', 'e']
ser_1 = pd.Series(ls)

print (ser_1)

0    a
1    b
2    c
3    d
4    e
dtype: object


In [0]:
#creating panda series from array
arr = np.array([10, 20, 30, 40, 50])
ser_2 = pd.Series(arr)

print (ser_2)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [0]:
#create series with specific indexing
ser_3 = pd.Series(arr, index = ['sarah', 'bob', 'alex', 'den', 'nancy'])

print (ser_3)

sarah    10
bob      20
alex     30
den      40
nancy    50
dtype: int64


In [0]:
#accessing an element using the index 
print (ser_3[3], '\n\n')
print (ser_3[[0, 2, 4]], '\n\n')
print (ser_3[:3])

40 


sarah    10
alex     30
nancy    50
dtype: int64 


sarah    10
bob      20
alex     30
dtype: int64


In [0]:
#accessing element using an index label
print (ser_3['den'], '\n\n')
print (ser_3[['den', 'alex', 'sarah']])

40 


den      40
alex     30
sarah    10
dtype: int64


### Create a Pandas DataFrame

Pandas DataFrame is a two-dimensional labeled data structure.

In [0]:
#Creating from a list of lists
pet_info = [['Blain', 10, 'Dog'], ['Lucy', 4 , 'Cat'], ['Cinco', 8, 'Rabbit']]
pet_info

[['Blain', 10, 'Dog'], ['Lucy', 4, 'Cat'], ['Cinco', 8, 'Rabbit']]

In [0]:
pet_df = pd.DataFrame(pet_info, columns = ['Name', 'Age', 'Type'],
                      index = ['A', 'B', 'C'])
                     
pet_df

Unnamed: 0,Name,Age,Type
A,Blain,10,Dog
B,Lucy,4,Cat
C,Cinco,8,Rabbit


In [0]:
pet_df['Name']

A    Blain
B     Lucy
C    Cinco
Name: Name, dtype: object

### Your Turn!

Print out the 'Type' Column

In [0]:
type(pet_df)

pandas.core.frame.DataFrame

In [0]:
pet_df.dtypes

Name    object
Age      int64
Type    object
dtype: object

### Your Turn!



In [0]:
#Create a list of lists that contain [[Month, Month #(Dec=12, etc.), Season]]
#Create a Pandas DataFrame with the columns = ['Month', 'Int', 'Season']

### iloc vs. loc

iloc - position-based indexing (index always starts with 0)

loc - label-based indexing

If you don't specify indexes, then iloc and loc will be the same

In [0]:
pet_df.iloc[0]

Name    Blain
Age        10
Type      Dog
Name: A, dtype: object

In [0]:
pet_df.loc['B']

Name    Lucy
Age        4
Type     Cat
Name: B, dtype: object

### Apply Concepts to a Data Set

In [0]:
from sklearn.datasets import load_iris
iris = load_iris()
data = iris.data
column_names = iris.feature_names

In [0]:
#columns
column_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [0]:
#rows
len(data)

150

### Slicing Arrays, Lists, and Tuples

Slicing index always starts at 0, which mean the 1st element is [0] and 2nd element is [1].

In [0]:
data[:10] 


array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [0]:
data[2:5]

array([[4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [0]:
data[0:10:2] #intervals

array([[5.1, 3.5, 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [4.6, 3.4, 1.4, 0.3],
       [4.4, 2.9, 1.4, 0.2]])

In [0]:
#Convert an array to a Pandas DataFrame
df = pd.DataFrame(data, columns = column_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [0]:
'''saving to google colab
'''
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


In [0]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
gdrive = GoogleDrive(gauth)


In [0]:
# note must have df first

from google.colab import drive
drive.mount('/content/drive')

with open('/content/drive/My Drive/WWCode/data/iris.csv', 'w+') as f:
  f.write(df.to_csv())


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [0]:
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [0]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB


In [0]:
df.loc[140:149]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
140,6.7,3.1,5.6,2.4
141,6.9,3.1,5.1,2.3
142,5.8,2.7,5.1,1.9
143,6.8,3.2,5.9,2.3
144,6.7,3.3,5.7,2.5
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [0]:
df.mode()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.0,3.0,1.4,0.2
1,,,1.5,


In [0]:
df.mean()

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
dtype: float64

In [0]:
df['sepal length (cm)'].mean()

5.843333333333335

### Your turn!

Find the mean petal width

In [0]:
df.groupby(['sepal width (cm)']).mean()

Unnamed: 0_level_0,sepal length (cm),petal length (cm),petal width (cm)
sepal width (cm),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2.0,5.0,3.5,1.0
2.2,6.066667,4.5,1.333333
2.3,5.325,3.25,0.975
2.4,5.3,3.6,1.033333
2.5,5.7625,4.5125,1.55
2.6,6.16,4.88,1.42
2.7,5.855556,4.622222,1.555556
2.8,6.335714,5.042857,1.707143
2.9,6.06,4.35,1.32
3.0,6.015385,4.234615,1.403846


### Your Turn! 

In [0]:
from sklearn.datasets import load_boston
boston = load_boston()
data = boston.data
column_names = boston.feature_names

In [0]:
data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [0]:
df = pd.DataFrame(data, columns = column_names)
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48
