# Welcome Pythoneers!

In this lesson, we will cover

*   Recap of Python containers (Lists, Dictionaries)
*   Introduction to Numpy array for scientific computing (Arrays, indexing, array arithmetics)
*   Introduction to dataframe and data manipulation with Pandas

Author: Mohammad Zul Fitri, Skymind Holdings Berhad

## A Brief Overview on Python Containers

### Lists

This is the first built-in container type that we looked at in this course. A list is similar to an array, but its size is not fixed and it can also contain different datatypes.

In [None]:
x = [1, 4, 3, 5, 2] # Create a list of 5 elements and assign it to x
print(x, x[3])  # Lists indexing starts with 0, to access elements it's just x[n-1] where n is the n-th element that you want

[1, 4, 3, 5, 2] 5


You can also print a list element using negative indices. This will count from the end.

In [None]:
print(x[-3], x[-1])

3 2


As mentioned before, a list can contain elements of different data types

In [None]:
x[4] = 'hello'
print(x)

[1, 4, 3, 5, 'hello']


Other things we can do with lists:

In [None]:
x.append('world') # .append will add a new element to the end of your list
print(x)

[1, 4, 3, 5, 'hello', 'world']


In [None]:
xpop = x.pop() # .pop will remove the last element of your list and will return it as well
print(xpop,x)

world [1, 4, 3, 5, 'hello']


We can access a sublist of elements instead of just accessing a single element; this is done by slicing the list

In [None]:
myNumbers = list(range(10))
print(myNumbers)          
print(myNumbers[1:3])     # Slice list from index 1 to 3, excluding element at index 3
print(myNumbers[:5])      # Slice list from the start to index 5, excluding element at index 3
print(myNumbers[5:])      # Slice list from index 5 to the end
print(myNumbers[:])       # Slice the whole list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2]
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


We can also loop through list elements as follows:

In [None]:
fruits = ['apple', 'banana', 'durian', 'grape', 'mango']
for fruit in fruits:
    print(fruit)

apple
banana
durian
grape
mango


In [None]:
for idx, fruit in enumerate(fruits):
    print('{}: {}'.format(idx+1, fruit))

1: apple
2: banana
3: durian
4: grape
5: mango


Suppose we want to transform the values of a list, and assign it into a new list. For example, getting the square values of the elements. This can be done as follows:

In [None]:
num = list(range(4))
print(num)
squared = []
for n in num:
    squared.append(n**2)  # Add the squared value into empty list
print(squared)

[0, 1, 2, 3]
[0, 1, 4, 9]


Alternatively, we can simplify this process by:

In [None]:
num = list(range(4))
squared = [n**2 for n in num]
print(num)
print(squared)

[0, 1, 2, 3]
[0, 1, 4, 9]


This is called list comprehensions. We can extend this by adding conditions. Suppose now we only want a list of squared values of the odd elements:

In [None]:
num = list(range(9))
odd_squared = [n**2 for n in num if n%2!=0]
print(num)
print(odd_squared)

[0, 1, 2, 3, 4, 5, 6, 7, 8]
[1, 9, 25, 49]


###Dictionaries

This is another data structure we have in Python which basically is a key-value pair where each pair will map the key to the respective value.

In [None]:
dict = {'red':'apple','yellow':'banana','green':'durian'} # Curly brackets used to create dictionary
print(dict['yellow'])
print(dict['green'])
print('blue' in dict)

banana
durian
False


In [None]:
dict['purple'] = 'grape'
print(dict.get('purple', 'N/A'))
print(dict.get('blue', 'N/A'))

grape
N/A


In [None]:
for colour, fruit in dict.items():
    print('{} is {}'.format(fruit, colour))

apple is red
banana is yellow
durian is green
grape is purple


Finally, similar to lists, we also have a dictionary comprehension.

In [None]:
num = list(range(9))
odd_squared = {n: n**2 for n in num if n%2!=0}
print(odd_squared)

{1: 1, 3: 9, 5: 25, 7: 49}


##NumPy

While lists provide the flexibility for heterogenity, most of the times it is more useful to have a *container* that holds a fixed data type. While the built in *array* in Python can be used to store array-based data efficiently, we will explore NumPy which will extend efficiency in terms of the operations we can do with those data.

NumPy is an important library used for scientific computing in Python. The array object in NumPy is called ndarray which stands for n-dimensional array, and it also has tools to manipulate the items in the array. We can start using NumPy by writing the following:

In [None]:
import numpy as np # This is a writing convention

### Creating Arrays

####Using Python Lists

In [None]:
np.array([2, 7, 4, 3, 1])

array([2, 7, 4, 3, 1])

In [None]:
np.array(list(range(4)))

array([0, 1, 2, 3])

In [None]:
np.array(list(range(4)),dtype='float')

array([0., 1., 2., 3.])

In [None]:
# Creating multidimensional array using nested list
np.array([[1,2],[5,3],[4,8]])

array([[1, 2],
       [5, 3],
       [4, 8]])

In [None]:
np.array([range(i, i+2) for i in [1, 2, 3]]) # Range: [i,i+2) ==> [2, 4)

array([[1, 2],
       [2, 3],
       [3, 4]])

#### Using functions

In [None]:
# Make an array full of 0's
np.zeros((5,2),dtype=int)

array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

In [None]:
# Make an array full of 1's
np.ones((2,3),dtype=int)

array([[1, 1, 1],
       [1, 1, 1]])

In [None]:
# Make a sequential array
# More or less similar to range() function
np.arange(0, 16, 4)

array([ 0,  4,  8, 12])

In [None]:
# Make an evenly spaced array
np.linspace(0, 1, 9)

array([0.   , 0.125, 0.25 , 0.375, 0.5  , 0.625, 0.75 , 0.875, 1.   ])

In [None]:
# Make 3x5 array filled with the same value i.e. 2.13
np.full((3, 5), 2.13)

array([[2.13, 2.13, 2.13, 2.13, 2.13],
       [2.13, 2.13, 2.13, 2.13, 2.13],
       [2.13, 2.13, 2.13, 2.13, 2.13]])

In [None]:
# Make 3x3 array of uniformly distributed values
# values generated randomly to be between 0 and 1
np.random.random((3, 3))

array([[0.74622101, 0.49550127, 0.91861552],
       [0.34720072, 0.84717542, 0.91196066],
       [0.68392369, 0.84620011, 0.67164692]])

In [None]:
# Make 3x3 array of random integers
# values generated randomly to be in [0, 20)
np.random.randint(0, 20, (3, 3))

array([[ 5,  2,  3],
       [10,  1,  9],
       [ 0, 16, 12]])

In [None]:
# Make 3x3 array of normally distributed values
# values generated randomly with mean 0 and s.d 1
np.random.normal(0, 1, (3, 3))

array([[ 0.45446536,  0.06578045,  0.14638267],
       [ 1.22230698,  0.75575012,  1.42736728],
       [-0.20809488,  0.88065748, -0.09181724]])

In [None]:
# Make 5x5 identity matrix
np.eye(5, dtype=int)

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

### Array Indexing

In [None]:
import numpy as np

#### Basics

In [None]:
a = np.arange(15)
print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]


In [None]:
b = a.reshape(3, 5) # .reshape(u, v) s.t u*v = original array length
print(b)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


In [None]:
# Get last row
print(b[-1])

[10 11 12 13 14]


In [None]:
# Get last column
print(b.T[-1]) # .T is just the Transpose function - exchanging row and columns

[ 4  9 14]


In [None]:
# Get the element at 2nd row and 3rd column, 7
print(b[1, 2])

7


#### Get elements in a list

In [None]:
x = np.arange(6)

print(x,'\n')
print(x[:4],'\n')
print(x[2:6],'\n')
print(x[-5:])

[0 1 2 3 4 5] 

[0 1 2 3] 

[2 3 4 5] 

[1 2 3 4 5]


#### Array Slicing

In [None]:
a = np.array([[10, 20, 30, 40], [-5, -6, -7, -8], [9, 10, 11, 12]])
b = a[:2, 1:4]
print(a)
print('\n')
print(b)

[[10 20 30 40]
 [-5 -6 -7 -8]
 [ 9 10 11 12]]


[[20 30 40]
 [-6 -7 -8]]


In [None]:
# Get middle row

middle1 = a[1, :]
middle2 = a[[1], :]

print(middle1, middle1.shape)
print(middle2, middle2.shape)

[-5 -6 -7 -8] (4,)
[[-5 -6 -7 -8]] (1, 4)


#### Other Indexing

In [None]:
# Integer indexing

u = np.array([[10, 20, 30, 40], [-5, -6, -7, -8], [9, 10, 11, 12]])

print(u)
print(u[[0, 1, 0, 2], [1, 0, 3, 2]])

[[10 20 30 40]
 [-5 -6 -7 -8]
 [ 9 10 11 12]]
[20 -5 40 11]


In [None]:
# Boolean indexing

bool_index = (u%2==0)
print(bool_index)

[[ True  True  True  True]
 [False  True False  True]
 [False  True False  True]]


In [None]:
print(u[bool_index])

# Alternatively:

print(u[u%2==0])

[10 20 30 40 -6 -8 10 12]
[10 20 30 40 -6 -8 10 12]


### Array arithmetics

In [None]:
import numpy as np

We can do some fundamental Mathematical operations on our arrays which will be computed on an elementwise basis

In [None]:
u = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
v = np.array([[14, 16, 18], [8, 10, 12], [2, 4, 6]])

print(u)
print(v)
print('\n')

print(u+v)
print('\n')
print(np.add(u,v))

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[[14 16 18]
 [ 8 10 12]
 [ 2  4  6]]


[[15 18 21]
 [12 15 18]
 [ 9 12 15]]


[[15 18 21]
 [12 15 18]
 [ 9 12 15]]


In [None]:
print(u-v)
print('\n')
print(np.subtract(u,v))

[[-13 -14 -15]
 [ -4  -5  -6]
 [  5   4   3]]


[[-13 -14 -15]
 [ -4  -5  -6]
 [  5   4   3]]


In [None]:
print(u*v)
print('\n')
print(np.multiply(u,v))

[[14 32 54]
 [32 50 72]
 [14 32 54]]


[[14 32 54]
 [32 50 72]
 [14 32 54]]


In [None]:
print(u/v)
print('\n')
print(np.divide(u,v))

[[0.07142857 0.125      0.16666667]
 [0.5        0.5        0.5       ]
 [3.5        2.         1.5       ]]


[[0.07142857 0.125      0.16666667]
 [0.5        0.5        0.5       ]
 [3.5        2.         1.5       ]]


In [None]:
# Inner product (vector-vector) of vectors a, b

a = np.array([0, 3])
b = np.array([1,5])

print(a)
print(b)

x = np.array([[1, 2], [3, 4]])
y = np.array([[2, 4], [3, 6]])

print(x)
print(y)

print(a.dot(b))
print(np.dot(a, b))

[0 3]
[1 5]
[[1 2]
 [3 4]]
[[2 4]
 [3 6]]
15
15


In [None]:
# Vector-Matrix product

print(a.dot(x))
print(np.dot(a, x))

[ 9 12]
[ 9 12]


In [None]:
# Matrix-Matrix product

print(x.dot(y))
print(np.dot(x, y))

[[ 8 16]
 [18 36]]
[[ 8 16]
 [18 36]]


In [None]:
print(y)
print(np.sum(y))
print(np.sum(y, axis=0)) # Column wise sum
print(np.sum(y, axis=1)) # Row wise sum

[[2 4]
 [3 6]]
15
[ 5 10]
[6 9]


In [None]:
print(np.min(y))
print(np.max(y))
print(np.mean(y))

2
6
3.75


## Pandas

This package provides a data structure implementing data frames which is basically just a multidimensional array with attached rows and columns. It acts as an interface for labeled data but at the same time also provides the tools for data manipulation and operations. It is comprised of two primary objects namely `series` and `DataFrame` which is built upon NumPy's `ndArray`. 



In [None]:
import pandas as pd # This is the writing convention

### Pandas Series
a one-dimensional NumPy array with an index



In [None]:
import pandas as pd

In [None]:
series = pd.Series([10, -12, 14, 16, -18, 20])

print(series)

0    10
1   -12
2    14
3    16
4   -18
5    20
dtype: int64


In [None]:
series.index

RangeIndex(start=0, stop=6, step=1)

In [None]:
series.values

array([ 10, -12,  14,  16, -18,  20])

In [None]:
series[2]

14

In [None]:
series[2:5]

2    14
3    16
4   -18
dtype: int64

### Generalized Pandas Series

In [None]:
ds = pd.Series([.1, .25, .3, .45, .5, .65], index= ['u', 'v', 'w', 'x', 'y', 'z'])

ds

u    0.10
v    0.25
w    0.30
x    0.45
y    0.50
z    0.65
dtype: float64

In [None]:
ds['x']

0.45

In [None]:
# Series as Python Dictionaries

dailyTemp = {'Monday': 27, 'Tuesday': 26.7, 'Wednesday': 26.9, 'Thursday': 27.3, 'Friday': 26.4, 'Saturday': 26.9, 'Sunday': 27.5}

temp = pd.Series(dailyTemp)
temp

Monday       27.0
Tuesday      26.7
Wednesday    26.9
Thursday     27.3
Friday       26.4
Saturday     26.9
Sunday       27.5
dtype: float64

In [None]:
temp['Friday']

26.4

In [None]:
# Slicing a series

temp['Wednesday':'Friday']

Wednesday    26.9
Thursday     27.3
Friday       26.4
dtype: float64

### DataFrames

In [None]:
temp

Monday       27.0
Tuesday      26.7
Wednesday    26.9
Thursday     27.3
Friday       26.4
Saturday     26.9
Sunday       27.5
dtype: float64

In [None]:
daily_weather = {'Monday': 'sunny', 'Tuesday': 'cloudy', 'Wednesday': 'sunny', 'Thursday': 'sunny', 'Friday': 'rainy', 'Saturday': 'sunny', 'Sunday': 'sunny'}

weather = pd.Series(daily_weather)
weather

Monday        sunny
Tuesday      cloudy
Wednesday     sunny
Thursday      sunny
Friday        rainy
Saturday      sunny
Sunday        sunny
dtype: object

In [None]:
day = pd.DataFrame({'Temperature':temp, 'Weather':weather})
day

Unnamed: 0,Temperature,Weather
Monday,27.0,sunny
Tuesday,26.7,cloudy
Wednesday,26.9,sunny
Thursday,27.3,sunny
Friday,26.4,rainy
Saturday,26.9,sunny
Sunday,27.5,sunny


In [None]:
day.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Monday to Sunday
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Temperature  7 non-null      float64
 1   Weather      7 non-null      object 
dtypes: float64(1), object(1)
memory usage: 488.0+ bytes


In [None]:
day.describe()

Unnamed: 0,Temperature
count,7.0
mean,26.957143
std,0.364496
min,26.4
25%,26.8
50%,26.9
75%,27.15
max,27.5


In [None]:
day.head()

Unnamed: 0,Temperature,Weather
Monday,27.0,sunny
Tuesday,26.7,cloudy
Wednesday,26.9,sunny
Thursday,27.3,sunny
Friday,26.4,rainy


In [None]:
day.tail()

Unnamed: 0,Temperature,Weather
Wednesday,26.9,sunny
Thursday,27.3,sunny
Friday,26.4,rainy
Saturday,26.9,sunny
Sunday,27.5,sunny


In [None]:
day.head(2)

Unnamed: 0,Temperature,Weather
Monday,27.0,sunny
Tuesday,26.7,cloudy


In [None]:
day.tail(3)

Unnamed: 0,Temperature,Weather
Friday,26.4,rainy
Saturday,26.9,sunny
Sunday,27.5,sunny


### Exercise: DataFrames

In [None]:
name_list = ['Jane', 'Abu', 'Zul', 'Ali', 'James', 'Yuen', 'Pranesh', 'Sarah', 'John', 'Anne', 'Lisa', 'Jordan', 'Molly']
gender_list = ['Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female']
age_list = [33, 26, 37, 26, 28, 36, 39, 26, 20, 25, 23, 21, 32]
location_list = ['Penang', 'Penang', 'Kuala Lumpur', 'Kuala Lumpur', 'Penang', 'Kuala Lumpur', 'Kuala Lumpur', 'Kuala Lumpur', 'Penang', 'Kuala Lumpur', 'Penang', 'Penang', 'Penang']

In [None]:
# Create a DataFrame using the lists above with the data above with names as index

df = pd.DataFrame({'Gender':gender_list, 'Age':age_list, 'Location':location_list}, index = name_list)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, Jane to Molly
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Gender    13 non-null     object
 1   Age       13 non-null     int64 
 2   Location  13 non-null     object
dtypes: int64(1), object(2)
memory usage: 416.0+ bytes


In [None]:
df.describe()

Unnamed: 0,Age
count,13.0
mean,28.615385
std,6.198635
min,20.0
25%,25.0
50%,26.0
75%,33.0
max,39.0


In [None]:
# Sort the data by Age in ascending order

In [None]:
# display the top rows after sorting

# The End

That's it! You have learnt all you need about Numpy and Pandas, the linear algebra and dataframe manipulators in Python. 

In the next lesson, you will leverage your knowledge about Numpy and Pandas to implement basic machine learning in Python using a library called **Scikit-learn**, or `sklearn` as enthusiasts like to call it.