# Introduction to Machine Learning
Sharon Ong and Emmanuel Keuleers, Department of Cognitive Science and Artificial Intelligence – Tilburg University
Academic Year 2020-2021

# Working with Essential Python Libraries 
## Numpy 
Numpy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions and pseudorandom number generators. 

Having a solid understanding of numpy is necessary if you want to work on machine learning projects. 
Other packages such as pandas is built on top of numpy. In this course, we will use the scikit learn package to build machine learning applications. This works heavily with numpy as well.  

Numpy provides the excellent ndarray objects, short for n-dimensional arrays. In a ‘ndarray’ object, aka ‘array’, you can store multiple items of the same data type. There are facilities around the array object that makes numpy so convenient for performing math and data manipulations.

You will perform some exercises with the Numpy package. 

1. Create a 1D numpy array with the contents of list1 and print the result 

In [1]:
import numpy as np

list1 = [0,1,2,3,4]
# 
# Your code goes here 
# 

2. Create a 2D array of 2 rows and 3 columns with numbers of your choice. 
3. Add the number 2 to every item in the array. Print the results 

In [9]:
#
# Your code goes here 
#
array2d = np.array([(8,7,6),(3,9,1)])
print(array2d)
array2d_p_2 = array2d + 2
print(array2d_p_2)

[[8 7 6]
 [3 9 1]]
[[10  9  8]
 [ 5 11  3]]


4. Print the shape, data types, size and number of dimensions 

In [20]:
#
# Your code goes here 
#
shape = array2d_p_2.shape
print("shape:", shape)
print("Number of dimensions:", array2d_p_2.ndim)
print("size:", array2d_p_2.size)
print("data type:", array2d_p_2.dtype)

shape: (2, 3)
Number of dimensions: 2
size: 6
data type: int32


5.  Create a numpy array which will print the following output 
 
#> ['1' '2' 'a' 'b']

In [22]:
#
# Your code goes here 
#
array = np.array(['1', '2', 'a', 'b'])
print(array)

['1' '2' 'a' 'b']


6. Reverse the rows of the following numpy array 

In [24]:
list = [[1, 2, 3, 4],[3, 4, 5, 6], [5, 6, 7, 8]]
arr = np.array(list, dtype='float')
#
# Your code goes here 
#
reversed_arr = arr[::-1]
print(reversed_arr)

[[5. 6. 7. 8.]
 [3. 4. 5. 6.]
 [1. 2. 3. 4.]]


7. Compute the mean and standard deviation of the numpy array (arr)

In [27]:
#
# Your code goes here 
#
print('mean:', np.mean(arr))
print('stdev:', np.std(arr))

mean: 4.5
stdev: 1.9790570145063195


8. Compute the minimum values row wise or column wise (use the np.amin version)

In [28]:
#
# Your code goes here 
#
print(np.amin(arr))

1.0


9. Flatten the numpy array 'arr'. You can use the flatten() or the ravel() method
These are 2 popular ways to implement flattening. The difference between ravel and flatten is, the new array created using ravel is actually a reference to the parent array. So, any changes to the new array will affect the parent as well. But is memory efficient since it does not create a copy.

In [34]:
#
# Your code goes here 
#
flat_arr = arr.flatten()
print(flat_arr)

[1. 2. 3. 4. 3. 4. 5. 6. 5. 6. 7. 8.]


## SciPy
SciPy is the collection of functions for scientific computing in Python. It provides advanced linear algebra routines, mathematical functions, optimization, signal processing, special mathematical funtions and statistical ditributions. 
One of the things we will work on are sparse matrices. We use sparse matrices when we want to store a 2D array which contains mostly zeros. Let's try: 

In [35]:
from scipy import sparse
import numpy as np
# Create a 2D NumPy array with a diagonal of ones, and zeros everywhere else
eye = np.eye(4)
print("NumPy array:\n", eye)
# Convert the NumPy array to a SciPy sparse matrix in CSR format
# Only the nonzero entries are stored
sparse_matrix = sparse.csr_matrix(eye)
print("\nSciPy sparse CSR matrix:\n", sparse_matrix)

NumPy array:
 [[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

SciPy sparse CSR matrix:
   (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


In [36]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print("COO representation:\n", eye_coo)
print("row_indices:\n", row_indices)
print("col_indices:\n", col_indices)

COO representation:
   (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0
row_indices:
 [0 1 2 3]
col_indices:
 [0 1 2 3]


## matplotlib 
matplotlib is the primary scientific plotting library in Python, providing functions for making publication-quality visualizations. You can show figures directly in browser using the "%matplotlib notebook" or "%matlplotlib inline" commands. You can create the plot with generated arrays 'x' and 'y'. (Plot x vs y), with different markers and label the x and y axis. Run the code below: 

In [37]:
%matplotlib notebook
import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.figure()
plt.plot(x, y, 'r.')
plt.xlabel('x')
plt.ylabel('y')
plt.figure()
plt.plot(x, 2*y, 'b+')
plt.xlabel('x')
plt.ylabel('y')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Text(0, 0.5, 'y')

## pandas
pandas is a Python library for data wrangling and analysis. It is build around a data structure called the DataFrame (a table, similar to an Excel spreadsheet). pandas provides a great range of methods to modify and operate this table, (e.g. SQL-like queries and joins of tables). 

In [39]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "Amsterdam"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes
# in the Jupyter notebook
display(data_pandas)

Unnamed: 0,Name,Location,Age
0,John,New York,24
1,Anna,Paris,13
2,Peter,Berlin,53
3,Linda,Amsterdam,33


In [41]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age >= 30])

Unnamed: 0,Name,Location,Age
2,Peter,Berlin,53
3,Linda,Amsterdam,33


## Exploring Data with Pandas 

We will explore the Iris dataset discussed in the theory lecture.

In [43]:
# Load the data first

# Import package we need
import pandas as pd

# Make sure the file 'iris.csv' is in the same folder as this notebook
# This loads the data into a 'dataframe', which is like a table
iris = pd.read_csv("iris.csv")

# Look at the first 5 rows
iris.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [44]:
# Load the data first

# Import package we need
import pandas as pd

# Make sure the file 'iris.csv' is in the same folder as this notebook
# This loads the data into a 'dataframe', which is like a table
iris = pd.read_csv("iris.csv")

# Look at the first 5 rows
iris.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [45]:
# Explore the shape: the number of rows and columns
iris.shape

(150, 5)

In [46]:
# How many of each flower?
iris.value_counts("class")

class
Setosa        50
Versicolor    50
Virginica     50
Name: count, dtype: int64

In [47]:
# Visualizing the data

# Import package we need
import seaborn as sns

# Plot the data using only 2 features: play around to find the best 2
sns.scatterplot(x='sepal_length', y='sepal_width',
                hue='class', data=iris)

<Axes: xlabel='x', ylabel='y'>

In [48]:
# Easier way to find the best 2 features that separate the flowers
sns.pairplot(iris, hue='class', height=2)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x22c41332e90>