# AI Workshop - Lab 1-0: Introduction to Python

In this lab, we're going to get acquainted with Python, NumPy, and scikit-learn. We're going to cover the following topics:

1. Using Python in a notebook environment
2. Basic NumPy operations
3. Basic scikit-learn operations

This lab is intended to help you become familiar with how we will run code in this workshop, as well as the basic tools we will be using. If you are already familiar with Python, NumPy, and scikit-learn, feel free to skip this lab and move onto Lab 1: Data Cleaning and Processing.

##### Notebooks
This lab will be using notebooks as a Python development environment. Notebooks are a great way to interactively develop code and are widely used in the data science community. They allow us to mix together code, text, and images in a single document.

##### Cells

Notebooks are made up of cells. Each cell can contain either code or text. You can change the type of a cell by selecting it and then selecting the appropriate type from the dropdown menu in the toolbar.

When you want to run a cell, you can either click the "play" button next to the cell or press "shift+enter".

#### Getting started

Let's get started by running some Python code. When we work with Python, we almost always rely on _libraries_ - collections of code that provide functionality beyond what is available in the core Python language. In this lab, we're going to be using two libraries: NumPy and scikit-learn.

**>> Run the code in the next cell** to import scikit-learn and NumPy.

In [None]:
import sklearn
import numpy as np

### NumPy Basics

Great. Let's move on to our next topic: getting a handle on NumPy basics. You can think of NumPy as a library that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

While Python supports many of the same operations as NumPy, NumPy is much more efficient and is optimized for numerical operations.

Let's create a 2x4 array containing the numbers 1 through 8 and conduct some basic operations on it.  
**>> Run the code in the next cell to create and print the array.***

In [None]:
# Create a variable called "array" and fill it with a 2x4 NumPy array
array = np.arange(8).reshape(2,4)
array

We can access the shape, number of dimensions, data type, and number of elements in our array as follows:  
*(Tip: use "print()" when you want a cell to output more than one thing, or you want to append text to your output, otherwise the cell will output the last object you call, as in the cell above)*

In [None]:
print ("Shape:", array.shape)
print ("Dimensions:", array.ndim)
print ("Data type:" , array.dtype.name)
print ("Number of elements:", array.size)

If we have a Python list containing a set of numbers, we can use it to create an array:  
*(Tip: if you click on a function call, such as array(), and press "shift+tab" the Notebook will provide you all the details of the function)*

In [None]:
mylist = [0, 1, 1, 2, 3, 5, 8, 13, 21]
myarray = np.array(mylist)
myarray

And we can do it for nested lists as well, creating multidimensional NumPy arrays:

In [None]:
my2dlist = [[1,2,3],[4,5,6]]
my2darray = np.array(my2dlist)
my2darray

We can also index and slice NumPy arrays like we would do with a Python list or another container object as follows:

In [None]:
array = np.arange(10)
print ("Originally: ", array)
print ("First four elements: ", array[:4])
print ("After the first four elements: ", array[4:])
print ("The last element: ", array[-1])

And we can index/slice multidimensional arrays, too.

In [None]:
array = np.array([[1,2,3],[4,5,6]])
print ("Originally: ", array)
print ("First row only: ", array[0])
print ("First column only: ", array[:,0])

#### Sneak preview

Often, when designing a machine learning classifier, it can be useful to compare an array of predictions (0 or 1 values) to another array of true values. We can do this pretty easily in NumPy to compute the *accuracy* (e.g., the number of values that are the same), for example, as follows:

In [None]:
true_values = [0, 0, 1, 1, 1, 1, 1, 0, 1, 0]
predictions = [0, 0, 0, 1, 1, 1, 0, 1, 1, 0]

true_values_array = np.array(true_values)
predictions_array = np.array(predictions)

accuracy = np.sum(true_values_array == predictions_array) / true_values_array.size
print ("Accuracy: ", accuracy * 100, "%")

In the previous cell, we took two Python lists, converted them to NumPy arrays, and then used a combination of np.sum() and .size to compute the accuracy (proportion of elements that are pairwise equal). A tiny bit more advanced, but demonstrates the power of NumPy arrays.

You'll notice we didn't used nested loops to conduct the comparison, but instead used the np.sum() function. This is an example of a vectorized operation within NumPy that is much more efficient when dealing with large datasets.

### Scikit-learn Basics

Scikit-learn is a great library to use for doing machine learning in Python. Data preparation, exploratory data analysis (EDA), classification, regression, clustering; it has it all. 

Scikit-learn usually expects data to be in the form of a 2D matrix with dimensions *n_samples x n_features* with an additional column for the target. To get acquainted with scikit-learn, we are going to use the [iris dataset](https://archive.ics.uci.edu/ml/datasets/iris), one of the most famous datasets in pattern recognition. 

Each entry in the dataset represents an iris plant, and is categorized as: 

* Setosa (class 0)
* Versicolor (class 1)
* Virginica (class 2)

These represent the target classes to predict. Each entry also includes a set of features, namely:

* Sepal width (cm)
* Sepal length (cm)
* Petal length (cm)
* Petal width (cm)

In the context of machine learning classification, the remainder of the lab is going to investigate the following question:  
*Can we design a model that, based on the iris sample features, can accurately predict the iris sample class? *

Scikit-learn has a copy of the iris dataset readily importable for us. Let's grab it now and conduct some EDA.

In [None]:
from sklearn.datasets import load_iris
iris_data = load_iris()
feature_data = iris_data.data

**YOUR TURN:** "feature_data" now contains the feature data for all of the iris samples. 
* What is the shape of this feature data? ________________
* The data type? ________________
* How many samples are there? ________________
* How many features are there? ________________

In [None]:
## Enter your code here


Next, we will save the target classification data in a similar fashion.

In [None]:
target_data = iris_data.target
target_names = iris_data.target_names

**YOUR TURN:**
* What values are in "target_data"? ________________
* What is the data type? ________________
* What values are in "target_names"? ________________
* What is the data type? ____________
* How many samples are of type "setosa"? ________________

In [None]:
## Enter your code here


We can also do some more visual EDA by plotting the samples according to a subset of the features and coloring the data points to coincide with the sample classification. We will use [matplotlib](https://matplotlib.org/), a powerful plotting library within Python, to accomplish this.

For example, lets plot sepal width vs. sepal length.


In [None]:
import matplotlib.pyplot as plt

In [None]:
setosa = feature_data[target_data==0]
versicolor = feature_data[target_data==1]
virginica = feature_data[target_data==2]

plt.scatter(setosa[:,0], setosa[:,1], label="setosa")
plt.scatter(versicolor[:,0], versicolor[:,1], label="versicolor")
plt.scatter(virginica[:,0], virginica[:,1], label="virginica")

plt.legend()
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.title("Visual EDA")

In the above step, we used boolean indexing to filter the feature data based on the target data class. This allowed us to create a scatter plot for each of the iris classes and distinguish them by color.

*Observations*: We can see that the "setosa" class typically consists of medium-to-high sepal width with low-to-medium sepal length, while the other two classes have lower width and higher length. The "virginica" class appears to have the largest combination of the two. 

**YOUR TURN:** 
* Which of the iris classes is seperable based on sepal characteristics? ________________
* Which of the iris classes is not? ________________
* Can we (easily) visualize each of the samples w.r.t. all features on the same plot? Why/why not? ________________