# SWCON253 Machine Learning

# Data Preprocessing and Machine Learning with Scikit-Learn

In [None]:
%load_ext watermark
%watermark -v -a "Won Hee Lee" -p numpy,scipy,matplotlib,sklearn,mlxtend

Author: Won Hee Lee

Python implementation: CPython
Python version       : 3.8.3
IPython version      : 7.16.1

numpy     : 1.18.5
scipy     : 1.5.0
matplotlib: 3.2.2
sklearn   : 0.23.1
mlxtend   : 0.18.0



# Machine Learning Workflow

<img src="images/ml_worflow.png" alt="drawing" width="700"/>

## Overview

In this lecture, we are closing the **"Computational Foundation"** section by introducing yet another Python library, pandas, which is extremely handy for data (pre)processing. The second focus of this lecture is on the [Scikit-learn](http://scikit-learn.org) machine learning library, which is widely considered as the most mature and most well-designed general machine learning library.

## Pandas - A Python Library for Working with Data Frames

- Pandas is probably the most popular and convenient data wrangling library for Python (official website: https://pandas.pydata.org) 
- Pandas stands for PANel-DAta-S.
- Similar to data frames in R.
- How is it different from NumPy arrays? 
    - Allows for heterogenous data (columns can have different data types)
    - Adds some more convenient functions on top that are handy for data processing

### Loading Tabular Datasets from Text Files

- Here, we are working with structured data, data which is organized similar to a "design matrix" (see lecture 3) -- that is, examples as rows and features as columns (in contrast: unstructured data such as text or images, etc.).
- CSV stands for "comma separated values" (also common: TSV, tab seperated values).
- The `head` command is a Linux/Unix command that shows the first 10 rows by default; the `!` denotes that Jupyter/the IPython kernel should execute it as a shell command (`!`-commands may not work if you are on Windows, but it is not really important).

In [3]:
from sklearn import datasets
iris = datasets.load_iris()
iris


 'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

sepal = 꽃받침, petal = 꽃잎

<img src="images/iris_pic.png" alt="drawing" width="800"/>

- We use the `read_csv` command to load the CSV file into a pandas data frame object f of the class `DataFrame`.
- Data frames also have a `head` command; here it shows the first 5 rows.

In [4]:
import pandas as pd

df= Dataframe(iris)

NameError: ignored

In [None]:
type(df)

pandas.core.frame.DataFrame

- It is always good to double check the dimensions and see if they are what we expect. 
- The `DataFrame` `shape` attribute works the same way as the NumPy array `shape` attribute (Lecture 10).

In [None]:
df.shape

(150, 6)

In [None]:
# many additional options exist 

import pandas as npw

pd.read_csv?

### Basic Data Handling

- The `apply` method offers a convenient way to manipulate pandas `DataFrame` entries along the column axis.
- We can use a regular Python or lambda function as input to the apply method.
- In this context, assume that our goal is to transform class labels from a string representation (e.g., "Iris-Setosa") to an integer representation (e.g., 0), which is a historical convention and a recommendation for compatibility with various machine learning tools.

In [None]:
df['Species'] = df['Species'].apply(lambda x: 0 if x=='Iris-setosa' else x)
df.tail()

Unnamed: 0,Id,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Species
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


#### Digression: Lambda Functions

- If you are not familiar with "lambda functions," they are basically the same as "regular function but can be written more compactly as a one-liner.

In [None]:
def some_func(x):
    return 'Hello World ' + str(x)

some_func(123)

'Hello World 123'

In [None]:
f = lambda x: 'Hello World ' + str(x)
f(123)

'Hello World 123'

#### .map vs. .apply

- If we want to map column values from one value to another, it is often more convenient to use the `map` method instead of apply.
- The achieve the following with the `apply` method, we would have to call `apply` three times.

In [None]:
d = {'Iris-setosa': 0,
     'Iris-versicolor': 1,
     'Iris-virginica': 2}

df = pd.read_csv('data/iris.csv')
df['Species'] = df['Species'].map(d) #map으로 맵핑시키기. 한번에 변경 가능.
df.head()

Unnamed: 0,Id,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


- The `tail` method is similar to `head` but shows the last five rows by default; we use it to double check that the last class label  (Iris-Virginica) was also successfully transformed

In [None]:
df.tail()

Unnamed: 0,Id,SepalLength[cm],SepalWidth[cm],PetalLength[cm],PetalWidth[cm],Species
145,146,6.7,3.0,5.2,2.3,2
146,147,6.3,2.5,5.0,1.9,2
147,148,6.5,3.0,5.2,2.0,2
148,149,6.2,3.4,5.4,2.3,2
149,150,5.9,3.0,5.1,1.8,2


- It's actually not a bad idea to check if all row entries of the `Species` column got transformed correctly.

In [None]:
import numpy as np

# unique로 확인 가능함. 
np.unique(df['Species'])

array([0, 1, 2], dtype=int64)

#### NumPy Arrays

- Pandas' data frames are built on top of NumPy arrays.
- While many machine learning-related tools also support pandas `DataFrame` objects as inputs now, by convention, we usually use NumPy arrays most tasks.
- We can access the NumPy array that is underlying a `DataFrame` via the `values` attribute.

In [None]:
y = df['Species'].values
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

- There are many different ways to access columns and rows in a pandas `DataFrame`, which we won't discuss here; a good reference documentation can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
- The `iloc` attribute allows for integer-based indexing and slicing, which is similar to how we use indexing on NumPy arrays (Lectures 10-11).
The following expression will select column 1, 2, 3, and 4 (sepal length, sepal width, petal length, petal width) from the `DataFrame` and then assign the underlying NumPy array to `X`.

In [None]:
X = df.iloc[:, 1:5].values

- Just as a quick check, we show the first 5 rows in the NumPy array:

In [None]:
X[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

### Exploratory Data Analysis

- Occasionally, we will use the MLxtend library (http://rasbt.github.io/mlxtend/) -- MLxtend stands for "machine learning extensions" and contains some convenience functions for machine learning and data science tasks.
- In particular, we will use the `scatterplotmatrix` function to display a scatter plot matrix of the dataset, which is useful to get a quick overview of the dataset (to inspect the relationship between features, look for outliers, etc.).
- You can install the MLxtend library by un-commenting and executing the next cell.

<img src="images/mlxtend.png" alt="drawing" width="400"/>

In [None]:
#conda install mlxtend --channel conda-forge

In [None]:
#pip install mlxtend

In [5]:
%matplotlib inline
import matplotlib.pyplot as plt
from mlxtend.data import iris_data
from mlxtend.plotting import scatterplotmatrix



names = df.columns[1:5]

fig, axes = scatterplotmatrix(X[y==0], figsize=(10, 8), alpha=0.5)
fig, axes = scatterplotmatrix(X[y==1], fig_axes=(fig, axes), alpha=0.5)
fig, axes = scatterplotmatrix(X[y==2], fig_axes=(fig, axes), alpha=0.5, names=names)

plt.tight_layout()
plt.legend(labels=['Setosa', 'Versicolor', 'Virginica'])
plt.savefig('images/eda.pdf')
plt.show()

NameError: ignored

## Object Oriented Programming (OOP) & Python Classes

- To get a better understanding of the scikit-learn API, we need to understand the main concepts behind Object Oriented Programming (OOP) and classes in Python
- This section illustrates the concept of "classes" in Python, which is relevant for understanding how the scikit-learn API works on a fundamental level later in this lecture.
- Note that Python is an object oriented language, and everything in Python is an object.
- Classes are "templates" for creating objects (this is called "instantiating" objects).
- An object is a collection of special "functions" (a "function" of an object or class is called "method") and attributes.
- Note that the `self` attribute is a special keyword for referring to a class or an instantiated object of a class, "itself."

In [6]:
class VehicleClass():
    
    def __init__(self, horsepower):   # __init__ is so-called "constructor"
        "This is the 'init' method"
        # this is a class attribute:
        self.horsepower = horsepower
        
    def horsepower_to_torque(self, rpm):
        "This is a regular method"
        numerator = self.horsepower * 33000
        denominator = 2* np.pi * 5000
        return numerator/denominator
    
    def tune_motor(self):
        self.horsepower *= 2
    
    def _private_method(self):
        print('this is private')
    
    def __very_private_method(self):
        print('this is very private')

In [9]:
import numpy as np
# instantiate an object:
car1 = VehicleClass(horsepower=123)
print(car1.horsepower)

123


In [10]:
car1.horsepower_to_torque(rpm=5000)

129.20198280200063

In [12]:
car1.tune_motor()
car1.horsepower_to_torque(rpm=5000)

516.8079312080025

In [None]:
car1._private_method()

this is private


- Python has the motto "we are all adults here," which means that a user can do the same things as a developer (in contrast to other programming languages, e.g., Java).
- A preceding underscore is an indicator that a method is considered "private" -- this means, this method is meant to be used internally but not by the user directly (also, it does not show up in the "help" documentation)
- a preceding double-underscore is a "stronger" indicator for methods that are supposed to be private, and while users can access these (adhering to the "we are all adults here" motto), we have to refer to "name mangling."

In [13]:
# Excecuting the following would raise an error:
car1.__very_private_method()

AttributeError: ignored

In [14]:
# If we use "name mangling" we can access this private method:
car1._VehicleClass__very_private_method()

this is very private


- Another useful aspect of using classes is the concept of "inheritance."
- Using inheritance, we can "inherit" methods and attributes from a parent class for re-use.
- For instance, consider the `VehicleClass` as a more general class than the `CarClass` -- i.e., a car, truck, or motorbike are specific cases of a vehicle.
- Below is an example of a `CarClass` that inherits the methods from the `VehicleClass` and adds a specific `self.num_wheels=4` attribute -- if we were to create a `BikeClass`, we could set this to `self.num_wheels=2`, for example.
- All-in-all, this is a very simple demonstration of class inheritance, however, it's a concept that is very useful for writing "clean code" and structuring projects -- the scikit-learn machine learning library makes heavy use of this concept internally (we, as users, don't have to worry about it too much though, it is useful to know though in case you would like to modify or contribute to the library).

In [15]:
class CarClass(VehicleClass):

    def __init__(self, horsepower):
        super(CarClass, self).__init__(horsepower)
        self.num_wheels = 4
    
new_car = CarClass(horsepower=123)
print('Number of wheels:', new_car.num_wheels)
print('Horsepower:', new_car.horsepower)
new_car.tune_motor()
print('Horsepower:', new_car.horsepower)

Number of wheels: 4
Horsepower: 123
Horsepower: 246


## Further Resources

- Scikit-learn documentation: http://scikit-learn.org/stable/documentation.html