# Looking at the Skulls

### <b>Welcome to Lab 1b of Machine Learning 101 with Python.</b>
<p><b>Machine Learning is a subset of artificial intelligence (AI), where the system can "learn" without explicitly being coded</b></p>

In this lab exercise, you will learn some basic functions for viewing and analysing data such as target, feature names, etc. Also, you will get a basic understanding of how to use data to fit (train) a model and use it to make a prediction. This will serve as a building block for future labs!


### Some Notebook Commands
<p>In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.</p>
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

If you are interested in more keyboard shortcuts, go to <b> Help -> Keyboard Shortcuts </b>

# Looking at the Skulls dataset

In this section, we will take a closer look at a data set, which is different from the digits data set from the first lab.

Everything starts off with how the data is stored. We will be working with .csv files, or comma separated value files. As the name implies, each attribute (or column) in the data is separated by commas.

Next, a little information about the dataset. We are using a dataset called skulls.csv, which contains the measurements made on Egyptian skulls from five epochs.

## The attributes of the data are as follows: 

<b>epoch</b> - The epoch corresponding to each skull. Assigned as a factor with levels c4000BC c3300BC, c1850BC, c200BC, and cAD150, where the years are only given approximately.

<b>mb</b> - Maximal Breadth of the skull.

<b>bh</b> - Basiregmatic Heights of the skull.

<b>bl</b> - Basilveolar Length of the skull.

<b>nh</b> - Nasal Heights of the skull.

---

### Importing Libraries
Before we begin, we need to import some libraries, as they have useful functions that will be used later on.<br>
If you look at the imports below, you will notice the return of **numpy**! Remember that numpy is homogeneous multidimensional array (ndarray).

<b>Note</b>: The **KNeighborsClassifier** is a machine learning algorithm that we will discuss later on, so don't worry about understanding it right now.

In [1]:
import numpy as np
import pandas
from sklearn.neighbors import KNeighborsClassifier

---
We need the **pandas** library for a function to read .csv files
<ul>
    <li> <b>pandas.read_csv</b> - Reads data into DataFrame </li>
    <li> The read_csv function takes in <i>2 parameters</i>: </li>
    <ul>
        <li> The .csv file as the first parameter </li>
        <li> The delimiter as the second parameter </li>
    </ul>
</ul>

-----------------------------
<font color = "green"> Save the "<b> skulls.csv </b>" data file into a variable called <b> my_data </b> </font>

In [2]:
my_data = pandas.read_csv("https://ibm.box.com/shared/static/u8orgfc65zmoo3i0gpt9l27un4o0cuvn.csv", delimiter=",")

-------
<font color = "green"> Print out the data in <b> my_data </b> </font>

In [3]:
print(my_data)

     Unnamed: 0    epoch   mb   bh   bl  nh
0             1  c4000BC  131  138   89  49
1             2  c4000BC  125  131   92  48
2             3  c4000BC  131  132   99  50
3             4  c4000BC  119  132   96  44
4             5  c4000BC  136  143  100  54
5             6  c4000BC  138  137   89  56
6             7  c4000BC  139  130  108  48
7             8  c4000BC  125  136   93  48
8             9  c4000BC  131  134  102  51
9            10  c4000BC  134  134   99  51
10           11  c4000BC  129  138   95  50
11           12  c4000BC  134  121   95  53
12           13  c4000BC  126  129  109  51
13           14  c4000BC  132  136  100  50
14           15  c4000BC  141  140  100  51
15           16  c4000BC  131  134   97  54
16           17  c4000BC  135  137  103  50
17           18  c4000BC  132  133   93  53
18           19  c4000BC  139  136   96  50
19           20  c4000BC  132  131  101  49
20           21  c4000BC  126  133  102  51
21           22  c4000BC  135  1

------------
<font color = "green"> Check the type of <b> my_data </b> </font>

In [4]:
print(type(my_data))

<class 'pandas.core.frame.DataFrame'>


-----------
There are various functions that the **pandas** library has to look at the data
<ul>
    <li> <font color = "red"> [DataFrame Data].columns </font> - Displays the Header of the Data </li>
    <ul> 
        <li> Type: pandas.indexes.base.Index </li>
    </ul>
</ul>

<ul>
    <li> <font color = "red"> [DataFrame Data].values </font> (or <font color = "red"> [DataFrame Data].as_matrix() </font>) - Displays the values of the data (without headers) </li>
    <ul>
        <li> Type: numpy.ndarray </li>
    </ul>
</ul>

<ul>
    <li> <font color = "red"> [DataFrame Data].shape </font> - Displays the dimensions of the data (rows x columns) </li>
    <ul>
        <li> Type: tuple </li>
    </ul>
</ul>

----------
<font color = "green"> Using the <b> my_data </b> variable containing the DataFrame data, retrieve the <b> header </b> data, data <b> values </b>, and <b> shape </b> of the data. </font> 


Column

In [5]:
my_data.columns

Index(['Unnamed: 0', 'epoch', 'mb', 'bh', 'bl', 'nh'], dtype='object')

-----------
Values

In [6]:
my_data.head

<bound method NDFrame.head of      Unnamed: 0    epoch   mb   bh   bl  nh
0             1  c4000BC  131  138   89  49
1             2  c4000BC  125  131   92  48
2             3  c4000BC  131  132   99  50
3             4  c4000BC  119  132   96  44
4             5  c4000BC  136  143  100  54
5             6  c4000BC  138  137   89  56
6             7  c4000BC  139  130  108  48
7             8  c4000BC  125  136   93  48
8             9  c4000BC  131  134  102  51
9            10  c4000BC  134  134   99  51
10           11  c4000BC  129  138   95  50
11           12  c4000BC  134  121   95  53
12           13  c4000BC  126  129  109  51
13           14  c4000BC  132  136  100  50
14           15  c4000BC  141  140  100  51
15           16  c4000BC  131  134   97  54
16           17  c4000BC  135  137  103  50
17           18  c4000BC  132  133   93  53
18           19  c4000BC  139  136   96  50
19           20  c4000BC  132  131  101  49
20           21  c4000BC  126  133  102  51
21

-----------
Shape

In [7]:
my_data.shape

(150, 6)

When we train a model, the model requires two inputs, X and y
<ul>
    <li> X: Feature Matrix, or array that contains the data. </li>
    <li> y: Response Vector, or 1-D array that contains the classification categories </li>
</ul>

<b> Note: We will not be able to use the built-in scikit-learn functions that was used with the digits dataset, since the data is not of type bunches. </b>

------------
There are some problems with the data in my_data:
<ul>
    <li> There is a header on the data (Unnamed: 0    epoch   mb   bh   bl  nh) </li>
    <li> The data needs to be in numpy.ndarray format in order to use it in the machine learning model </li>
    <li> There is non-numeric data within the dataset </li>
    <li> There are row numbers associated with each row that affect the model </li>
</ul>

To resolve these problems, I have created a function that fixes these for us:
<b> removeColumns(pandasArray, column) </b>

This function produces one output and requires two inputs.
<ul>
    <li> 1st Input: A pandas array. The pandas array we have been using is my_data </li>
    <li> 2nd Input: Any number of integer values (order doesn't matter) that represent the columns that we want to remove. (Look at the data again and find which column contains the non-numeric values). We also want to remove the first column because that only contains the row number, which is irrelevant to our analysis.</li>
    <ul>
        <li> Note: Remember that Python is zero-indexed, therefore the first column would be 0. </li>
    </ul>
</ul>


In [8]:
# Remove the column containing the target name since it doesn't contain numeric values.
# Also remove the column that contains the row number
# axis=1 means we are removing columns instead of rows.
# Function takes in a pandas array and column numbers and returns a numpy array without
# the stated columns
def removeColumns(pandasArray, *column):
    return pandasArray.drop(pandasArray.columns[[column]], axis=1).values

---------
<font color = "green"> Using the function, store the values from the DataFrame data into a variable called new_data. </font>

In [9]:
new_data = removeColumns(my_data, 0, 1)

<font color = "green"> Print out the data in <b> new_data </b> </font>

In [10]:
print(new_data)

[[131 138  89  49]
 [125 131  92  48]
 [131 132  99  50]
 [119 132  96  44]
 [136 143 100  54]
 [138 137  89  56]
 [139 130 108  48]
 [125 136  93  48]
 [131 134 102  51]
 [134 134  99  51]
 [129 138  95  50]
 [134 121  95  53]
 [126 129 109  51]
 [132 136 100  50]
 [141 140 100  51]
 [131 134  97  54]
 [135 137 103  50]
 [132 133  93  53]
 [139 136  96  50]
 [132 131 101  49]
 [126 133 102  51]
 [135 135 103  47]
 [134 124  93  53]
 [128 134 103  50]
 [130 130 104  49]
 [138 135 100  55]
 [128 132  93  53]
 [127 129 106  48]
 [131 136 114  54]
 [124 138 101  46]
 [124 138 101  48]
 [133 134  97  48]
 [138 134  98  45]
 [148 129 104  51]
 [126 124  95  45]
 [135 136  98  52]
 [132 145 100  54]
 [133 130 102  48]
 [131 134  96  50]
 [133 125  94  46]
 [133 136 103  53]
 [131 139  98  51]
 [131 136  99  56]
 [138 134  98  49]
 [130 136 104  53]
 [131 128  98  45]
 [138 129 107  53]
 [123 131 101  51]
 [130 129 105  47]
 [134 130  93  54]
 [137 136 106  49]
 [126 131 100  48]
 [135 136  9

-------
Now, we have one half of the required data to fit a model, which is X or new_data

Next, we need to get the response vector y. Since we cannot use .target and .target_names, I have created a function that will do this for us.

<b> targetAndtargetNames(numpyArray, targetColumnIndex) </b>

This function produces two outputs, and requires two inputs.
<ul>
    <li> <font size = 3.5><b><i>1st Input</i></b></font>: A numpy array. The numpy array you will use is my_data.values (or my_data.as_matrix())</li>
    <ul>
        <li> Note: DO NOT USE <b> new_data </b> here. We need the original .csv data file without the headers </li>
    </ul>
</ul>
<ul>
    <li> <font size = 3.5><b><i>2nd Input</i></b></font>: An integer value that represents the target column . (Look at the data again and find which column contains the non-numeric values. This is the target column)</li>
    <ul>
        <li> Note: Remember that Python is zero-indexed, therefore the first column would be 0. </li>
   </ul>
</ul>

<ul>
    <li> <font size = 3.5><b><i>1st Output</i></b></font>: The response vector (target) </li>
    <li> <font size = 3.5><b><i>2nd Output</i></b></font>: The target names (target_names) </li>
</ul>



In [11]:
def targetAndtargetNames(numpyArray, targetColumnIndex):
    target_dict = dict()
    target = list()
    target_names = list()
    count = -1
    for i in range(len(my_data.values)):
        if my_data.values[i][targetColumnIndex] not in target_dict:
            count += 1
            target_dict[my_data.values[i][targetColumnIndex]] = count
        target.append(target_dict[my_data.values[i][targetColumnIndex]])
    # Since a dictionary is not ordered, we need to order it and output it to a list so the
    # target names will match the target.
    for targetName in sorted(target_dict, key=target_dict.get):
        target_names.append(targetName)
    return np.asarray(target), target_names

<font color = "green"> Using the targetAndtargetNames function, create two variables called <b>target</b> and <b>target_names</b> </font>

In [12]:
target, target_names = targetAndtargetNames(my_data, 1)

<font color = "green"> Print out the <b>target</b> and <b>target_names</b> variables you created. </font>

In [13]:
print(target, target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4] ['c4000BC', 'c3300BC', 'c1850BC', 'c200BC', 'cAD150']


---------
Now that we have the two required variables to fit the data, a sneak peak at how to fit data will be shown in the cell below.

The data will be fit into a K-Nearest Neighbors model, which will be discussed more in a future lab.<br>
<b>Note</b>: The predict function will show a warning when run. Please ignore the warning.

In [14]:
X = new_data
y = target
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X,y)
print('Prediction: '), print(neigh.predict(new_data[10].reshape(1, -1))[0])
print('Actual:'), print(y[10])

Prediction: 
0
Actual:
0


(None, None)