<a href="https://colab.research.google.com/github/Pomalec/AI-data-structure-demo1/blob/main/lab_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome

Welcome to the first set of lab tasks. If you've managed to open this file up in Colab then you've hopefully already worked through the "Preparatory Tasks" on Moodle.

You should already know about:

*   Our use of Google Drive and a root-level folder called /AI/
*   How to download, unzip, and upload the new lab tasks each week
*   How to run the notebooks in Google Colab, and the purpose of the first 3 code cells

Recall that the next three code cells will be found at the start of every notebook and you can just run them! Work through them in order, hovering over them with your mouse and clicking the little "play" button that appears.

In [1]:
# mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# change current working directory
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/')

In [3]:
# check we can see the data
print(os.path.isfile('iris.csv'))

True


# Generative AI use in labs


## Your favourite large language model (YFLLM)

Remember that as you work it's fine (and expected!) that you be talking to your favourite large language model (YFLLM) - e.g., [Microsoft Copilot](https://https://copilot.microsoft.com/). But try to centre your own learning, rather than just asking it to generate answers for you to copy/paste. There's further guidance on how to make the best use of generative AI tools on this module in the "[How to study](https://moodle.mmu.ac.uk/mod/page/view.php?id=4935302)" page on Moodle.

The solutions to the lab tasks are available on Moodle if you want to look at them.

## Gemini for Colab

Google's Gemini LLM is now embedded into Colab's interface and will suggest context-aware autocompletions for your code, by default. Auto-completion suggestions can sometimes be a useful time-saver, but they can also be distracting and stop you from thinking and learning. Consider turning Colab’s code completion suggestions off if you find this is happening (Tools->Settings->Editor->see checkbox options).

# The iris.csv data

Recall the iris.csv data from the lecture materials. It contains 4 measurements taken from 3 different species of iris. The overall challenge is to try and predict the species of iris from the measurements: a classification problem. It's not particularly exciting data, but it's good for practicing and learning about supervised learning techniques with.

You should still have a copy of the iris.csv file on your local machine following unzipping (probably in your Downloads/ folder). Open it up in Excel and have a look at the data.

In a moment we'll introduce some Python code that applies the basic machine learning recipe from the lecture to the iris.csv file, but before we do, we need to introduce a little bit more supervised learning terminology.

Have a look over the terminology document on Moodle ([direct link](https://moodle.mmu.ac.uk/pluginfile.php/8206181/mod_resource/content/5/1-AI-2526-terminology.pdf)) and then answer the questions about the iris.csv file, below:

*Double click to type your answers in:*

1.   How many **observations** are there? **type your answer here**
2.   How many **features** are there? **type your answer here**
3.   What is the name of the **target feature**? **type your answer here**
4.   How many **predictive features** are there? **type your answer here**
5.   What is the first **predictive feature value** in the first **observation**? **type your answer here**
6.   Has the data been sorted in any way? **type your answer here**

# Basic supervised learning recipe

The code below implements the basic supervised learning recipe (steps 1-3 only) discussed in the lecture materials using Python and the [Scikit-learn](https://scikit-learn.org/stable/) machine learning package.

The code trains and evaluates a model called a decision tree classifier. We won't worry about how the model works yet (more on that next week), just give the code a run and see what accuracy it produces.

*   How accurate was the Decision Tree model? **type your answer here**

In [4]:
# Import the packages we need
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# <comment missing>
observations = pd.read_csv('iris.csv')

# <comment missing>
target_feature = 'species'

# <comment missing>
examples = observations.drop(columns=target_feature).to_numpy()
labels = observations[target_feature].to_numpy()

# <comment missing>
train_examples, test_examples, train_labels, test_labels = train_test_split(
    examples,
    labels,
    test_size=0.4,
    random_state=99,
    shuffle=True
)

# <comment missing>
model = DecisionTreeClassifier(random_state=99)

# <comment missing>
model.fit(train_examples, train_labels)

# <comment missing>
predictions = model.predict(test_examples)

# <comment missing>
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy:", accuracy, "(or", round(accuracy * 100, 1), "%)")

Accuracy: 0.95 (or 95.0 %)


Now see if you can copy and paste the comments below to the right places in the code:

```
# Fit the model to our training data
# Split into examples and labels
# Load all the observations from file
# Use the model to generate predictions for our testing examples
# Set the name of the target feature
# Calculate the model's accuracy - the fraction of predictions that were correct
# Create a decision tree classifier model object
# Shuffle and split into training data (60%) and testing data (40%)
```

**Tip**: adding some `print()` statements might help you to get sure about what's happening on each line. For example, `print(test_examples)` to see what's in the `test_examples` variable.

**Tip**: note the first comment has been left in. The list of package imports can be really offputting when you're new to Python. You're not expected to remember them or know them off by heart - almost everyone just looks them up based on the packages they've found they want to use (or their IDE might even autocomplete them). So don't worry about them and just skip past them when you're reading new code listings.

# Data Structures

From a Python/programming perspective, probably the most interesting thing about the code is the way that we store the data. Let's spend some time looking at the options, and seeing which of them are used, and where, in our basic recipe code.

## Python lists

Python has built-in ‘lists’ which you can use to store ordered collections of items. For example:

In [5]:
a = [1, 2, 3]
print(a)
print(type(a))

[1, 2, 3]
<class 'list'>


**Tip**: Python's `type()` function is really useful for finding out what data type your variables belong to - it's a good one to add to debug.

Lists are really flexible, and handy for little storage jobs while you're coding. For example, you can store a mixture of different data types side by side:

In [6]:
b = [22, 'Dave', 4.6]
print(b)

[22, 'Dave', 4.6]


You can update list entries and resize lists, easily:

In [7]:
b[1] = 'Brian'
print(b)
b.append(99)
print(b)

[22, 'Brian', 4.6]
[22, 'Brian', 4.6, 99]


You can 'slice' lists using the following basic notation:

```
list[start:end:step]
```
Where
*   `start` = The index where the slice starts (inclusive) [defaults to 0 if omitted]
*   `end` = The index where the slice ends (exclusive) [defaults to end of the list if omitted]
*   `step` = The interval between elements in the slice. [defaults to 1 if omitted]

For example:

In [8]:
print(b[2:4:1])
print(b[2:]) # use defaults via omission

[4.6, 99]
[4.6, 99]


You can even store other lists inside your lists and get something similar to the sorts of 2D array structures you will have worked with in other languages:

In [9]:
c = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # a list of lists
print(c[1][1]) # access a single value

5


However, lists aren't really arrays (usually a fixed-size data structure containing a single data type) and the cost of their flexibility is that they are _slow_.

So, interestingly, we don't actually use any Python lists in our original code listing. They will sometimes come in handy as a quick/easy way to store data, but we'll typically use another option.

## NumPy arrays

In practice, almost all array-type operations in Python are performed using the [NumPy](https://numpy.org/doc/stable/) package. NumPy implements an `ndarray` object that gives us a fast way to handle arrays that contain data of a single data type.

You can create a NumPy array from Python list, for example:



In [12]:
import numpy as np

a = np.array([1, 2,5,3])
print(type(a))
print(a.shape)

<class 'numpy.ndarray'>
(4,)


**Tip**: it's slightly confusing that you make `ndarray` objects with a call to a function called `array()` - it's just a quirk of the package

**Tip**: NumPy arrays have a `.shape` property that shows their shape (or the number/extent of their dimensions) - it's another really useful thing to print out in your debug.

Just remember now that the array can't change size (though there are NumPy helper functions to help you make differently sized _copies_ of arrays) and it can't contain mixtures of data types.

One of the nicest things about NumPy arrays, once you've made them, is that the 'slicing' syntax we saw with simple Python lists, still works:

In [16]:
print(a[0:3])

[1 2 5]


Slicing can be also applied across multiple dimensions of a NumPy array simultaneously. For example, here we use slicing to access values from a 2D array with what is some very compact, and hopefully intuitive, notation:

In [17]:
c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(c.shape)

print(c[2,2]) # print one element - notice only one pair of square braces, versus lists

print(c[0:3:1,2]) # print contents of right-hand column

print(c[::,2]) # print contents of right-hand column (using defaults via omission)
print(c[:,2]) # print contents of right-hand column (you can actually drop the second colon, too)

print(c[0,:]) # print contents of first row

print(c[0::2,0::2]) # print only even rows/columns

(3, 3)
9
[3 6 9]
[3 6 9]
[3 6 9]
[1 2 3]
[[1 3]
 [7 9]]


This is really useful notation for us on this module, and we'll use it quite a bit.

Starting off with the `examples` and `labels` variables, almost all the variables in our original code listing are NumPy arrays. But we don't initialise the `examples` and `labels` variables from Python lists, so how are they made?

## Pandas DataFrames

Any data we might want to load up on this module is likely to contain a mixture of data types (for example, numbers and strings). Rather than reverting to using Python lists to store them side-by-side, we use another nice package called [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), which implements a `DataFrame` object capable of storing mixed data types in a 2D tabular structure.

DataFrames are handy for loading our data up, and also particularly handy if you need to take extra steps to clean data (something we'll return to look at in future weeks). So the initial loading of our data is done via a Pandas DataFrame:

In [18]:
import pandas as pd

observations = pd.read_csv('iris.csv')
print(type(observations))
print(observations.shape) # DataFrames also have a .shape property (like NumPy arrays)

<class 'pandas.core.frame.DataFrame'>
(150, 5)


If we print the resulting DataFrame object, you can see that the column titles in the .csv file have actually been integrated into the data structure itself:

In [19]:
print(observations)

     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]


You can use them to index the DataFrame. For example, the following line prints the column of data holding sepal_width values, only:

In [20]:
print(observations['sepal_width'])

0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 150, dtype: float64


DataFrames are powerful objects with lots of functionality via a range of different methods, but for now we don't really need this power, and we'd prefer the raw speed of NumPy arrays.

At the start of our original code listing, we move pretty quickly to separate the different data types in our classification problem using the DataFrame's `.drop()` method (to remove a column), and set up our `examples` and `labels` variables as NumPy arrays using the DataFrame's `.to_numpy()` method:

In [21]:
examples = observations.drop(columns='species').to_numpy()
labels = observations['species'].to_numpy()
print(type(examples))
print(examples.shape)
print(type(labels))
print(labels.shape)

<class 'numpy.ndarray'>
(150, 4)
<class 'numpy.ndarray'>
(150,)


So we really only have one DataFrame in our code listing: `observations` right at the start. But DataFrames are still very useful objects that we'll come back to use when we look more at data cleaning in future.

### Predictive features

One thing we lose when we convert from DataFrames to NumPy arrays, is the record of the predictive features that lived in the DataFrame. This doesn't matter to any of the models we'll train - they don't care what names we assigned to different features - but the names can occasionally be useful to us.

Looking them up in the original .csv file is one option - the columns stay in the same order inside our NumPy arrays. But the code below shows how to grab the predictive features from the original `observations` DataFrame, as a NumPy array, should you need them:

In [22]:
# Grab a record of the predictive features
predictive_features = observations.drop(columns=target_feature).columns.to_numpy()
print(predictive_features)

['sepal_length' 'sepal_width' 'petal_length' 'petal_width']


## Scikit-learn behaviour

The Scikit-learn classes and functions we use in the original code listing are actually flexible about what format they accept data in. They will process: Python lists, NumPy arrays, or Pandas DataFrames. Much of Scikit-learn's functionality is similarly flexible.

However, the package works predominantly with NumPy arrays behind the scenes, for their speed and efficiency, and there are some cases where NumPy arrays are passed back from functions/methods regardless of what you pass in (which can lead to confusion/problems for the caller).

As a general rule, we will aim to pass Scikit-learn NumPy arrays to keep things simple, fast and efficient.

# Interacting with examples and labels

The code below sets up the original `examples` and `labels` variables again. Attempt the following tasks by adding code at the end of the listing:

1.   Print the 'sepal_width' feature values for all examples
2.   Print the 'sepal_width' and 'sepal_length' feature values for all examples
3.   Print all the feature values for the 50th example
4.   Print the petal_length feature values for the first 10 examples
5.   Set the first example to have the new feature values: 2.1, 3.2, 4.3, 5.4
6.   Set the first label to have the new feature value: 'virginica'
7.   Print the unique labels (hint: find a NumPy method to help you)
8.   Set all the examples with 'sepal_length' feature values that are less than 5.0 equal to 1.0 (hint: look up conditional indexing for NumPy arrays)

In [23]:
import pandas as pd
import numpy as np

observations = pd.read_csv('iris.csv')

examples = observations.drop(columns='species').to_numpy()
labels = observations['species'].to_numpy()

# Add your code on the lines below



In [26]:
#1. Print the 'sepal_width' feature values for all examples
print(labels)
print(examples[:, 1])

['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'v

# 1D NumPy arrays

You might have noticed, particularly if looking carefully at `.shape` debug in the previous sections, that 1D NumPy arrays don't have an orientation. That may be obvious, but perhaps not.

Let's play with our simple 2D NumPy array again, as an example:

In [None]:
import numpy as np
c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

Take the example of grabbing all the values in the 2nd row as a single array. We might tend to think of that array as being horizontal - it having formerly been a row of a 2D tabular structure - and therefore as having a shape of `(1,3)` or 1 row deep and 3 columns wide. NumPy doesn't, however. It's a 1D array and, from NumPy's perspective, there is no other dimension to assign it an extent of 1 in. So, if you print out its shape property you'll see `(3,)`. That is, an extent of 3 in the first dimension, and nothing else.

In [None]:
row = c[1,:]
print(row.shape)

Similarly if you think of reading all the values in a single column. The result isn't vertical - it having formerly been a column of a 2d tabular structure - and it doesn't have a shape of `(3,1)`, but rather `(3,)` again:

In [None]:
column = c[:,1]
print(column.shape)

This can matter in some circumstances on this module. For example, if you want to pass a single testing example to a model in order to generate a prediction. (Or if you want to treat 1D NumPy arrays as row or column vectors and multiply them by matrices for the extension tasks.)

It's useful to keep in mind that 1D NumPy arrays don't actually have an orientation, and also that there is syntax for making sure they do.

If you want to get back horizontal or vertical 2D arrays (with a single row/column, respectively), then you must make a slice that starts and ends on the same row/column index, as follows:

In [None]:
row = c[1:2,:]
column = c[:,1:2]
print(row.shape)
print(column.shape)

**Tip**: remember that the number after the first colon is exclusive (i.e., it's the index of the element you want to stop just _before_); the number before the first colon is inclusive (it's the index you want to start from).

Another solution is to use NumPy's `.atleast_2d()` method, which does pretty much what it sounds like it does, and can be handy sometimes. However, it turns everything into a horizontal array by default.

In [None]:
row = np.atleast_2d(c[1,:])
column = np.atleast_2d(c[:,1])
print(row.shape)
print(column.shape)

# Documentation

We'll work with the Python standard library, NumPy, Pandas and Scikit-learn lots on the module. The documentation and tutorial pages for each are really useful and worth bringing together in a single place here. Googling or talking to YFLLM is often really useful for the first steps, but the documentation pages often contain the fine details:



*   Python: [tutorial/examples](https://docs.python.org/3/tutorial/index.html); [docs](https://docs.python.org/3/library/index.html)
*   NumPy: [tutorial/examples](https://numpy.org/devdocs/user/quickstart.html); [docs](https://numpy.org/doc/stable/reference/index.html)
*   Pandas: [tutorial/examples](https://pandas.pydata.org/docs/user_guide/10min.html); [docs](https://numpy.org/doc/stable/reference/index.html)
*   Scikit-learn: [tutorial/examples](https://scikit-learn.org/stable/auto_examples/index.html); [docs](https://scikit-learn.org/stable/api/index.html)



# Final review

Now we've looked in more depth at what's happening in the basic recipe code, give it a final review. Are you happy with each of the lines of code? Do you want to add any debug to clarify anything? We'll use this basic recipe a lot in the early weeks of the module so it's good to get comfortable with it.

# Connection to 1CWK100

Notice you can copy/paste the basic recipe code to produce an initial solution for the first 1CWK100 task (and get a passing mark!). You'll just need to make a couple of small modifications to the original code listing; what are they? (Work inside the 1CWK100 template - available on Moodle - if you try things out...)

# Extension tasks

(Extension tasks are optional. They don't map to the assessment on this module and there is no harm in skipping them. However, for people who are interested in the subject, and who have the time, the extension tasks can help you deepen your machine learning knowledge and meet the module learning objectives more fully.)

Start by asking YFLLM about the syntax for defining a new Python function, and ask any follow-up questions you might have.

**Possible prompt**: "*How do I define a new function in Python?*"

**Possible follow-up**: "*How do I return more than one value?*"

**Possible follow-up**: "*Tell me more about tuples in Python.*"

**Possible follow-up**: "*Tell me more about tuple unpacking in Python.*"

Can you now write a `my_train_test_split()` function that re-implements the main steps carried out by scikit-learn's `train_test_split()` function? Some suggestions for approaching the re-implementation follow below:

*   You don't need to accept all the same parameters as the original function straight away - start by accepting `examples` and `labels` only and use sensible
defaults for the other values
*   The original function will accept `examples` and `labels` as lists, ndarrays, or DataFrames - cater only for ndarrays to begin with (add support for other options later on if you want to)
*   For the shuffle, remember that the random sequence you use to reorder the `examples` and `labels` must be the same - individual examples and labels must still correspond after the shuffle is complete
*   You should be able to replace the call to `train_test_split()` in the original code listing with a call to `my_train_test_split()` and see a comparable performance from the model in terms of accuracy (it won't be identical due to the randomness of the shuffle)
*   Can you extend your function to handle lists/ndarrays/DataFrames, can you extend you function to handle other helpful parameters from the caller?

Assuming you're new to Python, you'll inevitably need to Google for how to do new/small things and/or chat to YFLLM and/or search the relevant docs (e.g., NumPy). This will be a regular feature of the module, and particularly of the extension tasks.

In [None]:
# Add your code on the lines below

