# Welcome to MacAI's workshop "Code Your First ML Program"!
In this workshop, we will be taking a look at the "Breast Cancer Wisconsin" dataset. This dataset features information on hundreds of images of tumour samples, and their corresponding labels indicating whether they are benign (non-cancerous) or malignant (cancerous). Our job is to create a program that can classify new images such as these into either one of the two categories. <br> <br>
For more details on this dataset:
> [Breast Cancer Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)<br>

To explore other datasets:
> [List of Available Datasets](https://scikit-learn.org/stable/datasets/index.html)<br><br>

### Great, we're ready to begin! Let's start by importing the function to load the dataset. We can important any of the built-in datasets using the following code:
 > # from sklearn.datasets import d
  - The parameter "d" takes in a dataset from list of ***sklearn*** datasets. 



In [4]:
from sklearn.datasets import load_breast_cancer

### Now we'll actually load the dataset and store it in a variable using the following code:
> # ourData = load_datasetName()
 - The parameter "datasetname" takes in the name of your sklearn dataset

In [5]:
bc = load_breast_cancer()

### Before we can proceed, we must understand what information is in our dataset. The image below offers a visual representation of this data in the context of ***sklearn***:

![alt text](ExampleImage.png "Title")

Conceptually, our ML model/algorithm is observing training data, trying to establish patterns in its **features**, and then using these patterns to predict **labels**. Fundamentally, however, our algorithm must operate in the context of numbers. For example, if we plug our data into a function, it has no way of outputting "dog" as its classification. Instead, it may output 0's and 1's, where "dog" is represented by 0, and "cat" is represented by 1. The input for our algorithm, therefore, is the **data** contained within our **features**, and the output is the **targets**. To extract meaningful information, we need to know what this **data** and these **targets** really mean. The meaning of these is contained under **feature_names** and **target_names**.<br>
### Let's print the feature_names and target_names of our dataset:

> # ourData.feature_names
> # ourData.target_names

In [12]:
print(bc.feature_names)
print(bc.target_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
['malignant' 'benign']


### Now let's look at the 0-th datapoint in our dataset, as well as its corresponding target:<br>
> # ourData.data[0]<br>
> # ourData.target[0]

In [13]:
print(bc.data[0])
print(bc.target[0])

[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]
0


### Let's make this a little easier to see: <br>

# Excellent! We now understand the terms in sklearn, and have explored our data to see what it consists of.

### The last step before actually making our classifier is to split our data into training and testing data.
Remember, we want our model to *train* on the **features** *and* **labels** of our training data in order to establish patterns. Then, we want it to *predict* the **labels** of our test data using *only* its **features**.

### To do this, we'll first import the appropriate function from sklearn:
> # from sklearn.model_selection import train_test_split

In [16]:
from sklearn.model_selection import train_test_split

### We'll now split our data using the following code:
> # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = d)
 - The parameter "X_train" corresponds with our training features
 - The parameter "X_test" corresponds with our test features 
 - The parameter "y_train" corresponds with our training labels
 - The parameter "y_test" corresponds with our test labels<br><br>
 - The parameter "X" takes in your dataset's **data**
 - The parameter "y" takes in your dataset's **targets**
 - The parameter "d" takes in the desired fraction of data allocated to the test data, as a decimal

**Note 1 |** Notice that the "X" and "y" parameters must be declared before calling this function. <br>
**Note 2 |** A common convention is to use uppercase "X" and lowercase "y".

### Let's look at the code below:<br>

In [18]:
X = bc.data
y = bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .8)

# We can finally create our classifier! 
There are a number of different classifiers in ***sklearn***, and the full list can be found here:<br>
> [List of sklearn Classifiers](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)<br>

In this example, we will use the ***K-Nearest Neighbors*** classifier. More information about the classifier can be found here:
> [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)<br><br>
> [Video](https://www.youtube.com/watch?v=MDniRwXizWo)

### To begin, we'll import this classifier and initialize it:<br>

In [19]:
from sklearn.neighbors import KNeighborsClassifier

### Next, we need to "train" our classifier using our training data. We use the following code:
> # classifier.fit(trainingFeatures, trainingLabels)
 - "classifier" refers to the name of our classifier
 - The parameter "trainingFeatures" refers to our training **data**
 - The parameter "trainingLabels" refers to our training **targets**

### Let's look at the code below:<br>

In [21]:
classifier = KNeighborsClassifier(3)
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

### Cool, we've just trained our classifier! We won't delve into the interworkings of *how* the algorithm was "trained" and the math behind it, but appreciate that our classifier has now established some patterns in our data and is ready to make predicitions. We'll do this with the following code:
> # classifier.predict(testFeatures)
 - Again, "classifier" simply refers to the name we've given to our classifier
 - The parameter "testFeatures" refers to our test **data**

### We'll make and print these predictions using the code below:<br>

In [23]:
predictions = classifier.predict(X_test)
print(predictions)

[1 0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0
 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 0 1 1
 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1
 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1
 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 0
 0 0 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 1 0 0 1 0
 0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 1 1
 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1
 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 1 1 0
 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 1
 1 1 0 0 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1
 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0
 1 0 1 0 1 0 0 1 1 0 1 1]


In [24]:
for i in range(len(predictions)):
   print(predictions[i], ' = ', bc.target_names[predictions[i]])

1  =  benign
0  =  malignant
0  =  malignant
0  =  malignant
0  =  malignant
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
0  =  malignant
0  =  malignant
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
0  =  malignant
0  =  malignant
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
0  =  malignant
0  =  malignant
1  =  benign
1  =  benign
1  =  benign
0  =  malignant
1  =  benign
1  =  benign
0  =  malignant
0  = 

### Sweet! We can go back to our dataset and manually check if the predictions are correct, but that would be extremely inefficient, especially for larger datasets. Instead, we'll measure the accuracy of our predictions using an in-built function. We'll first import this, and then call it:
> # from sklearn.metrics import accuracy_score
> # accuracy_score(testLabels, predictions)
 - The parameter "testLabels" corresponds to the actual test targets from our dataset.
 - The parameter "predictions" corresponds to the targets we jsut predicted for our test features.

### Let's look at the code below:<br>

In [27]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(accuracy)

0.9298245614035088


### Not too bad! If we re-run our entire program, we will observe a slightly different accuracy, but within the same ballpark range. This is due to random variation in the way our algorithm learns and establishes patterns.<br>

--- 

---

# Congratulations! You've just created your (potentially) first ML program and  used it to successfully predict whether a tumour is benign or malginant based on its appearance. While solving new real-life problems may not be so trivial, the concepts you've learned here will remain with you throughout your journey in AI.
<br>