# Emerging Technologies
***

Jupyter notebook for Emerging Technologies module to keep track of tasks.
Author: Mark Reilly
ID: G00357230 
***

## Tasks
***

1. Write a Python function called sqrt2 that calculates and
prints to the screen the square root of 2 to 100 decimal places. Your code should
not depend on any module from the standard library1.

2. The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse   whether two categorical variables are independent.Use  scipy.stats to verify this value and calculate the  associated p value. You should include a short note with  references justifying your analysis in a markdown cell.


3. The standard deviation of an array of numbers <span style="background-color: #FFFF00"> x</span> is calculated using <span style="background-color: #FFFF00">numpy</span> as <span style="background-color: #FFFF00">np.sqrt(np.sum((x - np.mean(x))**2)/len(x))</span> .
However, Microsoft Excel has two different versions of the standard deviation calculation, <span style="background-color: #FFFF00">STDEV.P</span> and <span style="background-color: #FFFF00">STDEV.S</span> . The <span style="background-color: #FFFF00">STDEV.P</span> function performs the above
calculation but in the <span style="background-color: #FFFF00">STDEV.S</span> calculation the division is by <span style="background-color: #FFFF00">len(x)-1 </span> rather
than <span style="background-color: #FFFF00">len(x)</span> . Research these Excel functions, writing a note in a Markdown cell
about the difference between them. Then use numpy to perform a simulation
demonstrating that the <span style="background-color: #FFFF00">STDEV.S</span> calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of
this task is to figure out the terminology in the previous sentence.

4. Use scikit-learn to apply kNN clustering to Fisher’s famous Iris data set. You will easily obtain a copy of the data set online. Explain in a Markdown cell how your code works and how accurate it might be, and then explain how your model could be used to make predictions of species of iris.


***
# Task 1: Caluclate a square root
<br>

### Introduction

Having read the brief I will complete the following list of tasks:
<br>

- Research the following:
    - Definition of square root
    - Newtons Method
    - Babylonian Method
    - F-String Formatting
- Calculate the square root using python.
- Test the function and compare output with another function that imports decimal.
- Review work and write a conclusion.
- Continually add references to the section.

## Research

<b>Definition of Square Root:</b>

The square root of a number is another number which produces the first number when it is multiplied by itself. For example, the square root of 16 is 4 [4].

Algorithms for calculating square root:

<b>Newtons Method</b> for computing the square root of a number. Let a given number be $b$ and let $x$ be a rough guess of the square root of $b$. Newtons method suggests that a better guess, $New x$ can be computed as follows [5]:

$$ New x = 0.5 (x + \frac{b}{x}) $$

You can start with $b$ as a rough guess and compute $New x$. From $New x$ you can generate a even better guess, until two successive guesses are very close. Either of these guesses can be considered the square root of $b$ [5].

<b>The Babylonian</b> method for finding square root of a number. Make an inital guess using any positive number $x0$.
Improve the guess by appling the following formula [6] :

$$ x1 = \frac{(x0 + \frac{S}{x0})}{2} $$

Iterate until convergence, apply result of formula back in as $x1$ and iterate until the desired amount of agreed decimal places is reached.


<b>F-String formatting</b>[8,9]
Floating point values have the f suffix. We can also specify the precision: the number of decimal places. The precision is a value that goes right after the dot character.


We can calculate the square root of a number using Newton's method [1].We can find the square root $z$ of a number $x$ using the following formula[2].

$$ Z = 0.5 (X + \frac{N}{X}) $$


In [37]:
# Adapted from: https://www.geeksforgeeks.org/find-root-of-a-number-using-newtons-method/
def sqrtTest():
    """
    A function to calculate the square root of a number n.
    """
    # X is any guess which can be assumed to be N or 1.
    x = 2
    # Tolerance
    l = 0.00001
        
    count = 0
    
    while (l) :
        count += l
        # Calculate a more accurate guess for z
        z = 0.5 * (x +2/x)
        # If calculated root comes inside the tolerance allowed then break out of the loop.
        if(abs(z - x) < l):
            break
        # Update z
        x = z
            
    return z

In [38]:
print ("The Square Root of 2 is: {:.100f}".format(sqrtTest()))

The Square Root of 2 is: 1.4142135623746898698271934335934929549694061279296875000000000000000000000000000000000000000000000000


As seen above the sqrt() function can only output a value of 53 bits of precision.

In [9]:
# Adapted from: https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra

"""
Function to calculate square root of 2 and output the result correct to 100 decimal places.
"""

x = 2 * 10 ** 200

r = x

def sqrt(x, r):
    d0 = abs(x - r**2)
    dm = abs(x - (r-1)**2)
    dp = abs(x - (r+1)**2)
    minimised = d0 <= dm and d0 <= dp
    below_min = dp < dm
    return minimised, below_min

while True:
    oldr = r
    r = (r + x // r) // 2

    minimised, below_min = test_diffs(x, r)
    if minimised:
        break

    if r == oldr:
        if below_min:
            r += 1
        else:
            r -= 1
        minimised, _ = test_diffs(x, r)
        if minimised:
            break

print(f'{r // 10**100}.{r % 10**100:0100d}')

1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727


In [33]:
# Using above example to help with this function.
def sqrt():
    """
    A function to calculate the square root of 2 to 100 decimal places.
    """
    # X is any guess which can be assumed to be N or 1.
    x = 2
    
    num = x * 10 ** 200
    
    # Initial guess of x which is halfway between 0 and num
    z = num//2
    
    # Iterate 1000 times using the below equation. 
    for i in range(0, 1000):
        z = (z + num//z)//2
        
            
    # Return the result and by using format specifiers [8], add the decimal point after the first digit.
    return f'{z // 10**100}.{z % 10**100:0100d}'



In [9]:
sqrt()

'1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727'

<br>

## Testing 

In [12]:
import math
print ("The Square Root of 2 is: {:.100f}".format(math.sqrt(2)))

The Square Root of 2 is: 1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000


In [55]:
# Adapted from https://stackoverflow.com/questions/4733173/how-can-i-show-an-irrational-number-to-100-decimal-places-in-python
from decimal import *
getcontext().prec = 100
a = Decimal(2)
print ("\nSqaure root of 2 to 100 decimal places : ", a.sqrt()) 


Sqaure root of 2 to 100 decimal places :  1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641573


In [57]:
# Adapted from: https://www.journaldev.com/23715/python-convert-string-to-float#:~:text=We%20can%20convert%20a%20string,object%20__float__()%20function.
'''
Converting types to float values and then comparing to see if equal.
'''
getcontext().prec = 100
DecimalSqrt = a.sqrt()
sqrt = sqrt()

# Showing what type each output is.
print(type(DecimalSqrt))
print(type(sqrt))

# Convert to float, cannot compare String value and a decimal.Decimal value.
DecimalSqrt =float(DecimalSqrt)
sqrt = float(sqrt)

# https://www.geeksforgeeks.org/problem-in-comparing-floating-point-numbers-and-how-to-compare-them-correctly/
def compareFloatNum(a, b): 
      
    # Correct method to compare 
    # floating-ponumbers 
    if (abs(a - b) < 1e-9): 
        print("The numbers are equal "); 
    else: 
        print("The numbers are not equal ");
        
compareFloatNum(DecimalSqrt, sqrt)

<class 'decimal.Decimal'>
<class 'str'>
The numbers are equal 


<br>

## Conclusion and Learning Outcomes

- Newtons Method is a fast approximation to calculate the square root of a number.
- Newtons Method works by taking in a value and iterating using newtons equation until the desired value is achieved, i.e. the difference between the current value and the start vlaue divided by the current value is small enough.
- Implemented a function to calculate the square root of two, but only could calculate the result to an accuracy of 53 bits.
- Learned that there is 53 bits of precision available for a python float. [3]
- Reseached how to calculate the result to 100 decimal places.
- Found a function that used format specifiers that was able to display the square root of 2 to 100 decimals.
- Adapated this function and was able to complete the task.
- Compare the outputs from my sqrt() function and the function using the Decimal import, both outputs are equal.

<br>

## References

[1] https://en.wikipedia.org/wiki/Newton%27s_method

[2] https://www.geeksforgeeks.org/find-root-of-a-number-using-newtons-method/

[3] https://docs.python.org/2/tutorial/floatingpoint.html

[4] https://www.collinsdictionary.com/dictionary/english/square-root

[5] https://pages.mtu.edu/~shene/COURSES/cs201/NOTES/chap06/sqrt-1.html

[6] https://blogs.sas.com/content/iml/2016/05/16/babylonian-square-roots.html

[7] https://stackoverflow.com/questions/64278117/is-there-a-way-to-create-more-decimal-points-on-python-without-importing-a-libra

[8] https://www.python.org/dev/peps/pep-0498/#format-specifiers

[9] https://stackoverflow.com/questions/45310254/fixed-digits-after-decimal-with-f-strings/45310389

***

<br>

# Task 2: The Chi-Squared Test

The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example, stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.

The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification.

| | A | B | C | D | Total |
| :- | :-: | :-: |:-: |:-: |:-: |
| White Collar | 90 | 60 | 104 | 95 | 349 |
| Blue Collar | 30 | 50 | 51 | 20 | 151 |
| No Collar | 30 | 40 | 45 | 35 | 150 |
| Total | 150 | 150 | 200 | 150 | 650 |

<br>

## Introduction

Having read the brief I will complete the following list of tasks:
<br>

- Research the Chi-Squared test.

- Research Chi-Squared formula and how to use it.

- Calculate the Chi-Squared value through theory.

- Calculate the Chi-Sqaured value using scipy.stats.

- Analyse the results.

- Continually add references to the section.

#### What is the Chi-Squared Test?

The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is also called a "goodness of fit" statistic, because it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent. Chi is the greek letter Χ, so we can also write it Χ2 Test.[2] It is important to note that the test only works with categorical data, this means the data has to be counted and sorted into categories, it will not work with parametric or continuous data e.g height in inches.

<br>

## Research

#### The Chi-Squared Formula [1]

$$ X = \frac{(O - E)^2}{E} $$

$O$ stands for the Observed Frequency, this is the frequencies observed in the sample data.

$E$ stands for the Expected Frequency, $$ \frac{Row Total * Column Total}{N} $$




<br>

### Calculating the Chi-Square Value  in theory

In the following I am going to calculate the Chi-Squared Value for the above data.

$$ Expected Frequency = \frac{349 * 150}{650} $$

In [12]:
Row = 349
Column = 150
Total =650

ExpectedFreq = (Row*Column)/Total
print("Expected Frequency =", +ExpectedFreq)

Expected Frequency = 80.53846153846153


$$ Expected Frequency = 80.5385 $$

Now that we have the Expected Frequency we can calulate the Chi-Squared Value for position (1,1) in the data using the formula.

$$ X = \frac{(90 - 80.5385)^2}{80.5385} $$


In [13]:
ExpectedF = 80.5385
Observed = 90

X = ((Observed - ExpectedF)**2)/ExpectedF
print("Value for position (1,1) = ", + X)

Value for position (1,1) =  1.111517873439411


The above example was to calculate the Chi-squared value for position (1,1) in the data, this has to be done for all the observed values in the table. When this is done all the values are summed. This summed total is then used to calculate the p-Value.[1]

<br>

### Calculating Chi-Squared value using scipy.stats

In [40]:
#Adpated from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html?highlight=scipy%20stats%20chi2_contingency#scipy.stats.chi2_contingency
"""
The following function calculates chi-square statistic and p-value.
"""

from scipy.stats import chi2_contingency
import numpy as np 

ObservedData = np.array([[90, 60, 104, 95], [30, 50, 51, 20], [30, 40, 45, 35]])

Chi_Square = chi2_contingency(ObservedData)

print(type(Chi_Square))
print("The chi statistic =", + Chi_Square[0])
print("The p-Value =", + Chi_Square[1])
print("The degree of freedom =", + Chi_Square[2])


<class 'tuple'>
The chi statistic = 24.5712028585826
The p-Value = 0.0004098425861096696
The degree of freedom = 6


<br>

### Analysis of Results

For a Chi-square test, a p-value that is less than or equal to your significance level indicates there is sufficient evidence to conclude that the observed distribution is not the same as the expected distribution. You can conclude that a relationship exists between the categorical variables.[4] Based on this statement we can reject the null hypothesis that each person's neighborhood of residence is independent of the person's occupational classification, since the p-value of 0.0004 is less than the significance level of 0.05.

## References

[1] https://www.ling.upenn.edu/~clight/chisquared.htm

[2] https://www.mathsisfun.com/data/chi-square-test.html 

[3] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html?highlight=scipy%20stats%20chi2_contingency#scipy.stats.chi2_contingency

[4] https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/#:~:text=For%20a%20Chi%2Dsquare%20test,exists%20between%20the%20categorical%20variables. 

***
# Task 3

The standard deviation of an array of numbers <span style="background-color: #FFFF00"> x</span> is
calculated using <span style="background-color: #FFFF00">numpy</span> as <span style="background-color: #FFFF00">np.sqrt(np.sum((x - np.mean(x))**2)/len(x))</span> .
However, Microsoft Excel has two different versions of the standard deviation calculation, <span style="background-color: #FFFF00">STDEV.P</span> and <span style="background-color: #FFFF00">STDEV.S</span> . The <span style="background-color: #FFFF00">STDEV.P</span> function performs the above
calculation but in the <span style="background-color: #FFFF00">STDEV.S</span> calculation the division is by <span style="background-color: #FFFF00">len(x)-1 </span> rather
than <span style="background-color: #FFFF00">len(x)</span> . Research these Excel functions, writing a note in a Markdown cell
about the difference between them. Then use numpy to perform a simulation
demonstrating that the <span style="background-color: #FFFF00">STDEV.S</span> calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of
this task is to figure out the terminology in the previous sentence.

<br>

### Introduction

Having read the brief I will complete the following list of tasks:
<br>

- Complete research on STDEV.P and STDEV.S explaining what the functions do.

- Compare the differences between the functions.

- Use numpy to perform a simulation demonstrating that the STDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample

- Explain my code.

- Continually add to references section.

<br>

### Research

<b> Standard Deviation:</b>
<br>
Standard deviation is a measure of how much variance there is in a set of numbers compared to the average (mean) of the numbers. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.[3]

<b>STDEV.P:</b>
<br>
The STDEV.P function calculates the standard deviation for a sample set of data. The STDEV.P function is meant to estimate standard deviation for an entire population.[1]

<b>STDEV.S</b>
<br>
The STDEV.S is a Statistical function that calculates and returns the standard deviation for a sample of data. The STDEV.S calculates standard deviation using the “n-1” method.[2]

<b> Compare the differences between STDEV.P and STDEV.S:</b>
<br>
- The STDEV.P function is used when your data represents the entire population. The STDEV.S function is used when your data is a sample of the entire population.

- If we have dataset which is the number of people living in Ireland, we would use STDEV.P because it is <span style="background-color: #FFFF00">All</span> the people in Ireland. Whereas if the dataset is a sample of <span style="background-color: #FFFF00">Some</span> of the people in Ireland, we would ues STDEV.S becuase it is a subset of the whole population.

- In STDEV.P function, the squared deviation is divided by the total number of arguments, mostly represented as N. In STDEV.S, the squared deviation is divided by the total number of sample -1. It is represented as N-1.[4]

- STDEV.P function considers entire data, and some factors may dominate the result standard deviation. And since it will be taken as the standard deviation for everyone in data, even for minorities, this is called Biased Analysis. This is why this standard deviation is recommended to use only when an analysis is non-destructive.[4]
- The STDEV.S function is used on a small sample of the entire population, and we subtract one from the denominator (number of samples arguments). This is called non-biassed analysis of standard deviation. This is used when an analysis is destructive.[4]

### Calculation

Use numpy to perform a simulation demonstrating that the STDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample.

#### Simulating the data using numpy.random.normal

numpy.random.normal draw random samples from a normal (Gaussian) distribution.[6] 

This function takes in three parameters:[6]
- loc: float or array_like of floats - Mean (“centre”) of the distribution.

- scale: float or array_like of floats - Standard deviation (spread or “width”) of the distribution. Must be non-negative.

- size: int or tuple of ints, optional Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

In [38]:
# Adapated from https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html
import numpy as np
#Set the loc, scale and size
loc = 40
scale = 3
size = 1000

# Simulating the data using numpy.random.normal
simulation = np.random.normal(loc,scale,size)


#### Calculating standard deviation using numpy

In [39]:

#Using STDEV.P formula to calculate standard deviation
def stdev_P(x):
    return np.sqrt(np.sum((x - np.mean(x))**2)/len(x))

stdev_P(simulation)

3.020372747307692

In [40]:
#Using STDEV.S formula to calculate standard deviation
def stdev_S(x):
    return np.sqrt(np.sum((x - np.mean(x))**2)/(len(x)-1))

stdev_S(simulation)

3.0218840672658187

#### Which is a better estimate for standard deviation?

In [41]:
# Adapted from https://www.codegrepper.com/code-examples/python/numpy+standard+deviation

#Calculate standard deviation using numpy.std()
# as default, std() calculates basesd on a population
STDEVP = np.std(simulation)

print("STDEV.P result:", STDEVP)

STDEV.P result: 3.020372747307692


In [42]:
# Adapted from https://www.codegrepper.com/code-examples/python/numpy+standard+deviation

#Calculate standard deviation using numpy.std()
# by specifying ddof=1, it calculates based on the sample
STDEVS = np.std(simulation, ddof=1)

print("STDEV.S result:", STDEVS)

STDEV.S result: 3.0218840672658187


In [43]:
# Adapted from https://docs.scipy.org/doc//numpy-1.10.4/reference/generated/numpy.random.choice.html

#numpy.random.choice() Generates a random sample from a given array
sample = np.random.choice(simulation, 30)
np.random.shuffle(sample)

In [48]:
stdev_P(sample)
print("STDEV.P result using a sample:", STDEVP)
print(f'STDEV.P Accuracy: {(stdev_P(sample)/3)*100:0.2f}%')

stdev_S(sample)
print("\nSTDEV.S result using a sample:", STDEVS)
print(f'STDEV.P Accuracy: {(stdev_S(sample)/3)*100:0.2f}%')

STDEV.P result using a sample: 3.020372747307692
STDEV.P Calculation Accuracy: 89.93%

STDEV.S result using a sample: 3.0218840672658187
STDEV.P Calculation Accuracy: 91.46%


### Analysis of Results

From analysing the results of the sample data the STDEV.S function calculated a better estimate to the true Standard Deviation of 3 with a result of 3.0218840672658187 and an Accuracy of 91.46%. 

The simulation proves that the STDEV.S function is more accurate at calculating Standard Deviation on a sample of data, giving a better estimation of the true Standard Deviation when compared with STDEV.P.

<br>

### References

[1] STDEV.P definition - https://exceljet.net/excel-functions/excel-stdev.p-function#:~:text=The%20STDEV.,deviation%20for%20an%20entire%20population.

[2] STDEV.S definition - https://www.spreadsheetweb.com/excel-stdev-s-function/#:~:text=S%20(STDEV%20S)%20is%20a,%E2%80%9Cn%2D1%E2%80%9D%20method.

[3] Standard Deviation definition - https://en.wikipedia.org/wiki/Standard_deviation

[4] Difference between STDEV.P & STDEV.S - https://www.exceltip.com/statistical-formulas/how-to-use-excel-stdev-p-function.html#:~:text=Let's%20Explore.-,The%20STDEV.,sample%20of%20the%20entire%20population.

[5] Calculation of Stddev.P & Stddev.S - https://www.codegrepper.com/code-examples/python/numpy+standard+deviation

[6] Simulating data - https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

***
# Task 4

Use scikit-learn to apply kNN clustering to Fisher’s famous Iris data set. You will easily obtain a copy of the data set online. Explain in a Markdown cell how your code works and how accurate it might be, and then explain how your model could be used to make predictions of species of iris.

### Introduction

Having read the question I will complete the following list of tasks:

- Complete research on the meaning of scikit-learn and kNN clustering.
- Using Fisher's famous Iris data set, use scikit-learn to apply kNN clustering to this data set.
- Explain the above code and how accurate it might be.
- Explain how the model could be used to make predictions of species of iris.
- Continually add to references section.

### Research

##### Scikit-Learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. It's built upon NumPy, pandas, and Matplotlib.[1]

The functionality that scikit-learn provides include:[1]

- Regression, including Linear and Logistic Regression
- Classification, including K-Nearest Neighbors
- Clustering, including K-Means and K-Means++
- Model selection
- Preprocessing, including Min-Max Normalization

<b>Unsupervised Machine Learning</b>

Unsupervised learning is where you only have input data (X) and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.[2]

<b>Supervised Machine Learning</b>

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

#### k-Nearest Neighbors 

<b> k-Nearest Neighbors Algorithm</b>

The k-nearest neighbors algorithm is a supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. Using ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. We can understand its working with the help of following steps − [3,4]

- For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.
- Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
- For each point in the test data do the following −
    - Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.
    - Now, based on the distance value, sort them in ascending order.
    - Next, it will choose the top K rows from the sorted array.
    - Now, it will assign a class to the test point based on most frequent class of these rows.

<b> Clustering</b>

Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm (k-mean algorithm) to classify each data point into a specific group. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields.[4]


### Using scikit-learn for k-NN clustering 

<b> The Dataset </b>

I am going to use the famous iris dataset. The dataset consists of four attributes: sepal-width, sepal-length, petal-width and petal-length. These are the attributes of specific types of iris plant. The task is to predict the class to which these plants belong. There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica.[6]

In [9]:
# Adapted from - [6]
# Importing the Iris Dataset
import pandas as pd
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']


# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)

#Output the dataset
dataset.head(20)

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
15,5.7,4.4,1.5,0.4,Iris-setosa
16,5.4,3.9,1.3,0.4,Iris-setosa
17,5.1,3.5,1.4,0.3,Iris-setosa
18,5.7,3.8,1.7,0.3,Iris-setosa


<b> Split the dataset into its attributes and labels </b>


In [10]:
# Adapted from - [6]
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

<b> Train Test Split </b>

To avoid over-fitting, we will divide our dataset into training and test splits, which gives us a better idea as to how our algorithm performed during the testing phase. This way our algorithm is tested on un-seen data, as it would be in a production application.[6]

<b> Over-fitting </b>is a modeling error that occurs when a function is too closely fit to a limited set of data points.[7]

In [12]:
# Adapted from - [6]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

<b> Feature Scaling </b>

Before making any actual predictions, it is always a good practice to scale the features so that all of them can be uniformly evaluated.[6] 


In [13]:
# Adapted from - [6]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

<b>Training and Predictions</b>

Import the KNeighborsClassifier class from the sklearn.neighbors library. In the second line, this class is initialized with one parameter, i.e. n_neigbours. This is basically the value for the K. There is no ideal value for K and it is selected after testing and evaluation, however to start out, 5 seems to be the most commonly used value for KNN algorithm.[6]

In [23]:
# Adapted from - [6]
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=1)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

<b> Evaluating the Algorithm </b>

For evaluating an algorithm, confusion matrix, precision, recall and f1 score are the most commonly used metrics. The confusion_matrix and classification_report methods of the sklearn.metrics can be used to calculate these metrics.[6] 

A<b> Confusion matrix </b> is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

<b> Precision </b> is the ratio of correctly predicted positive observations to the total predicted positive observations.High precision relates to the low false positive rate. [8]

$$ Precision = \frac{(True Positive)}{True Positive + False Positive} $$


<b> Recall</b> calculates how many of the Actual Positives our model captures through labeling it as Positive (True Positive). Applying the same understanding, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.[8]

$$ Recall = \frac{True Positive}{True Positive + False Negative} $$

$$  = \frac{True Positive}{True Actual Positive} $$

<b>F1-Score</b> is needed when you want to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).[8]


$$  F1 = 2x\frac{Precision * Recall}{Precision + Recall} $$

In [24]:
# Adapted from - [6]
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[12  0  0]
 [ 0  8  1]
 [ 0  1  8]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       0.89      0.89      0.89         9
 Iris-virginica       0.89      0.89      0.89         9

       accuracy                           0.93        30
      macro avg       0.93      0.93      0.93        30
   weighted avg       0.93      0.93      0.93        30



#### References

[1] Explaining Scikit-learn - https://www.codecademy.com/articles/scikit-learn

[2] Supervised and Unsupervised Machine Learning definitions - https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

[3] K-NN explained - https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

[4] K-NN Algorithm -https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm

[5] Clustering Definition - https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68#:~:text=Clustering%20is%20a%20Machine%20Learning,the%20grouping%20of%20data%20points.&text=In%20theory%2C%20data%20points%20that,dissimilar%20properties%20and%2For%20features.

[6] Using scikit learn and k-nn algorithm - https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/

[7] Over-fitting - https://www.investopedia.com/terms/o/overfitting.asp#:~:text=Overfitting%20is%20a%20modeling%20error,in%20the%20data%20under%20study.

[8] Precision, recall and F1 - https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

End