GUID: 2660237
GitHub Link: ["Github"](https://github.com/Natasha-Warder/AI.Python.github.io.-).


# Machine Learning by Example: from Start to End

# Task 1: Initial Readings 


## Reading 2 Analysis 
The code in this notebook is modified from that which was made available by Aurélien Géron and his fabulous book ["Hands-On Machine Learning with Scitkit-Learn, Keras & Tensorflow"](https://eleanor.lib.gla.ac.uk/record=b4094676).

- Programming techniques process: Examine what desired output usually looks like, write a detection algorithm for noticable patterns, test the program by repeating steps one and two. 
- Machine learning is highly applicable to fluctuating environments as it can be retrained on new data keeping systems up to date.
- Artificial Neural Networks (ANNs) are simplified models of the central nervous system = highly interconnected neural computing elements that can respond to input stimuli to LEARN and to ADAPT to the environment. ANNs have been shown to be effective as computational processors for various tasks including pattern recognition, speech and visual image recognition.  
- Dynamic recurrent networks utilise backpropagation supervised learning algorithms to train these networks and their dynamics. A recurrent neural network (RNN) is a type of artificial neural network which uses sequential data or time series data. 
- PATTERN RECOGNITION – the recognition of lateral character strings ANNS in character recognition Imai (1991) = RNNS capture temporal information well.  



### Scikit-Learn

In [None]:
import sys # importing the package sys which lets you talk to your computer system

assert sys.version_info >= (3, 7) # versions are expressed a pair of numbers (3, 7)

The `assert` statement throws up an error when the statement following it is not true. If it is true, nothing will be shown. 
This experiment included replacing the numbers in the round brackets to be much bigger: **A Pair of numbers** like `(3, 7)` in round brackets is a data structure known as a **tuple** in programming lingo. 

Python versions are typically represented as three integers (major, minor, micro). For example, version 3.7.1 would be represented as (3, 7, 1).

The code (3, 7) represents the minimum required version as Python 3.7. This kind of version check is applicable to  Python code as it ensures that the script or program is compatible with a certain minimum version of the Python interpreter. Appropriate version labelling ultimately helps to avoid unexpected issues that may arise due to differences in syntax or functionality between different Python versions.

In [None]:
import sys # importing the package sys which lets you talk to your computer system.

assert sys.version_info >= (3, 8)

Scikit-Learn package version is greater than 1.0.1. 

In [None]:
from packaging import version # import the package "version"
import sklearn # import scikit-learn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1") 

### Fonts Used in Figures

The following code sets some font sizes to be used with `matplotlib.pyplot` previously used to display visual information or data. Sizes have been chosen based off of visual aesthetics, alignment and clarity. 

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=14) # sets general font size
plt.rc('axes', labelsize=14, titlesize=14) # sets font size for axis labels
plt.rc('legend', fontsize=14) # font size for legends
plt.rc('xtick', labelsize=10) # labelsize controls the font size of labels for intervals marked on the x axis
plt.rc('ytick', labelsize=10) # labelsize controls the font size of labels for intervals marked on the y axis

By setting font size for axis titles, legends and tick labels  configurations employ the default styling of Matplotlib plots. 

Moreover, adjusting font sizes can improve the readability and aesthetics of plots, this makes them more suitable for your specific use case or preference.

### Creating the Folder for Images 

The code below creates the directory `images/classification` (if it doesn't already exist) and defines the `save_fig()` function which is used to save the figures you create in matplotlib in high resolution.

In [None]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "classification"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Task 1-3: Review Machine Learning

In [None]:
from IPython.display import Image
Image ("images/workflow.jpg")


# Lectures 3 & 4 in alignment with the workflow

## What is the Data? 
As a subfield of Machine Learning (ML), Information Retrieval (IR) involves the development of algorithms and systems for searching, retrieving, and presenting information from large collections of structured and unstructured information. In data processing, the task is to make data more meaningful and informative by converting it from a given form into a much more usable one. This entire process can be automated using machine learning algorithms, mathematical modelling, and statistical knowledge.

 
## What is a Learning Algorithm?
An optimization technique and a loss function are two components of a learning algorithm. A loss occurs when the target estimate provided by the ML model does not match the target exactly. This penalty is quantified as a single value by a loss function. A learning algorithm learns the model's weights. Using weights, the model can assess the likelihood that the patterns it is learning are based on actual data. 

 
## What is the Model Ouput?
Model output is the prediction or decision made by a machine learning model based on input data. In supervised learning, the model output is a predicted target value for a given input. In unsupervised learning, the model output may include cluster assignments or other learned patterns in the data including: decision trees or linear regression models.
 

## Explaining the workflow and Examples to a Wider Audience
Spam filtering Example: 

- Words in a document are divided up into different nodes and these nodes are meant to distinguish different characteristics of these words to weight different words as important or not and classify them as spam or not.  
- Initially you generate a network with random weights  
- 'learn' distinguishing features punctuation etc. With the objective to flag spam 
- Examples, experience = training data  
- Performance measure = dependent on how you define ham and spam.  
- Eventually each arrow will have weights which influence how much firing occurs 
- Neurons – firing certain patterns/information = Neural network will fire when identifying a spam email  

 
Here “flagging spam” is the task T, and training on the examples of spam and ham constitute the experience E. Performance measure P would measure how well the machine is performing its task – this needs to be precisely defined. 

Real life examples have assisted my learning in comprehending the machine learning process. To engage a wider audience with learning code I plan to include videos, images and external readings to assist in the delivery of information. 

[spam_filtering_resource](https://www.sciencedirect.com/science/article/pii/S1877050921013016)

In [None]:
from IPython.display import * 

YouTubeVideo("9phQ4Mut2wc")

# Task 1-4: Framing the Problem

Two datasets:

1)	Tabular data consisting of information about houses in districts within the US state of California.
The first of these datasets will be used to **predict median housing prices for a given district**. The results of the prediction will be combined with other data to determine whether it is worth investing in a given district. 

2)	Image pixel data, each image representing a digit handwritten by high school students and employees of the US Census Bureau. 
The second of these datasets will be used to **classify hand written digits**. It was originally developed as a way of sorting out the handwritten US zip codes (similar to UK postcodes) at the post office. 


## Step 1: How framing the problem affects data selection

**“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask … for once I know the proper question, I could solve the problem in less than five minutes.” — (Einstein, SOURCE)**

Framing the problem can be applied to defining business challenges. Data may be used to solve it is known as problem framing. It's the process of transforming an abstract objective, such as "we want to know which customers are likely to retrun," into information about what data will be used, how the data will be presented, and potential modelling strategies. This involves both comprehending the fundamental principles of your issue and connecting them with appropriate methods and strategies.

The machine learning model using data above could be used to gather statistics on catchment areas for schools and the average salary of parents - indicating a level of education received based off class, used sociologically to identify areas of inequality.  


## Step 2: How to select your algorithm

In Lecture 2, we discussed how machine learning can be divided into three types: Supervised, Unsupervised, and Reinforcement. We will explore supervised learning by using examples of regression and classification.

#### Regression or classification for predicting median housing prices?
Regression is used to predict outputs that are continuous like housing prices which continuous to rise. The outputs are quantities that can be flexibly determined based on the inputs of the model rather than being confined to a set of possible labels like classification. 

#### Regression or classification for handwritten digit recognition?
Classification is used to predict a discrete label. The outputs fall under a finite set of possible outcomes. Many situations have only two possible outcomes. This is called binary classification a true or false, yes or no which can also be applied to multi-class classification which has the same idea behind binary classification, except instead of two possible outcomes, there are three or more.


#### Conclusion 
It is important to assess the nature of information as predicting a continuous output or finite output among a set of outcomes. 

**KEY RESOURCE: short article from Codecademy** – [“Regression vs Classification”](https://www.codecademy.com/article/regression-vs-classification).


## Step 3: Before Data Collection

Once your problem is defined (e.g. predicting the median housing price of a district), you will need to collect a new data set appropriate for your task, and/or identify existing data sets that can be used for training your model. 

#### Information about housing in a district that would help predict the median housing price in the district:
- schools in the area
- accesibility to public transport
- square footage of housing
- date built
- total rooms
- distance from city centre
- population density 

#### How these decisions might depend on geographical and/or cultural differences:
- conflict/war impacting availability of housing
- rural vs centralised living 
- low populations vs high populations
- accessibility to education

#### How the information collected would already bias the data:
- Reporting bias (also known as selective reporting) takes place when only a selection of results or outcomes are captured in a data set.
- Citation bias: occurs when your analysis is based on studies found in the citations of other studies.
- Language bias: occurs when you ignore reports not published in your native language.
- Location bias: occurs when certain studies are harder to locate than others.
- Automation Bias: Automated bias is a tendency of humans to favor results or suggestions generated by automated systems and to ignore contradictory information made by non-automated systems, even if it is correct.

**KEY RESOURCE: short article from Medium** – [“Type of Data Bias”](https://towardsdatascience.com/types-of-biases-in-data-cafc4f2634fb).



# Working with Data

Although reviews of machine-learning and AI commonly perceive the largest part of machine learning as in training the algorithm, this is an inaccurate deduction. Machine-learning includes, among other things, data cleaning, re-scaling, and labelling.


In [None]:

 from IPython.display import * 

YouTubeVideo("DUpDQWTOivg")

By storing and backing up data generated alongside begininng data, by recording the evaluation results, data exploration, interpretation, a majority of machine learning is involved in **data management**. 

A substantial part in:
- achieving transparency
- addressing the ethical concerns of AI and bias
- data protection 
- etc.

# Task 2: Getting Data
There many ways you can get data. Previous labs explored scikit-learn's datasets package which has some datasets already available.   

Data can also come from OECD, OpenML, Kaggle, individual repositories in GitHub. 

We will use these datasets to carry out the prediction of 
- **Example Tabular Data**: dataset comprising housing prices in California in the the United States. This dataset is available on the GitHub, courtesy of Aurelien Geron. 

- **Example Image Data**: MNIST dataset comprising images of handwritten digits. Handwritten digit recognition with the MNIST dataset is sometimes called the **"Hello World!" of machine learning**. 

## Task 2-1: Download the Data: Example Tabular Data

The following code defines **function** called `load_housing_data()`. This function retrieves a compressed file avaialable at `https://github.com/ageron/data/raw/main/housing.tgz` and saves it in a folder `datasets` which is in the same folder as this notebook. This will be created if it does not exist. The code will also extract the contents in the folder `datasets`. 

In [None]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data(): #defines a function that loads the housing data available as .tgz file on a github URL
    tarball_path = Path("datasets/housing.tgz") # where you will save your compressed data
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True) #create datasets folder as it did not exist yet
        url = "https://github.com/ageron/data/raw/main/housing.tgz" # github url for data retrieval 
        urllib.request.urlretrieve(url, tarball_path) # gets the url content and saves it at location specified by tarball_path
        with tarfile.open(tarball_path) as housing_tarball: # opens saved compressed file as housing_tarball
            housing_tarball.extractall(path="datasets") # extracts the compressed content to datasets folder
    return pd.read_csv(Path("datasets/housing/housing.csv")) #uses panadas to read the csv file from the extracted content

housing = load_housing_data() #runs the function defined above

This code exemplifies a convenient way to download, extract, and load housing data from a specific GitHub repository using Python.

It utilizes the pathlib, pandas, tarfile, and urllib.request modules to achieve these tasks. Then the data is stored in a Pandas DataFrame to ease processes of manipulation and analysis.

#### Once the compressed file is downloaded and extracted, just import the pandas library

In [None]:
import pandas as pd
from pathlib import Path

housing = pd.read_csv(Path("datasets/housing/housing.csv"))

## About Housing Data

In [None]:
housing.info()

The result above tells you how many attributes (e.g. longitude, latitude) characterise the dataset. The data type float64 is a numerical data type. So, the table above also tells you how many attributes are not numerical: object (1).  

In [None]:
housing["ocean_proximity"].value_counts() # adding information to the provided code tells you what values the column for `ocean_proximity` can take

In [None]:
housing.hist(bins=50, figsize=(12, 8))  # characterised data set 
save_fig("attribute_histogram_plots")  # the save_fig function takes a fig_name parameter, and it saves the current Matplotlib figure as a PNG file with the specified filename
plt.show()

In [None]:
housing.describe() #`housing.describe()`summarises the data set `housing`

It is unconventionally imprtant to stop looking at the data until setting aside test data to prevent inadvertent bias creeping into the machine learning process.

# Task 2-2: Downloading Data: Example Image Data

In contrast to tabular data, image data sets are not always read in using pandas. Technically you can do this (as the line below commented out suggests) but as there are no features human-friendly features (such as, median income etc.) - only pixel information, it does not always help to load it as a pandas dataframe, unless the model requires it to be so. 


In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')

#mnist_dataframe = pd.DataFrame(data=mnist.data, columns=mnist.feature_names)

In [None]:
type(mnist) # the command `type` shows what data type `mnist` is - this shows it is not a `pandas.core.frame.DataFrame`, dataframes are not the preferred data structure for image data

## MNIST Applications 

To get information about the dataset content the command `mnist.info()` won't operate as it's not a pandas dataframe. Since the mnist dataset is part of the `sklearn.datasets`, similar tools in weeks 1-4 apply: 

- the keyword `DECSR` - as demonstrated in the first code cell below. 
- The `print` command can be used in conjunction to get some useful context of the dataset structue and origin. 

In [None]:
print(mnist.DESCR)

# Task 2-3: Review the Data Description

### What is the size of each image?
Images were sized within a 20 x 20 pixel box relative to their aspect ratio (retaining their original shape) and translated their position point at the center of the 28x28 field.  
 

### Examine how LeCunn, Cortes, and Burges reorganised the NIST data as MNIST. Note that they remixed the data in two ways to create different a training dataset and test dataset. What they did do? 

The MNIST database was constructed from NIST's. NIST originally designated SD-3 as their training set and SD-1 as their test set. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets. The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. 


### Why do you think they did this? Was it justified? 

SD-3 is much cleaner and easier to recognize than SD-1 mixing data sets, means standardising the quality of information to an extent. This was justified the decision was found on the fact that SD-3 was collected among Census Bureau employees while SD-1 was collected among high-school students.


**A full list of keys other than `DESCR` which is available to this dataset. You can use the command `mnist.keys()` to see more keys available.**

In [None]:
mnist.keys()

## Task 2-4: Identifying the Dimension of Images

- listed above for `mnist.keys()`reflect keys used in labs weeks 1-4. 

- The former will return the image pixel data, while the `target` key will return the labels (the categories or classification) assigned to each of these images. 

- Create a code cell below to use these keys in Python, to use the `shape` command to verify the number of images in the dataset and how many features (e.g. pixels) represents the image. Print out the shape and the target categories. 

I have created a cell for you below with the data and target assigned to the **variables** `images` and `categories`. Add a line to print out the 'shape' of the images and the list a assigned categories. 

In [None]:
# cell for python code 

images = mnist.data
categories = mnist.target

# lines below to print the shape of images and to print the categories.

print(mnist.data)
print(mnist.target)

## Item one: one of the digits in the dataset 

In [None]:
#extra code to visualise the image of digits

import matplotlib.pyplot as plt

## the code below defines a function plot_digit. The initial key work `def` stands for define, followed by function name.
## the function take one argument image_data in a parenthesis. This is followed by a colon. 
## Each line below that will be executed when the function is used. 
## This cell only defines the function. The next cell uses the function.

def plot_digit(image_data): # defines a function so that you need not type all the lines below everytime you view an image
    image = image_data.reshape(28, 28) #reshapes the data into a 28 x 28 image - before it was a string of 784 numbers
    plt.imshow(image, cmap="binary") # show the image in black and white - binary.
    plt.axis("off") # ensures no x and y axes are displayed

In [None]:
# visualisation of a selected digit 0 with following code

some_digit = mnist.data[0]
plot_digit(some_digit)
plt.show()

# Task 3: Setting Aside the Test Data

## Why Shuffle

To set aside test data, you need to take shuffled and stratified samples. Because the dataset we are working with can be assign a specific order, we can select data points in only specific classes and by shuffling we can avoid this. 

`sklearn` allows you to split the data inclusive of splitting. 
This function is called `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split #using housing data

tratio = 0.2 #to get 20% for testing and 80% for training

train_set, test_set = train_test_split(housing, test_size=tratio, random_state=42) 
## assigning a number to random_state means that everytime you run this you get the same split, unless you change the data.

## Why Stratify

If the dataset contains more samples of a specific kind more than others it is skewed. Sampling randomly will result in test data not representing the desired test population. When stratified sampling is applied in cross-validation training and test sets have the same percentage of the feature of interest as the original dataset.

[An example of the estimated probability of getting a bad sample that does not reflect the actual population is provided below.]


In [None]:
#another way to estimate the probability of bad sample

import numpy as np

sample_size = 1000
ratio_female = 0.511 #The US population ratio of females in the census is 51.1%. 

np.random.seed(42)

samples = (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)
((samples < 485) | (samples > 535)).mean() # The following is the probability of getting a sample with less than 48.5% or greater than 53.5% females if you take a random sample withoput stratifying: approximately **10.71%** 

## Task 3.1: Stratified Sample: Housing Data

The following code adds a column to the `housing` data to create bins of data according to interval brackets of median income of districts. This is a first step to creating a stratified sample across different income brackets.

In [None]:
import numpy as np 
import pandas as pd #imported libraries

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf], #bins implement startified sampling. It will randomly sample 20% because test ratio = 0.2, from each income bracket defined above.
                               labels=[1, 2, 3, 4, 5])

In [None]:
from sklearn.model_selection import train_test_split

tratio = 0.2 #to get 20% for testing and 80% for training

strat_train_set, strat_test_set = train_test_split(housing, test_size=tratio, stratify=housing["income_cat"], random_state=42)

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set) #Prints out in order of the highest proportion first.

The proportion of each income category in the stratified test set above.

Setting the attribute `random_state` to a specific number keeps the split the same everytime you run the code. 
But it will not stay the same if you change the underlying dataset. 

Stratified sample based on median income is reasonable because: 
- stratified sampling based on median income ensures specific subgroups are present in their sample. 
- helps obtain precise estimates of each group’s characteristics. 
- this median income can be used to understand differences between subpopulations better. 


## Task 3.2: Setting Aside Test Set for Image Data 

In the case of `mnist` the data is already cleaned prepared, scaled and ordered:
- the training data is the first 60,000 images
- the test data is the last 10,000 images

So it is essential to not shuffle and stratify or use `train_test_split`. 

The following code is used to set aside the test dataset.

In [None]:
type(mnist.data) #The data type of `mnist.data` is `numpy.ndarray` (you can verify this with the command `type`). 

In [None]:
X_train = mnist.data[:60000]
y_train = mnist.target[:60000]

X_test = mnist.data[60000:]
y_test = mnist.target[60000:]

- By using a colon and then 60000 in a square bracket after `mnist.data`, you are telling the computer that you want all the items up until the 60000th item (not including the 60000th) in the array `mnist.data`. We assign this to the **variable** `X_train`. 
- Likewise, the first 60000 categories corresponding the the first 60000 images are assigned to the **variable** `y_train`. 
- By using a colon after 60000, you are telling the computer you would like all the items from the 60000th onwards.

It is machine learning convention to use upper case `X` for variable names associated with data and lower case `y` in the variable names associated to labels (or categories/classes).

# Task 4: Selecting and Training a Model

In the following code:
- **linear regression** is for the prediction of district housing prices (using `Scikit-Learn`).
- **convolutional neural network** is for classification of hand written digits (using `tensorflow` library with `keras`).

Regardless of the model, general flow is similar:  

STEP ONE: Import the model from the relevant library. 
STEP TWO: Create an untrained model instance. 
STEP THREE: Fit the model to your training data.


## Task 4-1: Housing Data and Linear Regression

With linear regression, you need data, which values are continuous not discrete. 
It's not enough for the values to be numbers, which can also be categories.

The feature `income_cat` is another category expressed as a number. 

You should always work with copies of data to never look at the test set in case you inadvertently use information in the test to improve the performance. This is exemplified in the following code.

In [None]:
housing = strat_train_set.copy()  #assigned a copy of the stratified training set we created earlier to the variable `housing`

## Step 1: Checking Correlations: Training Set

Linear regression works by picking up on correlations between features. Useful to explore the correlations, in this case,between the target value you are trying to predict `median_house_value` and alternative features in the dataset like `median_income`.

The training set we have is of type `pandas.DataFrame', pandas dataframes have the function `corr` which calculates the correlations for you. This is exemplified in the code below:


In [None]:
corr_matrix = housing.corr(numeric_only=True) 
# argument is so that it only calculates for numeric value features.
# calculates all the correlations between all the pairs of features and saves it in variable `corr_matrix'. 

corr_matrix["median_house_value"].sort_values(ascending=False)
# correlations for `median_house_value` 
#`sort_values(ascending=False)` sorts the correlation to display it in descending order of correlation (with most correlated fetures are listed first).

The values are sorted in descending order, with the most highly correlated features listed first.

- A correlation value close to 1 indicates a strong positive correlation, meaning as one variable increases, the other tends to increase as well.
- A correlation value close to -1 indicates a strong negative correlation, meaning as one variable increases, the other tends to decrease.
- A correlation value close to 0 indicates a weak or no linear correlation.

 "median_income" has the highest positive correlation with "median_house_value," suggesting that as median income increases, median house value tends to increase. 

## Step 2: Visualise the Correlations

Pandas also can visualise these correlations as a graph for you. In the code below, shows four selected features to 4 x 4 grid of graphs.

This assists in the interpretation of output values "latitude" has the highest negative correlation, indicating that as you move north (towards lower latitude values), median house value tends to decrease. The other features also emphasise varying degrees of correlation with "median_house_value."

In [None]:
from pandas.plotting import scatter_matrix

features = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[features], figsize=(12, 8))
#save_fig("scatter_matrix_plot")  - can be used to save the image. To use this, make sure to run the code at the beginning of the notebook defining the save_fig function

plt.show()

## Step 3: Separate the Target Labels from Your Data

Machine learning tasks require the provision of the data and the target Label separate to the machine learning algorithm. If not, they do not know which of the features is the target label. 

The label for the housing data is the `median_house_value`. When data is in a pandas dataframe format you can  drop the column with the label to get the data, and, get the column for the target label, to get the labels.

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) 
housing_labels = strat_train_set["median_house_value"].copy() 

## Step 4: Look for Missing Values in the Data

With tabular data, it's common to find that some rows are missing values for some of the columns. 

If you run the `info` command for dataframes (referring back to task 2-1).


In [None]:
housing.info()

Data Analysis:
- Running the code shows the total number of entries. 
- Compares that number to the number of Non-Null entries for each feature (e.g. `total_bedrooms`) you can see whether there are missing values. 
- If there are no missing values, the numbers should be the same
- there are 207 values missing for the number of `total_bedrooms'.

## Step 5: Handling Missing Values

It is necessary to put a code in place to tell the machine what to do if there are missing values. Handling missing values is a crucial step in data preprocessing, and there are several methods to deal with them. The choice of method depends on the nature of the data, the amount of missing values, and the specific requirements of the analysis.

Here are three common ways of handling missing values:

**(Option 1) Drop the row with missing value. This causes you to lose data points. In our scenario with the housing data, 168 rows will be removed.**

In [None]:
housing_option1 = housing.copy() # makes a copy of the data to variable housing_option1, doesn't mess up the original data.

housing_option1.dropna(subset=["total_bedrooms"], inplace=True)  #dropping the rows where total_bedroom is missing values.

housing_option1.info() #look for missing values after rows have been dropped

Pros:
- Simple and straightforward.
- Doesn't introduce bias by changing existing data.

Cons:
- Loss of potentially valuable information.
- If many rows have missing values, it can significantly reduce the size of the dataset.


**(Option 2) Drop the column with missing values. This causes you to lose one of your features.**

In [None]:
housing_option2 = housing.copy() # makes a copy of the data to variable housing_option1, doesn't mess up the original data.

housing_option2.drop("total_bedrooms", axis=1, inplace=True)  # dropping the column associated with total_bedrooms

housing_option2.info() # checking for missing values in the new data after column has been dropped

Pros:
- Simple and easy to implement.
- Useful when the entire column is not relevant to the analysis.

Cons:
- Loss of potentially important information.
- If many columns have missing values, it can lead to a loss of significant features.

**(Option 3) Fill in the missing value with some value such as the median or mean or fixed value that makes sense; imputing**

In [None]:
housing_option3 = housing.copy() #This makes a copy of the data to variable housing_option1, so that we don't mess up the original data.

median = housing["total_bedrooms"].median() # calculating mean of the value for total_bedrooms to use in filling missing values
housing_option3["total_bedrooms"].fillna(median, inplace=True)  # option 3 - filling missing values with the median

housing_option3.info()

Pros:
- Retains more data, especially when there are only a few missing values.
- Can be less biased than dropping rows or columns.
- Allows for the inclusion of potentially valuable information.

Cons:
- Imputed values may introduce bias, especially if the missing data is not missing completely at random.
- The choice of imputation method can impact results.

## Missing Values - Alternative Method
`SimpleImputer` from the `sklearn.impute` library can also alternatively be used to fill missing values with the median.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") # initialises the imputer

housing_num = housing.select_dtypes(include=[np.number]) # includes only numeric features in the data

imputer.fit(housing_num) # calculates the median for each numeric feature so that the imputer can use them

housing_num[:] = imputer.transform(housing_num) # the imputer uses the median to fill the missing values and saves the result in variable X

## Step 6: Scaling Your Features

Machine learning algorithms learn better when similar scales are used across all the features. The numeric range of values for `total_rooms` will be different from `median_income`.

In [None]:
housing_num.describe() #Test of scaling with the min and max values after running the pandas. 

All the features have very different ranges. 

Bringing these in alignment is called **feature scaling**.

Scikit-Learn provides MinMaxScaler which scales the values to fit into a range defined by you. 

**Below, the code is provided for when you are fitting it into the range from -1 to 1.**

In [None]:
from sklearn.preprocessing import MinMaxScaler # get the MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1, 1)) # setup an instance of a scaler
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)# use the scaler to transform the data housing_num

Scikit-Learn also provides a method called StandardScaler.

This method:
- tries normalise the distributional characteristics 
- considerings mean and standard deviation for each feature
- normalising the values to have standard deviation 1. 

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler() 
housing_num_std_scaled = std_scaler.fit_transform(housing_num) 
#every sklearn's transform's fit() just calculates the parameters and saves them as an internal objects state. 
#Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.

In [None]:
housing_num[:]=std_scaler.fit_transform(housing_num) 
#fit_transform joins these two steps and is used for the initial fitting of parameters on the training set x and 
#returns a transformed x. It calls first fit() and then transform() internally on data x.

### Step 7: Train a Linear Regression Model

Data resulting from the `SimpleImputer` exemplifies the median as a strategy and applies `StandardScaler` to scale the features. 

Before training a Linear Regression model for predicting median housing prices for districts, we need to also apply the scaling to the target labels the known median housing prices. 

In [None]:
from sklearn.preprocessing import StandardScaler  
#is not necessary if you ran this prior to running this cell. 
#We are however including it here for completeness sake.

target_scaler = StandardScaler() #instance of Scaler
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame()) #calculate the mean and standard deviation and use it to transform the target labels.


## Training Step

In [None]:
from sklearn.linear_model import LinearRegression #library from sklearn.linear model

model = LinearRegression() #an instance of the untrained model
model.fit(housing_num, scaled_labels)

#model.fit(housing[["median_income"]], scaled_labels) #fit it to your data
#some_new_data = housing[["median_income"]].iloc[:5]  # pretend this is new data

#scaled_predictions = model.predict(some_new_data)
#predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
some_new_data = housing_num.iloc[:5] #pretend this is new data
#some_new_data = housing[["median_income"]].iloc[:5]  # pretend this is new data

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
print(predictions, housing_labels.iloc[:5])

In [None]:
# extra code – computes the error ratios discussed in the book
error_ratios = housing_predictions[:5].round(-2) / housing_labels.iloc[:5].values - 1
print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))

### Step 8: Cross Validation

As mentioned in Lecture 4, having one training set and one test set to check performance is limited in producing a robust AI model. What you really want to see is a stable performance across many training sets and test sets. IN the first stance you want to test the model on the training set.

One way to evaluate your model before testing on the new data (the data you set aside as your test data) is cross validation. This where you split your training data into many pieces, then leave on of the pieces out for testing.

The code below is uses scikit-learn's cross_val_score to perform cross-validation and calculate root mean squared errors (RMSEs) for a given model. 

In [None]:
from sklearn.model_selection import cross_val_score

rmses = -cross_val_score(model, housing_num, scaled_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

The resulting rmses variable will contain an array of RMSE values, one for each fold in the cross-validation. These values can be used to assess the performance and variability of the model.

In [None]:
pd.Series(rmses).describe()

## Task 4-2: Hand Written Digit Classification

In [None]:
from IPython.display import * 
YouTubeVideo("YRhxdVk_sIs")

Tensorflow and Keras are popular libraries recognised for their usefulness in building neural networks quickly. Although we already loaded the data from `sklearn`, in the code below, will retrieve again from `tensorflow.keras.datasets`. This allows the retrieval of data from another library. Therefore simplifying comprehending the subsequent code because everything operates via tensorflow. 



The code for importing the libraries and getting the data has been included below. To get these to work, **you will need to have your environment installed with `tensorflow` and `keras`**. In the first line of the first code cell below, you will notice that `tensorflow` is imported as `tf`. This is a recognised convention in the machine learning community. Adopting this convention makes your code more readable for this community. Once you have imported the library that way, `tf` will be used subsequently instead of `tensorflow`.

### Step 1: Get the Data

In [None]:
import tensorflow as tf #importing the libraries
mnist = tf.keras.datasets.mnist.load_data() #the retrieval of data from another library

### Step 2: Review What the Data Looks Like  
`mnist` above is organised as a data type called **tuple** - formatted as `(a,b)`:


In [None]:
print(type(mnist))  #The `a` and `b` are tuples themselves, representing training and test data

### Step 3: How to Get the Data

To get the data out of `a` and `b`:

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = mnist 
# (X_train_full, y_train_full) is the 'tuple' related to `a` and (X_test, y_test) is the 'tuple' related to `b`.
# X_train_full is the full training data and y_train_full are the corresponding labels 
# - labels indicate what digit the image is of, for example 5 if it is an image of a handwritten 5.

### Step 4: Scaling the Pixel Values (the features)

In dealing with images, there are four main comsiderations that most frequently arise: 
- 1) input size of the image (height and width in terms of pixels)
- 2) whether you want to move the pixels so that the image is centered in the middle
- 3) scaling the value of the pixels to be in a specified range. 

The neural network we will use will works best with pixel values between 0 and 1. Pixels in a black and white image usually have values between 0 and 255. 

In [None]:
X_train_full = X_train_full / 255. # rescales pixels, dividing by 255
X_test = X_test / 255.

### Step 5: Split the Training Data into Training and Validation Data

The validation data is split from the training data meaning data has been split into training and test data. Splitting training data is used to evaluate the performance during training whereas test data is new data not seen during training or fine tuning used for the final test before publishing the results. 


In [None]:
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:] # images for validation data
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

### Step 6: Increasing Dimension to Include Colour Channels


When presenting your images to the neural network, you need to add an extra dimension alongside width and height to your image representation indicating the number of colour channels images use. 

typically...

...for a greyscale image this would be 1

...for a RBG colour image it would be 3. 

overall the shape should be like `(N, W, H, C)` 
`N` number of images, `W` width of any one image, `H` height of any one image, and `C` number of channels. 

"All your images are expected to be the same size as it enters the neural network." 
Convolutional Neural Networks do not depend on the image size and filters can be applied on all image sizes. Still, many frameworks and literally all papers use the same image sizes for training and changing the aspect ratio is almost always a bad idea.

The mnist dataset currently has a shape like `(N, W, H)`

The code below demonstrates how a numpy library allows you to add the required extra dimension.


In [None]:
import numpy as np # you won't need to run this line if you ran it before in this notebook. But for completeness.

X_train = X_train[..., np.newaxis] #adds a dimension to the image training set - the three dots means keeping everything else the same.
X_valid = X_valid[..., np.newaxis]
X_test = X_test[..., np.newaxis]

### Step 7: Build the Neural Network and Fit it to the Data

In [None]:
tf.keras.backend.clear_session()

tf.random.set_seed(42)
np.random.seed(42)

# the model is built by defining each layer of the neural network making it more usable than tensorflow and sckit learn
# Below, everytime tf.keras.layers is called it is building in another layer

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Conv2D(64, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Dense(128, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", 
              metrics=["accuracy"])

model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

In [None]:
model.summary() # not necessary for the machine learning task.

To further interpret the model, 'model.summary()' organises numbers at the bottom of the summary to tell you how many parameters are learning in this model. The visualisation can be useful later when you get more used to neural networks. 

### Step 8: Train and Evaluate the Model 

In [None]:

model.evaluate(X_test, y_test)

The method model.evaluate(X_test, y_test) typically returns a dictionary or a list of metrics, depending on the machine learning library. The metrics may include things like accuracy, loss, precision, recall, F1 score, etc., depending on the type of model and the problem you are solving.

### Comparing with Another Model

Below you are provided with code for using something called **Stochastic Gradient Decent Classifier**. This model applies the stochastic gradient descent optimiser (cf. the **nadam** optimiser used with the CNN above) with any number of algorithms but by default it applies it to a **Support Vector Machine**. 

In [None]:
# getting the data again from Scikit-Learn, so that we know the image dimens fit for the model!

from sklearn.datasets import fetch_openml
import pandas as pd

mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')

# getting the data and the categories for the data
images = mnist.data
categories = mnist.target

**Normally, we would set aside the test data**. 

But in this experiement we will abbreviate and use the entire data and evaluate using cross validation, especially since we are not intending, on this occasion, to develop our model with the validation step.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

sgd_clf = SGDClassifier(random_state=42)

#cross validation on training data for fit accuracy

accuracy = cross_val_score(sgd_clf, images, categories, cv=10)

print(accuracy)

You can see that the accuracies across all the validation runs are far below that of the CNN test results above.

# Task 5: Reflection

## Task 5-1: Reflecting on the Machine Learning Workflow


## 1. What would you need to do for your code if you were to use your own data (for example, discuss survey data data and photos):

### how does this change your model?
First it is necessary to select appropriate data. A comprehensive and optimised data model helps create a simplified, logical database that eliminates redundancy, reduces storage requirements, and enables efficient retrieval.
    
### how does this change scaling method?
Machine learning algorithms learn better when similar scales are used across all the features. Therefore for personal data scaling the value of the pixels to be in a specified range to appropriately to allow the model to properly function. 

 
### In dealing with images, there are four main comsiderations that most frequently arise:

- locating and formatting appropriate data

- classification 

- scaling

- processing 

 ### how does this change approach to handling missing data?
- Dropping the row with missing value, losing data points.
- Droping the column with missing values, losing one of your features.
- Filling in the missing value with some value such as the median or imputing.

    
## 2. What is the significance of cross validation?
Cross-validation is a method used to estimate the  accuracy of machine learning models using statistics. Especially where quantities of data are limited it protects against overfitting in a predictive model. 



## Task 5-2: Introducing the Tensorflow Playground

initial steps:
- changing data type to "spiral" by clicking on the picture for spiral data on the lefthand side. 
- the point of orange colour is one class and the ones of blue colour is another class. 
- As the neural network learns,the image on the righthand side change background colour indicates the class the neural network thinks the points in those regions belong to.  

In [None]:
from IPython.display import Image
Image ("tensorflow_initial.png")

## Task 5-2-1: Finding small networks that perform well.

Below is an attempt to produce the smallest network that bring the training loss down to 0.2 or less. 

- The traning loss is 0.199 indicated on the right hand side - right underneath the label **Output**.

- This neural network involved a total of 3 layers, with 2 hidden and a total of 10 nodes.  

- This network used 000,152 epochs by increasing steps (smaller black arrow button), indicated on the top lefthand corner. 

In [None]:
from IPython.display import Image
Image ("tensorflow_playground.png")

## Task 5-2-2: Examine the patterns displayed in the network nodes 

- The outputs from layer one converge with varying weights for the second layer exemplified by line thickness. 

- Noise at 50% means the model becomes more prone to overfitting, capturing noise in the training data as if it were a meaningful pattern.

- Batch size at 10 introduces more variability in each update to the model's weights. This can lead to faster convergence, but the training process might be noisier.

- After some experimentation I was able to reach a configuration where we do not need anything except 𝑋1 and 𝑋2. 

- First Layers (Early Layers) indicate three key aspects:Edges and Textures: The first layers often learn low-level features like edges, corners, and textures. Neurons might respond to simple shapes or patterns in the input. Basic Shapes: Some nodes may start to recognize basic shapes such as circles, squares, or lines. Colour and Intensity: Nodes might be sensitive to variations in color and intensity.


# Works Cited
Carter, D.S. and S. (no date) Tensorflow - Neural Network Playground, A Neural Network Playground. Available at: https://playground.tensorflow.org/ (Accessed: 12 December 2023). 

Codecademy (2023) Regression vs. classification, Codecademy. Available at: https://www.codecademy.com/article/regression-vs-classification (Accessed: 12 December 2023). 

deeplizard (2017) Convolutional Neural Networks (CNNs) explained, YouTube. Available at: https://www.youtube.com/ (Accessed: 12 December 2023). 

Geron, A. (2019) Hands-on machine learning with scikit-learn, keras and tensorflow concepts, tools, and techniques to build Intelligent Systems. Beijing: O’Reilly. 

Hewlett Packard Enterprise (2023) HPE Machine Learning Data Management Software Overview Hewlett Packard Enterprise, YouTube. Available at: https://www.youtube.com/ (Accessed: 12 December 2023). 

Kangralkar, S. (2021) ‘Types of Biases in Data’, towardsdatascience.com. Medium, 26 August. Available at: https://towardsdatascience.com/types-of-biases-in-data-cafc4f2634fb (Accessed: 2023). 

Kontsewaya, Y., Antonov, E. and Artamonov, A. (2021) ‘Evaluating the effectiveness of machine learning methods for spam detection’, Procedia Computer Science, 190, pp. 479–486. doi:10.1016/j.procs.2021.06.056. 
nee Khemchandani, R.R. (2023) Machine Learning Workflow , Machine Learning and Statistical Inference. South Asian University. Available at: https://sau.int/machine-learning-and-statistical-inference/ (Accessed: 12 December 2023). 

youtube.com, Q. (2022) what is a machine learning workflow, YouTube. Available at: https://www.youtube.com/ (Accessed: 12 December 2023). 