## Machine Learning By Example Tasks

### Task 1-2:

In [None]:
import sys 

assert sys.version_info >= (3,7)

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=14) #general font size
plt.rc('axes', labelsize=14, titlesize=14) #font size for the titles of x and y axes
plt.rc('legend', fontsize=14) # font size for legends
plt.rc('xtick', labelsize=10) # the font size of labels for intervals marked on the x axis
plt.rc('ytick', labelsize=10) # the font size of labels for intervals marked on the y axis

In [None]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "classification"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Task 1-4

#### How framing the problem affects data selection 
Firstly, the articles show that one of the important steps with framing our problem is so that we have clarity in what our objective is. This in turn helps us determine what the role of our model will be clearly and allows us to better select the right machine learning algorithm for the task. The readings suggest that a well defined task also helps with guiding the collection and preparation of data, if we do not define our problem and objectives, how can we collect data that has meaning. Also, understanding the task helps identify potential ethical considerations associated with the use of machine learning as demonstrated in the AI facial recognition article. 

#### Selecting Algorithm 
Regression would be better suited for predicting median house pricing because the output is quantities that are based on the input of the model. 

Classification would be better suited for handwritten digit recognition because of multi-class classification which would be involved in the model. 

## Task 2-1

#### Downloading example tabular data

In [None]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data(): 
    tarball_path = Path("datasets/housing.tgz") 
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True) 
        url = "https://github.com/ageron/data/raw/main/housing.tgz" 
        urllib.request.urlretrieve(url, tarball_path) 
        with tarfile.open(tarball_path) as housing_tarball: 
            housing_tarball.extractall(path="datasets") 
    return pd.read_csv(Path("datasets/housing/housing.csv")) 

housing = load_housing_data() 

In [None]:
housing.info()

There are 10 attributes and one of them ocean_proximity is not numerical

In [None]:
housing["ocean_proximity"].value_counts()

In [None]:
housing.hist(bins=50, figsize=(12, 8))
save_fig("attribute_histogram_plots")  
plt.show()

In [None]:
housing.describe()

## Task 2-2

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')

In [None]:
print (mnist.DESCR)

## Task 2-3

***Mark Down cell for critique***

1. Data transformation was done by transforming the originial NIST dataset of black and white images by size-normalising the digits and using anti-aliasing for grey levels
2. They selected subsets from NIST, with SD-3 designated as the training set and SD-1 as the test set by considering factors like cleanliness and recognisability 
3. The MNIST training set was formed by combining 30,000 patterns from SD-3 and 30,00 patterns from SD-1, the test set comprised 5000 patterns from each  
4. The decisions made were justified as they aimed to address issues related to data quality, standardisation and independence of training and test sets. These are useful for a reliable evaluation of machine learning models.  

## Task 2-4

In [None]:
mnist.keys()


In [None]:
images = mnist.data
categories = mnist.target

print("Shape of Images:", images.shape)

print("Categories", categories.tolist())

In [None]:
import matplotlib.pyplot as plt 

def plot_digit(image_data): 
    image = image_data.reshape(28, 28) 
    plt.imshow(image, cmap="binary") 
    plt.axis("off") 

In [None]:
some_digit = mnist.data[0]
plot_digit(some_digit)
plt.show()

## Task 3: Setting aside test data

In [None]:
from sklearn.model_selection import train_test_split

tratio = 0.2 

train_set, test_set = train_test_split(housing, test_size=tratio, random_state=42)

In [None]:
import numpy as np

sample_size = 1000
ratio_female = 0.511

np.random.seed(42)

samples = (np.random.rand(100_000, sample_size) < ratio_female).sum(axis=1)
((samples < 485) | (samples > 535)).mean()

In [None]:
import numpy as np
import pandas as pd

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
from sklearn.model_selection import train_test_split

tratio = 0.2 

strat_train_set, strat_test_set = train_test_split(housing, test_size=tratio, stratify=housing["income_cat"], random_state=42)

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

### Why is a stratified sample based on median income reasonable?

***Markdown cell***

1. Stratified sampling based on median income ensures that each stratum is adequately represented in the sample, helping to capture the diversity of the population in terms of economic diversity. 
2. Stratified sampling can lead to a more precise estimation each income group. Particularly important for anaylsing subpopulations ith specific income characteristics.
3. Provides proportional representation of each income stratum which ensures fairness in the presentation of different socioeconomic groups 
4. The results of a stratified sample are more likely to generalise well into the overall population. 

### Task 3.2: Setting aside test set for image data

In [None]:
X_train = mnist.data[:60000]
y_train = mnist.target[:60000]

X_test = mnist.data[60000:]
y_test = mnist.target[60000:]

## Task 4

### Step 1 Checking correlations: training set

In [None]:
housing = strat_train_set.copy()

In [None]:
corr_matrix = housing.corr(numeric_only=True) 
corr_matrix["median_house_value"].sort_values(ascending=False)

### Step 2: Visualise Correlations

In [None]:
from pandas.plotting import scatter_matrix

features = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[features], figsize=(12, 8))


plt.show()

### Step 3: Separate the target labels from data 

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) 
housing_labels = strat_train_set["median_house_value"].copy() 

### Step 4: Look for missing values in data

In [None]:
housing.info()

In [None]:
missing_values = housing.isnull().sum()

In [None]:
missing_total_bedrooms = missing_values['total_bedrooms']

print("Number of missing values for 'total_bedrooms':", missing_total_bedrooms)

### Step 5: Handling missing Values

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") 

housing_num = housing.select_dtypes(include=[np.number])

imputer.fit(housing_num) 

housing_num[:] = imputer.transform(housing_num) 

#### Comments
It seems like using SimpleImputer to fill in the missing values with the median is the most straightforward way to deal with the problem but it seems like potentially you are oversimplifying your data by filling missing values with the median. 

### Step 6: Scaling features

In [None]:
housing_num.describe()

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

In [None]:
housing_num[:]=std_scaler.fit_transform(housing_num)

### Step 7 Train a linear regression model 

In [None]:
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

In [None]:
from sklearn.linear_model import LinearRegression #get the library from sklearn.linear model

model = LinearRegression() #get an instance of the untrained model
model.fit(housing_num, scaled_labels)
model.fit(housing[["median_income"]], scaled_labels) 
some_new_data = housing[["median_income"]].iloc[:5] 

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
some_new_data = housing_num.iloc[:5] 
some_new_data = housing[["median_income"]].iloc[:5]  

scaled_predictions = model.predict(some_new_data)
predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
print(predictions, housing_labels.iloc[:5])

### Step 8 Cross Validation 

In [None]:
from sklearn.model_selection import cross_val_score

rmses = -cross_val_score(model, housing_num, scaled_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(rmses).describe()

## Task 4: Hand written digit classification 

### Step 1 Getting the data

In [None]:
import tensorflow as tf

mnist = tf.keras.datasets.mnist.load_data()

### Reviewing what the data looks like

In [None]:
print(type(mnist))

### Step 3 how to get the data

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = mnist

### Step 4 Scaling the pixel values (the features)

In [None]:
X_train_full = X_train_full / 255.
X_test = X_test / 255.

### Step 5 Split the training data into training and validation data 

In [None]:
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

### increasing the dimension to include color channels 

In [None]:
import numpy as np 

X_train = X_train[..., np.newaxis] 
X_valid = X_valid[..., np.newaxis]
X_test = X_test[..., np.newaxis]

### Step 6 building the neural network and fit it to the data 

In [None]:
tf.keras.backend.clear_session()

tf.random.set_seed(42)
np.random.seed(42)


model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Conv2D(64, kernel_size=3, padding="same", activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Dense(128, activation="relu", kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", 
              metrics=["accuracy"])

model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

In [None]:
model.summary()

### Step 7 Training and evaluating the model 

In [None]:
model.evaluate(X_test, y_test)

### data from scikitlearn

In [None]:


from sklearn.datasets import fetch_openml
import pandas as pd

mnist = fetch_openml('mnist_784', as_frame=False, parser='auto')


images = mnist.data
categories = mnist.target

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

sgd_clf = SGDClassifier(random_state=42)



accuracy = cross_val_score(sgd_clf, images, categories, cv=10)

print(accuracy)