# Knowing your task and knowing your data.

- While building your machine learning algorithm, keep in mind the following questions.
1. What question/s am I trying to answer?
2. Do I think the data collected can answer that question?
3. What is the best way to phrase my question as a machine learning problem?
4. Have I collected enough data to represent the problem I want to solve?
5. What features did I extract? Will these enable the right predictions?
6. How will I measure success in my application?
7. How will the machine learning solution interact with other parts of my research or business product?

- In a larger context, algorithms and methods in machine learning are only one part of a greater process to solve a particular problem. 
- It is a good idea to keep the big picture in mind.
- Many people spend a lot of time building a complex model, only to find out it doesn't solve the right problem.
- When going deep into technical aspects of machine learning, it is easy to lose sight of the ultimate goals.
- Keep in mind all of the assumptions that you make, explicitly or implicitly when you start building machine learning models.

## Why Python?
1. Combines the powerful general purpose programming
2. Ease of use for domain specific languages like MATLAB and R
3. Many libraries for data loading, visualisation, statistics, language processing, image processing, plotting and more
4. Machine learning and data analysis are fundamentally iterative processes, in which the data drives the analysis. 
5. It's essential for these processes to have tools that allow quick iteration and easy of interaction.
6. Python also integrates into existing systems easily, web services, data bases etc.

## Scikit-learn
1. Is an open source project
2. Constantly being developed and improved, very active community
3. Contains a number of state of the art machine learning algorithms
4. Documentation 

https://scikit-learn.org/stable/user_guide.html


## Python Libraries
- NumPy: Library for numerical operations and array manipulations in Python.
- SciPy: Library for scientific computing, built on NumPy, offering additional functions.
- matplotlib: Plotting library for creating static, interactive, and animated visualizations.
- pandas: Library for data manipulation and analysis, especially with tabular data.
- IPython: Enhanced interactive Python shell providing powerful introspection, rich media, and more.
- Jupyter Notebook: Web-based interactive environment for creating and sharing documents with live code.
- scikit-learn: Machine learning library for Python, providing tools for data mining and analysis.

pip install numpy scipy matplotlib ipython scikit-learn pandas

### NumPy
- Fundamental package for scientific computing in Python. 
- Contains functionality for multi-dimensional arrays
- High level linear algebra operations
- Fourier transform used for signal processing
- In SciKit-Learn The NumPy array is the fundamental data structure
  - The class is ndarray
  - A multi-dimensional (n-dimensional) array
  - All elements must be of the same type

In [None]:
import numpy as np 
x = np.array([[1,2,3], [4,5,6]])
print(x)
display(x)
print(f"x:\n{x}")


### SciPy
- Collection of functions for scientific computing in Python.
- Advanced linear algebra routines
- Mathematic function optimization
- Signal processing
- Statistical distributions
- SciKit-learn draws from scipy's functions for implementing it's algorithms.
- The most important part of SciPy for this book is scipy.sparse
  - This provides sparse matrices
  - Which are another representation that is used for data in scikit-learn
  - We use a spare matrix when we want to store a 2D array that contains mainly zeros

In [20]:
from scipy import sparse
eye = np.eye(4)
print(eye)
sparse_matrix = sparse.csr_matrix(eye)
print(sparse_matrix)



[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]
  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


- Creating a sparse matrix in COO format.
- COO means Coordinate 
- It stores a list of the non-zero elements in a matrix, along with their row and column index. 
- It creates three 1D arrays.
  - Row Index
  - Column Index
  - Data
- It's useful for constructing sparse matrices and converting to other formats like Compressed Sparse Row, or Compressed Spare Column. 

In [25]:
data = np.ones(4)
row_indices = np.arange(4)
col_indices = np.arange(4)
eye_coo = sparse.coo_matrix((data, (row_indices, col_indices)))
print(f"{eye_coo}")

  (0, 0)	1.0
  (1, 1)	1.0
  (2, 2)	1.0
  (3, 3)	1.0


### Matplotlib
- 

## Meet the data

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print(iris_dataset.keys())
print(dir(iris_dataset))
# print(iris_dataset.DESCR)

In [None]:
print(iris_dataset["target_names"])
print(iris_dataset["target"])

In [None]:
print(iris_dataset["feature_names"])
print(iris_dataset["data"][:2])

In [None]:
print(iris_dataset['data'].shape)
print(iris_dataset['target'].shape)
print(iris_dataset['target_names'].shape)

We see that the array contains measurements for 150 different flowers. Remember
that the individual items are called samples in machine learning, and their properties
are called features. The shape of the data array is the number of samples multiplied by
the number of features. This is a convention in scikit-learn, and your data will
always be assumed to be in this shape

In [None]:
print(type(iris_dataset["data"]))
print(type(iris_dataset.data))
print(iris_dataset["data"])
print(iris_dataset.data)

In [None]:
print(iris_dataset["data"][:5])

In [None]:
print(iris_dataset['target'].shape)
print(iris_dataset.target.shape)

In [None]:
print(type(iris_dataset['target']))
print(type(iris_dataset.target))

In [None]:
print(iris_dataset["target"][:5])
print(iris_dataset.target[:5])

The meanings of the numbers are given by the iris['target_names'] array:
0 means setosa, 1 means versicolor, and 2 means virginica.

In [None]:
print(iris_dataset.feature_names)
print(iris_dataset.data[1])
print(iris_dataset.target[1])
print(iris_dataset.target_names[1])



## Measuring Success: Training and Testing Data

train_test_split function. This function extracts 75% of the rows in the data as the
training set, together with the corresponding labels for this data. The remaining 25%
of the data, together with the remaining labels, is declared as the test set

In scikit-learn, data is usually denoted with a capital X, while labels are denoted by
a lowercase y. This is inspired by the standard formulation f(x)=y in mathematics,
where x is the input to a function and y is the output. Following more conventions
from mathematics, we use a capital X because the data is a two-dimensional array (a
matrix) and a lowercase y because the target is a one-dimensional array (a vector).

In [None]:
from sklearn.model_selection import train_test_split
help(train_test_split)

# train_test_split() random_state parameter

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'])
print("X_train shape: {}".format(X_train.shape))
print(X_train[0])
print("y_train shape: {}".format(y_train.shape))
print(y_train)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)
print("X_train shape: {}".format(X_train.shape))
print(X_train[0])
print("y_train shape: {}".format(y_train.shape))
print(y_train)

## First Things First: Look at Your Data

In [None]:
# Clone the repository (you only need to do this once)
# git clone https://github.com/amueller/introduction_to_ml_with_python.git

# Install the dependencies (you only need to do this once)
# pip install -r introduction_to_ml_with_python/requirements.txt

# Set up the path and import mglearn
import sys
sys.path.append('C:\\Users\\James\\Desktop\\RoboticsJourney\\Self_Study\\2.PatternReg.ML\\introduction_to_ml_with_python')
import mglearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
X_train: np.ndarray
X_test: np.ndarray
y_train: np.ndarray
y_test: np.ndarray

In [None]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns = iris_dataset.feature_names)

# create a scatter matrix from the dataframe, color by y_train
grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',
hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

## Building Your First Model: k-Nearest Neighbors

There are many classification 
algorithms in scikit-learn that we could use. Here we will use a k-neares 
neighbors classifier, which is easy to understand. Building this model only consists of
storing the training set. To make a prediction for a new data point, the algorithm
finds the point in the training set that is closest to the new point. Then it assigns the
label of this training point to the new data point.

Models in scikit-learn are implemented in their own classes,
which are called Estimator classes. The k-nearest neighbors classification algorithm
is implemented in the KNeighborsClassifier class in the neighbors module.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
print(knn)
dir(knn)
help(knn)

To build the model on the training set, we call the fit method of the knn object,
which takes as arguments the NumPy array X_train containing the training data and
the NumPy array y_train of the corresponding training labels:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)
print("X_train shape: {}".format(X_train.shape))
print(X_train[0])
print("y_train shape: {}".format(y_train.shape))
print(y_train[0])

In [None]:
knn.fit(X_train, y_train)

## Making Predictions

Imagine we found an iris in the wild with the following:  
- Sepal length = 5cm
- Sepal width = 2.9cm
- Petal length = 1cm
- Petal width 0.2cm  

Note: 
- A two-dimensional array is a matrix  
- A one-dimensional array is a vector  
- scikit-learn always expects two-dimensional arrays for the data.

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print(X_new.shape)
print(X_new)
type(X_new)

In [None]:

print(f"Printing the iris_dataset bunch object elements: {dir(iris_dataset)}")

prediction = knn.predict(X_new)
print(f"The prediction value is {prediction}")

print(f"Which matches the species: {iris_dataset.target_names[prediction]}")

This new iris belongs to the class 0, meaning its species is setosa. But how do we know whether we can trust our model?  
We don’t know the correct species of this sample, which is the whole point of building the model!

## Evaluating the Model

Make a prediction for each iris in the test data and compare it against its label (the known species).  
We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted.

In [None]:
# npArray = np.array([[5, 2.9, 1, 0.2]])
# print(f"npArray type: {type(npArray)}")
# print(f"npArray shape: {npArray.shape}")

X_train: np.ndarray
X_test: np.ndarray
y_train: np.ndarray
y_test: np.ndarray
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

print(f"X_test class: {type(X_test)}")
print(f"X_test shape is a matrix: {X_test.shape}")
print(f"X_test first element is: {X_test[0]}\n")

print(f"y_test class: {type(y_test)}")
print(f"y_test is a vector of labels {y_test.shape}")
print(f"y_test first element is: {y_test[0]}")

# Sample Definition
A sample, observation or instance is a single data point or record in your dataset.

In [None]:
print(f"1. Because scikit-learn has bunch object, we need to display the hash table properties this way first:\n\t{dir(iris_dataset)}\n")
print(f"2. The first sample, observation, instance or single data point is: {iris_dataset.data[0]} and these are the \"features\" of the sample\n")
print(f"3. The headings for these features are:\n\t{iris_dataset.feature_names}\n")
print(f"4. The first sample shown in step 2 has a \"class\" value, this is: {iris_dataset.target[0]} and is the \"target\" property\n")
print(f"5. The headings for the \"classes\" are:\n\t{iris_dataset.target_names}")

In [None]:
print(f"X_test is loaded into the model and predicted against. The output contains the \"target\" predictions, so we store them in y_pred\n")

y_pred = knn.predict(X_test)

print(f"{'X_test':<50} {'y_pred'}")
for x, y in zip(X_test, y_pred):
    print(f"{str(x):<50} {y}")


# df = pd.DataFrame(X_test, columns = iris_dataset['feature_names'])
# df['Predicted'] = y_pred
# print(df)

In [None]:
print(f"The score of accuracy of our model against the Training data is: {knn.score(X_train, y_train)}")
print(f"The score of accuracy of our model against the Testing data is: {knn.score(X_test, y_test)}\n")

y_pred = knn.predict(X_test)
print(f"Finding the mean between y_test and y_pred: {np.mean(y_test == y_pred)}\n")

# print(f"The score of accuracy of our model with X_test and y_pred is: {knn.score(X_train, y_pred)}")
print(f"The score of accuracy of our model with X_test and y_pred is: {knn.score(X_test, y_pred)}")

print(f"\nOkay I think I get it now")