# Learning Machine Learning

1. Asking the right question

   Example: "Predict if a person will develop diabetes."  
   The question must direct and validate the work and should answer the following questions:

   - What data do we gather?
   - How do we mold the data?
   - How do we interpret the solution?
   - What criteria validates the solution?

   A statement should be formed that defines an end goal, starting point and how to achieve the goal:

   - Define the scope (including data sources)
   - Define target performance measurements for the solution
   - Determine the context for using the solution
   - Define how the solution will be created

  **Scope and Data Sources**

  1. Understand the features in the data
  2. Identify critical features
  3. Focus on at-risk population
  4. Select data source

  The example of diabetes development is likely to be affected by a large number of factors e.g. age (older, more likely), race (African Americans, American Indians and Asian Americans, more likely).  
  Race in this scenario could be used as a selector, selecting for one or more of the higher diabetes risk racial groups.  
  The Pima Indian Diabetes study available from the UCI Machine Learning Repository would be a good data source for this purpose.  

  The statement at this point would read: "Using Pima Indian Diabetes data, predict which people will develop diabetes."  

  **Performance Targets**  

  1. Binary result (true or false)
  2. Genetic differences are a factor
  3. 70% accuracy is a common target (bottom of acceptable range)

  The statement is refined to "Using Pima Indian Diabetes data, predict with 70% or greater accuracy, which people will develop diabetes."  

  **Context**

  1. Disease prediction (What does it mean to predict disease?)
  2. Medical research practices (What are common practices in this field?)  
  3. Unknown variations between people
  4. __Likelihood__ of disease is used

  "Using Pima Indian Diabetes data, predict with 70% or greater accuracy which people are likely to develop diabetes."  

  **Solution Creation**

  Using the Machine Learning Workflow:

  1. Process the Pima Indian Data
  2. Transform the data as required

  "Use the Machine Learning Workflow to process and transform Pima Indian data to create a prediction model. This model must predict which people are likely to develop diabetes with 70% or greater accuracy."
  
2. Preparing data

  1. Find the data we need
  2. Inspect and clean the data
  3. Explore the data
  4. Mold and Tidy the data
  
  Tidy datasets are easy to manipulate, model and visualize and have a specific structure: each variable is a column, each observation is a row, each type of observational unit is in a table.
  
  50-80% of a Machine Learning project is spent getting, cleaning and organising data.
  
  **Sourcing the Data**
  
  Google: Quality can vary wildly, controversial topics will likely have biased data if not carefully chosen.  
  Government: Large volume of data, peer-reviewed, good documentation of the meaning of the data (metadata).
  Professional/Company: Professional societies provide datasets, Twitter provides access to tweets, financial data can be retrieved with APIs from companies like Yahoo.  
  Own Company: Own IT department can provide company-specific data, department specific data.  
  A mix of all the above may be required.  
  
  This example will use a modified version of the Pima Indian Diabetes data to replicate real-world data handling requirements.  
  It contains 9 feature columns such as: number of pregnancies, blood pressure, glucose, insulin level etc.  
  It contains 1 class column: diabetes, true or false.

  Data Rule 1: The closer the data is to what you are predicting, the better.  
  Data Rule 2: Data will never be in the format you need.  
  Data Rule 3: Accurately predicting rare events is difficult
  Data Rule 4: Track how you manipulate data
  
3. Selecting the algorithm

  1. Review the role of the algorithm
  2. Perform algorithm selection based on the requirements identified in the solution statement
  3. Assess potential best specific algorithms from a high-level
  4. Select an initial algorithm, the workflow may be cycled in the search of the best solution
  
  **The Role of the Algorithm**
  
  The algorithm is the driver of the process to create and use a trained model.  
  The training function call (fit()) triggers the algorithm to execute its logic and process the training data, using the algorithm's logic the data is analysed. This analysis evaluates the data with respect to a mathematical model and logic associated with the algorithm. The algorithm uses the results of this analysis to adjust internal parameters to produce a model that has been trained to best fit the features of the training data and produce the associated class result. This best fit is defined by evaluating a function specific to a particular algorithm. The fit parameters are stored and the model is now said to be "trained". Later the trained model is called via a prediction function (predict()), real data is passed to the trained model, using only the features in the data. The trained model uses its code and parameter values set during training to evaluate the data's features and predict the class result, e.g. diabetes or not diabetes for this new data.  
  
  **Algorithm Selection**
  
  Compare algorithms on a number of factors, the factors of most importance will likely be debatable.  
  For this example the main factors are: what learning type is supported, the result type the algorithm predicts, the complexity of the algorithm and whether the algorithm is basic or enhanced.  
  - A **prediction** model means supervised machine learning.
  
  Result type of regression produces continuous values, suitable for house price example.  
  Result type of classification produces discrete values e.g. small, medium, large, 1-100, 101-200, true/false  
  - Diabetes or not is a binary outcome (true/false), therefore involving binary classification support requirement
  
  As this is an initial algorithm, complexity should be kept low.
  - Eliminate "ensemble" algorithms (combine multiple algorithms to boost performance, difficult to debug)
  
  Enhanced algorithms are a variation of basic algorithms with performance improvements and additional functionality.
  - Initial algorithm, **basic** are easier to understand.  
  
  The main candidate algorithms based on these criteria are Naive Bayes, Logistic Regression and Decision Tree.  
  
  Naive Bayes: Based on likelihood and probability, looks at likelihood based on previous data combined with the probability of diabetes on nearby feature values. It makes the naive assumption that all features we pass are independent of each other and will have an equal impact upon the result e.g. blood pressure is equally important as BMI, number of pregnancies etc. This property enables fast convergence and requires only a small amount of data to train.  
  Logistic Regression: Returns a binary result, measures the the relationship of each feature and weights them based on their impact on the result. The resultant value is mapped against a curve with 2 values: 1 and 0, (diabetes/no-diabetes).  
  Decision Tree: A binary tree structure with each node making a decision based on the values of the feature. At each node, the feature value causes a path decision to be taken. More data is required to determine the nodes and their splits.  
  
  **Naive Bayes** is a suitable starting point, it will use how often a feature is associated with the result to determine if the value is diabetes/not-diabetes. The algorithm training time is fast, up to 100X faster than other algorithms, speeding up cycles. The algorithm is stable to data changes
  
  
  
4. Training the model
5. Testing the model
