# Defining the Problem

**Author: [Kevin Broløs](https://abzu.ai/team)**

The first part of any machine learning journey is defining the problem.

Often times, this requires further data analysis to crystallize, so this notebook will focus on introducing the different data sets we'll be working on going forward.

Defining the problem in a machine-learning context requires part domain knowledge, part business knowledge and part algorithm knowledge.

That is:

## Domain knowledge: 
* What are we dealing with?
    * What do the input features mean or represent?
    * Are they important to our problem, and why or why not?
    * Are they representative of the truth or real world (More on this when we get to analysis)
    
## Business knowledge: 
* What is the business value?
    * What kind of problem can we phrase that drives a business value, or change?
    * Are we interested in classifying future samples using a model?
        * Examples of this could be:
            * Determining what the kind of flower is
            * Know the difference between bird species from their features
            * Determining if a machine needs maintenance
            * If a patient will contract a disease
            * Alerting if a credit card transaction is fraudulent
            * Indicating if a lender will default
    * Are we interested in predicting a value or metric?
        * Examples of this include:
            * Predicting the leaf size of a flower
            * The airspeed velocity of an unladen african swallow
            * The cost of maintenance for a machine over time
            * The amount of people who will be hospitalized in a given period
            * The amount of credit card transactions a bank needs to support on a given day
            * The return on investment for loaning out money

## Algorithm knowledge: 
* Which algorithms would be suited for which problem?
    * Classification problems are best solved with some algorithms, such as (of the ones we're looking into):
        * Logistic regressions
        * Decision Trees/Random Forests
        * Neural Networks (for large datasets - often with many input features)
    * Regression problems are best solved with:
        * Linear regressions
        * Regression Trees/Forests
        * Neural Networks (for large datasets - often with many input features)
    * Knowing how to process and prepare the data for the given problem
        * Balancing
        * Encoding (Sparse categorical or One-Hot)
        * Filtering
        * Normalization

### Many ways to frame a problem

Most problems can be defined as either a regression or classification problem.
Sometimes, simplifying the problem to a binary or binned classification problem can simplify it immensely, while still providing good value to the business case. It requires a good understanding of both the domain and business angle to know when and how to make this tradeoff.

Performance on regression cases (due to their "exact" nature), can sometimes be hard to chase, and it's useful to know to reframe a problem from one to the other depending on the results, if you still want to extract signal - and ultimately value - from the data you have available.

Data sets in the wild are typically less forgiving than what you'll find here, so keep that in mind as you learn the tools.

---

# The Datasets:


## The Iris Dataset

A common beginner's dataset.
The [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) consists of 150 samples equally distributed over three classes of the iris flower, namely 'Iris Setosa', 'Iris Versicolor' and 'Iris Virginica'.

50 samples have been taken of each flower, so it's a well-balanced dataset, and the samples consist of four continunous features measured on real-world samples, as listed below:
1. Sepal Length (cm)
2. Sepal Width (cm)
3. Petal Length (cm)
4. Petal Width (cm)

For the non-botanically inclined - The petals are each individual leaf on the flower when in bloom. The sepals of the flower are the green leafy things by the foot of the flower near the stem protecting the flower petals.

### Defining the problem
The classical example is to attempt to classify the flower from the four features.

It is such a popular dataset because you can solve it with pretty much any machine learning algorithm.

You can even attempt to predict either of the four "input" parameters using the class and the remaining parameters, as a regression exercise.

### Getting the dataset

It's really easy to get the dataset, as it comes with any installation of scikit-learn due to its popularity. Follow along with the python code below

### Load the Iris dataset from sklearn

In [1]:
from sklearn.datasets import load_iris

In [2]:
iris=load_iris()

### Extract the data, target classes and features from the iris dataset

In [3]:
data, feature_names = iris.data, iris.feature_names
target, target_names = iris.target, iris.target_names

In [4]:
print("Feature columns in the dataset:\n", feature_names)
print("Unique classes in the target column:\n", target_names)

Feature columns in the dataset:
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Unique classes in the target column:
 ['setosa' 'versicolor' 'virginica']
