Regarding the platforms used for machine learning, there are many algorithms and programming languages. However, the Python ecosystem is one of the most dominant and fastest growing in machine learning.

Given its popularity and high adoption rate, we will use Python as our main programming language. First, we will look at the details of the Python packages used for machine learning, followed by the model development steps in the Python framework.

## Why Python?

These are some of the reasons for Python's popularity:

- high-level syntax (compared to lower-level languages ​​like C, Java, and C++). Applications can be developed by writing fewer lines of code, making Python attractive to both beginners and advanced programmers;
- efficient development life cycle;
- large collection of community-run and open-source libraries;
- strong portability.

Python's simplicity attracts many developers who create new libraries for machine learning, leading to its strong adoption.

## Python Libs for Machine Learning

<br>

**Data Manipulation and Transformation**

*NumPy* (https://numpy.org)

    - provides support for large and multidimensional arrays, as well as an extensive collection of mathematical functions;

*Pandas* (https://pandas.pydata.org)

    - a library for data manipulation and analysis. Among other features, it offers data structure to work with tables and tools to manipulate them;
    
<br>

**Machine Learning and Statistical Analysis**

*SciPy* (https://www.scipy.org)

    - the combination of **NumPy**, **Pandas** and **Matplotlib** is commonly known as **SciPy**, which is an ecosystem of Python libraries for mathematics, science and engineering;
    
*Scikit-learn* (https://scikit-learn.org)

    - a machine learning library offering a wide range of algorithms and utilities;
    
*StatsModels* (https://www.statsmodels.org)

    - a Python module that provides classes and functions for estimating numerous statistical models, as well as for conducting statistical tests and exploring statistical data;
    
*TensorFlow* (https://www.tensorflow.org) and *Theano* (https://deeplearning.net/software/theano)

    - dataflow libraries that facilitate work with neural networks;
    
*Keras* (https://keras.io)

    - a library of artificial neural networks that can act as a simplified interface for the **TensorFlow*/Theano* packages;
    
<br>

**Data Visualization**

*Matplotlib* (https://matplotlib.org)

    - a plotting library that allows you to create 2D graphs and plots;

*Seaborn* (https://seaborn.pydata.org)

    - a data visualization library based on Matplotlib. Provides a high-level interface for creating attractive and informative statistical graphs;
    
<br>

## Machine Learning Crisp-DM Model

The figure below shows a general idea of a simple machine learning template in seven steps, which can be used to start any machine learning model in Python.

<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1024/1*sicHaDLyHRGuJm9eZTaHNw.png" width="600">
    <figcaption>Step by step to develop a machine learning model</figcaption>
</figure>

### Blueprint de Desenvolvimento de Modelo

Now we will detail each step of the model development process:

**1. Problem definition**
The first step in any project is defining the problem. Powerful algorithms can be used to solve it, but the results will be of no use if the wrong problem is solved.
The following framework should be used to define the problem:

    1. describe the problem informally and formally. List similar assumptions and problems;
    2. list the motivation for solving the problem, the benefits brought by the resolution and how it will be used;
    3. describe how the problem would be solved using domain knowledge.


**2. Loading data and libs**
The second step gives you everything you need to start working on the problem. This includes loading libraries, packages, and functions required for model development.

**2.1. Loading the libs**
```Python
import pandas as pd
from matplotlib import pyplot
```
Details of libraries and modules for specific functionalities can be found on each page.

**2.2. Loading data**
The following items must be checked and removed before data is loaded:

    - column headers;
    - comments or characters;
    - delimiter.

There are many ways to load data. Some of the most common are:

```Python
# Upload CSV files with Pandas
from pandas import read_csv
filename = 'xpto.csv'
data = read_csv(filename, names=names)
```

```Python
# Upload files from a URL
url = 'https://goo.glvhm1eU'
names = ['age', 'class']
data = read_csv(url, names=names)
```

```Python
# Load files using pandas_darareader
import pandas_datareader.data as web
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']

stk_data = web.DataReader(stk_tickers, 'yahoo')
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
```

**3. Exploratory data analysis**
In this step, we analyze the data set.

**3.1. Descriptive statistics**
Understanding the dataset is one of the most important processes in model development. The steps to do this include:

    1. view the raw data;
    2. evaluate the dimensions of the set;
    3. evaluate the types of data attributes;
    4. summarize the distribution, descriptive statistics, and relationship between variables in the data set.
    
These steps are demonstrated below:

```Python
# View data
set_option('display.width', 100)
dataset.head(1)
```

```Python
# Evaluate dataset dimensions
dataset.shape
```

```Python
# Evaluate data attribute types
set_option('display.max_rows', 500)
dataset.dtypes
```

```Python
# Summarize data using descriptive statistics
set_option('precision', 3)
dataset.describe()
```

**3.2. Data visualization**
The quickest way to learn more about data is to visualize it. Visualization involves independently understanding each attribute of the dataset.

Types of charts:

*Univariates*
- histograms and density graphs;

*Multivariates*
- correlation matrix and scatter plot;

Below are Python code examples for univariate graph types:

```Python
# Univariate: histogram
from matplotlib import pyplot
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1, figsize=(10,4))
```

```Python
# Univariate: density plot
from matplotlib import pyplot
dataset.plot(kind='density', subplots=True, layout=(3, 3), sharex=False, legend=True, fontsize=1, figsize=(10,4))
pyplot.show()
```

Below are Python code examples for multivariate graph types:

```Python
# Multivariate: correlation matrix
import seaborn as sns
from matplotlib import pyplot
correlation = dataset.corr()
pyplot.figure(figsize=(5,5))
pyplot.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True, annot=True, cmap='cubehelix')
```

```Python
# Multivariate: scatter plot
from pandas.plotting import scatter_matrix
scatter_matrix(dataset)
```

**4. Data preparation**
Data preparation is the preprocessing step in which data from one or more sources is cleaned and transformed to improve its quality before it is used.

**4.1. Data cleaning**
In machine learning modeling, bad data can be costly. Data cleansing involves checking the following:

*Validity*
- type, range, ...;

*Accuracy*
- the degree to which the data are close to true values;

*Completeness*
- the degree to which all necessary data are known;

*Uniformity*
- the degree to which data are specified using the same unit of measurement;

Different options for cleaning data include:

*Remove "NA" values ​​from data*
```Python
dataset.dropna(axis=0)
```

*Fill "NA" with 0*
```Python
dataset.fillna(0)
```

*Fill in "NA" with the column average*
```Python
dataset['col'] = dataset['col'].fillna(dataset['col'].mean())
```

**4.2. Feature selection**
The features of the data used to train machine learning models have a huge influence on performance. Irrelevant or partially relevant features can negatively impact model performance. Feature selection is a process in which the features in the data that contribute most to the prediction of the variable or output are automatically selected.

The benefits of performing feature selection before modeling are:

*Reduced overfitting (overfitting)*
- less redundant data means fewer opportunities for the model to make decisions based on noise;

*Improves performance*
- less misleading data means better modeling performance;

*Reduction in training time and memory volume*
- less data means faster training and less memory volume;

The following sample feature is an example that demonstrates when two best features are selected using the *SelectKBest* function in **sklearn**. This function ranks the features using an underlying function and then removes all but the highest ranked *k* feature:

```Python
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
bestfeatures = SelectKBest(k=5)
dfscores = pd.DataFrame(X.fit.scores_)
dfcolumns = pd.DataFrame(X.columns) 
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
print(featureScores.nlargest(2,'Score'))
```

When features are irrelevant, they should be removed. This is presented below:

```Python
# Removing old features
dataset.drop(['Feature1', 'Feature2', 'Feature3'], axis=1, inplace=True)
```

**4.3. Data transformation**
Many machine learning algorithms make assumptions about data. It is good practice to perform data preparation in such a way that it best exposes it to machine learning algorithms. This can be accomplished by data transformation.

The different approaches to doing this are:

*Resizing*
- when data spans attributes with varying scales, many machine learning algorithms can benefit from *rescaling* all attributes to the same scale. Attributes are typically scaled in the range between zero and one. This is useful for optimization algorithms used in core machine learning algorithms and also helps speed up calculations in an algorithm:

```Python
from skear.preprocessing import MinMaxScaler
sclaer = MinMaxScaler(feature_range=(0,1))
rescaledX = pd.DataFrame(scaler.fit_transform(X))
```

*Standardization*
- standardization is a useful technique for transforming attributes to a normal distribution with a mean of 0 and a standard deviation of 1. It is best suited for techniques that assume that the input variables represent a normal distribution:

```Python
from sklearn.preprocessinf import StandardScaler
scaler = StandardScaler().fit(X)
StandardizedX = pd.DataFrame(scaler.fit_transform(X))
```

*Normalization*
- normalization refers to scaling each observation (record) to have a length of 1 (called the unit or vector norm). This preprocessing method can be useful for sparse datasets of attributes of varying scales, especially to apply in algorithms that give different weights to input values:

```Python
from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X)
NormalizedX = pd.DataFrame(scaler.fit_transform(X))
```

**5. Evaluation models**
After we have estimated the performance of our algorithm, we can retrain the final algorithm with the entire training data set and make it ready for operational use. The best way to do this is to evaluate your performance with a new set of data. Different machine learning techniques require different evaluation metrics. In addition to model performance, several other factors such as simplicity, interpretability and training time are considered when selecting a model.

**5.1. Division for training and testing**
The simplest method we can use to evaluate the performance of a machine learning algorithm is to use different datasets for training and testing. We can take our original dataset and divide it into two parts: train the algorithm with the first part; make predictions with the second; and evaluate predictions against expected results. The size of the split can depend on the size and details of the dataset, although it is common to use 80% of the data for training and 20% of the data for testing. Differences in training and testing data sets can bring significant results to accuracy estimation. Data can be easily split into training and testing sets using the *train_test_split* function available in **sklearn**:
```Python
# Split the dataset
validation_size = 0.2
seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = validation_size, random_state = seed)
```

**5.2. Identify evaluation metrics**
Choosing which metric to use to evaluate machine learning algorithms is very important. A fundamental aspect of the evaluation metric is its ability to discern the different results of the model.

**5.3. Compare models and algorithms**
It is an art and a science to select a machine learning model or algorithm. There is no panacea. There are several factors beyond the performance model that can impact the decision to choose a machine learning algorithm.

Let's understand the model comparison process with a simple example. We define two variables, *X* and *Y*, and try to create a model to predict *Y* using *X*. As a first step, the data is divided into training and testing groups, as mentioned previously:

```Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
validation_szie = 0.2
seed = 42
X = 2 - 3 * np.random.normal(0, 1, 20)
Y = X - 2 * (X ** 2) + 0.5 * (X ** 3) + np.exp(-X) + np.random.normal(-3, 3, 20)
X = X[:, np.newaxis]
Y = Y[:, np.newaxis]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = validation_size, random_state = seed)
```

We have no idea which algorithm will do well in this problem. Let's schedule our test now. We will use two models - a linear regression and a polynomial regression to fit *Y* to *X*. We will evaluate the algorithms using the *Root Mean Squared Error (RMSE)* metric, which is one of the model's performance measures. RMSE will give us a general idea of the level of error of all forecasts:

```Python
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)

rmse_linear = np.sqrt(mean_squared_error(y_train, y_pred))
r2_linear = r2_score(y_train, y_pred)
print("RMSE for Linear Regression:" , rmse_linear)

polynomial_features = PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(X_train)

model = LinearRegression()
model.fit(x_poly, y_train)
y_poly_pred = model.predict(x_poly)

rmse = np.sqrt(mean_squared_error(y_train, y_poly_pred))
r2 = r2_score(y_train, y_poly_pred)
print("RMSE for Polynomial Regression: ", rmse)
```

Output:
    - RMSE for Linear Regression:     6.772942
    - RMSE for Polynomial Regression: 6.420495

We can see that the RMSE of polynomial regression is slightly better than that of linear regression. Therefore, she will be the preferred model in this step.

**6. Model tuning**
Finding the best combination of hyperparameters of a model can be treated as a search problem. Such a search exercise is normally known as *model tuning* and is one of the most important steps in model development. Understands the search for the best model parameters when using techniques such as *grid search*. You create a grid with all possible combinations of hyperparameters and train the model using each of them. In addition to grid search, there are several other techniques for model tuning, including randomized search, Bayesian and hyperbrand optimization.

Continuing our example, we have the polynomial as the best model: next, we will perform a grid search for the model, readjusting the polynomial regression with different degrees. We will compare the RMSE results for all models:

```Python
Deg = [1, 2, 3, 6, 10]
results = []
names = []
for deg in Deg:
    polynomial_features = PolynomialFeatures(degree=deg)
    x_poly = polynomial_features.fit_transform(X_train)
    
    model = LinearRegression()
    model.fit_transform(x_poly, y_train)
    y_poly_pred = model.predict(x_poly)
    
    rmse = np.sqrt(mean_squared_error(y_train, y_poly_pred))
    r2 = r2_score(y_train, y_poly_pred)
    results.append(rmse)
    names.append(deg)

plt.plot(names, results, 'o')
plt.subtitle("Algorithm Comparison")
```

The RMSE decreases as the degree increases, and the lowest RMSE is for the model with degree 10. However, models with degrees lower than 10 performed well, and the test set will be used to finalize the best model.

**7. Finish the model**
Here, we will perform the final steps for mold selection. First, we will run the predictions on the test dataset with the trained model. Then, we will try to understand the intuition of the model and save it for future use.

**7.1. Performance with the test data set**
The model selected during the training steps is evaluated again with the test set. This set allows us to compare different models in an unbiased way by basing our comparisons on data that was not used in any part of the training.

```Python
Deg = [1, 2, 3, 6, 8, 10]
for deg in Deg:
    polynomial_features = PolynomialFeatures(degree=deg)
    x_poly = polynomial_features.fit_transform(X_train)
    model = LinearRegression()
    model.fit(x_poly, y_train)
    x_poly_test = polynomial_features.fit_transform(X_test)
    y_poly_pred_test = model.predict(x_poly_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_poly_pred_test))
    r2 = r2_score(y_test, y_poly_pred_test)
    results_test.append(rmse)
    names_test.append(deg)
    
plt.plot(names_test, results_test, 'o')
plt.subtitle("Algorithm Comparison")
```

In the training set, we saw that the RMSE decreases with an increase in the degree of the polynomial model, and the polynomial of degree 10 had the lowest RMSE. However, as shown in the previous output for polynomial of degree 10, although the training set achieved the best results, the results on the test set are poor. For the polynomial of degree 8, the RMSE on the test set is relatively higher. The polynomial of degree 6 shows the best result in the test set (although the difference is small when compared with other polynomials of lower degrees in the test set), as well as good results in the training set. For these reasons, it is the preferred model.

In addition to model performance, there are several other factors to consider when selecting a model, such as simplicity, interpretability and training time.