Regarding the platforms used for machine learning, there are many algorithms and programming languages. However, the Python ecosystem is one of the most dominant and fastest growing in machine learning.

Given its popularity and high adoption rate, we will use Python as our main programming language. First, we will look at the details of the Python packages used for machine learning, followed by the model development steps in the Python framework.

## Why Python?

These are some of the reasons for Python's popularity:

- high-level syntax (compared to lower-level languages ​​like C, Java, and C++). Applications can be developed by writing fewer lines of code, making Python attractive to both beginners and advanced programmers;
- efficient development life cycle;
- large collection of community-run and open-source libraries;
- strong portability.

Python's simplicity attracts many developers who create new libraries for machine learning, leading to its strong adoption.

## Python Libs for Machine Learning

<br>

**Data Manipulation and Transformation**

*NumPy* (https://numpy.org)

    - provides support for large and multidimensional arrays, as well as an extensive collection of mathematical functions;

*Pandas* (https://pandas.pydata.org)

    - a library for data manipulation and analysis. Among other features, it offers data structure to work with tables and tools to manipulate them;
    
<br>

**Machine Learning and Statistical Analysis**

*SciPy* (https://www.scipy.org)

    - the combination of **NumPy**, **Pandas** and **Matplotlib** is commonly known as **SciPy**, which is an ecosystem of Python libraries for mathematics, science and engineering;
    
*Scikit-learn* (https://scikit-learn.org)

    - a machine learning library offering a wide range of algorithms and utilities;
    
*StatsModels* (https://www.statsmodels.org)

    - a Python module that provides classes and functions for estimating numerous statistical models, as well as for conducting statistical tests and exploring statistical data;
    
*TensorFlow* (https://www.tensorflow.org) and *Theano* (https://deeplearning.net/software/theano)

    - dataflow libraries that facilitate work with neural networks;
    
*Keras* (https://keras.io)

    - a library of artificial neural networks that can act as a simplified interface for the **TensorFlow*/Theano* packages;
    
<br>

**Data Visualization**

*Matplotlib* (https://matplotlib.org)

    - a plotting library that allows you to create 2D graphs and plots;

*Seaborn* (https://seaborn.pydata.org)

    - a data visualization library based on Matplotlib. Provides a high-level interface for creating attractive and informative statistical graphs;
    
<br>

## Machine Learning Crisp-DM Model

The figure below shows a general idea of a simple machine learning template in seven steps, which can be used to start any machine learning model in Python.

<figure>
    <img src="https://miro.medium.com/v2/resize:fit:1024/1*sicHaDLyHRGuJm9eZTaHNw.png" width="600">
    <figcaption>Step by step to develop a machine learning model</figcaption>
</figure>

### Blueprint de Desenvolvimento de Modelo

Now we will detail each step of the model development process:

**1. Problem definition**
The first step in any project is defining the problem. Powerful algorithms can be used to solve it, but the results will be of no use if the wrong problem is solved.
The following framework should be used to define the problem:

    1. describe the problem informally and formally. List similar assumptions and problems;
    2. list the motivation for solving the problem, the benefits brought by the resolution and how it will be used;
    3. describe how the problem would be solved using domain knowledge.


**2. Loading data and libs**
The second step gives you everything you need to start working on the problem. This includes loading libraries, packages, and functions required for model development.

**2.1. Loading the libs**
```Python
import pandas as pd
from matplotlib import pyplot
```
Details of libraries and modules for specific functionalities can be found on each page.

**2.2. Loading data**
The following items must be checked and removed before data is loaded:

    - column headers;
    - comments or characters;
    - delimiter.

There are many ways to load data. Some of the most common are:

```Python
# Upload CSV files with Pandas
from pandas import read_csv
filename = 'xpto.csv'
data = read_csv(filename, names=names)
```

```Python
# Upload files from a URL
url = 'https://goo.glvhm1eU'
names = ['age', 'class']
data = read_csv(url, names=names)
```

```Python
# Load files using pandas_darareader
import pandas_datareader.data as web
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']

stk_data = web.DataReader(stk_tickers, 'yahoo')
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
```

**3. Exploratory data analysis**
In this step, we analyze the data set.

**3.1. Descriptive statistics**
Understanding the dataset is one of the most important processes in model development. The steps to do this include:

    1. view the raw data;
    2. evaluate the dimensions of the set;
    3. evaluate the types of data attributes;
    4. summarize the distribution, descriptive statistics, and relationship between variables in the data set.
    
These steps are demonstrated below:

```Python
# View data
set_option('display.width', 100)
dataset.head(1)
```

```Python
# Evaluate dataset dimensions
dataset.shape
```

```Python
# Evaluate data attribute types
set_option('display.max_rows', 500)
dataset.dtypes
```

```Python
# Summarize data using descriptive statistics
set_option('precision', 3)
dataset.describe()
```

**3.2. Data visualization**
The quickest way to learn more about data is to visualize it. Visualization involves independently understanding each attribute of the dataset.

Types of charts:

*Univariates*
- histograms and density graphs;

*Multivariates*
- correlation matrix and scatter plot;

Below are Python code examples for univariate graph types:

```Python
# Univariate: histogram
from matplotlib import pyplot
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1, figsize=(10,4))
```

```Python
# Univariate: density plot
from matplotlib import pyplot
dataset.plot(kind='density', subplots=True, layout=(3, 3), sharex=False, legend=True, fontsize=1, figsize=(10,4))
pyplot.show()
```

Below are Python code examples for multivariate graph types:

```Python
# Multivariate: correlation matrix
import seaborn as sns
from matplotlib import pyplot
correlation = dataset.corr()
pyplot.figure(figsize=(5,5))
pyplot.title('Correlation Matrix')
sns.heatmap(correlation, vmax=1, square=True, annot=True, cmap='cubehelix')
```

```Python
# Multivariate: scatter plot
from pandas.plotting import scatter_matrix
scatter_matrix(dataset)
```

**4. Data preparation**
Data preparation is the preprocessing step in which data from one or more sources is cleaned and transformed to improve its quality before it is used.

**4.1. Data cleaning**
In machine learning modeling, bad data can be costly. Data cleansing involves checking the following:

*Validity*
- type, range, ...;

*Accuracy*
- the degree to which the data are close to true values;

*Completeness*
- the degree to which all necessary data are known;

*Uniformity*
- the degree to which data are specified using the same unit of measurement;

Different options for cleaning data include:

*Remove "NA" values ​​from data*
```Python
dataset.dropna(axis=0)
```

*Fill "NA" with 0*
```Python
dataset.fillna(0)
```

*Fill in "NA" with the column average*
```Python
dataset['col'] = dataset['col'].fillna(dataset['col'].mean())
```

**4.2. Feature selection**
The features of the data used to train machine learning models have a huge influence on performance. Irrelevant or partially relevant features can negatively impact model performance. Feature selection is a process in which the features in the data that contribute most to the prediction of the variable or output are automatically selected.

The benefits of performing feature selection before modeling are:

*Reduced overfitting (overfitting)*
- less redundant data means fewer opportunities for the model to make decisions based on noise;

*Improves performance*
- less misleading data means better modeling performance;

*Reduction in training time and memory volume*
- less data means faster training and less memory volume;

The following sample feature is an example that demonstrates when two best features are selected using the *SelectKBest* function in **sklearn**. This function ranks the features using an underlying function and then removes all but the highest ranked *k* feature:

```Python
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
bestfeatures = SelectKBest(k=5)
dfscores = pd.DataFrame(X.fit.scores_)
dfcolumns = pd.DataFrame(X.columns) 
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
print(featureScores.nlargest(2,'Score'))
```

When features are irrelevant, they should be removed. This is presented below:

```Python
# Removing old features
dataset.drop(['Feature1', 'Feature2', 'Feature3'], axis=1, inplace=True)
```