### Numpy

Numpy is the core library for scientific calculation/computing in Python. It provides a high performance multidimensional array object, and tools for working with these arrays. MATLAB is very similar to it.

In [1]:
import numpy as np

### LinAlg

It includes very useful algebra tools: norm, inv, solve, det, eig, eigvalues, etc..

In [5]:
from numpy import linalg as la

In [7]:
la?

### Scipy

Library used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers, and other tasks common in science and enginerring.

In [2]:
import scipy.io as scio

In [3]:
scio?

In [20]:
from scipy.spatial.distance import pdist

In [21]:
pdist?

Help on function pdist in module scipy.spatial.distance:

pdist(X, metric='euclidean', *args, **kwargs)
    Pairwise distances between observations in n-dimensional space.
    
    See Notes for common calling conventions.
    
    Parameters
    ----------
    X : ndarray
        An m by n array of m original observations in an
        n-dimensional space.
    metric : str or function, optional
        The distance metric to use. The distance function can
        be 'braycurtis', 'canberra', 'chebyshev', 'cityblock',
        'correlation', 'cosine', 'dice', 'euclidean', 'hamming',
        'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching',
        'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
        'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.
    *args : tuple. Deprecated.
        Additional arguments should be passed as keyword arguments
    **kwargs : dict, optional
        Extra arguments to `metric`: refer to each metric documentation for a

In [22]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

In [25]:
dendrogram?

Help on function dendrogram in module scipy.cluster.hierarchy:

dendrogram(Z, p=30, truncate_mode=None, color_threshold=None, get_leaves=True, orientation='top', labels=None, count_sort=False, distance_sort=False, show_leaf_counts=True, no_plot=False, no_labels=False, leaf_font_size=None, leaf_rotation=None, leaf_label_func=None, show_contracted=False, link_color_func=None, ax=None, above_threshold_color='C0')
    Plot the hierarchical clustering as a dendrogram.
    
    The dendrogram illustrates how each cluster is
    composed by drawing a U-shaped link between a non-singleton
    cluster and its children. The top of the U-link indicates a
    cluster merge. The two legs of the U-link indicate which clusters
    were merged. The length of the two legs of the U-link represents
    the distance between the child clusters. It is also the
    cophenetic distance between original observations in the two
    children clusters.
    
    Parameters
    ----------
    Z : ndarray
        Th

In [26]:
linkage?

Help on function linkage in module scipy.cluster.hierarchy:

linkage(y, method='single', metric='euclidean', optimal_ordering=False)
    Perform hierarchical/agglomerative clustering.
    
    The input y may be either a 1-D condensed distance matrix
    or a 2-D array of observation vectors.
    
    If y is a 1-D condensed distance matrix,
    then y must be a :math:`\binom{n}{2}` sized
    vector, where n is the number of original observations paired
    in the distance matrix. The behavior of this function is very
    similar to the MATLAB linkage function.
    
    A :math:`(n-1)` by 4 matrix ``Z`` is returned. At the
    :math:`i`-th iteration, clusters with indices ``Z[i, 0]`` and
    ``Z[i, 1]`` are combined to form cluster :math:`n + i`. A
    cluster with an index less than :math:`n` corresponds to one of
    the :math:`n` original observations. The distance between
    clusters ``Z[i, 0]`` and ``Z[i, 1]`` is given by ``Z[i, 2]``. The
    fourth value ``Z[i, 3]`` represents t

In [27]:
fcluster?

Help on function fcluster in module scipy.cluster.hierarchy:

fcluster(Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None)
    Form flat clusters from the hierarchical clustering defined by
    the given linkage matrix.
    
    Parameters
    ----------
    Z : ndarray
        The hierarchical clustering encoded with the matrix returned
        by the `linkage` function.
    t : scalar
        For criteria 'inconsistent', 'distance' or 'monocrit',
         this is the threshold to apply when forming flat clusters.
        For 'maxclust' or 'maxclust_monocrit' criteria,
         this would be max number of clusters requested.
    criterion : str, optional
        The criterion to use in forming flat clusters. This can
        be any of the following values:
    
          ``inconsistent`` :
              If a cluster node and all its
              descendants have an inconsistent value less than or equal
              to `t`, then all its leaf descendants belong to the
    

### Sklearn (Scikit-learn)

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and the diabetes dataset for regression.

In [10]:
from sklearn import datasets

In [11]:
datasets?

Help on package sklearn.datasets in sklearn:

NAME
    sklearn.datasets

DESCRIPTION
    The :mod:`sklearn.datasets` module includes utilities to load datasets,
    including methods to load and fetch popular reference datasets. It also
    features some artificial data generators.

PACKAGE CONTENTS
    _base
    _california_housing
    _covtype
    _kddcup99
    _lfw
    _olivetti_faces
    _openml
    _rcv1
    _samples_generator
    _species_distributions
    _svmlight_format_fast
    _svmlight_format_io
    _twenty_newsgroups
    setup
    tests (package)

FUNCTIONS
    clear_data_home(data_home=None)
        Delete all the content of the data home cache.
        
        Parameters
        ----------
        data_home : str, default=None
            The path to scikit-learn data directory. If `None`, the default path
            is `~/sklearn_learn_data`.
    
    dump_svmlight_file(X, y, f, *, zero_based=True, comment=None, query_id=None, multilabel=False)
        Dump the datase

In [15]:
from sklearn import model_selection

In [16]:
model_selection?

Help on package sklearn.model_selection in sklearn:

NAME
    sklearn.model_selection

PACKAGE CONTENTS
    _search
    _search_successive_halving
    _split
    _validation
    tests (package)

CLASSES
    builtins.object
        sklearn.model_selection._search.ParameterGrid
        sklearn.model_selection._search.ParameterSampler
        sklearn.model_selection._split.BaseCrossValidator
            sklearn.model_selection._split.LeaveOneGroupOut
            sklearn.model_selection._split.LeaveOneOut
            sklearn.model_selection._split.LeavePGroupsOut
            sklearn.model_selection._split.LeavePOut
            sklearn.model_selection._split.PredefinedSplit
    sklearn.model_selection._search.BaseSearchCV(sklearn.base.MetaEstimatorMixin, sklearn.base.BaseEstimator)
        sklearn.model_selection._search.GridSearchCV
        sklearn.model_selection._search.RandomizedSearchCV
    sklearn.model_selection._split.BaseShuffleSplit(builtins.object)
        sklearn.model_selection

```model_selection.train_test_split:``` --> Split arrays or matrices into random train and test subsets.

In [23]:
model_selection.train_test_split?

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to t

In [24]:
from sklearn.linear_model import LinearRegression

LinearRegression fits a linear model with coefficients w = (w1, ..., wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [27]:
LinearRegression?

Help on class LinearRegression in module sklearn.linear_model._base:

class LinearRegression(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, LinearModel)
 |  LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False)
 |  
 |  Ordinary least squares Linear Regression.
 |  
 |  LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
 |  to minimize the residual sum of squares between the observed targets in
 |  the dataset, and the targets predicted by the linear approximation.
 |  
 |  Parameters
 |  ----------
 |  fit_intercept : bool, default=True
 |      Whether to calculate the intercept for this model. If set
 |      to False, no intercept will be used in calculations
 |      (i.e. data is expected to be centered).
 |  
 |  normalize : bool, default=False
 |      This parameter is ignored when ``fit_intercept`` is set to False.
 |      If True, the regressors X will be normalized before regression by
 |      subtr

In [1]:
from sklearn import metrics

`metrics` includes score functions, performance metrics and pairwise metrics and distance computations. Some of these metrics (like mean absolute error, mean squared error or root mean squared error) help to identify if our model is good or not.


In [1]:
from sklearn.metrics import *

In [2]:
metrics?

Help on package sklearn.metrics in sklearn:

NAME
    sklearn.metrics

DESCRIPTION
    The :mod:`sklearn.metrics` module includes score functions, performance metrics
    and pairwise metrics and distance computations.

PACKAGE CONTENTS
    _base
    _classification
    _pairwise_fast
    _plot (package)
    _ranking
    _regression
    _scorer
    cluster (package)
    pairwise
    setup
    tests (package)

CLASSES
    builtins.object
        sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay
        sklearn.metrics._plot.det_curve.DetCurveDisplay
        sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay
        sklearn.metrics._plot.roc_curve.RocCurveDisplay
    
    class ConfusionMatrixDisplay(builtins.object)
     |  ConfusionMatrixDisplay(confusion_matrix, *, display_labels=None)
     |  
     |  Confusion Matrix visualization.
     |  
     |  It is recommend to use :func:`~sklearn.metrics.plot_confusion_matrix` to
     |  create a :class:`Confusion

In [5]:
metrics.SCORERS

{'explained_variance': make_scorer(explained_variance_score),
 'r2': make_scorer(r2_score),
 'max_error': make_scorer(max_error, greater_is_better=False),
 'neg_median_absolute_error': make_scorer(median_absolute_error, greater_is_better=False),
 'neg_mean_absolute_error': make_scorer(mean_absolute_error, greater_is_better=False),
 'neg_mean_absolute_percentage_error': make_scorer(mean_absolute_percentage_error, greater_is_better=False),
 'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False),
 'neg_mean_squared_log_error': make_scorer(mean_squared_log_error, greater_is_better=False),
 'neg_root_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False, squared=False),
 'neg_mean_poisson_deviance': make_scorer(mean_poisson_deviance, greater_is_better=False),
 'neg_mean_gamma_deviance': make_scorer(mean_gamma_deviance, greater_is_better=False),
 'accuracy': make_scorer(accuracy_score),
 'top_k_accuracy': make_scorer(top_k_accuracy_score, ne

In [6]:
from sklearn import feature_selection

In [7]:
feature_selection?

Help on package sklearn.feature_selection in sklearn:

NAME
    sklearn.feature_selection

DESCRIPTION
    The :mod:`sklearn.feature_selection` module implements feature selection
    algorithms. It currently includes univariate filter selection methods and the
    recursive feature elimination algorithm.

PACKAGE CONTENTS
    _base
    _from_model
    _mutual_info
    _rfe
    _sequential
    _univariate_selection
    _variance_threshold
    tests (package)

CLASSES
    sklearn.base.MetaEstimatorMixin(builtins.object)
        sklearn.feature_selection._from_model.SelectFromModel(sklearn.base.MetaEstimatorMixin, sklearn.feature_selection._base.SelectorMixin, sklearn.base.BaseEstimator)
    sklearn.base.TransformerMixin(builtins.object)
        sklearn.feature_selection._base.SelectorMixin
            sklearn.feature_selection._from_model.SelectFromModel(sklearn.base.MetaEstimatorMixin, sklearn.feature_selection._base.SelectorMixin, sklearn.base.BaseEstimator)
            sklearn.featur

In [8]:
from sklearn.tree import DecisionTreeRegressor

In [9]:
DecisionTreeRegressor?

Help on class DecisionTreeRegressor in module sklearn.tree._classes:

class DecisionTreeRegressor(sklearn.base.RegressorMixin, BaseDecisionTree)
 |  DecisionTreeRegressor(*, criterion='mse', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, ccp_alpha=0.0)
 |  
 |  A decision tree regressor.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : {"mse", "friedman_mse", "mae", "poisson"}, default="mse"
 |      The function to measure the quality of a split. Supported criteria
 |      are "mse" for the mean squared error, which is equal to variance
 |      reduction as feature selection criterion and minimizes the L2 loss
 |      using the mean of each terminal node, "friedman_mse", which uses mean
 |      squared error with Friedman's improvement score for potential splits,
 |  

In [1]:
from sklearn.neighbors import KNeighborsRegressor

In [2]:
KNeighborsRegressor?

Help on class KNeighborsRegressor in module sklearn.neighbors._regression:

class KNeighborsRegressor(sklearn.neighbors._base.KNeighborsMixin, sklearn.base.RegressorMixin, sklearn.neighbors._base.NeighborsBase)
 |  KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
 |  
 |  Regression based on k-nearest neighbors.
 |  
 |  The target is predicted by local interpolation of the targets
 |  associated of the nearest neighbors in the training set.
 |  
 |  Read more in the :ref:`User Guide <regression>`.
 |  
 |  .. versionadded:: 0.9
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, default=5
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : {'uniform', 'distance'} or callable, default='uniform'
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhoo

In [1]:
from sklearn.linear_model import LogisticRegression

In [2]:
LogisticRegression?

Help on class LogisticRegression in module sklearn.linear_model._logistic:

class LogisticRegression(sklearn.linear_model._base.LinearClassifierMixin, sklearn.linear_model._base.SparseCoefMixin, sklearn.base.BaseEstimator)
 |  LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |  
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the
 |  cross-entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag', 'saga' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 's

In [3]:
from sklearn.naive_bayes import GaussianNB

In [4]:
GaussianNB?

Help on class GaussianNB in module sklearn.naive_bayes:

class GaussianNB(_BaseNB)
 |  GaussianNB(*, priors=None, var_smoothing=1e-09)
 |  
 |  Gaussian Naive Bayes (GaussianNB)
 |  
 |  Can perform online updates to model parameters via :meth:`partial_fit`.
 |  For details on algorithm used to update feature means and variance online,
 |  see Stanford CS tech report STAN-CS-79-773 by Chan, Golub, and LeVeque:
 |  
 |      http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CS-TR-79-773.pdf
 |  
 |  Read more in the :ref:`User Guide <gaussian_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  priors : array-like of shape (n_classes,)
 |      Prior probabilities of the classes. If specified the priors are not
 |      adjusted according to the data.
 |  
 |  var_smoothing : float, default=1e-9
 |      Portion of the largest variance of all features that is added to
 |      variances for calculation stability.
 |  
 |      .. versionadded:: 0.20
 |  
 |  Attributes
 |  ----------
 |  c

In [5]:
from sklearn.neighbors import KNeighborsClassifier

In [6]:
KNeighborsClassifier?

Help on class KNeighborsClassifier in module sklearn.neighbors._classification:

class KNeighborsClassifier(sklearn.neighbors._base.KNeighborsMixin, sklearn.base.ClassifierMixin, sklearn.neighbors._base.NeighborsBase)
 |  KNeighborsClassifier(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None, **kwargs)
 |  
 |  Classifier implementing the k-nearest neighbors vote.
 |  
 |  Read more in the :ref:`User Guide <classification>`.
 |  
 |  Parameters
 |  ----------
 |  n_neighbors : int, default=5
 |      Number of neighbors to use by default for :meth:`kneighbors` queries.
 |  
 |  weights : {'uniform', 'distance'} or callable, default='uniform'
 |      weight function used in prediction.  Possible values:
 |  
 |      - 'uniform' : uniform weights.  All points in each neighborhood
 |        are weighted equally.
 |      - 'distance' : weight points by the inverse of their distance.
 |        in this case, closer ne

In [1]:
from sklearn.tree import DecisionTreeClassifier

In [2]:
DecisionTreeClassifier?

Help on class DecisionTreeClassifier in module sklearn.tree._classes:

class DecisionTreeClassifier(sklearn.base.ClassifierMixin, BaseDecisionTree)
 |  DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, ccp_alpha=0.0)
 |  
 |  A decision tree classifier.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : {"gini", "entropy"}, default="gini"
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |  
 |  splitter : {"best", "random"}, default="best"
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best ran

In [3]:
from sklearn.svm import SVC

In [4]:
SVC?

Help on class SVC in module sklearn.svm._classes:

class SVC(sklearn.svm._base.BaseSVC)
 |  SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
 |  
 |  C-Support Vector Classification.
 |  
 |  The implementation is based on libsvm. The fit time scales at least
 |  quadratically with the number of samples and may be impractical
 |  beyond tens of thousands of samples. For large datasets
 |  consider using :class:`~sklearn.svm.LinearSVC` or
 |  :class:`~sklearn.linear_model.SGDClassifier` instead, possibly after a
 |  :class:`~sklearn.kernel_approximation.Nystroem` transformer.
 |  
 |  The multiclass support is handled according to a one-vs-one scheme.
 |  
 |  For details on the precise mathematical formulation of the provided
 |  kernel functions and how `gamma`, `coef0` and `degree` affect each
 

In [9]:
from sklearn import ensemble

In [10]:
ensemble?

Help on package sklearn.ensemble in sklearn:

NAME
    sklearn.ensemble

DESCRIPTION
    The :mod:`sklearn.ensemble` module includes ensemble-based methods for
    classification, regression and anomaly detection.

PACKAGE CONTENTS
    _bagging
    _base
    _forest
    _gb
    _gb_losses
    _gradient_boosting
    _hist_gradient_boosting (package)
    _iforest
    _stacking
    _voting
    _weight_boosting
    setup
    tests (package)

CLASSES
    sklearn.base.BaseEstimator(builtins.object)
        sklearn.ensemble._base.BaseEnsemble(sklearn.base.MetaEstimatorMixin, sklearn.base.BaseEstimator)
    sklearn.base.ClassifierMixin(builtins.object)
        sklearn.ensemble._bagging.BaggingClassifier(sklearn.base.ClassifierMixin, sklearn.ensemble._bagging.BaseBagging)
        sklearn.ensemble._gb.GradientBoostingClassifier(sklearn.base.ClassifierMixin, sklearn.ensemble._gb.BaseGradientBoosting)
        sklearn.ensemble._stacking.StackingClassifier(sklearn.base.ClassifierMixin, sklearn.ensem

In [5]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

In [6]:
RandomForestClassifier?

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and uses averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  P

In [7]:
GradientBoostingClassifier?

Help on class GradientBoostingClassifier in module sklearn.ensemble._gb:

class GradientBoostingClassifier(sklearn.base.ClassifierMixin, BaseGradientBoosting)
 |  GradientBoostingClassifier(*, loss='deviance', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
 |  
 |  Gradient Boosting for classification.
 |  
 |  GB builds an additive model in a
 |  forward stage-wise fashion; it allows for the optimization of
 |  arbitrary differentiable loss functions. In each stage ``n_classes_``
 |  regression trees are fit on the negative gradient of the
 |  binomial or multinomial deviance loss function. Binary classification
 |  is a special case where only a sin

In [8]:
AdaBoostClassifier?

Help on class AdaBoostClassifier in module sklearn.ensemble._weight_boosting:

class AdaBoostClassifier(sklearn.base.ClassifierMixin, BaseWeightBoosting)
 |  AdaBoostClassifier(base_estimator=None, *, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)
 |  
 |  An AdaBoost classifier.
 |  
 |  An AdaBoost [1] classifier is a meta-estimator that begins by fitting a
 |  classifier on the original dataset and then fits additional copies of the
 |  classifier on the same dataset but where the weights of incorrectly
 |  classified instances are adjusted such that subsequent classifiers focus
 |  more on difficult cases.
 |  
 |  This class implements the algorithm known as AdaBoost-SAMME [2].
 |  
 |  Read more in the :ref:`User Guide <adaboost>`.
 |  
 |  .. versionadded:: 0.14
 |  
 |  Parameters
 |  ----------
 |  base_estimator : object, default=None
 |      The base estimator from which the boosted ensemble is built.
 |      Support for sample weighting is requi

In [14]:
from sklearn.cluster import KMeans

In [15]:
KMeans?

Help on class KMeans in module sklearn.cluster._kmeans:

class KMeans(sklearn.base.TransformerMixin, sklearn.base.ClusterMixin, sklearn.base.BaseEstimator)
 |  KMeans(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='deprecated', verbose=0, random_state=None, copy_x=True, n_jobs='deprecated', algorithm='auto')
 |  
 |  K-Means clustering.
 |  
 |  Read more in the :ref:`User Guide <k_means>`.
 |  
 |  Parameters
 |  ----------
 |  
 |  n_clusters : int, default=8
 |      The number of clusters to form as well as the number of
 |      centroids to generate.
 |  
 |  init : {'k-means++', 'random'}, callable or array-like of shape             (n_clusters, n_features), default='k-means++'
 |      Method for initialization:
 |  
 |      'k-means++' : selects initial cluster centers for k-mean
 |      clustering in a smart way to speed up convergence. See section
 |      Notes in k_init for more details.
 |  
 |      'random': choose `n_clusters` o

In [18]:
!pip install yellowbrick
#yellow brick is a library for Machine Learning Visualization



In [17]:
from yellowbrick.cluster import KElbowVisualizer

In [19]:
KElbowVisualizer?

Help on class KElbowVisualizer in module yellowbrick.cluster.elbow:

class KElbowVisualizer(yellowbrick.cluster.base.ClusteringScoreVisualizer)
 |  KElbowVisualizer(estimator, ax=None, k=10, metric='distortion', timings=True, locate_elbow=True, **kwargs)
 |  
 |  The K-Elbow Visualizer implements the "elbow" method of selecting the
 |  optimal number of clusters for K-means clustering. K-means is a simple
 |  unsupervised machine learning algorithm that groups data into a specified
 |  number (k) of clusters. Because the user must specify in advance what k to
 |  choose, the algorithm is somewhat naive -- it assigns all members to k
 |  clusters even if that is not the right k for the dataset.
 |  
 |  The elbow method runs k-means clustering on the dataset for a range of
 |  values for k (say from 1-10) and then for each value of k computes an
 |  average score for all clusters. By default, the ``distortion`` score is
 |  computed, the sum of square distances from each point to its as

In [28]:
from sklearn.cluster import DBSCAN

In [None]:
DBSCAN?

In [1]:
from sklearn.decomposition import PCA

In [2]:
PCA?

Help on class PCA in module sklearn.decomposition._pca:

class PCA(sklearn.decomposition._base._BasePCA)
 |  PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', random_state=None)
 |  
 |  Principal component analysis (PCA).
 |  
 |  Linear dimensionality reduction using Singular Value Decomposition of the
 |  data to project it to a lower dimensional space. The input data is centered
 |  but not scaled for each feature before applying the SVD.
 |  
 |  It uses the LAPACK implementation of the full SVD or a randomized truncated
 |  SVD by the method of Halko et al. 2009, depending on the shape of the input
 |  data and the number of components to extract.
 |  
 |  It can also use the scipy.sparse.linalg ARPACK implementation of the
 |  truncated SVD.
 |  
 |  Notice that this class does not support sparse input. See
 |  :class:`TruncatedSVD` for an alternative with sparse data.
 |  
 |  Read more in the :ref:`User Guide <PCA>`.
 |  
 | 

In [3]:
from sklearn.datasets import fetch_openml

In [4]:
fetch_openml?

Help on function fetch_openml in module sklearn.datasets._openml:

fetch_openml(name: Optional[str] = None, *, version: Union[str, int] = 'active', data_id: Optional[int] = None, data_home: Optional[str] = None, target_column: Union[str, List, NoneType] = 'default-target', cache: bool = True, return_X_y: bool = False, as_frame: Union[str, bool] = 'auto')
    Fetch dataset from openml by name or dataset id.
    
    Datasets are uniquely identified by either an integer ID or by a
    combination of name and version (i.e. there might be multiple
    versions of the 'iris' dataset). Please give either name or data_id
    (not both). In case a name is given, a version can also be
    provided.
    
    Read more in the :ref:`User Guide <openml>`.
    
    .. versionadded:: 0.20
    
    .. note:: EXPERIMENTAL
    
        The API is experimental (particularly the return value structure),
        and might have small backward-incompatible changes without notice
    
    Parameters
    -----

In [5]:
from sklearn.manifold import TSNE

In [6]:
TSNE?

Help on class TSNE in module sklearn.manifold._t_sne:

class TSNE(sklearn.base.BaseEstimator)
 |  TSNE(n_components=2, *, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, random_state=None, method='barnes_hut', angle=0.5, n_jobs=None, square_distances='legacy')
 |  
 |  t-distributed Stochastic Neighbor Embedding.
 |  
 |  t-SNE [1] is a tool to visualize high-dimensional data. It converts
 |  similarities between data points to joint probabilities and tries
 |  to minimize the Kullback-Leibler divergence between the joint
 |  probabilities of the low-dimensional embedding and the
 |  high-dimensional data. t-SNE has a cost function that is not convex,
 |  i.e. with different initializations we can get different results.
 |  
 |  It is highly recommended to use another dimensionality reduction
 |  method (e.g. PCA for dense data or TruncatedSVD for sparse data)
 | 

In [1]:
from sklearn.dummy import DummyClassifier

In [2]:
DummyClassifier?

Help on class DummyClassifier in module sklearn.dummy:

class DummyClassifier(sklearn.base.MultiOutputMixin, sklearn.base.ClassifierMixin, sklearn.base.BaseEstimator)
 |  DummyClassifier(*, strategy='prior', random_state=None, constant=None)
 |  
 |  DummyClassifier is a classifier that makes predictions using simple rules.
 |  
 |  This classifier is useful as a simple baseline to compare with other
 |  (real) classifiers. Do not use it for real problems.
 |  
 |  Read more in the :ref:`User Guide <dummy_estimators>`.
 |  
 |  .. versionadded:: 0.13
 |  
 |  Parameters
 |  ----------
 |  strategy : {"stratified", "most_frequent", "prior", "uniform",             "constant"}, default="prior"
 |      Strategy to use to generate predictions.
 |  
 |      * "stratified": generates predictions by respecting the training
 |        set's class distribution.
 |      * "most_frequent": always predicts the most frequent label in the
 |        training set.
 |      * "prior": always predicts the 

In [4]:
from sklearn.metrics import confusion_matrix

In [5]:
confusion_matrix?

Help on function confusion_matrix in module sklearn.metrics._classification:

confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)
    Compute confusion matrix to evaluate the accuracy of a classification.
    
    By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
    is equal to the number of observations known to be in group :math:`i` and
    predicted to be in group :math:`j`.
    
    Thus in binary classification, the count of true negatives is
    :math:`C_{0,0}`, false negatives is :math:`C_{1,0}`, true positives is
    :math:`C_{1,1}` and false positives is :math:`C_{0,1}`.
    
    Read more in the :ref:`User Guide <confusion_matrix>`.
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Ground truth (correct) target values.
    
    y_pred : array-like of shape (n_samples,)
        Estimated targets as returned by a classifier.
    
    labels : array-like of shape (n_classes), default=

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [11]:
Pipeline?

In [13]:
SimpleImputer?
# Imputation transformer for completing missing values.

In [14]:
from sklearn.model_selection import cross_val_score

In [15]:
cross_val_score?

In [31]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [27]:
f1_score?

In [20]:
accuracy_score?

In [21]:
precision_score?

In [22]:
recall_score?

In [23]:
roc_curve?

In [24]:
roc_auc_score?

In [30]:
import warnings
warnings.filterwarnings('ignore')

# it avoids displaying warnings related to the cell we run 

In [32]:
from sklearn.preprocessing import OneHotEncoder

In [33]:
OneHotEncoder?

Encode categorical features as a one-hot numeric array. It only can be applied to DataFrames with one column, so if I want to use it in a pipeline, I have to use a column transformer with `remainder='passthrough'` so that the pipeline works and does nothing to the other columns.

In [34]:
from sklearn.compose import ColumnTransformer

In [None]:
ColumnTransformer?

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input
to be transformed separately and the features generated by each transformer
will be concatenated to form a single feature space.
This is useful for heterogeneous or columnar data, to combine several
feature extraction mechanisms or transformations into a single transformer.

In [1]:
from category_encoders import TargetEncoder

  import pandas.util.testing as tm


In [2]:
TargetEncoder?

In [1]:
from sklearn.preprocessing import StandardScaler

In [2]:
StandardScaler?

Standard Scaler standardizes features by removing the mean and scaling to unit variance. This is very important because if a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

### Pandas

Data structures and analysis.

In [7]:
import pandas as pd

In [32]:
pd?

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic an

### MatPlotLib

In [2]:
import matplotlib.pyplot as plt

In [27]:
plt?

Help on module matplotlib.pyplot in matplotlib:

NAME
    matplotlib.pyplot

DESCRIPTION
    `matplotlib.pyplot` is a state-based interface to matplotlib. It provides
    a MATLAB-like way of plotting.
    
    pyplot is mainly intended for interactive plots and simple cases of
    programmatic plot generation::
    
        import numpy as np
        import matplotlib.pyplot as plt
    
        x = np.arange(0, 5, 0.1)
        y = np.sin(x)
        plt.plot(x, y)
    
    The object-oriented API is recommended for more complex plots.

FUNCTIONS
    acorr(x, *, data=None, **kwargs)
        Plot the autocorrelation of *x*.
        
        Parameters
        ----------
        x : array-like
        
        detrend : callable, default: `.mlab.detrend_none` (no detrending)
            A detrending function applied to *x*.  It must have the
            signature ::
        
                detrend(x: np.ndarray) -> np.ndarray
        
        normed : bool, default: True
            If `

In [11]:
from mpl_toolkits.mplot3d import Axes3D

In [12]:
Axes3D?

Help on class Axes3D in module mpl_toolkits.mplot3d.axes3d:

class Axes3D(matplotlib.axes._axes.Axes)
 |  Axes3D(fig, rect=None, *args, azim=-60, elev=30, sharez=None, proj_type='persp', box_aspect=None, **kwargs)
 |  
 |  3D axes object.
 |  
 |  Method resolution order:
 |      Axes3D
 |      matplotlib.axes._axes.Axes
 |      matplotlib.axes._base._AxesBase
 |      matplotlib.artist.Artist
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, fig, rect=None, *args, azim=-60, elev=30, sharez=None, proj_type='persp', box_aspect=None, **kwargs)
 |      Parameters
 |      ----------
 |      fig : Figure
 |          The parent figure.
 |      rect : (float, float, float, float)
 |          The ``(left, bottom, width, height)`` axes position.
 |      azim : float, default: -60
 |          Azimuthal viewing angle.
 |      elev : float, default: 30
 |          Elevation viewing angle.
 |      sharez : Axes3D, optional
 |          Other axes to share z-limits with.
 

### Seaborn

It includes very useful algebra tools: norm, inv, solve, det, eig, eigvalues, etc. It provides a high level interface for drawing attractive and informative statistical grahics.

In [30]:
import seaborn as sn

In [22]:
sn?

Help on package seaborn:

NAME
    seaborn - # Import seaborn objects

PACKAGE CONTENTS
    _core
    _decorators
    _docstrings
    _statistics
    _testing
    algorithms
    axisgrid
    categorical
    cm
    colors (package)
    conftest
    distributions
    external (package)
    matrix
    miscplot
    palettes
    rcmod
    regression
    relational
    tests (package)
    utils
    widgets

DATA
    crayons = {'Almond': '#EFDECD', 'Antique Brass': '#CD9575', 'Apricot':...
    xkcd_rgb = {'acid green': '#8ffe09', 'adobe': '#bd6c48', 'algae': '#54...

VERSION
    0.11.1

FILE
    /opt/anaconda3/envs/python_intro/lib/python3.8/site-packages/seaborn/__init__.py




### Zipfile

In [8]:
from zipfile import ZipFile

In [None]:
zf = ZipFile('ruta del fichero')

zf.filelist

In [None]:
ZipFile?

### Web Scrapping

In [9]:
import requests

In [35]:
requests?

Help on package requests:

NAME
    requests

DESCRIPTION
    Requests HTTP Library
    ~~~~~~~~~~~~~~~~~~~~~
    
    Requests is an HTTP library, written in Python, for human beings.
    Basic GET usage:
    
       >>> import requests
       >>> r = requests.get('https://www.python.org')
       >>> r.status_code
       200
       >>> b'Python is a programming language' in r.content
       True
    
    ... or POST:
    
       >>> payload = dict(key1='value1', key2='value2')
       >>> r = requests.post('https://httpbin.org/post', data=payload)
       >>> print(r.text)
       {
         ...
         "form": {
           "key1": "value1",
           "key2": "value2"
         },
         ...
       }
    
    The other HTTP methods are supported - see `requests.api`. Full documentation
    is at <https://requests.readthedocs.io>.
    
    :copyright: (c) 2017 by Kenneth Reitz.
    :license: Apache 2.0, see LICENSE for more details.

PACKAGE CONTENTS
    __version__
    _internal_utils

#### Beautiful Soup

In [1]:
from bs4 import BeautifulSoup

In [2]:
BeautifulSoup?

#### IPython.display

For web scrapping:
- _Image_ and _display_ allow to see images extracted from webistes.
- _IFrame_ allows to render a website into the notebook.

In [38]:
from IPython.display import Image, display, IFrame

In [None]:
Example

IFrame('https://en.wikipedia.org/',800,600)

In [43]:
Image?

Help on class Image in module IPython.core.display:

class Image(DisplayObject)
 |  Image(data=None, url=None, filename=None, format=None, embed=None, width=None, height=None, retina=False, unconfined=False, metadata=None)
 |  
 |  An object that wraps data to be displayed.
 |  
 |  Method resolution order:
 |      Image
 |      DisplayObject
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, data=None, url=None, filename=None, format=None, embed=None, width=None, height=None, retina=False, unconfined=False, metadata=None)
 |      Create a PNG/JPEG/GIF image object given raw data.
 |      
 |      When this object is returned by an input cell or passed to the
 |      display function, it will result in the image being displayed
 |      in the frontend.
 |      
 |      Parameters
 |      ----------
 |      data : unicode, str or bytes
 |          The raw image data or a URL or filename to load the data from.
 |          This always results in embedded image 

In [44]:
display?

Help on function display in module IPython.core.display:

display(*objs, include=None, exclude=None, metadata=None, transient=None, display_id=None, **kwargs)
    Display a Python object in all frontends.
    
    By default all representations will be computed and sent to the frontends.
    Frontends can decide which representation is used and how.
    
    In terminal IPython this will be similar to using :func:`print`, for use in richer
    frontends see Jupyter notebook examples with rich display logic.
    
    Parameters
    ----------
    objs : tuple of objects
        The Python objects to display.
    raw : bool, optional
        Are the objects to be displayed already mimetype-keyed dicts of raw display data,
        or Python objects that need to be formatted before display? [default: False]
    include : list, tuple or set, optional
        A list of format type strings (MIME types) to include in the
        format data dict. If this is set *only* the format types included

In [45]:
IFrame?

Help on class IFrame in module IPython.lib.display:

class IFrame(builtins.object)
 |  IFrame(src, width, height, **kwargs)
 |  
 |  Generic class to embed an iframe in an IPython notebook
 |  
 |  Methods defined here:
 |  
 |  __init__(self, src, width, height, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  iframe = '\n        <iframe\n            width="{width}"\n   ...      ...



#### BS4

It helps to read the content of a site and pull data out of it. We can create a BeautifulSoup object out of the content and use __prettify__ to get a more readable version of it.

In [12]:
from bs4 import BeautifulSoup

In [55]:
BeautifulSoup?

Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(na

#### Flask

Flask is a popular Python web framework, meaning it is a third party Python library used for developing web applications (APIs).

In [13]:
from flask import Flask, jsonify, request

In [57]:
Flask?

Help on class Flask in module flask.app:

class Flask(flask.helpers._PackageBoundObject)
 |  Flask(import_name, static_url_path=None, static_folder='static', static_host=None, host_matching=False, subdomain_matching=False, template_folder='templates', instance_path=None, instance_relative_config=False, root_path=None)
 |  
 |  The flask object implements a WSGI application and acts as the central
 |  object.  It is passed the name of the module or package of the
 |  application.  Once it is created it will act as a central registry for
 |  the view functions, the URL rules, template configuration and much more.
 |  
 |  The name of the package is used to resolve resources from inside the
 |  package or the folder the module is contained in depending on if the
 |  package parameter resolves to an actual python package (a folder with
 |  an :file:`__init__.py` file inside) or a standard module (just a ``.py`` file).
 |  
 |  For more information about resource loading, see :func:`open_re

In [58]:
jsonify?

Help on function jsonify in module flask.json:

jsonify(*args, **kwargs)
    This function wraps :func:`dumps` to add a few enhancements that make
    life easier.  It turns the JSON output into a :class:`~flask.Response`
    object with the :mimetype:`application/json` mimetype.  For convenience, it
    also converts multiple arguments into an array or multiple keyword arguments
    into a dict.  This means that both ``jsonify(1,2,3)`` and
    ``jsonify([1,2,3])`` serialize to ``[1,2,3]``.
    
    For clarity, the JSON serialization behavior has the following differences
    from :func:`dumps`:
    
    1. Single argument: Passed straight through to :func:`dumps`.
    2. Multiple arguments: Converted to an array before being passed to
       :func:`dumps`.
    3. Multiple keyword arguments: Converted to a dict before being passed to
       :func:`dumps`.
    4. Both args and kwargs: Behavior undefined and will throw an exception.
    
    Example usage::
    
        from flask impor

In [59]:
request?

Help on LocalProxy in module werkzeug.local object:

class LocalProxy(builtins.object)
 |  Acts as a proxy for a werkzeug local.  Forwards all operations to
 |  a proxied object.  The only operations not supported for forwarding
 |  are right handed operands and any kind of assignment.
 |  
 |  Example usage::
 |  
 |      from werkzeug.local import Local
 |      l = Local()
 |  
 |      # these are proxies
 |      request = l('request')
 |      user = l('user')
 |  
 |  
 |      from werkzeug.local import LocalStack
 |      _response_local = LocalStack()
 |  
 |      # this is a proxy
 |      response = _response_local()
 |  
 |  Whenever something is bound to l.user / l.request the proxy objects
 |  will forward all operations.  If no object is bound a :exc:`RuntimeError`
 |  will be raised.
 |  
 |  To create proxies to :class:`Local` or :class:`LocalStack` objects,
 |  call the object as shown above.  If you want to have a proxy to an
 |  object looked up by a function, you can (as

### NLP - Natural Language Processing

In [1]:
import nltk

In [2]:
nltk?

In [3]:
from nltk.corpus import brown

In [13]:
brown?

In [4]:
from nltk.corpus import gutenberg

In [14]:
gutenberg?

In [5]:
from __future__ import division

Importing _\__future____ can be used to use features which will appear in newer versions while having an older release of Python.

In [6]:
from nltk import ConditionalFreqDist

In [18]:
ConditionalFreqDist?

In [7]:
from nltk import FreqDist

In [20]:
FreqDist?

In [8]:
from nltk import DecisionTreeClassifier

In [21]:
DecisionTreeClassifier?

In [24]:
from nltk.classify import accuracy

In [25]:
accuracy?

In [10]:
import spacy

In [26]:
spacy?

In [11]:
from nltk.tree import Tree

In [27]:
Tree?

In [12]:
from nltk.corpus import wordnet as wn

WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser(link is external). WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

In [1]:
from pywsd.lesk import simple_lesk

Warming up PyWSD (takes ~10 secs)... took 8.57317590713501 secs.


In [9]:
simple_lesk?

In [2]:
from pywsd import disambiguate

In [10]:
disambiguate?

In [3]:
from pywsd.similarity import max_similarity as maxsim

In [11]:
max_similarity?

In [4]:
from pywsd.baseline import random_sense, first_sense

In [12]:
random_sense?

In [14]:
first_sense?

In [5]:
from pywsd.baseline import max_lemma_count as most_frequent_sense

In [16]:
most_frequent_sense?

In [6]:
from pywsd.lesk import adapted_lesk

In [17]:
adapted_lesk?

In [7]:
from pywsd.lesk import cosine_lesk

In [18]:
cosine_lesk?