# Libraries, Packages and DataSets

## Essential Tools and Libraries

To work through this machine learning training, you need to have basic knowledge of Python programming. In addition, there are a number of libraries and packages generally used in performing various machine learning tasks which are briefly described below.

### Python

Python has become the lingua franca for many data science (and ML) applications. It combines the powers of general purpose programming languages with the ease of use of domain specific scripting languages like Matlab or R.

Python has libraries for data loading, visualization, statistics, natural language processing, image processing, and more. This vast toolbox provides data scientists with a large array of general and special purpose functionality.

In python there are also some ML libraries like Numpy, Scikit-learn, Scipy, TensorFlow, Theano, Keras, PyTorch, Pandas, and many more.

<b>Python is open-source, and by extension most of the tools that form part of its ecosystem.</b>


### Scikit-learn

Scikit-learn is a very popular tool, and the most prominent Python library for machine learning. It is widely used in industry and academia, and there is a wealth of tutorials and code snippets about scikit-learn available online. Scikit-learn works well with a number of other scientific Python tools, which we will discuss later.

### NumPy

NumPy is one of the fundamental packages for scientific computing in Python. It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations and the Fourier transform, and pseudo random number generators.

The NumPy array is the fundamental data structure in scikit-learn. Scikit-learn takes in data in the form of NumPy arrays. Any data you’re using will have to be converted to a NumPy array. The core functionality of NumPy is the $ndarray$, meaning it has $n$ dimensions, and all elements of the array must be of the same type. 

### SciPy

SciPy is both a collection of functions for scientific computing in python. It provides, among other functionality, advanced linear algebra routines, mathematical function optimization, signal processing, special mathematical functions and statistical distributions. Scikit-learn draws from SciPy’s collection of functions for implementing its algorithms.

### Matplotlib

Matplotlib is the primary scientific plotting library in Python. It provides function for making publication-quality visualizations such as line charts, histograms, scatter plots, and so on. 

### Seaborn 

Seaborn is also a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.


### Pandas

Pandas is a Python library for data wrangling and analysis. It is built around a data structure called DataFrame, that is modeled after the R DataFrame. Simply put, a Pandas DataFrame is a table, similar to an Excel Spreadsheet. Pandas provides a great range of methods to modify and operate on this table. In particular, it allows SQL-like queries and joins of tables. Another valuable tool provided by Pandas is its ability to ingest from a great variety of file formats and databases, like SQL, Excel files and comma separated value (CSV) files.

### Jupyter Notebook

The Jupyter Notebook is an interactive environment for running code in the browser. It is a great tool for exploratory data analysis and is widely used by data scientists.

<br/>

## Installation

You can download and install Python separately from www.python.org on your local machine. However the recommended way to install Python and other scientific computing and machine learning packages simultaneously, is to use the <b>Anaconda</b> distribution.

### Anaconda

Anaconda is a Python distribution made for large-scale data processing, predictive analytics, and scientific comput ing. Anaconda comes with NumPy, SciPy, Matplotlib, IPython, Jupyter Notebook, and Scikit-learn. Anaconda is available for Mac OS X, Windows, and Linux.

To download the free Anaconda Python distribution from Continuum Analytics, you can do the following −

Visit the official site of Continuum Analytics and its download page (https://www.anaconda.com/distribution/). Note that the installation process may take 15-20 minutes as the installer contains Python, associated packages, a code editor, and some other files. Choose the installation process that matches your operating system. 
> Make sure to select the Python 3.x version instead of 2.x.

> Ensure that Anaconda’s Python distribution installs into a single directory, and does not affect other Python installations, if any, on your system.

Launch the Anaconda Navigator and click the "Launch" button under Notebook to launch the Jupyter Notebook.

<img src="images/anaconda-navigator.png" />

<br/>

# Knowing your data

Quite possibly the most important part in the machine learning process is understanding the data you are working with. It will not be effective to randomly choose an algorithm and throw your data at it. It is necessary to understand what is going on in your dataset before you begin building a model. Each algorithm is different in terms of what data it works best for, what kinds data it can handle, what kind of data it is optimized for, and so on. 

Before you start building a model, it is important to know the answers to most of, if not all of, the following questions:
- How much data do I have? Do I need more?
- How many features do I have? Do I have too many? Do I have too few?
- Is there missing data? Should I discard the rows with missing data or handle them differently?
- What question(s) am I trying to answer? Do I think the data collected can answer that question?