 # Introduction to Scikit-Learn, Pandas and Numpy

## Scikit-Learn

Scikit-Learn, also known as sklearn, is Python's go to and general-purpose machine learning library. Scikit-Learn offers a high-level interface for many tasks which allows you to better practice the entire machine learning workflow and understand the big picture.

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack includes:

1. NumPy: Base n-dimensional array package
2. SciPy: Fundamental library for scientific computing
3. Matplotlib: Comprehensive 2D/3D plotting
4. IPython: Enhanced interactive console
5. Sympy: Symbolic mathematics
6. Pandas: Data structures and analysis


### What are the features?
The library is focused on modeling data. It is not focused on loading, manipulating and summarizing data. For these features we use NumPy and Pandas. Some popular groups of models provided by scikit-learn include:

* Clustering: for grouping unlabeled data such as KMeans.
* Cross Validation: for estimating the performance of supervised models on unseen data.
* Datasets: for test datasets and for generating datasets with specific properties for investigating model behavior.
* Dimensionality Reduction: for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
* Ensemble methods: for combining the predictions of multiple supervised models.
* Feature extraction: for defining attributes in image and text data.
* Feature selection: for identifying meaningful attributes from which to create supervised models.
* Parameter Tuning: for getting the most out of supervised models.
* Supervised Models: a vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.


### Documentation
Here are some of the resorces that will help you get on baord quickly with scikit learn.

1. Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html
2. User Guide http://scikit-learn.org/stable/user_guide.html
3. API Reference http://scikit-learn.org/stable/modules/classes.html
4. Example Gallery http://scikit-learn.org/stable/auto_examples/index.html

## Examples

Let's see how to use scikit learn for a classification prpblem. Below is an example of using a Classification and Regression Trees (CART) decision tree classifier
to model the Iris flower dataset.



In [35]:
from sklearn import datasets
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

   micro avg       1.00      1.00      1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150

[[50  0  0]
 [ 0 50  0]
 [ 0  0 50]]


# Pandas

## Introduction

Pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. The pandas package is the most important tool for data analysis in Python  and essentially your data’s home. Through pandas, one can easily clean, transform and aggregate a given dataset.

Pandas allows you to extract the data from a CSV stored on your computer into a DataFrame — a table, basically and  then let you do things like:

* Calculate statistics and answer questions about the data, like "What's the average, median, max, or min of each column?", "Does column A correlate with column B?"
* Clean the data by doing operations like removing missing values and filtering rows or columns based on some criteria
* Visualize the data with help from Matplotlib. Plot scatter, bars, lines, histograms, bubble charts, etc.
* Store the cleaned, transformed data back into a CSV, other file or database

## Installation

Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

**conda install pandas**

OR

**pip install pandas**

Alternatively, you can install pandas in a Jupyter notebook like this:

**!pip install pandas**

The **!** at the beginning runs cells as if they were in a terminal.

## Examples

In [19]:
#Imports
import pandas as pd
import numpy as np

data = {
 'name': ['James','Ryan', 'Jane','Mary'],
 'age': [20, 19, 22, 21],
 'favorite_color': ['red', 'orange', 'green', 'purple'],
 'grade': [67, 78, 90, 12]}

# Reading data from a JSON into a dataframe
df = pd.DataFrame(data)
print(df)

    name  age favorite_color  grade
0  James   20            red     67
1   Ryan   19         orange     78
2   Jane   22          green     90
3   Mary   21         purple     12
Index(['name', 'age', 'favorite_color', 'grade'], dtype='object')


In [21]:
#printing the features/columns of a dataframe
df.columns

Index(['name', 'age', 'favorite_color', 'grade'], dtype='object')

In [25]:
# To see the size of the dataset
print("No of Samples:", df.shape[0])
print("No of features:", df.shape[1])

No of Samples: 4
No of features: 4


In [26]:
# To see the datatypes
df.dtypes

name              object
age                int64
favorite_color    object
grade              int64
dtype: object

## Documetation

Here is the link for the official panda documentation
http://pandas.pydata.org/pandas-docs/stable/

Some links to get you started on pandas:

* https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
* https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d

# Numpy

## Introduction

NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object, and tools for working with these arrays.It contains various features including these important ones:

* A powerful N-dimensional array object
* Sophisticated (broadcasting) functions
* Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined using Numpy which allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

## Installation
Mac and Linux users can install NumPy via pip command:

**pip install numpy**

## Documentation
Here are some links that will give you a good hands on learning on numpy

* http://www.numpy.org/
* http://cs231n.github.io/python-numpy-tutorial/
* https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html

## Examples

In [36]:
# Python program to demonstrate basic numpy array characteristics 
import numpy as np 
  
# Creating array object 
arr = np.array( [[ 1, 2, 3], 
                 [ 4, 2, 5]] ) 
  
# Printing type of arr object 
print("Array is of type: ", type(arr)) 
  
# Printing array dimensions (axes) 
print("No. of dimensions: ", arr.ndim) 
  
# Printing shape of array 
print("Shape of array: ", arr.shape) 
  
# Printing size (total number of elements) of array 
print("Size of array: ", arr.size) 
  
# Printing type of elements in array 
print("Array stores elements of type: ", arr.dtype) 


Array is of type:  <class 'numpy.ndarray'>
No. of dimensions:  2
Shape of array:  (2, 3)
Size of array:  6
Array stores elements of type:  int64


In [37]:
# Create a sequence of integers  
# from 0 to 30 with steps of 5 
f = np.arange(0, 30, 5) 
print ("\nA sequential array with steps of 5:\n", f) 
  
# Create a sequence of 10 values in range 0 to 5 
g = np.linspace(0, 5, 10) 
print ("\nA sequential array with 10 values between""0 and 5:\n", g) 

#Reshaping of arrays
grid = np.arange(1, 10).reshape((3, 3))
print(grid)


A sequential array with steps of 5:
 [ 0  5 10 15 20 25]

A sequential array with 10 values between0 and 5:
 [0.         0.55555556 1.11111111 1.66666667 2.22222222 2.77777778
 3.33333333 3.88888889 4.44444444 5.        ]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [38]:
# Python program to demonstrate 
# indexing in numpy 
import numpy as np 
  
# An exemplar array 
arr = np.array([[-1, 2, 0, 4], 
                [4, -0.5, 6, 0], 
                [2.6, 0, 7, 8], 
                [3, -7, 4, 2.0]]) 
  
# Slicing array 
temp = arr[:2, ::2] 
print ("Array with first 2 rows and alternate"
                    "columns(0 and 2):\n", temp) 
  
# Integer array indexing example 
temp = arr[[0, 1, 2, 3], [3, 2, 1, 0]] 
print ("\nElements at indices (0, 3), (1, 2), (2, 1),"
                                    "(3, 0):\n", temp) 
  
# boolean array indexing example 
cond = arr > 0 # cond is a boolean array 
temp = arr[cond] 
print ("\nElements greater than 0:\n", temp) 

Array with first 2 rows and alternatecolumns(0 and 2):
 [[-1.  0.]
 [ 4.  6.]]

Elements at indices (0, 3), (1, 2), (2, 1),(3, 0):
 [4. 6. 0. 3.]

Elements greater than 0:
 [2.  4.  4.  6.  2.6 7.  8.  3.  4.  2. ]


# Note:

This is just an introduction. Go through the tutorial links mentioned under each section to practice and get familiar with more functionality.