# The Top 5 Machine Learning Libraries in Python

>Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

# Lesson 1: Intro to pandas data structures

> Our first top 5 Library is Pandas. **Pandas** is an open source Python library for data analysis. Pandas makes Python great for analysis.

In [1]:
# Let's bring in pandas
import pandas as pd

We can import the entire library but that's often not the Pythonic way. 

>In the code below we are using the from **keyword** to only import what we need. 

Interstingly, in most tutorials all of pandas is imported. 



In [2]:
from pandas import DataFrame

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

There are **three** core Pandas data structures. They are: 

> Series - A pandas Series is a one-dimensional array of indexed data.

> Dataframe -  The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

> Index -  This Index object is an interesting structure in itself, and it can be thought of as an ordered set. 

In [3]:
# Let's create a series
# A series is a one dimensional array like object. 

s = pd.Series([3,5,5,9,6])
s

0    3
1    5
2    5
3    9
4    6
dtype: int64

In [4]:
s.head()

0    3
1    5
2    5
3    9
4    6
dtype: int64

Methods end with parentheses, while attributes don't:

> So head() in the exmple above is a method. 

Let's read some tabular data into our workspace using pandas. 

> When you hear the word tabular data think excel spreadsheet. 

Let's read some data into a table and manipulate that data

In [5]:
ufo = pd.read_table('http://bit.ly/uforeports', sep=',')

In [6]:
ufo.head(10)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
5,Valley City,,DISK,ND,9/15/1934 15:30
6,Crater Lake,,CIRCLE,CA,6/15/1935 0:00
7,Alma,,DISK,MI,7/15/1936 0:00
8,Eklutna,,CIGAR,AK,10/15/1936 17:00
9,Hubbard,,CYLINDER,OR,6/15/1937 0:00


In [7]:
ufo['State']

0        NY
1        NJ
2        CO
3        KS
4        NY
5        ND
6        CA
7        MI
8        AK
9        OR
10       CA
11       AL
12       SC
13       IA
14       MI
15       CA
16       CA
17       GA
18       TN
19       AK
20       NE
21       LA
22       LA
23       KY
24       WV
25       CA
26       WV
27       NM
28       NM
29       UT
         ..
18211    MA
18212    CA
18213    CA
18214    TX
18215    TX
18216    CA
18217    CO
18218    TX
18219    CA
18220    CA
18221    NH
18222    PA
18223    SC
18224    OK
18225    CA
18226    CA
18227    CA
18228    TX
18229    IL
18230    CA
18231    CA
18232    WI
18233    AK
18234    CA
18235    AZ
18236    IL
18237    IA
18238    WI
18239    WI
18240    FL
Name: State, Length: 18241, dtype: object

In [8]:
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"


In [9]:
ufo.describe()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
count,18216,2882,15597,18241,18241,18216
unique,6476,27,27,52,16145,8029
top,Seattle,RED,LIGHT,CA,11/16/1999 19:00,"Seattle, WA"
freq,187,780,2803,2529,27,187


In [10]:
ufo.shape

(18241, 6)

In [11]:
ufo.dtypes

City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
Location           object
dtype: object

In [12]:
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time',
       'Location'],
      dtype='object')

In [None]:
ufo = pd.read_table('http://bit.ly/uforeports', sep=',')
ufo.head() 

In [None]:
ufo.drop('Colors Reported', axis=1, inplace=True)
ufo.head()

In [None]:
ufo.drop(['State','Time'], axis=1, inplace=True)
ufo.head()

In [None]:
ufo = pd.read_table('http://bit.ly/uforeports', sep=',')
ufo.head()

In [13]:
ufo.State.sort_values(ascending=False).head(25)

12079    WY
11490    WY
11333    WY
4866     WY
3326     WY
3328     WY
16594    WY
1177     WY
378      WY
5065     WY
7684     WY
6116     WY
10729    WY
12072    WY
16637    WY
14618    WY
7491     WY
12063    WY
14240    WY
14586    WY
15667    WY
7485     WY
14747    WY
1461     WY
1442     WY
Name: State, dtype: object

In [14]:
ufo.sort_values('City').head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
1761,Abbeville,,DISK,SC,12/10/1968 0:30,"Abbeville, SC"
4553,Aberdeen,,CYLINDER,WA,6/15/1981 22:00,"Aberdeen, WA"
16167,Aberdeen,,VARIOUS,OH,3/29/2000 3:00,"Aberdeen, OH"
14703,Aberdeen,,TRIANGLE,WA,9/30/1999 21:00,"Aberdeen, WA"
389,Aberdeen,ORANGE,CIRCLE,SD,11/15/1956 18:30,"Aberdeen, SD"


In [15]:
ufo.sort_values(['City','State']).head(25)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
1761,Abbeville,,DISK,SC,12/10/1968 0:30,"Abbeville, SC"
2297,Aberdeen,,TRIANGLE,MD,8/18/1972 1:30,"Aberdeen, MD"
9404,Aberdeen,,DISK,MD,6/15/1996 13:30,"Aberdeen, MD"
16167,Aberdeen,,VARIOUS,OH,3/29/2000 3:00,"Aberdeen, OH"
389,Aberdeen,ORANGE,CIRCLE,SD,11/15/1956 18:30,"Aberdeen, SD"
4553,Aberdeen,,CYLINDER,WA,6/15/1981 22:00,"Aberdeen, WA"
12294,Aberdeen,,FIREBALL,WA,10/4/1998 4:42,"Aberdeen, WA"
14703,Aberdeen,,TRIANGLE,WA,9/30/1999 21:00,"Aberdeen, WA"
17809,Aberdeen,GREEN,FIREBALL,WA,10/29/2000 17:25,"Aberdeen, WA"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"


# Lesson 2: Intro to NumPy

> Our Second top 5 Library is NumPy. **NumPY** is a package that provides support for multi-dimensional arrays and matrices, along with mathematical functions to operate on them. NumPy is short for Numerical Python.

Numpy is an open source library for Scientific Computing in python. It provides support for multi-dimensional arrays and matrices, along with efficient mathematical functions for operating on these arrays.

> An array is a ordered list/ sequenced collection of elements with same datatype, just group of similar data.

Comments

The Python comment character is '#': anything after '#' on the line is ignored by the Python interpreter. .

Multi-line strings can be used within code blocks to provide multi-line comments.

Multi-line strings are delimited by pairs of triple quotes (''' or """). Any newlines in the string will be represented as '\n' characters in the string.

> Keep in mind I've left out most of the comments in the code for the course. You should be able to commen the code after you run the cell. If you can't find the answer then use Google. 

In [16]:
import numpy as np
x = np.array([1,2,3,4,5])
x

array([1, 2, 3, 4, 5])

In [17]:
type(x)

numpy.ndarray

In [18]:
x.ndim

1

In [19]:
x.shape

(5,)

In [20]:
len(x)

5

In [21]:
x.size

5

In [22]:
x.dtype

dtype('int64')

In [23]:
x

array([1, 2, 3, 4, 5])

> Accessing an array is pretty much straight forward. We access a specific location in the table by referring to its row and column inside square braces.

In [24]:
x[0,]

1

In [25]:
x[4,]

5

In the example below we are asking Python to give us everything up to the number 3 in our array. 

In [26]:
x[:3]

array([1, 2, 3])

> Let's save and then load the saved array. 

**Note**: Don't forget we save to our default directory. 


In [28]:
np.save('x', x)

In [29]:
np.load("x.npy")

array([1, 2, 3, 4, 5])

# Lesson 3: Intro to SciKit-Learn

What is scikit-learn?

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

> The library is focused on **modeling data.** It is not focused on loading, manipulating and summarizing data. For these features, refer to NumPy and Pandas.

Machine Learning Steps

>Define Problem

>Prepare Data

>Evaluate Algorithms

>Improve Results

>Present Results

In [48]:
from sklearn import datasets
from sklearn import metrics
from sklearn.svm import SVC
import sklearn.model_selection as cv # Cross-Validation

What exactly is Support Vector Machines(SVM) ? SVM is a supervised learning model. Supervised means you need a dataset which has been labeled. SVM is a linear model. What does that mean? 

> If your data is very simple and only has two dimensions, then the SVM will learn a line which will be able to separate the data.

**Note:** In SciKit-Learn SVC is the model used for an SVM. There's no real difference other than terminology here. Just keep in mind that in SciKit-Learn all their models are called classifiers. 




Notice we are loading the iris data set with a function. Does that mean it's included in SciKit-Learn?

> Exactly. There are a hand full of data sets baked into SciKit-Learn that we can use to learn about the model building process. 

In [38]:
ds = datasets.load_iris()

In [41]:
print(ds.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [37]:
ds.data[:5]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

In [49]:
# Build the Support Vector Classifier
model = SVC()

# Cross-Validation: Separate data into training and testing (test 30% of data)
(data_train, data_test, target_train, target_test) = cv.train_test_split(ds.data, ds.target, test_size=.3)

# Fit the model
model.fit(data_train, target_train)
print(model)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [52]:
expected = target_test
expected

array([2, 2, 2, 0, 0, 1, 1, 0, 2, 1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 0, 0, 2, 1,
       2, 1, 1, 0, 1, 2, 0, 2, 0, 1, 2, 1, 1, 0, 2, 0, 1, 2, 0, 1, 1, 2])

In [53]:
predicted = model.predict(data_test)
predicted

array([2, 2, 2, 0, 0, 1, 1, 0, 2, 1, 2, 2, 1, 2, 0, 1, 0, 1, 2, 0, 0, 2, 1,
       2, 1, 2, 0, 1, 2, 0, 2, 0, 1, 2, 1, 1, 0, 2, 0, 1, 2, 0, 1, 1, 2])

In [59]:
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print('Accuracy %s' % metrics.accuracy_score(expected, predicted))
print('\nConfusion Matrix')
print(metrics.confusion_matrix(expected, predicted))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        13
          1       1.00      0.94      0.97        16
          2       0.94      1.00      0.97        16

avg / total       0.98      0.98      0.98        45

Accuracy 0.977777777778

Confusion Matrix
[[13  0  0]
 [ 0 15  1]
 [ 0  0 16]]


# Lesson 4: Intro to MatPlotLib

> Matplotlib is a tool for data visualization and this tool built upon the Numpy and Scipy framework. 

It was developed by John Hunter in 2002. Matplotlib is a library for making 2D plots of arrays in Python. Matplotlib also able to create simple plots with just a few commands and along with limited 3D graphic support. It can provide quality graph/figure in interactive environment across platforms. It can also be used for animations as well. 

> The three basic plot types you will find are used  the most often are **line**, **scatter
plots** and **histograms**. 

Some code for making these two types of plots is included in
this section.


As you progress with Matplotlib, it might be useful to understand how it works fundamentally. This process is true with a lot of computer graphics processes. First, you have some data, then you "draw" that data to a canvas of some sort, but it is only in the computer's memory. Once you've drawn that data, you can then "show" that data. This is so the computer can first draw everything, and then perform the more laborious task of showing it on the screen.


In [None]:
# Import the necessary packages and modules
import matplotlib.pyplot as plt
import numpy as np

# Prepare the data
x = np.linspace(0, 10, 100)

# Plot the data
plt.plot(x, x, label='linear')

# Add a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
import matplotlib.pyplot as plt

X = [590,540,740,130,810,300,320,230,470,620,770,250]
Y = [32,36,39,52,61,72,77,75,68,57,48,48]

plt.scatter(X,Y)
plt.show()

In [None]:
import matplotlib.pyplot as plt

X = [590,540,740,130,810,300,320,230,470,620,770,250]
Y = [32,36,39,52,61,72,77,75,68,57,48,48]

plt.scatter(X,Y)
plt.title('Relationship Between Temperature and Mountain Dew Sales')
plt.show()

In [None]:
import matplotlib.pyplot as plt

X = [590,540,740,130,810,300,320,230,470,620,770,250]
Y = [32,36,39,52,61,72,77,75,68,57,48,48]

plt.scatter(X,Y)
plt.title('Relationship Between Temperature and Mountain Dew Sales')
plt.xlabel('Cans of Mountain Dew Sold')
plt.ylabel('Temperature in Fahrenheit')
plt.scatter(X, Y, s=80, c='green', marker='X')
plt.show()

In [None]:
import numpy as np
import pylab as pl
# pylab is a module in matplotlib that gets installed alongside matplotlib
# make an array of random numbers with a gaussian distribution with
# mean = 5.0
# rms = 3.0
# number of points = 1000
data = np.random.normal(5.0, 3.0, 1000)
# make a histogram of the data array
pl.hist(data)
# make plot labels
pl.xlabel('data')
pl.show()

# Lesson 5: Intro to NLTK

> The idea of Natural Language Processing is to do some form of analysis, or processing, where the machine can understand, at least to some level, what the text means, says, or implies.

In [None]:
from nltk.tokenize import sent_tokenize
sents = "Thanks for taking my courses. NLTK rocks!"
sent_tokenize(sents)

In [None]:
# Let's tokenize some words. 
from nltk.tokenize import word_tokenize
word_tokenize("I like Mikes courses.")

In [None]:
from nltk.corpus import stopwords
stopwords.words("english")