# Intro to Scikit Learn

## 1. What is Scikit Learn
- It is a free software machine learning library for the Python programming language.
- Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

## 2. Installation
- The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn.
- This stack that includes:
>- NumPy
>- SciPy
>- Matplotlib
>- IPython
>- Sympy
>- Pandas

- `pip install scikit-learn`

## 3. Important Concepts/Functions

* A. Train Test Split
* B. Pipleling
* C. Function Transformer
* D. Column Transform

### A. Train Test Split
- Split arrays or matrices into random train and test subsets
- **test_size** : float, int or None, optional (default=0.25). If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
- **random_state** - https://stackoverflow.com/questions/28064634/random-state-pseudo-random-number-in-scikit-learn

In [59]:
## Define path data
PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_week', 'native_country', 'label']

df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False)
df_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital,occupation,relationship,race,sex,capital_gain,capital_loss,hours_week,native_country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [61]:
from sklearn.model_selection import train_test_split

# Now that the dataset is ready, we can split it 80/20. 
# 80 percent for the training set and 20 percent for the test set.
X_train, X_test, y_train, y_test = train_test_split(df_train[features],
                                                    df_train.label,
                                                    test_size = 0.2,
                                                    random_state=0)
X_train.head(5)
print(X_train.shape, X_test.shape)

(26048, 14) (6513, 14)


### B. Pipelining
- Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.
- Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated.
- Each stage of a pipeline is fed data processed from its preceding stage; that is, the output of a processing unit is supplied as the input to the next step. 

In [65]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [133]:
# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2,random_state=42)
X_train.shape

(120, 4)

In [109]:
# print(X_train)

In [114]:
pipe_lr = Pipeline([('stdscr', StandardScaler()),
 ('clf', LogisticRegression(solver='newton-cg', multi_class='ovr'))])

In [115]:
pipe_lr.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('stdscr', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])

In [116]:
score = pipe_lr.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.967


In [117]:
# pipe_lr.named_steps['stdscr'].transform(X_train)

### C. Function Transformer
- A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function.
- Used in Data Pre-processing.
- Applies a user-defined function to the dataset.

**validate** : bool, optional default=True
Indicate that the input X array should be checked before calling func. 

The possibilities are:

If False, there is no input validation.

If True, then X will be converted to a 2-dimensional NumPy array or sparse matrix. If the conversion is not possible an exception is raised.

In [202]:
from sklearn.datasets import load_boston
import pandas as pd
housing_data = load_boston()
df = pd.DataFrame(housing_data.data)
df.columns = housing_data.feature_names
df['PRICE'] = housing_data.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


The goal is to predict the PRICE variable given the other features. How does this variable distribute?

In [203]:
X_train, X_test, y_train, y_test = train_test_split(df, df.PRICE, test_size= 0.2,random_state=42 )

In [204]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression

def just_RM_column(X):
    RM_col_index = 5
    return X[:, [RM_col_index]]

pipe = make_pipeline(
FunctionTransformer(just_RM_column, validate=True),
LinearRegression()
)


In [205]:
pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('functiontransformer', FunctionTransformer(accept_sparse=False, check_inverse=True,
          func=<function just_RM_column at 0x1a22f24c80>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])

In [206]:
score = pipe.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.371


In [207]:
def add_squared_col(X):
    return np.hstack((X, X**2))

pipe = Pipeline([
    ('colone', FunctionTransformer(just_RM_column, validate=True)),
    ('coltwo', FunctionTransformer(add_squared_col, validate=True)),
    ('clf', LinearRegression())
])

In [208]:
pipe.fit(X_train,y_train)
score = pipe.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.518


In [209]:
# df[['RM']]

In [210]:
# pipe.named_steps['colone'].transform(df)

In [219]:
x1_train = pipe.named_steps['colone'].transform(X_train)
x2_train = pipe.named_steps['coltwo'].transform(x1_train)
model = LinearRegression()
model.fit(x2_train,y_train)
x1_test = pipe.named_steps['colone'].transform(X_test)
x2_test = pipe.named_steps['coltwo'].transform(x1_test)
model.score(x2_test, y_test)

0.5176878620868068

In [211]:
from sklearn.tree import DecisionTreeRegressor
pipe = Pipeline([
    ('colone', FunctionTransformer(just_RM_column, validate=True)),
    ('coltwo', FunctionTransformer(add_squared_col, validate=True)),
    ('clf', DecisionTreeRegressor(max_depth=3))
])

In [212]:
pipe.fit(X_train,y_train)
score = pipe.score(X_test, y_test)
print('Decision Trees pipeline test accuracy: %.3f' % score)

Decision Trees pipeline test accuracy: 0.512


### D. Column Transformer
- Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

>- Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
>- Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.

- The brand new ColumnTransformer allows you to choose which columns get which transformations. 
- Categorical columns will almost always need separate transformations than continuous columns.

- The ColumnTransformer takes a list of three-item tuples. 
- The first value in the tuple is a name that labels it, 
- the second is an instantiated estimator, 
- and the third is a list of columns you want to apply the transformation to. 
- The tuple will look like this:
>- `('name', SomeTransformer(parameters), columns)`

### One Hot Encoder
- Encoding Categorical Data into Features

In [221]:
# Import dataset
import pandas as pd

## Define path data
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_week', 'native_country', 'label']

### Define continuous list
CONTI_FEATURES  = ['age', 'fnlwgt','capital_gain', 'education_num', 'capital_loss', 'hours_week']

### Define categorical list
CATE_FEATURES = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'sex', 'native_country']

## Prepare the data
features = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_week', 'native_country']

PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False)
df_train[CONTI_FEATURES] = df_train[CONTI_FEATURES].astype('float64')
df_train.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [222]:
df_train['marital'].unique()

array(['Never-married', 'Married-civ-spouse', 'Divorced',
       'Married-spouse-absent', 'Separated', 'Married-AF-spouse',
       'Widowed'], dtype=object)

In [225]:
hs_train = df_train[['marital']].copy()
hs_train.ndim

2

In [228]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
hs_train_transformed = ohe.fit_transform(hs_train)
hs_train_transformed

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [231]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train[features],
                                                    df_train.label,
                                                    test_size = 0.2,
                                                    random_state=0)
X_train.head(5)
print(X_train.shape, X_test.shape)

(26048, 14) (6513, 14)


In [229]:
### Define continuous list
CONTI_FEATURES  = ['age', 'fnlwgt','capital_gain', 'education_num', 'capital_loss', 'hours_week']

### Define categorical list
CATE_FEATURES = ['workclass', 'education', 'marital', 'occupation', 'relationship', 'race', 'sex', 'native_country']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, CONTI_FEATURES),
        ('cat', categorical_transformer, CATE_FEATURES)])

In [233]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='newton-cg'))])

In [234]:
clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('num', Pipeline(memory=None,
     steps=[('imputer', SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbo...ty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])

In [235]:
clf.score(X_test, y_test)

0.8472286196837095

In [237]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

In [238]:
from sklearn.datasets import fetch_20newsgroups

## Explaination of the Dataset

In [239]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

"""shuffle useful if you wish to select only a subset of samples to quickly train a model and get a first idea of the results 
before re-training on the complete dataset later."""

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


'shuffle useful if you wish to select only a subset of samples to quickly train a model and get a first idea of the results \nbefore re-training on the complete dataset later.'

In [240]:
'''The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed as 
python dict keys or object attributes for convenience, for instance the target_names holds the list of the requested category
names:'''

twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [247]:
'The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:'
print (len(twenty_train.data))
#2257 is the number of files that we will be using for training
print (len(twenty_train.filenames))
#2257

2257
2257


In [248]:
'Let’s print the first lines of the first loaded file:'

print("\n".join(twenty_train.data[0].split("\n")[:3]))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


In [252]:
"""Supervised learning algorithms will require a category label for each document in the training set. In this case the 
category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.

For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to 
the index of the category name in the target_names list. The category integer id of each sample is stored in the target 
attribute:"""

print (twenty_train.target[:10])
#array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

print(twenty_train.target_names[twenty_train.target[0]])
#comp.graphics

[1 1 3 3 3 3 3 2 2 2]
comp.graphics


In [253]:
'It is possible to get back the category names as follows:'

for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])
    
#twenty train.target is a list of category names for all the files present

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


# Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content **into numerical feature vectors.**

### Bags of words
The most intuitive way to do so is to use a **bags of words representation:**

    1.Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

    2.For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.

The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

If n_samples == 10000, storing X as a NumPy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

**Fortunately, most values in X will be zeros since for a given document less than a few thousand distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.**

**scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.**

### Tokenizing text with scikit-learn

Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors:

In [254]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
#print (X_train_counts[0])

#count_vect will give you a dictionary which will contain info of all words in all docs
#output format - (doc number, word id)  count of that word in the doc

In [258]:
count_vect.vocabulary_

{'from': 14887,
 'sd345': 29022,
 'city': 8696,
 'ac': 4017,
 'uk': 33256,
 'michael': 21661,
 'collier': 9031,
 'subject': 31077,
 'converting': 9805,
 'images': 17366,
 'to': 32493,
 'hp': 16916,
 'laserjet': 19780,
 'iii': 17302,
 'nntp': 23122,
 'posting': 25663,
 'host': 16881,
 'hampton': 16082,
 'organization': 23915,
 'the': 32142,
 'university': 33597,
 'lines': 20253,
 '14': 587,
 'does': 12051,
 'anyone': 5201,
 'know': 19458,
 'of': 23610,
 'good': 15576,
 'way': 34755,
 'standard': 30623,
 'pc': 24651,
 'application': 5285,
 'pd': 24677,
 'utility': 33915,
 'convert': 9801,
 'tif': 32391,
 'img': 17389,
 'tga': 32116,
 'files': 14281,
 'into': 18268,
 'format': 14676,
 'we': 34775,
 'would': 35312,
 'also': 4808,
 'like': 20198,
 'do': 12014,
 'same': 28619,
 'hpgl': 16927,
 'plotter': 25361,
 'please': 25337,
 'email': 12833,
 'any': 5195,
 'response': 27836,
 'is': 18474,
 'this': 32270,
 'correct': 9932,
 'group': 15837,
 'thanks': 32135,
 'in': 17556,
 'advance': 4378,

In [259]:
'''CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a 
dictionary of feature indices:'''
'The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.'

print (count_vect.vocabulary_.get(u'algorithm'))

4690


## From occurrences to frequencies (tf-idf)

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf for Term Frequencies.**

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for **“Term Frequency times Inverse Document Frequency”.**

**Both tf and tf–idf can be computed as follows using TfidfTransformer:**

In [260]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts) #sparse matrix of type '<class 'numpy.float64'>
print (X_train_tf.shape)

(2257, 35788)


In the above example-code, we firstly use the **fit(..)** method to fit our estimator to the data and secondly the **transform(..)** method to transform our count-matrix to a tf-idf representation. These two steps can be combined to achieve the same end result faster by skipping redundant processing. This is done through using the **fit_transform(..)** method as shown below, and as mentioned in the note in the previous section:

In [262]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print (X_train_tfidf.shape)

(2257, 35788)


## Training a classifier
Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a **naïve Bayes classifier**, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; **the one most suitable for word counts is the multinomial variant:**

In [268]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers, since they have already been fit to the training set:

In [269]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
     print('%r => %s' % (doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


## Building a pipeline
In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [266]:
from sklearn.pipeline import Pipeline
import sklearn.preprocessing as pp

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

The names **vect**, **tfidf** and **clf (classifier)** are arbitrary. We will use them to perform **grid search** for suitable **hyperparameters** below. We can now train the model with a single command:

In [267]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Evaluation of the performance on the test set
Evaluating the predictive accuracy of the model is equally easy:

In [29]:
import numpy as np

twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

#fetching the test data set, categories are the same

docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target) 

#gives 83% accuracy for multinomial classifer

0.8348868175765646

We achieved 83.5% accuracy. Let’s see if we can do better with a **linear support vector machine (SVM)**, which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by simply plugging a different classifier object into our pipeline:

In [30]:
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(twenty_train.data, twenty_train.target)  

predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target) 

#accuracy increased for SVM

0.9127829560585885

We achieved 91.3% accuracy using the SVM. scikit-learn provides further utilities for more detailed performance analysis of the results:

In [31]:
from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



In [32]:
metrics.confusion_matrix(twenty_test.target, predicted)
# 35 - atheism as has been classfified as christian for max num of times

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]], dtype=int64)

As expected the confusion matrix shows that posts from the newsgroups on atheism and Christianity are more often confused for one another than with computer graphics.

## Parameter tuning using grid search
We’ve already encountered some parameters such as **use_idf in the TfidfTransformer**. Classifiers tend to have many parameters as well; e.g., **MultinomialNB includes a smoothing parameter alpha and SGDClassifier has a penalty parameter alpha** and configurable loss and penalty terms in the objective function (see the module documentation, or use the Python help function to get a description of these).

**Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. We try out all classifiers on either words or bigrams, with or without idf, and with a penalty parameter of either 0.01 or 0.001 for the linear SVM:**

In [37]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-2, 1e-3),
}

In [38]:
'''Parallel Execution with the n_jobs parameter. If we give this parameter a value of -1, grid search will detect how many
cores are installed and use them all'''

gs_clf = GridSearchCV(text_clf, parameters, cv=5, iid=False, n_jobs=-1)

In [39]:
#Let’s perform the search on a smaller subset of the training data to speed up the computation:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

In [40]:
#Prediction
twenty_train.target_names[gs_clf.predict(['God is love'])[0]]
#'soc.religion.christian'

'soc.religion.christian'

In [45]:
"""The object’s best_score_ and best_params_ attributes store the best mean score and the parameters setting corresponding
to that score:"""

print ("best score is {}".format(gs_clf.best_score_))

for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))




best score is 0.9151349867929058
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 2)


## HOMEWORK


In [43]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data train-test (80-20) into training/testing sets

# Split the targets train-test (80-20) into training/testing sets

# Create linear regression object using .LinearRegression()

# Train the model using the training sets using .fit

# Make predictions using the testing set using .predict

#Calculate The mean squared error using mean_squared_error

## Resources
- https://nbviewer.jupyter.org/github/python-visualization/folium/tree/master/examples/
- https://github.com/python-visualization/folium
- https://github.com/python-visualization/folium/tree/master/examples
- https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-analysis
- https://python-visualization.github.io/folium/