# Introduciton to Scikit-Learn 

Scikit-Learn is a Machine Learning library in Python that has all well-known and well-performing algorithms, and other useful tools to solve practical problems.

Python package name is `sklearn`.

Note:
* Deep Learning uses own libraries build by major companies (Google's *tensorflow*, Amazon's *mxnet*, Nvidia's *CUDA* and *cuDNN*) and universities (*Caffe*, MIT's *keras*, abandoned *Theano* from Deep Learning pioneers)
* Database is a separate package called *Pandas* with its parallel version *Dask*. They are non-relational databases. Actual data can be stored anywhere, including distributed file systems like HDFS.
* Low-level mathematic operations are done with *Numpy*, that uses C-code libraries suitable for your computer.

In [1]:
import sklearn
from sklearn import linear_model

## Example usage of Scikit-Learn

In [2]:
from sklearn import datasets
iris = datasets.load_iris()

In [3]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [4]:
X = iris['data']
Y = iris['target']

#### We will use a test set to estimate generalization performance

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

#### Now train a linear model for classification

In [7]:
from sklearn.linear_model import LogisticRegression

In [8]:
model = LogisticRegression()
model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Evaluate how close are model-computed numbers to real flower species

In [9]:
model.score(X_test, Y_test)

0.97368421052631582

### Altogether

This code has all parts needed to run Machine Learning.
1. Get a dataset with training and testing parts.
2. Fit the model - find values of model parameters
3. Evaluate that a model performance is good enough for the task
4. Calculate model predictions and connect them to real quantities

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model = LogisticRegression()
model.fit(X_train, Y_train)
score = model.score(X_test, Y_test)
y_hat = model.predict(X_test)

In [12]:
print('Estimate of generalization performance: ', score)
print('True class of a test sample: {} ({})'.format(Y_test[0], iris['target_names'][Y_test[0]]))
print('Predicted class of a test sample: {} ({})'.format(y_hat[0], iris['target_names'][y_hat[0]]))

Estimate of generalization performance:  0.973684210526
True class of a test sample: 1 (versicolor)
Predicted class of a test sample: 1 (versicolor)


# 20 Newsgroups Dataset for homeworks

This dataset has text of news articles, a total of 20,000 articles from 20 topics. Our goal is to predict the topic of a new article. Dataset already split into training and test parts based on time (test articles published later than training).

Scikit-Learn has a function to download the whole dataset: `sklearn.datasets.fetch_20newsgroups`

In [13]:
news = datasets.fetch_20newsgroups(subset='train')
news_test = datasets.fetch_20newsgroups(subset='test')

In [14]:
news.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [15]:
Text_train = news.data
Y_train = news.target

In [16]:
Text_test = news_test.data
Y_test = news_test.target

In [17]:
Text_train[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [18]:
news.target_names[Y_train[0]]

'rec.autos'

### Machine Learning can work only with numbers, not the text.

Let's create a vocabulary of the most common 5,000 words in the news articles. Then replace each piece of text by a vector with 5,000 values representing the number of times a corresponding vocabulary word is used in the text.

Scikit-Learn has a function for that, called `CountVectorizer()`

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

In [20]:
processor = CountVectorizer(max_features=5000, stop_words='english')
processor.fit(Text_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [21]:
processor.vocabulary_

{'wam': 4826,
 'umd': 4656,
 'edu': 1594,
 'thing': 4500,
 'subject': 4331,
 'car': 894,
 'nntp': 3111,
 'posting': 3474,
 'host': 2228,
 'organization': 3239,
 'university': 4676,
 'maryland': 2822,
 'college': 1072,
 'park': 3302,
 'lines': 2683,
 '15': 50,
 'wondering': 4917,
 'saw': 3959,
 'day': 1346,
 'door': 1523,
 'sports': 4231,
 'looked': 2724,
 'late': 2595,
 'early': 1570,
 'called': 864,
 'doors': 1524,
 'really': 3705,
 'small': 4159,
 'addition': 346,
 'separate': 4044,
 'rest': 3823,
 'body': 757,
 'know': 2563,
 'model': 2959,
 'engine': 1638,
 'specs': 4216,
 'years': 4978,
 'production': 3552,
 'history': 2187,
 'info': 2353,
 'looking': 2725,
 'mail': 2775,
 'thanks': 4492,
 'il': 2291,
 'brought': 809,
 'carson': 911,
 'washington': 4837,
 'guy': 2089,
 'si': 4106,
 'clock': 1051,
 'final': 1835,
 'summary': 4356,
 'reports': 3794,
 'keywords': 2540,
 'upgrade': 4689,
 'article': 530,
 '11': 29,
 'fair': 1769,
 'number': 3145,
 'shared': 4080,
 'experiences': 1726,

#### Now transform the texts

Note that we fit a processor on training data only, because test data is used to estimate generalization performance on *future* data that we have not seen.

Transformations are done with the same processor on both training and test sets. 

In [22]:
X_train = processor.transform(Text_train)
X_test = processor.transform(Text_test)

In [23]:
X_train

<11314x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 874806 stored elements in Compressed Sparse Row format>

#### Let's check one data sample

In [24]:
Text_train[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

#### Sparse matrix

Sparse matrix stores only non-zero elements. Our *Compressed Sparse Row* matrix stores values, positions of values, and a total number of values in one row/column.

In [25]:
x1 = X_train[0]
x1.data

array([1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1])

In [26]:
x1.indices

array([  50,  346,  757,  809,  864,  894, 1072, 1346, 1523, 1524, 1570,
       1594, 1638, 2187, 2228, 2291, 2353, 2563, 2595, 2683, 2724, 2725,
       2775, 2822, 2959, 3111, 3239, 3302, 3474, 3552, 3705, 3823, 3959,
       4044, 4159, 4216, 4231, 4331, 4492, 4500, 4656, 4676, 4826, 4917,
       4978], dtype=int32)

In [27]:
x1.indptr

array([ 0, 45], dtype=int32)

#### What is that most common word with 5 occurrences, at index 894?

In [28]:
for a,b in processor.vocabulary_.items():
    if b == 894:
        print(a,b)

car 894


## Convert sparse matrix to dense

We can convert sparse matrix to dense, if there is enough memory. Check memory beforehand from the size of a sparse matrix.

Many methods work directly with sparse matrices. Only convert sparse matrix to dense if really needed.

In [29]:
X_train

<11314x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 874806 stored elements in Compressed Sparse Row format>

In [30]:
memory_bytes = 11314 * 5000 * 8  # int64 / 8 bits per byte = 8 bytes per number
print('Memory in GB: ', memory_bytes / 2**30)

Memory in GB:  0.42147934436798096


In [31]:
import numpy as np

In [32]:
D_train = np.array(X_train.todense())
D_test = np.array(X_test.todense())

### Altogether

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

In [3]:
news = fetch_20newsgroups(subset='train')
Text_train = news.data
Y_train = news.target

news_test = fetch_20newsgroups(subset='test')
Text_test = news_test.data
Y_test = news_test.target

In [6]:
news.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [35]:
processor = CountVectorizer(max_features=5000, stop_words='english')
processor.fit(Text_train)
Sparse_train = processor.transform(Text_train)
Sparse_test = processor.transform(Text_test)

In [36]:
X_train = np.array(Sparse_train.todense())
X_test = np.array(Sparse_test.todense())

In [37]:
labels = news.target_names
vocabulary = processor.vocabulary_

#### Dataset with:

X_train, X_test, Y_train, Y_test, labels, vocabulary