# Working with HTML Pages

## Parsing XML and HTML

Simply extracting data from an XML file as you did before may not be enough. Following example extract XML data in correcttype so you be able perform data manipulation better on this DataFrame.

In [None]:
# lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
from lxml import objectify
import pandas as pd
# Distutils to make Python modules and extensions, easily available to a wider audience with very little 
# overhead for build/release/install mechanics.
# you can create setup.py etc for distribution
# https://docs.python.org/2/library/distutils.html
from distutils import util

xml = objectify.parse(open('XMLData.xml'))
root = xml.getroot()
df = pd.DataFrame(columns=('Number', 'Boolean'))

# it already knows we have 4 children
for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    # make dics like {'Boolean': False, 'Number': 4}
    # all data classes provide a .pyval attribute that returns the value as plain Python type
    row = dict(zip(['Number', 'Boolean'], 
                   [obj[0].pyval, 
                    bool(util.strtobool(obj[2].text))]))
    
    
    row_s = pd.Series(row)
    row_s.name = obj[1].text
    df = df.append(row_s)
  
df

In [None]:
# printing the type of first row and the Number and Boolean columns
print type(df.ix['First']['Number'])
print type(df.ix['First']['Boolean'])

## Using XPath for data extraction

http://www.w3schools.com/xsl/xpath_intro.asp

In [None]:
from lxml import objectify
import pandas as pd
from distutils import util

xml = objectify.parse(open('XMLData.xml'))
root = xml.getroot()

# map is Python built in function which applies function (1st arg) to every item of iterable (2nd arg)and 
# return a list of the results
data = zip(map(int, root.xpath('Record/Number')), 
           map(bool, map(util.strtobool, 
                map(str, root.xpath('Record/Boolean')))))

df = pd.DataFrame(data, 
                  columns=('Number', 'Boolean'), 
                  index=map(str, 
                        root.xpath('Record/String')))

print df

In [None]:
print type(df.ix['First']['Number'])
print type(df.ix['First']['Boolean'])

# Working with Raw Text

## Stemming and removing stop words

Stemming is the process of reducing words to their stem (or root) word. For example, the words cats, catty, and catlike all have the stem cat. The act of stemming helps you analyze sentences by tokenizing them. Removing suffixes to create stem words and generally tokenizing sentences are only two parts of the process of creating something like a natural language interface. Languages include a great number of glue words that dont mean much to computer but have significant meaning to humans, such as a, as, the, that, and so on in English. These short, less useful words are stop words. Sentences dont make sense without them to humans, but for your computer, they can act as a means of stopping sentence analysis. <br /><br />

This example requires the use of the Natural Language Toolkit (NLTK), which Anaconda does not install by default. So go here for instruction how to do it: http://www.nltk.org/data.html <br />

After you install NLTK, you must also install the packages associated with it. The instructions at http://www.nltk.org/data.html <br />

The following example demonstrates how to perform stemming and remove stop words from a sentence.



In [None]:
# this is from scikilt-learn module used for machine learning
# http://scikit-learn.org/stable/
import sklearn.feature_extraction.text as ext
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

# pick the stemmer
stemmer = PorterStemmer()

# a function gets tokents and the stemmer algorithm and retuns all the stems as a list
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

# a function that gets the text and use word_tokenize to create tokens
# then calls function stem-tokens to get the stems
def tokenize(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

# the test used for training
vocab = ['Sam loves swimming so he swims all the time']

# create CountVectorizer using tokenize function we defined and stop words for english. You can use stop words for 
# other languages too.
vect = ext.CountVectorizer(tokenizer=tokenize, 
                           stop_words='english')

# apply the CountVectorizer to the vocab
vec = vect.fit(vocab)

# apply the vec to new text and see how many stems are matching the training text
sentence1 = vec.transform(['George loves swimming too!'])

# prints the stem of training
print vec.get_feature_names()
# prints the number of training stems match the test text
print sentence1.toarray()

## Introducing regular expressions

Regular expressions present the data scientist with an interesting array of tools for parsing raw text. At first, it may seem daunting how regular expressions work. However, sites such as https://regex101.com/#python let you play with regular expressions so that you can see how the use of various expressions perform specific types of pattern matching.

#### Pattern-Matching Characters Used in Python

(re) &nbsp;&nbsp;&nbsp; matching re <br />
re?  &nbsp;&nbsp;&nbsp; matching 0 or 1 occurance <br />
re*  &nbsp;&nbsp;&nbsp; matching 0 or more occurance <br />
re+  &nbsp;&nbsp;&nbsp; matching 1 or more occurance <br />
. &nbsp;&nbsp;&nbsp; matching any single character <br />
[^...] &nbsp;&nbsp; matching any single character or range not in in brackets<br />
[...] &nbsp;&nbsp;  matching any single character in brackets <br />
re{n, m} &nbsp;&nbsp;&nbsp; matching at least n and at most m occurances<br />
\d &nbsp;&nbsp;&nbsp; matching a digit<br />
\s &nbsp;&nbsp;&nbsp; matching a whitespace (\t\n\r\f)<br />
\w &nbsp;&nbsp;&nbsp; matching word character<br />
\Z &nbsp;&nbsp;&nbsp; matching the end of a string (a\Z means a at the end)<br />
\D &nbsp;&nbsp;&nbsp; matching nondigit<br />
\S &nbsp;&nbsp;&nbsp; matching nonwhitespace<br />
\W &nbsp;&nbsp;&nbsp; matching nonword characters<br />
re1|re2 &nbsp;&nbsp;&nbsp; matching re1 or re2<br />
re {n} &nbsp;&nbsp;&nbsp; matching exactly n times<br />
re{n, } &nbsp;&nbsp;&nbsp; matching minimum n times<br />
$ &nbsp;&nbsp;&nbsp; matching the end of the line<br />


(?=foo)&nbsp;&nbsp;&nbsp;Lookahead&nbsp;&nbsp;&nbsp;Asserts that what immediately follows the current position in the string is foo <br />
(?<=foo)&nbsp;&nbsp;&nbsp;Lookbehind&nbsp;&nbsp;&nbsp;Asserts that what immediately precedes the current position in the string is foo<br />
(?!foo)&nbsp;&nbsp;&nbsp;Negative Lookahead&nbsp;&nbsp;&nbsp;Asserts that what immediately follows the current position in the string is not foo<br />
(?<!foo)&nbsp;&nbsp;&nbsp;Negative Lookbehind&nbsp;&nbsp;&nbsp;Asserts that what immediately precedes the current position in the string is not  foo<br /><br />

Example:<br />
(?&lt;=&lt;hello&gt;).*(?=&lt;/hello&gt;)<br />
returns test if being applied to &lt;hello&gt;test&lt;/hello&gt;<br />


In [None]:
import re

data1 = 'Welcome to data programming!'

# compiles the regular expression so will be faster if you do it once (creates a DFA (Deterministic finite automata))
# r means raw data so no need for escape sequences
pattern = re.compile(r'a')

# searches and find the first matching string
dmatch2 = pattern.search(data1)

# the matching string
print dmatch2.group()
# the start index of match
print dmatch2.start()
# the end index of match
print dmatch2.end()

# searches and find all the matching string
dmatch1 = pattern.findall(data1)
print dmatch1

# find all matchings staring index and the matching string
for match in pattern.finditer(data1):
    print "%s: %s" % (match.start(), match.group())

In [None]:
import re

data1 = 'My phone number is: 800-555-1212.'
data2 = '800-555-1234 is my phone number.'

# compiles the regular expression so will be faster if you do it once (creates a DFA (Deterministic finite automata))
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

# searches and find the matching string
dmatch1 = pattern.search(data1)
dmatch2 = pattern.search(data2)

# finds the match
print dmatch1.group()
print dmatch2.group()
# finds the indices of start and end
print dmatch1.start()
print dmatch1.end()

# Working With Text Data

## Loading the 20 newsgroups dataset

The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:

    The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn. Alternatively, it is possible to download the dataset manually from the web-site and use the sklearn.datasets.load_files function by pointing it to the 20news-bydate-train subfolder of the uncompressed archive folder.

In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

In [None]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']

We can now load the list of files matching those categories as follows:

In [None]:
from sklearn.datasets import fetch_20newsgroups
# used for training, and random_state is used for seeding the random shuffling
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

# the type is a Bunch, a simple holder object with fields that can be both accessed as python dict keys or object
print type(twenty_train)
#twenty_train.data[0] or twenty_train['data']
twenty_train['data']

In [None]:
# shows the categories
twenty_train.target_names

In [None]:
# The files themselves are loaded in memory in the data attribute
# Shows the number of files loaded
# twenty_train.data is a list and each element is the whole file text
len(twenty_train.data)

In [None]:
# this can show you file names
len(twenty_train.filenames)

In [None]:
# content of the first file
# first three lines
print("\n".join(twenty_train.data[0].split("\n")[:3]))

In [None]:
# this gives you an integer assigned to each document and each integer represent one of the four categories we used
twenty_train.target

In [None]:
# the category name for the first file loaded into memory
twenty_train.target_names[twenty_train.target[0]]

Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.

For speed and space efficiency reasons scikit-learn loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored in the target attribute:

In [None]:
twenty_train.target[:10]

In [None]:
#get back the category names
for i in twenty_train.target[:10]:
    print twenty_train.target_names[i]

### Sparse Matrix 

A sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use specialized algorithms and data structures that take advantage of the sparse structure of the matrix.<br /><br />

One of these algorithms is CSR (Compressed Sparse Row) algorithm.<br /><br />

Example:<br /><br />

[[1, 2, 0],<br />
 [0, 0, 3],<br />
 [4, 0, 5]]<br /><br />
 
We will use CSR to compress this parse matrix. We end up with three matrices:<br /><br />

NZV = [1, 2, 3, 4, 5]<br />
C   = [0, 1, 2, 0, 2] <br />
RP  = [0, 2, 3, 5]<br /><br />

NZV = None zero values<br />
C = Column of each none zero values (one for each NZV)<br />
RP = Row Pointer (This is for finding the rows of element)<br /><br />

This is interpretation of RP:<br />
[0, 2, 3, 5] --> the last number is the number of NZV (so it is 5)<br />
Now we will have [0, 2, 3] (and we know we have NZV indices as [0, 1, 2, 3, 4]<br />
Now elements of (0, 1) in same row, (2) in next row, (3, 4) in next row.<br /><br />

Having all above information about NZV, C and RP you can have the complete sparce matrix.<br />

Another example:

[[1, 2, 0],<br />
 [0, 0, 0],<br />
 [4, 0, 5]]<br /><br />

NZV = [1, 2, 4, 5]<br />
C   = [0, 1, 0, 2] <br />
RP  = [0, 3, 3, 4]<br /><br />

RP: (0,1) first row, 3 being repeated so you ignore the first 3 (means all zeros when you see repeatition and you should ignore) (3,4) goes to third row<br /><br />

http://www.bu.edu/pasi/files/2011/01/NathanBell1-10-1000.pdf

https://en.wikipedia.org/wiki/Sparse_matrix

http://docs.scipy.org/doc/scipy/reference/sparse.html

Example:


In [None]:
import numpy as np
from scipy.sparse import csr_matrix
#A = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
A = csr_matrix([[1, 2, 3], [0, 0, 0], [0, 0, 5]])
# gives you NZV
A.data

In [None]:
# gives you C matrix
A.indices

In [None]:
# gives you RP matrix
A.indptr

In [None]:
# shows you the sparse matrix 
A.toarray()

Now back to our text data example

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(['Data programming is cool', 'If you and only you pass the course!'])
# tokenizing the two texts, as you can see we have 11 tokens
count_vect.get_feature_names()
#X_train_counts.toarray()
#type(X_train_counts)
#X_train_counts.indices
#0, 6, 5, 3, 2, 3, 7, 1, 4

In [None]:
type(X_train_counts) # as you can see this is CSR matrix

In [None]:
# the sparse matrix itself is
X_train_counts.toarray()
# each row is for each of those texts in the list and each column is showing
# if any token exists in this text (11 tokens)

In [None]:
# NZV of Sparse matrix above
X_train_counts.data

In [None]:
# gives you index of the vocabulary
count_vect.vocabulary_.get(u'only')

## Understanding the bag of words model

In [None]:
# the concept we talked above and the generated matrix is bag of words concept
from sklearn.datasets import fetch_20newsgroups
import sklearn.feature_extraction.text as ext
import pandas as pd

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories, 
                                  shuffle=True, 
                                  random_state=42)


count_vect = ext.CountVectorizer()
X_train_counts = count_vect.fit_transform(
    twenty_train.data)

# huge array but being saved as CSR matrix
X_train_counts.shape

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [None]:
#tf
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

In [None]:
# tf-idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

## Training a classifier

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a naïve Bayes classifier, which provides a nice baseline for this task. scikit-learn includes several variants of this classifier; the one most suitable for word counts is the multinomial variant:

http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf

http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes


In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers, since they have already been fit to the training set:

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

## Working with n-grams

An n-gram is a continuous sequence of items in the text you want to analyze. The items can be letters, words etc. The n in n-gram refer to size. An n-gram that has a size of one, for example, is a unigram. The example in this section uses a size of three, making a trigram. You use n-grams in a probabillistic manner to perform tasks such as predicting the next sequence in a series, which wouldnt seem very useful until you start thinking about applications such as search engines that try to predict the word you want to type based on previous letters you have supplied. However, the technique has all sorts of applications, such as in DNA sequencing and data compression. The following example shows how to create n-grams from the 20 Newsgroups dataset.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import sklearn.feature_extraction.text as ext

categories = ['sci.space']

twenty_train = fetch_20newsgroups(subset='train', 
        categories=categories, 
        remove=('headers', 'footers', 'quotes'),
        shuffle=True, 
        random_state=42)

# analyzer can be word, char or char_wb(characters with word boundry)
# word boundry means it is seperated from other words for example by space. 
# ngram_range = (min of n, max of n)
# max_features = x means the top x ngram you looking for 
count_chars = ext.CountVectorizer(analyzer='char_wb', 
        ngram_range=(3,3), 
        max_features=10).fit(twenty_train['data'])

print type(twenty_train)

# stop words being used
count_words = ext.CountVectorizer(analyzer='word', 
        ngram_range=(2,2),
        max_features=10,
        stop_words='english').fit(twenty_train['data'])

X = count_chars.transform(twenty_train.data)

# the top 10 features
print count_chars.get_feature_names()
# rows are number of documents
# columns are the top 10 features
# values are frequency of those features for each document
print X.toarray()
# the top 10 features
print count_words.get_feature_names()

# Working with Graph Data

## Using NetworkX basics

### Creating the initial graph

Creating adjacency matrix

In [None]:
import networkx as nx

G = nx.cycle_graph(10)
A = nx.adjacency_matrix(G)

print(A.todense())

### Visualizing the graph

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
nx.draw_networkx(G)
plt.show()

### Adding to the Graph

In [None]:
G.add_edge(1,5)
nx.draw_networkx(G)
plt.show()