# The Analyses of Spark PRs with spaCy and Scikit-Learn 

This notebook was used to analyze the Spark commit messages and PRs with spaCy and Scikit-Learn. 

First, we extract the useful descriptions and comments from Spark issues (Section 1). 

Second, we use spaCy to do tokenization, remove stop words and reduce inflected words to their root stem (Section 2). 

Third, we use Scikit-Learn to analyze Spark issues: calculate the percentages of different kinds of words in all issues (Section 3.1), the TF-IDF of all issues (Section 3.2), the term frequency distributions (Section 3.3), the classification of all issues (Section 3.4) and the issue similarity matrix (Section 3.5).

## 0. Installation

If you use Linux-like system, (including, to greater or lesser degrees, Ubuntu, MacOS, Cygwin, and Bash for Windows), you should be able to run these commands to install SpaCy, Scikit-Learn, Pandas, and the other required libraries. Ete3 is a library for tree visualization which is optional.

    sudo pip install spacy scikit-learn pandas ete3

Now download the SpaCy data with this command:

    python -m spacy.en.download all

## 1. Extract the useful information from Spark issues

In [None]:
#coding=utf-8
import xml.dom.minidom
import re
import os
import sys

reload(sys)
sys.setdefaultencoding('utf-8')

Remove some useless XML headers.

In [None]:
def remove(data):
    data = re.sub(r'</?p>', "", data)
    data = re.sub(r'</?tt>', "", data)
    data = re.sub(r'<br/>', "", data)
    data = re.sub(r'\<a.*?\>', "", data)
    data = re.sub(r'</a>', "", data)
    data = re.sub(r'\<div.*?\>', "", data)
    data = re.sub(r'\</div\>', "", data)
    data = re.sub(r'\<pre.*?\>', "", data)
    data = re.sub(r'\</pre\>', "", data)
    data = re.sub(r'\<span.*?\>', "", data)
    data = re.sub(r'\</span\>', "", data)
    data = re.sub(r'\<ul.*?\>', "", data)
    data = re.sub(r'</ul\>', "", data)
    data = re.sub(r'\<table.*?\>', "", data)
    data = re.sub(r'\</table\>', "", data)
    data = re.sub(r'\<td.*?\>', "", data)
    data = re.sub(r'\</td\>', "", data)
    data = re.sub(r'\<th.*?\>', "", data)
    data = re.sub(r'\</th\>', "", data)
    data = re.sub(r'\</?del\>', "", data)
    data = re.sub(r'\</?em\>', "", data)
    data = re.sub(r'\</?h3\>', "", data)
    data = re.sub(r'\</?li\>', "", data)
    data = re.sub(r'</?ol>', "", data)
    data = re.sub(r'</?tr>', "", data)
    data = re.sub(r'</?tbody>', "", data)
    data = re.sub(r'\<img.*?\>', "", data)
    data = re.sub(r'\n', " ", data)
    data = re.sub(r'\&gt\;', ">", data)
    data = re.sub(r'\&lt\;', "<", data)
    data = re.sub(r'\&\#91\;', "[", data)
    data = re.sub(r'\&\#93\;', "]", data)
    data = re.sub(r'\&\#8211\;', "-", data)
    data = re.sub(r'\&amp\;', "&", data)
    data = re.sub(r'\<200c\>', "", data)
    data = re.sub(r'\<200b\>', "", data)
    return data

Read titles, descriptions and comments from Spark issues.

In [None]:
def readInfoFromXML(root, toFile):
    fopen = open(toFile, 'w')
    title = root.getElementsByTagName('title')
    for i, ti in enumerate(title):
        if i != 0:
            data = ti.firstChild.data
            data = re.sub(r'\[.*?\]\s', "", data)
            data = remove(data)
            fopen.write('%s\n' % (data))
    description = root.getElementsByTagName('description')
    for i, des in enumerate(description):
        if i != 0 and des.firstChild != None:
            data = remove(des.firstChild.data)
            fopen.write('%s\n' % (data.encode('utf-8')))
    comments = root.getElementsByTagName('comment')
    for i, com in enumerate(comments):
        data = remove(com.firstChild.data)
        fopen.write('%s\n' % (data.encode('utf-8')))
    fopen.close()

Write the extracted information to new dir and files.

In [None]:
def getInfoFromXML(fromDir, path):
    toDir = "../data/desAndCom/"
    fromFile = os.path.join('%s%s' % (fromDir, path))
    toFile = os.path.join('%s%s' % (toDir, path))
    #print fromFile, toFile, path
    dom = xml.dom.minidom.parse(fromFile)
    root = dom.documentElement
    readInfoFromXML(root, toFile)

The file operations.

In [None]:
# Print all files of this dir 'filepath'
def eachFile(filepath):
    pathDir =  os.listdir(filepath)
    for allDir in pathDir:
        child = os.path.join('%s%s' % (filepath, allDir))
        print child.decode('gbk')

# Print the content of this file 'filename'
def readFile(filename):
    fopen = open(filename, 'r') 
    for eachLine in fopen:
        print "the content of this line", line
    fopen.close()

# Write multiple lines to a specific file
def writeFile(filename):
    fopen = open(filename, 'w')
    while True:
        aLine = raw_input()
        if aLine != ".":
            fopen.write('%s%s' % (aLine, os.linesep))
        else:
            print "the file is saved"
            break
    fopen.close()

Get all descriptions and comments for spaCy.

In [None]:
def getUsefulInfo(filepath):
    pathDir =  os.listdir(filepath)
    subName = "SPARK"              # now only deal with the spark issue 
    invalidName = "invalid"
    numFiles = 0
    for allDir in pathDir:
        if subName in allDir:
            if invalidName in allDir:
                continue
            #print allDir
            getInfoFromXML(filepath, allDir)
            numFiles += 1
    return numFiles
            #print allDir

filePath = "../data/spark-issues/"
#eachFile(filePath)
numFiles = getUsefulInfo(filePath)             #write all useful information to ../data/desAndCom
numFiles

## 2. Use spaCy to do tokenization, stemming and remove stop words 

In [None]:
import spacy

This command will load the model of spaCy, which might take a little while.

In [None]:
 nlp = spacy.load('en-core-web-md')

Get all extracted files and print the number of files.

In [None]:
fileList =  os.listdir("../data/desAndCom/")
fileDir = "../data/desAndCom/"
files = []
for tempFile in fileList:
    if "ipynb" in tempFile:
        continue
    files.append(os.path.join('%s%s' % (fileDir, tempFile)))
print len(files)  
#print files

Parse the texts. These commands might take a little while. 

In [None]:
# Use spaCy to analysis these files
text_num = len(files)
text_array = [[]] * text_num
for i, tempFile in enumerate(files):
    raw_data = open(files[i]).read()
    text_array[i] = nlp(raw_data.decode('utf-8'))
#print files[0]
#print text_array[0]

In [None]:
# Just for checking
#for token in text_array[0]:
#    print (token, token.lemma_, token.lemma, token.pos_, token.pos, token.is_stop)

Each SpaCy document is already tokenized into words, which are accessible by iterating over the document. The next step just prints one text after removing stop words, punctuations, bracket, etc. Note capitals and steming will be dealt later.

In [None]:
for i, token in enumerate(text_array[0]):
    if token.is_punct or token.is_space or token.is_stop or token.pos_ == 'SYM':
        continue
    print (token, token.lemma_, token.lemma, token.pos_, token.pos, token.is_stop)

Write the intermediate results to files.

In [None]:
fileList = os.listdir("../data/desAndCom/")
fileDir = "../data/clearFiles/"
files = []
for tempFile in fileList:
    if "ipynb" in tempFile:
        continue
    files.append(os.path.join('%s%s' % (fileDir, tempFile)))

for j in range(text_num):
    fopen = open(files[j], 'w')
    for i, token in enumerate(text_array[j]):
        if token.is_punct or token.is_space or token.is_stop or token.pos_ == 'SYM':   #include symbor, such as "="
            continue
        fopen.write(token.lemma_.encode('utf8') + " ")   #reduce all words to its root stem
        #print token.lemma_
    fopen.close()

## 3. Use Scikit-Learn to analyze PRs and issues

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

from collections import Counter
from glob import glob
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline


Reload the clear files and use the spaCy models to analyze them.

In [None]:
text_num = len(files)
text_array = [] * text_num
for i, tempFile in enumerate(files):
    raw_data = open(files[i]).read()
    text_array.append(nlp(raw_data.decode('utf-8')))
#print files[0]
#print text_array[0]

### 3.1.  The percentage of different POS (Parts of Speech) in all issues

Each word already has a part of speech and a tag associated with it. It's fun to compare the distribution of parts of speech in all issues.

In [None]:
# Get the total number of all words
totalWords = 0
for i, tempFile in enumerate(files):
    totalWords += len(text_array[i])
#print totalWords

# Get the total number of different POS
totalType = {}
for i, tempFile in enumerate(files):
    typeMap = text_array[i].count_by(spacy.attrs.POS)
    for obj in typeMap.items():
        if obj[0] in totalType:
            totalType[obj[0]] += obj[1]
        else:
            totalType[obj[0]] = obj[1]

# Test
#for obj in totalType.items():       
#    print obj

The horizontal axis shows the parts of speech, and the vertical axis shows the percentage of different kinds of words. The noun, the adjective, the number and the verb account for the most proportion.

In [None]:
# Set the data of the figure
textPOS = [] 
textPOS.append(pd.Series(totalType) / totalWords)     # the sequence of POS percentages
#print textPOS

# Set the tag in the X axis
tagDict = {}
for i, tempFile in enumerate(files):
    for w in text_array[i]:
        tagDict[w.pos] = w.pos_

rcParams['figure.figsize'] = 16, 8
df = pd.DataFrame([textPOS[0]], index=['Spark'])                        # the figure configuration
df.columns = [tagDict[column] for column in df.columns]                 # the columns configuration
df.T.plot(kind='bar')

In [None]:
# Use the percentage of different POS in two issues to draw a picture
#textPOS = [] * text_num
#for i, tempFile in enumerate(files):
#    textPOS.append(pd.Series(text_array[i].count_by(spacy.attrs.POS))/len(text_array[i]))     # the sequence of POS percentages
#rcParams['figure.figsize'] = 16, 8
#df = pd.DataFrame([textPOS[0], textPOS[1]], index=['firstText', 'secondText'])  # the figure configuration
#df.columns = [tagDict[column] for column in df.columns]                 # the columns configuration
#df.T.plot(kind='bar')

### 3.2.  The TF-IDF of all documents

This uses a non-semantic technique for vectorizing documents, just using bag-of-words. We won't need any of the fancy features of SpaCy for this, just scikit-learn. We'll vectorize the corpus using scikit-learn's TfidfVectorizer class. This creates a matrix of word frequencies. 

In [None]:
# First, we'll vectorize the corpus using scikit-learn's TfidfVectorizer class.
tfidf = TfidfVectorizer(input='filename', decode_error='ignore', use_idf=True)

In [None]:
testFilenames = sorted(glob('../data/clearFiles/*'))
#print testFilenames

# While we're at it, let's make a list of the lengths, so we can use them to plot dot sizes. 
lengths = [len(open(filename).read())/100 for filename in testFilenames]
#print lengths

# Add a manually compiled list of presidential party affiliations, 
# So that we can use this to color our dots. 
parties = 'rrrbbrrrbbbbbrrbbrrbrrrbbrrbrrrrbbrrrbbbbbrrbbrrbrrrbbrrbrrrrbbrrrbbbbbrrbbrrbrrrbbrrbrrrrbbrrrbbbbbrrbbrrbrrrbbrrbr'

In [None]:
tfidfOut = tfidf.fit_transform(testFilenames)
tfidfOut

As the table shows, because the number of words is huge, we only print the top 5 TF-IDF words of all issues.

In [None]:
feature_names = tfidf.get_feature_names()
#print feature_names

#print the TFIDF of two articles
#print '\n------------------------ the TFIDF of all words of two issues ---------------------------------'
temp = 0
feature_index = tfidfOut[temp,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidfOut[temp, x] for x in feature_index])
#for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
#    print w, s
#print '-----------------------------------------------------------------------'
temp = 1
feature_index = tfidfOut[temp,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidfOut[temp, x] for x in feature_index])
#for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
#    print w, s

print '\n------------------------ the top 5 TFIFP words of all issues ---------------------------------'
for i, tfid in enumerate(tfidfOut):
    feature_index = tfidfOut[i,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidfOut[i, x] for x in feature_index])
    sorted_l=sorted(tfidf_scores,key=lambda t:t[1], reverse=True)  
    print "the %dth Spark issue:" % i
    numOut = 0
    for w, s in [(feature_names[j], s) for (j, s) in sorted_l]:
        print (w, s), 
        numOut += 1
        if numOut > 5:
            print "\n"
            break

The matrix of TFIDF of all documents. For example, 93 is the number of issues and 3005 is the number of words

In [None]:
tfidfOut.shape

### 3.3.  The term frequency distributions

We're simply going to count the occurrences of words and divide by the total number of words in the document.

In [None]:
# Make labels by removing the directory name and .txt/.xml extension: 
labels = [filename.split('/')[3] for filename in testFilenames]
labels = [filename.split('.')[0] for filename in labels]
#print labels

# We're simply going to count the occurrences of words and divide by the total number of words in the document.
doc_raw = [open(doc).read() for doc in testFilenames]
inaugural = [nlp(doc.decode("utf-8")) for doc in doc_raw]

# Create a Pandas Data Frame with each word counted in each document, divided by the length of the document. 
inauguralSeries = [pd.Series(Counter([word.string.strip().lower() for word in doc])) / len(doc) for doc in inaugural]
seriesDict = {label: series for label, series in zip(labels, inauguralSeries)}
inauguralDf = pd.DataFrame(seriesDict).T.fillna(0)

We can know the frequencency of each word in first 5 documents.

In [None]:
inauguralDf.head()
# you can know the frequencency of each word in all documents
# inauguralDf

We can easily slice this data frame with words we're interested in, and plot those words across the corpus. For example, let's look at the proportions of the words "important", "key" and "lose":

In [None]:
inauguralDf[['important', 'key', 'lose']].plot(kind='bar')

We can even compute, say the ratio of uses of the word "master" to uses of the word "class."

In [None]:
#americaWorldRatio = inauguralDf['master']/inauguralDf['class']
#americaWorldRatio.plot(kind='bar')

### 3.4.  The issues classfication

In [None]:
tfidfOut[0].shape

Becuase a word vector is 3005-dimensional, so in order to plot it in 2D, it might help to reduce the dimensionality to the most meaningful dimensions. We can use Scikit-Learn to perform truncated singular value decomposition for latent semantic analysis (LSA).

In [None]:
lsa = TruncatedSVD(n_components=2)
lsaOut = lsa.fit_transform(tfidfOut.todense())

In [None]:
#The classification of all documents
xs, ys = lsaOut[:,0], lsaOut[:,1]
for i in range(len(xs)): 
    plt.scatter(xs[i], ys[i], c=parties[0], s=lengths[i], alpha=0.5)
    plt.annotate(labels[i], (xs[i], ys[i]))

### 3.5.  The document similarity matrix

Using the .similarity() method from earlier that uses word vectors, we can very easily compute the document similarity between all the documents in our corpus.

In [None]:
#　Document Similarity Matrix
similarities = [ [doc.similarity(other) for other in inaugural] for doc in inaugural ]
similaritiesDf = pd.DataFrame(similarities, columns=labels, index=labels)

Both the horizontal axis and the vertical axis show all issues, the deeper the color, the more similar.

In [None]:
# Requires the Seaborn library. 
rcParams['figure.figsize'] = 16, 8
seaborn.heatmap(similaritiesDf)

In [None]:
# Get the top PROPN words
#firstAdjs = [w for w in first if w.pos_ == 'PROPN']
#Counter([w.string.strip() for w in firstAdjs]).most_common(10)