%%HTML
<link rel="stylesheet" type="text/css" href="custom.css">
<link rel="stylesheet" type="text/css" href="pandas-table.css">

# Topic model with Github Pull Request 

Nowadays software development process is a complex social network activity where a big amount of non structured documentation is generated. Over the past 15 years, there has been an explosion of empirical research in software engineering to explore questions related with software development process. Partly motivated by the availability of data from sites like GitHub. 
 
- Who should fix this bug? 
- How do you find documentation about a bug? 

The documentation in a software project is generated dynamically in multiple documents as it grows.
The pull request is a source of very valuable documentation that is generated collaboratively by the development team when they want to integrate a new change in the project. Pull request is an __short not structured text__, this makes it difficult to manage.
One social network technique to manage this kind of __short documentation__  is to define a set of topics with a keyword, usually called labels or tags. Then every __short documentation__, pullrequest in this case, is labeled with a label. The choice of keywords (labels) are selected by the participants of the social network throughout all their interventions. Then they apply a label to a new interaction with social network.

In this work aims  to label the pull requests applying topic modelling that it is an  non supervised machine learning technique.


##  Researcher
 - César Ignacio García Osorio
 - Mario Juez Gil
 - Carlos López Nozal
 - Álvar Arnaiz González

## References 
- [Topic modeling evaluation](https://datascience.blog.wzb.eu/2017/11/09/topic-modeling-evaluation-in-python-with-tmtoolkit/)
Likelihood and perplexity Evaluating the posterior distributions’ density or divergence
- [Perplexity To Evaluate Topic Models](http://qpleple.com/perplexity-to-evaluate-topic-models/)
- [Python tutorial topic model](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/)
Choose a better values of K (#number of topics)
We started with understanding what topic modeling can do. We built a basic topic model using Gensim’s LDA and visualize the topics using pyLDAvis. Then we built mallet’s LDA implementation. You saw how to find the optimal number of topics using coherence scores and how you can come to a logical understanding of how to choose the optimal model.
Finally we saw how to aggregate and present the results to generate insights that may be in a more actionable.
- [Paper Latent Dirichlet Allocation](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
- [wikipedia Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocatio)

## Index
1. [Pull request dataset](#dataset)
2. [Load data file dataset](#files)
3. [Dataset Tokenization](#tokenization)
4. [Dataset normalization, stemming and lemmatization](#normalization)
5. [LDAModel](#ldamodel)

<a id='dataset'></a>
## Pull Request dataset
### Definitions
- [What is a Pull request in development software process?](https://github.com/features)
- [An example of pull requets commented](https://github.com/google/WebFundamentals/pull/4136)
### Github Repositories
Our dataset from many github trending repositories, is updated to July 2017:
- ChartJS
- AngularJS
- CakePHP
- Play Framework
- WebFundamentals
- ElasticSearch

### Data set structure
We concat following data attributtes in a single attributte named `pull_request`.
- repository_owner
- repository_name
- repository_language
- pull_request_title
- pull_request_body 

<a id='files'></a>
## Load data file dataset

In [1]:
import os
import glob
import pandas as pd
#Read csv
def loadCsvPullRequestFolder(path):
    """Load pullrequest data from  csv file and generate a 
    list with all pull requeste"""
    _lprbt=list()
    _totalfile=0
    for filename in glob.glob(os.path.join(path, '*.csv')):        
        print(filename)
        df2=pd.read_csv(filename, error_bad_lines=False, index_col=False, dtype='unicode')
        df2["pull_request"] = df2["repository_owner"].map(str) + " " + \
        df2["repository_name"].map(str) +  " " +  df2["repository_language"].map(str) + " " +\
        df2["pull_request_body"].map(str) + " " +  df2["pull_request_title"].map(str)
        [_lprbt.append(pr) for pr in df2.pull_request] 
        _totalfile+=1
        #print(lprbt[len(lprbt)-1])
    return _totalfile, len(_lprbt), _lprbt

totalfiles,totalinstances,lprbt=loadCsvPullRequestFolder(path="./datasets/pullrequest")   
print("Number of files: {} Number of instances: {}".format(totalfiles,totalinstances))


./datasets/pullrequest/reviews_Leaflet_processed.csv
./datasets/pullrequest/reviews_playframework_processed.csv
./datasets/pullrequest/reviews_angular.js_processed.csv
./datasets/pullrequest/reviews_WebFundamentals_processed.csv
./datasets/pullrequest/reviews_appium_processed.csv
./datasets/pullrequest/reviews_Chart.js_processed.csv
./datasets/pullrequest/reviews_cakephp_processed.csv
./datasets/pullrequest/reviews_elasticsearch_processed.csv
Number of files: 8 Number of instances: 4303


<a id='tokenization'></a>
## Dataset Tokenization

In [2]:
totalfiles,totalinstances,lprbt=loadCsvPullRequestFolder(path="./datasets/pullrequest")
#Clean text \\nn

import re
def textNormalization(lpr):
    #remove character space
    lpr=[re.sub('\s+',' ',pr) for pr in lpr]
    #TODO remove \\n \\r
    #lpt=[re.sub('[/\\n/\\n]',' ',pr) for pr in lpr]
    return lpr

lprbt=textNormalization(lprbt)


    

./datasets/pullrequest/reviews_Leaflet_processed.csv
./datasets/pullrequest/reviews_playframework_processed.csv
./datasets/pullrequest/reviews_angular.js_processed.csv
./datasets/pullrequest/reviews_WebFundamentals_processed.csv
./datasets/pullrequest/reviews_appium_processed.csv
./datasets/pullrequest/reviews_Chart.js_processed.csv
./datasets/pullrequest/reviews_cakephp_processed.csv
./datasets/pullrequest/reviews_elasticsearch_processed.csv


In [3]:
import gensim
from gensim.utils import simple_preprocess

def pr_to_words(sentences):
    for sentence in sentences:
        # deacc True remove puntactions
        yield(gensim.utils.simple_preprocess(str(sentence),deacc=True))
                
prwords=list(pr_to_words(lprbt))
print("Numbers of tokens in pullrequest: {} ".format(len(prwords)))

Numbers of tokens in pullrequest: 4303 


<a id='normalization'></a>
## Dataset stopwords, stemming and lemmatization

In [4]:
import gensim
from gensim.models import CoherenceModel
# Build the bigram and trigram models
bigram = gensim.models.Phrases(prwords, min_count=5, threshold=500) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[prwords], threshold=500)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[prwords[1]]])



['leaflet', 'leaflet', 'javascript', 'nan', 'fix', 'webpack', 'using', 'valid', 'image', 'file', 'for', 'default', 'icon', 'path']


In [6]:
# NLTK Stop words
import nltk
from nltk.corpus import stopwords
import gensim

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    nltk.download('stopwords')
    stop_words = stopwords.words('english')
    stop_words.extend(['\\n\\n', '\\n\\r'])
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

# spacy for lemmatization
import spacy
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Remove Stop Words
data_words_nostops = remove_stopwords(prwords)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
#!python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print("Numbers of tokens in pullrequest after normalization: {} ".format(len(data_lemmatized)))
#print(data_lemmatized[:1])

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/clopezno/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Numbers of tokens in pullrequest after normalization: 4303 


<a id='ldamodel'></a>
## Topic Models
[Gensim tutorial](https://radimrehurek.com/gensim/tut2.html)

### Vector Space Model algorithms
- __Term Frequency__ count number of a word is in a document doc2bow
- __Inverse Document Frequency__, Tf-Idf expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length

### Generating disctionary and corpus with pullrequest preprocessed

In [7]:
import gensim.corpora as corpora
import gensim.models.tfidfmodel as tfidmodel

def createCorpus(data_lemmatized):
    id2word = corpora.Dictionary(data_lemmatized)
    # Create Corpus
    texts = data_lemmatized
    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in texts]
    return corpus,id2word

import gensim.models.tfidfmodel as tfidf
def createCorpusTfid(corpus):
    tfidf = tfidmodel.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    return corpus_tfidf




def printHumanCorpus(corpus, id2word, instance):
    """Print length of corpus and human legible instance 
    instance int index of pullrequest """
    print("Corpus lenght {}".format(len(corpus)))
    print("Instance corpus {}".format(corpus[instance-1:instance]))
    print("Instance corpus {}")
    [[print(id2word[id], freq) for id, freq in cp] for cp in corpus[instance-1:instance]]
    return


corpus,id2word= createCorpus(data_lemmatized)
#printHumanCorpus(corpus,id2word,3)
tfidf_corpus= createCorpusTfid(corpus)
#printHumanCorpus(corpus,tfidf_corpus,3)

#serialize corpus
corpora.MmCorpus.serialize("./models/prcorpus.mm", corpus)
corpora.MmCorpus.serialize("./models/tfidfprcorpus.mm", tfidf_corpus)

#save dictionary
dictionary = corpora.Dictionary(data_lemmatized)
dictionary.save('/models/dict_pullrequets')
dictionary.save_as_text('./models/dict_pullrequets.txt')

# Making a library to preprocces data
All step in just a method preproccessData
- textNormalization - now not implemented
- stopWords
- bigrams
- speech tagging  ['NOUN', 'ADJ', 'VERB', 'ADV']
- lemmatization

In [1]:
import LibraryTopicModel as ltm
## process all csv in directory
path="./datasets/pullrequest"
## process only one csv
#path="./datasets/pullrequest/reviews_cakephp_processed.csv"
data_lemmatized=ltm.preprocessData(path)
corpus,id2word=ltm.createCorpus(data_lemmatized)

./datasets/pullrequest/reviews_Leaflet_processed.csv
./datasets/pullrequest/reviews_playframework_processed.csv
./datasets/pullrequest/reviews_angular.js_processed.csv
./datasets/pullrequest/reviews_WebFundamentals_processed.csv
./datasets/pullrequest/reviews_appium_processed.csv
./datasets/pullrequest/reviews_Chart.js_processed.csv
./datasets/pullrequest/reviews_cakephp_processed.csv
./datasets/pullrequest/reviews_elasticsearch_processed.csv
Number of files: 8 Number of instances: 4303
Numbers of tokens in pullrequest: 283706 




Numbers of tokens nostops in pullrequest: 191337 
Numbers of tokens in pullrequest after lemmatization: 177226 


### Creating topic models and saving in diectory ./models

### Random Projections
RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

In [8]:
import gensim.models.rpmodel as rpmodel
# Build Random Projections model
#rpmodelpr = rpmodel.RpModel(tfidf_corpus, num_topics=20)
rpmodelpr = rpmodel.RpModel(corpus, num_topics=10)
#rpmodelpr.print_topics(2)

rpmodelpr.save("./models/rpmodelpr")

### Latent Dirichlet Allocation
LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is a probabilistic extension of LSA (also called multinomial PCA), so LDA’s topics can be interpreted as probability distributions over words. These distributions are, just like with LSA, inferred automatically from a training corpus. Documents are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).


In [9]:
# Build LDA model
import gensim.models.ldamodel as ldamodel
ldamodelpr = ldamodel.LdaModel(corpus, id2word=id2word, num_topics=10)
ldamodelpr.print_topics(2)
ldamodelpr.save("./models/ldamodelpr")

##lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
##                                           id2word=id2word,
##                                           num_topics=5, 
##                                           random_state=100,
##                                           update_every=1,
##                                           chunksize=100,
##                                           passes=10,
##                                           alpha='auto',
##                                           per_word_topics=True)

### Hierarchical Dirichlet Process
__HDP__ is a non-parametric bayesian method (note the missing number of requested topics):

In [10]:
# Build Hdp model
import gensim.models.hdpmodel as hdpmodel
hpdmodelpr = hdpmodel.HdpModel(corpus, id2word=id2word)
hpdmodelpr.print_topics(2)
hpdmodelpr.save("./models/hpdmodelpr")

### Latent Semantic Indexing
LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality.
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!



In [11]:
 # Build LSI model
import gensim.models.lsimodel as lsimodel   
#lsimodelpr = lsimodel.LsiModel(tfidf_corpus, id2word=id2word, num_topics=20)
lsimodelpr = lsimodel.LsiModel(corpus, id2word=id2word, num_topics=10)
lsimodelpr.print_topics(2)
lsimodelpr.save("./models/lsimodelpr")

### LDA mallet version
So far you have seen Gensim’s inbuilt version of the LDA algorithm. Mallet’s version, however, often gives a better quality of topics.

Gensim provides a wrapper to implement Mallet’s LDA from within Gensim itself. You only need to download the zipfile, unzip it and provide the path to mallet in the unzipped directory to gensim.models.wrappers.LdaMallet. 
[download](https://www.machinelearningplus.com/wp-content/uploads/2018/03/mallet-2.0.8.zip)

In [3]:
import gensim.models.wrappers.ldamallet as ldamallet

mallet_path="./mallet-2.0.8/bin/mallet"
ldamalletmodelpr = ldamallet.LdaMallet(mallet_path, corpus, num_topics=10, id2word=id2word)
ldamalletmodelpr.print_topics(2)
ldamalletmodelpr.save("./models/ldamalletmodelpr")