## GA Data Science Final Project - 3- NLP

#### Import the data

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

#### Convert datatypes to `str`

In [3]:
df = pd.read_csv('issue_comments_jupyter_copy.csv')
df['org'] = df['org'].astype('str')
df['repo'] = df['repo'].astype('str')
df['comments'] = df['comments'].astype('str')
df['user'] = df['user'].astype('str')

### Scikit Learn Count Vectorizer
##### First, save the Series 'comments' to the variable `comments`.

In [4]:
comments = df.comments

In [5]:
comments.head()

0                                           Thanks !\n
1    Oops. i got it. I have to uninstall ipython3 a...
2                                         same issue\n
3    FWIW a workaround is to share from Google Driv...
4    At some point, I'll probably hack on a Rethink...
Name: comments, dtype: object

*comments* is a long column with many rows and I want to use CountVectorizer on all the comments to generate counts of all the words used, so I'll convert all the comments to strings and concatenate those strings using a space as a separator, then save the whole long string to a text file and also store it as the variable **all_comments**

CountVectorizer documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

which states: "Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data."

##### Second, convert all rows in `comments` to a string.

In [6]:
with open ('all_comments.txt',"wb") as fd:
    all_comments = comments.str.cat(sep=' ')
    fd.write (all_comments)

In [7]:
cvec = CountVectorizer()

cvec.fit([all_comments])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [8]:
df2 = pd.DataFrame(cvec.transform([all_comments]).todense(), columns=cvec.get_feature_names())

In [9]:
df2.transpose().sort_values(0, ascending=False).head(25)

Unnamed: 0,0
the,37966
to,25787
it,15234
and,13353
that,13045
is,12484
in,12396
this,11209
of,10136
for,9383


The output shows a multitude of stop words that we can take out later.

Frequency Distribution Curve

In [10]:
import nltk
fdist1 = nltk.FreqDist(df2)
fdist1

vocabulary1 = fdist1.keys()
vocabulary1[:50]

[u'a6e74c7c',
 u'hanging',
 u'trawling',
 u'test_nbgrader_export',
 u'2294823',
 u'scoll',
 u'googlenet',
 u'_populate_',
 u'bab1ddb196e0',
 u'14kb',
 u'uncaughterror',
 u'kaspersky',
 u'_kw',
 u'bringing',
 u'externalshell',
 u'xdg_data_home',
 u'wednesday',
 u'330mb',
 u'48',
 u'tdzero',
 u'a7wv3fn9yvgwr02kd6azhgx4bqcdip6',
 u'270',
 u'271',
 u'272',
 u'273',
 u'274',
 u'275',
 u'276',
 u'277',
 u'278',
 u'279',
 u'fastcgi_temp_path',
 u'scraped',
 u'errors',
 u'dialogs',
 u'cooking',
 u'designing',
 u'17794439',
 u'numeral',
 u'cc2ec89ef65d82792adcab63f0300a8d',
 u'widget',
 u'11b87e3c',
 u'pdftexcmds',
 u'2f8032c541f5',
 u'njsmith',
 u'statupdelay',
 u'abnc9nxza1cptisenr2e7lu2i3vglgaoks5rrsmmgajpzm4lwwew',
 u'affiliated',
 u'chink',
 u'kids']

From Loper, et al. 2009: 
"When we first invoke FreqDist, we pass the name of the text as an argument. We can inspect the total number of words ('outcomes') that have been counted up. The expression keys() gives us a list of all the distinct types in the text, and we can look at the first 50 of these by slicing the list.

In [11]:
fdist1.plot(50, cumulative=True)



<matplotlib.figure.Figure at 0x117ce2950>

In [12]:
fdist1.hapaxes()

[u'a6e74c7c',
 u'hanging',
 u'trawling',
 u'test_nbgrader_export',
 u'2294823',
 u'scoll',
 u'googlenet',
 u'_populate_',
 u'bab1ddb196e0',
 u'14kb',
 u'uncaughterror',
 u'kaspersky',
 u'_kw',
 u'bringing',
 u'externalshell',
 u'xdg_data_home',
 u'wednesday',
 u'330mb',
 u'48',
 u'tdzero',
 u'a7wv3fn9yvgwr02kd6azhgx4bqcdip6',
 u'270',
 u'271',
 u'272',
 u'273',
 u'274',
 u'275',
 u'276',
 u'277',
 u'278',
 u'279',
 u'fastcgi_temp_path',
 u'scraped',
 u'errors',
 u'dialogs',
 u'cooking',
 u'designing',
 u'17794439',
 u'numeral',
 u'cc2ec89ef65d82792adcab63f0300a8d',
 u'widget',
 u'11b87e3c',
 u'pdftexcmds',
 u'2f8032c541f5',
 u'njsmith',
 u'statupdelay',
 u'abnc9nxza1cptisenr2e7lu2i3vglgaoks5rrsmmgajpzm4lwwew',
 u'affiliated',
 u'chink',
 u'kids',
 u'xkcd_plots',
 u'pytables',
 u'deferring',
 u'monospaced',
 u'localinterfaces',
 u'f8229c9fb6ab5a5135f67ea253e6833cebd761f7',
 u'appropriately',
 u'8e56',
 u'disable_check_xsrf',
 u'029722c8',
 u'6a0be7783599',
 u'506602',
 u'9bedf33b867d',


Depending on output: Since neither frequent nor infrequent words help, we need to try something else. Beginning to see that this is a very noisy data set.

### NLP Bag of Words
Using **segmentation** to identify sentences within `all_comments`

In [13]:
from nltk.tokenize import PunktSentenceTokenizer
sent_detector = PunktSentenceTokenizer()
sent_detector.sentences_from_text(all_comments.decode('utf8'))

[u'Thanks !',
 u'Oops.',
 u'i got it.',
 u'I have to uninstall ipython3 and then install back ipython2.3 even when i am in a brand new virtual env.',
 u'I guess you may want to add a note to that... anyway, great work... \n same issue\n FWIW a workaround is to share from Google Drive.',
 u'(Right-click the .ipynb file in Drive, and choose "Share".)',
 u"At some point, I'll probably hack on a RethinkDB based store which you could run internally.",
 u"No promises though, just something I've been thinking on.",
 u'If there are folks willing to fund such a thing, @captainsafia and others would love to collaborate on this.',
 u'Also, if you just want storage (not the realtime aspect), you can also use a content store like these:\n- [Postgres contents](https://github.com/quantopian/pgcontents) - awesome, used in production at Quantopian\n- [OpenStack Swift store](https://github.com/rgbkrk/bookstore) - note: not up to date, welcome to new maintainers\n Was looking for the realtime.',
 u':-) \

### Lemmatization

In [32]:
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(df2)

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

### Stemming

### Term Frequency - Inverse Document Frequency - TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english')
tvec.fit([all_comments])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
tfidf_data = pd.DataFrame(tvec.transform([all_comments]).todense(), 
    columns=tvec.get_feature_names(), index=['all_comments'])

In [16]:
tfidf_data.transpose().sort_values('all_comments', ascending=False).head(10).transpose()

Unnamed: 0,notebook,github,jupyter,com,https,think,like,thanks,js,just
all_comments,0.372162,0.309472,0.30273,0.302074,0.25254,0.143172,0.13643,0.119622,0.11817,0.114144


"tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. What this output means: the words above are most important across the Jupyter GitHub comments. 

### Topic Modeling: LDA

#### Import libraries

In [17]:
from gensim import corpora, models, matutils
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import pandas as pd

#### Instantiate CountVectorizer and fit data to model

In [18]:
#circling back to add stop words
stop_words = text.ENGLISH_STOP_WORDS.union(['jupyter', 'notebook', 'https', 'github', 'com', 'html', 'http', 'org','ellisonbg','don'])

In [19]:
vectorizer = CountVectorizer(stop_words=stop_words, min_df=3)
X = vectorizer.fit_transform(comments.dropna())

#### Tokens that were saved after stopwords removed

In [27]:
vectorizer.vocabulary_

AttributeError: 'dict' object has no attribute 'head'

#### Counts of tokens

In [21]:
docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).head()
docs

Unnamed: 0,00,000,0000,001,00ms,01,02,022,023,03,...,zlib,zmq,zmq_version,zmqhandlers,zmqstream,zmqterminalinteractiveshell,zoom,zooming,zsh,écrit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
docs.shape

(5, 9156)

#### Set up LDA model - set up the vocabulary

In [23]:
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab

{0: u'00',
 1: u'000',
 2: u'0000',
 3: u'001',
 4: u'00ms',
 5: u'01',
 6: u'02',
 7: u'022',
 8: u'023',
 9: u'03',
 10: u'04',
 11: u'0440',
 12: u'048',
 13: u'04ms',
 14: u'05',
 15: u'056',
 16: u'06',
 17: u'07',
 18: u'08',
 19: u'0892',
 20: u'09',
 21: u'091',
 22: u'0b1',
 23: u'0b2',
 24: u'0ce510255304',
 25: u'0dev',
 26: u'0eb95a1',
 27: u'0em',
 28: u'0px',
 29: u'0rc1',
 30: u'0rc2',
 31: u'10',
 32: u'100',
 33: u'1000',
 34: u'10000',
 35: u'10000ms',
 36: u'1002',
 37: u'1003',
 38: u'100644',
 39: u'1008',
 40: u'100mb',
 41: u'100s',
 42: u'1010',
 43: u'1015',
 44: u'10182',
 45: u'102',
 46: u'1020496',
 47: u'1021',
 48: u'1024',
 49: u'102549803',
 50: u'102558889',
 51: u'103',
 52: u'1032',
 53: u'1037',
 54: u'104',
 55: u'1047',
 56: u'105',
 57: u'1053',
 58: u'10586',
 59: u'106',
 60: u'107',
 61: u'1072',
 62: u'108',
 63: u'10845504',
 64: u'109',
 65: u'10m',
 66: u'10px',
 67: u'11',
 68: u'110',
 69: u'111',
 70: u'112',
 71: u'1123',
 72: u'1124',

In [24]:
vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=frozenset(['all', 'show', 'anyway', 'fifty', 'four', 'go', 'mill', 'find', 'seemed', 'whose', 're', 'herself', 'whoever', 'behind', 'should', 'to', 'only', 'under', 'herein', 'do', 'his', 'get', 'very', 'de', 'myself', 'cannot', 'every', 'yourselves', 'him', 'is', 'cry', 'beforehand', 'th..., 'eight', 'but', 'nothing', 'why', 'jupyter', 'noone', 'sometimes', 'together', 'serious', 'once']),
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

#### Set up the actual LDA model

In [25]:
lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    #corpus,
    num_topics = 5,
    passes = 20,
    id2word = vocab
)

*Third pass looking at the topics, with added stop words 'jupyter', 'notebook', 'https', 'github', 'com', 'html', 'http', 'org','ellisonbg','don'*

In [28]:
lda.print_topics(num_topics=5, num_words=5)

[(0,
  u'0.028*"notifications" + 0.027*"view" + 0.024*"directly" + 0.024*"reply" + 0.023*"wrote"'),
 (1,
  u'0.025*"file" + 0.024*"python" + 0.024*"py" + 0.019*"line" + 0.015*"packages"'),
 (2,
  u'0.024*"thanks" + 0.015*"issue" + 0.010*"pr" + 0.009*"think" + 0.009*"just"'),
 (3,
  u'0.040*"js" + 0.017*"notebookapp" + 0.013*"png" + 0.013*"assets" + 0.012*"cloud"'),
 (4,
  u'0.012*"think" + 0.011*"like" + 0.011*"cell" + 0.009*"use" + 0.008*"just"')]

*High-level labels*

In [None]:
#First pass: funny but not helpful topics:
topics_labels = {
    0: "Installing Python and Jupyter"
    1: "I think I just like the Jupyter Notebook"
    2: "JavaScript and .py Files"
    3: "Cells"
    4: "GitHub"
}


In [None]:
#Third pass topics
topics_labels = {
    0: "Viewing and writing notifications"
    1: "Python files and packages"
    2: "PR and issue closure"
    3: "User interface"
    4: "Thinking"
}

In [33]:
for ti, topic in enumerate(lda.show_topics(num_topics = 5)):
    print("Topic: %d" % (ti))
    print(topic)
    print()

Topic: 0
(0, u'0.028*"notifications" + 0.027*"view" + 0.024*"directly" + 0.024*"reply" + 0.023*"wrote" + 0.022*"email" + 0.021*"issuecomment" + 0.021*"pull" + 0.017*"issues" + 0.016*"brian"')
()
Topic: 1
(1, u'0.025*"file" + 0.024*"python" + 0.024*"py" + 0.019*"line" + 0.015*"packages" + 0.014*"lib" + 0.013*"ipython" + 0.013*"local" + 0.012*"usr" + 0.011*"users"')
()
Topic: 2
(2, u'0.024*"thanks" + 0.015*"issue" + 0.010*"pr" + 0.009*"think" + 0.009*"just" + 0.009*"ll" + 0.009*"like" + 0.008*"try" + 0.008*"work" + 0.007*"issues"')
()
Topic: 3
(3, u'0.040*"js" + 0.017*"notebookapp" + 0.013*"png" + 0.013*"assets" + 0.012*"cloud" + 0.012*"githubusercontent" + 0.012*"docker" + 0.012*"error" + 0.012*"17" + 0.009*"static"')
()
Topic: 4
(4, u'0.012*"think" + 0.011*"like" + 0.011*"cell" + 0.009*"use" + 0.008*"just" + 0.007*"kernel" + 0.007*"way" + 0.006*"code" + 0.006*"want" + 0.006*"make"')
()
