## GA Data Science Final Project - 3a- NLP

#### Import the data

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

#### Convert datatypes to `str`

In [25]:
df = pd.read_csv('issue_comments_jupyter_copy.csv')
df['org'] = df['org'].astype('str')
df['repo'] = df['repo'].astype('str')
df['comments'] = df['comments'].astype('str')
df['user'] = df['user'].astype('str')

### Scikit Learn Count Vectorizer
##### First, save the Series 'comments' to the variable `comments`.

In [26]:
comments = df.comments.str.lower()

In [27]:
comments.head()

0                                           thanks !\n
1    oops. i got it. i have to uninstall ipython3 a...
2                                         same issue\n
3    fwiw a workaround is to share from google driv...
4    at some point, i'll probably hack on a rethink...
Name: comments, dtype: object

*comments* is a long column with many rows and I want to use CountVectorizer on all the comments to generate counts of all the words used, so I'll convert all the comments to strings and concatenate those strings using a space as a separator, then save the whole long string to a text file and also store it as the variable **all_comments**

CountVectorizer documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

which states: "Convert a collection of text documents to a matrix of token counts  
This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data."

##### Second, convert all rows in `comments` to a string.

Skip concatenating all documents into a single string. 

In [28]:
#with open ('all_comments.txt',"wb") as fd:
#    all_comments = comments.str.cat(sep=' ')
#    fd.write (all_comments)

Below: reduce to 1000 features with an ngram range of 2

In [31]:
cvec = CountVectorizer(max_features =1000,
                      ngram_range=(1,2),
                      stop_words='english',
                      binary=True)

cvec.fit(comments)

CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [35]:
df2 = pd.DataFrame(cvec.transform(comments).todense(), columns=cvec.get_feature_names())

In [36]:
df2.head()

Unnamed: 0,01,02,03,04,05,07,08,09,10,100,...,write,writing,written,wrong,wrote,www,yeah,yep,yes,zmq
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
df2.transpose().sort_values(0, ascending=False).head(25).transpose()

Unnamed: 0,thanks,01,port,physics data,ping,pip,pip install,place,plan,pm,...,poly state,possible,people,possibly,post,pr,prefer,prefix,present,pretty
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The output shows a multitude of stop words that we can take out later.

### Term Frequency - Inverse Document Frequency - TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english')
tvec.fit([all_comments])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
tfidf_data = pd.DataFrame(tvec.transform([all_comments]).todense(), 
    columns=tvec.get_feature_names(), index=['all_comments'])

In [16]:
tfidf_data.transpose().sort_values('all_comments', ascending=False).head(10).transpose()

Unnamed: 0,notebook,github,jupyter,com,https,think,like,thanks,js,just
all_comments,0.372162,0.309472,0.30273,0.302074,0.25254,0.143172,0.13643,0.119622,0.11817,0.114144


"tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. What this output means: the words above are most important across the Jupyter GitHub comments. Note that this is a "crude" topic analysis and I'll use other tools below to perform a more discrete anaysis.

### Topic Modeling: LDA

#### Import libraries

In [17]:
from gensim import corpora, models, matutils
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import pandas as pd

#### Instantiate CountVectorizer and fit data to model

In [18]:
#circling back to add stop words
stop_words = text.ENGLISH_STOP_WORDS.union(['jupyter', 'notebook', 'https', 'github', 'com', 'html', 'http', 'org','ellisonbg','don'])

In [19]:
vectorizer = CountVectorizer(stop_words=stop_words, min_df=3)
X = vectorizer.fit_transform(comments.dropna())

#### Tokens that were saved after stopwords removed

In [20]:
vectorizer.vocabulary_

{u'hanging': 4101,
 u'looking': 5230,
 u'59ms': 649,
 u'skipping': 7691,
 u'uncaughterror': 8586,
 u'differently': 2946,
 u'kaspersky': 4866,
 u'bringing': 1791,
 u'disturb': 3033,
 u'basics': 1600,
 u'xdg_data_home': 9083,
 u'wednesday': 8913,
 u'commented': 2267,
 u'rebuilding': 6847,
 u'271': 373,
 u'272': 374,
 u'275': 376,
 u'276': 377,
 u'sailed': 7334,
 u'errors': 3341,
 u'dialogs': 2927,
 u'designing': 2875,
 u'cac55037e43be04b586fc7936f12352ed95eb2a3': 1871,
 u'cull': 2652,
 u'widget': 8961,
 u'njsmith': 5788,
 u'elaborate': 3215,
 u'advantages': 1136,
 u'criticism': 2630,
 u'appropriately': 1329,
 u'wikimedia': 8969,
 u'replace': 7060,
 u'506602': 585,
 u'allow_origin': 1205,
 u'browse': 1803,
 u'sidebars': 7621,
 u'webpack': 8903,
 u'dns': 3052,
 u'paperwork': 6120,
 u'standardized': 7895,
 u'meeting': 5420,
 u'gregnordin': 4019,
 u'svurens': 8130,
 u'unpack': 8648,
 u'circumstances': 2092,
 u'_launch_kernel': 958,
 u'locked': 5202,
 u'40': 489,
 u'pursue': 6660,
 u'locket':

#### Counts of tokens

In [21]:
docs = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).head()
docs

Unnamed: 0,00,000,0000,001,00ms,01,02,022,023,03,...,zlib,zmq,zmq_version,zmqhandlers,zmqstream,zmqterminalinteractiveshell,zoom,zooming,zsh,écrit
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
docs.shape

(5, 9156)

#### Set up LDA model - set up the vocabulary

In [23]:
vocab = {v: k for k, v in vectorizer.vocabulary_.iteritems()}
vocab

{0: u'00',
 1: u'000',
 2: u'0000',
 3: u'001',
 4: u'00ms',
 5: u'01',
 6: u'02',
 7: u'022',
 8: u'023',
 9: u'03',
 10: u'04',
 11: u'0440',
 12: u'048',
 13: u'04ms',
 14: u'05',
 15: u'056',
 16: u'06',
 17: u'07',
 18: u'08',
 19: u'0892',
 20: u'09',
 21: u'091',
 22: u'0b1',
 23: u'0b2',
 24: u'0ce510255304',
 25: u'0dev',
 26: u'0eb95a1',
 27: u'0em',
 28: u'0px',
 29: u'0rc1',
 30: u'0rc2',
 31: u'10',
 32: u'100',
 33: u'1000',
 34: u'10000',
 35: u'10000ms',
 36: u'1002',
 37: u'1003',
 38: u'100644',
 39: u'1008',
 40: u'100mb',
 41: u'100s',
 42: u'1010',
 43: u'1015',
 44: u'10182',
 45: u'102',
 46: u'1020496',
 47: u'1021',
 48: u'1024',
 49: u'102549803',
 50: u'102558889',
 51: u'103',
 52: u'1032',
 53: u'1037',
 54: u'104',
 55: u'1047',
 56: u'105',
 57: u'1053',
 58: u'10586',
 59: u'106',
 60: u'107',
 61: u'1072',
 62: u'108',
 63: u'10845504',
 64: u'109',
 65: u'10m',
 66: u'10px',
 67: u'11',
 68: u'110',
 69: u'111',
 70: u'112',
 71: u'1123',
 72: u'1124',

In [None]:
%%time
vectorizer

#### Set up the actual LDA model

In [25]:
lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    #corpus,
    num_topics = 5,
    passes = 20,
    id2word = vocab
)

*Third pass looking at the topics, with added stop words 'jupyter', 'notebook', 'https', 'github', 'com', 'html', 'http', 'org','ellisonbg','don'*

In [26]:
lda.print_topics(num_topics=5, num_words=5)

[(0, u'0.025*"js" + 0.020*"file" + 0.019*"py" + 0.018*"line" + 0.014*"lib"'),
 (1,
  u'0.033*"thanks" + 0.015*"issue" + 0.013*"pr" + 0.012*"ll" + 0.011*"think"'),
 (2,
  u'0.023*"notifications" + 0.021*"view" + 0.021*"directly" + 0.020*"reply" + 0.019*"wrote"'),
 (3,
  u'0.013*"think" + 0.012*"cell" + 0.012*"like" + 0.008*"use" + 0.008*"just"'),
 (4,
  u'0.021*"ipython" + 0.019*"python" + 0.013*"install" + 0.012*"kernel" + 0.009*"using"')]

*High-level labels*

In [27]:
#First pass: funny but not helpful topics:
topics_labels = {
    0: "Installing Python and Jupyter"
    1: "I think I just like the Jupyter Notebook"
    2: "JavaScript and .py Files"
    3: "Cells"
    4: "GitHub"
}


SyntaxError: invalid syntax (<ipython-input-27-1f989758fa76>, line 4)

In [None]:
#Third pass topics
topics_labels = {
    0: "Viewing and writing notifications"
    1: "Python files and packages"
    2: "PR and issue closure"
    3: "User interface"
    4: "Thinking"
}

In [None]:
for ti, topic in enumerate(lda.show_topics(num_topics = 5)):
    print("Topic: %d" % (ti))
    print(topic)
    print()

##### K-Means

In [47]:
from sklearn.cluster import KMeans
#define the k-means clustering function
def k_means(df2, num_clusters=5):
    km = KMeans(n_clusters=num_clusters,
               max_iter=100)
    km.fit(df2)
    clusters = km.labels_
    return km, clusters

In [48]:
#set k=5, lets say we want 5 clusters from the 100 movies
num_clusters = 5

In [49]:
#get clusters and assign the cluster labels to the comments
km_obj, clusters = k_means(df2=df2,
                          num_clusters=num_clusters)
df2['Cluster'] = clusters