Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Added Jensen Shannon metric and dendrogram visualization #1484

Merged
merged 21 commits into from
Aug 23, 2017

Conversation

parulsethi
Copy link
Contributor

@parulsethi parulsethi commented Jul 13, 2017

Adds Jensen-Shannon distance metric and topic dendrogram-heatmap visualization notebook.

@piskvorky
Copy link
Owner

Hanging indent everywhere please -- vertical indent makes the code hard to read and hard to maintain.

@piskvorky
Copy link
Owner

@parulsethi It's Jensen-Shannon (after Johan Jensen and Claude Shannon).

@parulsethi
Copy link
Contributor Author

@piskvorky Can you please point to which line should be changed to hanging indent? As if i understand correctly hanging/vertical indent is used in case of long arguments? and there weren't any in this PR so I didn't use any line breaks in code

@piskvorky
Copy link
Owner

Unfortunately Github says the notebook is too large and its diff cannot be displayed, so I cannot review directly. But the indentation is broken around annotation_html =, update( and a few other places.

There are also a few minor code style inconsistencies, like spaces around , or :. @menshikh-iv will direct you :)

@menshikh-iv is there a way to somehow review the notebook? I'd like to point out some constructs which, while not wrong, could be improved a little for better readability (like map(int(), to be more "Pythonic".

@piskvorky piskvorky changed the title Added Jenson shannon metric [WIP] Added Jensen Shannon metric Jul 24, 2017
@parulsethi
Copy link
Contributor Author

@piskvorky I'll remove the cell outputs once the notebook is complete. It will then have a much smaller diff containing only the input code cells which could be displayed in Github for code review.

@piskvorky
Copy link
Owner

piskvorky commented Aug 3, 2017

@parulsethi let's fix that Jenson everywhere, before it propagates too far through the code/notebooks.

@parulsethi
Copy link
Contributor Author

I've removed the plotly code and cell outputs for smaller diff on github, will add it back after the code review of notebook is complete. If needed, the output visualization can be seen here in a previous commit.

@piskvorky corrected the name in a notebook comment it was being used in. Elsewhere it's the function name.

@piskvorky
Copy link
Owner

piskvorky commented Aug 4, 2017

The name is incorrect and needs to be fixed, that's what I mean.

@parulsethi
Copy link
Contributor Author

parulsethi commented Aug 4, 2017

Oh, the spelling. Sorry, misunderstood earlier

@menshikh-iv
Copy link
Contributor

@piskvorky About notebook - nbdime good tool for view diffs between notebooks, but this tool doesn't integrate with github.

Also, we can use nbconvert, but this again does not suit us. Probably it makes sense to talk with tech GitHub support and discuss their plans to improve the work with notebooks.

"outputs": [],
"source": [
"from gensim.models.ldamodel import LdaModel\n",
"from gensim.corpora import Dictionary, MmCorpus\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: gensim.corpora.MmCorpus, scipy.spatial.distance.squareform, plotly.figure_factory

"texts = []\n",
"for line in df_fake.text:\n",
" lowered = line.lower()\n",
" words = re.findall(r'\\w+', lowered, flags = re.UNICODE | re.LOCALE)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary spaces flags = re.UNICODE | re.LOCALE

"source": [
"from gensim.matutils import jensen_shannon\n",
"\n",
"from random import sample\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: random.sample, scipy

"\n",
"# Plot dendrogram\n",
"dendro = create_dendrogram(topic_dist, distfun=js_dist, labels=range(1, 36), annotation=annotation)\n",
"dendro['layout'].update({'width':1000, 'height':600})\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces after :

"annotation = text_annotation(topic_dist, topic_terms, n_ann_terms)\n",
"\n",
"# Initialize figure by creating upper dendrogram\n",
"figure = create_dendrogram(topic_dist, distfun=js_dist, labels = range(1, 36), annotation=annotation)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no needed spaces labels = range(1, 36)

"# Add Heatmap Data to Figure\n",
"figure['data'].extend(heatmap)\n",
"\n",
"dendro_leaves = [x+1 for x in dendro_leaves]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing whitespace x+1

"dendro_leaves = [x+1 for x in dendro_leaves]\n",
"\n",
"# Edit Layout\n",
"figure['layout'].update({'width':800, 'height':800,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing spaces after : (and below)

" 'showline': False,\n",
" \"showticklabels\": True, \n",
" \"tickmode\": \"array\",\n",
" \"ticktext\" : dendro_leaves,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary space (before :) and below

" 'showline': False,\n",
" \"showticklabels\": True, \n",
" \"tickmode\": \"array\",\n",
" \"ticktext\" : dendro_leaves,\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary space (before :) and below

" 'showline': False,\n",
" 'zeroline': False,\n",
" 'showticklabels': False,\n",
" 'ticks':\"\"}})\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space

@menshikh-iv
Copy link
Contributor

@parulsethi please fix PEP things, return all images and resolve merge conflict. LGTM for me

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

"# no. of terms to display in annotation\n",
"n_ann_terms = 10\n",
"\n",
"# use Jenson-Shannon distance metric in dendrogram\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more "Jenson".

@parulsethi parulsethi changed the title [WIP] Added Jensen Shannon metric [MRG] Added Jensen Shannon metric Aug 23, 2017
@parulsethi parulsethi changed the title [MRG] Added Jensen Shannon metric [MRG] Added Jensen Shannon metric and dendrogram visualization Aug 23, 2017
@menshikh-iv
Copy link
Contributor

Good job @parulsethi 🥇

@menshikh-iv menshikh-iv merged commit de3e2cb into piskvorky:develop Aug 23, 2017
@parulsethi parulsethi deleted the jenson_shannon branch August 23, 2017 10:37
fabriciorsf pushed a commit to LINE-PESC/gensim that referenced this pull request Aug 23, 2017
* add notebook with js heatmap

* separate out vector conversion function

* modify notebook

* add cluster heatmap notebook

* print y-axis values

* correct y labels

* add nb details

* few more nb details

* add text labels

* print linkage/dendo data

* add annotations for all hierarchy levels

* make diff annotations symmetric

* add Plotly's dendrogram code

* train for more passes

* remove outputs

* remove plotly code and imports

* err.. re-run cells

* fix jensen spelling

* made requested changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants