Skip to content

Commit

Permalink
Formula of the Kalinsky-Harabasz score and minor corrections
Browse files Browse the repository at this point in the history
  • Loading branch information
Orieus committed Nov 18, 2019
1 parent 3c4b75d commit b267c0c
Show file tree
Hide file tree
Showing 7 changed files with 65,253 additions and 98 deletions.
22 changes: 11 additions & 11 deletions TM1.IntrodNLP/NLP_py2_wikitools/notebooks/TM1_NLP.ipynb
Expand Up @@ -977,7 +977,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise**: There are usually many tokens that appear with very low frequency in the corpus. Count the number of tokens appearing only once, and what is the proportion of them in the token list."
"**Exercise**: There are usually many tokens that appear with very low frequency in the corpus. Count the number of tokens appearing only once, and what is the proportion of them in the token list."
]
},
{
Expand Down Expand Up @@ -1006,7 +1006,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise**: Represent graphically those 20 tokens that appear in the highest number of articles. Note that you can use the code above (headed by `# SORTED TOKEN FREQUENCIES`) with a very minor modification."
"**Exercise**: Represent graphically those 20 tokens that appear in the highest number of articles. Note that you can use the code above (headed by `# SORTED TOKEN FREQUENCIES`) with a very minor modification."
]
},
{
Expand Down Expand Up @@ -1056,7 +1056,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise**: Count the number of tokens appearing only in a single article.\n"
"**Exercise**: Count the number of tokens appearing only in a single article.\n"
]
},
{
Expand All @@ -1074,7 +1074,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise** (*All in one*): Note that, for pedagogical reasons, we have used a different `for` loop for each text processing step creating a new `corpus_xxx` variable after each step. For very large corpus, this could cause memory problems. \n",
"**Exercise** (*All in one*): Note that, for pedagogical reasons, we have used a different `for` loop for each text processing step creating a new `corpus_xxx` variable after each step. For very large corpus, this could cause memory problems. \n",
"\n",
"As a summary exercise, repeat the whole text processing, starting from corpus_text up to computing the bow, with the following modifications:\n",
"\n",
Expand All @@ -1099,7 +1099,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise** (*Visualizing categories*): Repeat the previous exercise with a second wikipedia category. For instance, you can take \"communication\". \n",
"**Exercise** (*Visualizing categories*): Repeat the previous exercise with a second wikipedia category. For instance, you can take \"communication\". \n",
"\n",
"1. Save the result in variable `corpus_bow2`.\n",
"2. Determine the most frequent terms in `corpus_bow1` (`term1`) and `corpus_bow2` (`term2`).\n",
Expand All @@ -1123,7 +1123,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"** Exercise ** (bigrams): `nltk` provides an utility to compute n-grams from a list of tokens, in `nltk.util.ngrams`. Join all tokens in `corpus_clean` in a single list and compute the bigrams. Plot the 20 most frequent bigrams in the corpus."
"**Exercise** (bigrams): `nltk` provides an utility to compute n-grams from a list of tokens, in `nltk.util.ngrams`. Join all tokens in `corpus_clean` in a single list and compute the bigrams. Plot the 20 most frequent bigrams in the corpus."
]
},
{
Expand All @@ -1148,21 +1148,21 @@
"anaconda-cloud": {},
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python [conda root]",
"display_name": "Python 3",
"language": "python",
"name": "conda-root-py"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
Expand Down
138 changes: 95 additions & 43 deletions U1.KMeans/.ipynb_checkpoints/KMeans-checkpoint.ipynb

Large diffs are not rendered by default.

0 comments on commit b267c0c

Please sign in to comment.