In [1]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Analysis for Topic Extraction Pipeline\n",
    "\n",
    "In this module we focus on the most widespread and present media form: **text**.  \n",
    "Text is every-where, in spoken language, books, websites, and so on.\n",
    "\n",
    "It is no surprise that the success of big search engines is based on the appropriate analysis of text. \n",
    "Deriving ‘meaning’ from text is far from trivial, indeed it is a very difficult task.\n",
    "Consider that meaning in text is created by distributions of words in specific language, following very specific and diverse grammar rules.\n",
    "These arrangements are further nuanced by cultural codes, shorthands, metaphors, analogies, irony, specific references and so on.,\n",
    "\n",
    "So, in this module, we are going to analyse this kind of media to extract topics in with the help of some statistical tools that can be seen as semi-automated. \n",
    "Indeed, it is only semi-automated due to the fact that we are going to create a pipeline (a sequence of steps) that is going allow us to analyse some statistics about the corpus and derive some meaning - so, the end result will always depend on human interpretation.\n",
    "\n",
    "In order to do so, we are going to use a family of techniques known as **Bag of Words**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. A primer to the Bag Of Words Pipeline\n",
    "\n",
    "The life cycle of statistical text processing can be seen as a four stages core, that are ilustrated in the figure below. \n",
    "\n",
    "The starting point of this process is the availability of a **corpus** - which is a collection of text documents.\n",
    "For example, we can have a corpus on political debates, cooking recipes, the news broadcast by a given agency in a time period, and so on. \n",
    "\n",
    "Corpora (plural ofcorpus) thus often gather many documents on a given theme, and our goal is to find the different topics that make up that theme."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Bag Of Words](./imgs/bag_of_words_pipeline.png)\n",
    "Figure 1. Bag Of Words Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 It all starts with a corpus\n",
    "\n",
    "Any kind of text analysis always starts with a corpus. Maybe that corpus are Tweeks, or maybe that corpus are News.\n",
    "And that corpus can be in many data formats - Web Pages that need to be scrapped, Text Files that need to be parsed, or some binary format that needs to be interpreted.\n",
    "\n",
    "Let's make some examples with a text file that contains several news for a given time period, and let's parse the file!\n",
    "\n",
    "Why do we need to parse the file? Well, if we want to understand the words of a given document in the file, we need to be able to access to different documents as we need - If I want to access to document 17, I must have a way to do.\n",
    "Therefore, we need to create some kind of structure in memory to process these documents on the file!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Step 1 - Parse the Data\n",
    "\n",
    "To parse the data into memory, we can start by getting every document in the file as a string, and store each document into a List."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "document_list = []\n",
    "# this is the dataset that we have been working with in the classes\n",
    "# change the path of the file to where it is in your computer\n",
    "with open('data/NYT_Corpus.txt', encoding = \"UTF-8\") as f:   \n",
    "    \n",
    "    for line in f:\n",
    "        found_url = line[:9] == \"URL: http\" and line[-6:]==\".html\\n\"\n",
    "        if found_url:\n",
    "            f.readline()\n",
    "            document_list.append(\"\")\n",
    "        else:\n",
    "            document_list[len(document_list)-1] += line\n",
    "\n",
    "corpus = [doc for doc in document_list if len(doc) > 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we have some data in memory with which we can start the pipeline described above!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Clean Documents\n",
    "\n",
    "Now that we have our corpus, the first stage is concerned with cleaning the documents.The main idea here is to remove ‘noise’ in the form of text that has little to say about what a documentis speaking about.  \n",
    "\n",
    "Statistical text analysis (STA) is concerned with the distributions of words. This is so in the sense that a document that contains e.g. many copies of the words ‘oven’, ‘cook’and ‘onion’ are likely to be about recipes/food.  \n",
    "STA is not concerned with grammar rules of anykind.  For this reason, all connectors, punctuation marks and so on are not important to STA(and indeed they are eliminated as we will see later).  \n",
    "\n",
    "Other methods for analysing text in thefield of Deep Natural Language processing are interested in (and use) grammar rules but we donnot study Deep NLP in this course."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is our goal in this step\n",
    "\n",
    "Now we have a lot of words in the documents, as well as some other chars (e.g. punctuation). At the end of this Data Cleaning step, we need to have:\n",
    "\n",
    "1. No punctuation chars\n",
    "2. \"normalized\" words in two senses:\n",
    "    - we and all the words in the same case (lower case)\n",
    "    - in the sense that mapping, mapper and map are all words from the same family, and for this analysis we want these to be the same word instead of different words\n",
    "3. Clean words that are not going to help us understand topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Interpreting tokens in each document\n",
    "\n",
    "So, one first step that we need to model is to understand what we are going to define as a valid tokens. For every document, as we are going thought it, we are going to need to filter valid tokens - only letting pass tokens that we have defined as valid.\n",
    "\n",
    "In this exercise, let's define a **valid token** as any sequence of letters (being if upper or lowercase) as well as any sequence of numbers.\n",
    "\n",
    "In order to do so, let's use a `regex` pattern which is `\\w`, meaning that:\n",
    "1. We want a sequence with at least of letter, being it upper or lower case\n",
    "2. We want any sequence with at least one number\n",
    "3. The catch on using this pattern is that we are going to allow the underscore char as well (`_`), but this allow our regex pattern to be a small and readable string :)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Putting the words in lower case\n",
    "\n",
    "This step can be done either before the tokenization of after the tokenization, given that we are creating tokens out of both lower and upper case chars.\n",
    "So, this step is here as the second not because it needs to be done here, but for us not to forget that this needs to be done - being it before, during or after the tokenization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Stemming the words\n",
    "\n",
    "One of our goals is this step is to normalize words of the same family.\n",
    "\n",
    "For example, if I have the words mapping, mapper and map, I want these different words that are the same semantic to be the same word.\n",
    "\n",
    "\n",
    "A process that can help us do that is called Stem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import RegexpTokenizer\n",
    "from nltk.stem import PorterStemmer\n",
    "\n",
    "tokenizer = RegexpTokenizer(r'\\w+')\n",
    "ps = PorterStemmer()\n",
    "\n",
    "def mytokeniser(s):\n",
    "    aux = [w.lower() for w in tokenizer.tokenize(s)]\n",
    "    return list(map(ps.stem, aux))\n",
    "\n",
    "tokenised_corpus = list(map(mytokeniser, corpus))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning words that will not help us understand topics\n",
    "\n",
    "So, given that Statistical Text Analysis (STA) is about data distribution, events that are constants (e.g. words that appear in every document - stop words for example) and events that are rare (e.g. word that appear in, for example, only in 3% in the documents of the corpus) are not going to be relevant for ou topic analysis, aren't they? \n",
    "\n",
    "Thinkg about this a little bit, and let this sink in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does it makes sense now? Awesome!\n",
    "\n",
    "So, we now need a way to understand which words are which.\n",
    "\n",
    "The way we are going to do this is:\n",
    "1. Creating a vocabulary - a set of all the terms in the corpus\n",
    "2. Score all the terms with a metric that help us understand the incident of a given term in a corpus.\n",
    "    - This score will put the rare events in one extreme, and the constant events the another extreme (e.g. rare terms having the higher values, and constant terms having the lower values).\n",
    "    - The above line yields that, for topic analysis, we want the words that are in between of those two\n",
    "3. Based on these metrics, let's filter the terms and stay only with those relevant for topic analysis "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Creating a vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that this corpus has a lot of document, I am going to sample the corpus and will only use 100 documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "My vocabolary size is 7642\n"
     ]
    }
   ],
   "source": [
    "tokenized_corpus_sampled = tokenised_corpus[:100]\n",
    "\n",
    "vocab = set()\n",
    "\n",
    "for doc in tokenized_corpus_sampled:\n",
    "    vocab = vocab.union(set(doc))  \n",
    "\n",
    "print(f\"My vocabolary size is {len(vocab)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 Scoring extreme terms\n",
    "\n",
    "The metric we are going to use to score the terms and therefore to get the extreme event is going to be the *IDF* metric."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The IDF Metric\n",
    "\n",
    "Suppose  for  example  that  in  our  corpus  the  word  ‘rice’  appears  in  every  document.   \n",
    "Suppose that you are the librarian keeping this corpus,  and somebody comes searching for a subset of documents in a given topic from your corpus.  \n",
    "Imagine this library visitor tells you ‘rice’.  You go into the box to fetch all the documents that contain that word.  Clearly you will come backwith the entire box because every document in it contains that term.  \n",
    "Was ‘rice’ a helpful term to support the library visitor’s needs?  Not really.  In fact not helpful at all. \n",
    "\n",
    "This is where Inverse Document Frequency or IDF comes in handy.  This number will represent the importance of a term in a given corpus, calculated using the following formula:\n",
    "\n",
    "$IDF = log(\\frac{N}{df_t})$\n",
    "\n",
    "Where:\n",
    "- N is the number of document in the corpus\n",
    "- $df_t$ is the number of documents where the term $t$ is in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "def idf(term, corpus):\n",
    "    cnt =  sum([1 if term in doc else 0 for doc in corpus])\n",
    "    return math.log10( len(corpus) / cnt )\n",
    "\n",
    "idfvocab = {}\n",
    "\n",
    "for term in vocab:\n",
    "    term_idf = idf(term, tokenized_corpus_sampled)\n",
    "    idfvocab[term] = term_idf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6 Let's filter out those extreme events\n",
    "\n",
    "So, we now have our terms with a given metric scored. \n",
    "For us to filter the extreme events, there are several ways for doing so. \n",
    "\n",
    "We can think that we want to remove, for example, the upper 25% of the tokens and the lower 25% of the tokens.\n",
    "\n",
    "We can also say that we want, at most only 200 tokens that are somewhere in the middle of the distribution to pass this filter.\n",
    "\n",
    "The following example will be with the latter mindset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Min is 0.0 and max is 2.0\n"
     ]
    }
   ],
   "source": [
    "idfvocab_it = [(el[0],el[1]) for el in idfvocab.items()]\n",
    "\n",
    "aux = np.array( idfvocab_it )\n",
    "low = float( min( aux[:,1] ) )\n",
    "high = float( max( aux[:,1] ) )\n",
    "\n",
    "print(f\"Min is {low} and max is {high}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "150"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def keep_terms( lower, upper, threshold, step, idf_vocabulary ):\n",
    "    low = lower\n",
    "    up = upper\n",
    "    candidates = idf_vocabulary\n",
    "    while len(candidates) > threshold:\n",
    "        #print(f\"current vocabolary size is {len(candidates)}\")\n",
    "        low = low + step\n",
    "        up = up - step\n",
    "        candidates = [  term for term in idf_vocabulary if term[1] >= low and term[1] <= up  ]\n",
    "    return candidates\n",
    "\n",
    "\n",
    "cnd = keep_terms(low, high, 200, 0.005, idfvocab_it)\n",
    "len(cnd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the end of this process, we now have our Bag of Words to continue this analysis! Notice that every document is going to be encoded with this bag of words in a vector space - we now have a unified (single) corpus representation!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Construction of the TF-IDF Matrix\n",
    "\n",
    "This step is the core of the STA life cycle. What  is  interesting  here  is  that,  while  starting with a corpus made of disjoint elements (a collection of documents),  we end up with a single corpus representation.  \n",
    "\n",
    "This single representation unifies the information we have about the comprised documents.  Therefore we can use this single data structure to reason about the entire corpus!  \n",
    "For this goal, working with an universal dictionary of terms is essential."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The TF Metric\n",
    "\n",
    "Term Frequency or simply TF is a numeric quantity used to express the importance of a term inside a document.  \n",
    "In its raw form, it is simply the count of times a term appears in a document.\n",
    "\n",
    "However, here we compute TF as a proportion, by dividing this count by the total number of tokens  in  the  (stemmed)  document.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "vc = np.array(cnd) #a matrix, with column 0 being terms and column 1 being idf\n",
    "vc_terms = vc[:,0] \n",
    "\n",
    "def normTFx(term,doc):\n",
    "    return doc.count(term)/len(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The TF-IDF Metric\n",
    "\n",
    "The final quantity that we will use to measure the importance of a term inside a document that belongs to a corpus os the standard TF.IDF measure which is simply:\n",
    "\n",
    "$TF.IDF^d_t = TF^d_t * IDF_t$\n",
    "\n",
    "Here $t$ refers as always to the term, and $d$ to a specific document. \n",
    "\n",
    "What is the effect of multiplying the original normalised TF by the IDF? The IDF acts as a modulator.  If the TF is high but the term is everywhere in the corpus, the IDF will be low, so the TF is brought down.  If a TF is medium, but the IDF is high then its importance is modulated upwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tfidfmat(corpus, tl) :\n",
    "    mat =[]\n",
    "    for i in tl :\n",
    "        idft = idf(i,corpus)\n",
    "        row = []\n",
    "        for d in corpus :\n",
    "            tft = normTFx(i,d)\n",
    "            tf_idf_term_document = tft*idft\n",
    "            row.append(tf_idf_term_document)\n",
    "        mat.append(row)\n",
    "    return mat    \n",
    "            \n",
    "    \n",
    "\n",
    "tfidf_matrix = tfidfmat(tokenized_corpus_sampled, vc_terms) \n",
    "tfidf_matrix_np = np.array(tfidf_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 100)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf_matrix_np.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The document with index 0 contains 833 words\n",
      "The term with index 0 is `parent`\n",
      "The importance of the term `parent` in the document with idx = 0 is 0.0\n"
     ]
    }
   ],
   "source": [
    "print(f\"The document with index 0 contains {len(tokenized_corpus_sampled[0])} words\")\n",
    "print(f\"The term with index 0 is `{vc_terms[0]}`\")\n",
    "\n",
    "print(f\"The importance of the term `{vc_terms[0]}` in the document with idx = 0 is {tfidf_matrix_np[0,0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now  we  have  all  we  need  to  construct  our  single representation  of  the  corpus  as  a  matrix.   \n",
    "\n",
    "This  matrix has  rows representing  the  terms  of the  universal  dictionary  for  the  corpus,  and  columns  representing  the  contained  documents. Therefore a given cell $S_{t,d}$ of the matrix will contain the corresponding $TF.IDF^d_t$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Factorizing the Matrix\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import NMF\n",
    "model = NMF(n_components=5, init='random', random_state=0)\n",
    "W = model.fit_transform(tfidf_matrix_np) # loadings\n",
    "H = model.components_ #scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(150, 5)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "W.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5, 100)"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "H.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_top_N_terms(matrix_slice, N):\n",
    "    return matrix_slice.argsort()[-N:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_terms_from_slice(loadings_matrix, idx, topN, bag_of_words, orientation=\"col\"):\n",
    "    '''\n",
    "        the parameter `orientation` can either be \"col\" or \"row\", so we can process a loadings matrix being it transposed or not\n",
    "    '''\n",
    "    k = None\n",
    "    if orientation == \"col\":\n",
    "        k = loadings_matrix[:,idx]\n",
    "    elif orientation == \"row\":\n",
    "        k = loadings_matrix[idx,:]\n",
    "    else:\n",
    "        raise Exception(\"Orientation not recognized\")\n",
    "    k_top5terms_idx = get_top_N_terms(k,topN)\n",
    "    return bag_of_words[k_top5terms_idx]\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The terms with more weight in the component 0 are: ['onlin' 'growth' 'video' 'experi' 'audienc' 'educ' 'publish']\n",
      "The terms with more weight in the component 1 are: ['threaten' 'citizen' 'senior' 'educ' 'health' 'crisi' 'senat']\n",
      "The terms with more weight in the component 2 are: ['investor' 'air' 'rate' 'educ' 'negoti' 'pay' 'attend']\n",
      "The terms with more weight in the component 3 are: ['goal' 'titl' 'franc' 'twitter' 'host' 'central' 'met']\n",
      "The terms with more weight in the component 4 are: ['citizen' 'review' 'health' 'refus' 'appeal' 'lawyer' 'polic']\n"
     ]
    }
   ],
   "source": [
    "for k in range(0,W.shape[1]):\n",
    "    # Get terms for the k-th characteristic / topic\n",
    "    print(f\"The terms with more weight in the component {k} are: {get_terms_from_slice(W, k, 7, vc_terms)}\")\n",
    "\n",
    "# here we are printing the top 7, but the this choise is arbitrary - we are going to analyze as much as we need to understand the topics"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['# Text Analysis for Topic Extraction Pipeline\n',
    '\n',
    'In this module we focus on the most widespread and present media form: **text**.  \n',
    'Text is every-where, in spoken language, books, websites, and so on.\n',
    '\n',
    'It is no surprise that the success of big search engines is based on the appropriate analysis of text. \n',
    'Deriving ‘meaning’ from text is far from trivial, indeed it is a very difficult task.\n',
    'Consider that meaning in text is created by distributions of words in specific language, following very specific and diverse grammar rules.\n',
    'These arrangements are further nuanced by cultural codes, shorthands, metaphors, analogies, irony, specific references and so on.,\n',
    '\n',
    'So, in this module, we are going to analyse this kind of media to extract topics in with the help of some statistical tools that can be seen as semi-automated. \n',
    'Indeed,