<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - Miscellaneous](18.15-mlpg-Other-Considerations-Miscellaneous.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Social Network Analysis – An Introduction](20.00-mlpg-Social-Network-Analysis–An-Introduction.ipynb) ]>

# 19. Text Analytics – An Introduction

## 19.1. What can be done with text data?
* `Information Extraction`: Parse text and find/identify/extract relevant information from it
* `Sentiment Analysis`: Is this movie review positive or negative?
* `Topic Identification`: Is this news article about Politics, Sports, or Technology?
* `Spam Detection`: Is this email spam or not?
* `Spelling corrections`: Weather or Whether?    Color or Colour?

## 19.2. Basic functions to handle texts in Python
Following are some of the basic text handling functions in Python.
* Finding unique words:
  - `set()` function
* Word comparison functions:
  - `s.startswith(t)`; `s.endswith(t)`
  - `t in s`
  - `s.isupper()`; `s.islower()`; `s.istitle()`
  - `s.isalpha()`; `s.isdigit()`; `s.isalnum()`
* String Operations:
  - `s.lower()`; `s.upper()`; `s.titlecase()`
  - `s.split(t)`; `s.splitlines()`
  - `s.joint(t)`
  - `s.strip()`; `s.rstrip()`
  - `s.find(t)`; `s.rfind(t)`
  - `s.replace(u, v)`
* File Operations:
  - `f = open(filename, mode)`
  - `f.readline()`; `f.read()`; `f.read(n)`
  - `for line in f: doSomething(line)`
  - `f.seek(n)`
  - `f.write(message)`
  - `f.close()`
  - `f.closed`

## 19.3. NLP and Basic NLTK tasks
**Natural Language Processing (NLP):**
* `Natural language`: Language used by humans for communication
* `NLP` (Natural Language Processing): Any computation or manipulation of natural language
* It’s a branch of AI that focuses on text and speech
* NLP is a collective term referring to the automatic computational processing of human languages. This includes both algorithms that take human-produced text as input and algorithms that produce natural-looking text as outputs
* Use-case examples:
  - **`Sentiment analysis`**: Analyze how people feel about a business or product(s) of a business
  - **`Monitoring system`**: Monitor the positive and negative comments about a company on social media
  - **`Chatbots and Visual assistants`**: Can be used as a core to build chatbots and visual assistants
  - **`Text extraction`**: This can be used to extract texts from documents

**Natural Language ToolKit (NLTK):**
* An open-source library in Python which is a widely-used toolkit for text and NLP tasks
* It gives access to many corpora and handy tools
* Sentence splitting, tokenization, and lemmatization are important, non-trivial preprocessing tasks
* Some of the simple NLTK tasks are:
  - Count the number of words
    - `len(t)`
  - Count the number of unique words
    - `len(set(t))`
  - Frequency of words
    - `len(FreqDist(t))`
  - Return a list of the words in t, separating by `' '`
    - `t.split(' ')`
List comprehension examples:
  - Words that are greater than 3 letters long in `t`
    - `[w for w in t if len(w) > 3]`
  - Capitalized words in t
    - `[w for w in t if w.istitle()]`
  - Words in t that end in 's'
    - `[w for w in t if w.endswith('s')]`
  - Finding hashtags example
    - `[w for w in t if w.startswith('#')]`
  - Finding callouts example
    - `[w for w in t if w.startswith('@')]`
  - Regular expression example
    - `[w for w in t if re.search('@[A-Za-z0-9_]+', w)]`

## 19.4. Components of NLP and the techniques used in NLP
**NLP Components:**
* NLP does not apply only to English; it works for other languages as well
* Some of the examples where NLP is used are:
  - Chatbots, Recommender systems, Sentiment analysis
  - Cybersecurity (spam detection, etc)
  - Voice assistant from Google
  - Siri from Apple
  - Alexa from Amazon
* The components of NLP are:
  1) **`NL Generation (NLG)`**: generates meaningful language using NLP
  2) **`NL Understanding (NLU)`**: extracts the meaning of a language and it is more complex than NLG

**NLP techniques:**
* **Tokenization:**
  - Dividing a sentence into unique units and it’s the first step of NLP, for example:<br>
    The lion killed the deer 🠊 `The` `lion` `killed` `the` `deer`

* **Stemming:**
  - The process of trimming down the words (prefixes & suffixes) to find their root words, for example:<br>
    Careful 🠊 `Care`<br>
    Doing 🠊 `Do`<br>
    Helping 🠊 `Help`<br>
    Unhappy 🠊 `happy`<br>
  - Less accurate, less computational cost, fast

* **Lemmatization:**
  - Finds the root word from different forms of the same word and unified the synonyms, for example:<br>
    `Affected` 🠊 `affect`<br>
    `Good` 🠊 `good`, `Better` 🠊 `good`, `Best` 🠊 `good`<br>
    `Bad` 🠊 `bad`, `Worse` 🠊 `bad`, `Worst` 🠊 `bad`<br>
  - More accurate, more computational cost, slow

* **Stop Word:**
  - Words that have no or minimal value, for example:<br>
    “`a`”, ”`an`”, “`the`”, “`.`”, “`,`”, “`;`”, “` `“

* **POS (Parts Of Speech) tagging:**
  - The process of tagging the parts of speech of the words of a particular sentence, for example:<br>
    `The lion killed the deer` &emsp;&emsp;&nbsp; `lion`, `deer` 🠊 `noun`<br>
    &nbsp; ↓ &emsp;&emsp; ↓ &emsp;&emsp; ↓ &emsp;&emsp; ↓ &emsp;&emsp; ↓ &emsp;&emsp;&emsp; `the`, `a`     🠊 `determinant`<br>
    `DT` &nbsp;&nbsp; `NN` &nbsp;&nbsp; `VB` &nbsp;&nbsp;&nbsp; `DT` &nbsp;&nbsp;&nbsp; `NN`   &emsp;&emsp; `killed`     🠊 `verb`<br>

* **Named Entity Recognition:**
  - Classifies the noun of a sentence into different categories such as persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc., for example:<br>
  `Bala joined Microsoft as a Manager at the Chicago branch`<br>
  `Bala` 🠊 **`person`**, `Microsoft` 🠊 **`organization`**, `Chicago` 🠊 **`location`**

## 19.5. Differences between `Stemming` and `Lemmatization`
* `Lemmatization` and `stemming` are special cases of **normalization** 
* They identify a canonical representative for a set of related word forms
* `Lemmatization` is closely related to `stemming`, however, there are some differences
* `Stemmer` operates on a single word **`without`** knowledge of the context
* `Stemmer` cannot discriminate between words that have different meanings depending on the POS 
* Stemmers are typically easier to implement and run faster
* When used within information retrieval systems, `stemming improves` query **`recall accuracy`**, or **`true positive rate`**, when compared to `lemmatization`, nonetheless, `stemming reduces` **`precision`**, or **`true negative rate`**, for such systems, for instance:
  - The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up
  - The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatization
  - The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context
* A `stemmer` will return the stem of a word, which needn't be identical to the morphological root of the word. It is usually sufficient that related words map to the same stem, even if the stem is not in itself a valid root, while in `lemmatization`, it will return the dictionary form of a word, which must be valid
* In `lemmatization`, the part of speech of a word should be first determined and the normalization rules will be different for a different part of speech, while the `stemmer` operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech
* An example-driven explanation of the differences between lemmatization and stemming:
  - `Lemmatization` handles matching “**car**” to “**cars**” along with matching “**car**” to “**automobile**”
  - `Stemming` handles matching “**car**” to “**cars**” only
* `Stemming` just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. `Lemmatization` considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
  - If you `lemmatize` the word '**Caring**', it would return '**Care**'. If you `stem`, it would return '**Car**' and this is erroneous
  - If you `lemmatize` the word '**Stripes**' in `verb context`, it would return '**Strip**'. If you `lemmatize` it in `noun context`, it would return '**Stripe**'. If you just `stem` it, it would just return '**Strip**'
  - You would get the same results whether you `lemmatize or stem` words such as **walking**, **running**, **swimming**... to **walk**, **run**, **swim**, etc.
  - `Lemmatization` is _**computationally expensive**_ since it involves look-up tables and whatnot. 
    - If you have a `large dataset and performance is an issue`, go with **`Stemming`**. You can also add your rules to Stemming. 
    - If `accuracy is paramount and the dataset isn't humongous`, go with **`Lemmatization`**

## 19.6. Information Extraction (IE)
* Most of the world's data is in unstructured or semi-structured text, examples are:
  - Social media analytics -> What are customers saying about my products? Is my marketing campaign successful?
  - Machine log analysis -> Why is my server failing?
  - Call center logs -> Why are customers leaving the company?
  - Email analysis -> Which emails are related to a merger/financial trade?
  - Financial services -> Automatically buy/sell shares based on news
* **Information Extraction (IE)**: Distill structured data from unstructured and semi-structured text
* **Goals of IE:**
  - Enable humans to shift through information faster
  - Enable the computer algorithms to exploit the text
* **Streaming**: Continuous ingestion and continuous analysis of data (**`Note:`** `streaming is not real-time`)

## 19.7. How does IE solve business problems?
* **Named Entity Recognition (NER):**
  - Extracts proper names of persons, organizations, places, and so on
* **Relation Extraction:**
  - Uncovers relationships between different entities mentioned in the text, for example, 
`the relationship between a company and the location where the company is headquartered at`
* **Event Extraction:**
  - Like Relation Extraction, Event Extraction finds the relation between the same entities, but it also implies a change of state, for example, `who beat whom in the Australian Open`
* **Sentiment Extraction:**
  - Extracts opinion towards a particular entity or a particular aspect of an entity, along with the polarity of that sentiment (`whether it is positive or negative`)
* **Co-reference Resolution:**
  - Attempts to link different references in the text to the same entity, for example, 
`Co-reference Resolution indicates that the pronoun ‘he’ refers to ‘President Obama’; 
‘the president’ also refers to ‘President Obama’; but ‘it’ refers to ‘Russia’.`
* **Table Extraction:**
  - A lot of information is available on the web and public records, in HTML format, that may or may not be formatted as tables. The Table Extraction associates apparent cells with table boundaries, titles, row and column headings, and so on

## 19.8. Challenges and requirements for IE
* **Accuracy:**
  - The quality of extracted information drives the quality of an entire application; an accurate system produces high-quality extraction results, with very few false positives and false negatives
  - In practice, obtaining high accuracy for real-world data requires complex extraction programs that can express the nuances of human-readable data
* **Usability:**
  - The system should be easy to program; it shouldn’t take months of programming to get results
  - Every extraction task and data set has its particular quirks, so usability is key to efficiently analyzing a particular data set or customer need
* **Scalability:**
  - Scalability requires that the system can turn unstructured text into structured data at a high rate with relatively modest computational resources
  - Scalability determines whether our solution will be economical and practical
* **Expressivity:**
  - Expressivity means that the system can accurately capture the nuances of human-readable data, which is essential for high-quality analysis
* **Transparency:**
  - Transparency means that it’s easy to understand why the system produces a particular result and to change the behavior of the system if that result isn’t satisfactory

<!--NAVIGATION-->
<br>

<[ [Other Considerations - Miscellaneous](18.15-mlpg-Other-Considerations-Miscellaneous.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Social Network Analysis – An Introduction](20.00-mlpg-Social-Network-Analysis–An-Introduction.ipynb) ]>