# <h1 style="text-align: center;" class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Natural Language Processing in Python Track</h1>

Gain the core Natural Language Processing (NLP) skills you need to convert unstructured data into valuable insights. You’ll learn to use Natural Language Processing in Python to automatically transcribe TED talks, extract information from articles, and identify whether a movie review is positive or negative. As you progress, you’ll discover some popular Python NLP libraries, including NLTK, scikit-learn, spaCy, and SpeechRecognition.

You’ll start this track by learning how to identify words and extract topics in text before building your very own chatbot that transforms human language into actionable instructions. By the end of the track, you'll understand how to transcribe audio files using natural language processing techniques and understand how to extract insights from real-world sources, including Wikipedia articles, online review sites, and data from a flight booking system.

# <h1 style="text-align: center;" class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Introduction to Natural Language Processing in Python</h1>

In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

<a id="toc"></a>

<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Contents</h3>
    
* [1. Regular expressions & word tokenization](#1)
    - Introduction to regular expressions
    - Introduction to tokenization
    - Advanced tokenization with NLTK and regex
    - Charting word length with NLTK

* [2. Simple topic identification](#2) 
    - Word counts with bag-of-words
    - Simple text preprocessing
    - Introduction to gensim
    - TF-IDF with gensim
    
* [3. Named-entity recognition](#3)
    - Named-entity recognition
    - Introduction to Spacy
    - Multilingual NER with polyglot
    
* [4. Building a "fake news" classifier](#4)
    - Classifying fake news using supervised learning with NLP
    - Building word count vectors with scikit-learn
    - Training adn testing a classification model with scikit-learn
    - Simple NLP, complex problems

## Imports

In [1]:
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## <a id="1"></a>
<font color="lightseagreen" size=+2.5><b>1. Regular expressions & word tokenization</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

### 1 01 Introduction to regular expressions

1. Introduction to regular expressions

Welcome to the course! In this video, you'll be learning about regular expressions.

2. What is Natural Language Processing?

![image.png](attachment:image.png)

Natural language processing is a massive field of study and actively used practice which aims to make sense of language using statistics and computers. In this course, you will learn some of the basics of NLP which will help you move from simple to more difficult and advanced topics. Even though this is the first course, you will still get some exposure to the challenges of the field such as topic identification and text classification. Some interesting NLP areas you might have heard about are: topic identification, chatbots, text classification, translation, sentiment analysis. There are also many more! You will learn the fundamentals of some of these topics as we move through the course.

3. What exactly are regular expressions?

![image-2.png](attachment:image-2.png)

Regular expressions are strings you can use that have a special syntax, which allows you to match patterns and find other strings. A pattern is a series of letters or symbols which can map to an actual text or words or punctuation. You can use regular expressions to do things like find links in a webpage, parse email addresses and remove unwanted strings or characters. Regular expressions are often referred to as regex and can be used easily with python via the `re` library. Here we have a simple import of the library. We can match a substring by using the re.match method which matches a pattern with a string. It takes the pattern as the first argument, the string as the second and returns a match object, here we see it matched exactly what we expected: abc. We can also use special patterns that regex understands, like the \w+ which will match a word. We can see here via the match object representation that it has matched the first word it found -- hi.

4. Common regex patterns

![image-3.png](attachment:image-3.png)

There are hundreds of characters and patterns you can learn and memorize with regular expressions, but to get started, I want to share a few common patterns. The first pattern \w we already saw, it is used to match words. The \d pattern allows us to match digits, which can be useful when you need to find them and separate them in a string. The \s pattern matches spaces, the period is a wildcard character. The wildcard will match ANY letter or symbol. The + and * characters allow things to become greedy, grabbing repeats of single letters or whole patterns. For example to match a full word rather than one character, we need to add the + symbol after the \w. Using these character classes as capital letters negates them so the \S matches anything that is not a space. You can also create a group of characters you want by putting them inside square brackets, like our lowercase group.

5. Common regex patterns (2)

![image-4.png](attachment:image-4.png)

6. Common regex patterns (3)

![image-5.png](attachment:image-5.png)

7. Common regex patterns (4)

![image-6.png](attachment:image-6.png)

8. Common regex patterns (5)

![image-7.png](attachment:image-7.png)

9. Common regex patterns (6)

![image-8.png](attachment:image-8.png)

10. Common regex patterns (7)

![image-9.png](attachment:image-9.png)

11. Python's re module

![image-10.png](attachment:image-10.png)

In the following exercises, you'll use the `re` module to perform some simple activities, like splitting on a pattern or finding all patterns in a string. In addition to split and findall, search and match are also quite popular. You saw a simple match at the beginning of this video, and search is similar but doesn't require you to match the pattern from the beginning of the string. The syntax for the regex library is always to pass the pattern first, and the string second. Depending on the method, it may return an iterator, a new string or a match object. Here we see the re.split method will take a pattern for spaces and a string with some spaces and return a list object with the results of splitting on spaces. This can be used for tokenization, so you can preprocess text using regex while doing natural language processing.

12. Let's practice!

Now it's your turn! Get started writing your first Regex and I'll see you back here soon!

**Exercise**

**Which pattern?**

Which of the following Regex patterns results in the following text?

![image.png](attachment:image.png)

In the IPython Shell, try replacing PATTERN with one of the below options and observe the resulting output. The re module has been pre-imported for you and my_string is available in your namespace.

In [2]:
my_string = "Let's write RegEx!"

In [7]:
import re

re.split('\W+', my_string)

['Let', 's', 'write', 'RegEx', '']

In [9]:
re.findall("\s+", my_string)

[' ', ' ']

In [8]:
re.findall("\w+", my_string)

['Let', 's', 'write', 'RegEx']

In [10]:
re.findall("[a-z]", my_string)

['e', 't', 's', 'w', 'r', 'i', 't', 'e', 'e', 'g', 'x']

In [11]:
re.findall("\w", my_string)

['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']

**Exercise**

**Practicing regular expressions: re.split() and re.findall()**

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "\" followed by the character "n" - and not as a new line.

The regular expression module re has already been imported for you.

Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.

**Instructions**

- Split my_string on each sentence ending. To do this:
    - Write a pattern called sentence_endings to match sentence endings (.?!).
    - Use re.split() to split my_string on the pattern and print the result.
- Find and print all capitalized words in my_string by writing a pattern called capitalized_words and using re.findall().
    - Remember the [a-z] pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
- Write a pattern called spaces to match one or more spaces ("\s+") and then use re.split() to split my_string on this pattern, keeping all punctuation intact. Print the result.
- Find all digits in my_string by writing a pattern called digits ("\d+") and using re.findall(). Print the result.

In [12]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [13]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.!?]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r" +"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']
['4', '19']


### 1 02 Introduction to tokenization

1. Introduction to tokenization

In this video, we'll learn more about string tokenization!

2. What is tokenization?

![image.png](attachment:image.png)

Tokenization is the process of transforming a string or document into smaller chunks, which we call tokens. This is usually one step in the process of preparing a text for natural language processing. There are many different theories and rules regarding tokenization, and you can create your own tokenization rules using regular expresssions, but normally tokenization will do things like break out words or sentences, often separate punctuation or you can even just tokenize parts of a string like separating all hashtags in a Tweet.

3. nltk library

![image-2.png](attachment:image-2.png)

One library that is commonly used for simple tokenization is nltk, the natural language toolkit library. Here is a short example of using the word_tokenize method to break down a string into tokens. We can see from the result that words are separated and punctuation are individual tokens as well.

4. Why tokenize?

![image-3.png](attachment:image-3.png)

Why bother with tokenization? Because it can help us with some simple text processing tasks like mapping part of speech, matching common words and perhaps removing unwanted tokens like common words or repeated words. Here, we have a good example. The sentence is: I don't like Sam's shoes. When we tokenize it we can clearly see the negation in the not and we can see possession with the 's. These indicators can help us determine meaning from simple text.

5. Other nltk tokenizers

![image-4.png](attachment:image-4.png)

Beyond just tokenizing words, NLTK has plenty of other tokenizers you can use, including these ones you'll be working with in this chapter. The sent_tokenize function will split a document into individual sentences. The regexp_tokenize uses regular expressions to tokenize the string, giving you more granular control over the process. And the tweettokenizer does neat things like recognize hashtags, mentions and when you have too many punctuation symbols following a sentence. How convenient!!!

6. More regex practice

![image-5.png](attachment:image-5.png)

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different. When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc. When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match. This is because match will try and match a string from the beginning until it cannot match any longer. Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match.

7. Let's practice!

Now it's your turn to try some tokenization!

**Exercise**

**Word tokenization with NLTK**

Here, you'll be using the first scene of Monty Python's Holy Grail, which has been pre-loaded as scene_one. Feel free to check it out in the IPython Shell!

Your job in this exercise is to utilize word_tokenize and sent_tokenize from nltk.tokenize to tokenize both words and sentences from Python strings - in this case, the first scene of Monty Python's Holy Grail.

**Instructions**

- Import the sent_tokenize and word_tokenize functions from nltk.tokenize.
- Tokenize all the sentences in scene_one using the sent_tokenize() function.
- Tokenize the fourth sentence in sentences, which you can access as sentences[3], using the word_tokenize() function.
- Find the unique tokens in the entire scene by using word_tokenize() on scene_one and then converting it into a set using set().
- Print the unique tokens found. This has been done for you, so hit 'Submit Answer' to see the results!

In [16]:
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

In [17]:
# Import necessary modules
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{"'m", 'migrate', 'an', 'So', 'Are', 'Listen', 'times', 'suggesting', 'line', 'goes', 'sun', 'weight', 'clop', 'winter', 'ask', 'got', 'yet', 'That', '?', 'that', 'husk', 'wind', 'Found', 'beat', 'mean', 'Well', 'We', 'where', 'wings', 'from', 'and', '...', 'my', "'s", 'house', 'pound', 'Halt', 'temperate', 'by', 'tell', 'agree', 'they', 'Supposing', 'course', 'Wait', 'maintain', 'horse', 'The', 'minute', 'castle', 'other', '!', 'strangers', 'zone', 'Where', 'then', 'swallow', 'will', 'Saxons', ']', '.', 'one', 'tropical', 'every', 'knights', 'get', 'kingdom', 'plover', 'right', 'KING', 'forty-three', "'ve", 'carried', 'carry', 'King', 'But', 'strand', 'It', 'to', 'second', 'Please', 'servant', '#', 'but', 'A', 'go', 'In', 'Ridden', 'two', 'found', 'ARTHUR', 'here', 'guiding', 'swallows', 'am', 'are', 'five', 'join', 'interested', 'dorsal', 'under', 'have', 'of', 'using', 'Oh', 'or', 'these', 'son', '2', 'England', 'could', 'Will', 'our', 'sovereign', 'search', 'fly', 'covered', 'held'

**Exercise**

**More regex with re.search()**

In this exercise, you'll utilize re.search() and re.match() to find specific tokens. Both search and match expect regex patterns, similar to those you defined in an earlier exercise. You'll apply these regex library methods to the same Monty Python text from the nltk corpora.

You have both scene_one and sentences available from the last exercise; now you can use them with re.search() and re.match() to extract and match more text.

**Instructions**

- Use re.search() to search for the first occurrence of the word "coconuts" in scene_one. Store the result in match.
- Print the start and end indexes of match using its .start() and .end() methods, respectively.
----------
- Write a regular expression called pattern1 to find anything in square brackets.
- Use re.search() with the pattern to find the first text in scene_one in square brackets in the scene. Print the result.
----------
- Create a pattern to match the script notation (e.g. Character:), assigning the result to pattern2. Remember that you will want to match any words or spaces that precede the : (such as the space within SOLDIER #1:).
- Use re.match() with your new pattern to find and print the script notation in the fourth line. The tokenized sentences are available in your namespace as sentences.

In [18]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

580 588


In [21]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>


In [22]:
# pattern1 = r'\[(.*?)\]'
# print(re.search(pattern1, scene_one))

<re.Match object; span=(9, 15), match='[wind]'>


In [23]:
sentences = ['SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!',
 '[clop clop clop] \nSOLDIER #1: Halt!',
 'Who goes there?',
 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.',
 'King of the Britons, defeator of the Saxons, sovereign of all England!',
 'SOLDIER #1: Pull the other one!',
 'ARTHUR: I am, ...  and this is my trusty servant Patsy.',
 'We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.',
 'I must speak with your lord and master.',
 'SOLDIER #1: What?',
 'Ridden on a horse?',
 'ARTHUR: Yes!',
 "SOLDIER #1: You're using coconuts!",
 'ARTHUR: What?',
 "SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.",
 'ARTHUR: So?',
 "We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?",
 'ARTHUR: We found them.',
 'SOLDIER #1: Found them?',
 'In Mercea?',
 "The coconut's tropical!",
 'ARTHUR: What do you mean?',
 'SOLDIER #1: Well, this is a temperate zone.',
 'ARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?',
 'SOLDIER #1: Are you suggesting coconuts migrate?',
 'ARTHUR: Not at all.',
 'They could be carried.',
 'SOLDIER #1: What?',
 'A swallow carrying a coconut?',
 'ARTHUR: It could grip it by the husk!',
 "SOLDIER #1: It's not a question of where he grips it!",
 "It's a simple question of weight ratios!",
 'A five ounce bird could not carry a one pound coconut.',
 "ARTHUR: Well, it doesn't matter.",
 'Will you go and tell your master that Arthur from the Court of Camelot is here.',
 'SOLDIER #1: Listen.',
 'In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?',
 'ARTHUR: Please!',
 'SOLDIER #1: Am I right?',
 "ARTHUR: I'm not interested!",
 'SOLDIER #2: It could be carried by an African swallow!',
 'SOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.',
 "That's my point.",
 'SOLDIER #2: Oh, yeah, I agree with that.',
 'ARTHUR: Will you ask your master if he wants to join my court at Camelot?!',
 'SOLDIER #1: But then of course a-- African swallows are non-migratory.',
 'SOLDIER #2: Oh, yeah...',
 "SOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!",
 'Supposing two swallows carried it together?',
 "SOLDIER #1: No, they'd have to have it on a line.",
 'SOLDIER #2: Well, simple!',
 "They'd just use a strand of creeper!",
 'SOLDIER #1: What, held under the dorsal guiding feathers?',
 'SOLDIER #2: Well, why not?']

In [25]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>


In [26]:
# pattern2 = r'^(\b[\w\s]+):'
# print(re.match(pattern2, sentences[3]))

<re.Match object; span=(0, 7), match='ARTHUR:'>


### Advanced tokenization with regex

1. Advanced tokenization with regex

In this video, we'll take a look at doing more advanced tokenization with regex.

2. Regex groups using or "|"

![image.png](attachment:image.png)

One new regex pattern you will find useful for advanced tokenization is the ability to use the or method. In regex, OR is represented by the pipe character. To use the or, you can define a group using parenthesis. Groups can be either a pattern or a set of characters you want to match. You can also define explicit character classes using square brackets. We'll go a bit more into depth on groups and ranges soon. Let's take an example that we want to tokenize using regular expressions and we want to find all digits and words. We define our pattern using a group with the OR symbol and make them greedy so they catch the full word or digits. Then, we can call findall using Python's re library and return our tokens. Notice that our pattern does not match punctuation but properly matches the words and digits.

3. Regex ranges and groups

![image-2.png](attachment:image-2.png)

Let's take a look at another more advanced topic, defining groups and character ranges. Here we have another chart of patterns, and this time we are using ranges or character classes marked by the square brackets and groups marked by the parentheses. We can see in this chart that we can use square brackets to define a new character class. For example, we can match all upper and lowercase english letters using Uppercase A hyphen Uppercase Z which will match all uppercase and then lowercase a hyphen lowercase z which will match all lowercase letters. We can also make ranges to match all digits 0 hyphen 9, or perhaps a more complex range like uppercase and lowercase English with the hyphen and period. Because the hyphen and period are special characters in regex, we must tell regex we mean an ACTUAL period or hyphen. To do so, we use what is called an escape character and in regex that means to place a backwards slash in front of our character so it knows then to look for a hyphen or period. On the other hand, with groups which are designated by the parentheses, we can only match what we explicitly define in the group. So a-z matched only a, a hyphen and z. Groups are useful when you want to define an explicit group, such as the final example; where we are taking spaces or commas.

4. Character range with `re.match()`

![image-3.png](attachment:image-3.png)

In this code example, we can use match with a character range to match all lowercase ascii, any digits and spaces. It is greedy marked by the + after the range definition, but once it hits the comma, it can't match anymore. This short example demonstrates that thinking about what regex method you use (such as search versus match) and whether you define a group or a range can have a large impact on the usefulness and readability of your patterns.

5. Let's practice!

Now it's your turn to practice advanced regex techniques to help with tokenization!

## <a id="2"></a>
<font color="lightseagreen" size=+2.5><b>2. Simple topic identification</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.

## <a id="3"></a>
<font color="lightseagreen" size=+2.5><b>3. Named-entity recognition</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.

## <a id="4"></a>
<font color="lightseagreen" size=+2.5><b>4. Building a "fake news" classifier</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.