# <h1 style="text-align: center;" class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Natural Language Processing in Python Track</h1>

Gain the core Natural Language Processing (NLP) skills you need to convert unstructured data into valuable insights. You’ll learn to use Natural Language Processing in Python to automatically transcribe TED talks, extract information from articles, and identify whether a movie review is positive or negative. As you progress, you’ll discover some popular Python NLP libraries, including NLTK, scikit-learn, spaCy, and SpeechRecognition.

You’ll start this track by learning how to identify words and extract topics in text before building your very own chatbot that transforms human language into actionable instructions. By the end of the track, you'll understand how to transcribe audio files using natural language processing techniques and understand how to extract insights from real-world sources, including Wikipedia articles, online review sites, and data from a flight booking system.

# <h1 style="text-align: center;" class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Introduction to Natural Language Processing in Python</h1>

In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.

<a id="toc"></a>

<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Contents</h3>
    
* [1. Regular expressions & word tokenization](#1)
    - Introduction to regular expressions
    - Introduction to tokenization
    - Advanced tokenization with NLTK and regex
    - Charting word length with NLTK

* [2. Simple topic identification](#2) 
    - Word counts with bag-of-words
    - Simple text preprocessing
    - Introduction to gensim
    - TF-IDF with gensim
    
* [3. Named-entity recognition](#3)
    - Named-entity recognition
    - Introduction to Spacy
    - Multilingual NER with polyglot
    
* [4. Building a "fake news" classifier](#4)
    - Classifying fake news using supervised learning with NLP
    - Building word count vectors with scikit-learn
    - Training adn testing a classification model with scikit-learn
    - Simple NLP, complex problems

## Imports

In [1]:
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## <a id="1"></a>
<font color="lightseagreen" size=+2.5><b>1. Regular expressions & word tokenization</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

### 1 01 Introduction to regular expressions

1. Introduction to regular expressions

Welcome to the course! In this video, you'll be learning about regular expressions.

2. What is Natural Language Processing?

![image.png](attachment:image.png)

Natural language processing is a massive field of study and actively used practice which aims to make sense of language using statistics and computers. In this course, you will learn some of the basics of NLP which will help you move from simple to more difficult and advanced topics. Even though this is the first course, you will still get some exposure to the challenges of the field such as topic identification and text classification. Some interesting NLP areas you might have heard about are: topic identification, chatbots, text classification, translation, sentiment analysis. There are also many more! You will learn the fundamentals of some of these topics as we move through the course.

3. What exactly are regular expressions?

![image-2.png](attachment:image-2.png)

Regular expressions are strings you can use that have a special syntax, which allows you to match patterns and find other strings. A pattern is a series of letters or symbols which can map to an actual text or words or punctuation. You can use regular expressions to do things like find links in a webpage, parse email addresses and remove unwanted strings or characters. Regular expressions are often referred to as regex and can be used easily with python via the `re` library. Here we have a simple import of the library. We can match a substring by using the re.match method which matches a pattern with a string. It takes the pattern as the first argument, the string as the second and returns a match object, here we see it matched exactly what we expected: abc. We can also use special patterns that regex understands, like the \w+ which will match a word. We can see here via the match object representation that it has matched the first word it found -- hi.

4. Common regex patterns

![image-3.png](attachment:image-3.png)

There are hundreds of characters and patterns you can learn and memorize with regular expressions, but to get started, I want to share a few common patterns. The first pattern \w we already saw, it is used to match words. The \d pattern allows us to match digits, which can be useful when you need to find them and separate them in a string. The \s pattern matches spaces, the period is a wildcard character. The wildcard will match ANY letter or symbol. The + and * characters allow things to become greedy, grabbing repeats of single letters or whole patterns. For example to match a full word rather than one character, we need to add the + symbol after the \w. Using these character classes as capital letters negates them so the \S matches anything that is not a space. You can also create a group of characters you want by putting them inside square brackets, like our lowercase group.

5. Common regex patterns (2)

![image-4.png](attachment:image-4.png)

6. Common regex patterns (3)

![image-5.png](attachment:image-5.png)

7. Common regex patterns (4)

![image-6.png](attachment:image-6.png)

8. Common regex patterns (5)

![image-7.png](attachment:image-7.png)

9. Common regex patterns (6)

![image-8.png](attachment:image-8.png)

10. Common regex patterns (7)

![image-9.png](attachment:image-9.png)

11. Python's re module

![image-10.png](attachment:image-10.png)

In the following exercises, you'll use the `re` module to perform some simple activities, like splitting on a pattern or finding all patterns in a string. In addition to split and findall, search and match are also quite popular. You saw a simple match at the beginning of this video, and search is similar but doesn't require you to match the pattern from the beginning of the string. The syntax for the regex library is always to pass the pattern first, and the string second. Depending on the method, it may return an iterator, a new string or a match object. Here we see the re.split method will take a pattern for spaces and a string with some spaces and return a list object with the results of splitting on spaces. This can be used for tokenization, so you can preprocess text using regex while doing natural language processing.

12. Let's practice!

Now it's your turn! Get started writing your first Regex and I'll see you back here soon!

**Exercise**

**Which pattern?**

Which of the following Regex patterns results in the following text?

![image.png](attachment:image.png)

In the IPython Shell, try replacing PATTERN with one of the below options and observe the resulting output. The re module has been pre-imported for you and my_string is available in your namespace.

In [1]:
my_string = "Let's write RegEx!"

## <a id="2"></a>
<font color="lightseagreen" size=+2.5><b>2. Simple topic identification</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.

## <a id="3"></a>
<font color="lightseagreen" size=+2.5><b>3. Named-entity recognition</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.

## <a id="4"></a>
<font color="lightseagreen" size=+2.5><b>4. Building a "fake news" classifier</b></font>

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.