# Common regex patterns

There are hundreds of characters and patterns you can learn and memorize with regular expressions, but to get started, I want to share a few common patterns.

The first pattern **\w** we already saw, it is used to match words.

The **\d** pattern allows us to match digits, which can be useful when you need to find them and separate them in a string. 
The **\s** pattern matches spaces, the period is a wildcard character.

The wildcard will match ANY letter or symbol.
The **+ and * .** characters allow things to become greedy, grabbing repeats of single letters or whole patterns.

For example to match a full word rather than one character, we need to add the + symbol after the \w.

Using these character classes as capital letters negates them so the **\S** matches anything that is not a space. You can also create a group of characters you want by putting them inside 

https://www.computerhope.com/unix/regex-quickref.htm

https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

# Split lines or split the line

In [None]:
# Split string at line boundaries
file_split = file.split("\n")

# Print file_split
print(file_split)

# Complete for-loop to split by commas
for substring in file_split:
    substring_split = substring.split(",")
    print(substring_split)

# Finding and replacing

In [None]:
for movie in movies:
  	# Find if actor occurrs between 37 and 41 inclusive
    if movie.find("actor", 37, 42) == -1:
        print("Word not found")
    # Count occurrences and replace two by one
    elif movie.count("actor") == 2:  
        print(movie.replace("actor actor", "actor"))
    else:
        # Replace three occurrences by one
        print(movie.replace("actor actor actor", "actor"))

# Where's the word?

In [None]:
for movie in movies:
  try:
    # Find the first occurrence of word
  	print(movie.index("movie", 12, 50))
  except ValueError:
    print("substring not found")

In [None]:
# Replace negations 
movies_no_negation = movies.replace("isn't", "is")

# Replace important
movies_antonym = movies_no_negation.replace("important", "insignificant")

# Print out
print(movies_antonym)

# Positional formatting

In [None]:
# Assign the substrings to the variables
first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

In [None]:
# Assign the substrings to the variables
first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

# Define string with placeholders 
my_list.append("The tool {} is used in {}")

# Define string with rearranged placeholders
my_list.append("The tool {1} is used in {0}")

# Use format to print strings
for my_string in my_list:
  	print(my_string.format(first_pos, second_pos))

In [2]:
courses=['artificial intelligence', 'neural networks']

In [8]:
# Create a dictionary
plan = {
  		"field": courses[0],
        "tool": courses[1]
        }

# Complete the placeholders accessing elements of field and tool keys in the data dictionary
my_message = "If you are interested in {data[field]}, you can take the course related to {data[tool]}"

# Use the plan dictionary to replace placeholders
print(my_message.format(data=plan))

If you are interested in artificial intelligence, you can take the course related to neural networks


In [9]:
# Import datetime 
from datetime import datetime

# Assign date to get_date
get_date = datetime.now()

# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

# Format date
print(message.format(today=get_date))

Good morning. Today is January 31, 2021. It's 20:30 ... time to work!


In [10]:
get_date

datetime.datetime(2021, 1, 31, 20, 30, 49, 993956)

# Formatted string literal

In [None]:
# Complete the f-string : ''
print(f"Data science is considered {field1!r} in the {fact1:d}st century")
# Complete the f-string: exponentiel
print(f"About {fact2:e} of {field2} in the world")

In [None]:
# Complete the f-string
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed"

In [None]:
# Include both variables and the result of dividing them 
print(f"{number1} tweets were downloaded in {number2} minutes indicating a speed of {number1/number2:.1f} tweets per min")

In [None]:
# Replace the substring http by an empty string
print(f"{string1.replace('https', '')}")

In [None]:
# Divide the length of list by 120 rounded to two decimals
print(f"Only {(len(list_links)*100/120):.2f}% of the posts contain links")

In [None]:
# Access values of date and price in east dictionary
print(f"The price for a house in the east neighborhood was ${east['price']} in {east['date']:%m-%d-%Y}")

In [None]:
# Access values of date and price in west dictionary
print(f"The price for a house in the west neighborhood was ${west['price']} in {west['date']:%m-%d-%Y}.")

# Template method

In [None]:
# Import Template
from string import Template

# Create a template
wikipedia = Template("$tool is a $description")

# Substitute variables in template
print(wikipedia.substitute(tool=tool1, description=description1))
print(wikipedia.substitute(tool=tool2, description=description2))
print(wikipedia.substitute(tool=tool3, description=description3))

In [None]:
# Import template
from string import Template

# Select variables
our_tool = tools[0]
our_fee = tools[1]
our_pay = tools[2]

# Create template
course = Template("We are offering a 3-month beginner course on $tool just for $$ $fee ${pay}ly")

# Substitute identifiers with three variables
print(course.substitute(tool=our_tool, fee=our_fee, pay=our_pay))

In [None]:
# Import template
from string import Template

# Complete template string using identifiers
the_answers = Template("Check your answer 1: $answer1, and your answer 2: $answer2")

In [None]:
# Import template
from string import Template

# Complete template string using identifiers
the_answers = Template("Check your answer 1: $answer1, and your answer 2: $answer2")

# Use substitute to replace identifiers
try:
    print(the_answers.substitute(answers))
except KeyError:
    print("Missing information")

# Introduction to regular expressions

In [None]:
# Import the re module
import re

# Write the regex
regex = r"@robot\d\W"

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

In [None]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

In [None]:
# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

In [None]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)

# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, " ", sentiment_sub)
print(sentiment_final)

# Repetitions

In [None]:
# Import re module
import re
for tweet in sentiment_analysis:
  	# Write regex to match http links and print out result
	print(re.findall(r"http\S+", tweet))

	# Write regex to match user mentions and print out result
	print(re.findall(r"@\w+", tweet))

In [None]:
# Complete the for loop with a regex to find dates;27 minutes ago
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\s\w+\sago", date))

In [None]:
# Complete the for loop with a regex to find dates;23rd june 2018
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}", date))

In [None]:
# Complete the for loop with a regex to find dates,1st september 2019 17:25
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{2}", date))

In [None]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"
# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)
# Get tokens by splitting text
print(re.split(r"\s+",no_hashtag))

# Regex metacharacters

In [None]:
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"
for text in sentiment_analysis:
	# Find all matches of the regex
	print(re.findall(regex, text))
    
	# Replace all matches with empty string
	print(re.sub(regex, "", text))

In [None]:
# Write a regex to match a valid email address
regex = r"[A-Za-z0-9!#%&*\$\.]+@\w+\.com"

for example in emails:
  	# Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
      	print("The email {email_example} is a valid email".format(email_example=example))
    else:
      	print("The email {email_example} is invalid".format(email_example=example))   

In [None]:
# Write a regex to match a valid password
regex = r"[A-Za-z0-9!#%&*\$\.]{8,20}" 

for example in passwords:
  	# Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
      	print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
      	print("The password {pass_example} is invalid".format(pass_example=example))     

# Greedy vs. non-greedy matching
**Well done! Remember that a greedy quantifier will try to match as much as possible while a non-greedy quantifier will do it as few times as needed, expanding one character at a time and giving us the match we are looking for. Good!**

In [3]:
tences with the optional words
regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."# Import re
import re 
string = "I want to see that <strong>amazing show</strong> again!"
# Write a regex to eliminate tags
string_notags = re.sub(r"<.*?>", "", string)
# Print out the result
print(string_notags)

I want to see that amazing show again!


In [2]:
# Import re
import re 
# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string)
# Print out the result
print(string_notags)

I want to see that amazing show again!


In [4]:
# Write a lazy regex expression 
sentiment_analysis="I was born on April 24t"
numbers_found_lazy = re.search(r"\d+", sentiment_analysis)
# Print out the result
print(numbers_found_lazy)

<re.Match object; span=(20, 22), match='24'>


In [10]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"\d+?", sentiment_analysis)
# Print out the result
print(numbers_found_lazy)

['2', '4']


In [9]:
# Write a lazy regex expression 
numbers_found_lazy = re.finditer(r"\d+", sentiment_analysis)
matchings = [match.group() for match in numbers_found_lazy]
matchings

['24']

In [11]:
# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"\d+", sentiment_analysis)
# Print out the result
print(numbers_found_greedy)

['24']


# Lazy approach


You have done some cleaning in your dataset but you are worried that there are sentences encased in parentheses that may cloud your analysis.

Again, a greedy or a lazy quantifier may lead to different results.

For example, if you want to extract a word starting with a and ending with e in the string I like apple pie, you may think that applying the greedy regex a.+e will return apple. However, your match will be apple pie. A way to overcome this is to make it lazy by using ? which will return apple.

In [15]:
sentiment_analysis="Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). "
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.*\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


In [16]:
# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.*?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

['(They were so cute)', "(I'm crying)"]


In [19]:
s2='(They were so cute)'
s3=re.sub('\(|\)', '', s2)
s3

'They were so cute'

In [20]:
s2=sentiment_analysis
s3=re.sub('\(|\)', '', s2)
s3

"Put vacation photos online They were so cute a few yrs ago. PC crashed, and now I forget the name of the site I'm crying. "

# Groups 

In [None]:
# Write a regex that matches email
regex_email = r"([a-zA-Z0-9]+)@\S+"
for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email,tweet)
    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Here you have your boarding pass LA4214 AER-CDB 06NOV.

You need to extract the information about the flight:

    The two letters indicate the airline (e.g LA),
    The 4 numbers are the flight number (e.g. 4214).
    The three letters correspond to the departure (e.g AER),
    The destination (CDB),
    The date (06NOV) of the flight

In [21]:
# Import re
import re
flight="Here you have your boarding pass LA4214 AER-CDB 06NOV."
# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

# Find all matches of the flight information
flight_matches = re.findall(regex, flight)
    
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

Airline: LA Flight number: 4214
Departure: AER Destination: CDB
Date: 06NOV


In [22]:
flight_matches

[('LA', '4214', 'AER', 'CDB', '06NOV')]

# Alternation and non-capturing groups

In [None]:
# Write a regex that matches sentences with the optional words
regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    positive_matches = re.findall(regex_positive, tweet)
    
    # Complete format to print out the results
    print("Positive comments found {}".format(positive_matches))

Complete the regular expression to capture the words hate or dislike or disapprove. Match but don't capture the words movie or concert. Match and capture anything appearing until the ..

Find all matches of the regex in each element of sentiment_analysis.

Assign them to negative_matches.

Complete the .format() method to print out the results contained in negative_matches for each element in sentiment_analysi

In [None]:
# Write a regex that matches sentences with the optional words
regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    negative_matches = re.findall(regex_negative, tweet)
    # Complete format to print out the results
    print("Negative comments found {}".format(negative_matches))

# Backreferences
## Parsing PDF files 

In [29]:
# Write regex and scan contract to capture the dates described
contract ="Signed on 05/24/2016"
# Write regex and scan contract to capture the dates described
# Write regex and scan contract to capture the dates described
regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)

# Assign to each key the corresponding match
signature = {
	"day": dates.group(2),
	"month": dates.group(1),
	"year": dates.group(3)
}
# Complete the format method to print-out
print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

Our first contract is dated back to 2016. Particularly, the day 24 of the month 05.


**Remember that each capturing group is assigned a number according to its position in the regex. Only if you use .search() and .match(), you can use .group() to retrieve the groups.**

In [None]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
 
    if match_tag:
        # If it matches print the first group capture
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))

**Backreferences are very helpful when you need to reuse part of the regex match inside the regex. Knowing when and how to use them will simplify many of your tasks!**

In [None]:
# Complete the regex to match an elongated word
regex_elongated = r"\w*(\w)\1\w*"

for tweet in sentiment_analysis:
	# Find if there is a match in each tweet 
	match_elongated = re.search(regex_elongated, tweet)
    
	if match_elongated:
		# Assign the captured group zero 
		elongated_word = match_elongated.group(0)
        
		# Complete the format method to print the word
		print("Elongated word found: {word}".format(word=elongated_word))
	else:
		print("No elongated word found")	

**You should remember that the group zero stands for the entire expression matched. It's always helpful to keep that in mind.**
# Lookaround

In [None]:
# Positive lookahead
look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis)

# Print out
print(look_ahead)

In [None]:
# Positive lookbehind
look_behind = re.findall(r"(?<=[Pp]ython\s)\w+", sentiment_analysis)

# Print out
print(look_behind)

# Filtering phone numbers

Now, you need to write a script for a cell-phone searcher. It should scan a list of phone numbers and return those that meet certain characteristics.

The phone numbers in the list have the structure:

    Optional area code: 3 numbers
    Prefix: 4 numbers
    Line number: 6 numbers
    Optional extension: 2 numbers

E.g. 654-8764-439434-01.

You decide to use .findall() and the non-capturing group's negative lookahead (?!) and negative lookbehind (?<!).

The list cellphones, containing three phone numbers, and the re module are loaded in your session. You can use print() to view the data in the IPython Shell.

In [None]:
for phone in cellphones:
	# Get all phone numbers not preceded by area code
	number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
	print(number)

In [None]:
for phone in cellphones:
	# Get all phone numbers not followed by optional extension
	number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
	print(number)

In [3]:
import re 
>>> my_string = "Let's write RegEx!"
PATTERN = r"\w+"
>>> re.findall(PATTERN, my_string)
# ['Let', 's', 'write', 'RegEx']

['Let', 's', 'write', 'RegEx']

In [None]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

You'll be using more regex in this section as well, not only when you are tokenizing, but also figuring out how to parse tokens and text. Using the regex module's re.match and re.search are pretty essential tools for Python string processing. Learning when to use search versus match can be challenging, so let's take a look at how they are different. When we use search and match with the same pattern and string with the pattern is at the beginning of the string, we see we find identical matches. That is the case with matching and searching abcde with the pattern abc. When we use search for a pattern that appears later in the string we get a result, but we don't get the same result using match. This is because match will try and match a string from the beginning until it cannot match any longer. Search will go through the ENTIRE string to look for match options. If you need to find a pattern that might not be at the beginning of the string, you should use search. If you want to be specific about the composition of the entire string, or at least the initial pattern, then you should use match. 

In [None]:
# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

In [None]:
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*\]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1,scene_one))

In [None]:
# Find the script notation at the beginning of the fourth sentence and print it
pattern2 = r"[\w]+:"
print(re.match(pattern2,sentences[3]))

#### Choosing a tokenizer

Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have '#1' remain a single token.

In [None]:
r"(\w+|#\d|\?|!)"

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Write a pattern that matches both mentions (@) and hashtags
pattern2 = r"([@#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

In [1]:
import re 
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address.group(0))

support@datacamp.com
xyz@datacamp.com


In [2]:
re.findall( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')
# Output: ['cats', 'dogs']



['cats', 'dogs']

In [6]:
[x.group() for x in re.finditer( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')]
# Output: ['all cats are', 'all dogs are']

['all cats are', 'all dogs are']

# Extract List of Keywords :


# Raw Matcher 
### re.findall(pattern.string)

findall() returns all non-overlapping matches of pattern in string as a list of strings.

### re.finditer()

finditer() returns callable object.

In both functions, the string is scanned from left to right and matches are returned in order found.

In [1]:
import logging
import re
class RawMatcher:

    def __init__(self, keywords_list, ignore_case=True):
        """Constructor

        Parameters
        ----------
        keyword_list : list
            list of keywords to be used for matching
        ignore_case : bool, optional
            Whether of not case should be ignored, by default True
        """

        self.keywords_list = list(set(keywords_list))
        self.keywords_patterns = self.create_patterns(ignore_case)

    def create_patterns(self, ignore_case):
        """Create a regex pattern from list of keywords

        Parameters
        ----------
        ignore_case : boleean
            Boolean operator to specify case ignoring

        Returns
        -------
        re object
            Compiled regex to be used for matching
        """
        try:
            keywords_patterns = re.compile(r'\b(?:%s)\b' % '|'.join(
                self.keywords_list), re.IGNORECASE if ignore_case else 0)
            return keywords_patterns
        except re.error as e:
            logging.error(f"{e}")

    def batch_match(self, text_list):
        """Take a list of sentences or string and iterate over each of them

        Parameters
        ----------
        text_list : list
            list of string

        Returns
        -------
        list[list]
            2d list of matches

        Example usage
        -------------
        >>> keywords = ["Total","Total S.A.", "Christophe de Margerie", "Ernest Mercier", "Elf Aquitaine", "Total SA"]
        >>> sentences = ["This total entity sells gasoil","Christophe de margerie was Total's CEO"]
        [['Christophe de margerie', 'Total'], ['total']]
        """
        return [self.match(i) for i in text_list]

    def match(self, text):
        """Get matches from string or sentence

        Parameters
        ----------
        text : string
            sentence or string to be matched

        Returns
        -------
        list
            list of matches
        """
        matchings = [match.group() for match in self.keywords_patterns.finditer(text)]
        return list(matchings)


if __name__ == "__main__":  # pragma: no cover

    keywords = ["Total", "Total S.A.", "Christophe de Margerie",
                "Ernest Mercier", "Elf Aquitaine", "Total SA"]
    sentences = ["Total is a french oil company", "Christophe de Margerie was Total's CEO"]
    sentence = sentences[0]

    c = RawMatcher(keywords)
    batch_res = c.batch_match(sentences)
    res = c.match(sentence)

    print(res)
    print(batch_res)


['Total']
[['Total'], ['Christophe de Margerie', 'Total']]


# Preprocessing 
## Text preprocessing steps:
Lemmatization, lowercasing, removing unwanted tokens.
###  Tokenization and Lemmatization
####  Tokenizing Spacy 

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

##  Word tokenization with NLTK

In [None]:
# Import the necessary modules
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

## Non-ascii tokenization

In [None]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text,capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

In [None]:
"""
    Split the script holy_grail into lines using the newline ('\n') character.
    Use re.sub() inside a list comprehension to replace the prompts such as ARTHUR: and SOLDIER #1. The pattern has been written for you.
    Use a list comprehension to tokenize lines with regexp_tokenize(), keeping only words. Recall that the pattern for words is "\w+".
    Use a list comprehension to create a list of line lengths called line_num_words.
        Use t_line as your iterator variable to iterate over tokenized_lines, and then len() function to compute line lengths.
    Plot a histogram of line_num_words using plt.hist(). Don't forgot to use plt.show() as well to display the plot.
"""
# Split the script into lines: lines
lines = holy_grail.split('\n')

# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
tokenized_lines = [regexp_tokenize(s,"\w+") for s in lines]

# Make a frequency list of lengths: line_num_words
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)

# Show the plot
plt.show()

In [None]:
# Import WordNetLemmatizer

from nltk.stem import WordNetLemmatizer
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops ]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


# Sentence Splitter 

In [None]:
import nltk
from functools import reduce


class SentenceSplitter:

    def __init__(self, keys_map={"text": 0, "title": 1, "thread.title": 2}, keep_raw=False):
        self.keys_map = keys_map
        self.keep_text = keep_raw

    def deep_get(self, dictionary, keys, default=None):
        return reduce(lambda d, key: d.get(key, default) if isinstance(d, dict) else default, keys.split("."), dictionary)

    def split_document(self, doc, id_key="_id"):
        sentences = []
        doc_id = doc.get(id_key)
        for key in self.keys_map.keys():
            sentence = self.text_splitter(self.deep_get(
                doc, key, ''), self.keys_map[key], doc_id, id_key)
            sentences.extend(sentence)

        return {"sentences": sentences, "nb_sentences": len(sentences)}

    def text_splitter(self, text, type_key, doc_id, id_key):
        index_splitted_text = [x + (type_key, doc_id)for x in list(
            enumerate(nltk.sent_tokenize(text), 0))]
        keys_list = ["sentence_id", "text", "type",  id_key]
        splitted_text = [dict(zip(keys_list, x)) for x in index_splitted_text]
        return splitted_text

    def batch_spliting(self, document_list, id_key="_id"):
        return list(map(lambda d: self.split_document(d, id_key), document_list))


if __name__ == "__main__":
    import jsonlines
    import json
    datapath = "sentence_splitter/example.json"
    with jsonlines.open(datapath) as reader:
        lines = [*reader]
    test_data = lines[28:30]
    s = SentenceSplitter()
    res = s.batch_spliting(test_data,'id')
    print(json.dumps(res, indent=3))


In [None]:
import nltk

class Aggregator():
    def __init__(self, text):
        """
        Parent Constructor

        :param text: text to evaluate
        """
        self.text = text
        self.sentences = nltk.sent_tokenize(self.text)


# Sentiment Agregator

In [None]:
import pandas as pd
import numpy as np

import collections
from collections import defaultdict
from collections import OrderedDict

from aggregator.baseaggregator import Aggregator

import operator

class SentimentAggregator(Aggregator):
    _persist_methods = ['mean', 'max', 'sum']

    def __init__(self, sentences, language, learners):
        """
        Constructor

        :param sentences: sentence-tokenized text
        :param language: language of the corpus
        :param learners: model for sentiment analysis
        """
        super().__init__(sentences)
        self.language = language
        self.learners = learners
        self.summation_score = 0
        self.polarity_score = 0
        self.subjectivity_score = 0
        self.bermingham_score = 0

    def compute_polarity_score(self, method, value=None):
        """
        Compute the sentiment score of a text based on the polarity of each sentence

        :param method: method of aggregation ie mean, sum , max, quantile
        :param value: if quantile choosen, value of the quantile between 0 and 1

        :returns: The Sentiment score of the text
        """
        final_res = []

        for sentence in self.sentences:
            d_result = self.get_sentence_predictions(sentence)
            score = self.get_sentence_polarity_score(d_result['webhose_valence3_single_positive'], d_result['webhose_valence3_single_negative'])
            final_res.append(score)
        
        if method == "quantile":
            self.polarity_score = np.quantile(np.array(final_res), value)
            return self.polarity_score
        elif method in self._persist_methods:
            self.polarity_score = getattr(np.array(final_res),method)()
            return self.polarity_score

    @staticmethod
    def get_sentence_polarity_score(positive_probability, negative_probability):
        """
        Get the sentence score based on the compute of the polarity ie (positive_probability - negative_probability)

        :param positive_probability: Positive probability returned by the sentiment classifier
        :param negative_probability: Negative probability returned by the sentiment classifier

        :returns: The sentence sentiment score 
        """
        polarity = positive_probability - negative_probability
        sentence_score = 100 / (1+ np.exp(-polarity))
        return sentence_score
    
    def compute_subjectivity_score(self, method, value=None):
        """
        Compute the sentiment score of a text based on the subjectivity of each sentence

        :param method: method of aggregation ie mean, sum , max, quantile
        :param value: if quantile choosen, value of the quantile between 0 and 1

        :returns: Sentiment score of the text
        """
        final_res = []

        for sentence in self.sentences:
            d_result = self.get_sentence_predictions(sentence)
            score = self.get_sentence_subjectivity_score(d_result['webhose_valence3_single_positive'], d_result['webhose_valence3_single_negative'])
            final_res.append(score)
        
        if method == "quantile":
            self.subjectivity_score = np.quantile(np.array(final_res), value)
            return self.subjectivity_score
        elif method in self._persist_methods:
            self.subjectivity_score = getattr(np.array(final_res),method)()
            return self.subjectivity_score
 
    @staticmethod
    def get_sentence_subjectivity_score(positive_probability, negative_probability):
        """
        Get the sentence score based on the compute of the subjectivity ie (1 - neutral)

        :param positive_probability: Positive probability returned by the sentiment classifier
        :param negative_probability: Negative probability returned by the sentiment classifier

        :returns : The sentence sentiment score
        """
        subjectivity = positive_probability + negative_probability
        sentence_score = 100 / (1+ np.exp(-subjectivity))
        return sentence_score

    def compute_bermingham_score(self, method, value=None):
        """
        Compute the sentiment score of a text based on sentiment ratio of each sentence

        :param method: method of aggregation ie mean, sum , max, quantile
        :param value: if quantile choosen, value of the quantile between 0 and 1

        :returns: Sentiment score of the text
        """
        final_res = []

        for sentence in self.sentences:
            d_result = self.get_sentence_predictions(sentence)
            score = self.get_sentence_ratio_score(d_result['webhose_valence3_single_positive'], d_result['webhose_valence3_single_negative'])
            final_res.append(score)

        if method == "quantile":
            self.bermingham_score = np.quantile(np.array(final_res), value)
            return self.bermingham_score
        elif method in self._persist_methods:
            self.bermingham_score = getattr(np.array(final_res),method)()
            return self.bermingham_score
    
    @staticmethod
    def get_sentence_ratio_score(positive_probability, negative_probability):
        """
        Get the sentence score based on the ratio ie (positive_probability/negative_probability)

        :param positive_probability: Positive probability returned by the sentiment classifier
        :param negative_probability: Negative probability returned by the sentiment classifier

        :returns : The sentence sentiment score  
        """
        sentiment_ratio = (positive_probability + 1) / (negative_probability + 1)
        sentence_score = 100 * np.log(sentiment_ratio)
        return sentence_score

    def compute_summation_score(self):
        """
        Compute the sentiment score of a text based on the difference of total positives and total negatives sentences

        :returns: Sentiment score of the text
        """
        for sentence in self.sentences:
            d_result = self.get_sentence_predictions(sentence)
            try:
                d_result.pop('webhose_valence3_single_neutral')
            except KeyError:
                print("Key not found") 
            res = list(d_result.keys())[0] 
            if res == 'webhose_valence3_single_positive':
                self.summation_score += 1
            elif res == 'webhose_valence3_single_negative':
                self.summation_score -= 1
        self.summation_score = (100 * self.summation_score/len(self.sentences))
        return self.summation_score
    
    def get_sentence_predictions(self, sentence):
        """
        Get the predictions performed by the sentiment classifier model on one sentence

        :param sentence: The sentence to perform the sentiment analysis on

        :returns: A dictionary of three probabilities: positive (webhose_valence3_single_positive), neutral (webhose_valence3_single_neutral) 
                  and negative (webhose_valence3_single_negative)
        """
        result = self.learners.predict_sentence_affect(sentence, 
                                                       models=['news-sentiment'], 
                                                       language=self.language, 
                                                       models_preloaded=True)
        d_result = defaultdict(int, result)
        d_result = {k: v for k, v in sorted(d_result.items(), key=lambda item: item[1], reverse=True)}
        return d_result
    
    


# Enity Agregator 

In [None]:
import pandas as pd
import numpy as np

import collections
from collections import defaultdict
from collections import OrderedDict

from aggregator.baseaggregator import Aggregator

import operator


class EntityAggregator(Aggregator):
    def __init__(self, sentences, entity_of_interest, ranker):
        """
        Constructor

        :param sentences: sentence-tokenized text
        :param entity_of_interest: the entity that we are interested in ie. Q312
        :param ranker: Entity Ranker object
        """
        super().__init__(sentences)
        self.entity_of_interest = entity_of_interest
        self.ranker = ranker
        self.results = np.array(self.ranker.bulk_filter_queries(self.sentences))
        self.half_elements = len(self.results)/2

    def compute_mean_score(self, noise_reducer=.5):
        """
        Compute the sentence entity ranker based on mean of sentences

        :param noise_reducer: thd / Slider to evaluate the aggregation method

        :returns: Bool(mean_sentences_entity_of_interest > noise_reducer) 
        """
        mean_score = self.get_entity_of_interest_mean(self.entity_of_interest, self.results)
        return (mean_score > noise_reducer)
    
    @staticmethod
    def get_entity_of_interest_mean(entity_of_interest, list_of_entities):
        """
        Compute the mean of all sentences linked to our entity of interest 

        :param list_of_entities: List of entities 

        :returns: The mean of all sentences that talks about the entity of interest
        """
        return (np.array(list_of_entities)==entity_of_interest).mean()

    def compute_majority_voting_score(self):
        """
        Compute the sentence entity ranker based on Majority Voting of sentences

        :returns: Bool(entity_of_interest is majority) 
        """
        my_value = self.get_num_sentences(self.entity_of_interest, self.results)
        return (my_value > self.half_elements)
    
    @staticmethod
    def get_num_sentences(entity_of_interest, list_of_entities):
        """
        Count the number of sentences that are linked with the entity of interest

        :param list_of_entities: List of entities

        :returns: The number of sentences that are linked with the entity of interest
        """
        counter = collections.Counter(np.array(list_of_entities))
        sorted_counter = {k: v for k, v in sorted(counter.items(), key=lambda item: item[1], reverse=True)}
        my_value = sorted_counter.get(entity_of_interest)
        if isinstance(my_value, type(None)):
            my_value = 0
        return my_value
    
    def compute_top_n_score(self, n=1):
        """
        Compute the sentence entity ranker based on top n entities by score

        :param n: Top n entities to retrieve

        :returns: Bool(entity_of_interest is in top n)
        """
        top_entities = self.get_top_entities(self.results, n)
        return (self.entity_of_interest in top_entities)
    
    @staticmethod
    def get_top_entities(list_of_entities, n=1):
        """
        Get top n entities by score. 

        :param list_of_entities: List of entities
        :param n: Top n entities to retrieve

        :returns: A list containing ordered top n entities
        """ 
        counter = collections.Counter(list_of_entities)
        top = sorted(counter, key=counter.get, reverse=True)[:n]
        return top


# Lemmatizing

In [None]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

# Text cleaning

In [None]:
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))

In [None]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

# remove html tag 

In [None]:
import re

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

# POS tagging

In [None]:
# Load the en_core_web_sm model
nlp = spacy.load("en_core_web_sm")

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

# Basic feature extraction
## Character count

In [None]:
# Create a feature char_count
tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
print(tweets['char_count'].mean())

##  Building a Counter with bag-of-words

In [None]:
# Import Counter
from collections import Counter
# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [w.lower() for w in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))


In [5]:
import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)

##  Word count

In [None]:
# Function that returns number of words in a string
def count_words(string):
	# Split the string into words
    words =string.split()
    
    # Return the number of words
    return len(words)

# Create a new feature word_count
ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
print(ted['word_count'].mean())

# Hashtags and mentions

In [None]:
# Function that returns numner of hashtags in a string
def count_hashtags(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are hashtags
    hashtags = [word for word in words if word.startswith('#')]
    
    # Return number of hashtags
    return(len(hashtags))

# Create a feature hashtag_count and display distribution
tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
tweets['hashtag_count'].hist()
plt.title('Hashtag count distribution')
plt.show()

In [None]:
# Function that returns number of mentions in a string
def count_mentions(string):
	# Split the string into words
    words = string.split()
    
    # Create a list of words that are mentions
    mentions = [word for word in words if word.startswith('@')]
    
    # Return number of mentions
    return(len(mentions))

# Create a feature mention_count and display distribution
tweets['mention_count'] = tweets['content'].apply(count_mentions)
tweets['mention_count'].hist()
plt.title('Mention count distribution')
plt.show()

## Named entity recognition

In [None]:
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

## NER with NLTK

In [None]:
# Tokenize the article into sentences: sentences
print(article)
sentences =sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences,binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)


In [None]:
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

In [None]:
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

# spaCy NER Categories vs NLTK 

Which are the extra categories that spacy uses compared to nltk in its named-entity recognition?

**NORP, CARDINAL, MONEY, WORKOFART, LANGUAGE, EVENT**

##  Identifying people mentioned in a news article

In [None]:
def find_persons(text):
    # Create Doc object
    doc = nlp(text)
  
    # Identify the persons
    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
    # Return persons
    return persons

print(find_persons(tc))

# Multilingual NER with polyglot
You have access to the full article string in article. Additionally, the Text class of polyglot has been imported from polyglot.text.

In [None]:
"""You have access to the full article string in article. 
Additionally, the Text class of polyglot has been imported from polyglot.text."""
# Create a new text object using Polyglot's Text class: txt
txt =  Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))


In [None]:
# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities :
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if 'Márquez'in ent or  'Gabo' in ent:
        # Increment count
        count+=1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)


##  Counting nouns in a piece of text
### proper nouns

In [None]:
nlp = spacy.load('en_core_web_sm')
import numpy as np
# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

###  nouns

In [None]:
nlp = spacy.load('en_core_web_sm')

# Returns number of other nouns
def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

# proper_nouns/Noun usage in fake news

In [None]:
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))

# Extractions Url 

In [None]:
import regex as re


class TextCleaner:

    def __init__(self):
        self.url_pattern = re.compile(r'(https?://[^\s]+)|^.*(\\\\?.*)$')

    def count_urls(self, text):
        """ count urls tags containing hypertext links
        """
        urls = self.url_pattern.findall(text)
        return len(urls)


if __name__ == "__main__":
    pass


# Remove html tags containing hypertext links

In [None]:
import regex as re

class Filter_html:
    """ remove html tags containing hypertext links
    """
    def __init__(self):
        self.re_http = re.Regex('<[^>]*http[^>]*>')
        self.re_end_tag = re.Regex('</[^>]*>')

    def filter(self, text):
        t = self.re_http.sub('',text)
        return self.re_end_tag.sub('', t)

# Building a bag of words model
## BoW model

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import  CountVectorizer

# Create CountVectorizer object
vectorizer= CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)

## Mapping feature indices with feature names

In [None]:
# Create CountVectorizer object
vectorizer = CountVectorizer ()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary 
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
print(bow_df)

# Modeling 

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

In [None]:
# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow,y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was terrible. The music was underwhelming and the acting mediocre."
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

### Predicting the sentiment 

In [None]:
# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow,y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was terrible. The music was underwhelming and the acting mediocre."
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

# Building n-gram models

In [None]:
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

In [None]:
# Define an instance of MultinomialNB 
clf_ng =MultinomialNB()

# Fit the classifier 
clf_ng.fit(X_train_ng ,y_train )

# Measure the accuracy 
accuracy = clf_ng.score(X_test_ng,y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was not good. The plot had several holes and the acting lacked panache."
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

## Comparing performance of n-gram models

In [None]:
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer(ngram_range=(1, 3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))

# Tf-idf vectors 

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object

vectorizer= TfidfVectorizer()
# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

# Cosine similarity matrix of a corpus

In [None]:
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim =cosine_similarity (tfidf_matrix,tfidf_matrix)
print(cosine_sim)

# Comparing linear_kernel and cosine_similarity

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

In [None]:
# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

In [None]:
# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

# The recommender function

In [None]:
# Generate mapping between titles and index
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

# Beyond n-grams: word embeddings
## Generating word vectors

In [None]:
# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

# Classifying fake news using supervised learning with NLP

In [None]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split 

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df['label']

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"],y,test_size=0.33,random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer =CountVectorizer(stop_words="english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[0:10])

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english" , max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test =tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])


In [None]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

In [None]:
from sklearn import metrics 
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = nb_classifier =MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train ,y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test,pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test,pred, labels=['FAKE','REAL'])
print(cm)


In [None]:
import numpy as np 
# Create the list of alphas: alphas
alphas = np.arange(0,1,0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier =MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train ,y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test,pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()


# Sentiemnt Analysis 1

## How many positive and negative reviews are there?

In [None]:
# Find the number of positive and negative reviews
print('Number of positive and negative reviews: ', movies.label.value_counts())

# Find the proportion of positive and negative reviews
print('Proportion of positive and negative reviews: ', movies.label.value_counts()/ len(movies))

**The .value_counts() method is an easy way to gain a first impression about the contents of the label column.**

# Longest and shortest reviews

In [None]:
length_reviews = movies.review.str.len()
# How long is the longest review
print(max(length_reviews))

In [None]:
length_reviews = movies.review.str.len()
# How long is the shortest review
print(min(length_reviews))


# Detecting the sentiment of Tale of Two Cities

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object  
blob_two_cities = TextBlob(two_cities)

# Print out the sentiment 
print(blob_two_cities.sentiment)

In [None]:
# Import the required packages
from textblob import  TextBlob

# Create a textblob object 
blob_annak =  TextBlob(annak)
blob_catcher = TextBlob(catcher)

# Print out the sentiment   
print('Sentiment of annak: ',blob_annak.sentiment)
print('Sentiment of catcher: ', blob_catcher.sentiment)

# What is the sentiment of a movie review?

In [None]:
# Import the required packages
from textblob import  TextBlob
# Create a textblob object  
blob_titanic = TextBlob(titanic)

# Print out its sentiment  
print(blob_titanic.sentiment)

# Your first word cloud

In [None]:
# Generate the word cloud from the east_of_eden string
cloud_east_of_eden = WordCloud(background_color="white").generate(east_of_eden)

# Create a figure of the generated cloud
plt.imshow(cloud_east_of_eden, interpolation='bilinear')  
plt.axis('off')
# Display the figure
plt.show()

# Which words are in the word cloud?

In [None]:
# Import the word cloud function  
from wordcloud import WordCloud
# Create and generate a word cloud image 
my_cloud =  WordCloud(background_color='white', stopwords=my_stopwords).generate(descriptions)
# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")
# Don't forget to show the final image
plt.show()

## Bag-of-words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 
# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)
# Transform the review column
X_review = vect.transform(reviews.review)
# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

# Getting granular with n-grams

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 
# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)
# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)
# Transform the review
X_review = vect.transform(reviews.review)
# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

In [None]:
# Import the required function
from nltk import word_tokenize
# Transform the GoT string to word tokens
print(word_tokenize(GoT))

In [None]:
# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

In [None]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] =len_tokens


In [None]:
# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))
--------------------------------------------------------------------------------------
from langdetect import detect_langs

languages = []

# Loop over the sentences in the list and detect their language
for sentence in range(len(sentences)):
    languages.append(detect_langs(sentences[sentence]))
    
print('The detected languages are: ', languages)

In [None]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(non_english_reviews)):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] =languages

print(non_english_reviews.head())

# Word cloud of tweets

In [None]:
# Import the word cloud function and stop words list
from wordcloud import WordCloud, STOPWORDS 

# Define and update the list of stopwords
my_stop_words = STOPWORDS.update(['airline', 'airplane'])

# Create and generate a word cloud image
my_cloud = WordCloud(stopwords=my_stop_words).generate(text_tweet)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")
# Don't forget to show the final image
plt.show()

In [None]:
# Import the stop words
from sklearn.feature_extraction.text import CountVectorizer,ENGLISH_STOP_WORDS

# Define the stop words
my_stop_words = ENGLISH_STOP_WORDS.union(['airline', 'airlines', '@'])

# Build and fit the vectorizer
vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(tweets.text)

# Create the bow representation
X_review = vect.transform(tweets.text)
# Create the data frame
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

In [None]:
# Build and fit the vectorizer
vect = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]').fit(tweets.text)
vect.transform(tweets.text)
print('Length of vectorizer: ', len(vect.get_feature_names()))

In [None]:
# Import the word tokenizing package
from nltk import word_tokenize

# Tokenize the text column
word_tokens = [word_tokenize(review) for review in tweets.text]
print('Original tokens: ', word_tokens[0])

# Filter out non-letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
print('Cleaned tokens: ', cleaned_tokens[0])

# String operators 

In [None]:
# Create a list of lists, containing the tokens from list_tweets
tokens = [word_tokenize(item) for item in tweets_list]
# Remove characters and digits , i.e. retain only letters
letters = [[word for word in item if word.isalpha()] for item in tokens]
# Remove characters, i.e. retain only letters and digits
let_digits = [[word for word in item if word.isalnum()] for item in tokens]
# Remove letters and characters, retain only digits
digits = [[word for word in item if word.isdigit()] for item in tokens]
# Print the last item in each list
print('Last item in alphabetic list: ', letters[2])
print('Last item in list of alphanumerics: ', let_digits[2])
print('Last item in the list of digits: ', digits[2])

# Stems and lemmas 

In [None]:
# Import the required packages from nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
porter = PorterStemmer()
WNlemmatizer = WordNetLemmatizer()
# Tokenize the GoT string
tokens = word_tokenize(GoT) 

In [None]:
import time

# Log the start time
start_time = time.time()

# Build a stemmed list
stemmed_tokens = [porter.stem(token) for token in tokens] 

# Log the end time
end_time = time.time()

print('Time taken for stemming in seconds: ', end_time - start_time)
print('Stemmed tokens: ', stemmed_tokens)

In [None]:
import time

# Log the start time
start_time = time.time()

# Build a lemmatized list
lem_tokens = [WNlemmatizer.lemmatize(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for lemmatizing in seconds: ', end_time - start_time)
print('Lemmatized tokens: ', lem_tokens) 

In [None]:
# Import the required packages
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

# Import the Spanish SnowballStemmer
SpanishStemmer = SnowballStemmer("spanish")

# Create a list of tokens
tokens = [word_tokenize(review) for review in non_english_reviews.review] 
# Stem the list of tokens
stemmed_tokens = [[SpanishStemmer.stem(word) for word in token] for token in tokens]

# Print the first item of the stemmed tokenss
stemmed_tokens[0]

In [None]:
# Import the required vectorizer package and stop words list
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
# Define the vectorizer and specify the arguments
my_pattern = r'\b[^\d\W][^\d\W]+\b'
vect = TfidfVectorizer(ngram_range=(1,2), max_features=100, token_pattern=my_pattern, stop_words=ENGLISH_STOP_WORDS).fit(tweets.text)
# Transform the vectorizer
X_txt = vect.transform(tweets.text)
# Transform to a data frame and specify the column names
X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: ', X.head())

In [None]:
# Import the required packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
# Build a BOW and tfidf vectorizers from the review column and with max of 100 features
vect1 = CountVectorizer(max_features=100).fit(reviews.review)
vect2 = TfidfVectorizer(max_features=100).fit(reviews.review) 
# Transform the vectorizers
X1 = vect1.transform(reviews.review)
X2 = vect2.transform(reviews.review)
# Create DataFrames from the vectorizers 
X_df1 = pd.DataFrame(X1.toarray(), columns=vect1.get_feature_names())
X_df2 = pd.DataFrame(X2.toarray(), columns=vect2.get_feature_names())
print('Top 5 rows using BOW: \n', X_df1.head())
print('Top 5 rows using tfidf: \n', X_df2.head())

## Logistic regression of movie reviews

In [None]:
# Import the logistic regression
from sklearn.linear_model import LogisticRegression
# Define the vector of targets and matrix of features
y = movies.label
X = movies.drop('label', axis=1)
# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))

# Did we really predict the sentiment well?

In [None]:
# Import the required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Define the vector of labels and matrix of features
y = movies.label
X = movies.drop('label', axis=1)
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build a logistic regression model and print out the accuracy
log_reg = LogisticRegression().fit(X_train,y_train)
print('Accuracy on train set: ', log_reg.score(X_train, y_train))
print('Accuracy on test set: ', log_reg.score(X_test,y_test))

### Performance metrics

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123,stratify=y)
# Train a logistic regression
log_reg = LogisticRegression().fit(X_train,y_train )
# Make predictions on the test set
y_predicted = log_reg.predict(X_test)
# Print the performance metrics
print('Accuracy score test set: ', accuracy_score(y_test, y_predicted))
print('Confusion matrix test set: \n', confusion_matrix(y_test, y_predicted)/len(y_test))

###  Build and assess a model: product reviews data

In [None]:
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the probability of the 0 class
prob_0 = log_reg.predict_proba(X_test)[:, 0]
# Predict the probability of the 1 class
prob_1 = log_reg.predict_proba(X_test)[:, 1]

print("First 10 predicted probabilities of class 0: ", prob_0[:10])
print("First 10 predicted probabilities of class 1: ", prob_1[:10])

### Product reviews with regularization

In [None]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Train a logistic regression with regularization of 1000
log_reg1 = LogisticRegression(C=1000).fit(X_train, y_train)
# Train a logistic regression with regularization of 0.001
log_reg2 = LogisticRegression(C=0.001).fit(X_train, y_train)

# Print the accuracies
print('Accuracy of model 1: ', log_reg1.score(X_test,y_test))
print('Accuracy of model 2: ', log_reg2.score(X_test, y_test))

# Recurrent Neural Networks for Language Modeling in Python

In [10]:
# Read Data :
import pandas as pd

sheldon_quotes=pd.read_fwf('/home/abderrazak/ALLINHERE/NLP/Datacamp/sampleDlNlp.txt', header=None)
sheldon_quotes

Unnamed: 0,0,1
0,its my job\tÃ© o meu trabalho,
1,wholl cook\tquem cozinharÃ¡,
2,help me\tajudemme,
3,i sat down\teu me sentei,
4,im worn out\testou exausto,
...,...,...
4995,call me later\tme liga depois,
4996,look alive\tse apresse,
4997,id buy that\teu compraria aquele,
4998,i have won\teu venci,


In [11]:
# Transform the list of sentences into a list of words
all_words = ' '.join(sheldon_quotes[0]).split(' ')

# Get number of unique words
unique_words = list(set(all_words))

# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}

print(index_to_word)

# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}

print(word_to_index)

{0: '', 1: 'a', 2: 'abaixado', 3: 'abaixados', 4: 'abandon', 5: 'abenÃ§oe', 6: 'aberto', 7: 'aboard\ttodos', 8: 'about', 9: 'above\tvejam', 10: 'abrace', 11: 'abraÃ§ar', 12: 'abraÃ§aram', 13: 'abraÃ§o', 14: 'abraÃ§ou', 15: 'abraÃ§Ã¡lo', 16: 'abrir', 17: 'absurd\tque', 18: 'absurdo', 19: 'acabou', 20: 'acalma', 21: 'acalme', 22: 'acalmese', 23: 'acenei', 24: 'acenou', 25: 'achamos', 26: 'achar', 27: 'achei', 28: 'acho', 29: 'acidente', 30: 'acima', 31: 'acontece', 32: 'acontecer', 33: 'acorda', 34: 'acordada', 35: 'acordado', 36: 'acordados', 37: 'acorde', 38: 'acordei', 39: 'acordo', 40: 'acordou', 41: 'acredito', 42: 'act', 43: 'act\tnÃ³s', 44: 'action\ttome', 45: 'action\ttomem', 46: 'actor\tsou', 47: 'addicted\testou', 48: 'adentro', 49: 'adeus', 50: 'adiantada', 51: 'adiantado', 52: 'adiantados', 53: 'adiante', 54: 'admire', 55: 'admiro', 56: 'admit', 57: 'admito', 58: 'adoeceu', 59: 'adopted\teu', 60: 'adora', 61: 'adorable\tele', 62: 'adormeceu', 63: 'adormecido', 64: 'adoro', 65

# Preparing text data for model input

In [None]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2         # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i + chars_window])

# Print 10 pairs
print_examples(sentences, next_chars, 10)

# Transforming new text

In [None]:
# Loop through the sentences and get indexes
new_text_split = []
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        index = word_to_index.get(wd,0 )
        sent_split.append(index)
    new_text_split.append(sent_split)

# Print the first sentence's indexes
print(new_text_split[0])

# Print the sentence converted using the dictionary
print(' '.join([index_to_word[index] for index in new_text_split[0]]))

# Keras models
## Sequentiel

In [17]:
# Instantiate the class
model = Sequential(name="sequential_model")

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(LSTM(128, input_shape=(None, 10), name="LSTM"))

# Add a dense layer with one unit
model.add(Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()

Model: "sequential_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


## Model

In [18]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Input, LSTM
# Define the input layer
main_input =Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
lstm_layer = LSTM(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = Dense(1, activation="sigmoid", name="output")(lstm_layer)

# Instantiate the class at the end
model = Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()

Model: "modelclass_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, None, 10)]        0         
_________________________________________________________________
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


# Keras preprocessing

In [21]:
# Import relevant classes/functions
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sheldon_quotes[0])

# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(sheldon_quotes[0])
print("Number of words in the sample texts: ({0}, {1})".format(len(texts_numeric[0]), len(texts_numeric[1])))

# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60, padding='post')
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(texts_pad[0]))

Number of words in the sample texts: (7, 4)
Now the texts have fixed length: 60. Let's see the first one: 
[ 23 106 488   5   7 113 273   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0]


## Your first RNN model

In [None]:
# Build model
model = Sequential()
model.add(SimpleRNN(units=128, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
              optimizer='adam',
              metrics=['accuracy'])
# Load pre-trained weights
model.load_weights('model_weights.h5')
# Method '.evaluate()' shows the loss and accuracy
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print("Loss: {0} \nAccuracy: {1}".format(loss, acc))

# Exploding gradient problem

In [None]:
# Create a Keras model with one hidden Dense layer
model = Sequential()
model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer=he_uniform(seed=42)))
model.add(Dense(1, activation='linear'))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9, clipvalue=3.0))
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)

# See Mean Square Error for train and test data
train_mse = model.evaluate(X_train, y_train, verbose=0)
test_mse= model.evaluate(X_test, y_test, verbose=0)

# Print the values of MSE
print('Train: %.3f, Test: %.3f' % (train_mse, test_mse))

# Vanishing gradient problem

In [None]:
# Create the model
model = Sequential()
model.add(SimpleRNN(units=600, input_shape=(None, 1)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Load pre-trained weights
model.load_weights('model_weights.h5')
# Plot the accuracy x epoch graph
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.legend(['train', 'val'], loc='upper left')
plt.show()

# Stacking RNN 

In [None]:
# Import the LSTM layer
from keras.layers.recurrent import LSTM
# Build model
model = Sequential()
model.add(LSTM(units=128, input_shape=(None, 1), return_sequences=True))
model.add(LSTM(units=128, return_sequences=True))
model.add(LSTM(units=128, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Load pre-trained weights
model.load_weights('lstm_stack_model_weights.h5')
print("Loss: %0.04f\nAccuracy: %0.04f" % tuple(model.evaluate(X_test, y_test, verbose=0)))

# Transfer learning

In [None]:
# Load the glove pre-trained vectors
glove_matrix = load_glove('glove_200d.zip')
# Create a model with embeddings
model = Sequential(name="emb_model")
model.add(Embedding(input_dim=vocabulary_size + 1, output_dim=wordvec_dim, 
                    embeddings_initializer= Constant(glove_matrix), 
                    input_length=sentence_len, trainable=False))
model.add(GRU(128))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Print the summaries of the model with embeddings
model.summary()

# Sentiment classification revisited
## Better sentiment classification

In [None]:
# Build and compile the model
model = Sequential()
model.add( Embedding(vocabulary_size, wordvec_dim, trainable=True, input_length=max_text_len))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(64, return_sequences=False, dropout=0.2, recurrent_dropout=0.15))
model.add(Dense(16))
model.add(Dropout(rate=0.25))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Load pre-trained weights
model.load_weights('model_weights.h5')
# Print the obtained loss and accuracy
print("Loss: {0}\nAccuracy: {1}".format(*model.evaluate(X_test, y_test, verbose=0)))

# Prepare label vector

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

In [None]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes
# Print initial shape
print(numerical_ids.shape)
# One-hot encode the indexes
Y = to_categorical(numerical_ids)
# Check the new shape of the variable
print(Y.shape)
# Print the first 5 rows
print(Y[0:5])

In [None]:
# Create and fit tokenizer
tokenizer =Tokenizer()
tokenizer.fit_on_texts(news_dataset)
# Prepare the data
prep_data = tokenizer.texts_to_sequences(news_dataset.data)
prep_data = pad_sequences(prep_data, maxlen=200)
# Prepare the labels
prep_labels = to_categorical(news_dataset.target)
# Print the shapes
print(prep_data.shape)
print(prep_labels.shape)

In [None]:
# Change text for numerical ids and pad
X_novel = tokenizer.texts_to_sequences(news_novel.data)
X_novel = pad_sequences(X_novel, maxlen=400)
# One-hot encode the labels
Y_novel = to_categorical(news_novel.target)
# Load the model pre-trained weights
model.load_weights('classify_news_weights.h5')
# Evaluate the model on the new dataset
loss, acc = model.evaluate(X_novel, Y_novel, batch_size=64)
# Print the loss and accuracy obtained
print("Loss:\t{0}\nAccuracy:\t{1}".format(loss, acc))

# Precision-Recall trade-off

When working with classification tasks, the term Precision-Recall trade-off often appears. Where does it comes from?

Usually, the class with higher probability (obtained by the .predict_proba() method) is chosen to assign the document to. But, what if the maximum probability is equal to 0.1? Should you consider that document to belong to this class with only 10% probability?

The answer varies according to problem at hand. It is possible to add a minimum threshold to accept the classification, and by changing the threshold the values of precision and recall move in opposite directions.

The variables y_true and the model model are loaded. Also, if the probability is lower than the threshold, the document will be assigned to DEFAULT_CLASS (chosen to be class 2).

# Precision or Recall, that is the question

You learned about a few performance metrics and maybe you are asking, when should I use precision and when should I use recall? Those two metrics are calculated for each class, and sometimes it is difficult to understand when to focus on one and when to focus on the other.

Precision is a metric that measures how well the model is predicting some class, while recall measures how well a class is being classified. If precision is high for one class, you can trust your model when it predicts that class. When recall is high for a class, you can rest assured that that class is well understood by the model.

Follow the instruction to see this comparison between precision and recall with an example. The functions precision_score() and recall_score() are loaded.


In [None]:
# Compute the precision of the sentiment model
prec_sentiment = precision_score(sentiment_y_true, sentiment_y_pred, average=None)
print(prec_sentiment)
# Compute the recall of the sentiment model
rec_sentiment = recall_score(sentiment_y_true, sentiment_y_pred, average=None)
print(rec_sentiment)

# Performance on multi-class classification

In [None]:
# Use the model to predict on new data
predicted = model.predict(X_test)
# Choose the class with higher probability 
y_pred = np.argmax(predicted, axis=1)
# Compute and print the confusion matrix
print(confusion_matrix(y_true, y_pred))
# Create the performance report
print(classification_report(y_true, y_pred, target_names=news_cat))

# Sequence to Sequence Models
## Preparing the data for training

In [None]:
# Instantiate the vectors
sentences = []
next_chars = []
# Loop for every sentence
for sentence in sheldon.split('\n'):
    # Get 20 previous chars and next char; then shift by step
    for i in range(0, len(sentence) - chars_window, step):
        sentences.append(sentence[i:i + chars_window])
        next_chars.append(sentence[i + chars_window])
# Define a Data Frame with the vectors
df = pd.DataFrame({'sentence': sentences, 'next_char': next_chars})
# Print the initial rows
print(df.head())

In [None]:
# Instantiate the variables with zeros
numerical_sentences = np.zeros((num_seqs, chars_window, n_vocab), dtype=np.bool)
numerical_next_chars = np.zeros((num_seqs, n_vocab), dtype=np.bool)

# Loop for every sentence
for i, sentence in enumerate(sentences):
  # Loop for every character in sentence
  for t, char in enumerate(sentence):
    # Set position of the character to 1
    numerical_sentences[i, t, char_to_index[char]] = 1
    # Set next character to 1
    numerical_next_chars[i, char_to_index[next_chars[i]]] = 1

# Print the first position of each
print(numerical_sentences[0], numerical_next_chars[0], sep="\n")

# Creating the text generation model

In [None]:
# Instantiate the model
model = Sequential(name="LSTM model")
# Add two LSTM layers
model.add(LSTM(64, input_shape=input_shape, dropout=0.15, recurrent_dropout=0.15, return_sequences=True, name="Input_layer"))
model.add(LSTM(64, dropout=0.15, recurrent_dropout=0.15, return_sequences=False, name="LSTM_hidden"))
# Add the output layer
model.add(Dense(n_vocab, activation='softmax', name="Output_layer"))
# Compile and load weights
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.load_weights('model_weights.h5')
# Summary
model.summary()

# Neural Machine Translation

In [None]:
# Get maximum length of the sentences
pt_length = max([len(sentence.split()) for sentence in pt_sentences])

# Transform text to sequence of numerical indexes
X = input_tokenizer.texts_to_sequences(pt_sentences)

# Pad the sequences
X = pad_sequences(X, maxlen=pt_length, padding='post')

# Print first sentence
print(pt_sentences[0])

# Print transformed sentence
print(X[0])

In [None]:
# Initialize the variable
Y = transform_text_to_sequences(en_sentences,output_tokenizer)

# Temporary list
ylist = list()
for sequence in Y:
  	# One-hot encode sentence and append to list
    ylist.append(to_categorical(sequence, num_classes=en_vocab_size))

# Update the variable
Y = np.array(ylist).reshape(Y.shape[0], Y.shape[1], en_vocab_size)

# Print the raw sentence and its transformed version
print("Raw sentence: {0}\nTransformed: {1}".format(en_sentences[0], Y[0]))

# Multilabl : 
https://shravan-kuchkula.github.io/mutli-class-multi-label-pipeline/#testing-the-improved-pipeline-on-a-holdout-dataset

In [None]:
# Asiig

In [None]:
import pandas as pd 
pd.