# Sentiment Analysis in Python

"Sentiment analysis", also called "Opinion Mining", is the process of understanding the opinion of an author about a subject.  
In a sentiment analysis system, depending on the context, we usually have 3 elements:  
1. The opinion / emotion. An opinion, also called "polarity", can be positive, neutral or negative. An emotion could be qualitative (like joy, surprise, or anger) or quantitative (like rating a movie on the scale from 1 to 10).
2. A Sentiment Analysis System : the subject that is being talked about, such as a book, a movie, or a product. Sometimes one opinion could discuss multiple aspects of the same subject. For example: "The camera on this phone is great but its battery life is rather disappointing." 
3. The opinion holder, or entity, expressing the opinion.


Sentiment analysis has many practical applications. In social media monitoring, we don't just want to know if people are talking about a brand; we want to know how they are talking about it. Social media isn't our only source of information; we can also find sentiment on forums, blogs, and the news. Most brands analyze all of these sources to enrich their understanding of how customers interact with their brand, what they are happy or unhappy about, and what matters most to consumers. Sentiment analysis is thus very important in brand monitoring, and in fields such as customer and product analytics and market research and analysis.

### Counting the number of negative reviews

In [None]:
# Find the number of positive and negative reviews
print('Number of positive and negative reviews: ', movies.label.value_counts())

# Find the proportion of positive and negative reviews
print('Proportion of positive and negative reviews: ', movies.label.value_counts() / len(movies))

In [None]:
# Find the longest/shortest review
length_reviews = movies.review.str.len()

# How long is the longest review?
print(max(length_reviews))

# How long is the shortest review?
print(min(length_reviews))

# Sentiment Analysis types and approaches

Sentiment analysis tasks can be carried out at different levels of granularity.  
First is document level: this is when we look at the whole review of a product, for example.  
Second is the sentence level: we determe whether the opinion expressed in each sentence is positive, negative, or neutral.  
The last level of granularity is the aspect level: it refers to expressing opinions about different features of a product.  
For example, the sentence "The camera in this phone is pretty good but the battery life is disappointing." expresses both positive and negative opinions about a phone, and we might want to be able to say which features of the product clients like and which they don't.

The algorithms used for sentiment analysis could be split into 2 main categories.  
The first is rule or lexicon based. Such methods most commonly have a predefined list of words with a valence score (nice could be +2, good +1, terrible -3, etc.). The algorithm then matches the words from the lexicon to the words in the text and either sums or averages the scores in some way. Each word gets a score, and to get the total valence we sum the words.  
The second category is automated systems, which are based on machine learning. The task is usually modeled as a classification problem where we need to predict the sentiment of a new piece of text using some historical data with known sentiment.

We can calculate the valence score of a text using the textblob library. A TextBlob object is like a Python string, which has obtained some natural language processing skills. We can call different attributes of the TextBlob object, like `.sentiment`.  
The sentiment attribute returns a tuple polarity, which is measured on the scale from [-1.0 to 1.0], where -1.0 is very negative, 0 is neutral and +1.0 is very positive. The second element in the tuple displays the subjectivity, measured from [0.0 to 1.0] where 0.0 is very objective and 1.0 is very subjective. So our example is rather positive and subjective.

### Which method should one use? 
A machine learning sentiment analysis relies on having labeled historical data whereas lexicon-based methods rely on having manually created rules or dictionaries. Lexicon-based methods fail at certain tasks because the polarity of words might change with the problem, which will not be reflected in a predefined dictionary. However, lexicon-based approaches can be quite fast, whereas Machine learning models might take a while to train. At the same time, machine learning models can be quite powerful. So, the jury is still out on that one. Many people find that a hybrid approach tends to work best in many, usually complex scenarios.

### Detecting a sentiment with TextBlob

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object  
blob_two_cities = TextBlob(two_cities)

# Print out the sentiment 
print(blob_two_cities.sentiment)

### Comparing sentiments

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object 
blob_annak = TextBlob(annak)
blob_catcher = TextBlob(catcher)

# Print out the sentiment   
print('Sentiment of annak: ', blob_annak.sentiment)
print('Sentiment of catcher: ', blob_catcher.sentiment)

### Sentiment of movie review

In [None]:
# Import the required packages
from textblob import TextBlob

# Create a textblob object  
blob_titanic = TextBlob(titanic)

# Print out its sentiment  
print(blob_titanic.sentiment)

# Word clouds

A word cloud is an image composed of words with different sizes and colors. They can be especially useful in sentiment analysis. Have you ever wondered how such an image is generated? In this video, we will learn how to create a word cloud in Python.

Word clouds (also called tag clouds) are used across different contexts. In the most common type of word clouds - and the one we will be using in this course - the size of the text corresponds to the frequency of the word. The more frequent a word is, the bigger and bolder it will appear on the word cloud.

Remember how we found the longest movie review? This word cloud is generated using only the words in one of the longest reviews. Which movie do you guess the review is talking about? I think we can agree it is about the Titanic!

Why are word clouds so popular? First of all, they can reveal the essential. We saw in our word cloud, the word Titanic really popped out. Second, unless told otherwise, they will plot all the words in a text, and a quick scan of the image can provide an overall sense of the text. Last but not least, they are easy to understand and quite fun. However, they have their drawbacks. Sometimes they tend to work less well. All the words plotted on the cloud might seem unrelated and it could be difficult to draw a conclusion based on a crowded word cloud. Secondly, if the text we work with is large, a word cloud might require quite a lot of preprocessing steps before it appears sensible and uncluttered.

Now let's create a word cloud in Python.To do that, we can use the WordCloud function from the wordcloud package. We will have to import matplotlib.pyplot as well, which will allow wordcloud to plot on its base. Let's define a string, called two_cities, which captures the first sentence of Dickens' A Tale of Two Cities. Note how the text carries many emotionally charged words.

After we have imported the package, we build the cloud by calling the WordCloud function, followed by the generate method, which takes as argument the text, in our case - the two_cities string. The WordCloud function has many arguments. We will not cover all of them here but if you want to learn what they are, just type ?WordCloud in the Shell. You can change things such as the background color, the size and font of the words, their scaling and others. One interesting argument you can specify is the stopwords, which will remove words such as 'the', 'and', 'to', 'from', and so on. We will cover what stopwords are in detail in a later video. The result cloud_two_cities is a wordcloud object.

If we want to display the generated word cloud object, we need to use some matplotlib functionality. We call plt.imshow(), specifying our cloud_two_cities as an argument. We also specify the interpolation to be bilinear. All this does is make the image appear more smoothly. The imshow() function thus creates the figure. We specify we don't want the image to display x and y axis, and finally, call the show method. The imshow() function has created the figure but we need to call show() to display it. We see the word cloud we generated on this piece of text. Which words pop out the most?

### Word cloud example

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate the word cloud from the east_of_eden string
cloud_east_of_eden = WordCloud(background_color="white").generate(east_of_eden)

# Create a figure of the generated cloud
plt.imshow(cloud_east_of_eden, interpolation='bilinear')  
plt.axis('off')
# Display the figure
plt.show()

### Word Cloud on movie review

In [None]:
# Import the word cloud function  
from wordcloud import WordCloud

# Create and generate a word cloud image 
my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(descriptions)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

# Numeric features

### Bag-of-words
The first step in performing a sentiment analysis task is transforming our text data to numeric form. This is required as a machine learning model cannot work with the text data directly, but rather with numeric features we create from the data.

We start with the "bag-of-words" approach, used to describe the occurrence, or frequency, of words within a document, or a collection of documents (called corpus). It basically comes down to building a vocabulary of all the words occurring in the document and keeping track of their frequencies.

Before we continue with the discussion of BOW, we will introduce the data we will use throughout the chapter, namely reviews of Amazon products. The dataset consists of two columns: the first contains the score, which is 1 if positive and 0 if negative; The second column contains the actual review of the product.

Let’s see how BOW would work applied to an example review. Imagine you have the following string: "This is the best book ever. I loved the book and highly recommend it." The goal of a BOW approach would be to build the following dictionary-like output: 'This', occurs once in our string, so it has a count of 1, 'is' occurs once, 'the' occurs two times and so on. One thing to note is that we lose the word order and grammar rules, that’s why this approach is called a ‘bag’ of words, resembling dropping a bunch of items in a bag and losing any sense of their order. This sounds straightforward but sometimes deciding how to build the vocabulary can be complex. We discuss some of the trade-offs we need to consider in later chapters.

When we transform the text column with a BOW, the end result looks something like the table that we see: where the column is the word (also called token), and the row represents how many times we have encountered it in the respective review.

How do we execute a BOW process in Python? The simplest way to do this is by using the CountVectorizer from the text library in the sklearn.feature_extraction submodule. In Python, we import the CountVectorizer() from sklearn.feature_extraction.text. In the CountVectorizer function, for the moment we leave the default functional options, except for the max_features argument, which only considers the features with highest term frequency, i.e. it will pick the 1000 most frequent words across the corpus of reviews. We need to do that sometimes for memory’s sake. We use the `fit()` method from the CountVectorizer, calling fit() on our text column. To create a BOW representation, we call the transform() method, applied again to our text column.

The result is a sparse matrix. A sparse matrix only stores entities that are non-zero, where the rows correspond to the number of rows in the dataset, and the columns to the BOW vocabulary.

To look at the actual contents of a sparse matrix, we need to perform an additional step to transform it back to a 'dense' NumPy array, using the `.toarray()` method. We can build a pandas DataFrame from the array, where the columns' names are obtained from the `.get_feature_names()` method of the vectorizer. This returns a list where every entry corresponds to one feature.

That was our introduction to BOW! Let's apply what we've learned in the exercises.

In [None]:
# Example of small BoW
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

### Build a BoW from reviews

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

### Granularity with n-grams

You might remember from an earlier video that with a bag-of-words approach the word order is discarded.

Imagine you have a sentence such as 'I am happy, not sad' and another one 'I am sad, not happy'. They will have the same representation with a BOW, even though the meanings are inverted. In this case, putting NOT in front of the word (which is also called negation) changes the whole meaning and demonstrates why context is important.


There is a way to capture the context when using a BOW by, for example, considering pairs or triples of tokens that appear next to each other. Let's define some terms. Single tokens are what we used so far and are also called 'unigrams'. Bigrams are pairs of tokens, trigrams are triples of tokens, and a sequence of n-tokens is called 'n-grams.'


Let's illustrate that with an example. Take the sentence 'The weather today is wonderful' and split it using unigrams, bigrams and trigrams. With unigrams we have single tokens, with bigrams, pairs of neighboring tokens, with trigrams: triples of neighboring tokens.


It is easy to implement n-grams with the CountVectorizer method. To specify the n-grams, we use the ngram_range parameter. The ngram_range is a tuple where the first parameter is the minimum length and the second parameter is the maximum length of tokens. For instance, ngram_range =(1, 1) means we will use only unigrams, (1, 2) means we will use unigrams and bigrams and so on.


It's not easy to determine what is the optimal sequence you should use for your problem. If we use longer token sequence, this will result in more features. In principle, the number of bigrams could be the number of unigrams squared; trigrams the number of unigrams to the power of 3 and so forth. In general, having longer sequences results in more precise machine learning models, but this also increases the risk of overfitting. An approach to find the optimal sequence length would be to try different lengths in something like a grid search and see which results in the best model.

Determining the length of token sequence is not the only way to determine the size of the vocabulary. There are a few parameters in the CountVectorizer that can also do that. You might remember we set the max_features parameter. The max_features can tell the CountVectorizer to take the top most frequent tokens in the corpus. If it is set to None, all the words in the corpus will be included. So this parameter can remove rare words, which depending on the context may or may not be a good idea. Another parameter you can specify is max_df. If given, it tells CountVectorizer to ignore terms with a higher than the given frequency. We can specify it as an integer - which will be an absolute count, or float - which will be a proportion. The default value of max_df is 1.0, which means it does not ignore any terms. Very similar to max_df is min_df. It is used to remove terms that appear too infrequently. It again can be specified either as an integer, in which case it will be a count, or a float, in which case it will be a proportion. The default value is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

In [None]:
# Getting unigrams and bigrams of text
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1,2)) # 1 and 2 means we will get unigrams and bigrams
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

In [None]:
# Size of vocabulary
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary, min-max number of documents, and fit
vect = CountVectorizer(max_features=100, # limit the size of vocabulary to 100
                       max_df=200, # limit the max number of documents to 200
                       min_df=50) # limit the min number of documents to 50
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

In [None]:
# BoW with n-grams and vocabulary size
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

### Build new features from text

When we have a sentiment analysis task to be solved with machine learning, having extra features usually results in a better model.

Some very predictive features say something about the complexity of the text column. For example, one could measure how long each review is, how many sentences it contains, or say something about the parts of speech involved, punctuation marks, etc.

We employed a BOW approach to transform each review to numeric features, counting how many times a word occurred in the respective review. Here, we stop one step earlier and only split the reviews in individual words (usually called tokens, though a token can be a whole sentence as well.) We will work with the **nltk** package, and concretely the `.word_tokenize()` function. Let's apply the word_tokenize function to our familiar anna_k string. The returned result is a list, where each item is a token from the string. Note that not only words but also punctuation marks are originally assigned as tokens. The same would have been the case with digits, if we had any in our string.

Now we want to apply the same logic but to our column of reviews. One fast way to iterate over strings is by using list comprehension. A quick reminder on list comprehensions. They are like flattened-out for loops. The syntax is an operation we perform on each item in an iterable object (such as a list). In our case, a list comprehension will allow us to iterate over the review column, tokenizing every review. The result is going to be a list; if we explore the type of the first item, for example, we see it is also of type list. This means that our word_tokens is a list of lists. Each item stores the tokens from a single review.

Now that we have our word_tokens list, we only need to count how many tokens there are in each item of word_tokens. We start by creating an empty list, to which we will append the length of each review as we iterate over the word_tokens list. In the first line of the for loop, we find the number of items in the word_tokens list using the len() function. Since we want to iterate over this number, we need to surround the len() by the the range() function. In the second line, we find the length of each iterable, and append that number to our empty list len_tokens. Lastly, we create a new feature for the length of each review.

Note that we did not address punctuation but you can exclude it if it suits your context better. You can even create a new feature that measures the number of punctuation signs. In our context, a review with more punctuation signs could signal a very emotionally charged opinion. It's also good to know that we can follow the same logic and create a feature that counts the number of sentences, where one token will be equal to a sentence and not to a single word.

If we check how the product reviews dataset looks like, we see the 'n_tokens' column we created. It shows the number of words in each review.

In [None]:
# Tokenize a string from GoT
# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

In [None]:
# Word tokens from the Avengers script
# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

In [None]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

In [None]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

### Guessing a Language

Often in real applications not all documents carrying sentiment will be in English. We might want to detect what language is used or build specific features related to language.

In Python there are a few libraries that can detect the language of a string. In this course, we will use langdetect because it is one of the better performing packages. But you can follow the same structure using another package. We first import the detect_langs function from the langdetect package. Now imagine we have a string called foreign, which is a sentence in another language. Our goal is to identify its language. We apply the detect_langs function to our string. This function will return a list. Each item of the list contains a pair of a language and a number saying how likely it is that the string is in this particular language. In this case, we observe only 1 item in the list, namely Spanish. That's because the function is fairly certain the language is Spanish. In other cases we might get longer lists, where the most likely candidate languages will appear first, followed by less likely ones.

In real applications, we usually work not with a single string but with many strings, often contained in a column of a dataset. A common problem is to detect the language of each of the strings and capture the most likely language in a new column. How to do that? We again start by importing the detect_langs function from the langdetect package. We import our familiar dataset with product reviews.

The steps we follow next are quite similar to our approach when capturing the length of a review. First, we create an empty list, called languages. We want to iterate over the rows of our dataset using a for loop. In the first line of the loop, we apply the len() function to our dataset, which returns the number of rows. We still need to call the range() function since we want to iterate over the number of rows. In the second line of the loop, we apply the detect_lang function on the review column of the dataset, which is the second column in our case, while selecting one row at a time. We want to store each detected language as an item in a list, therefore we append the result of detect_langs to the empty list languages. When we print languages, we see that it is a list of lists, where each element contains the detected language of the respective row and how likely that language is. In some cases, the individual lists contain more than one item.

We have one more step before we create our language feature. We saw that languages is a list of lists. We want to extract the first element of each list within languages since the first item is always the most likely language. One fast way to do that is by list comprehension. Let's break down the command in steps. For example, let's take the first element of the languages and split it on a colon sign. After that, we extract the first element of the resulting split, returning '[es'. Finally,since there is a left bracket before the language, we select everything from the 2nd element onwards, resulting in 'es' for Spanish.

To write the list comprehension, we put these steps together by iterating over each item in our list of lists. Lastly, we assign the cleaned list to a new feature, called language.

In [None]:
# Detect a language in one sentence
# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))

In [None]:
# Detect languages in multiple sentences
from langdetect import detect_langs
languages = []

# Loop over the sentences in the list and detect their language
for sentence in sentences:
    languages.append(detect_langs(sentence))
    
print('The detected languages are: ', languages)

In [None]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(non_english_reviews)):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]

# Assign the list to a new feature 
non_english_reviews['language'] = languages

print(non_english_reviews.head())

# Stop words

In every language, there are words that occur too frequently and are not very informative. Sometimes, it is useful to get rid of them before we build a machine learning model.

Words that occur too frequently and are not very informative are called stop words. But how do we know which words are not informative? In every language, there is a set of words that most practitioners agree are not useful and should be removed when performing a natural language processing task. For instance, in English the definite and indefinite article (the, a/an), conjunctions ('and','but','for'), propositions('on', 'in', 'at'), etc. are stop words. Secondly, depending on the context, we might want to expand the standard set of stop words. For example, in the movie reviews dataset, we might want to exclude words such as 'film', 'movie', 'cinema', etc.

Maybe you recall from a previous video that we built word clouds using movie reviews. Here is an example of two word clouds using the movie reviews. In the picture on the left, the stop words have not been removed. Words that pop up are 'film' and 'br', which is an indication for a line break. In the cloud on the right side, stop words have been removed and now we see words such as 'character', 'see', 'good', 'story'.

How do we remove stop words when creating a word cloud? Let's start by reviewing how we built a word cloud. First, we import the WordCloud function from wordcloud. We also import the default list of STOPWORDS from wordcloud. To create our list of stop words, we can take a set of the default list. A set is like a list but with unique, not repeating items. We can update the set of stop words by calling update and providing a list to it. We pass our list of stopwords, called my_stopwords to the stopwords argument in the WordCloud function. Then we display it. So, the only new argument we added here is defining the list of stop words. Everything else stays the same.

Removing non-informative words when we are building a BOW transformation can also be very useful. This can easily be incorporated in the countvectorizer function. First, we need to import the list of default English stop words from the same feature_extraction.text package from sci-kit learn. Let's assume we want to enrich this default list with movie-specific words. To do that, we call the union function on the default list. Remember that a union of two sets A and B consists of all elements of A and all elements of B such that no elements are repeated. In our case, the union will add the new words to the list of default stop words, if that word is not already there. To use the constructed set, we specify the stop_words argument in the CountVectorizer to be equal to our defined set. Everything else stays the same and should look pretty familiar by now. One important thing to note is that using stopwords will reduce the size of the vocabulary we built using a BOW or another approach.

### Word Cloud of Tweets

In [None]:
# Import the word cloud function 
from wordcloud import WordCloud, STOPWORDS 

# Create and generate a word cloud image
my_cloud = WordCloud(background_color='white').generate(text_tweet)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

In [None]:
# Import the word cloud function and stop words list
from wordcloud import WordCloud, STOPWORDS 

# Define and update the list of stopwords
my_stop_words = STOPWORDS.update(['airline', 'airplane'])

# Create and generate a word cloud image
my_cloud = WordCloud(stopwords=my_stop_words).generate(text_tweet)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")
# Don't forget to show the final image
plt.show()

### Airline sentiment with stop words

In [None]:
# Import the stop words
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the stop words
my_stop_words = ENGLISH_STOP_WORDS.union(['airline', 'airlines', '@'])

# Build and fit the vectorizer
vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(tweets.text)

# Create the bow representation
X_review = vect.transform(tweets.text)
# Create the data frame
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

### Multiple Text columns

In [None]:
# Import the vectorizer and default English stop words list
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the stop words
my_stop_words = ENGLISH_STOP_WORDS.union(['airline', 'airlines', '@', 'am', 'pm'])
 
# Build and fit the vectorizers
vect1 = CountVectorizer(stop_words=my_stop_words)
vect2 = CountVectorizer(stop_words=ENGLISH_STOP_WORDS) 
vect1.fit(tweets.text)
vect2.fit(tweets.negative_reason)

# Print the last 15 features from the first, and all from second vectorizer
print(vect1.get_feature_names()[-15:])
print(vect2.get_feature_names())

# Capturing a token pattern

You may have noticed while working with the airline sentiment data from Twitter that the text contains many digits and other characters. Sometimes we may want to exclude them from our numeric representation.

If we work with a string, how can we make sure we extract only certain characters? There are a few useful functionalities we will review here. We can use string comparison operators, such as .isaplha(), which returns true if a string is composed only of letters and false otherwise; .isdigits() returns true if a string is composed only of digits; and finally .isalnum() returns true if a string is composed only of alphanumeric characters, i.e. letters and digits.

String operators can improve some of the features we created earlier. As a reminder, in a previous video we used a list comprehension to iterate over each review of the product reviews dataset and create word tokens from each review. We can adjust our original code. If we want to retain only tokens consisting of letters, for example, we can use the .isaplha() operator in a second list comprehension. Since the result of the first list comprehension is a list of lists, we first need to iterate over the items in each inner list, filtering out those tokens that are not letters. This is what happens in the first part of the list comprehension, enclosed in the inner brackets. In the second part, we are iterating over the lists, basically saying that we want to perform this filtering across all lists in the word_tokens list. When we compare the length of the first item of word_tokens and the cleaned_tokens lists, we see that the filtering decreased the number of tokens, as we might expect.

Regular expressions are a standard way to extract certain characters from a string. Python has a built-in package, called re, which allows you to work with regular expressions. We will not cover regular expressions in depth here but, a quick reminder on the syntax. We import the re package. Then imagine we have a string #Wonderfulday and we want to extract a hash(#) followed by any letter, capital or small. One standard way to do is by calling the search function on our string, specifying the regular expression. In our case, it starts with a #, and is followed by either an upper or lower case letter. When we print the result, we see that it is a match object, showing how large the match is - in our case, the span is 2, and also the exact characters that were matched.

Our familiar CountVectorizer takes a regular expression as an argument. The default pattern used matches words that consists of at least two letters or numbers (\w) and which are separated by word boundaries (\b). It will ignore single-lettered words, and will split words such as 'don't' and 'haven't'. If we are fine with this default pattern, we don't need to change any arguments in the CountVectorizer. If we want to change it, we can specify the token_pattern argument. If we want the vectorizer to ignore digits and other characters and only consider words of two or more letters, we can use the specified token pattern. In fact, there are multiple ways to specify this. It doesn't mean the one specified here is the only correct or best way to accomplish this. Feel free to experiment with this. Note, however, that we need to add an 'r' before the regular expression itself.

### Specifying the token pattern with regex

In [None]:
# Build and fit the vectorizer
vect = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(tweets.text)
vect.transform(tweets.text)
print('Length of vectorizer: ', len(vect.get_feature_names()))

In [None]:
# Build the first vectorizer
vect1 = CountVectorizer().fit(tweets.text)
vect1.transform(tweets.text)

# Build the second vectorizer
vect2 = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]').fit(tweets.text)
vect2.transform(tweets.text)

# Print out the length of each vectorizer
print('Length of vectorizer 1: ', len(vect1.get_feature_names()))
print('Length of vectorizer 2: ', len(vect2.get_feature_names()))

### String operators with the Twitter data

In [None]:
# Example 1
# Import the word tokenizing package
from nltk import word_tokenize

# Tokenize the text column
word_tokens = [word_tokenize(review) for review in tweets.text]
print('Original tokens: ', word_tokens[0])

# Filter out non-letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
print('Cleaned tokens: ', cleaned_tokens[0])

In [None]:
# Example 2
# Create a list of lists, containing the tokens from list_tweets
tokens = [word_tokenize(item) for item in tweets_list]

# Remove characters and digits , i.e. retain only letters
letters = [[word for word in item if word.isalpha()] for item in tokens]
# Remove characters, i.e. retain only letters and digits
let_digits = [[word for word in item if word.isalnum()] for item in tokens]
# Remove letters and characters, retain only digits
digits = [[word for word in item if word.isdigit()] for item in tokens]

# Print the last item in each list
print('Last item in alphabetic list: ', letters[2])
print('Last item in list of alphanumerics: ', let_digits[2])
print('Last item in the list of digits: ', digits[2])

# Stemming and lemmatization

In a language, words are often derived from other words, meaning words can share the same root. When we create a numeric transformation of a text feature, we might want to strip a word down to its root. This is the topic of this lesson.

This process is called stemming. More formally, stemming can be defined as the transformation of words to their root forms, even if the stem itself is not a valid word in the language. For example, staying, stays, stayed will be mapped to the root 'stay', and house, houses, housing will be mapped to the root 'hous'. In general, stemming will tend to chop off suffixes such as '-ed', '-ing', '-er', as well as plural or possessive forms.

Lemmatization is quite a similar process to stemming, with the main difference that with lemmatization, the resulting roots are valid words in the language. Going back to our examples of words derived from 'stay', lemmatization reduces them to 'stay'; and words derived from 'house' are reduced to the noun 'house'.

You might wonder when to use stemming and when lemmatization. The main difference is in the obtained roots. With lemmatization they are actual words and with stemming they might not be. So if in your problem it's important to retain words, not only roots, lemmatization would be more suitable. However, if you use nltk - which is what we will use in this course - stemming follows an algorithm which makes it faster than the lemmatization process in nltk. Furthermore, lemmatization is dependent on knowing the part of speech of the word you want to lemmatize. For example, whether we want to transform a noun, a verb, an adjective, etc.

One popular stemming library is the PorterStemmer in the nltk.stem package. The PorterStemmer is not the only stemmer in nltk but it's quite fast and easy to use, so it's often a standard choice. We call the PorterStemmer function and store it under the name porter. We can then call porter.stem on a string, for example, 'wonderful'. The result is 'wonder'.

Stemming is possible using other languages as well, such as Danish, Dutch, French, Spanish, German, etc. To use foreign language stemmers we need to use the SnowballStemmer package. We can specify in the stemmer the foreign language we want to use. Then we apply the stem function on our string. For example, we have imported a Dutch stemmer and fed it a Dutch verb. The result is the root of the verb.

If you apply the PorterStemmer on a sentence, the result is the original sentence. We see nothing has changed about our 'Today is a wonderful day!' sentence. We need to stem each word in the sentence separately. Therefore, as a first step, we need to transform the sentence into tokens using the familiar word_tokenize function. In the second step, we apply the stemming function on each word of the sentence, using a list comprehension.

The lemmatization of strings is similar to stemming. We import the WordNetLemmatizer from the nltk.stem library. It uses the WordNet database to look up lemmas of words. We call the WordNetLemmatizer function and store it under the name WNlemmatizer. We can then call WNlemmatizer.lemmatize() on 'wonderful'. Note that we have specified a part-of-speech, given by the 'pos' argument. The default pos is noun, or 'n'. Here we specify an adjective, that's why pos = 'a'. The result is 'wonderful'. If you'd recall, stemming returned 'wonder' as a result.

### Stems and lemmas from Game of Thrones

In [None]:
# Import the required packages from nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize

porter = PorterStemmer()
WNlemmatizer = WordNetLemmatizer()

# Tokenize the GoT string
tokens = word_tokenize(GoT)

In [None]:
# Stems
import time

# Log the start time
start_time = time.time()

# Build a stemmed list
stemmed_tokens = [porter.stem(token) for token in tokens] 

# Log the end time
end_time = time.time()

print('Time taken for stemming in seconds: ', end_time - start_time)
print('Stemmed tokens: ', stemmed_tokens) 

In [None]:
# Lemmas
import time

# Log the start time
start_time = time.time()

# Build a lemmatized list
lem_tokens = [WNlemmatizer.lemmatize(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for lemmatizing in seconds: ', end_time - start_time)
print('Lemmatized tokens: ', lem_tokens) 

### Stem Spanish reviews

In [None]:
# Import the language detection package
import langdetect

# Loop over the rows of the dataset and append  
languages = [] 
for i in range(len(non_english_reviews)):
    languages.append(langdetect.detect_langs(non_english_reviews.iloc[i, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]
# Assign the list to a new feature 
non_english_reviews['language'] = languages

# Select the Spanish ones
filtered_reviews = non_english_reviews[non_english_reviews.language == 'es']

In [None]:
# Import the required packages
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

# Import the Spanish SnowballStemmer
SpanishStemmer = SnowballStemmer("spanish")

# Create a list of tokens
tokens = [word_tokenize(review) for review in filtered_reviews.review] 
# Stem the list of tokens
stemmed_tokens = [[SpanishStemmer.stem(word) for word in token] for token in tokens]

# Print the first item of the stemmed tokenss
print(stemmed_tokens[0])

### Stems from tweets

In [None]:
# Import the function to perform stemming
from nltk.stem import PorterStemmer
from nltk import word_tokenize

# Call the stemmer
porter = PorterStemmer()

# Transform the array of tweets to tokens
tokens = [word_tokenize(tweet) for tweet in tweets]
# Stem the list of tokens
stemmed_tokens = [[porter.stem(word) for word in tweet] for tweet in tokens] 
# Print the first element of the list
print(stemmed_tokens[0])

# Term frequency - inverse document frequency (Tf-Idf)

We have extensively worked with a BOW and applied it using a CountVectorizer in Python. As powerful as BOW can be, sometimes we might want to try slightly more sophisticated approaches. In this video we will talk about one of them, an approach called TfIdf - term frequency inverse document frequency.

The term frequency tells us how often a given word appears within a document in the corpus. Each word in a document has its own term frequency. The inverse document frequency is commonly defined as the log-ratio between the total number of documents and the number of documents that contain a specific word. What inverse document frequency means is that rare words will have a high inverse document frequency.

When we multiply the tf and the idf scores, we obtain the TfIdf score of a word in a corpus. With BOW, words could have different frequency counts across documents but we did not account for the length of a document; whereas the TfIdf score of a word incorporates the length of a document. TfIdf will also highlight words that are more interesting, i.e. words that are common in a document but not across all documents. However, note that interesting does not have to relate to a positive or a negative review. It is purely an unsupervised approach.

In our Twitter sentiment analysis, names of airline companies such as United and Virgin America are likely to have low TfIdf scores since they occur many times and across many documents, i.e. tweets. If a tweet talks a lot about the check-in service of a company and there are not many other tweets discussing the topic, words in this tweet are likely to have a high TfIdf score. Note that since TfIdf penalizes frequent words, there is less of a need to explicitly define stop words. We can still remove stop words, of course, to restrict the size of our vocabulary. Even though TfIdf is relatively simple, it is quite commonly used in information retrieval and search engines as a way to rank the relevance of the returned queries.

In Python, you can apply TfIdf by importing the TfidfVectorizer from sklearn.feature_extraction.text. The TfIdfVectorizer is similar to the CountVectorizer, and so are the arguments it takes. We can define the maximum number of features by max_features, the type of n-grams to use by specifying ngram_range, the stop_words argument, token_pattern, max_df and min_df. We fit the TfidfVectorizer to the text column of the tweets dataset. Then we transform it, the same way we did with the CountVectorizer.

The Tfidfvectorizer also returns a sparse matrix. If you recall, a sparse matrix is a matrix with mostly zero values, storing only the non-zero values. We need to transform the sparse matrix to an array and specify the feature names, using the same syntax as with the CountVectorizer. Inspecting the top 5 rows of the newly created dataset, we see that the output is quite similar to a BOW. Each column is a feature and each row contains the TfIdf score of the feature in a given tweet. The values are floating numbers, and many of them are zero.

### Tf-Idf example

In [None]:
# Import the required function
from sklearn.feature_extraction.text import TfidfVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Call the vectorizer and fit it
anna_vect = TfidfVectorizer().fit(annak)

# Create the tfidf representation
anna_tfidf = anna_vect.transform(annak)

# Print the result 
print(anna_tfidf.toarray())

### Tf-Idf on Twitter airline sentiment data

In [None]:
# Import the required vectorizer package and stop words list
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# Define the vectorizer and specify the arguments
my_pattern = r'\b[^\d\W][^\d\W]+\b'
vect = TfidfVectorizer(ngram_range=(1, 2), max_features=100, token_pattern=my_pattern, stop_words=ENGLISH_STOP_WORDS).fit(tweets.text)

# Transform the vectorizer
X_txt = vect.transform(tweets.text)

# Transform to a data frame and specify the column names
X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: ', X.head())

In [None]:
# Example with Tf-Idf and BoW
# Import the required packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Build a BOW and tfidf vectorizers from the review column and with max of 100 features
vect1 = CountVectorizer(max_features=100).fit(reviews.review)
vect2 = TfidfVectorizer(max_features=100).fit(reviews.review) 

# Transform the vectorizers
X1 = vect1.transform(reviews.review)
X2 = vect2.transform(reviews.review)
# Create DataFrames from the vectorizers 
X_df1 = pd.DataFrame(X1.toarray(), columns=vect1.get_feature_names())
X_df2 = pd.DataFrame(X2.toarray(), columns=vect2.get_feature_names())
print('Top 5 rows using BOW: \n', X_df1.head())
print('Top 5 rows using tfidf: \n', X_df2.head())

# Predict a sentiment

Imagine we are working with the product reviews. A supervised learning task will try to classify any new review as either positive or negative based on already labeled reviews. This is what we call a classification problem. In the case of the product and movie reviews, we have two classes - positive and negative. This is a binary classification problem. The airline sentiment Twitter data has three categories of sentiment: positive, neutral and negative. This is a multi-class classification problem.

One algorithm commonly applied in classification tasks is a logistic regression. You might be familiar with a linear regression, where we fit a straight line to approximate a relationship, shown in the graph on the left. With a logistic regression, instead of fitting a line, we are fitting an S-shaped curve, called a sigmoid function. A property of this function is that for any value of x, y will be between 0 and 1.

When performing linear regression, we are predicting a numeric outcome (say the sale price of a house). With logistic regression, we estimate the probability that the outcome (sentiment) belongs to a particular category(positive or negative) given the review. Since we are estimating a probability and want an output between 0 and 1, we model the X values using the sigmoid/logistic function, as shown on the graph. For more details on logistic regression, refer to other courses on DataCamp.

In Python, we import the LogisticRegression from the sklearn.linear_model library. Keep in mind that the sklearn API works only with continuous variables. It also requires either a DataFrame or an array as arguments and cannot handle missing data. Therefore, all transformation of the data needs to be completed beforehand. We call the logistic regression function and create a Logistic classifier object. We fit it by specifying the X matrix, which is an numpy array of our features or a pandas DataFrame, and the vector of targets y.

How do we know if the model is any good? We look at the discrepancy between the predicted label and what was the real label for each instance (observation) in our dataset. One common metric to use is the accuracy score. Though not appropriate in all contexts, it is still useful. Accuracy gives us the fraction of predictions that our model got right. The higher and closer it is to 1, the better. One way we can calculate the accuracy score of a logistic regression model is by calling the score method on the logistic regression object. It takes as arguments the X matrix and y vector.

Alternatively, we can use the accuracy_score function from sklearn.metrics. There is an accuracy_score function apart from the score function because different models have different default score metrics. Thus, the accuracy_score function always returns the accuracy but the score function might return other metrics if we use it to evaluate other models. Here, we need to explicitly calculate the predictions of the model, by calling predict on the matrix of features. The accuracy score takes as arguments the vector of true labels and the predicted labels. We see in the case of logistic regression, both score and accuracy score return value of 0.9009.

Can we trust such high accuracy? We should be careful in making strong conclusions just yet. In the next video, we will see how to check how robust the model performance is but before that, let's solve some exercises!

### Logistic regression of movie reviews

In [None]:
# Import the logistic regression
from sklearn.linear_model import LogisticRegression

# Define the vector of targets and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))

### Logistic regression with Twitter data

In [None]:
# Define the vector of targets and matrix of features
y = tweets.airline_sentiment
X = tweets.drop('airline_sentiment', axis=1)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X,y))

# Create an array of prediction
y_predict = log_reg.predict(X)

# Print the accuracy using accuracy score
print('Accuracy of logistic regression: ',  accuracy_score(y, y_predict))

# Did we predict the sentiment well? 

In the previous video, we used all of the available data to build a logistic regression model and assess its accuracy. However, we want to make sure our machine learning model generalizes and performs well on unseen data. How to do that?

To get any idea on how well a model will perform on unseen data, we randomly split the dataset in 2 parts: one used for training (building the model) and one for testing (evaluate the performance of the model). In some cases, when we want to tune the parameters of our algorithm, we might have 3 sets: training, testing and validation, but this is out of scope for our course. The training set is usually around 70 or 80% of the whole dataset, and the rest is used for testing.

In Python, we can perform a random train-test split using the train_test_split function from the sklearn.model_selection package. It takes as arguments arrays, lists, or DataFrames. The X-train and test matrices and y-train and test vectors are the output of the train_test_split. The first arguments we provide in the function are the features matrix X and labels vector y. We can specify the proportion of the data going to testing; here, it is equal to 0.2. Another parameter is the random state, which is the seed generator used to make the random split. It ensures that every time you perform the train-test split on the same data, you will get the same instances in each set. We can also specify the stratify argument. If we want to ensure that the train and test set have similar proportions of both classes, we can do that by specifying stratify to be equal to y.

Let's revisit our logistic regression example, executed after a train-test split. We create the LogisticRegression object and fit it on the training set. We can calculate the accuracy on the training data, calling score on the logistic regression with arguments X_train and y_train. We can also calculate the accuracy score of the model using the test set - X_test and y_test. It is slightly lower than the accuracy on the training data, which is usually the case.

You may recall that another way to calculate the accuracy was to use the accuracy_score function from the sklearn.metrics. After we have built the logistic regression model, we apply predict to the logistic regression specifying X_test as an argument. In the last step, we call the accuracy score on the true and predicted labels. The value is identical to the accuracy produced by the score function.

The accuracy is a useful measure of a model's performance but it's not always the most informative. We can instead use something called a 'confusion matrix'. It shows the number of predicted and true values of each of the classes, as displayed in the table. A confusion matrix will allow us to calculate the values in each cell and say how many observations of each class we have predicted correctly. For more details of when we would want to optimize for the different cells, refer to other DataCamp courses.

In Python, we import the confusion_matrix from the sklearn.metrics module. After we have built our logistic regression and predicted the test set labels, we call the confusion matrix where we give as arguments the true and predicted labels. We have divided the matrix by the length of the y-vector in order to obtain proportions in the cells of the matrix.

### Build and assess a model

In [None]:
# Import the required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Define the vector of labels and matrix of features
y = movies.label
X = movies.drop('label', axis=1)

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a logistic regression model and print out the accuracy
log_reg = LogisticRegression().fit(X_train, y_train)
print('Accuracy on train set: ', log_reg.score(X_train, y_train))
print('Accuracy on test set: ', log_reg.score(X_test, y_test))

### Performance metrics of Twitter data

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Make predictions on the test set
y_predicted = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score test set: ', accuracy_score(y_test, y_predicted))
print('Confusion matrix test set: \n', confusion_matrix(y_test, y_predicted)/len(y_test))

### Assess model

In [None]:
# Import the accuracy and confusion matrix
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Build a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the labels 
y_predict = log_reg.predict(X_test)

# Print the performance metrics
print('Accuracy score of test data: ', accuracy_score(y_test, y_predict))
print('Confusion matrix of test data: \n', confusion_matrix(y_test, y_predict)/len(y_test))

# Logistic regression: revisited

Before we build a logistic regression using text features, we transformed the text fields to numeric columns. As a result, we might end up having hundreds or even thousands of features, which can make the model quite complex.

A complex model can occur in a few scenarios. If we use a very complicated function to explain the relationship of interest, we will inevitably fit the noise in the data. Such a model will not perform well when used to score unseen data. This is also called overfitting. A complex model could stem from including too many unnecessary features and parameters; especially with transformed text data, where we might create thousands of extra numeric columns. These two sources of complexity often go hand-in-hand. One way to artificially discourage complex models is by the use of regularization. When using regularization, we are penalizing, or restricting the function of the model.

Regularization is applied by default in the logistic regression function from sklearn. It uses the so-called L2 penalty; the details of it are outside of the scope of this course, but intuitively it's good to know that the L2 penalty shrinks all the coefficients towards zero, effectively reducing the impact of each feature. The parameter that determines the strength of regularization is given by C, which takes a default value of 1. Higher values of C correspond to less regularization, in other words, the model will try to fit the data as best as possible. Small values of C correspond to high penalization(or regularization), meaning that the coefficients of the logistic regression will be closer to zero; the model will be less flexible because it will not fit the training data so well. How to find the most appropriate value of C? Usually we need to test different values and see which one gives us the best performance on the test data.

You should recall that when we trained a logistic regression model, we applied the predict function to the test set to predict the labels. The predict function predicts a class: 0 or 1 if we are working with a binary classifier. However, instead of a class, we can predict a probability using the predict_proba function. We again pass as an argument the test dataset.

This returns an array of probabilities, ordered by the label of classes - first the class 0 then the class 1. The probabilities for each observation are displayed on a separate row. The first value is the probability that the instance is of class 0, and the second of a class 1. Therefore, is is common when predicting the probabilities to specify already that we want to extract the probabilities of the class 1.

One important thing to know is that we cannot directly apply the accuracy score or confusion matrix to the predicted probabilities. If you do that in sklearn, you will get a ValueError. The reason is that the accuracy and the confusion matrix work directly with classes. If we have predicted probabilities, we need to encode them as classes. The default is that any probability higher or equal to 0.5 is translated to class 1, otherwise to class 0. However, you can change that threshold depending on your problem. Imagine only 1% of the reviews are positive and you have built a model to predict whether a new review is positive or negative. In that context, you don't want to translate any predicted probability higher than 0.5 to class 1, this threshold should be much lower.

### Predict probabilities of movie reviews

In [None]:
# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict the probability of the 0 class
prob_0 = log_reg.predict_proba(X_test)[:, 0]
# Predict the probability of the 1 class
prob_1 = log_reg.predict_proba(X_test)[:, 1]

print("First 10 predicted probabilities of class 0: ", prob_0[:10])
print("First 10 predicted probabilities of class 1: ", prob_1[:10])

### Product review with regularization

In [None]:
# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Train a logistic regression with regularization of 1000
log_reg1 = LogisticRegression(C=1000).fit(X_train, y_train)
# Train a logistic regression with regularization of 0.001
log_reg2 = LogisticRegression(C=0.001).fit(X_train, y_train)

# Print the accuracies
print('Accuracy of model 1: ', log_reg1.score(X_test, y_test))
print('Accuracy of model 2: ', log_reg2.score(X_test, y_test))

### Regularization models with Twitter data

In [None]:
# Build a logistic regression with regularizarion parameter of 100
log_reg1 = LogisticRegression(C=100).fit(X_train, y_train)
# Build a logistic regression with regularizarion parameter of 0.1
log_reg2 = LogisticRegression(C=0.1).fit(X_train, y_train)

# Predict the labels for each model
y_predict1 = log_reg1.predict(X_test)
y_predict2 = log_reg2.predict(X_test)

# Print performance metrics for each model
print('Accuracy of model 1: ', accuracy_score(y_test, y_predict1))
print('Accuracy of model 2: ', accuracy_score(y_test, y_predict2))
print('Confusion matrix of model 1: \n' , confusion_matrix(y_test, y_predict1)/len(y_test))
print('Confusion matrix of model 2: \n', confusion_matrix(y_test, y_predict2)/len(y_test))

# Summary

Welcome back! In this video we will put together all the steps we have applied in this course on sentiment analysis. I find myself apply these same steps in my work as a data scientist.

We defined sentiment analysis as the process of understanding the opinion of an author about a subject. Throughout the course we worked with examples of movie and Amazon product reviews, Twitter airline sentiment data, and different emotionally charged literary examples. We went through various steps to transform the text column, which contained the review, to numeric features. We finished our analysis by training a logistic regression model and predicting the sentiment of a new review based on the words in the text. Let's go through these steps in more detail.

We started with exploring the review column in the movies reviews dataset. We found which were the shortest and longest reviews. We also plotted word clouds from the movie reviews, which allowed us to quickly see which are the most frequently mentioned words in positive or negative reviews. Furthermore, we created features for the length of a review in terms of number of words and number of sentences, and we learned how to detect the language of a document.

We continued with numeric transformations of the review features. We transformed the text using a bag-of-words approach and a Tfidf vectorizer. The bag-of-words created features corresponding to the frequency count of a word in a respective review or tweet (also called document in NLP problems). The term frequency-inverse document frequency approach is similar to the bag-of-words but it accounts for how frequently a word occurs in a document with respect to the rest of the documents. So, we can capture 'important' words, whereas words that occur frequently have lower tfidf score. We used the CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text to construct each of the vectors. As a reminder of the syntax, we called the vectorizer function and fit and then transformed it to the text column in our data.

There are many arguments we specified in the vectorizers. We dealt with stop words: those frequently occurring and non-informative words. We had a video on n-grams, which allowed us to use different lengths of phrases instead of a single word. We learned how to limit the size of the vocabulary by setting any of a number of parameters: max_features (for the maximum number of features), max and min_df (which tells the vectorizer to ignore terms with higher or lower than the specified frequency, respectively). We could capture only certain characters using the token_pattern argument. Last but not least, we learned about lemmas and stems and practiced lemmatizing and stemming of tokens and strings. We could adjust all these arguments - with the exception of lemmas and stems - in both the the count- and tfidfvectorizers.

In the final step, we used a logistic regression to train a classifier predicting the sentiment. We evaluated the performance of the model using metrics such as accuracy score and a confusion matrix. Since the goal is for our model to perform well on unseen data, we randomly split the data into a training and testing set; we used the training set to build the model and the test set to evaluate its performance.

These are all very valuable skills and essential in performing a sentiment analysis task. Let's perform some of these steps in the exercises.

In [None]:
# Step 1: word cloud and feature creation
# Create and generate a word cloud image
cloud_positives = WordCloud(background_color='white').generate(positive_reviews)
 
# Display the generated wordcloud image
plt.imshow(cloud_positives, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

In [None]:
# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in reviews.review]

# Create an empty list to store the length of the reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

In [None]:
# Step 2: Building a vectorizer
# Import the TfidfVectorizer and default list of English stop words
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# Build the vectorizer
vect = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS, ngram_range=(1, 2), max_features=200, token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(reviews.review)
# Create sparse matrix from the vectorizer
X = vect.transform(reviews.review)

# Create a DataFrame
reviews_transformed = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: \n', reviews_transformed.head())

In [None]:
# Step 3: Building a classifier
# Define X and y
y = reviews_transformed.score
X = reviews_transformed.drop('score', axis=1)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=456)

# Train a logistic regression
log_reg = LogisticRegression().fit(X_train, y_train)
# Predict the labels
y_predicted = log_reg.predict(X_test)

# Print accuracy score and confusion matrix on test set
print('Accuracy on the test set: ', accuracy_score(y_test, y_predicted))
print(confusion_matrix(y_test, y_predicted)/len(y_test))