```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§2 Sentiment Analysis in Python

§2.2 Numeric features from reviews
```

# Build new features from the text

## What is the goal of building new features from the text?

* Enrich the existing dataset with features related to the text column (capturing the sentiment).

## What could the features be from the review column?

* How long is each review?

* How many sentences does it contain?

* What parts of speech are involved?

* How many punctuation marks?

## Code of tokenizing a string:

In [1]:
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy',
 'families',
 'are',
 'all',
 'alike',
 ',',
 'every',
 'unhappy',
 'family',
 'is',
 'unhappy',
 'in',
 'its',
 'own',
 'way',
 '.']

## Code of tokens from a column:

```
# General form of list comprehension
[expression for item in iterable]
```

In [2]:
import pandas as pd

reviews = pd.read_csv('ref2. Amazon product reviews sample.csv')[[
    'score', 'review'
]]
reviews.head()

Unnamed: 0,score,review
0,1,Stuning even for the non-gamer: This sound tr...
1,1,The best soundtrack ever to anything.: I'm re...
2,1,Amazing!: This soundtrack is my favorite musi...
3,1,Excellent Soundtrack: I truly like this sound...
4,1,"Remember, Pull Your Jaw Off The Floor After H..."


In [3]:
word_tokens = [word_tokenize(review) for review in reviews.review]
type(word_tokens)

list

In [4]:
type(word_tokens[0])

list

In [5]:
len_tokens = []

# Iterate over the word_tokens list
for i in range(len(word_tokens)):
    len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review
reviews['n_tokens'] = len_tokens

reviews.head()

Unnamed: 0,score,review,n_tokens
0,1,Stuning even for the non-gamer: This sound tr...,87
1,1,The best soundtrack ever to anything.: I'm re...,109
2,1,Amazing!: This soundtrack is my favorite musi...,165
3,1,Excellent Soundtrack: I truly like this sound...,145
4,1,"Remember, Pull Your Jaw Off The Floor After H...",109


## How to dealing with punctuation?

* Although the previous exercise did not address it, it is possible to exclude it.

* The reason for building a feature that measures the number of punctuation signs is because a review with many punctuation signs could signal a very emotionally charged opinion.

## Practice exercises for building new features from the text:

$\blacktriangleright$ **Data pre-loading:**

In [6]:
GoT = 'Never forget what you are, for surely the world will not. \
Make it your strength. Then it can never be your weakness. \
Armour yourself in it, and it will never be used to hurt you.'

$\blacktriangleright$ **String tokenization practice:**

In [7]:
# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

['Never', 'forget', 'what', 'you', 'are', ',', 'for', 'surely', 'the', 'world', 'will', 'not', '.', 'Make', 'it', 'your', 'strength', '.', 'Then', 'it', 'can', 'never', 'be', 'your', 'weakness', '.', 'Armour', 'yourself', 'in', 'it', ',', 'and', 'it', 'will', 'never', 'be', 'used', 'to', 'hurt', 'you', '.']


$\blacktriangleright$ **Data re-pre-loading:**

In [8]:
avengers = [
    "Cause if we can't protect the Earth, you can be d*** sure we'll avenge it",
    'There was an idea to bring together a group of remarkable people, \
    to see if we could become something more',
    "These guys come from legend, Captain. They're basically Gods."
]

$\blacktriangleright$ **Word tokens practice:**

In [9]:
# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)

[['Cause', 'if', 'we', 'ca', "n't", 'protect', 'the', 'Earth', ',', 'you', 'can', 'be', 'd', '*', '*', '*', 'sure', 'we', "'ll", 'avenge', 'it'], ['There', 'was', 'an', 'idea', 'to', 'bring', 'together', 'a', 'group', 'of', 'remarkable', 'people', ',', 'to', 'see', 'if', 'we', 'could', 'become', 'something', 'more'], ['These', 'guys', 'come', 'from', 'legend', ',', 'Captain', '.', 'They', "'re", 'basically', 'Gods', '.']]


$\blacktriangleright$ **Package pre-loading:**

In [10]:
import pandas as pd

$\blacktriangleright$ **Data re-pre-loading:**

In [11]:
reviews_all = pd.read_csv('ref2. Amazon product reviews sample.csv')[[
    'score', 'review'
]]
reviews_all = reviews_all.iloc[:100, ]
reviews = reviews_all.copy()

$\blacktriangleright$ **The length feature practice:**

In [12]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

['Stuning', 'even', 'for', 'the', 'non-gamer', ':', 'This', 'sound', 'track', 'was', 'beautiful', '!', 'It', 'paints', 'the', 'senery', 'in', 'your', 'mind', 'so', 'well', 'I', 'would', 'recomend', 'it', 'even', 'to', 'people', 'who', 'hate', 'vid', '.', 'game', 'music', '!', 'I', 'have', 'played', 'the', 'game', 'Chrono', 'Cross', 'but', 'out', 'of', 'all', 'of', 'the', 'games', 'I', 'have', 'ever', 'played', 'it', 'has', 'the', 'best', 'music', '!', 'It', 'backs', 'away', 'from', 'crude', 'keyboarding', 'and', 'takes', 'a', 'fresher', 'step', 'with', 'grate', 'guitars', 'and', 'soulful', 'orchestras', '.', 'It', 'would', 'impress', 'anyone', 'who', 'cares', 'to', 'listen', '!', '^_^']


In [13]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
    len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens

## Version checking:

In [14]:
import sys
import nltk

print('The Python version is {}.'.format(sys.version.split()[0]))
print('The NLTK version is {}.'.format(nltk.__version__))
print('The pandas version is {}.'.format(pd.__version__))

The Python version is 3.7.9.
The NLTK version is 3.5.
The pandas version is 1.2.1.
