## Chapter 14: Mining Text and Images

As you may know, data mining is the process of examining datasets for meaningful patterns, and since much of the data on the web is either text or image based, such a place is ripe for data mining. Companies, for example, collect and mine Twitter posts, e-mail messages, and other forms of customer feedback. Further, government organizations make extensive use of image and facial recognition. 

In general, text and image mining includes many of the data-mining operations you may have heard of:

    •	Clustering: Grouping related text fragments, documents, or images
    •	Classification: Categorizing text or images
    •	Prediction: Determining the next text in a sequence

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user nltk
! pip install --user gensim
! pip install --user Dlib
! pip install --user Facial_Recognition
! pip install --user sklearn
! pip install --user matplotlib
! pip install TextBlob
! python -m textblob.download_corpora
```

# Performing Sentiment Analysis

Sentiment analysis is the process of examining text to determine if it corresponds to negative, neutral, or positive feedback. To perform sentiment analysis, you can take advantage of several existing packages, which are readily available on the Web. 

TextBlob is a widely used text-processing library for Python. The TextBlob library supports a wide range of text-processing operations, which include natural-language processing, language translation, as well as sentiment analysis.

The following Python script, SimpleSentiment.py, uses a Naïve Bayes classifier to determine the sentiment (negative or positive) of several text strings:

In [None]:
######################################
# Chapter 14 (Python) / Deliverable 1
######################################


from textblob.classifiers import NaiveBayesClassifier

trainingData = [
    ("The service was great!.", "pos"),
    ("Our waiter was awesome!", "pos"),
    ("They have great appetizers.", "pos"),
    ("Happy hour was busy and fun.", "pos"),
    ("Great place for a quick meal.", "pos"),
    ("Our foot took forever to arrive.", "neg"),
    ("The waiter was slow.", "neg"),
    ("The drinks were weak", "neg"),
    ("It was very crowded and noisy!", "neg"),
    ("My pasta was horrible.", "neg"),
    ("The cost was reasonable.", "pos"),
    ("The drinks were cold.", "pos"),
    ("The hostess was ditsy.", "neg"),
]
testingData = [
    ("The wine list was complete.", "pos"),
    ("There was no place to park.", "neg"),
    ("I really liked the bread.", "pos"),
    ("I want to come back!", "pos"),
    ("The food was not that good.", "neg"),
    ("The beer was great!", "pos"),
]
classifier = NaiveBayesClassifier(trainingData)
print("Accuracy: {0}".format(classifier.accuracy(testingData)))

# Classify some statements
print("The food was awesome.", classifier.classify("The food was awesome."))       # "pos"
print("I didn't like my pasta.", classifier.classify("I didn't like my pasta."))   # "neg"

classifier.show_informative_features(10)

The script creates the training and testing datasets by specifying a sentence and the corresponding sentiment: “pos” or “neg.” The program calls the show_informative_features to show you how the classifier uses specific words in its decision process.

As you might expect, the greater the amount of data in the training set, the more accurate your sentiment results. In fact, because of the small training dataset, you can experiment with the script by adding additional positive or negative sentiments. As you do, you can see how your new examples directly influence the result. Alternatively, you can test statements based on the existing criteria to identify the points of weakness in having such a small sample set.

The following Python script, DecisionTreeText.py, uses a DecisionTree classifier. To improve the results, several additional training dataset records have been added:

In [None]:
from textblob.classifiers import DecisionTreeClassifier
trainingData = [
    ("The service was great!.", "pos"),
    ("Our waiter was awesome!", "pos"),
    ("They have great appetizers.", "pos"),
    ("Happy hour was busy and fun.", "pos"),
    ("Great place for a quick meal.", "pos"),
    ("Our foot took forever to arrive.", "neg"),
    ("The waiter was slow.", "neg"),
    ("The drinks were weak", "neg"),
    ("It was very crowded and noisy!", "neg"),
    ("My pasta was horrible.", "neg"),
    ("My pasta was yummy.", "pos"),
    ("The cost was reasonable.", "pos"),
    ("The drinks were cold.", "pos"),
    ("The hostest was ditsy.", "neg"),
    ("Very good pasta.", "pos"),
    ("They didn't have dessert.", "neg"),
    ("They didn't want to help us.", "neg")
]
testingData = [
    ("The wine list was complete.", "pos"),
    ("There was no place to park.", "neg"),
    ("I really liked the bread.", "pos"),
    ("I want to come back!", "pos"),
    ("The food was not that good.", "neg"),
    ("The beer was great!", "pos")
]
classifer = DecisionTreeClassifier(trainingData)

# Classify new text
print("The food was awesome.", classifer.classify("The food was awesome."))     # "pos"
print("I didn't like my pasta.", classifer.classify("I didn't like my pasta.")) # "neg"

print("Accuracy: {0}".format(classifer.accuracy(testingData)))
print(classifer.pprint())

# Using the Natural Language Toolkit (NLTK)

One way to increase your training data is to use the NLTK (Natural Language ToolKit) dataset, a suite of libraries and tools for symbolic and statistic natural language processing. One such tool within this kit is VADER (Valence Aware Dictionary for sEntiment Reasoning), a lexicon and rule-based sentiment analyzer.

VADER uses a list of lexical features (e.g., words, emoticons) that have been rated by humans on their degree of polarity (how negative or positive a feature is), and values words not found in its list as neutral. By evaluating with reference to this list, in conjunction with a set of grammatical and syntactical rules, VADER can evaluate a large range of english sentiments with impressive accuracy, even accounting for intensifiers such as adverbs of degree and exclamation marks.

The following Python script, AskSentiment.py, prompts the user to enter a response to a question regarding their meal. The script then determines the corresponding customer sentiment by using the positive(pos), neutral(neu), and negative(neg) metrics to calculate the final polarity score(compound):

In [None]:
######################################
# Chapter 14 (Python) / Deliverable 2
######################################

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
  
feedback = input("How was your meal? ")
  
sia = SentimentIntensityAnalyzer()

score = sia.polarity_scores(feedback)
for i in score:
   print('{0}: {1}, '.format(i, score[i]), end='')

# Clustering Related Text

As you may have learned, clustering groups together related data items. Text clustering is similar, in that the text processor will cluster similar terms or sentences. The following Python script, ClusterText.py, clusters similar text using a K-means clustering algorithm. The number prepended to each array of words corresponds to the index of the cluster to which it belongs:

In [None]:
from gensim.models import Word2Vec
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 

sentences = [['We', 'should', 'watch', 'a', 'movie'],
            ['Babe', 'Ruth',  'was', 'a', 'great', 'baseball', 'player'],
            ['Lou', 'Gerhig', 'played', 'baseball'],
            ['Do', 'not', 'discuss', 'politics', 'at', 'work'],
            ['Baseball', 'hotdogs', 'Apple', 'Pie', 'and', 'Chevrolet'],
            ['Data', 'mining', 'can', 'use', 'machine', 'learning'],
            ['Clustering', 'uses', 'unsupervised', 'machine', 'learning'],
            ['My', 'company', 'does', 'machine', 'learning'],
            ['Bill', 'Gates', 'was', 'a', 'programmer'],  
            ['The', 'movie', 'was', 'bad']]
  
model = Word2Vec(sentences, min_count=1) 
Data = []
for sentence in sentences:
    vector = []
    wordCount = 0
    for word in sentence:
       if wordCount == 0:
          vector = model.wv[word]
       else:
          vector = np.add(vector, model.wv[word])
       wordCount += 1
    Data.append(np.asarray(vector)/wordCount)
km = KMeansClusterer(5, nltk.cluster.util.euclidean_distance, repeats=10)
assigned_clusters = km.cluster(Data, assign_clusters=True)

for index, sentence in enumerate(sentences):    
    print (str(assigned_clusters[index]) + ":" + str(sentence))

Because Word2Vec clusters based upon words, the script specifies the sentences as individual words. You could instead specify strings and later parse the strings into the individual words, but for simplicity, the script begins with this already completed.    
Since the K-Means clustering algorithm works with numbers, not words, the script uses Word2Vec to model the words.

As you can see, some of the cluster groups are more accurate than others. To improve the results, the following script, RevisedCluster.py, edits the sentences to use only key words (eliminating words such as *was*, *a*, and *do*):

In [None]:
from gensim.models import Word2Vec
from nltk.cluster import KMeansClusterer
import nltk
import numpy as np 

sentences = [['watch', 'movie'],
            ['Ruth', 'baseball', 'player'],
            ['Gerhig', 'played', 'baseball'],
            ['politics',  'work'],
            ['Baseball', 'hotdogs', 'Apple', 'Pie', 'Chevrolet'],
            ['Data', 'mining', 'machine', 'learning'],
            ['Clustering', 'unsupervised', 'machine', 'learning'],
            ['machine', 'learning'],
            ['Gates', 'programmer'],  
            ['movie', 'bad']]

model = Word2Vec(sentences, min_count=1)
Data = []

for sentence in sentences:
    vector = []
    wordCount = 0
    for word in sentence:
       if wordCount == 0:
          vector = model.wv[word]
       else:
          vector = np.add(vector, model.wv[word])
       wordCount += 1
     
    Data.append(np.asarray(vector)/wordCount)

km = KMeansClusterer(5, nltk.cluster.util.euclidean_distance, repeats=10)
assigned_clusters = km.cluster(Data, assign_clusters=True)

for index, sentence in enumerate(sentences):    
    print (str(assigned_clusters[index]) + ":" + str(sentence))

# Creating a Simple Facial Recognition Application

Image mining is the application of data-mining techniques to image data. Image mining includes facial recognition, object identification, image clustering, and image classification. Across the Web, applications make extensive use of image processing for a wide range of operations:

    •    Weather-image analysis
    •    Medical-image analysis
    •    National security
    •    Facial recognition
    •    And more

Facial recognition is a software process that identifies a person (or people) within a photo. The government, for example, makes use of facial recognition for national-security applications, tracking who is entering and exiting the country. Similarly, mobile-phone apps and many computer applications make use of facial recognition to authenticate users.

The following Python script, Recognize.py, leverages the Face_Recognition and DLib modules to create a simple facial-recognition solution: 

In [None]:
######################################
# Chapter 14 (Python) / Deliverable 3
######################################

import os
import sys
import face_recognition

# Get the images to compare
images = os.listdir('photos')

# Load the image to match
image_to_be_matched = face_recognition.load_image_file('trump.jpg')

# Convert the image into a feature vector
image_to_be_matched_encoded = face_recognition.face_encodings(image_to_be_matched)[0]
# Loop through the images comparing each
for image in images:
    current_image = face_recognition.load_image_file("photos/" + image)
    current_image_encoded = face_recognition.face_encodings(current_image)[0]
    result = face_recognition.compare_faces(
        [image_to_be_matched_encoded], current_image_encoded)
    if result[0] == True:
        print("Matched the image: " + image)
    else:
        print("Did not match the image: " + image)

# Hand Writing Classification

One of the most commonly used image classification examples on the Web is the classification of hand-written digits (0-9), using images contained in the Modified National Institute of Science and Technology (MNIST) digits dataset. The MNIST dataset contains 60,000 images, stored in an 8x8 matrix (64 attributes). Each attribute value is represented using a value from 0 to 255. 

The following script, ShowDigit.py, loads a 1,797 image subset of the dataset, which is built into sklearn, and displays the first digit:

In [None]:
######################################
# Chapter 14 (Python) / Deliverable 4
######################################

from sklearn import datasets
import matplotlib.pyplot as plt

digits = datasets.load_digits()

plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

The following Python script, DigitAttributes.py, loads the dataset and displays the attribute values for the first image:

In [None]:
from sklearn import datasets

digits = datasets.load_digits()

print(digits.data[0])

The following Python script, DigitsClassify.py, uses K-nearest-neighbor (KNN) classification to classify the hand-written digits as a number from 0 to 9:

In [None]:
######################################
# Chapter 14 (Python) / Deliverable 5
######################################

from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
print ('\nModel accuracy score: ', accuracy_score(y_test, pred))

# Predict a hand-written digit, reshape the 1D array as a 2D array
pred = knn.predict(digits.data[500].reshape(1, -1))

# Compare the prediction with the actual number (digits.target[]) at the specified index location
print('Predicted:', pred, 'Actual:', digits.target[500])