Translation and sentiment analysis with ML

Translation is a very hard problem compounded by the fact that there are thousands of languages and each can have very different grammar rules. One approach is to convert the formal grammar rules for one language, such as English, into a non-language dependent structure, and then translate it by converting back to another language. This approach means that you would take the following steps:


    1 Identification. Identify or tag the words in input language into nouns, verbs etc.
    
    2 Create translation. Produce a direct translation of each word in the target language format.


Example sentence, English to Irish

In 'English', the sentence I feel happy is three words in the order:

    subject (I)

    verb (feel)

    adjective (happy)
    


However, in the 'Irish' language, the same sentence has a very different grammatical structure - emotions like "happy" or "sad" are expressed as being upon you.

The English phrase I feel happy in Irish would be Tá athas orm. A literal translation would be Happy is upon me.

An Irish speaker translating to English would say I feel happy, not Happy is upon me, because they understand the meaning of the sentence, even if the words and sentence structure are different.

The formal order for the sentence in Irish are:

    verb (Tá or is)

    adjective (athas, or happy)
    
    subject (orm, or upon me)


Translation

A naive translation program might translate words only, ignoring the sentence structure.

If you've learned a second (or third or more) language as an adult, you might have started by thinking in your native language, translating a concept word by word in your head to the second language, and then speaking out your translation. This is similar to what naive translation computer programs are doing. It's important to get past this phase to attain fluency!

Naive translation leads to bad (and sometimes hilarious) mistranslations: I feel happy translates literally to Mise bhraitheann athas in Irish. That means (literally) me feel happy and is not a valid Irish sentence. Even though English and Irish are languages spoken on two closely neighboring islands, they are very different languages with different grammar structures.

Machine learning approaches

So far, you've learned about the formal rules approach to natural language processing. Another approach is to ignore the meaning of the words, and instead use machine learning to detect patterns. This can work in translation if you have lots of text (a corpus) or texts (corpora) in both the origin and target languages.

For instance, consider the case of Pride and Prejudice, a well-known English novel written by Jane Austen in 1813. If you consult the book in English and a human translation of the book in French, you could detect phrases in one that are idiomatically translated into the other. You'll do that in a minute.

For instance, when an English phrase such as I have no money is translated literally to French, it might become Je n'ai pas de monnaie. "Monnaie" is a tricky french 'false cognate', as 'money' and 'monnaie' are not synonymous. A better translation that a human might make would be Je n'ai pas d'argent, because it better conveys the meaning that you have no money (rather than 'loose change' which is the meaning of 'monnaie').

If an ML model has enough human translations to build a model on, it can improve the accuracy of translations by identifying common patterns in texts that have been previously translated by expert human speakers of both languages.

Exercise - translation

You can use TextBlob to translate sentences. Try the famous first line of Pride and Prejudice

In [4]:
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor


blob = TextBlob(
    "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife!"
)
print(blob.translate(to="fr"))


AttributeError: 'list' object has no attribute 'strip'

TextBlob does a pretty good job at the translation: "C'est une vérité universellement reconnue, qu'un homme célibataire en possession d'une bonne fortune doit avoir besoin d'une femme!".

It can be argued that TextBlob's translation is far more exact, in fact, than the 1932 French translation of the book by V. Leconte and Ch. Pressoir:

Sentiment analysis

Another area where machine learning can work very well is sentiment analysis. A non-ML approach to sentiment is to identify words and phrases which are 'positive' and 'negative'. Then, given a new piece of text, calculate the total value of the positive, negative and neutral words to identify the overall sentiment.

This approach is easily tricked as you may have seen in the Marvin task - the sentence Great, that was a wonderful waste of time, I'm glad we are lost on this dark road is a sarcastic, negative sentiment sentence, but the simple algorithm detects 'great', 'wonderful', 'glad' as positive and 'waste', 'lost' and 'dark' as negative. The overall sentiment is swayed by these conflicting words.

Stop a second and think about how we convey sarcasm as human speakers. Tone inflection plays a large role. Try to say the phrase "Well, that film was awesome" in different ways to discover how your voice conveys meaning.

ML approaches

The ML approach would be to manually gather negative and positive bodies of text - tweets, or movie reviews, or anything where the human has given a score and a written opinion. Then NLP techniques can be applied to opinions and scores, so that patterns emerge (e.g., positive movie reviews tend to have the phrase 'Oscar worthy' more than negative movie reviews, or positive restaurant reviews say 'gourmet' much more than 'disgusting').

Exercise - sentimental sentences

Sentiment is measured in with a polarity of -1 to 1, meaning -1 is the most negative sentiment, and 1 is the most positive. Sentiment is also measured with an 0 - 1 score for objectivity (0) and subjectivity (1).

Take another look at Jane Austen's Pride and Prejudice. The text is available here at Project Gutenberg. The sample below shows a short program which analyses the sentiment of first and last sentences from the book and display its sentiment polarity and subjectivity/objectivity score.

You should use the TextBlob library (described above) to determine sentiment (you do not have to write your own sentiment calculator) in the following task.

In [5]:
from textblob import TextBlob

quote1 = """It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."""

quote2 = """Darcy, as well as Elizabeth, really loved them; and they were both ever sensible of the warmest gratitude towards the persons who, by bringing her into Derbyshire, had been the means of uniting them."""

sentiment1 = TextBlob(quote1).sentiment
sentiment2 = TextBlob(quote2).sentiment

print(quote1 + " has a sentiment of " + str(sentiment1))
print(quote2 + " has a sentiment of " + str(sentiment2))

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. has a sentiment of Sentiment(polarity=0.20952380952380953, subjectivity=0.27142857142857146)
Darcy, as well as Elizabeth, really loved them; and they were both ever sensible of the warmest gratitude towards the persons who, by bringing her into Derbyshire, had been the means of uniting them. has a sentiment of Sentiment(polarity=0.7, subjectivity=0.8)


In [9]:
import requests
from bs4 import BeautifulSoup
from textblob import TextBlob

# Download the book from Project Gutenberg
url = "# downloading using Python: Yay
import requests

url = "https://www.gutenberg.org/files/42671/42671.txt"
response = requests.get(url)

if response.status_code == 200:
    # Get the filename from the URL
    filename = 'pride_and_prejudice.txt'

    # Save the file
    with open(filename, "w") as file:
        file.write(response.text)
    
    print("File downloaded successfully.")
else:
    print("Failed to download the file.")"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Remove the metadata from the book
text = soup.find("div", id="pge1").text

# Create a TextBlob object
blob = TextBlob(text)

# Analyze each sentence in the book
sentences = blob.sentences
positive_sentences = []
negative_sentences = []
for sentence in sentences:
  polarity = sentence.sentiment.polarity
  if polarity == 1:
    positive_sentences.append(sentence)
  elif polarity == -1:
    negative_sentences.append(sentence)

# Print out the positive and negative sentences
print("Positive sentences:")
for sentence in positive_sentences:
  print(sentence)

print("Negative sentences:")
for sentence in negative_sentences:
  print(sentence)

# Print out the number of positive and negative sentences
print("Number of positive sentences:", len(positive_sentences))
print("Number of negative sentences:", len(negative_sentences))


AttributeError: 'NoneType' object has no attribute 'text'

In [7]:
import nltk
from textblob import TextBlob


# Download a copy of Pride and Prejudice from Project Gutenberg as a .txt file.
# Remove the metadata at the start and end of the file, leaving only the original text.
with open("pride-and-prejudice.txt", "r") as f:
  text = f.read().replace("**", "").replace("***", "")

# Create a TextBlob using the book string.
blob = textblob.TextBlob(text)

# Analyze each sentence in the book in a loop.
positive_sentences = []
negative_sentences = []
for sentence in blob.sentences:
  polarity = sentence.sentiment.polarity
  if polarity == 1:
    positive_sentences.append(sentence)
  elif polarity == -1:
    negative_sentences.append(sentence)

# Print out all the positive sentences and negative sentences (separately) and the number of each.
print("Positive sentences:")
for sentence in positive_sentences:
  print(sentence)
print("Number of positive sentences:", len(positive_sentences))

print("Negative sentences:")
for sentence in negative_sentences:
  print(sentence)
print("Number of negative sentences:", len(negative_sentences))


FileNotFoundError: [Errno 2] No such file or directory: 'pride-and-prejudice.txt'

In [2]:
# opening the file
import nltk
with open('pride_and_prejudice.txt',encoding="utf8") as f:
    content = f.read()
# making textblob of the    
book = TextBlob(content)
# initializing the containers for the sentences
positive_sentiments = []
negative_sentiments = []

# assigning sentences based on extremes in polarity
for sentence in book.sentences:
    if sentence.sentiment.polarity == 1:
        positive_sentiments.append(sentence)
    if sentence.sentiment.polarity == -1:
        negative_sentiments.append(sentence)

# finally printing the sentences
print('The ' + str(len(positive_sentiments)) + ' most positive sentences: ')
for sentence in positive_sentiments:
    print('+ ' + str(sentence.replace('\n', '').replace('      ', ' ')))
    
print('The ' + str(len(negative_sentiments)) + ' most negative sentences: ')
for sentence in negative_sentiments:
    print('- ' + str(sentence.replace('\n', '').replace('      ', ' ')))

NameError: name 'TextBlob' is not defined