### CHONG WEI HAN
### SD20045
### Data Mining
### Lab Report 6

# Question 1: General Knowledge

Discuss the text mining concepts in data mining implementation related to sentiment analysis applications discussed above. Give reference/ references.

### Voice of the customer 
Analysis refers to the systematic examination of consumer input in order to get practical and implementable insights. These insights provide a deeper comprehension of client behaviour and enable the execution of marketing-conversion gap analysis.


### Social media listening 
The social media landscape is extensive, offering several platforms tailored to different user profiles. Facebook caters to a more older audience, whereas Twitch specifically caters to gamers. Through the proliferation of articles specifically aimed at facilitating rapid fame on YouTube, and the increasing preference of celebrities for Instagram as their primary platform, social media has transformed several aspects of life into a competition for popularity.

Reference:
- https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17 
- https://www.repustate.com/blog/text-sentiment-analysis/
- https://www.red-gate.com/simple-talk/databases/sql-server/bi-sql-server/text-mining-and-sentiment-analysis-introduction/

# Question 2: Python

Based on the aforementioned above, develop sentiment analyser for movie review with criteria as follows:

In [None]:
#pip install nltk

In [9]:
# a. Import movie_reviews corpus from nltk.corpus.
import nltk
from nltk import FreqDist
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

# Download the movie_reviews corpus if not already downloaded
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to C:\Users\Cyrex
[nltk_data]     Chong\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [10]:
# b. Treshold setting = 0.9.

# Get the movie reviews documents
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
import random
random.shuffle(documents)

# Define a threshold for splitting into training and testing sets
threshold = 0.9
split = int(threshold * len(documents))

# Split the data into training and testing sets
train_set = documents[:split]
test_set = documents[split:]

# Get the most informative words
all_words = [word.lower() for word in movie_reviews.words()]
word_freq = FreqDist(all_words)
word_features = list(word_freq.keys())[:2000]

# Create the feature sets
training_features = [(document_features(d), c) for (d, c) in train_set]
testing_features = [(document_features(d), c) for (d, c) in test_set]

In [11]:
# c. Use Naïve Bayes classifier for model development and set the accuracy result.

# Train the Naïve Bayes classifier
classifier = NaiveBayesClassifier.train(training_features)

# Evaluate the accuracy of the classifier on the testing set
accuracy_result = accuracy(classifier, testing_features)
print(f'Accuracy: {accuracy_result:.2%}')

Accuracy: 85.00%


In [12]:
# d. Create 20 most informative words.

print("20 Most Informative Words:")
print(classifier.show_most_informative_features(20))

20 Most Informative Words:
Most Informative Features
        contains(turkey) = True              neg : pos    =     12.3 : 1.0
     contains(stretched) = True              neg : pos    =      8.4 : 1.0
    contains(schumacher) = True              neg : pos    =      6.6 : 1.0
          contains(mena) = True              neg : pos    =      6.4 : 1.0
        contains(shoddy) = True              neg : pos    =      6.4 : 1.0
        contains(suvari) = True              neg : pos    =      6.4 : 1.0
       contains(singers) = True              pos : neg    =      6.3 : 1.0
     contains(atrocious) = True              neg : pos    =      6.2 : 1.0
         contains(waste) = True              neg : pos    =      5.9 : 1.0
       contains(bronson) = True              neg : pos    =      5.7 : 1.0
        contains(canyon) = True              neg : pos    =      5.7 : 1.0
     contains(pregnancy) = True              neg : pos    =      5.7 : 1.0
        contains(sordid) = True              ne

In [13]:
# e. Test the input movie reviews with the reviews below:
# i. The costumes in this movie look real.
# ii. The movie is merely based on the book.
# iii. I think the story was terrible and the characters were very weak.
# iv. People say that the director of the movie is astounding.
# v. This is such a ridiculous movie. I will not recommend it to anyone.

# Test input movie reviews
input_reviews = [
    "The costumes in this movie look real.",
    "The movie is merely based on the book.",
    "I think the story was terrible and the characters were very weak.",
    "People say that the director of the movie is astounding.",
    "This is such a ridiculous movie. I will not recommend it to anyone."
]

In [14]:
# f. Predict the sentiment and probability of reviews in (e).
# Predict sentiment and probability for each input review
for review in input_reviews:
    features = document_features(review.split())
    sentiment = classifier.classify(features)
    probability = classifier.prob_classify(features).prob(sentiment)
    print(f"Review: '{review}'\nSentiment: {sentiment}\nProbability: {probability:.2%}\n")

Review: 'The costumes in this movie look real.'
Sentiment: neg
Probability: 99.96%

Review: 'The movie is merely based on the book.'
Sentiment: neg
Probability: 99.92%

Review: 'I think the story was terrible and the characters were very weak.'
Sentiment: neg
Probability: 99.97%

Review: 'People say that the director of the movie is astounding.'
Sentiment: neg
Probability: 99.60%

Review: 'This is such a ridiculous movie. I will not recommend it to anyone.'
Sentiment: neg
Probability: 99.71%

