<a href="https://colab.research.google.com/github/Prajaktahz/Uni_Colab_Work/blob/main/Week_7_Case_Study_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://www.nlab.org.uk/wp-content/uploads/nlabmain.png" style="width:40%; clear:both; margin-bottom:-20px" align=left>
<br style="clear:both;"/>

## Analytics Specializations & Applications 2 - Week 7

# Text Analytics - Case Study Exercises
----------
Dr Georgiana Nica-Avram - University of Nottingham
[mail](mailto:georgiana.nica-avram1@nottingham.ac.uk)
[web](http://www.neodemographics.org)

This set of exercises assumes that you have completed the accompanying "Text Analytics - Preparatory Exercises" jupyter notebook. If you haven't, please find and run through that set of exercises first.

### Scenario
Now we have the tools we need, let's consider the following case study scenario. We are a consumer research company like IPSOS MORI, who are now receiving reviews and discussion on the web (particularly in the form of YouTube comments). We would like to generate a text analytics solution that will take these reviews (which we assume to be unstructured text alone), perform some text analytics on them, and then tell us if that review was positive or negative (to do this, we will need to perform sentiment analysis). Once we have a method of doing this we can score the success of our outputs in an automatic fashion (and also potentially gauge how reaction towards them changes over time).

The problem is that we currently have no basis for assessing reviews - our media outputs don't get "scored". This problem is called the "cold start" problem - we just don't have any ground truth which we can build a text analytics model against.

Luckily, we may be able to leverage some "transfer learning" - one of our partners has a dataset of movie reviews that **are** accompanied with a score - so we know if the text they include is broadly positive or negative.

By performing text analytics on this dataset and concentrating on sentiment (rather than movie actors, directors, genres, etc) we will be able to create a natural language model that will receive any review - such as those we get discussing our advertising campaigns - and tell us something about the author's reaction. This we can then document, and use in our future pitches.

### The dataset
Our transfer dataset consists of 25,000 written movie reviews from the Internet Movie Database, IMDb (www.imdb.com). No movie has more than 30 reviews, and the review text is accompanied by a binary score (with the value 1 if the IMDb rating for that review is greater than 6, and the value 0 if the rating is less than 5). From this data we will learn what constitutes a positive and negative review in terms of text).

To analyse this text, so we can understand what consitutes a positive and negative reivew in terms of language, we will implement the following:

* Data Collation
* Stripping / Case Folding
* Stemming
* Stopping
* Tokenization
* Vectorization (and TF-IDF)
* Testing (using Cosine Similarity)


In [1]:
# This next line is only to be used on Google Colaboratory and will download the CSV file for you
!wget -O week7_data.zip "https://drive.google.com/uc?export=download&id=1vxzKV3Z522JN67xnglQZJCniIQaypsY9"
!unzip week7_data.zip

# We can then check that the file is here by listing the content of the current directory
!ls

--2024-03-14 22:50:29--  https://drive.google.com/uc?export=download&id=1vxzKV3Z522JN67xnglQZJCniIQaypsY9
Resolving drive.google.com (drive.google.com)... 172.217.164.14, 2607:f8b0:4025:803::200e
Connecting to drive.google.com (drive.google.com)|172.217.164.14|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1vxzKV3Z522JN67xnglQZJCniIQaypsY9&export=download [following]
--2024-03-14 22:50:29--  https://drive.usercontent.google.com/download?id=1vxzKV3Z522JN67xnglQZJCniIQaypsY9&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 172.217.12.1, 2607:f8b0:4025:815::2001
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|172.217.12.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13859069 (13M) [application/octet-stream]
Saving to: ‘week7_data.zip’


2024-03-14 22:50:39 (11.7 MB/s) - ‘week7_data.zip’ saved [13859069/13859069]

Let's begin by loading in the data

<span style="font-weight:bold; color:green;">&rarr; Load in and examine the first ten lines of the data <span/>

In [None]:
import pandas

data = pandas.read_csv("movie_data.tsv", delimiter="\t")

#-- examine the first 10 lines of the data here
data.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


Have a look at the last entry - see how it has html tags in it. We need to get rid of these (and let's loose the punctuation while we are at it), so let's first do some stripping. I've created a custom function to do this which is out of scope of this course, so for now just run the code below:

In [None]:
import html_cleaner

data.review = html_cleaner.remove_html(data.review)

#-- examine the first 10 lines of the data again
data.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,This movie is full of references. Like \Mad Ma...


In [None]:
#-- describe the data here
data.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


We are going to learn a model that can recognize those reviews with positive sentiment form those with negative sentiment. Start by splitting the data into test and training sets (use the first 20,000 items for the training data and the rest for the test data).

<span style="font-weight:bold; color:green;">&rarr; Split the data into test and training <span/>

In [None]:
data.drop(["id"], axis = 1)
train_data = data[:20000]
test_data = data[20000:]

Ok, as before the next step is to vectorize our text data - let's do that next with a simple Count Vectorizer (and examine how much TF-IDF can improve things later).

<span style="font-weight:bold; color:green;">&rarr; Complete the following code <span/>

In [None]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer

#-- create our vectorizer object, ready to fit and
#-- transform our data into a vector space format
vectorizer = CountVectorizer()

#-- setup the model's feature space using our training data
vectorizer.fit(train_data.review)

#-- and then convert the training data set into vector format
train_features = vectorizer.transform(train_data.review)

#-- while we are here, convert our test dataset in the same way
test_features = vectorizer.transform(test_data.review)

print("Training and test data successfully vectorized")

Training and test data successfully vectorized


Now let's create a model that will understand how sentiment is constructed out of text in some way. For this job we could use any classifier, but given Naive Bayes models have historically been used in text analysis, let's maintain that tradition here:

In [None]:
#-- let's use a multinomial naive bayes classifer
from sklearn.naive_bayes import BernoulliNB

NB = BernoulliNB()

#-- fit the model to our training data - note in this step the model is
#-- finding the relationship between word frequencies and the sentiment
#-- of each review
NB.fit(train_features, train_data.sentiment)
print("Linguistic Model successfuly created")

Linguistic Model successfuly created


Now let's see how well our model works, by testing it on our holdout dataset (note that we would normally cross-validate here to get a more representative score, but a single holdout test is fine for now):

In [None]:
#-- generate some predictions
results = NB.predict_proba(test_features)
print(results)

[[5.78165233e-01 4.21834767e-01]
 [9.93375916e-01 6.62408434e-03]
 [1.25287276e-02 9.87471272e-01]
 ...
 [9.99879156e-01 1.20843866e-04]
 [3.06619208e-01 6.93380792e-01]
 [4.53248215e-05 9.99954675e-01]]


The results come in two columns for each review - the first column is the probability that it is a negative review, and the second if it is a positive review. We can come up with an actual prediction of whether the review contains positive sentiment or not by seeing if the second column is > 0.5 or not (our threshold):

In [None]:
#-- note the neat syntax here: First we index the result's second column using
#-- [:,1] and then we test if it is more than 0.5 and hence a positive review
predictions = results[:,1] > 0.5

#-- the columns which were more than 0.5 are designated as True
print(predictions)

[False False  True ... False  True  True]


In [None]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(test_data.sentiment, predictions)
print("We predicted the sentiment of {0:.01f}% of reviews correctly".format(acc*100))

We predicted the sentiment of 84.7% of reviews correctly


84% is not bad at all, given we are using a simple and quick bag of words approach. In fact this is no doubt good enough for the business task, and we could start applying the model to our own reviews. In fact let's try some:

In [None]:
#-- create some test "reviews"
test_reviews = [
    "I'm not sure about this advert - it is selling a bad brand!",
    "I love this advert - it is selling a good brand!",
    "This is excellent work",
    "What is this rubbish?",
    "Please save us from this nonsense",
    "I enjoyed watching this",
    "I wanted to say this advert is bad, but I can't - just the opposite in fact!"
]

#-- vectorize it
vec_test = vectorizer.transform( test_reviews )

#-- Run it through your NB sentiment analyser
results = NB.predict_proba(vec_test)

#-- Examine the sentiment the model detects
for t, r in zip(test_reviews, results):
    if r[1] > 0.5:
        print(t, "--> POSITIVE REVIEW")
    else:
        print(t, "--> NEGATIVE REVIEW")

I'm not sure about this advert - it is selling a bad brand! --> NEGATIVE REVIEW
I love this advert - it is selling a good brand! --> NEGATIVE REVIEW
This is excellent work --> POSITIVE REVIEW
What is this rubbish? --> POSITIVE REVIEW
Please save us from this nonsense --> NEGATIVE REVIEW
I enjoyed watching this --> NEGATIVE REVIEW
I wanted to say this advert is bad, but I can't - just the opposite in fact! --> NEGATIVE REVIEW


Not bad! Notice how almost all the reviews are categorized correctly... apart from the last one. With some carefully worded phrasing, we have tricked our model. Nonetheless even with a simple bag of words approach we have a useful tool for the business, which we can now use to track the companies influence.

However, we can do better as we've omitted some useful steps. Your challenge is now to see how much you can improve the results this model by implementing:
> * Stopping
> * Stemming
> * Case Folding
> * and a TfifdVectorizer()

Also consider:
> * What can you find out about the important features (i.e. which words are most influential?)
> * Can you design a query that fools the model? - tip. consider including negative words even though the review is good...

Good luck! And ask for help if you run out of ideas.

In [None]:
#-- Use the previous tutorials to do all of the
#-- above and try and build a better 'sentiment analyser'
#-- (i.e. one that manages to get any improvment at all on 84.7% accuracy!)

# BONNE CHANCE