<img src="logo.png" alt="Drawing" style="width: 80px;"/>

# Exercise 4 - Language Models - By Omer Dembinsky

In this lesson we will read movie reviews and predict their sentiment (classify based on tagged data , and using Generative AI). 

Original data from: http://ai.stanford.edu/~amaas/data/sentiment/ (and modified CSV from https://github.com/rasbt/python-machine-learning-book-2nd-edition/tree/master/code/ch08/). 

This data was contributed by: 
Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher, Learning Word Vectors for Sentiment Analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June, 2011, http://www.aclweb.org/anthology/P11-1015

A full code solution for Classification model is in: https://github.com/PacktPublishing/Python-Machine-Learning-Second-Edition//blob/master/Chapter08/ch08.ipynb

Good explanation in: https://stackabuse.com/text-classification-with-python-and-scikit-learn/

### 1. Imports

In [2]:
import pandas as pd 
import numpy as np  
import matplotlib.pyplot as plt 

In [3]:
# This is a classification model
from sklearn import ensemble
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report # Produces a table of precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score

from sklearn.feature_extraction.text import TfidfVectorizer

Packages for text analytics:

In [5]:
import re #regex package
import nltk
# nltk.download()  # Download text data sets, including stop words. This takes very long time, so run it only if needed

### 2. Read the movie reviews data

In [7]:
df = pd.read_csv('movie_data.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [9]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [10]:
df['sentiment'].value_counts()

1    25000
0    25000
Name: sentiment, dtype: int64

### 3. Split data to train and test
Before creating a bag of words split the data to train and test

In [12]:
X_train, X_test, y_train, y_test = train_test_split(df.values[:,0], df.values[:,1], test_size=0.3, random_state=0)

In [13]:
#Verify y is diverse enough:
print("There are ",y_train.sum()," Positive sentiments out of ",y_train.shape[0])

There are  17342  Positive sentiments out of  35000


### 4. Step by step example

#### 4.1 Clean the data

This stage is optional as (most of) the cleaning can be done in the TfIdf command.
The cleaning can be done before the split.

Demonstrated below are:
1. Change all text to lower case.
2. Remove characters that are not a letter

In [15]:
df['review']=df.review.str.lower() #Lowercase on entire text, so capitalized word is not counted separately
df.head()

Unnamed: 0,review,sentiment
0,"in 1974, the teenager martha moxley (maggie gr...",1
1,ok... so... i really like kris kristofferson a...,0
2,"***spoiler*** do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"i recently bought the dvd, forgetting just how...",0


In [16]:
df.review = df.review.str.replace( '[^a-z]', ' ') # Replace any non-textual character with space
df.head()

  df.review = df.review.str.replace( '[^a-z]', ' ') # Replace any non-textual character with space


Unnamed: 0,review,sentiment
0,in the teenager martha moxley maggie gr...,1
1,ok so i really like kris kristofferson a...,0
2,spoiler do not read this if you think a...,0
3,hi for all the people who have seen this wonde...,1
4,i recently bought the dvd forgetting just how...,0


You can further anotate movie names (replace each specific movie name with a general tag), as well as people names. To do so, there is a special package in: https://imdbpy.sourceforge.io/. 

Annotation requires long detailed mapping of words. Yet, there are many open source annotations for many areas.

#### 4.2. Create tf-idf bag of words
The bag of words creation will be with TfidfVectorizer, fitting to X_train

In [19]:
tf_idf_vec = TfidfVectorizer(ngram_range=(1,1), lowercase=True, stop_words=None)

In [20]:
X_train_bag = tf_idf_vec.fit_transform(X_train)  # note we do both the fit and transform on train

In [21]:
print ("size of X_train_bag is:", X_train_bag.shape)
print ("there are ", X_train_bag.shape[0]*X_train_bag.shape[1], " elements in the matrix" )

size of X_train_bag is: (35000, 87866)
there are  3075310000  elements in the matrix


In [22]:
X_train_bag

<35000x87866 sparse matrix of type '<class 'numpy.float64'>'
	with 4771837 stored elements in Compressed Sparse Row format>

In [23]:
words = tf_idf_vec.get_feature_names_out()

In [24]:
words[6520:6530]

array(['bacula', 'bad', 'badalamenti', 'badalucco', 'badanov', 'badass',
       'badassdom', 'badasses', 'badassness', 'badat'], dtype=object)

In [25]:
#show overall "strength" of some words
tfidftotal = np.empty(shape=X_train_bag.shape[1])
for i in range(6520,6530):
    tfidftotal[i]=(X_train_bag[:,i].sum())
tfidftotal[6520:6530]

array([1.30021176e-01, 4.89991661e+02, 6.83669459e-01, 8.28327199e-02,
       2.20184840e-01, 2.59200002e+00, 8.05290509e-02, 1.21376691e-01,
       4.23959822e-02, 4.23848580e-02])

### 5. Run Model based on tf_idf 

In [27]:
gb_tfidf = ensemble.GradientBoostingClassifier()

In [28]:
#Small test to check model and data, as the full data can take a very long time to run
#y_train_small = y_train[0:200]
#X_train_bag_small = X_train_bag[0:200]
#gb_tfidf.fit(X_train_bag_small, y_train_small.astype('int'))

In [102]:
#Fit model on bag-of-words
gb_tfidf.fit(X_train_bag, y_train.astype('int')) 

GradientBoostingClassifier()

In [105]:
#Run prediction on Test
y_test_pred = gb_tfidf.predict(tf_idf_vec.transform(X_test)) # note we do ONLY transform on test data

In [107]:
print("Confusion Matrix : \n", confusion_matrix(y_test.astype('int'), y_test_pred),"\n")
print("The precision  is ",precision_score(y_test.astype('int'), y_test_pred)) 
print("The recall is ",recall_score(y_test.astype('int'), y_test_pred),"\n")  

Confusion Matrix : 
 [[5652 1690]
 [1068 6590]] 

The precision  is  0.7958937198067633
The recall is  0.8605379994776704 



In [32]:
#look at some examples
for i in range (10,20):
    print(X_test[i])
    print ("\n predicted sentiment is: ",y_test_pred[i])
    print ("\n real sentiment is: ",y_test[i],"\n\n")

This movie is about Tyrannus, a gladiator who is brought back from the dead to summon Tyrannus, a gladiator who must be brought back from the dead. Tyrannus, we learn after about an hour, is also called Demonicus. This adds much needed depth to the screenplay and calls into question our assumptions about identity, psychology and ourselves. <br /><br />The spirit of Tyrannus accomplishes his little to-do list (killing some people and saying repetitive phrases in Latin) by possessing the body of a college guy. He uses a magic mind-control helmet to do this, which the college boy willingly puts on his head, and then at several points in the movie, takes off and puts back on.<br /><br />Maria performs oral sex on a poor man's Sean Willian Scott, and Tyrannus wears the Rollerball glove. Tyrannus has his own green backlighting for no reason, and has apparently been sitting next to CG fire in an ancient concrete tunnel for centuries like this. Utter misfortune.<br /><br />This movie is empty 

## Now - Let's try to do the same on the original data with ChatGPT

In [47]:
# Running on a sample of the data to avoid accodently wasting too much money
df_for_chatgpt = df.head(20)

In [48]:
print(df_for_chatgpt)

                                               review  sentiment
0   in       the teenager martha moxley  maggie gr...          1
1   ok    so    i really like kris kristofferson a...          0
2      spoiler    do not read this  if you think a...          0
3   hi for all the people who have seen this wonde...          1
4   i recently bought the dvd  forgetting just how...          0
5   leave it to braik to put on a good show  final...          1
6   nathan detroit  frank sinatra  is the manager ...          1
7   to understand  crash course  in the right cont...          1
8   i ve been impressed with chavez s stance again...          1
9   this movie is directed by renny harlin the fin...          1
10  i once lived in the u p and let me tell you wh...          0
11  hidden frontier is notable for being the longe...          1
12  it s a while ago  that i have seen sleuth     ...          0
13  what is it about the french  first  they  appa...          0
14  this very strange mov

### Below code taken from asking ChatGPT and Bard (following initial errors in running)

In [50]:
import openai  # Import the OpenAI package

# Set your OpenAI API key
openai.api_key = "ENTER_YOUR_KEY_HERE"

# Define the function to analyze sentiment using ChatGPT
def analyze_sentiment_chatgpt(review):
    response = openai.completions.create(
        model="text-davinci-003",
        prompt="Analyze the sentiment of the following text:\n" + review,
        max_tokens=15,  # Adjust as needed
        n=1,
        stop=None,
        temperature=0.5
    )

    sentiment_chatgpt = 0  # Default to negative sentiment
    if "positive" in response.choices[0].text.lower():
        sentiment_chatgpt = 1
    return sentiment_chatgpt

# Apply the sentiment analysis function to each review
df_for_chatgpt["sentiment_chatgpt"] = df_for_chatgpt["review"].apply(analyze_sentiment_chatgpt)

# Print the DataFrame with the new sentiment_chatgpt column
df_for_chatgpt

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_for_chatgpt["sentiment_chatgpt"] = df_for_chatgpt["review"].apply(analyze_sentiment_chatgpt)


Unnamed: 0,review,sentiment,sentiment_chatgpt
0,in the teenager martha moxley maggie gr...,1,1
1,ok so i really like kris kristofferson a...,0,0
2,spoiler do not read this if you think a...,0,0
3,hi for all the people who have seen this wonde...,1,1
4,i recently bought the dvd forgetting just how...,0,0
5,leave it to braik to put on a good show final...,1,1
6,nathan detroit frank sinatra is the manager ...,1,1
7,to understand crash course in the right cont...,1,1
8,i ve been impressed with chavez s stance again...,1,1
9,this movie is directed by renny harlin the fin...,1,1


In [117]:
# Let's predict these reviews' sentiments using the classification model and compare to the real sentiment and to ChatGPT

X_df_for_chargpt = tf_idf_vec.transform(df_for_chatgpt["review"])
df_predict = gb_tfidf.predict(X_df_for_chargpt)

for i in range (0,19):
    print(df_for_chatgpt["review"][i])
    print ("\n predicted sentiment is: ",df_predict[i])
    print ("\n ChatGPT predicted sentiment is: ",df_for_chatgpt["sentiment_chatgpt"][i])
    print ("\n real sentiment is: ",df_for_chatgpt["sentiment"][i],"\n\n")

in       the teenager martha moxley  maggie grace  moves to the high class area of belle haven  greenwich  connecticut  on the mischief night  eve of halloween  she was murdered in the backyard of her house and her murder remained unsolved  twenty two years later  the writer mark fuhrman  christopher meloni   who is a former la detective that has fallen in disgrace for perjury in o j  simpson trial and moved to idaho  decides to investigate the case with his partner stephen weeks  andrew mitchell  with the purpose of writing a book  the locals squirm and do not welcome them  but with the support of the retired detective steve carroll  robert forster  that was in charge of the investigation in the    s  they discover the criminal and a net of power and money to cover the murder  br    br    murder in greenwich  is a good tv movie  with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a kennedy  the powerful and rich family 