<img src="logo.png" alt="Drawing" style="width: 80px;"/>

# Exercise 4 - Language Models - Amir Melnikov

In this lesson we will read movie reviews and predict their sentiment (classify based on tagged data , and using Generative AI). 

Original data from: http://ai.stanford.edu/~amaas/data/sentiment/ (and modified CSV from https://github.com/rasbt/python-machine-learning-book-2nd-edition/tree/master/code/ch08/). 

This data was contributed by: 
Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher, Learning Word Vectors for Sentiment Analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June, 2011, http://www.aclweb.org/anthology/P11-1015

A full code solution for Classification model is in: https://github.com/PacktPublishing/Python-Machine-Learning-Second-Edition//blob/master/Chapter08/ch08.ipynb

Good explanation in: https://stackabuse.com/text-classification-with-python-and-scikit-learn/

### 1. Imports

In [1]:
import pandas as pd 
import numpy as np  
import matplotlib.pyplot as plt 

In [2]:
# This is a classification model
from sklearn import ensemble
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report # Produces a table of precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score

from sklearn.feature_extraction.text import TfidfVectorizer

Packages for text analytics:

In [3]:
import re #regex package
import nltk
# nltk.download()  # Download text data sets, including stop words. This takes very long time, so run it only if needed

### 2. Read the movie reviews data

In [4]:
df = pd.read_csv('movie_data.csv')

### 2.5 - add short version of reviews

In [5]:
df["review"] = df["review"].str.split().str[:3].str.join(sep=" ")
df

Unnamed: 0,review,sentiment
0,"In 1974, the",1
1,OK... so... I,0
2,***SPOILER*** Do not,0
3,hi for all,1
4,I recently bought,0
...,...,...
49995,"OK, lets start",0
49996,The British 'heritage,0
49997,I don't even,0
49998,Richard Tyler is,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB


In [7]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the",1
1,OK... so... I,0
2,***SPOILER*** Do not,0
3,hi for all,1
4,I recently bought,0


In [8]:
df['sentiment'].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

### 3. Split data to train and test
Before creating a bag of words split the data to train and test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df.values[:,0], df.values[:,1], test_size=0.3, random_state=0)

In [10]:
#Verify y is diverse enough:
print("There are ",y_train.sum()," Positive sentiments out of ",y_train.shape[0])

There are  17342  Positive sentiments out of  35000


### 4. Step by step example

#### 4.1 Clean the data

This stage is optional as (most of) the cleaning can be done in the TfIdf command.
The cleaning can be done before the split.

Demonstrated below are:
1. Change all text to lower case.
2. Remove characters that are not a letter

In [11]:
df['review']=df.review.str.lower() #Lowercase on entire text, so capitalized word is not counted separately
df.head()

Unnamed: 0,review,sentiment
0,"in 1974, the",1
1,ok... so... i,0
2,***spoiler*** do not,0
3,hi for all,1
4,i recently bought,0


In [12]:
df.review = df.review.str.replace( '[^a-z]', ' ') # Replace any non-textual character with space
df.head()

Unnamed: 0,review,sentiment
0,"in 1974, the",1
1,ok... so... i,0
2,***spoiler*** do not,0
3,hi for all,1
4,i recently bought,0


You can further anotate movie names (replace each specific movie name with a general tag), as well as people names. To do so, there is a special package in: https://imdbpy.sourceforge.io/. 

Annotation requires long detailed mapping of words. Yet, there are many open source annotations for many areas.

#### 4.2. Create tf-idf bag of words
The bag of words creation will be with TfidfVectorizer, fitting to X_train

In [13]:
tf_idf_vec = TfidfVectorizer(ngram_range=(1,1), lowercase=True, stop_words=None)

In [14]:
X_train_bag = tf_idf_vec.fit_transform(X_train)  # note we do both the fit and transform on train

In [15]:
print ("size of X_train_bag is:", X_train_bag.shape)
print ("there are ", X_train_bag.shape[0]*X_train_bag.shape[1], " elements in the matrix" )

size of X_train_bag is: (35000, 9231)
there are  323085000  elements in the matrix


In [16]:
X_train_bag

<35000x9231 sparse matrix of type '<class 'numpy.float64'>'
	with 93024 stored elements in Compressed Sparse Row format>

In [17]:
words = tf_idf_vec.get_feature_names_out()

In [18]:
words[6520:6530]

array(['quit', 'quite', 'quitting', 'quiz', 'quo', 'quote', 'rabbit',
       'race', 'raced', 'racer'], dtype=object)

In [19]:
#show overall "strength" of some words
tfidftotal = np.empty(shape=X_train_bag.shape[1])
for i in range(6520,6530):
    tfidftotal[i]=(X_train_bag[:,i].sum())
tfidftotal[6520:6530]

array([ 0.86735186, 57.99713717,  0.76997323,  0.63161671,  0.75962458,
        7.03327058,  1.39447712,  1.85374631,  0.92437471,  0.69347334])

### 5. Run Model based on tf_idf 

In [20]:
gb_tfidf = ensemble.GradientBoostingClassifier()

In [21]:
#Small test to check model and data, as the full data can take a very long time to run
#y_train_small = y_train[0:200]
#X_train_bag_small = X_train_bag[0:200]
#gb_tfidf.fit(X_train_bag_small, y_train_small.astype('int'))

In [22]:
#Fit model on bag-of-words
gb_tfidf.fit(X_train_bag, y_train.astype('int')) 

In [23]:
#Run prediction on Test
y_test_pred = gb_tfidf.predict(tf_idf_vec.transform(X_test)) # note we do ONLY transform on test data

In [24]:
print("Confusion Matrix : \n", confusion_matrix(y_test.astype('int'), y_test_pred),"\n")
print("The precision  is ",precision_score(y_test.astype('int'), y_test_pred)) 
print("The recall is ",recall_score(y_test.astype('int'), y_test_pred),"\n")  

Confusion Matrix : 
 [[5840 1502]
 [5172 2486]] 

The precision  is  0.623370110330993
The recall is  0.32462784016714546 



In [25]:
#look at some examples
for i in range (10,20):
    print(X_test[i])
    print ("\n predicted sentiment is: ",y_test_pred[i])
    print ("\n real sentiment is: ",y_test[i],"\n\n")

This movie is

 predicted sentiment is:  0

 real sentiment is:  0 


This is one

 predicted sentiment is:  1

 real sentiment is:  0 


this was one

 predicted sentiment is:  0

 real sentiment is:  0 


I went to

 predicted sentiment is:  0

 real sentiment is:  1 


One of the

 predicted sentiment is:  1

 real sentiment is:  1 


Quite liked Flesh

 predicted sentiment is:  0

 real sentiment is:  0 


As is the

 predicted sentiment is:  1

 real sentiment is:  1 


Amazing movie that,

 predicted sentiment is:  1

 real sentiment is:  1 


Okay, I can

 predicted sentiment is:  0

 real sentiment is:  0 


Shame on Julia

 predicted sentiment is:  0

 real sentiment is:  0 




## Now - Let's try to do the same on the original data with ChatGPT

In [26]:
# Running on a sample of the data to avoid accodently wasting too much money
df_for_chatgpt = df.head(20)

In [27]:
print(df_for_chatgpt)

                   review  sentiment
0            in 1974, the          1
1           ok... so... i          0
2    ***spoiler*** do not          0
3              hi for all          1
4       i recently bought          0
5             leave it to          1
6   nathan detroit (frank          1
7    to understand "crash          1
8     i've been impressed          1
9           this movie is          1
10           i once lived          0
11     hidden frontier is          1
12           it's a while          0
13             what is it          0
14      this very strange          1
15             i saw this          0
16         there are some          0
17             i was cast          1
18             i had high          0
19             set in and          1


### Below code taken from asking ChatGPT and Bard (following initial errors in running)

In [28]:
import openai  # Import the OpenAI package

# Set your OpenAI API key
openai.api_key = ""

# Define the function to analyze sentiment using ChatGPT
def analyze_sentiment_chatgpt(review):
    response = openai.completions.create(
        model="text-davinci-003",
        prompt="Analyze the sentiment of the following text:\n" + review,
        max_tokens=15,  # Adjust as needed
        n=1,
        stop=None,
        temperature=0.5
    )

    sentiment_chatgpt = 0  # Default to negative sentiment
    if "positive" in response.choices[0].text.lower():
        sentiment_chatgpt = 1
    return sentiment_chatgpt

# Apply the sentiment analysis function to each review
df_for_chatgpt["sentiment_chatgpt"] = df_for_chatgpt["review"].apply(analyze_sentiment_chatgpt)

# Print the DataFrame with the new sentiment_chatgpt column
df_for_chatgpt

APIConnectionError: Connection error.

In [None]:
# Let's predict these reviews' sentiments using the classification model and compare to the real sentiment and to ChatGPT

X_df_for_chargpt = tf_idf_vec.transform(df_for_chatgpt["review"])
df_predict = gb_tfidf.predict(X_df_for_chargpt)

for i in range (0,19):
    print(df_for_chatgpt["review"][i])
    print ("\n predicted sentiment is: ",df_predict[i])
    print ("\n ChatGPT predicted sentiment is: ",df_for_chatgpt["sentiment_chatgpt"][i])
    print ("\n real sentiment is: ",df_for_chatgpt["sentiment"][i],"\n\n")