# Amazon Fine Food Reviews Analysis

Data set Source: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews data set consists of reviews of fine foods.

- Total Data points: 568,454
- Total Unique Users: 256,059
- Total Products: 74,258
- Time Preiod: Oct 1999 - Oct 2012
- Number of Attributes/Columns in data: 10

Attributes/Features:
1. Id
2. ProductId
3. UserId
4. ProfileName
5. HelpfulnessNumerator
6. HelpfulnessDenominator
7. Score
8. Time
9. Summary
10. Text

## Objective:

Given a review, determine whether the review is positive or negative.

For this project we will be using our 'Score' column as the basis for our class label. Where a rating of 4 or higher will be categorized as a positive review and a rating of 2 or less will be categorized as a negative review.
A rating of 3 in Neutral and will be ignored.

In [6]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import string
import nltk
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

In [7]:
conn = sqlite3.connect('database.sqlite')

In [8]:
raw_filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""",conn)

In [9]:
raw_filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [10]:
raw_filtered_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [11]:
def make_label(x):
    if x > 3:
        return 'Positive'
    else: 
        return 'Negative'

In [12]:
raw_filtered_data['Score'] = raw_filtered_data['Score'].map(make_label)
raw_filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,Positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,Negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,Positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,Negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,Positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...
...,...,...,...,...,...,...,...,...,...,...
525809,568450,B001EO7N10,A28KG5XORO54AY,Lettie D. Carter,0,0,Positive,1299628800,Will not do without,Great for sesame chicken..this is a good if no...
525810,568451,B003S1WTCU,A3I8AFVPEE8KI5,R. Sawyer,0,0,Negative,1331251200,disappointed,I'm disappointed with the flavor. The chocolat...
525811,568452,B004I613EE,A121AA1GQV751Z,"pksd ""pk_007""",2,2,Positive,1329782400,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o..."
525812,568453,B004I613EE,A3IBEVCTXKNOH,"Kathy A. Welch ""katwel""",1,1,Positive,1331596800,Favorite Training and reward treat,These are the BEST treats for training and rew...


## Data Cleaning

In [13]:
display = pd.read_sql_query("""
SELECT userID, productID, count(userID) as Common_Count
FROM Reviews
WHERE Score != 3 
GROUP BY userID, ProfileName , Summary , Text, Time
HAVING count(userID) > 1
ORDER BY ProductID
""", conn)
display

Unnamed: 0,UserId,ProductId,Common_Count
0,A1033RWNZWEMR5,B00004RYGX,3
1,A1048CYU0OV4O8,B00004RYGX,3
2,A10L8O1ZMUIMR2,B00004RYGX,3
3,A11QN7NDE3AROH,B00004RYGX,3
4,A157XTSMJH9XA4,B00004RYGX,3
...,...,...,...
53004,A8Y0K5BEEW2LP,B009NTCO4O,5
53005,AVFP6Y641X6CZ,B009NTCO4O,5
53006,A3RTG00GE3NXQU,B009OM66IU,4
53007,ABE4NMHQOP4W4,B009OM66IU,4


In [14]:
sorted_filtered_data = raw_filtered_data.sort_values('ProductId',  ascending = True, )

In [15]:
sorted_filtered_data.shape

(525814, 10)

In [16]:
print(f"Shape Before: {sorted_filtered_data.shape}")
final_filtered_data = sorted_filtered_data.drop_duplicates(subset=["UserId", "Summary", "Text", "Time"], keep='first', inplace=False)
print(f"Shape After:  {final_filtered_data.shape}")

Shape Before: (525814, 10)
Shape After:  (365293, 10)


In [17]:
print(f"Percentage of common reviews{final_filtered_data.shape[0]*100/sorted_filtered_data.shape[0]}")

Percentage of common reviews69.47190451376342


In [18]:
display = pd.read_sql_query("""
SELECT * FROM Reviews
WHERE Score != 3 AND HelpfulnessNumerator > HelpfulnessDenominator
""",conn)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
1,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [19]:
final_filtered_data = final_filtered_data[final_filtered_data.HelpfulnessNumerator <= final_filtered_data.HelpfulnessDenominator]

In [20]:
final_filtered_data.shape

(365291, 10)

In [21]:
final_filtered_data['Score'].value_counts()

Score
Positive    307932
Negative     57359
Name: count, dtype: int64

## Bag of Words

In [22]:
count_vect = CountVectorizer() # To compute Bag of Words representation(from scikit-learn) 
final_counts = count_vect.fit_transform(final_filtered_data['Text'].values)

In [23]:
final_counts

<365291x115281 sparse matrix of type '<class 'numpy.int64'>'
	with 19418882 stored elements in Compressed Sparse Row format>

## Text Preprocessing:
### - Stemming
### - Stop-words removal
### - Lemmatization

### Steps
- Remove html tags (eg: `<br>`, `<a href>`)
- Remove punctuation marks and characters like (. , #)
- Check if word is made of english letters and not junk(alpha-numeric eg: a87sd87)
- Check for length greater than 2 (There are no adjectives in less than 2 letters)
- Convert to lowercase 
- Remove Stopwords
- Perform stemming (Snowball Stemming) to remove similar words(eg: taste, tasty,tasteful) 

In [24]:
import re
import string 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [25]:
stop_words = {'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'}
snow = nltk.stem.SnowballStemmer('english')

In [26]:
def cleanhtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr,' ',sentence)
    return cleantext

def clean_punctuation(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return cleaned

In [27]:
final_filtered_data['Text'].values[0]

"this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college"

In [28]:
i=0
all_positive_words = []
all_negative_words = []
final_strings = []
s = ''
for sent in final_filtered_data['Text']:
    filtered_sentence = []
    sent= cleanhtml(sent)
    for w in sent.split():
        cleaned_word = clean_punctuation(w)
        if((cleaned_word.isalpha()) & (len(cleaned_word)>2)):
            if cleaned_word.lower() not in stop_words :
                s = (snow.stem(cleaned_word.lower())).encode('utf8')
                filtered_sentence.append(s)
                if(final_filtered_data['Score'].values)[i] =='Positive':
                    all_positive_words.append(s)
                else:
                    all_negative_words.append(s)
    strl = b" ".join(filtered_sentence) 
    final_strings.append(strl)  
    i+=1

## Add cleaned text to the dataframe

In [29]:
final_filtered_data['CleanedText'] = final_strings

In [30]:
final_filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,Positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh recit car dr...
138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,Positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak watch realli rosi movi inco...
138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,Positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,Positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,Positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month goe month cute littl poem ...
...,...,...,...,...,...,...,...,...,...,...,...
178145,193174,B009RSR8HO,A4P6AN2L435PV,romarc,0,0,Positive,1350432000,LOVE!! LOVE!!,"LOVE, LOVE this sweetener!! I use it in all m...",b'love sweeten use unsweeten flavor unsweeten ...
173675,188389,B009SF0TN6,A1L0GWGRK4BYPT,Bety Robinson,0,0,Positive,1350518400,Amazing!! Great sauce for everything!,You have to try this sauce to believe it! It s...,b'tri sauc believ start littl sweet honey tast...
204727,221795,B009SR4OQ2,A32A6X5KCP7ARG,sicamar,1,1,Positive,1350604800,Awesome Taste,I bought this Hazelnut Paste (Nocciola Spread)...,b'bought hazelnut past local shop palm tast ex...
5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,Positive,1351209600,DELICIOUS,Purchased this product at a local store in NY ...,b'purchas product local store kid love quick e...


## Create a new database with cleaned text for future use.

In [31]:
conn = sqlite3.connect('final.sqlite')
c = conn.cursor()
conn.text_factory = str
final_filtered_data.to_sql('Reviews', conn,schema = None, if_exists = 'replace')

365291

In [32]:
conn = sqlite3.connect('final.sqlite')
final_filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""",conn)

## Observing the Cleaned text

In [33]:
final_filtered_data

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
0,138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,Positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh recit car dr...
1,138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,Positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak watch realli rosi movi inco...
2,138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,Positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
3,138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,Positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
4,138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,Positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month goe month cute littl poem ...
...,...,...,...,...,...,...,...,...,...,...,...,...
365286,178145,193174,B009RSR8HO,A4P6AN2L435PV,romarc,0,0,Positive,1350432000,LOVE!! LOVE!!,"LOVE, LOVE this sweetener!! I use it in all m...",b'love sweeten use unsweeten flavor unsweeten ...
365287,173675,188389,B009SF0TN6,A1L0GWGRK4BYPT,Bety Robinson,0,0,Positive,1350518400,Amazing!! Great sauce for everything!,You have to try this sauce to believe it! It s...,b'tri sauc believ start littl sweet honey tast...
365288,204727,221795,B009SR4OQ2,A32A6X5KCP7ARG,sicamar,1,1,Positive,1350604800,Awesome Taste,I bought this Hazelnut Paste (Nocciola Spread)...,b'bought hazelnut past local shop palm tast ex...
365289,5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,Positive,1351209600,DELICIOUS,Purchased this product at a local store in NY ...,b'purchas product local store kid love quick e...


In [34]:
pd.read_sql_query("""
SELECT count(*)
FROM Reviews
WHERE Score == 'Positive'
""",conn)

Unnamed: 0,count(*)
0,307932


In [35]:
pd.read_sql_query("""
SELECT count(*)
FROM Reviews
WHERE Score == 'Negative'
""",conn)

Unnamed: 0,count(*)
0,57359


### We now have words categorized into positive and negative.

In [36]:
all_positive_words[::1000000]

[b'witti',
 b'purchas',
 b'youv',
 b'easili',
 b'eat',
 b'mani',
 b'old',
 b'shelf',
 b'assum',
 b'announc']

In [37]:
all_negative_words[::100000]

[b'one',
 b'mix',
 b'send',
 b'relas',
 b'nasti',
 b'buy',
 b'though',
 b'state',
 b'cigarett',
 b'process',
 b'hour',
 b'agre',
 b'box',
 b'assum',
 b'coffe',
 b'contact',
 b'eat',
 b'cooki',
 b'howev',
 b'get']

### Make a Frequency Distribution for the TF-IDF calculation.

In [38]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
freq_dist_negative = nltk.FreqDist(all_negative_words)
freq_dist_positive

FreqDist({b'like': 133703, b'tast': 104680, b'love': 104420, b'use': 97996, b'great': 88889, b'good': 88643, b'one': 86499, b'tri': 77071, b'flavor': 74857, b'make': 72695, ...})

In [39]:
print("Most common positive words: ",freq_dist_positive.most_common()[:10])
print("Most common negative words: ",freq_dist_negative.most_common()[:10])

Most common positive words:  [(b'like', 133703), (b'tast', 104680), (b'love', 104420), (b'use', 97996), (b'great', 88889), (b'good', 88643), (b'one', 86499), (b'tri', 77071), (b'flavor', 74857), (b'make', 72695)]
Most common negative words:  [(b'like', 31019), (b'tast', 28727), (b'product', 20698), (b'one', 17873), (b'would', 17864), (b'tri', 16426), (b'use', 14469), (b'get', 13513), (b'buy', 13283), (b'flavor', 13238)]


## Observation
- Most common positive and negative words are common
- Using Bi-grams could fix this issue

# TF- IDF

In [40]:
tf_idf_vect = TfidfVectorizer(ngram_range = (1,2))
final_tf_idf = tf_idf_vect.fit_transform(final_filtered_data['CleanedText'].values)

In [41]:
final_tf_idf

<365291x2591202 sparse matrix of type '<class 'numpy.float64'>'
	with 20643824 stored elements in Compressed Sparse Row format>

In [42]:
features = tf_idf_vect.get_feature_names_out()
len(features)

2591202

In [43]:
features[100000:100010]

array(['applesauc dont', 'applesauc dri', 'applesauc easili',
       'applesauc eat', 'applesauc egg', 'applesauc enjoy',
       'applesauc even', 'applesauc ever', 'applesauc everi',
       'applesauc everytim'], dtype=object)

In [44]:
print(len(final_tf_idf[3,:].toarray()[0]))

2591202


### Create Function to return uni-gram, bi-gram with top tf_idf value with corresponding ranks

- Since we obtained uncertain results by using uni-gram we also take bi-gram into consideration. This can be commonly seen in our dataset and in real-life that words often provide more meaning with the context.
- It can be seen that "like" is the most common word in both positive and negative words but when we read the actual reviews with the whole context we find that the meaning of words is often altered by the word preceding or succeeding the given word.
- eg: the words "do like", "don't like" will completely lose their meaning as soon as we use uni-gram.
- therefore we will be using bi-gram which provides a good balance between the size of dataset produced and the quality of output.

In [45]:
def top_tfidf_features(row,features,top_n=25):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_features(final_tf_idf[1,:].toarray()[0], features,25)

In [46]:
top_tfidf

Unnamed: 0,feature,tfidf
0,cover paperback,0.206787
1,sendak watch,0.206787
2,rosi movi,0.206787
3,movi incorpor,0.206787
4,grew read,0.206787
5,flimsi take,0.206787
6,keep page,0.206787
7,paperback seem,0.206787
8,read sendak,0.206787
9,incorpor love,0.200394


In [47]:
final_filtered_data

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
0,138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,Positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh recit car dr...
1,138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,Positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak watch realli rosi movi inco...
2,138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,Positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
3,138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,Positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
4,138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,Positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month goe month cute littl poem ...
...,...,...,...,...,...,...,...,...,...,...,...,...
365286,178145,193174,B009RSR8HO,A4P6AN2L435PV,romarc,0,0,Positive,1350432000,LOVE!! LOVE!!,"LOVE, LOVE this sweetener!! I use it in all m...",b'love sweeten use unsweeten flavor unsweeten ...
365287,173675,188389,B009SF0TN6,A1L0GWGRK4BYPT,Bety Robinson,0,0,Positive,1350518400,Amazing!! Great sauce for everything!,You have to try this sauce to believe it! It s...,b'tri sauc believ start littl sweet honey tast...
365288,204727,221795,B009SR4OQ2,A32A6X5KCP7ARG,sicamar,1,1,Positive,1350604800,Awesome Taste,I bought this Hazelnut Paste (Nocciola Spread)...,b'bought hazelnut past local shop palm tast ex...
365289,5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,Positive,1351209600,DELICIOUS,Purchased this product at a local store in NY ...,b'purchas product local store kid love quick e...


In [48]:
from sklearn.preprocessing import OrdinalEncoder
trf1 = OrdinalEncoder(categories=[['Positive', 'Negative']])  # Creating OrdinalEncoder object
y_final = trf1.fit_transform(final_filtered_data[['Score']])  # Fitting and transforming data

In [49]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

In [50]:
x_train, x_test, y_train, y_test = train_test_split(final_tf_idf,y_final,train_size = 0.7,random_state = 42) 

# Applying Boosted Decision Trees

-  ## Use optuna to fine tune the hyperparameters

In [51]:
import optuna
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import cross_validate


def objective(trial):
    max_depth = trial.suggest_int("max_depth", 1, 10)
    min_child_weight = trial.suggest_float("min_child_weight", 0, 10)
    gamma = trial.suggest_float("gamma", 0, 1)
    colsample_bytree = trial.suggest_float("colsample_bytree", 0, 1)
    subsample = trial.suggest_float("subsample", 0, 1)
    reg_alpha = trial.suggest_float("reg_alpha", 0, 1)
    n_estimators = trial.suggest_int("n_estimators", 64, 128)
    
    params = {
        "max_depth": max_depth,
        "min_child_weight": min_child_weight,
        "gamma": gamma,
        "colsample_bytree": colsample_bytree,
        "subsample": subsample,
        "reg_alpha": reg_alpha,
        "n_estimators": n_estimators,
        "num_class": 2  # Assuming binary classification
    }
    
    params["learning_rate"] = 0.1
    params["objective"] = "multi:softmax"
    params["nthread"] = -1
    
    model = XGBClassifier(**params)

    cv_results = cross_validate(model, x_train, y_train, cv=3, scoring='accuracy')
    validation_score = np.mean(cv_results['test_score'])
    
    return -validation_score  # Optimize for the negative of the validation score

sampler = optuna.samplers.TPESampler(seed=42)
study = optuna.create_study(direction="minimize", sampler=sampler)  # Minimize the negative validation score
study.optimize(objective, n_trials=200)
df_study = study.trials_dataframe()
df_study_best = df_study.sort_values(by='value', ascending=True)
best_params = study.best_params
print(best_params)


[I 2024-02-12 00:40:41,661] Trial 40 finished with value: 0.8908377224498026 and parameters: {'max_depth': 9, 'min_child_weight': 1.5024504039270885, 'gamma': 0.4967196723475804, 'colsample_bytree': 0.9142646046644476, 'subsample': 0.9707744203981367, 'reg_alpha': 0.8413627673548031, 'n_estimators': 120}. Best is trial 40 with value: 0.8908377224498026.

In [55]:
model = XGBClassifier(
    max_depth=9,
    min_child_weight=1.5024504039270885,
    gamma=0.4967196723475804,
    colsample_bytree=0.9142646046644476,
    subsample=0.9707744203981367,
    reg_alpha=0.8413627673548031,
    n_estimators=120,
    num_class=2,
    learning_rate=0.1,
    objective="multi:softmax",
    nthread=-1
)

In [56]:
model.fit(x_train,y_train)

In [57]:
predictions = model.predict(x_test)

In [59]:
accuracy = accuracy_score(predictions,y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8907544621673906


In [61]:
confusion_matrix(predictions,y_test)

array([[91456, 10948],
       [ 1024,  6160]], dtype=int64)

In [62]:
import pickle

with open('fine_foods_xgboost_model.pkl', 'wb') as f:
    pickle.dump(model, f)
