# 1. The motivation behind the presented solutions
Initially, I used a Gaussian model, but the accuracy was very low, so I switched to a multinomial Naive Bayes model. The main difference between them lies in the assumption about the features. The Gaussian model assumes that each feature's probability distribution follows a Gaussian distribution, i.e., a normal distribution. Therefore, in the Gaussian model, each feature's value is considered continuous. The multinomial model assumes that the features are discrete, such as word frequencies in text or the presence or absence of words in a bag-of-words model. When choosing between the Gaussian model and the multinomial Naive Bayes model, the feature types in the dataset should be considered. If the features are continuous, the Gaussian model should be chosen, while if the features are discrete, the multinomial model should be used. For example, in text classification, the bag-of-words model is commonly used as the feature representation, where each feature is the occurrence or absence of a word. Therefore, in this case, the multinomial model should be used. On the other hand, for some numerical datasets, the features are continuous, and in such cases, the Gaussian model should be used.

# 2. Data preprocessing generally includes:
Data cleaning: dealing with issues such as missing values, outliers, duplicate values, incorrect values, etc.
Data reduction, such as dimensionality reduction and, quantity reduction, data compression. Generally, there are methods like PCA, feature selection, regression models, etc.
Data transformation and discretization: uniformly transforming features, such as normalization, standardization, etc. There are generally two ways to standardize data: 1) min-max normalization; 2) z-score normalization. In min-max normalization, data is scaled to a range between 0 and 1, while in z-score normalization, data is scaled to a range with a mean of 0 and a standard deviation of 1. However, because this is text data, the above operations are generally for continuous data. Text data is usually discrete data.
Here is my data preprocessing flow:

2.1 Cleaning and text normalization (note that text normalization and data standardization are not the same concepts), including replacing punctuation with spaces, converting all text to lowercase, tokenization, removing stop words, lemmatization, recombining the processed sentences, etc. The purpose of text normalization is to reduce the dimensionality of text data, remove noise, reduce the risk of model overfitting, and improve the accuracy of the model. 

2.2 Vectorizing text data. Since machine learning models can only handle numerical data, text is a type of unstructured data that cannot be directly input into the model for training and prediction. Therefore, it is necessary to convert the text into a numerical vector so that the model can understand and process it. The vectorization process includes splitting the text into words, counting the number of times each word appears in the text, or using other feature representation methods. Finally, all text is converted into a matrix, where each row represents a feature vector of a text, and each column represents a feature. I used the common text vectorization method, the bag-of-words model.

# 3. My idea and solution implementation for each of the two tasks listed above

First, baseline:

In this step, I only vectorized the text data and built an MNB model, using cross-validation to obtain an initial model, model1.

In the second step, I performed data preprocessing on the text data and improved the baseline model based only on the 'review' attribute for task 1. I built a second MNB model, model2, which only had preprocessing operations added compared to model1. Then, I used Laplace smoothing, added the alpha parameter to model2, gave a range, and selected the optimal alpha to build the model3. I submitted it to Kaggle and found that the model's score reached the baseline score of 89.449.

In the third step, I chose to perform task 2, which was to improve the model by adding extra attributes. I chose to add the 'review' and 'name' attributes and thought of two methods. I wanted to know which one is more appropriate:

The first method: preprocess each feature column one by one (including vectorization), which would result in a matrix, then convert it to a dataframe for merging (using concat). After merging, convert it back to a matrix for the training set. The second method: merge the feature columns first (using +), then preprocess them (including vectorization, resulting in a matrix). I could only use the expression "X_combined = data1['review'] + ' ' + data1['name']" because if I used concat, X_combined1 would become a dataframe type (2838, 2), and when preprocessing, each feature column would need to be processed one by one, resulting in a matrix with twice the number of rows as y (several feature columns would have several times), because cleaned_data was processed column by column. In the end, I chose the second method because I found that method 1 was very cumbersome when vectorizing the test set later.

# 4&5. The evaluation procedure and the training/validation results.
After adding the extra attributes, I split the dataset into training and testing sets. I trained the model using the aforementioned cross-validation and optimal parameter methods to optimize the model. Finally, I obtained a result that exceeded the baseline. I thought that adding extra attributes for task 2 would significantly improve the model's accuracy, but in fact, it only improved slightly. I think that adding extra attributes has little help in improving the model for task 2.

Here are some things I learned during the operation process that I wanted to record:

To train a model on the train.csv file, text data needs to be preprocessed first, then vectorized using the vec.fit_transform function. However, I noticed that the function with 'fit' can only be used for the training set, while the general test set only requires vec.transform. This led me to wonder what to do with new text data in the test.csv file if the test set on the original dataset using the transform function. The answer is that we only need the transform function for new text data.

Another question is when to perform vectorization. Before or after splitting the training and test sets? I tried two approaches to this. The first was to split the training and test sets first, then vectorize the text data for the training set using the 'fit_transform' function. Then, I used X_train and y_train to train a Naive Bayes model. Finally, I used the transform function to preprocess new text data in the test.csv file and applied the trained Naive Bayes model 1 to make predictions. This first attempt was applied in the baseline without any data preprocessing.

The second approach was to vectorize all text data in train.csv first using the 'fit_transform' function, then split the training and test sets, and train the Naive Bayes model using X_train and y_train, with cross-validation to avoid overfitting issues. Then, I applied the same data preprocessing operations to the text data in the test.csv file, followed by the transform function to vectorize the preprocessed text data, and used the trained Naive Bayes model 2 to make predictions.

After uploading to Kaggle, I found that the model's accuracy decreased by 0.01 compared to the baseline. There could be multiple reasons for this decrease:

Filtering out some important information during data cleaning and preprocessing. For example, filtering out some helpful vocabulary for the classification task in the stopword list or losing some important word forms in the lemmatization process.
Overfitting of the dataset. During training, the model learns the noise in the training set, which reduces its accuracy on new data.
Insufficient feature selection. Some important features may have been ignored, resulting in the model not fully utilizing the information in the dataset.
The presence of noise or outliers in the dataset. These noises or outliers may have a negative impact on the model's training, leading to a decrease in its accuracy.
Improper model selection. In some cases, the model selection may not be appropriate, resulting in a decrease in its performance.
Standardizing data on text data may destroy the semantic information in the original text data. Because text data is unstructured data, each word in it has a specific meaning and context. After vectorization, each word is converted into a numerical feature, but these numerical features may not have a direct relationship with the original meaning and context of the word. Therefore, standardizing data after vectorization may destroy the meaning and context information in the original text data, affecting the performance of the model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from scipy.stats import pearsonr
import random
import os
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection  import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
import string
from nltk.stem import PorterStemmer
import re
%matplotlib inline

In [2]:
data1 = pd.read_csv("train.csv") # Import the csv file and name it data1
dataset_test=pd.read_csv('test.csv')

In [3]:
data1

Unnamed: 0,ID,name,latitude,longitude,mean_checkin_time,review,category
0,1457,Splendid Pig,29.937778,-90.081429,17.0,Experienced the Splendid Pig pop-up on a trip ...,Restaurants
1,3526,Subway,29.932236,-90.004271,18.0,"Ok, not a great location. But - The girl behin...",Restaurants
2,1891,Firehouse Subs,29.920835,-90.012189,18.0,Soooo excited for this location and it did not...,Restaurants
3,3384,Port Orleans Brewing,29.917005,-90.098272,19.0,Just opened but feels like a hit. Big open spa...,Nightlife
4,1297,New Orleans Cake Café & Bakery,29.963752,-90.052964,16.0,In love with this hole in the wall. Def must g...,Restaurants
...,...,...,...,...,...,...,...
2833,1681,Bar Frances,29.935199,-90.104948,17.0,Picky friend and I went the other night and lo...,Restaurants
2834,1855,Chef D'Z Cafe,29.963803,-90.071958,18.0,Food is real good you get alot of food for the...,Restaurants
2835,3281,Eyes on Canal,29.975829,-90.102318,19.0,Doctor Jackson is great. As her practice grows...,Shopping
2836,1181,Slidershak Food Truck,29.921968,-90.107394,13.0,"Tucked beneath the highway, slider shack is us...",Restaurants


In [4]:
data1.columns

Index(['ID', 'name', 'latitude', 'longitude', 'mean_checkin_time', 'review',
       'category'],
      dtype='object')

In [5]:
X=data1['review'] # feature data in training dataset
y=data1['category'] # label data in training dataset
X_test_o=dataset_test['review']

baseline:

In [6]:
# Dividing the training set and test set, training set：0.8, testing set：0.2 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [7]:
# transform the train and text data into vector
vec = CountVectorizer()  # Text vectorization through counting word frequency.
X_train_v = vec.fit_transform(X_train)
X_test_v= vec.transform(X_test_o)

In [8]:
print(X_train_v.shape)
print(y_train.shape)

(2270, 11728)
(2270,)


In [9]:
# Perform text classification using Multinomial Naive Bayes.
model1 = MultinomialNB()
scores = cross_val_score(model1,X_train_v,y_train,cv=5)

In [10]:
print("score:",scores.mean())

score: 0.8599118942731279


In [11]:
model1.fit(X_train_v,y_train)
#y_pred = model1.predict(X_test)
#score_pre = metrics.accuracy_score(y_test, y_pred) # Return the accuracy of the prediction 
#print(score_pre)

MultinomialNB()

In [12]:
y_test_pred= model1.predict(X_test_v)


In [13]:
dataset_test['category'] = y_test_pred

In [14]:
dataset_test[['ID','category']].to_csv('predictions02.csv', index=False)

Perform data preprocessing:

In [15]:
# read in text data
X=data1['review'] # feature data in training dataset
y=data1['category'] # label data in training dataset

The first method:

In [16]:
X_combined1 = pd.concat([data1['review'], data1['name']], axis=1)

In [17]:
# Define stop words
stop_words = set(stopwords.words('english'))

# Define lemmatizer
lemmatizer = WordNetLemmatizer()

# Clean and normalize text data
cleaned_data1 = []
for sentence in X_combined1.iloc[:, 0]:
    # Replace punctuation with space
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, " ")   
    # Convert all text to lowercase
    sentence_lower = sentence.lower()
     # Tokenize
    tokens = nltk.word_tokenize(sentence_lower)
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Reassemble the processed sentence
    cleaned_sentence = ' '.join(lemmatized_tokens)
    cleaned_data1.append(cleaned_sentence)
    
cleaned_data2 = []
for sentence in X_combined1.iloc[:, 1]:
    # Replace punctuation with space
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, " ")   
    # Convert all text to lowercase
    sentence_lower = sentence.lower()
     # Tokenize
    tokens = nltk.word_tokenize(sentence_lower)
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Reassemble the processed sentence
    cleaned_sentence = ' '.join(lemmatized_tokens)
    cleaned_data2.append(cleaned_sentence)


In [18]:
# Vectorizing text data
vec1 = CountVectorizer()  # Text vectorization through counting word frequency
X_vec1 = vec1.fit_transform(cleaned_data1)
vec2 = CountVectorizer()  
X_vec2 = vec2.fit_transform(cleaned_data2)

In [19]:
X_vec_df1 = pd.DataFrame(X_vec1.toarray(), columns=vec1.get_feature_names())



In [20]:
# Merge the vectorized matrices into one DataFrame
X_vec_df1 = pd.DataFrame(X_vec1.toarray(), columns=vec1.get_feature_names())
X_vec_df2 = pd.DataFrame(X_vec2.toarray(), columns=vec2.get_feature_names())
X_vec_combined = pd.concat([X_vec_df1, X_vec_df2], axis=1)

# Convert the merged DataFrame to a matrix
X_matrix = X_vec_combined.values



The second method:

In [21]:
X_combined= data1['review'] + ' ' + data1['name'] #Two columns are merged together, the first one is "review" and the second one is "name", so they are still in the same column.

In [22]:
X_numeric = data1['mean_checkin_time'].astype(float)

In [23]:
arr = np.array(X_numeric).reshape(-1, 1)
scaler = StandardScaler()
arr1 = scaler.fit_transform(arr)
X_num = arr1.reshape(-1)

In [24]:
# Define stop words
stop_words = set(stopwords.words('english'))
# Define lemmatizer
lemmatizer = WordNetLemmatizer()
# Clean and normalize text data
cleaned_data = []
for sentence in X_combined:
    # Replace punctuation with space
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, " ")   
    # Convert all text to lowercase
    sentence_lower = sentence.lower()
     # Tokenize
    tokens = nltk.word_tokenize(sentence_lower)
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Reassemble the processed sentence
    cleaned_sentence = ' '.join(lemmatized_tokens)
    cleaned_data.append(cleaned_sentence)

In [25]:
# Vectorizing text data
vec = CountVectorizer() # Text vectorization through counting word frequency
X_vec = vec.fit_transform(cleaned_data)

In [26]:
print("Raw data:\n", X_combined)
print("Preprocessed data:\n", cleaned_data)
print("Vectorized data:\n", X_vec)

Raw data:
 0       Experienced the Splendid Pig pop-up on a trip ...
1       Ok, not a great location. But - The girl behin...
2       Soooo excited for this location and it did not...
3       Just opened but feels like a hit. Big open spa...
4       In love with this hole in the wall. Def must g...
                              ...                        
2833    Picky friend and I went the other night and lo...
2834    Food is real good you get alot of food for the...
2835    Doctor Jackson is great. As her practice grows...
2836    Tucked beneath the highway, slider shack is us...
2837    Tourist trap no doubt. Still a fun place early...
Length: 2838, dtype: object
Preprocessed data:
Vectorized data:
   (0, 4138)	1
  (0, 10582)	2
  (0, 8386)	2
  (0, 8556)	1
  (0, 11629)	1
  (0, 7631)	1
  (0, 4235)	1
  (0, 724)	1
  (0, 12446)	1
  (0, 12434)	1
  (0, 4215)	1
  (0, 8806)	1
  (0, 9988)	1
  (0, 2876)	1
  (0, 7087)	1
  (0, 3194)	1
  (0, 4541)	1
  (0, 8922)	1
  (0, 3165)	1
  (0, 12465)	1
  

In [27]:
print(data1.isnull().sum())

ID                   0
name                 0
latitude             0
longitude            0
mean_checkin_time    0
review               0
category             0
dtype: int64


In [28]:
# Concatenate the numerical and text features into one feature matrix.
Xc = pd.concat([pd.DataFrame(X_vec.toarray(), X_num)], axis=1)

In [29]:
# Dividing the training set and test set, training set：0.8, testing set：0.2 
X_train, X_test, y_train, y_test = train_test_split(Xc, y, test_size = 0.3, random_state = 1)

In [30]:
# Using Multinomial Naive Bayes for text classification
model2 = MultinomialNB()
scores = cross_val_score(model2,X_train,y_train,cv=5)
model2.fit(X_train,y_train)

MultinomialNB()

In [31]:
print("score:",scores.mean())

score: 0.8736149260154678


Adjusting model parameters is usually done through grid search. Grid search is a technique for automatically adjusting model parameters that finds the best parameter settings by traversing the given parameter combinations.

In Scikit-learn, the GridSearchCV function can be used for grid search.

In [32]:
model3 = MultinomialNB()

# define the list of parameters
params = {'alpha': [0.62, 0.635, 0.6353, 0.6355, 0.65], 'fit_prior': [True, False]}

# Using GridSearchCV for cross-validation and hyperparameter tuning
grid_search = GridSearchCV(model3, params, cv=5)
grid_search.fit(X_train, y_train)

# Output the best parameters and the best score.
print('Best Parameters:', grid_search.best_params_)
print('Best Score:', grid_search.best_score_)

Best Parameters: {'alpha': 0.635, 'fit_prior': True}
Best Score: 0.8882181689302937


In [33]:
# Make predictions on the test set
y_pred3 = grid_search.predict(X_test)

# Calculate accuracy.
acc = accuracy_score(y_test, y_pred3)
print("Accuracy: ", acc)

Accuracy:  0.8943661971830986


Preprocess text data on the test set: Vectorize the text data in the test set using the vocabulary from the training set, which ensures that the test set uses the same vocabulary as the training set.

In [34]:
# Define stop words
stop_words = set(stopwords.words('english'))
# Define lemmatizer
lemmatizer = WordNetLemmatizer()
# Clean and standardize text data
cleaned_data_test = []
for sentence in X_test_o:
    # Replace punctuation with spaces
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, " ")   
    # Convert all text to lowercase
    sentence_lower = sentence.lower()
    # Tokenize
    tokens = nltk.word_tokenize(sentence_lower)
    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Lemmatize
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    # Recombine processed sentences
    cleaned_sentence = ' '.join(lemmatized_tokens)
    cleaned_data_test.append(cleaned_sentence)


In [35]:
# Vectorizing text data
vec = CountVectorizer() 
X_vec = vec.fit_transform(cleaned_data)
X_test_vec = vec.transform(cleaned_data_test)

In [36]:
X_numeric_test = dataset_test['mean_checkin_time'].astype(float)
arr_test = np.array(X_numeric_test).reshape(-1, 1)
scaler = StandardScaler()
arr1_test = scaler.fit_transform(arr_test)
X_num_test = arr1_test.reshape(-1)

In [37]:
# Concatenate the numerical features and the text features into one feature matrix
Xc_test = pd.concat([pd.DataFrame(X_test_vec.toarray(), X_num_test)], axis=1)

In [38]:
X_test_vec.shape

(728, 12620)

In [39]:
y_test_pred= grid_search.predict(Xc_test)
dataset_test['category'] = y_test_pred
dataset_test[['ID','category']].to_csv('predictions11.csv', index=False)