### Sentiment Analysis
* Sentiment analysis involves determining the sentiment of text.
* In this lab, you will use a hotel review data set that includes reviews and a rating 
 * There are other features that you can ignore, unless you want to use them to improve results
* Your goal is to train a model that can predict the number of stars based on the text
* This is the last programming assignment. We will use similar cleaning and discovery techniques as other assignments
 * ... except we need to add the fun of stop words, stemming / lemmatizing and similar exciting topics.
* Dont forget to save this as a copy in your Google Colab environment



* **Student Name: Christian Blake**
* **Partner Name:**

### Get the data
* Either download the data and store it in your drive or use the Kaggle API to obtain the data from
 * https://www.kaggle.com/datasets/datafiniti/hotel-reviews

In [4]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_error, accuracy_score
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load the data from the CSV file
data = pd.read_csv('/content/7282_1.csv')

# Display the first 5 rows of the data
data.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,address,categories,city,country,latitude,longitude,name,postalCode,province,reviews.date,reviews.dateAdded,reviews.doRecommend,reviews.id,reviews.rating,reviews.text,reviews.title,reviews.userCity,reviews.username,reviews.userProvince
0,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-09-22T00:00:00Z,2016-10-24T00:00:25Z,,,4.0,Pleasant 10 min walk along the sea front to th...,Good location away from the crouds,,Russ (kent),
1,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-04-03T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Really lovely hotel. Stayed on the very top fl...,Great hotel with Jacuzzi bath!,,A Traveler,
2,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2014-05-13T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,Ett mycket bra hotell. Det som drog ner betyge...,Lugnt l��ge,,Maud,
3,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2013-10-27T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,Good location on the Lido.,,Julie,
4,Riviera San Nicol 11/a,Hotels,Mableton,US,45.421611,12.376187,Hotel Russo Palace,30126,GA,2015-03-05T00:00:00Z,2016-10-24T00:00:25Z,,,5.0,We stayed here for four nights in October. The...,������ ���������������,,sungchul,


### Explore and Clean the Data

In [9]:
# Check for missing values
data.isnull().sum()

# Drop unnecessary columns
data = data[['reviews.rating', 'reviews.text']]
data['reviews.rating'] = data['reviews.rating'].apply(round)
# Drop rows with missing values
data = data.dropna()

# Clean the text data
def clean_text(text):
    text = re.sub(r'\W+', ' ', text)  # Remove non-word characters
    text = text.lower()  # Convert to lowercase
    return text

data['reviews.text'] = data['reviews.text'].apply(clean_text)

# Tokenize, remove stop words, and stem/lemmatize
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def process_text(text):
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens]
    return ' '.join(lemmatized_tokens)

data['reviews.text'] = data['reviews.text'].apply(process_text)


### Train the Model
* Train the model using 90% of the data
* You may choose whichever model technique you choose

In [12]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(data['reviews.text'], data['reviews.rating'], test_size=0.1, random_state=42)

# Vectorize the text data
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train the model (using a simple Linear Regression model)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_vec, y_train)


### Test the Model 
* Test the model using the remaining 10% of the data
* The testing results will depend on the model you use
 * If the rating is evaluated as a number, you need to look at values such as mean square error
 * If you are using categories, then you can use accuracy, but you may want to collapse the categories from 1 to 5 to 3 categories such as bad, neutral, and good.

In [13]:
# Make predictions
y_pred = model.predict(X_test_vec)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Optional: Categorize predictions and calculate accuracy
def categorize(rating):
    if rating < 2:
        return "bad"
    elif rating < 4:
        return "neutral"
    else:
        return "good"

y_test_cat = y_test.apply(categorize)
y_pred_cat = [categorize(rating) for rating in y_pred]

accuracy = accuracy_score(y_test_cat, y_pred_cat)
print("Accuracy:", accuracy)


Mean Squared Error: 2.4050865644032586
Accuracy: 0.6254638880959178


### Provide an explanation of your model and results

* *In this project, I used a simple Linear Regression model to predict the rating of hotel reviews based on their text. After cleaning the text data, I removed stop words, stemmed, and lemmatized the words in the reviews. This preprocessing step helped reduce noise and improve the performance of the model by retaining only the most relevant words in the text. Then, I used the TfidfVectorizer to convert the text data into numerical features. This vectorizer weighs the importance of each word by considering both its frequency in the document and its inverse document frequency, which measures how common a word is across all documents.

In addition, I categorized the predicted ratings into three categories: bad, neutral, and good. The accuracy score of the model when predicting these categories is 0.626, which means that the model correctly predicts the sentiment category in approximately 62.6% of cases.

While the Linear Regression model provides a starting point, there is significant room for improvement. Potential techniques to improve the model's performance include using more advanced machine learning algorithms, such as support vector machines or neural networks, and leveraging additional features from the dataset. Additionally, experimenting with different text preprocessing techniques, such as using bigrams or trigrams, could help improve the model's ability to capture more complex patterns in the text data.

To be honest, I tried SVM, and a logistic regression model, but it took about 8 minutes for the first iteration for logistic regression, and I never saw one for SVM, so I gave up. I don't think it's worthwile if it will take too long.
* **DO NOT FORGET TO DO THE ANALYSIS PART**

### Discuss techniques you could use to improve your model if you had more time

* To be honest, I would want to still use logistic regression, but it would require me to cut the data. 


Otherwise, I could you Stochastic Gradient Descent, which updates the model's weights using a randomly selected subset of the data at each iteration, which can lead to faster convergence.

Additionally, I feel like I could reduce the number of features, by making it focus toward only more important ones, but I'm still unsure if these would make it quick enough.*Your explanation here*