<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

(1) Features (text representation) used for topic modeling.

(2) Top 10 clusters for topic modeling.

(3) Summarize and describe the topic for each cluster.


In [2]:
import pandas as pd

# Reading the dataset from a CSV file
dataset = pd.read_csv('reviews_sentiment.csv')

# Printing out the dimensions of the dataset
print(dataset.shape)
# Print the column names of the imdb_data DataFrame
print(dataset.columns)





(1000, 3)
Index(['document_id', 'clean_text', 'sentiment'], dtype='object')


In [3]:
import nltk   # Importing the Natural Language Toolkit

# Removing punctuation from the 'clean_text' column
dataset['clean_text'] = dataset['clean_text'].str.replace('[^\w\s]', '')

# Converting all text in 'clean_text' to lowercase
dataset['clean_text'] = dataset['clean_text'].apply(lambda x: ' '.join(word.lower() for word in x.split()))

# Importing and downloading stopwords from NLTK
from nltk.corpus import stopwords
nltk.download('stopwords')

# Creating a list of English stopwords
stop = stopwords.words('english')

# Removing stopwords from the 'clean_text' column
dataset['clean_text'] = dataset['clean_text'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop))


  dataset['clean_text'] = dataset['clean_text'].str.replace('[^\w\s]', '')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# List to store tokenized words
tokenized_words = []
from nltk.tokenize import RegexpTokenizer
from gensim import corpora, models

# Creating a tokenizer
text_tokenizer = RegexpTokenizer(r'\w+')

# Tokenizing each document in the 'clean_text' column of imdb_data
for document in pd.Series(dataset['clean_text']):
    tokens = text_tokenizer.tokenize(document)
    tokenized_words.append(tokens)

# Creating a dictionary from the tokenized words
word_dictionary = corpora.Dictionary(tokenized_words)

# Generating the bag-of-words corpus
bow_corpus = [word_dictionary.doc2bow(text) for text in tokenized_words]


In [5]:
import gensim  # Importing Gensim for LDA model

# Creating an LDA model with the corpus and dictionary
lda_model = gensim.models.ldamodel.LdaModel(
    bow_corpus,          # Using the bag-of-words corpus
    num_topics=10,       # Specifying the number of topics
    id2word=word_dictionary,  # Associating the LDA model with the word dictionary
    passes=20            # Number of passes through the corpus during training
)


In [6]:
# Printing the top 5 words from each of the 10 topics in the LDA model
print(lda_model.print_topics(num_topics=10, num_words=5))


[(0, '0.057*"back" + 0.020*"amazing" + 0.018*"disappointed" + 0.017*"wont" + 0.016*"go"'), (1, '0.025*"place" + 0.021*"dont" + 0.018*"delicious" + 0.017*"go" + 0.014*"back"'), (2, '0.023*"service" + 0.019*"great" + 0.019*"restaurant" + 0.018*"friendly" + 0.011*"staff"'), (3, '0.019*"food" + 0.015*"time" + 0.013*"go" + 0.011*"service" + 0.010*"place"'), (4, '0.017*"good" + 0.016*"great" + 0.012*"place" + 0.010*"could" + 0.010*"well"'), (5, '0.018*"like" + 0.013*"going" + 0.012*"soon" + 0.011*"definitely" + 0.010*"wont"'), (6, '0.035*"good" + 0.032*"food" + 0.023*"place" + 0.022*"great" + 0.022*"service"'), (7, '0.048*"food" + 0.024*"place" + 0.013*"never" + 0.012*"good" + 0.009*"bad"'), (8, '0.030*"good" + 0.027*"service" + 0.014*"food" + 0.013*"experience" + 0.013*"really"'), (9, '0.021*"food" + 0.017*"fantastic" + 0.014*"service" + 0.012*"tasty" + 0.011*"worst"')]


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

(1) Features used for sentiment classification and explain why you select these features.

(2) Select two of the supervised learning algorithm from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build a sentiment classifier respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

(3) Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [7]:
# Write your code here

import pandas as pd
imdb_data=pd.read_csv('reviews_sentiment.csv')
print(imdb_data.shape)

(1000, 3)


In [8]:
imdb_data.head()

Unnamed: 0,document_id,clean_text,sentiment
0,1,I'd love to go back.,positive
1,2,I will come back here every time I'm in Vegas.,neutral
2,3,I checked out this place a couple years ago an...,negative
3,4,I don't think we'll be going back anytime soon.,neutral
4,5,I hate those things as much as cheap quality b...,negative


In [9]:
imdb_data['sentiment'].value_counts()



positive    514
negative    250
neutral     236
Name: sentiment, dtype: int64

In [10]:
import nltk
# Importing the Natural Language Toolkit (nltk) for text processing

# Removing punctuation from the 'clean_text' column of the imdb_data DataFrame
# This is done by replacing anything that is not a word character (\w) or a space (\s) with an empty string
imdb_data['clean_text'] = imdb_data['clean_text'].str.replace('[^\w\s]', '')

# Converting all text in the 'clean_text' column to lowercase
# Lowercasing is a common text preprocessing step to ensure uniformity
# Here, each string in 'clean_text' is split into words, converted to lowercase, and then joined back into a string
imdb_data['clean_text'] = imdb_data['clean_text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# Importing and downloading the list of stopwords from nltk
# Stopwords are common words like 'the', 'is', 'in', which are often removed in text processing
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')

# Removing stopwords from the 'clean_text' column
# This is done by splitting each string into words, removing words that are in the stopwords list, and joining them back
imdb_data['clean_text'] = imdb_data['clean_text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))


  imdb_data['clean_text'] = imdb_data['clean_text'].str.replace('[^\w\s]', '')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Displaying the first few rows of the imdb_data DataFrame
# The head() function is used to get a quick overview of the DataFrame
# By default, it displays the first 5 rows
imdb_data.head()


Unnamed: 0,document_id,clean_text,sentiment
0,1,id love go back,positive
1,2,come back every time im vegas,neutral
2,3,checked place couple years ago impressed,negative
3,4,dont think well going back anytime soon,neutral
4,5,hate things much cheap quality black olives,negative


In [12]:
# Displaying detailed information about the imdb_data DataFrame
# The info() method provides a concise summary of the DataFrame
# This includes the number of entries, the column names, the number of non-null values in each column,
# and the data type of each column
imdb_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   document_id  1000 non-null   int64 
 1   clean_text   1000 non-null   object
 2   sentiment    1000 non-null   object
dtypes: int64(1), object(2)
memory usage: 23.6+ KB


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating a TF-IDF Vectorizer
# This vectorizer will convert the text data into a matrix of TF-IDF features
# It's set to consider unigrams and bigrams (ngram_range=(1,2)) and limit the features to 1000 most frequent ngrams
tf_idf = TfidfVectorizer(ngram_range=(1,2), max_features=1000)

# Fitting the vectorizer to the 'clean_text' column of the imdb_data
tf_idf.fit(imdb_data['clean_text'])

# Transforming the 'clean_text' column into TF-IDF features
x_values = tf_idf.transform(imdb_data['clean_text'])

# The target variable is the 'sentiment' column
y_values = imdb_data['sentiment']

# Splitting the dataset into training and validation sets
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x_values, y_values, test_size=0.2)

# This split reserves 20% of the data for validation (test_size=0.2)


In [14]:
from sklearn import metrics

def evaluation(y_pred, y_test):
    # Calculating the Accuracy score
    # Accuracy is the ratio of correctly predicted observations to the total observations
    Accuracy = metrics.accuracy_score(y_pred, y_test)

    # Calculating the Recall score
    # Recall (or Sensitivity) is the ratio of correctly predicted positive observations to all observations in the actual class
    # The 'micro' setting calculates metrics globally by considering each element of the label indicator matrix as a label
    Recall = metrics.recall_score(y_pred=y_pred, y_true=y_test, pos_label='positive', average='micro')

    # Calculating the Precision score
    # Precision is the ratio of correctly predicted positive observations to the total predicted positive observations
    # Similar to Recall, 'micro' is used for calculating metrics globally
    Precision = metrics.precision_score(y_pred=y_pred, y_true=y_test, pos_label='positive', average='micro')

    # Calculating the F1 Score
    # F1 Score is the weighted average of Precision and Recall, taking both false positives and false negatives into account
    # It's useful for uneven class distribution
    F1 = 2 * (Precision * Recall) / (Precision + Recall) # Formula for F1 Score

    # Printing the scores
    print("Accuracy: ", Accuracy.round(4))
    print("Recall:", Recall.round(4))
    print("Precision:", Precision.round(4))
    print("F-1 score:", F1.round(4))


In [15]:
from sklearn import naive_bayes
from sklearn.model_selection import cross_val_score, KFold

# Creating an instance of the Multinomial Naive Bayes classifier
naive_bayes_implement = naive_bayes.MultinomialNB()

# Training the classifier on the training data
naive_bayes_implement.fit(x_train, y_train)

# Predicting the sentiments for the validation data
y_pred_valid = naive_bayes_implement.predict(x_valid)

# Evaluating the model performance on the validation set
evaluation(y_pred_valid, y_valid)

# Performing cross-validation to assess the model's effectiveness
# KFold is used to split the dataset into 10 different subsets
# shuffle=True ensures the data is shuffled before splitting into batches
# random_state=22 ensures reproducibility of the results
cv_scores = cross_val_score(naive_bayes_implement, x_valid, y_valid, cv=KFold(10, shuffle=True, random_state=22))

# Printing the cross-validation scores
print("Cross Validation Score:", cv_scores)


Accuracy:  0.705
Recall: 0.705
Precision: 0.705
F-1 score: 0.705
Cross Validation Score: [0.6  0.7  0.6  0.5  0.55 0.6  0.5  0.65 0.5  0.4 ]




In [16]:
from sklearn import svm
from sklearn.model_selection import cross_val_score, KFold

# Creating an instance of the SVM classifier
svm_implement = svm.SVC()

# Training the SVM classifier on the training data
svm_implement.fit(x_train, y_train)

# Predicting sentiments for the validation data using the trained SVM model
y_pred_valid = svm_implement.predict(x_valid)

# Evaluating the SVM model performance on the validation set using predefined evaluation function
evaluation(y_pred_valid, y_valid)

# Performing cross-validation to assess the effectiveness of the SVM model
# KFold is used to split the dataset into 10 different subsets, ensuring data is shuffled and the process is reproducible
cv_scores = cross_val_score(svm_implement, x_valid, y_valid, cv=KFold(10, shuffle=True, random_state=22))

# Calculating the mean of the cross-validation scores for a single, overall performance metric
# Rounding to 4 decimal places for readability
mean_cv_score = cv_scores.mean().round(4)

# Printing the mean cross-validation score
print("Cross Validation Score:", mean_cv_score)




Accuracy:  0.69
Recall: 0.69
Precision: 0.69
F-1 score: 0.69
Cross Validation Score: 0.565


# **Question 3: House price prediction**

In [17]:
import pandas as pd

# Loading the test dataset from a CSV file named "test.csv"
# This dataset is typically used for evaluating the model
# Common columns in such a dataset include features used for prediction
df_train = pd.read_csv("test.csv")

# Loading the training dataset from a CSV file named "train.csv"
# This dataset is used for training the model
# It usually contains features along with the target variable for supervised learning
df_test = pd.read_csv("train.csv")

# Note: Ensure that the file paths and names are correct and the files are located in your working directory.
# If your dataset has different column names or requires specific preprocessing, you should adjust the code accordingly.


(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.


In [18]:
# Assuming df_train is a pandas DataFrame that has been previously loaded

# The info() method is used to display a concise summary of the DataFrame
# This includes information like the number of entries, the column names,
# the number of non-null values in each column, and the data type of each column
# It's useful for getting an initial understanding of the dataset's structure,
# size, and data types, as well as identifying if there are any missing values
df_train.info()

# Note: Ensure that df_train is properly loaded from its corresponding CSV file
# and exists in the current working environment before executing this code.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

In [19]:
# Displaying a summary of the df_test DataFrame
# This includes column names, non-null counts, and data types for each column
df_test.info()

# This helps in understanding the dataset's structure and identifying missing values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [20]:
# Generating descriptive statistics for the df_train DataFrame
# This method provides a summary of central tendency, dispersion, and shape of the dataset’s distribution
# Typically includes statistics like mean, standard deviation, min, max, and quartiles
df_train.describe()

# Note: This is particularly useful for numerical columns to get a quick statistical overview


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1459.0,1459.0,1232.0,1459.0,1459.0,1459.0,1459.0,1459.0,1444.0,1458.0,...,1458.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0
mean,2190.0,57.378341,68.580357,9819.161069,6.078821,5.553804,1971.357779,1983.662783,100.709141,439.203704,...,472.768861,93.174777,48.313914,24.243317,1.79438,17.064428,1.744345,58.167923,6.104181,2007.769705
std,421.321334,42.74688,22.376841,4955.517327,1.436812,1.11374,30.390071,21.130467,177.6259,455.268042,...,217.048611,127.744882,68.883364,67.227765,20.207842,56.609763,30.491646,630.806978,2.722432,1.30174
min,1461.0,20.0,21.0,1470.0,1.0,1.0,1879.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,1825.5,20.0,58.0,7391.0,5.0,5.0,1953.0,1963.0,0.0,0.0,...,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,2190.0,50.0,67.0,9399.0,6.0,5.0,1973.0,1992.0,0.0,350.5,...,480.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,2554.5,70.0,80.0,11517.5,7.0,6.0,2001.0,2004.0,164.0,753.5,...,576.0,168.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,2919.0,190.0,200.0,56600.0,10.0,9.0,2010.0,2010.0,1290.0,4010.0,...,1488.0,1424.0,742.0,1012.0,360.0,576.0,800.0,17000.0,12.0,2010.0


In [21]:
# Display summary statistics for the DataFrame df_test
df_test.describe()

# This will provide statistical information like count, mean, std, min, 25%, 50%, and 75% percentiles, and max
# for each numerical column in the DataFrame.


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [22]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

# Preprocessing for training and testing datasets
# Select numerical columns, interpolate missing values, and remove rows with any remaining missing values
df_train = df_train.select_dtypes(include=['number']).interpolate().dropna()
df_test = df_test.select_dtypes(include=['number']).interpolate().dropna()

# Split the data into training and testing sets
# x_training_data contains all features except 'SalePrice' and 'Id'
# y_training_data contains the natural logarithm of 'SalePrice'
x_training_data = df_test.drop(['SalePrice', 'Id'], axis=1)
y_training_data = np.log(df_test.SalePrice)

# Split the data into training and testing sets with a random state for reproducibility
# The test size is set to 20% of the data
x_train, x_test, y_train, y_test = train_test_split(x_training_data, y_training_data, random_state=21, test_size=0.2)

# Create a Linear Regression model
regression = LinearRegression()

# Fit the model to the training data
regression.fit(x_train, y_train)

# Make predictions on the test data
y_pred = regression.predict(x_test)


In [23]:
# Print the R-squared score of the linear regression model on the test data
print('Linear Regression R sq: %.5f' % regression.score(x_test, y_test))


Linear Regression R sq: 0.84932


In [24]:
# Import necessary libraries
import numpy as np
from sklearn.metrics import mean_squared_error

# Calculate the mean squared error (MSE) between the exponentiated predictions (back to original scale) and the true values (y_test)
lin = mean_squared_error(np.exp(y_pred), y_test)

# Calculate the square root of the mean squared error to get the root mean squared error (RMSE)
lin_r = np.sqrt(lin)

# Print the RMSE
print(lin_r)


195747.28015953404


In [25]:
# Create a dictionary 'results' containing the predicted prices and actual prices on the original scale
results = {"Predicted Prices": np.exp(y_pred), "Actual Prices": np.exp(y_test)}

# Create a DataFrame 'df_val' from the 'results' dictionary
df_val = pd.DataFrame(results)

# Calculate the percentage difference between predicted and actual prices and add it as a new column
df_val["Percentage Difference"] = round(abs((df_val["Predicted Prices"] - df_val["Actual Prices"]) / df_val["Actual Prices"]) * 100, 2)

# Display the DataFrame 'df_val' with columns for predicted prices, actual prices, and percentage difference
df_val


Unnamed: 0,Predicted Prices,Actual Prices,Percentage Difference
880,156677.762126,157000.0,0.21
605,229973.921577,205000.0,12.18
1166,245358.948864,245350.0,0.00
216,218109.586021,210000.0,3.86
970,86909.329260,135000.0,35.62
...,...,...,...
218,231419.797074,311500.0,25.71
1228,311321.804719,367294.0,15.24
1007,93234.996099,88000.0,5.95
575,109041.447359,118500.0,7.98


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **pre-trained Large Language Model (LLM) from the Hugging Face Repository** for your specific task using the data collected in Assignment 3. After creating an account on Hugging Face (https://huggingface.co/), choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any Meta based text analysis model. Provide a brief description of the selected LLM, including its original sources, significant parameters, and any task-specific fine-tuning if applied.

Perform a detailed analysis of the LLM's performance on your task, including key metrics, strengths, and limitations. Additionally, discuss any challenges encountered during the implementation and potential strategies for improvement. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [26]:
!pip install scikit-learn




In [28]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Load the pre-trained BERT model and tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Load the dataset from a CSV file
data_file_path = 'reviews_sentiment.csv'  # Replace with the path to your CSV file
df = pd.read_csv(data_file_path)

# Define a preprocessing function for text and labels
def preprocess_text_and_labels(texts, labels, tokenizer, max_len=512):
    input_ids, attention_masks, labels_out = [], [], []

    for text, label in zip(texts, labels):
        # Convert label 'negative' to 0, and 'positive' to 1
        label = 0 if label == 'negative' else 1

        # Tokenize and preprocess the text
        encoded_dict = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])
        labels_out.append(label)

    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels_out = torch.tensor(labels_out)

    return input_ids, attention_masks, labels_out

# Split the data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['clean_text'], df['sentiment'], test_size=0.3)

# Preprocess the training and testing data
train_inputs, train_masks, train_labels = preprocess_text_and_labels(train_texts, train_labels, tokenizer)
test_inputs, test_masks, test_labels = preprocess_text_and_labels(test_texts, test_labels, tokenizer)

# Create data loaders for training and testing
batch_size = 16  # Adjust the batch size as needed
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(test_inputs, test_masks, test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

# Function to evaluate the model's accuracy
def evaluate_model(dataloader):
    model.eval()
    predictions, true_labels = [], []

    for batch in dataloader:
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        predictions.extend(logits)
        true_labels.extend(label_ids)

    # Convert lists to numpy arrays before flattening
    predictions = np.array(predictions)
    true_labels = np.array(true_labels)

    # Calculate accuracy
    pred_flat = np.argmax(predictions, axis=1).flatten()
    labels_flat = true_labels.flatten()
    return accuracy_score(labels_flat, pred_flat)

# Evaluate the model's accuracy on the test dataset
accuracy = evaluate_model(test_dataloader)
print("Accuracy:", accuracy)


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Accuracy: 0.22333333333333333
