<a href="https://colab.research.google.com/github/NagillaUdayasree/Udayasree_INFO5731_Spring2024/blob/main/Nagilla_Udayasree_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
import pandas as pd  # pandas library for data manipulation and analysis
import gensim  # gensim library for topic modeling
from gensim import corpora  # corpora module from gensim for creating dictionaries and corpus
import nltk  # nltk library for natural language processing tasks
nltk.download('punkt')  #  the 'punkt' resource for tokenization
nltk.download('stopwords')  #  the 'stopwords' corpus for removing stopwords
nltk.download('wordnet')  #  the 'wordnet' corpus for lemmatization
from nltk.corpus import stopwords  #  stopwords corpus from nltk
from nltk.stem import WordNetLemmatizer  #  the WordNetLemmatizer class from nltk for lemmatization

# Load the dataset
data = pd.read_csv('classified_sentiment_output.csv')  # Read the CSV file into a pandas DataFrame

# Preprocessing functions
def tokenize_text(txt):
    return nltk.word_tokenize(txt.lower())  # Tokenize the text and convert to lowercase

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))  # Create a set of English stopwords
    return [word for word in tokens if word not in stop_words]  # Remove stopwords from the token list

def lemmatize_tokens(tokens):
    lemmatizer = WordNetLemmatizer()  # Create an instance of the WordNetLemmatizer
    return [lemmatizer.lemmatize(word) for word in tokens]  # Lemmatize each token in the list

def preprocess(text):
    tokens = tokenize_text(text)  # Tokenize the text
    tokens = remove_stopwords(tokens)  # Remove stopwords from the tokens
    tokens = lemmatize_tokens(tokens)  # Lemmatize the tokens
    return tokens  # Return the preprocessed tokens

# Preprocess the text data
processed_texts = [preprocess(text) for text in data['clean_text']]  # Apply the preprocess function to each text in the 'clean_text' column

# Create a dictionary from the processed texts
dictionary = corpora.Dictionary(processed_texts)  # Create a dictionary from the processed texts

# Convert the texts to a document-term matrix
corpus = [dictionary.doc2bow(text) for text in processed_texts]  # Convert the processed texts to a document-term matrix

# Create an LDA model
num_topics = 10  # Number of topics to extract
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=num_topics)  # Create an LDA model with 10 topics

print("*" * 80)  # Print a separator line
# Print the top topics
print("Top {} topics:".format(num_topics))  # Print a header for the top topics
for idx, topic in lda_model.print_topics(-1):  # Iterate over the top topics
    print("Topic {}: {}".format(idx, topic))  # Print the topic index and keywords
print("*" * 80)  # Print a separator line

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


********************************************************************************
Top 10 topics:
Topic 0: 0.030*"film" + 0.016*"movie" + 0.012*"prison" + 0.011*"andy" + 0.010*"shawshank" + 0.008*"one" + 0.008*"redemption" + 0.007*"life" + 0.007*"freeman" + 0.006*"time"
Topic 1: 0.033*"movie" + 0.017*"film" + 0.010*"prison" + 0.010*"shawshank" + 0.009*"time" + 0.009*"one" + 0.009*"story" + 0.007*"life" + 0.007*"redemption" + 0.006*"freeman"
Topic 2: 0.022*"movie" + 0.017*"film" + 0.013*"prison" + 0.013*"andy" + 0.009*"shawshank" + 0.009*"one" + 0.009*"time" + 0.007*"hope" + 0.007*"life" + 0.007*"redemption"
Topic 3: 0.024*"movie" + 0.010*"film" + 0.007*"shawshank" + 0.007*"prison" + 0.006*"redemption" + 0.006*"like" + 0.006*"andy" + 0.005*"freeman" + 0.005*"best" + 0.005*"good"
Topic 4: 0.049*"movie" + 0.013*"one" + 0.011*"best" + 0.010*"film" + 0.009*"prison" + 0.009*"time" + 0.009*"hope" + 0.008*"story" + 0.007*"shawshank" + 0.007*"good"
Topic 5: 0.019*"film" + 0.018*"movie" + 0.011*"t

In [None]:
"""Topic modeling using the Natural Language Toolkit (NLTK) and Gensim, and the resulting output of the top 10 topics identified from an LDA model.

### Features (Text Representation) Used for Topic Modeling

- Tokenization: This breaks text into individual words using NLTK's tokenizer, ensuring that the text is processed at a word level.
- Case Normalization: All text is converted to lowercase to ensure that the same words in different cases are counted as the same word.
- Stop Word Removal: Common words that generally do not contribute to the meaning of the text, such as "the", "is", and "and", are removed using a predefined list from NLTK.
- Lemmatization: Words are reduced to their base or root form. This process is more sophisticated than stemming as it uses lexical knowledge bases to get the correct base forms of words.
- Vectorization (Document-Term Matrix): Using the Bag of Words model, texts are converted into a numerical format where each unique word is represented by an integer and each document is a vector of word counts.

### Top 10 Clusters for Topic Modeling

The LDA model identified the following 10 topics as clusters of words that frequently co-occur across the documents:

Topic 0: Primarily related to movies and films, with a focus on "prison" themes.
Topic 1: Focuses on general movie discussions, highlighting the viewing experience ("time", "story").
Topic 2: Combines elements of movies and prison settings with emotional themes ("hope", "life").
Topic 3: A mixture of film critique and specific elements from "Shawshank Redemption" (like "redemption").
Topic 4: Revolves around the "best" aspects of movies, possibly award discussions or top movie lists.
Topic 5: Discusses narrative elements in films, including storytelling and plot summaries.
Topic 6: Encompasses a variety of film elements but seems to focus on personal stories within movies.
Topic 7: Appears to discuss film impact and narrative quality, possibly in reviews.
Topic 8: Heavy on specific references to "Shawshank Redemption", discussing characters and plot details.
Topic 9: Mixes general movie discussion with focus on prison life and philosophical themes like "hope".

### Summarize and Describe the Topic for Each Cluster

1. Topic 0 - (Prison Drama Insights): Explores dramatic narratives set in prison environments, specifically highlighting "Shawshank Redemption" and key characters like Andy and themes of life and time.

2. Topic 1 - (Cinematic Timing and Storytelling): Delves into how movies manage storytelling within their runtime, with frequent references to narrative pacing and life reflections.

3. Topic 2 - (Hope and Redemption in Film): Focuses on the themes of hope and redemption within the context of prison movies, likely drawing heavily from "Shawshank Redemption".

4.Topic 3 - (Critical Reception and Values): Discusses what makes movies like "Shawshank Redemption" critically acclaimed, touching on themes like redemption and performance.

5. Topic 4 - (Celebrating Cinematic Excellence): Centers on movies considered the "best", often discussing their hopeful messages and timeless stories.

6. Topic 5 - (Narrative and Plot Analysis): Analyzes the storytelling techniques and plot elements in movies, with a focus on how stories are seen and appreciated.

7. Topic 6 - (Personal and Emotional Film Narratives): Looks at personal and emotional aspects of film stories, particularly how characters like Andy influence the narrative.

8. Topic 7 - (Reviewing Film Impact): Reviews the impact of films on audiences, considering how stories are made and received in terms of quality and emotional engagement.

9. Topic 8 - (Character Deep Dives): Provides deep dives into specific characters and themes from "Shawshank Redemption", discussing their role in the film's success.

10. Topic 9 - (Philosophical and Life Themes in Movies): Explores philosophical questions and life themes presented in films, especially in prison settings, with a focus on hope and redemption narratives.

Each topic summary provides insight into the different aspects of how films, particularly "Shawshank Redemption", are discussed and analyzed."""

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
import pandas as pd  #pandas library for data manipulation and analysis
from sklearn.model_selection import train_test_split  # train_test_split function for splitting the data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer  #TfidfVectorizer for creating TF-IDF features
from sklearn.naive_bayes import MultinomialNB  # MultinomialNB classifier from scikit-learn
from sklearn.tree import DecisionTreeClassifier  #DecisionTreeClassifier from scikit-learn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score  #evaluation metrics from scikit-learn
from sklearn.model_selection import cross_val_score  #cross_val_score function for cross-validation

# Load the dataset
sentiment_data = pd.read_csv('classified_sentiment_output.csv')

# Split the data into training and testing sets
text_data = sentiment_data['clean_text']  # Extract the 'clean_text' column from the DataFrame
labels = sentiment_data['Sentiment']  # Extract the 'Sentiment' column from the DataFrame
train_text, test_text, train_labels, test_labels = train_test_split(text_data, labels, test_size=0.2, random_state=45)
# Split the data into training and testing sets (80% for training, 20% for testing) with a fixed random state

# TF-IDF features
vectorizer = TfidfVectorizer()  # Creating an instance of the TfidfVectorizer
train_features = vectorizer.fit_transform(train_text)  # Creating TF-IDF features for the training text data
test_features = vectorizer.transform(test_text)  # Creating TF-IDF features for the testing text data

# Naive Bayes Classifier
print("Naive Bayes Classifier")  # header for the Naive Bayes Classifier
nb_classifier = MultinomialNB()  # Creating an instance of the MultinomialNB classifier
nb_classifier.fit(train_features, train_labels)  # Training the Naive Bayes Classifier on the training data
nb_predictions = nb_classifier.predict(test_features)  # Making predictions on the testing data
nb_accuracy = accuracy_score(test_labels, nb_predictions)  # Calculating the accuracy score
nb_precision = precision_score(test_labels, nb_predictions, average='weighted')  # Calculating the weighted precision score
nb_recall = recall_score(test_labels, nb_predictions, average='weighted')  # Calculating the weighted recall score
nb_f1 = f1_score(test_labels, nb_predictions, average='weighted')  # Calculating the weighted F1 score
print(f"Accuracy: {nb_accuracy:.4f}")  # Printing the accuracy score
print(f"Precision: {nb_precision:.4f}")  # Printing the precision score
print(f"Recall: {nb_recall:.4f}")  # Printing the recall score
print(f"F1-score: {nb_f1:.4f}")  # Printing the F1 score
nb_cv_scores = cross_val_score(nb_classifier, train_features, train_labels, cv=5, scoring='accuracy')
# Performing 5-fold cross-validation and calculate the accuracy scores for each fold
print(f"Cross-validation scores: {nb_cv_scores}")  # Printing the cross-validation scores
print(f"Mean cross-validation score: {nb_cv_scores.mean():.4f}")  # Printing the mean cross-validation score

# Decision Tree Classifier
print("\nDecision Tree Classifier")  # header for the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()  # Creating an instance of the DecisionTreeClassifier
dt_classifier.fit(train_features, train_labels)  # Training the Decision Tree Classifier on the training data
dt_predictions = dt_classifier.predict(test_features)  # Making predictions on the testing data
dt_accuracy = accuracy_score(test_labels, dt_predictions)  # Calculating the accuracy score
dt_precision = precision_score(test_labels, dt_predictions, average='weighted')  # Calculating the weighted precision score
dt_recall = recall_score(test_labels, dt_predictions, average='weighted')  # Calculating the weighted recall score
dt_f1 = f1_score(test_labels, dt_predictions, average='weighted')  # Calculating the weighted F1 score
print(f"Accuracy: {dt_accuracy:.4f}")  # Printing the accuracy score
print(f"Precision: {dt_precision:.4f}")  # Printing the precision score
print(f"Recall: {dt_recall:.4f}")  # Printing the recall score
print(f"F1-score: {dt_f1:.4f}")  # Printing the F1 score
dt_cv_scores = cross_val_score(dt_classifier, train_features, train_labels, cv=5, scoring='accuracy')
# Performing 5-fold cross-validation and calculate the accuracy scores for each fold
print(f"Cross-validation scores: {dt_cv_scores}")  # Printing the cross-validation scores
print(f"Mean cross-validation score: {dt_cv_scores.mean():.4f}")  # Printing the mean cross-validation score

Naive Bayes Classifier
Accuracy: 0.8390
Precision: 0.7040
Recall: 0.8390
F1-score: 0.7656
Cross-validation scores: [0.81707317 0.81707317 0.81595092 0.81595092 0.81595092]
Mean cross-validation score: 0.8164

Decision Tree Classifier


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.7366
Precision: 0.7518
Recall: 0.7366
F1-score: 0.7437
Cross-validation scores: [0.74390244 0.76219512 0.79754601 0.77300613 0.79141104]
Mean cross-validation score: 0.7736


In [None]:
"""In the sentiment classification, TF-IDF (Term Frequency-Inverse Document Frequency) is used as the feature extraction method because of the following reasons:

TF-IDF as Feature Extraction Method:
  Emphasizes Important Words: TF-IDF highlights crucial words for sentiment analysis and diminishes the impact of common, less informative words.
  Adjusts Word Importance: It increases the weight of rare words across documents, focusing on terms that uniquely define sentiment in specific texts.

Compatibility with Classifiers:
  Naive Bayes: TF-IDF aligns with the probabilistic nature of Naive Bayes, treating scores as normalized frequency counts which enhances model suitability.
  Decision Trees: Utilizes the numerical vectors from TF-IDF to make decisions based on quantitative thresholds, effectively distinguishing between different sentiment classes.

Benefits of Using TF-IDF:
  Reduces Computational Complexity: Lowers the dimensionality of the feature space, making data processing more manageable.
  Improves Model Performance: Enhances key metrics such as accuracy, precision, recall, and F1-score.
  Stable and Reliable Predictions: Cross-validation with TF-IDF features leads to more consistent and dependable outcomes, confirming the effectiveness and efficiency of the models.

Performance comparison:
  When used for sentiment classification tasks, the Naive Bayes classifier beats the Decision Tree in most important performance criteria. In contrast to the Decision Tree, Naive Bayes exhibits greater accuracy (83.90% vs. 73.66%), recall (83.90% vs. 73.66%), F1-score (76.56% vs. 74.37%), and a mean cross-validation score (81.64% vs. 77.36%) that is more stable. In some areas, the Decision Tree performs worse than the other, but its precision is slightly greater (75.18% vs. 70.40%). For particular labels, the Decision Tree may also have problems with undefinable precision, which could be caused by an imbalance in the classes represented in the predictions or by missing classes. Because Naive Bayes provides balanced and robust performance—especially in terms of detecting all pertinent occurrences without missing positive cases—it is therefore a more dependable option for this dataset."""

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [21]:
import pandas as pd  # Import pandas library for data manipulation and analysis
import numpy as np  # Import numpy for numerical operations
from sklearn.model_selection import train_test_split  #function to split data
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score #for metric evaluation

# Load the training and testing data from CSV files
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Print the structure of the training dataset and the first few rows to understand the data format
print(train_df.info())
print(train_df.head())

# Drop columns with a high percentage of missing values, as they may not provide reliable information
drop_cols = ['Alley', 'PoolQC', 'Fence', 'MiscFeature']
train_df.drop(columns=drop_cols, inplace=True)

# Impute missing values: fill missing numerical data with the median, and categorical data with the mode
for col in train_df.columns:
   if train_df[col].dtype == 'object':  # Check if the column data type is object (categorical)
       train_df[col].fillna(train_df[col].mode()[0], inplace=True)  # Impute with mode for categorical columns
   else:
       train_df[col].fillna(train_df[col].median(), inplace=True)  # Impute with median for numerical columns

# Confirm that all missing values have been filled by checking the maximum count of missing values across columns
print(train_df.isnull().sum().max())  # Outputs 0 if no null values remain

# Remove outliers from the `SalePrice` data using the Interquartile Range (IQR) method
Q1 = train_df['SalePrice'].quantile(0.25)  # Calculate the first quartile (25th percentile)
Q3 = train_df['SalePrice'].quantile(0.75)  # Calculate the third quartile (75th percentile)
IQR = Q3 - Q1  # Calculate the interquartile range (distance between 25th and 75th percentiles)
lower_bound = Q1 - 1.5 * IQR  # Define the lower bound for acceptable data points
upper_bound = Q3 + 1.5 * IQR  # Define the upper bound for acceptable data points
train_df = train_df[(train_df['SalePrice'] >= lower_bound) & (train_df['SalePrice'] <= upper_bound)]  # Filter out outliers

# Select only numeric columns for correlation calculation
numeric_features = train_df.select_dtypes(include=[np.number])
corr = numeric_features.corr()['SalePrice'].sort_values(ascending=False)  # Calculate and sort correlation with SalePrice
print(corr.head(10))  # Print the top 10 features most correlated with SalePrice

# Define a list of features to be used for the model, based on correlation and domain knowledge
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']

# Prepare feature matrix (X) and target vector (y)
X = train_df[selected_features]  # Features matrix
y = train_df['SalePrice']  # Target vector

# Split the dataset into training and testing sets with 80% training and 20% testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)  # Fit the model on the training data

# Predict the target variable for the test set
y_pred = model.predict(X_test)  # Model predictions for the test set

# Evaluate the model performance using Mean Squared Error (MSE) and R-squared metrics
mse = mean_squared_error(y_test, y_pred)  # Calculate MSE
rmse = np.sqrt(mse)  # Calculate Root Mean Squared Error (RMSE)
r2 = r2_score(y_test, y_pred)  # Calculate R-squared value

print("###############################################")
# Print evaluation metrics to assess model performance
print(f"Root Mean Squared Error: {rmse}")
print(f"R^2 Score: {r2}")
print("###############################################")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [None]:
""" The following characteristics were picked for the regression model's selection in order to forecast house prices:

1. OverallQual (Overall Material and Finish Quality): As better construction and finishes are reflected in greater quality, higher quality is generally associated with higher pricing.

2. GrLivArea (Above Grade Living Area): Buyers place a higher value on larger living areas, which has a direct impact on the market price of the property.

3. GarageCars (Garage Size in Car Capacity): A larger garage improves the use and aesthetics of a house, which in turn affects the price and desirability of the property.

4. TotalBsmtSF (Total Basement Area): A home's worth might rise dramatically if it has a larger basement because it usually translates into more usable space.

5. FullBath (Full Bathrooms): Having more bathrooms increases luxury and convenience, which raises the market value and usefulness of the house.

6. Year Built (Construction Year): While older homes may draw purchasers seeking for traditional elements and maybe lower prices, newer homes often command greater prices due to their modern conveniences and lesser upkeep requirements.

Because these characteristics reflect significant elements that prospective purchasers usually take into consideration when determining a property's value, they are excellent indicators of home prices."""

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [17]:
# Import necessary libraries
import pandas as pd
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset and preprocess it
data_df = pd.read_csv('classified_sentiment_output.csv')
data_df['Sentiment'] = data_df['Sentiment'].str.lower()  # Converting sentiment labels in the dataset to lowercase to ensure consistency
data_subset = data_df.head(10).copy()  # Copying the first 10 entries from the dataset for analysis

# Initializing the zero-shot classification model
sentiment_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")  # Loading the BART model pre-trained on MNLI

# Defining labels and perform sentiment analysis
labels = ["positive", "negative", "neutral"]  # Setting the target labels for sentiment classification
# Applying the classifier to each text in the 'clean_text' column, store the highest confidence label in lowercase
data_subset['predicted_sentiment'] = data_subset['clean_text'].apply(
    lambda x: sentiment_classifier(x, candidate_labels=labels)['labels'][0].lower()
)

# results and displaying the output
data_subset.to_csv('sentiment_predictions.csv', index=False)  # Saving the dataframe with predictions to a CSV file
print(data_subset[['clean_text', 'predicted_sentiment']])  # Printing the cleaned text and its predicted sentiment
print("Output saved to sentiment_predictions.csv")

# Calculating and displaying evaluation metrics
# Creating a dictionary to store the metrics calculated from the actual and predicted labels
metrics = {
    'Accuracy': accuracy_score(data_subset['Sentiment'], data_subset['predicted_sentiment']),
    'Precision': precision_score(data_subset['Sentiment'], data_subset['predicted_sentiment'], average='weighted', zero_division=1),
    'Recall': recall_score(data_subset['Sentiment'], data_subset['predicted_sentiment'], average='weighted', zero_division=1),
    'F1 Score': f1_score(data_subset['Sentiment'], data_subset['predicted_sentiment'], average='weighted', zero_division=1)
}
# Printing each metric
print("\nMetrics:")
for metric_name, metric_value in metrics.items():
    print(f"{metric_name}: {metric_value:.2f}")


                                          clean_text predicted_sentiment
0  shawshank redemption written directed frank da...            positive
1  wonder film high rating quite literally breath...            positive
2  im trying save money last film title consider ...            positive
3  movie ordinary hollywood flick great deep mess...            positive
4  oscar year shawshank redemption written direct...            negative
5  best movie history best ending entertainment b...            positive
6  one finest film made recent year poignant stor...            positive
7  ive lost count number time seen movie one best...            positive
8  misery stand best adaptation one add shawshank...            positive
9  shawshank redemption without doubt one best fi...            positive
Output saved to sentiment_predictions.csv

Metrics:
Accuracy: 0.90
Precision: 1.00
Recall: 0.90
F1 Score: 0.95


In [None]:
""" 1. Overview of the Chosen Pre-trained Language Model (PLM)

Model used: facebook/bart-large-mnli

Overview:
BART (Bidirectional and Auto-Regressive Transformers): Developed by Facebook AI, this model is tailored for both encoding and decoding tasks, making it flexible for a wide range of NLP applications such as text comprehension and generation.

Sources for Pretraining:
- BART was extensively trained on diverse textual sources, including books and a vast array of internet content, enabling it to grasp a wide spectrum of language variations and contexts.

Model Size:
- The facebook/bart-large-mnli variant boasts around 406 million parameters, which equips it with the ability to model intricate language patterns but also makes it computationally demanding.

Specialized Fine-tuning:
- This model has undergone fine-tuning with the Multi-Genre Natural Language Inference (MNLI) dataset, which is focused on text entailment tasks. This training enhances its capability to understand and interpret complex text relationships.

2. Pros, Cons, and Challenges faced

Pros:
Flexibility:BART can handle various tasks beyond classification, including text summarization and language translation.
Advanced Understanding: Its training on the MNLI dataset means BART is adept at analyzing detailed textual relationships, which is useful for deep linguistic tasks.
Effective Performance: BART typically shows robust accuracy in relevant linguistic tasks due to its sophisticated understanding of language.

Cons:
High Resource Needs: The model’s vast number of parameters necessitates significant computational resources for operation, limiting its accessibility for users with constrained computational power.
Risk of Overfitting: The complexity of BART, while beneficial for capturing nuanced patterns, also makes it susceptible to overfitting, particularly on smaller or homogeneous datasets.
Inherent Biases: As with most large-scale models, BART may replicate biases inherent in its training data, potentially leading to skewed outputs.

challenges faced:
Resource Constraints: Executing this model for inference is resource-intensive, potentially challenging when handling extensive data sets.
Setup and Compatibility: Integrating the Hugging Face pipeline with specific applications may require additional adjustments to ensure compatibility and functionality.
Output Interpretation: Making sense of the model’s classifications and effectively using this information can be complex due to the abstract nature of its outputs.

