<a href="https://colab.research.google.com/github/Manya-65/assigment1/blob/main/Navies_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Modify the code in cell "XGJGmzCB3-Hi" to include lemmatization or stemming in the text preprocessing step for the 'Data' column of the "blogs_categories.csv" dataset. The preprocessing should also include converting text to lowercase, removing punctuation, tokenizing, and removing stop words. After preprocessing, perform TF-IDF vectorization on the cleaned text data. Then, split the data into training and testing sets for the Naive Bayes classification model. Finally, perform sentiment analysis on the original 'Data' column and analyze the sentiment distribution.

## Load and explore the data

### Subtask:
Load the "blogs_categories.csv" dataset into a pandas DataFrame and perform an initial exploration to understand its structure, columns, and basic statistics.


**Reasoning**:
Attempt to load the dataset from the specified paths with error handling and perform initial exploration if successful.



In [36]:
import pandas as pd

df = None # Initialize df to None

try:
    # Attempt to load the dataset from the current directory
    df = pd.read_csv('blogs.csv', engine='python', sep=',')
    print("Dataset loaded successfully from current directory.")
except FileNotFoundError:
    print("Error: 'blogs_categories.csv' not found in the current directory. Trying /data/.")
    try:
        # Attempt to load the dataset from the /data/ directory
        df = pd.read_csv('/data/blogs_categories.csv', engine='python', sep=',')
        print("Dataset loaded successfully from /data/.")
    except FileNotFoundError:
        print("Error: 'blogs_categories.csv' not found in /data/ either.")
        df = None # Ensure df is None if file not found
    except pd.errors.ParserError:
        print("ParserError: Attempting to load from /data/ with error handling.")
        try:
            # Attempt to load with error handling for bad lines and quote characters from /data/
            df = pd.read_csv('/data/blogs_categories.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
            print("Dataset loaded successfully from /data/ with error handling.")
        except Exception as e:
            print(f"An error occurred during the second loading attempt from /data/: {e}")
            df = None
    except Exception as e:
        print(f"An unexpected error occurred during the first loading attempt from /data/: {e}")
        df = None
except pd.errors.ParserError:
    print("ParserError: Attempting to load from current directory with error handling.")
    try:
        # Attempt to load with error handling for bad lines and quote characters from current directory
        df = pd.read_csv('blogs_categories.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
        print("Dataset loaded successfully from current directory with error handling.")
    except Exception as e:
        print(f"An error occurred during the second loading attempt from current directory: {e}")
        df = None
except Exception as e:
    print(f"An unexpected error occurred during the first loading attempt: {e}")
    df = None

# Check if the dataframe was loaded successfully, perform initial exploration
if df is not None and not df.empty:
    print("\nFirst 5 rows of the DataFrame:")
    display(df.head())

    print("\nColumn information:")
    df.info()

    print("\nDescriptive statistics:")
    display(df.describe())

    print("\nMissing values per column:")
    display(df.isnull().sum())
else:
    print("DataFrame not loaded. Cannot perform initial exploration.")

Dataset loaded successfully from current directory.

First 5 rows of the DataFrame:


Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism



Column information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB

Descriptive statistics:


Unnamed: 0,Data,Labels
count,2000,2000
unique,2000,20
top,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,alt.atheism
freq,1,100



Missing values per column:


Unnamed: 0,0
Data,0
Labels,0


## Preprocess text data

### Subtask:
Clean the 'Data' column by converting text to lowercase, removing punctuation, tokenizing the text, removing common English stop words, and performing lemmatization or stemming.


**Reasoning**:
Load the dataset from the specified path, handle potential errors, and then perform text preprocessing including lowercase conversion, punctuation removal, tokenization, stop word removal, and lemmatization.



In [37]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer

# Attempt to load the dataset
try:
    df = pd.read_csv('/content/blogs.csv', engine='python', sep=',')
    print("Dataset loaded successfully from /content/.")
except FileNotFoundError:
    print("Error: 'blogs_categories.csv' not found in /content/. Cannot proceed.")
    df = None # Ensure df is None if file not found
except pd.errors.ParserError:
    print("ParserError: Attempting to load from /content/ with error handling.")
    try:
        # Attempt to load with error handling for bad lines and quote characters
        df = pd.read_csv('/content/blogs_categories.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
        print("Dataset loaded successfully from /content/ with error handling.")
    except Exception as e:
        print(f"An error occurred during the second loading attempt from /content/: {e}")
        df = None
except Exception as e:
    print(f"An unexpected error occurred during the first loading attempt from /content/: {e}")
    df = None

# Check if the dataframe was loaded before proceeding with cleaning
if df is not None and not df.empty and 'Data' in df.columns:
    # Step 1: Convert the text in the 'Data' column to lowercase.
    df['cleaned_data'] = df['Data'].str.lower()

    # Step 2: Remove punctuation from the text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

    # Step 3: Tokenize the cleaned text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(word_tokenize)

    # Download necessary NLTK data
    try:
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()
    except LookupError:
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('wordnet')
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()

    # Step 4: Remove common English stop words and perform lemmatization.
    def remove_stopwords_and_lemmatize(tokens):
        return [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    df['cleaned_data'] = df['cleaned_data'].apply(remove_stopwords_and_lemmatize)

    # Step 5: Join the processed tokens back into a string.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda tokens: ' '.join(tokens))

    # Display the first few rows with the new cleaned_data column
    print("\nDataFrame with cleaned 'Data' column:")
    display(df[['Data', 'cleaned_data']].head())
else:
    print("DataFrame not loaded or 'Data' column not found. Cannot proceed with cleaning.")

Dataset loaded successfully from /content/.

DataFrame with cleaned 'Data' column:


Unnamed: 0,Data,cleaned_data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,xref cantaloupesrvcscmuedu altatheism53485 tal...


**Reasoning**:
The previous attempt to load the dataset failed because the file was not found at `/content/blogs_categories.csv`. Based on the available files list provided in the current notebook context, the file is named `blogs.csv` and is located at `/content/`. I will attempt to load the dataset from `/content/blogs.csv` and perform the text preprocessing steps including lemmatization as outlined in the subtask.



In [27]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer

# Attempt to load the dataset from the correct path
try:
    df = pd.read_csv('/content/blogs.csv', engine='python', sep=',')
    print("Dataset loaded successfully from /content/.")
except FileNotFoundError:
    print("Error: 'blogs.csv' not found in /content/. Cannot proceed.")
    df = None # Ensure df is None if file not found
except pd.errors.ParserError:
    print("ParserError: Attempting to load from /content/ with error handling.")
    try:
        # Attempt to load with error handling for bad lines and quote characters
        df = pd.read_csv('/content/blogs.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
        print("Dataset loaded successfully from /content/ with error handling.")
    except Exception as e:
        print(f"An error occurred during the second loading attempt from /content/: {e}")
        df = None
except Exception as e:
    print(f"An unexpected error occurred during the first loading attempt from /content/: {e}")
    df = None

# Check if the dataframe was loaded before proceeding with cleaning
if df is not None and not df.empty and 'Data' in df.columns:
    # Step 1: Convert the text in the 'Data' column to lowercase.
    df['cleaned_data'] = df['Data'].str.lower()

    # Step 2: Remove punctuation from the text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

    # Step 3: Tokenize the cleaned text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(word_tokenize)

    # Download necessary NLTK data
    try:
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()
    except LookupError:
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('wordnet')
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()

    # Step 4: Remove common English stop words and perform lemmatization.
    def remove_stopwords_and_lemmatize(tokens):
        return [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    df['cleaned_data'] = df['cleaned_data'].apply(remove_stopwords_and_lemmatize)

    # Step 5: Join the processed tokens back into a string.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda tokens: ' '.join(tokens))

    # Display the first few rows with the new cleaned_data column
    print("\nDataFrame with cleaned 'Data' column:")
    display(df[['Data', 'cleaned_data']].head())
else:
    print("DataFrame not loaded or 'Data' column not found. Cannot proceed with cleaning.")

Dataset loaded successfully from /content/.

DataFrame with cleaned 'Data' column:


Unnamed: 0,Data,cleaned_data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,xref cantaloupesrvcscmuedu altatheism53485 tal...


**Reasoning**:
The previous code failed because the `punkt_tab` resource was not found for tokenization. Although I attempted to download necessary NLTK data, `punkt_tab` was not included. I need to modify the code to explicitly download `punkt_tab` before performing tokenization. I will reuse the rest of the logic for loading the data and performing the preprocessing steps, including lemmatization.



In [28]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
from nltk.stem import WordNetLemmatizer

# Attempt to load the dataset from the correct path
try:
    df = pd.read_csv('/content/blogs.csv', engine='python', sep=',')
    print("Dataset loaded successfully from /content/.")
except FileNotFoundError:
    print("Error: 'blogs.csv' not found in /content/. Cannot proceed.")
    df = None # Ensure df is None if file not found
except pd.errors.ParserError:
    print("ParserError: Attempting to load from /content/ with error handling.")
    try:
        # Attempt to load with error handling for bad lines and quote characters
        df = pd.read_csv('/content/blogs.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
        print("Dataset loaded successfully from /content/ with error handling.")
    except Exception as e:
        print(f"An error occurred during the second loading attempt from /content/: {e}")
        df = None
except Exception as e:
    print(f"An unexpected error occurred during the first loading attempt from /content/: {e}")
    df = None

# Check if the dataframe was loaded before proceeding with cleaning
if df is not None and not df.empty and 'Data' in df.columns:
    # Step 1: Convert the text in the 'Data' column to lowercase.
    df['cleaned_data'] = df['Data'].str.lower()

    # Step 2: Remove punctuation from the text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

    # Download necessary NLTK data, including 'punkt_tab'
    try:
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()
        nltk.data.find('tokenizers/punkt_tab/english') # Check if punkt_tab is available
    except LookupError:
        print("Downloading NLTK data...")
        nltk.download('punkt')
        nltk.download('stopwords')
        nltk.download('wordnet')
        nltk.download('punkt_tab') # Explicitly download punkt_tab
        stop_words = set(stopwords.words('english'))
        lemmatizer = WordNetLemmatizer()
        print("NLTK data downloaded.")

    # Step 3: Tokenize the cleaned text in the 'Data' column.
    df['cleaned_data'] = df['cleaned_data'].apply(word_tokenize)


    # Step 4: Remove common English stop words and perform lemmatization.
    def remove_stopwords_and_lemmatize(tokens):
        return [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    df['cleaned_data'] = df['cleaned_data'].apply(remove_stopwords_and_lemmatize)

    # Step 5: Join the processed tokens back into a string.
    df['cleaned_data'] = df['cleaned_data'].apply(lambda tokens: ' '.join(tokens))

    # Display the first few rows with the new cleaned_data column
    print("\nDataFrame with cleaned 'Data' column:")
    display(df[['Data', 'cleaned_data']].head())
else:
    print("DataFrame not loaded or 'Data' column not found. Cannot proceed with cleaning.")

Dataset loaded successfully from /content/.

DataFrame with cleaned 'Data' column:


Unnamed: 0,Data,cleaned_data
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,newsgroups altatheism path cantaloupesrvcscmue...
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,path cantaloupesrvcscmuedudasnewsharvardedunoc...
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,path cantaloupesrvcscmuedumagnesiumclubcccmued...
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,xref cantaloupesrvcscmuedu altatheism53485 tal...


## Feature extraction (tf-idf)

### Subtask:
Convert the cleaned text data into numerical features using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique. This will represent each blog post as a vector of TF-IDF scores.


**Reasoning**:
Convert the cleaned text data into numerical features using the TF-IDF vectorization technique, as the dataframe is loaded and cleaned in previous steps.



In [29]:
# Check if the dataframe was loaded and cleaned successfully
if 'df' in locals() and df is not None and not df.empty and 'cleaned_data' in df.columns:
    # Instantiate a TfidfVectorizer object.
    tfidf_vectorizer = TfidfVectorizer()

    # Fit and transform the 'cleaned_data' column to create the TF-IDF features.
    tfidf_features = tfidf_vectorizer.fit_transform(df['cleaned_data'])

    # Print the shape of the resulting TF-IDF features
    print("Shape of TF-IDF features:", tfidf_features.shape)
else:
    print("DataFrame or 'cleaned_data' column not found. Cannot perform TF-IDF vectorization.")

Shape of TF-IDF features: (2000, 53191)


## Split data for classification

### Subtask:
Split the TF-IDF features and the 'Labels' column into training and testing sets to prepare for training the Naive Bayes model.


**Reasoning**:
Split the TF-IDF features and the 'Labels' column into training and testing sets.



In [30]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = tfidf_features
y = df['Labels']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (1600, 53191)
Shape of X_test: (400, 53191)
Shape of y_train: (1600,)
Shape of y_test: (400,)


## Train and evaluate naive bayes model

### Subtask:
Implement and train a Naive Bayes classifier (e.g., Multinomial Naive Bayes) on the training data. Evaluate the model's performance on the test set using appropriate metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
Implement and train a Multinomial Naive Bayes classifier, make predictions, and evaluate its performance using standard classification metrics.



In [31]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Instantiate a Multinomial Naive Bayes model
mnb = MultinomialNB()

# Train the Naive Bayes model on the training data
mnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = mnb.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted') # Use weighted for multi-class
recall = recall_score(y_test, y_pred, average='weighted') # Use weighted for multi-class
f1 = f1_score(y_test, y_pred, average='weighted') # Use weighted for multi-class

# Print the evaluation metrics
print("Naive Bayes Classifier Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Naive Bayes Classifier Performance:
Accuracy: 0.8075
Precision: 0.8327
Recall: 0.8075
F1-score: 0.7992


**Reasoning**:
Apply VADER sentiment analysis to the 'Data' column to determine the sentiment of each blog post, as requested in the next subtask. This involves importing necessary libraries, downloading the VADER lexicon, defining a sentiment analysis function, and applying it to create a new column.



In [32]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon if not already present
try:
    nltk.data.find('sentiment/vader_lexicon.zip')
except LookupError:
    nltk.download('vader_lexicon')

# Check if the dataframe was loaded successfully and has the 'Data' column
if 'df' in locals() and df is not None and not df.empty and 'Data' in df.columns:
    # Instantiate the SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()

    # Define a function to get the compound sentiment score
    def get_sentiment_score(text):
        if isinstance(text, str):
            return analyzer.polarity_scores(text)['compound']
        else:
            return 0.0 # Return 0 for non-string values

    # Apply the function to the 'Data' column
    df['sentiment_score'] = df['Data'].apply(get_sentiment_score)

    # Display the first few rows with the new sentiment_score column
    print("\nDataFrame with 'sentiment_score' column:")
    display(df[['Data', 'sentiment_score']].head())
else:
    print("DataFrame not loaded or 'Data' column not found. Cannot perform sentiment analysis.")


DataFrame with 'sentiment_score' column:


Unnamed: 0,Data,sentiment_score
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,-0.9966
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,0.8856
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,-0.9969
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,-0.9999
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,0.9821


**Reasoning**:
Analyze the sentiment distribution across all blog posts and within different categories if the 'Labels' column is available. This involves categorizing sentiment scores and calculating value counts.



In [33]:
# Step 2: Categorize sentiment scores
def categorize_sentiment(score):
    if score > 0.05:
        return 'Positive'
    elif score < -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)

# Step 3: Calculate and print overall sentiment distribution
print("\nOverall Sentiment Distribution:")
overall_sentiment_distribution = df['sentiment_category'].value_counts(normalize=True) * 100
print(overall_sentiment_distribution)

# Step 4: Calculate and print per-category sentiment distribution if 'Labels' column exists
if 'Labels' in df.columns:
    print("\nSentiment Distribution per Category:")
    per_category_sentiment_distribution = df.groupby('Labels')['sentiment_category'].value_counts(normalize=True).unstack(fill_value=0) * 100
    print(per_category_sentiment_distribution)
else:
    print("\n'Labels' column not found. Cannot analyze sentiment distribution per category.")

# Step 5: Summarize key observations
print("\nSummary of Sentiment Analysis Results:")
print("Overall, the sentiment analysis shows the following distribution:")
for category, percentage in overall_sentiment_distribution.items():
    print(f"- {category}: {percentage:.2f}%")

if 'Labels' in df.columns:
    print("\nSentiment distribution varies across different blog categories.")
    # Add more specific observations based on per_category_sentiment_distribution if needed


Overall Sentiment Distribution:
sentiment_category
Positive    66.70
Negative    31.55
Neutral      1.75
Name: proportion, dtype: float64

Sentiment Distribution per Category:
sentiment_category        Negative  Neutral  Positive
Labels                                               
alt.atheism                   42.0      1.0      57.0
comp.graphics                 13.0      4.0      83.0
comp.os.ms-windows.misc       24.0      2.0      74.0
comp.sys.ibm.pc.hardware      21.0      0.0      79.0
comp.sys.mac.hardware         24.0      3.0      73.0
comp.windows.x                20.0      2.0      78.0
misc.forsale                   7.0      8.0      85.0
rec.autos                     27.0      1.0      72.0
rec.motorcycles               30.0      2.0      68.0
rec.sport.baseball            27.0      1.0      72.0
rec.sport.hockey              28.0      1.0      71.0
sci.crypt                     29.0      0.0      71.0
sci.electronics               18.0      4.0      78.0
sci.med      

In [38]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download VADER lexicon if not already present
try:
    nltk.data.find('sentiment/vader_lexicon.zip')
except LookupError:
    nltk.download('vader_lexicon')

# Attempt to load the dataset
try:
    df = pd.read_csv('blogs.csv', engine='python', sep=',')
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: 'blogs_categories.csv' not found in the current directory. Trying /data/.")
    try:
        df = pd.read_csv('/data/blogs.csv', engine='python', sep=',')
        print("Dataset loaded successfully from /data/.")
    except FileNotFoundError:
        print("Error: 'blogs_categories.csv' not found in /data/ either. Cannot proceed.")
        df = None
    except pd.errors.ParserError:
        print("ParserError: Attempting to load from /data/ with error handling.")
        try:
            df = pd.read_csv('/data/blogs_categories.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
            print("Dataset loaded successfully from /data/ with error handling.")
        except Exception as e:
            print(f"An error occurred during the second loading attempt from /data/: {e}")
            df = None
    except Exception as e:
        print(f"An unexpected error occurred during the first loading attempt from /data/: {e}")
        df = None
except pd.errors.ParserError:
    print("ParserError: Attempting to load with error handling.")
    try:
        df = pd.read_csv('blogs_categories.csv', engine='python', sep=',', on_bad_lines='skip', quotechar='"')
        print("Dataset loaded successfully with error handling.")
    except Exception as e:
        print(f"An error occurred during the second loading attempt: {e}")
        df = None
except Exception as e:
    print(f"An unexpected error occurred during the first loading attempt: {e}")
    df = None


# If the dataframe was loaded successfully, perform sentiment analysis
if df is not None and not df.empty and 'Data' in df.columns:
    # Instantiate the SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()

    # Define a function to get the compound sentiment score
    def get_sentiment_score(text):
        if isinstance(text, str):
            return analyzer.polarity_scores(text)['compound']
        else:
            return 0.0 # Return 0 for non-string values

    # Apply the function to the 'Data' column
    df['sentiment_score'] = df['Data'].apply(get_sentiment_score)

    # Display the first few rows with the new sentiment_score column
    print("\nDataFrame with 'sentiment_score' column:")
    display(df[['Data', 'sentiment_score']].head())
else:
    print("DataFrame not loaded or 'Data' column not found. Cannot perform sentiment analysis.")

Dataset loaded successfully.

DataFrame with 'sentiment_score' column:


Unnamed: 0,Data,sentiment_score
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,-0.9966
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,0.8856
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,-0.9969
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,-0.9999
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,0.9821


## Analyze sentiment results

### Subtask:
Summarize and analyze the sentiment distribution across all blog posts and potentially examine sentiment within different categories.


**Reasoning**:
Check if the DataFrame df exists, is not empty, and contains the 'sentiment_score' and 'Labels' columns. If not, print a message indicating that sentiment analysis distribution cannot be performed and finish the task. Otherwise, categorize sentiment scores, calculate and print overall and per-category sentiment distribution, and summarize the results.



In [35]:
if 'df' not in locals() or df is None or df.empty or 'sentiment_score' not in df.columns:
    print("DataFrame 'df' not found, is empty, or does not contain the 'sentiment_score' column. Sentiment analysis distribution cannot be performed.")
else:
    # Step 2: Categorize sentiment scores
    def categorize_sentiment(score):
        if score > 0.05:
            return 'Positive'
        elif score < -0.05:
            return 'Negative'
        else:
            return 'Neutral'

    df['sentiment_category'] = df['sentiment_score'].apply(categorize_sentiment)

    # Step 3: Calculate and print overall sentiment distribution
    print("\nOverall Sentiment Distribution:")
    overall_sentiment_distribution = df['sentiment_category'].value_counts(normalize=True) * 100
    print(overall_sentiment_distribution)

    # Step 4: Calculate and print per-category sentiment distribution if 'Labels' column exists
    if 'Labels' in df.columns:
        print("\nSentiment Distribution per Category:")
        per_category_sentiment_distribution = df.groupby('Labels')['sentiment_category'].value_counts(normalize=True).unstack(fill_value=0) * 100
        print(per_category_sentiment_distribution)
    else:
        print("\n'Labels' column not found. Cannot analyze sentiment distribution per category.")

    # Step 5: Summarize key observations
    print("\nSummary of Sentiment Analysis Results:")
    print("Overall, the sentiment analysis shows the following distribution:")
    for category, percentage in overall_sentiment_distribution.items():
        print(f"- {category}: {percentage:.2f}%")

    if 'Labels' in df.columns:
        print("\nSentiment distribution varies across different blog categories.")
        # Add more specific observations based on per_category_sentiment_distribution if needed

DataFrame 'df' not found, is empty, or does not contain the 'sentiment_score' column. Sentiment analysis distribution cannot be performed.


## Summary:

### Data Analysis Key Findings

*   The dataset `blogs.csv` was successfully loaded for analysis after an initial `FileNotFoundError` with the incorrect filename `blogs_categories.csv`.
*   The 'Data' column was preprocessed by converting text to lowercase, removing punctuation, tokenizing, removing stop words, and applying lemmatization using NLTK. A `LookupError` during tokenization was resolved by explicitly downloading the `punkt_tab` resource.
*   TF-IDF vectorization was applied to the cleaned text data, resulting in a feature matrix with 2000 documents and 53191 unique terms.
*   The data was split into training (80%) and testing (20%) sets, with 1600 samples for training and 400 for testing.
*   A Multinomial Naive Bayes model was trained on the TF-IDF features. The model achieved an accuracy of approximately 80.75% on the test set, with weighted precision, recall, and F1-score around 83.27%, 80.75%, and 79.92%, respectively.
*   Sentiment analysis using VADER revealed that 66.70% of the blog posts were categorized as Positive, 31.55% as Negative, and 1.75% as Neutral.
*   Sentiment distribution varied significantly across blog categories, with political categories showing a higher proportion of negative sentiment compared to technical or sales-related categories.

### Insights or Next Steps

*   Investigate the characteristics of blog posts in categories with high negative sentiment to understand the topics or language contributing to this polarity.
*   Explore alternative text vectorization techniques (e.g., word embeddings) or more complex classification models to potentially improve classification performance, especially for categories with lower prediction metrics.
