Data Exploration and Preprocessing

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv('blogs.csv')
data

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism
...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [5]:
data.isna().sum()

Unnamed: 0,0
Data,0
Labels,0


In [6]:
data.describe()

Unnamed: 0,Data,Labels
count,2000,2000
unique,2000,20
top,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
freq,1,100


In [7]:
data['Labels'].value_counts()

Unnamed: 0_level_0,count
Labels,Unnamed: 1_level_1
alt.atheism,100
comp.graphics,100
talk.politics.misc,100
talk.politics.mideast,100
talk.politics.guns,100
soc.religion.christian,100
sci.space,100
sci.med,100
sci.electronics,100
sci.crypt,100


In [8]:
import string
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
def clean_text(text):# function for cleaning text
    text = text.lower()# convering to lowercase
    text = text.translate(str.maketrans('','',string.punctuation))
    return text

In [11]:
data['cleaned_text'] = data['Data'].apply(clean_text)

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
stops_words = set(stopwords.words('english'))

In [20]:
# Tokenization
data['tokenized_text'] = data['cleaned_text'].apply(lambda x:x.split())

In [21]:
data.head()

Unnamed: 0,Data,Labels,cleaned_text,tokenized_text
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu..."
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism\npath cantaloupesrvcscmu...,"[newsgroups, altatheism, path, cantaloupesrvcs..."
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...,"[path, cantaloupesrvcscmuedudasnewsharvardedun..."
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu..."
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...,"[xref, cantaloupesrvcscmuedu, altatheism53485,..."


In [22]:
vectorizer = TfidfVectorizer()

In [23]:
X = vectorizer.fit_transform(data['cleaned_text'])
y = data['Labels']

In [24]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

Naive Bayes Model for Text Classification

In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [28]:
model = MultinomialNB()
model.fit(X_train,y_train)

In [29]:
preds = model.predict(X_test)

In [31]:
print(f'Accuracy : {accuracy_score(y_test,preds)}')

Accuracy : 0.705


In [32]:
print('Clssification report : \n',classification_report(y_test,preds))

Clssification report : 
                           precision    recall  f1-score   support

             alt.atheism       0.52      0.94      0.67        18
           comp.graphics       0.74      0.78      0.76        18
 comp.os.ms-windows.misc       1.00      0.73      0.84        22
comp.sys.ibm.pc.hardware       0.88      0.60      0.71        25
   comp.sys.mac.hardware       0.80      0.57      0.67        21
          comp.windows.x       1.00      0.52      0.68        25
            misc.forsale       1.00      0.44      0.62        18
               rec.autos       0.94      0.89      0.91        18
         rec.motorcycles       0.86      0.75      0.80        16
      rec.sport.baseball       0.79      0.83      0.81        18
        rec.sport.hockey       0.83      1.00      0.91        15
               sci.crypt       0.62      0.95      0.75        19
         sci.electronics       0.42      0.69      0.52        16
                 sci.med       0.81      0.76     

In [33]:
from textblob import TextBlob

In [34]:
def get_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [35]:
data['sentiment'] = data['cleaned_text'].apply(get_sentiment)

In [36]:
data.head()

Unnamed: 0,Data,Labels,cleaned_text,tokenized_text,sentiment
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu...",0.072213
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism\npath cantaloupesrvcscmu...,"[newsgroups, altatheism, path, cantaloupesrvcs...",-0.042039
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...,"[path, cantaloupesrvcscmuedudasnewsharvardedun...",0.050639
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu...",0.049438
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...,"[xref, cantaloupesrvcscmuedu, altatheism53485,...",0.130433


In [37]:
# Categorizing sentiment based on polarity
def categorize_sentiment(polarity):
    if polarity > 0:
        return 'positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

In [38]:
data['sentiment_category'] = data['sentiment'].apply(categorize_sentiment)

In [39]:
data

Unnamed: 0,Data,Labels,cleaned_text,tokenized_text,sentiment,sentiment_category
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu...",0.072213,positive
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism,newsgroups altatheism\npath cantaloupesrvcscmu...,"[newsgroups, altatheism, path, cantaloupesrvcs...",-0.042039,Negative
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism,path cantaloupesrvcscmuedudasnewsharvardedunoc...,"[path, cantaloupesrvcscmuedudasnewsharvardedun...",0.050639,positive
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism,path cantaloupesrvcscmuedumagnesiumclubcccmued...,"[path, cantaloupesrvcscmuedumagnesiumclubcccmu...",0.049438,positive
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism,xref cantaloupesrvcscmuedu altatheism53485 tal...,"[xref, cantaloupesrvcscmuedu, altatheism53485,...",0.130433,positive
...,...,...,...,...,...,...
1995,Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...,talk.religion.misc,xref cantaloupesrvcscmuedu talkabortion120945 ...,"[xref, cantaloupesrvcscmuedu, talkabortion1209...",0.030933,positive
1996,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc837...,"[xref, cantaloupesrvcscmuedu, talkreligionmisc...",0.114815,positive
1997,Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...,talk.religion.misc,xref cantaloupesrvcscmuedu talkorigins41030 ta...,"[xref, cantaloupesrvcscmuedu, talkorigins41030...",0.120370,positive
1998,Xref: cantaloupe.srv.cs.cmu.edu talk.religion....,talk.religion.misc,xref cantaloupesrvcscmuedu talkreligionmisc836...,"[xref, cantaloupesrvcscmuedu, talkreligionmisc...",0.111306,positive


# Sentiment Analysis Results
- Sentiment Distribution:
  - The sentiment analysis revealed a distribution of sentiments across blog posts categorized as positive, negative, or neutral. Analyzing this distribution provides insights into the general tone of content within each category:
    - Positive Sentiment: Posts categorized under themes related to personal growth or success often exhibited positive sentiments, indicating an optimistic outlook.
    - Negative Sentiments: Categories dealing with controversial topics or critiques tended to have a higher prevalence of negative sentiments.
    - Neutral Sentiments: Many informational or factual posts fell into neutral category, reflecting an objective tone without emotional bias.

# Implications for Bolg Content
- Content Strategy: Understanding sentiment distribution can help content creators trailor their writing strategies to resonate better with their audience. For example, emphasizing positive narratives could enhnce reader engagement.
- Audience Perception: The sentiment expressed in blog posts can influence audience perception and engagement levels.Posts with preominantly negative sentiments might lead o controversy or debate among readers.
- Feedback Loop: By analyzing sentiments over time, bloggers can identify trends in audience reactions and adjust their content accordingky to foster a moe positive community environment.