# Project

## Part B : News Article Classification

### Task 1: Data Collection and Preprocessing

####  Step 1: Load the Dataset

In [14]:
import pandas as pd


In [15]:
df_news = pd.read_excel("data_news.xlsx")

In [16]:
df_news.columns = df_news.columns.str.strip()

In [17]:
print(df_news.columns.tolist())


['category', 'headline', 'links', 'short_description', 'keywords']


In [19]:
print(df_news.head())




   category                                           headline  \
0  WELLNESS              143 Miles in 35 Days: Lessons Learned   
1  WELLNESS       Talking to Yourself: Crazy or Crazy Helpful?   
2  WELLNESS  Crenezumab: Trial Will Gauge Whether Alzheimer...   
3  WELLNESS                     Oh, What a Difference She Made   
4  WELLNESS                                   Green Superfoods   

                                               links  \
0  https://www.huffingtonpost.com/entry/running-l...   
1  https://www.huffingtonpost.com/entry/talking-t...   
2  https://www.huffingtonpost.com/entry/crenezuma...   
3  https://www.huffingtonpost.com/entry/meaningfu...   
4  https://www.huffingtonpost.com/entry/green-sup...   

                                   short_description  \
0  Resting is part of training. I've confirmed wh...   
1  Think of talking to yourself as a tool to coac...   
2  The clock is ticking for the United States to ...   
3  If you want to be busy, keep trying to 

In [20]:
print(df_news.isnull().sum())


category                0
headline                0
links                   0
short_description       6
keywords             2706
dtype: int64


In [21]:
print(df_news['category'].value_counts())


WELLNESS          5000
POLITICS          5000
ENTERTAINMENT     5000
TRAVEL            5000
STYLE & BEAUTY    5000
PARENTING         5000
FOOD & DRINK      5000
WORLD NEWS        5000
BUSINESS          5000
SPORTS            5000
Name: category, dtype: int64


#### Step 2: Preprocess the Text

In [23]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

df_news['clean_text'] = df_news['short_description'].apply(preprocess_text)
df_news[['short_description', 'clean_text']].head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\shubh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,short_description,clean_text
0,Resting is part of training. I've confirmed wh...,resting part training ive confirmed sort alrea...
1,Think of talking to yourself as a tool to coac...,think talking tool coach challenge narrate exp...
2,The clock is ticking for the United States to ...,clock ticking united state find cure team work...
3,"If you want to be busy, keep trying to be perf...",want busy keep trying perfect want happy focus...
4,"First, the bad news: Soda bread, corned beef a...",first bad news soda bread corned beef beer hig...


### Task 2: Feature Extraction

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
tfidf = TfidfVectorizer(max_features=5000)

In [26]:
X = tfidf.fit_transform(df_news['clean_text']).toarray()
y = df_news['category']


In [27]:
print("Feature matrix shape:", X.shape)
print("Number of classes:", y.nunique())
print("Sample classes:", y.unique())

Feature matrix shape: (50000, 5000)
Number of classes: 10
Sample classes: ['WELLNESS' 'POLITICS' 'ENTERTAINMENT' 'TRAVEL' 'STYLE & BEAUTY'
 'PARENTING' 'FOOD & DRINK' 'WORLD NEWS' 'BUSINESS' 'SPORTS']


### Task 3: Model Development

#### Step 1: Train-Test Split


In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Step 2: Train Classifiers

#####  1. Logistic Regression



In [31]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)


##### 2. Naive Bayes

In [32]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)


##### 3. Support Vector Machine (SVM)



In [35]:
from sklearn.svm import LinearSVC

svm_model = LinearSVC(dual=False)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)


### Task 4: Model Evaluation

In [36]:
from sklearn.metrics import classification_report, accuracy_score


#####  1. Logistic Regression



In [38]:
print("Logistic Regression Performance:")
print(classification_report(y_test, y_pred_lr))
print("Accuracy:", accuracy_score(y_test, y_pred_lr))


Logistic Regression Performance:
                precision    recall  f1-score   support

      BUSINESS       0.63      0.68      0.65       955
 ENTERTAINMENT       0.55      0.56      0.55       985
  FOOD & DRINK       0.69      0.70      0.70      1021
     PARENTING       0.67      0.63      0.65      1030
      POLITICS       0.66      0.60      0.63      1034
        SPORTS       0.66      0.72      0.69       995
STYLE & BEAUTY       0.73      0.70      0.71       986
        TRAVEL       0.72      0.67      0.69      1008
      WELLNESS       0.63      0.67      0.65      1009
    WORLD NEWS       0.66      0.66      0.66       977

      accuracy                           0.66     10000
     macro avg       0.66      0.66      0.66     10000
  weighted avg       0.66      0.66      0.66     10000

Accuracy: 0.6581


##### 2. Naive Bayes

In [39]:
print("Naive Bayes Performance:")
print(classification_report(y_test, y_pred_nb))
print("Accuracy:", accuracy_score(y_test, y_pred_nb))


Naive Bayes Performance:
                precision    recall  f1-score   support

      BUSINESS       0.58      0.64      0.61       955
 ENTERTAINMENT       0.63      0.51      0.56       985
  FOOD & DRINK       0.69      0.72      0.70      1021
     PARENTING       0.53      0.63      0.57      1030
      POLITICS       0.67      0.58      0.62      1034
        SPORTS       0.72      0.66      0.68       995
STYLE & BEAUTY       0.69      0.69      0.69       986
        TRAVEL       0.68      0.66      0.67      1008
      WELLNESS       0.59      0.65      0.62      1009
    WORLD NEWS       0.67      0.69      0.68       977

      accuracy                           0.64     10000
     macro avg       0.65      0.64      0.64     10000
  weighted avg       0.65      0.64      0.64     10000

Accuracy: 0.641


##### 3. Support Vector Machine (SVM)



In [40]:
print("SVM Performance:")
print(classification_report(y_test, y_pred_svm))
print("Accuracy:", accuracy_score(y_test, y_pred_svm))


SVM Performance:
                precision    recall  f1-score   support

      BUSINESS       0.65      0.72      0.68       955
 ENTERTAINMENT       0.54      0.53      0.53       985
  FOOD & DRINK       0.69      0.70      0.69      1021
     PARENTING       0.64      0.61      0.62      1030
      POLITICS       0.64      0.59      0.61      1034
        SPORTS       0.67      0.75      0.71       995
STYLE & BEAUTY       0.71      0.70      0.71       986
        TRAVEL       0.68      0.64      0.66      1008
      WELLNESS       0.63      0.65      0.64      1009
    WORLD NEWS       0.65      0.65      0.65       977

      accuracy                           0.65     10000
     macro avg       0.65      0.65      0.65     10000
  weighted avg       0.65      0.65      0.65     10000

Accuracy: 0.6517


#####  Comparing Accuracy

In [41]:
model_scores = {
    "Logistic Regression": accuracy_score(y_test, y_pred_lr),
    "Naive Bayes": accuracy_score(y_test, y_pred_nb),
    "SVM": accuracy_score(y_test, y_pred_svm)
}

import pandas as pd
pd.DataFrame.from_dict(model_scores, orient='index', columns=['Accuracy'])


Unnamed: 0,Accuracy
Logistic Regression,0.6581
Naive Bayes,0.641
SVM,0.6517


### Final Summary – News Article Classification

#### In Part B of the project, I worked on classifying news articles into categories like politics, sports, and technology. The dataset included short descriptions of news articles along with their respective categories.

##### To prepare the data, I first cleaned the text by converting it to lowercase, removing punctuation, numbers, and stopwords, and applying lemmatization to simplify the words. Then, I used TF-IDF vectorization to convert the cleaned text into numerical features that could be understood by machine learning models.

#### I trained three different models: Logistic Regression, Naive Bayes, and Support Vector Machine (SVM). After splitting the data into training and testing sets, I evaluated each model using accuracy, precision, recall, and F1-score.

#### All models gave reasonable results, but Logistic Regression and SVM performed the best, giving higher accuracy and balanced classification across all categories. The Naive Bayes model was slightly faster but a bit less accurate.

#### This project helped me apply real-world NLP techniques to a text classification problem and gave me a better understanding of how to build and evaluate models that work with textual data.