In [19]:
import os
import pandas as pd

news_file_path = os.path.join('./mind_data/MINDsmall_train', 'news.tsv')

news_df = pd.read_csv(news_file_path, sep='\t')

news_df.head()

Unnamed: 0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By","Shop the notebooks, jackets, and more that the royals can't live without.",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"", ""Type"": ""P"", ""WikidataId"": ""Q80976"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [48], ""SurfaceForms"": [""Prince Philip""]}, {""Label"": ""Charles, Prince of Wales"", ""Type"": ""P"", ""WikidataId"": ""Q43274"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [28], ""SurfaceForms"": [""Prince Charles""]}, {""Label"": ""Elizabeth II"", ""Type"": ""P"", ""WikidataId"": ""Q9682"", ""Confidence"": 0.97, ""OccurrenceOffsets"": [11], ""SurfaceForms"": [""Queen Elizabeth""]}]",[]
0,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
1,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
2,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
3,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."
4,N2073,sports,football_nfl,Should NFL be able to fine players for critici...,Several fines came down against NFL players fo...,https://assets.msn.com/labs/mind/AAJ4lap.html,"[{""Label"": ""National Football League"", ""Type"":...","[{""Label"": ""National Football League"", ""Type"":..."


In [26]:
print("Shape of the DataFrame:", news_df.shape)

print("\nData types of each column:")
print(news_df.dtypes)

print("\nMissing values per column:")
print(news_df.isnull().sum())

Shape of the DataFrame: (51282, 8)

Data types of each column:
NewsID              object
Category            object
SubCategory         object
Title               object
Abstract            object
URL                 object
TitleEntities       object
AbstractEntities    object
dtype: object

Missing values per column:
NewsID              0
Category            0
SubCategory         0
Title               0
Abstract            0
URL                 0
TitleEntities       3
AbstractEntities    4
dtype: int64


**Reasoning**:
Based on the column names in the output, it seems the header was not correctly loaded. I need to reload the data with the correct header and then analyze the distribution of the target variable (news categories) and handle missing values.



In [21]:
news_file_path = os.path.join('./mind_data/MINDsmall_train', 'news.tsv')
news_df = pd.read_csv(news_file_path, sep='\t', header=None)

news_df.columns = ['NewsID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'TitleEntities', 'AbstractEntities']

print("\nDistribution of news categories:")
print(news_df['Category'].value_counts())

print("\nMissing values per column after reloading:")
print(news_df.isnull().sum())

news_df['Abstract'] = news_df['Abstract'].fillna('')

print("\nMissing values per column after handling:")
print(news_df.isnull().sum())


Distribution of news categories:
Category
news             15774
sports           14510
finance           3107
foodanddrink      2551
lifestyle         2479
travel            2350
video             2068
weather           2048
health            1885
autos             1639
tv                 889
music              769
movies             606
entertainment      587
kids                17
middleeast           2
northamerica         1
Name: count, dtype: int64

Missing values per column after reloading:
NewsID                 0
Category               0
SubCategory            0
Title                  0
Abstract            2666
URL                    0
TitleEntities          3
AbstractEntities       4
dtype: int64

Missing values per column after handling:
NewsID              0
Category            0
SubCategory         0
Title               0
Abstract            0
URL                 0
TitleEntities       3
AbstractEntities    4
dtype: int64


**Reasoning**:
Now that missing values in 'Abstract' have been handled, I will select the relevant columns, combine the text from 'Title' and 'Abstract', and perform basic text preprocessing (lowercase and punctuation removal) on the combined text.



In [22]:
import string

relevant_cols_df = news_df[['Category', 'Title', 'Abstract']].copy()

relevant_cols_df['combined_text'] = relevant_cols_df['Title'] + ' ' + relevant_cols_df['Abstract']

relevant_cols_df['combined_text'] = relevant_cols_df['combined_text'].str.lower()

relevant_cols_df['combined_text'] = relevant_cols_df['combined_text'].str.translate(str.maketrans('', '', string.punctuation))

display(relevant_cols_df.head())

Unnamed: 0,Category,Title,Abstract,combined_text
0,lifestyle,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",the brands queen elizabeth prince charles and ...
1,health,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,50 worst habits for belly fat these seemingly ...
2,news,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,the cost of trumps aid freeze in the trenches ...
3,health,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",i was an nba wife heres how it affected my men...
4,health,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",how to get rid of skin tags according to a der...


In [23]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import pandas as pd

category_counts = relevant_cols_df['Category'].value_counts()
single_sample_categories = category_counts[category_counts < 2].index

filtered_relevant_cols_df = relevant_cols_df[~relevant_cols_df['Category'].isin(single_sample_categories)].copy()

X_train, X_test, y_train, y_test = train_test_split(
    filtered_relevant_cols_df['combined_text'],
    filtered_relevant_cols_df['Category'],
    test_size=0.2,
    random_state=42,
    stratify=filtered_relevant_cols_df['Category']
)

model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000))
])

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        autos       0.76      0.54      0.63       328
entertainment       0.91      0.36      0.52       117
      finance       0.66      0.51      0.58       622
 foodanddrink       0.77      0.71      0.74       510
       health       0.78      0.60      0.68       377
         kids       0.00      0.00      0.00         3
    lifestyle       0.55      0.43      0.48       496
       movies       0.70      0.45      0.55       121
        music       0.90      0.47      0.62       154
         news       0.65      0.87      0.74      3155
       sports       0.88      0.94      0.91      2902
       travel       0.62      0.38      0.47       470
           tv       0.65      0.29      0.40       178
        video       0.58      0.20      0.29       414
      weather       0.73      0.65      0.69       410

     accuracy                           0.73     10257
    macro avg       0.68      0.49      0.55     10257
 weighte

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [24]:
import pandas as pd

new_news_examples = [
    "Tech giant announces new smartphone with groundbreaking camera technology.",
    "Local sports team wins championship after thrilling overtime game.",
    "Experts discuss the impact of recent interest rate hikes on the global economy.",
    "New study reveals the health benefits of incorporating mindfulness into daily routine.",
    "Famous chef shares a quick and easy recipe for a weeknight dinner."
]

new_news_series = pd.Series(new_news_examples)

predicted_categories = model.predict(new_news_series)

print("Predictions for new news examples:")
for text, category in zip(new_news_examples, predicted_categories):
    print(f"Text: {text}")
    print(f"Predicted Category: {category}\n")

Predictions for new news examples:
Text: Tech giant announces new smartphone with groundbreaking camera technology.
Predicted Category: video

Text: Local sports team wins championship after thrilling overtime game.
Predicted Category: sports

Text: Experts discuss the impact of recent interest rate hikes on the global economy.
Predicted Category: finance

Text: New study reveals the health benefits of incorporating mindfulness into daily routine.
Predicted Category: health

Text: Famous chef shares a quick and easy recipe for a weeknight dinner.
Predicted Category: foodanddrink



In [25]:
import pickle

with open('news_category_model.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved to news_category_model.pkl")

Model saved to news_category_model.pkl


## Summary:

### Data Analysis Key Findings

*   The dataset was loaded from a tab-separated file (`news.tsv`) located in the `./mind_data/MINDsmall_train` directory.
*   The initial loading incorrectly interpreted the first row as headers, requiring a reload with `header=None` and manual column assignment (`NewsID`, `Category`, `SubCategory`, `Title`, `Abstract`, `URL`, `TitleEntities`, `AbstractEntities`).
*   Missing values were identified primarily in the 'Abstract', 'TitleEntities', and 'AbstractEntities' columns. Missing 'Abstract' values were filled with empty strings.
*   The dataset exhibits class imbalance across different news categories.
*   Categories with fewer than two samples were filtered out before splitting the data to enable stratified splitting.
*   A Logistic Regression model within a pipeline using TF-IDF vectorization was successfully trained on the preprocessed and filtered data.
*   The trained model was able to predict categories for new, unseen text examples.
