<h2>20 Newsgroups Dataset Overview</h2>
<p>This notebook loads the <strong>20 Newsgroups</strong> dataset, a popular text classification dataset that contains approximately 20,000 newsgroup documents evenly divided across 20 categories.</p>
<p>The first code cell performs the following:</p>
<ul>
  <li>Loads the entire dataset (both training and test parts) using <code>fetch_20newsgroups</code> from <code>sklearn.datasets</code>, removing headers, footers, and quotes for cleaner text data.</li>
  <li>Converts the loaded data into a <code>pandas DataFrame</code> with three columns:
    <ul>
      <li><code>text</code>: the raw newsgroup post content</li>
      <li><code>target</code>: the integer label representing the newsgroup category</li>
      <li><code>Category</code>: the category name for the 5th target (index 4) — just as an example</li>
    </ul>
  </li>
  <li>Prints the first 5 rows of the DataFrame to give a preview of the dataset structure.</li>
</ul>


In [2]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the dataset (all subsets: train + test)
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Convert to DataFrame
df = pd.DataFrame({
    'text': newsgroups.data,
    'target': newsgroups.target,
    'Category': newsgroups.target_names[4]
})

# Show first 5 rows
print(df.head())


                                                text  target  \
0  \n\nI am sure some bashers of Pens fans are pr...      10   
1  My brother is in the market for a high-perform...       3   
2  \n\n\n\n\tFinally you said what you dream abou...      17   
3  \nThink!\n\nIt's the SCSI card doing the DMA t...       3   
4  1)    I have an old Jasmine drive which I cann...       4   

                Category  
0  comp.sys.mac.hardware  
1  comp.sys.mac.hardware  
2  comp.sys.mac.hardware  
3  comp.sys.mac.hardware  
4  comp.sys.mac.hardware  


In [4]:
df.shape

(18846, 3)

In [30]:
# importing the necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Dropout
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


<h1> Train / Test split

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], test_size=0.2, random_state=42)

<h2>TF-IDF Vectorization of Text Data</h2>
<p>This cell uses <code>TfidfVectorizer</code> from <code>scikit-learn</code> to convert the raw text data into numerical feature vectors that can be fed into a machine learning model.</p>
<ul>
  <li><strong>max_features=20000</strong>: Limits the vocabulary to the 20,000 most frequent words.</li>
  <li><strong>stop_words='english'</strong>: Removes common English stopwords to focus on meaningful words.</li>
  <li><strong>lowercase=True</strong>: Converts all text to lowercase for uniformity.</li>
  <li><code>fit_transform</code> is applied on training data to learn the vocabulary and transform texts into TF-IDF features.</li>
  <li><code>transform</code> is used on test data to convert texts into the same feature space learned from training data.</li>
</ul>
<p>The output is a dense numpy array representation of TF-IDF features ready for model training.</p>


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=20000, stop_words='english', lowercase=True)

X_train_vec = vectorizer.fit_transform(X_train).toarray()
X_test_vec = vectorizer.transform(X_test).toarray()

<h2>Neural Network Model Architecture</h2>
<p>This cell defines a simple feedforward neural network using <code>TensorFlow Keras</code> with the following layers:</p>
<ul>
  <li><strong>Input Layer:</strong> Accepts the TF-IDF feature vectors of shape <code>(X_train_vec.shape[1],)</code>.</li>
  <li><strong>Dense Layer 1:</strong> 512 neurons with ReLU activation to learn complex patterns.</li>
  <li><strong>Dropout Layer 1:</strong> Applies 30% dropout to reduce overfitting.</li>
  <li><strong>Dense Layer 2:</strong> 256 neurons with ReLU activation.</li>
  <li><strong>Dropout Layer 2:</strong> Another 30% dropout layer.</li>
  <li><strong>Output Layer:</strong> 20 neurons with softmax activation corresponding to the 20 newsgroup categories, producing a probability distribution over classes.</li>
</ul>
<p>This architecture is designed to classify the text data into one of the 20 categories.</p>


In [16]:
# Model
model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(X_train_vec.shape[1],)))
model.add(Dropout(0.3))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(20, activation='softmax'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [17]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')

<h2>Model Compilation</h2>
<p>In this step, we configure the model for training by specifying:</p>
<ul>
  <li><strong>Optimizer:</strong> <code>Adam</code> with a learning rate of 0.001, which adapts the learning rate during training for efficient convergence.</li>
  <li><strong>Loss Function:</strong> <code>sparse_categorical_crossentropy</code> is used because the target labels are integers representing categories (not one-hot encoded).</li>
  <li><strong>Metrics:</strong> We track <code>accuracy</code> to monitor the proportion of correctly classified samples during training and validation.</li>
</ul>


In [18]:
# Compile
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)


<h2>Model Training</h2>
<p>This cell trains the neural network on the TF-IDF feature vectors with the following settings:</p>
<ul>
  <li><strong>Training data:</strong> <code>X_train_vec</code> and labels <code>y_train</code>.</li>
  <li><strong>Validation data:</strong> <code>X_test_vec</code> and labels <code>y_test</code> to monitor performance on unseen data.</li>
  <li><strong>Epochs:</strong> The model will train for 10 full passes over the training dataset.</li>
  <li><strong>Batch size:</strong> 32 samples per training update.</li>
</ul>
<p>The training process outputs a <code>history</code> object containing loss and accuracy metrics to analyze model performance.</p>


In [19]:
history = model.fit(
    X_train_vec, y_train,
    validation_data=(X_test_vec, y_test),
    epochs=10,
    batch_size=32
)

Epoch 1/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.4126 - loss: 2.1070 - val_accuracy: 0.7361 - val_loss: 0.9024
Epoch 2/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.8875 - loss: 0.4197 - val_accuracy: 0.7366 - val_loss: 0.9003
Epoch 3/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9555 - loss: 0.1711 - val_accuracy: 0.7332 - val_loss: 1.0077
Epoch 4/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9688 - loss: 0.1196 - val_accuracy: 0.7302 - val_loss: 1.0547
Epoch 5/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9702 - loss: 0.1111 - val_accuracy: 0.7353 - val_loss: 1.1214
Epoch 6/10
[1m472/472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.9679 - loss: 0.1112 - val_accuracy: 0.7334 - val_loss: 1.1307
Epoch 7/10
[1m472/472[0m

<h2>Finally checking the accuracy

In [20]:
loss, accuracy = model.evaluate(X_test_vec, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

[1m118/118[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7202 - loss: 1.3425
Test Accuracy: 0.73


<h2>Predicting the Category of New Text</h2>
<p>This cell demonstrates how to use the trained model to classify a new piece of text:</p>
<ul>
  <li>A sample paragraph about space exploration is defined.</li>
  <li>The paragraph is transformed into TF-IDF feature vectors using the previously trained <code>vectorizer</code>, ensuring consistent preprocessing.</li>
  <li>The model predicts probabilities for each of the 20 categories.</li>
  <li>The category with the highest predicted probability is selected as the model’s prediction.</li>
  <li>The predicted category index is mapped back to the actual newsgroup category name for readability.</li>
</ul>
<p>This shows how your model can be applied to classify unseen text data.</p>


In [29]:
# Example new text
new_text = ["Space exploration has made remarkable progress in recent years. With the development of advanced rockets and satellites, scientists are now able to study distant planets and galaxies in unprecedented detail. Many countries are investing heavily in space programs to discover new resources and expand human presence beyond Earth."]

# Transform using the same vectorizer
new_vec = vectorizer.transform(new_text).toarray()

# Predict probabilities
pred_probs = model.predict(new_vec)

# Get predicted class index
pred_class = pred_probs.argmax(axis=1)[0]

#  Here we are mapping index to category name to see the result
pred_category = newsgroups.target_names[pred_class]

print(f"Predicted category: {pred_category}")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
Predicted category: sci.space
