<a href="https://colab.research.google.com/github/Maruf346/AI-ML-with-python/blob/main/Naive_Bayes_Theorem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Developing Fundamental Concepts for Preprocessing Data**

In [1]:
documents = [
    'The canal is open to shipping.',
    'This is a ship shipping ship, shipping shipping ships.',
    'The ship locked into the new canal.'
]
print(documents)

['The canal is open to shipping.', 'This is a ship shipping ship, shipping shipping ships.', 'The ship locked into the new canal.']


**Next, we convert the sentences to all lower case letters.**

In [2]:
lowercase_documents = []

for doc in documents:
  lowercase_documents.append(doc.lower())

print(lowercase_documents)

['the canal is open to shipping.', 'this is a ship shipping ship, shipping shipping ships.', 'the ship locked into the new canal.']


**Punction marks do not add any value in predicting class label. So, we remove punctua-tion from each senence.**

In [3]:
import string

punctuation_removed_documents = []

for doc in lowercase_documents:
  punctuation_removed_documents.append("".join(j for j in doc if j not in string.punctuation))

print(punctuation_removed_documents)

['the canal is open to shipping', 'this is a ship shipping ship shipping shipping ships', 'the ship locked into the new canal']


**Next, we break each sentence into words.**

In [4]:
preprocessed_documents = []

for doc in punctuation_removed_documents:
  preprocessed_documents.append(doc.split(' ')) # split on space

print(preprocessed_documents)

[['the', 'canal', 'is', 'open', 'to', 'shipping'], ['this', 'is', 'a', 'ship', 'shipping', 'ship', 'shipping', 'shipping', 'ships'], ['the', 'ship', 'locked', 'into', 'the', 'new', 'canal']]


**Finally, word frequency of each word is counted using counter function.**

In [5]:
from collections import Counter

frequency_list = []

for doc in preprocessed_documents:
  frequency_counts = Counter(doc)
  frequency_list.append(frequency_counts)

print(frequency_list)

[Counter({'the': 1, 'canal': 1, 'is': 1, 'open': 1, 'to': 1, 'shipping': 1}), Counter({'shipping': 3, 'ship': 2, 'this': 1, 'is': 1, 'a': 1, 'ships': 1}), Counter({'the': 2, 'ship': 1, 'locked': 1, 'into': 1, 'new': 1, 'canal': 1})]


**The above output shows the constituting words of each sentence and their frequencies. The above steps are a bit lengthy. Instead of using them, we can do the same using CountVectorizer from sklearn.feature extraction.text.**

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer() # set the variable

count_vector.fit(documents) # fit the function
print(count_vector.get_feature_names_out()) # get the outputs

['canal' 'into' 'is' 'locked' 'new' 'open' 'ship' 'shipping' 'ships' 'the'
 'this' 'to']


**The above output shows the set of word occuring in the dataset. Now we would like to output the frequencies of each word using the following piece of code.**

In [7]:
doc_array = count_vector.transform(documents).toarray()
display(doc_array)

array([[1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1],
       [0, 0, 1, 0, 0, 0, 2, 3, 1, 0, 1, 0],
       [1, 1, 0, 1, 1, 0, 1, 0, 0, 2, 0, 0]])

**Pandas is a great tool for visualization. We represent the above outputs using Pandas DataFrame which displays the output in a much meaningful way.**

In [8]:
import pandas as pd

frequency_matrix = pd.DataFrame(doc_array, columns=count_vector.get_feature_names_out())

display(frequency_matrix)

Unnamed: 0,canal,into,is,locked,new,open,ship,shipping,ships,the,this,to
0,1,0,1,0,0,1,0,1,0,1,0,1
1,0,0,1,0,0,0,2,3,1,0,1,0
2,1,1,0,1,1,0,1,0,0,2,0,0


.................................................................................................................................................................

# **Implement Na ̈ıve Bayes Classifier in Scikit-Learn**
**The SMS Spam Collection v.1 is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam. The distribution is a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.**

In [25]:
import pandas as pd
# Dataset from
# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

df = pd.read_table('SMSSpamCollection',
                   sep='\t',
                   header=None,
                   names=['label', 'message'])

df.head()

--2025-08-24 18:27:36--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip.1’

smsspamcollection.z     [  <=>               ] 198.65K   827KB/s    in 0.2s    

2025-08-24 18:27:36 (827 KB/s) - ‘smsspamcollection.zip.1’ saved [203415]

Archive:  smsspamcollection.zip
replace SMSSpamCollection? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Once we have our data ready, it is time to do some preprocessing. We will focus on removing useless variance for our task at hand. First, we have to convert the labels from strings to binary values for our classifier. Map applies a function to all the items in an input list or df column.**

In [11]:
df['label'] = df.label.map({'ham': 0, 'spam': 1})
display(df.head())

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


**Now that we have performed feature extraction from our data, it is time to build our model. We will start by splitting our data into training and test sets:**

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['message'],
                                                    df['label'], random_state=1)

print("Original dataset contains", df.shape[0], "messages")
print("Training set contains", X_train.shape[0], "messages")
print("Testing set contains", X_test.shape[0], "messages")

Original dataset contains 5572 messages
Training set contains 4179 messages
Testing set contains 1393 messages


**Next, we transform the data into occurrences, which will be the features that we will feed into our model:**

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer() # set the variable

train_features = count_vector.fit_transform(X_train)
test_features = count_vector.transform(X_test)

**Our spam classifier will use multinomial Na ̈ıve Bayes method from sklearn.nive bayes. This method is well-suited for discrete inputs (like word counts) whereas the Gaussian Na ̈ıve Bayes classifier performs better on continuous inputs.**

In [14]:
from sklearn.naive_bayes import MultinomialNB

# call the method
naive_bayes = MultinomialNB()

# train the classifier on the training set
naive_bayes.fit(train_features, y_train)

**Once we have put together our classifier, we can evaluate its performance in the testing set:**

In [15]:
import numpy as np

#predict using the model on the testing set
predictions = naive_bayes.predict(test_features)

print(np.mean(predictions == y_test))

0.9885139985642498


**Our simple Na ̈ıve Bayes Classifier has 98.2% accuracy with this specific test set! But it is not enough by just providing the accuracy, since our dataset is imbalanced when it comes to the labels (86.6% legitimate in contrast to 13.4% spam). It could happen that our classifier is over-fitting the legitimate class while ignoring the spam class. To solve this uncertainty, let’s have a look at the confusion matrix:**

In [16]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predictions))

[[1203    5]
 [  11  174]]


# **Lab Exercises**

1. Create a dataset containing two different types of news: sports and ploitics. (Col-
lect 100 news, 50 from each class.)

2. Perform Na ̈ıve Bayes Classifier on your dataset and report classification results.
3. [Bonus] How would you extend your dataset and code for more than two classes?

## Dataset creation

### Subtask:
Generate or collect a dataset of 100 news articles (50 sports, 50 politics) and save it in a suitable format (e.g., CSV).


**Reasoning**:
Manually create the lists of sports and politics news, their corresponding labels, combine them, create a pandas DataFrame, and save it to a CSV file.



In [17]:
sports_news = [
    "LeBron James leads the Lakers to victory with a last-second shot.",
    "Manchester United wins the Premier League title.",
    "Serena Williams advances to the next round of Wimbledon.",
    "The Chicago Cubs win the World Series.",
    "Tom Brady announces his retirement from the NFL.",
    "Golden State Warriors dominate the NBA finals.",
    "Tiger Woods makes a comeback at the Masters.",
    "Usain Bolt breaks the world record in the 100m sprint.",
    "The New York Yankees sign a new pitcher.",
    "Liverpool secures a win against their rivals.",
    "Michael Jordan's greatest moments in basketball.",
    "Simone Biles wins gold at the Olympics.",
    "The FIFA World Cup kicks off in Qatar.",
    "Roger Federer retires from professional tennis.",
    "Lewis Hamilton wins the Formula 1 championship.",
    "The Boston Celtics defeat the Miami Heat.",
    "Naomi Osaka withdraws from the French Open.",
    "Concerns rise over player safety in the NFL.",
    "New rules implemented in Major League Baseball.",
    "College basketball tournament March Madness begins.",
    "The Ryder Cup golf tournament is underway.",
    "Cricket World Cup preparations are in full swing.",
    "Rugby Union Six Nations tournament results.",
    "Hockey playoffs are heating up.",
    "Boxing match ends in a knockout.",
    "Swimming records tumble at the championships.",
    "Athletics world championships highlights.",
    "Cycling Tour de France stage winners.",
    "Table tennis world rankings updated.",
    "Badminton star wins major title.",
    "Esports team claims victory in the grand final.",
    "Surfing competition held in Hawaii.",
    "Skateboarding added to the Olympic program.",
    "Gymnastics team qualifies for the Olympics.",
    "Wrestling championship results.",
    "Fencing tournament held in Paris.",
    "Archery team aims for gold.",
    "Sailing race concludes in strong winds.",
    "Equestrian event draws large crowds.",
    "Judo competition showcases top athletes.",
    "Karate introduced as an Olympic sport.",
    "Baseball season opener draws huge attendance.",
    "Football transfer window closes.",
    "Basketball team signs a new coach.",
    "Tennis player reaches career-high ranking.",
    "Golf course renovations completed.",
    "Athletics track and field events schedule.",
    "Swimming pool construction begins.",
    "Cycling route for the next race announced.",
    "Team celebrates championship win."
]

politics_news = [
    "President signs new healthcare bill into law.",
    "Congress debates infrastructure spending package.",
    "Supreme Court hears arguments on voting rights.",
    "Prime Minister addresses the nation on the economy.",
    "United Nations holds a climate change summit.",
    "Senator introduces legislation on technology regulation.",
    "Election polls show a tight race.",
    "Political protest held in the capital city.",
    "Diplomatic talks aim to resolve international conflict.",
    "New trade agreement signed between two countries.",
    "Cabinet reshuffle announced by the government.",
    "Investigation launched into political corruption.",
    "Opposition party leader criticizes government policy.",
    "Local elections results announced.",
    "Former President gives a public speech.",
    "Political analyst discusses current events.",
    "Embassy security measures increased.",
    "Summit of world leaders concludes.",
    "Debate on immigration policy continues.",
    "Town hall meeting held with constituents.",
    "Political campaign fundraising efforts intensify.",
    "Governor signs state budget into law.",
    "Mayor proposes new urban development plan.",
    "City council approves zoning changes.",
    "Public opinion survey on political issues released.",
    "International relations strained by recent events.",
    "Political rally draws thousands of supporters.",
    "Legislative session begins with a new agenda.",
    "Political commentary on the news cycle.",
    "Ambassador presents credentials to foreign government.",
    "Treaty ratification process underway.",
    "Economic sanctions imposed on a foreign nation.",
    "Political crisis escalates in the region.",
    "Human rights report highlights global concerns.",
    "Political asylum granted to refugees.",
    "Intelligence agency releases declassified documents.",
    "Cybersecurity threats to political institutions.",
    "Political cartoonist offers satirical view.",
    "Political scientist publishes new research.",
    "Youth engagement in politics discussed.",
    "Voting registration drive launched.",
    "Political advertising spending increases.",
    "Political strategists analyze election trends.",
    "Caucus meetings held to discuss party platform.",
    "Political think tank publishes policy recommendations.",
    "Election recount requested in disputed state.",
    "Inauguration ceremony for the new leader.",
    "Political memoir released by former official.",
    "Grassroots political movement gains traction.",
    "Political transition of power is smooth."
]

news_articles = sports_news + politics_news
labels = ['sports'] * len(sports_news) + ['politics'] * len(politics_news)

df = pd.DataFrame({'message': news_articles, 'label': labels})

df.to_csv('news_dataset.csv', index=False)

display(df.head())
display(df.tail())

Unnamed: 0,message,label
0,LeBron James leads the Lakers to victory with ...,sports
1,Manchester United wins the Premier League title.,sports
2,Serena Williams advances to the next round of ...,sports
3,The Chicago Cubs win the World Series.,sports
4,Tom Brady announces his retirement from the NFL.,sports


Unnamed: 0,message,label
95,Election recount requested in disputed state.,politics
96,Inauguration ceremony for the new leader.,politics
97,Political memoir released by former official.,politics
98,Grassroots political movement gains traction.,politics
99,Political transition of power is smooth.,politics


## Data loading and preparation

### Subtask:
Load the dataset into a pandas DataFrame and prepare it for classification, similar to how the SMS spam dataset was handled (mapping labels to numerical values).


**Reasoning**:
Load the news_dataset.csv file into a pandas DataFrame and map the 'label' column to numerical values.



In [18]:
df = pd.read_csv('news_dataset.csv')
df['label'] = df.label.map({'sports': 0, 'politics': 1})
display(df.head())

Unnamed: 0,message,label
0,LeBron James leads the Lakers to victory with ...,0
1,Manchester United wins the Premier League title.,0
2,Serena Williams advances to the next round of ...,0
3,The Chicago Cubs win the World Series.,0
4,Tom Brady announces his retirement from the NFL.,0


## Text preprocessing and vectorization

### Subtask:
Apply text preprocessing techniques (like lowercasing, removing punctuation) and vectorize the text data using a method like `CountVectorizer`.


**Reasoning**:
Apply text preprocessing and vectorize the text data using CountVectorizer as instructed.



In [19]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
features = count_vector.fit_transform(df['message'])

print("Shape of the document-term matrix:", features.shape)

Shape of the document-term matrix: (100, 399)


## Data splitting

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the features and labels into training and testing sets and print their shapes.



In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    df['label'],
                                                    test_size=0.25,
                                                    random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (75, 399)
Shape of X_test: (25, 399)
Shape of y_train: (75,)
Shape of y_test: (25,)


## Model training

### Subtask:
Train a Multinomial Naive Bayes classifier on the training data.


**Reasoning**:
Import the MultinomialNB classifier and train the model using the training data.



In [21]:
from sklearn.naive_bayes import MultinomialNB

# Instantiate a MultinomialNB object
naive_bayes = MultinomialNB()

# Train the classifier on the training set
naive_bayes.fit(X_train, y_train)

## Model evaluation

### Subtask:
Evaluate the performance of the classifier on the testing data using appropriate metrics (accuracy, confusion matrix, etc.).


**Reasoning**:
Use the trained Naive Bayes model to make predictions on the test set and evaluate its performance using accuracy and a confusion matrix.



In [22]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Make predictions on the test features
predictions = naive_bayes.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, predictions)

# Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, predictions)

# Print the results
print("Accuracy:", accuracy)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.8

Confusion Matrix:
 [[11  3]
 [ 2  9]]


## Reporting results

### Subtask:
Present the classification results clearly.


**Reasoning**:
Summarize the classification results by explaining the accuracy and the confusion matrix in the context of the problem.



In [23]:
# Summarize the accuracy
print(f"The accuracy of the Naive Bayes classifier on the test set is: {accuracy:.2f}\n")

# Explain the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)
print("\nInterpretation of the Confusion Matrix:")
print("The confusion matrix shows the following:")
print(f"- True Negatives (Top-Left): {conf_matrix[0, 0]} - Number of 'sports' news correctly classified as 'sports'.")
print(f"- False Positives (Top-Right): {conf_matrix[0, 1]} - Number of 'sports' news incorrectly classified as 'politics'.")
print(f"- False Negatives (Bottom-Left): {conf_matrix[1, 0]} - Number of 'politics' news incorrectly classified as 'sports'.")
print(f"- True Positives (Bottom-Right): {conf_matrix[1, 1]} - Number of 'politics' news correctly classified as 'politics'.")

# Discuss correctly and incorrectly classified instances
print("\nClassification Summary:")
print(f"Out of {conf_matrix[0, 0] + conf_matrix[0, 1]} actual 'sports' news articles:")
print(f"  - {conf_matrix[0, 0]} were correctly classified as 'sports'.")
print(f"  - {conf_matrix[0, 1]} were incorrectly classified as 'politics'.")
print(f"Out of {conf_matrix[1, 0] + conf_matrix[1, 1]} actual 'politics' news articles:")
print(f"  - {conf_matrix[1, 1]} were correctly classified as 'politics'.")
print(f"  - {conf_matrix[1, 0]} were incorrectly classified as 'sports'.")

# Briefly interpret the results
print("\nInterpretation of Results:")
print(f"The model achieved an accuracy of {accuracy:.2f}, meaning it correctly classified {int((conf_matrix[0,0] + conf_matrix[1,1]))} out of {len(y_test)} test articles.")
print("The confusion matrix provides a detailed view of the model's performance for each class.")
print(f"The model is reasonably good at distinguishing between sports and politics news on this small dataset, but there are still some misclassifications ({conf_matrix[0,1] + conf_matrix[1,0]} in total).")

The accuracy of the Naive Bayes classifier on the test set is: 0.80

Confusion Matrix:
[[11  3]
 [ 2  9]]

Interpretation of the Confusion Matrix:
The confusion matrix shows the following:
- True Negatives (Top-Left): 11 - Number of 'sports' news correctly classified as 'sports'.
- False Positives (Top-Right): 3 - Number of 'sports' news incorrectly classified as 'politics'.
- False Negatives (Bottom-Left): 2 - Number of 'politics' news incorrectly classified as 'sports'.
- True Positives (Bottom-Right): 9 - Number of 'politics' news correctly classified as 'politics'.

Classification Summary:
Out of 14 actual 'sports' news articles:
  - 11 were correctly classified as 'sports'.
  - 3 were incorrectly classified as 'politics'.
Out of 11 actual 'politics' news articles:
  - 9 were correctly classified as 'politics'.
  - 2 were incorrectly classified as 'sports'.

Interpretation of Results:
The model achieved an accuracy of 0.80, meaning it correctly classified 20 out of 25 test articles

## Bonus: extension to multiple classes

### Subtask:
Discuss how to adapt the dataset and code for classification with more than two classes.


**Reasoning**:
Explain how to extend the dataset and code for multi-class classification as requested in the bonus part of the original task.



In [24]:
print("Extending to Multi-Class Classification")
print("----------------------------------------")

print("\n1. Dataset Structure Modification:")
print("To extend the dataset for more than two classes (e.g., sports, politics, entertainment, technology), you would need to collect news articles for each additional category.")
print("The dataset structure (like the pandas DataFrame 'df') would remain similar, with 'message' and 'label' columns.")
print("However, the 'label' column would now contain strings representing the new categories (e.g., 'entertainment', 'technology') in addition to 'sports' and 'politics'.")
print("For example:")
print("message                             | label")
print("------------------------------------|---------")
print("New movie breaks box office records.| entertainment")
print("Scientists discover new planet.    | technology")
print("... (and so on for other categories)")

print("\n2. Label Mapping Extension:")
print("The label mapping step needs to be extended to assign a unique numerical value to each new class.")
print("Instead of just mapping 'sports' to 0 and 'politics' to 1, you would expand the mapping dictionary:")
print("`df['label'] = df.label.map({'sports': 0, 'politics': 1, 'entertainment': 2, 'technology': 3, ...})`")
print("Each class must have a distinct integer label.")

print("\n3. CountVectorizer and Multinomial Naive Bayes Compatibility:")
print("Fortunately, the `CountVectorizer` from `sklearn.feature_extraction.text` and the `MultinomialNB` classifier from `sklearn.naive_bayes` are inherently designed to handle multi-class classification.")
print("- `CountVectorizer` will automatically build a vocabulary from all unique words across all classes in your training data, regardless of the number of classes.")
print("- `MultinomialNB` can train and predict on datasets with more than two output classes without requiring significant code changes to the model fitting and prediction steps (`naive_bayes.fit(X_train, y_train)` and `naive_bayes.predict(X_test)`).")
print("The number of classes is inferred directly from the unique values in the target variable (`y_train`).")

print("\n4. Evaluation Metrics:")
print("Evaluation metrics like accuracy (`accuracy_score`) are still valid for multi-class classification; it measures the overall proportion of correctly classified instances.")
print("The confusion matrix (`confusion_matrix`) is also applicable but will be larger.")
print("For 'n' classes, the confusion matrix will be an n x n matrix where each cell `[i, j]` represents the number of instances from class 'i' that were classified as class 'j'.")
print("You would need to interpret the larger confusion matrix to understand misclassifications between specific pairs of classes.")
print("Other multi-class specific metrics like F1-score (macro or weighted) or classification report (`sklearn.metrics.classification_report`) can provide more detailed performance insights per class.")

print("\n5. Challenges and Considerations:")
print("When moving to a multi-class scenario, consider:")
print("- **Class Imbalance:** If some news categories have significantly fewer examples than others, the model might perform poorly on the minority classes. Techniques like oversampling, undersampling, or using class weights might be necessary.")
print("- **Data Size:** You will likely need a larger dataset overall to adequately represent and train on more distinct categories.")
print("- **Category Similarity:** Some news categories (e.g., 'politics' and 'world news') might have overlapping vocabulary, making it harder for the model to distinguish between them. This can lead to higher confusion between similar classes in the confusion matrix.")
print("- **Vocabulary Size:** With more diverse text data, the vocabulary size from `CountVectorizer` will increase, potentially increasing memory usage and training time. You might consider using techniques like limiting the number of features (`max_features` in `CountVectorizer`) or using TF-IDF instead of raw counts.")

Extending to Multi-Class Classification
----------------------------------------

1. Dataset Structure Modification:
To extend the dataset for more than two classes (e.g., sports, politics, entertainment, technology), you would need to collect news articles for each additional category.
The dataset structure (like the pandas DataFrame 'df') would remain similar, with 'message' and 'label' columns.
However, the 'label' column would now contain strings representing the new categories (e.g., 'entertainment', 'technology') in addition to 'sports' and 'politics'.
For example:
message                             | label
------------------------------------|---------
New movie breaks box office records.| entertainment
Scientists discover new planet.    | technology
... (and so on for other categories)

2. Label Mapping Extension:
The label mapping step needs to be extended to assign a unique numerical value to each new class.
Instead of just mapping 'sports' to 0 and 'politics' to 1, you woul

## Summary:

### Q&A

*   **How would you extend your dataset and code for more than two classes?**
    To extend the dataset and code for more than two classes, you would need to:
    1.  **Collect Data:** Gather news articles for each new category you want to include (e.g., entertainment, technology) and add them to your dataset. The dataset structure with 'message' and 'label' columns would remain similar, but the 'label' column would contain the new category names.
    2.  **Extend Label Mapping:** Update the label mapping dictionary to assign a unique numerical value to each new class in addition to the existing ones.
    3.  **Utilize Compatible Tools:** The `CountVectorizer` and `MultinomialNB` are already compatible with multi-class classification, so the core vectorization and model training/prediction code would not require significant changes. The number of classes is automatically inferred from the unique labels in the training data.
    4.  **Adapt Evaluation:** Accuracy is still a valid metric. The confusion matrix will be larger (an n x n matrix for 'n' classes) and requires interpreting misclassifications between specific classes. Consider using multi-class specific metrics like F1-score or a classification report for more detailed performance analysis per class.
    5.  **Address Challenges:** Be mindful of potential issues like class imbalance (unequal numbers of articles per category), the need for a larger dataset, difficulty distinguishing between similar categories, and managing a potentially larger vocabulary size.

### Data Analysis Key Findings

*   A dataset of 100 news articles was created, consisting of 50 sports articles and 50 politics articles.
*   The text data was vectorized using `CountVectorizer`, resulting in a document-term matrix with a shape of (100, 399), indicating 100 documents and 399 unique terms.
*   The dataset was split into training (75 samples) and testing (25 samples) sets.
*   A Multinomial Naive Bayes classifier was trained on the training data.
*   The trained classifier achieved an accuracy of 0.80 (80%) on the test set.
*   The confusion matrix showed that out of 25 test articles, the model correctly classified 11 sports articles and 9 politics articles.
*   The model incorrectly classified 3 sports articles as politics and 2 politics articles as sports, totaling 5 misclassifications.

### Insights or Next Steps

*   The Naive Bayes model shows reasonable performance (80% accuracy) on this small, manually created dataset for binary classification.
*   For a real-world application, using a larger, more diverse dataset would be crucial to improve model robustness and generalization. Exploring techniques like TF-IDF vectorization or more advanced NLP models could also be beneficial.
