The deadline for this homework is on **28.11.2025 18:29** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW1_NshanPotikyan.ipynb`)

4. Submit the file via the e-learning environment.

**Note**

* if you do not follow any of the above conditions, your homework will not be graded.

* you do not need to send any dataset files or helper scripts that I provide with your homework (since I already have them).

**Problem.** During the practice session we tried to classify the titles of some news articles using the Naive Bayes algorithm with different data processing methods, but the result was not that good.

* In this homework, you need to take the same dataset but this time you need to consider the article paragraph itself to train a classifier.

* Split the training dataset into train/val parts, so that you can evaluate which data processing approach results in better performance.

* Make use of sklearn [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html#sklearn.pipeline.make_pipeline) to construct the different data processing pipelines.

* Evaluate the model performance in terms of the accuracy score.

* Use the best data processing method to train a final model on the train+val dataset and report the accuracy score on the test dataset.

Run the below command to download the train/test splits of the news dataset.

In [30]:
!curl https://raw.githubusercontent.com/NshanPotikyan/Dasa1Doom/master/files/news_data.zip -o news_data.zip
!unzip news_data.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  145k  100  145k    0     0   2570      0 --:--:-- --:--:-- --:--:--     0k      0 --:--:-- --:--:-- --:--:--  257k
Archive:  news_data.zip
  inflating: train_news.csv          
  inflating: test_news.csv           


In [2]:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

### Read and split data

In [31]:
train = pd.read_csv('train_news.csv')
test = pd.read_csv('test_news.csv')

In [4]:
# train['article'] = train['article_title'] +'\t' + train['article_paragraph']

X_train, X_val, y_train, y_val = train_test_split(
    train['article_paragraph'], train['type'], random_state = 0)

print("Training dataset: ", X_train.shape[0])
print("Validation dataset: ", X_val.shape[0])



Training dataset:  209
Validation dataset:  70


In [5]:
y_train.describe()

count         209
unique          2
top       economy
freq          105
Name: type, dtype: object

In [6]:
pipes_ensemble = []

### Create Baseline model on word frequencies

In [7]:
pipe_freq = make_pipeline(CountVectorizer(stop_words='english'), MultinomialNB())
pipe_freq.fit(X_train, y_train)

train_preds = pipe_freq.predict(X_train)
val_preds = pipe_freq.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_freq)


Train accuracy score:  0.9952153110047847
Validation accuracy score:  1.0


### use word frequencies with STOP WORDS

In [8]:
pipe_freq_no_sw = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_freq_no_sw.fit(X_train, y_train)

train_preds = pipe_freq_no_sw.predict(X_train)
val_preds = pipe_freq_no_sw.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_freq_no_sw)

Train accuracy score:  1.0
Validation accuracy score:  1.0


### Fitting on word occurrencies instead of frequencies

In [9]:
pipe_occur = make_pipeline(CountVectorizer(stop_words='english', binary=True), MultinomialNB())
pipe_occur.fit(X_train, y_train)

train_preds = pipe_occur.predict(X_train)
val_preds = pipe_occur.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_occur)

Train accuracy score:  1.0
Validation accuracy score:  0.9857142857142858


In [None]:
with pd.option_context('max_colwidth', 1000):
    print(X_val[y_val != val_preds])

239     Armenian President Armen Sarkissian is in Syunik province on a two-day working visit, Armenpress correspondent reports from Meghri.Firstly, President Sarkissian visited the Meghri free economic zone, toured the area and got acquainted with the ongoing activities.Director of Meghri Free Economic Zone Ashot Zarbabyan said the Free Economic Zone has been established on December 15, 2017. At the moment they still face some problems, connected with lands and entry-exit regime. “As it is a border zone, entry and exit are certainly restricted, therefore, we face problems”, he said.The President asked whether these issues have not been discussed before 2017, whether they have not been solved before its launch, the staff of the Free Economic Zone stated that the project should have been implemented at two stages. The first stage was its launch and the next stage was to solve the remaining issues. The President was informed that a working group has been formed by the Prime Minister’s ins

## Fitting on tf-idf

### Use Bernoulli Naive Bayes

In [11]:
pipe_bernulli = make_pipeline(TfidfVectorizer(stop_words='english'), BernoulliNB())
pipe_bernulli.fit(X_train, y_train)

train_preds = pipe_bernulli.predict(X_train)
val_preds = pipe_bernulli.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_bernulli)

Train accuracy score:  0.9856459330143541
Validation accuracy score:  0.9285714285714286


### We see that for this problemn MultiNomial NB works better

In [12]:
pipe_tf_idf = make_pipeline(TfidfVectorizer(stop_words='english'), MultinomialNB())
pipe_tf_idf.fit(X_train, y_train)

train_preds = pipe_tf_idf.predict(X_train)
val_preds = pipe_tf_idf.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_tf_idf)

Train accuracy score:  0.9952153110047847
Validation accuracy score:  1.0


### Use 2 grams with stop-words

In [13]:
pipe_2gram = make_pipeline(CountVectorizer(stop_words='english', ngram_range=(1,2)), MultinomialNB())
pipe_2gram.fit(X_train, y_train)

train_preds = pipe_2gram.predict(X_train)
val_preds = pipe_2gram.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_2gram)

Train accuracy score:  1.0
Validation accuracy score:  1.0


### Use 2 grams without stop-words

In [14]:
pipe_2gram_no_stop_words = make_pipeline(CountVectorizer(ngram_range=(1,2)), MultinomialNB())
pipe_2gram_no_stop_words.fit(X_train, y_train)

train_preds = pipe_2gram_no_stop_words.predict(X_train)
val_preds = pipe_2gram_no_stop_words.predict(X_val)

val_acc = accuracy_score(y_val, val_preds)
print("Train accuracy score: ", accuracy_score(y_train, train_preds))
print("Validation accuracy score: ", val_acc)

if val_acc > 0.97:
    pipes_ensemble.append(pipe_2gram_no_stop_words)


Train accuracy score:  1.0
Validation accuracy score:  0.9857142857142858


In [25]:
for pipe in pipes_ensemble:
    test_preds = pipe.predict(test['article_paragraph'])
    print("Test Accuracy score: ", accuracy_score(test['type'], test_preds))

Test Accuracy score:  0.978494623655914
Test Accuracy score:  0.978494623655914
Test Accuracy score:  0.978494623655914
Test Accuracy score:  0.978494623655914
Test Accuracy score:  0.978494623655914
Test Accuracy score:  0.978494623655914


In [23]:
from sklearn.ensemble import VotingClassifier

ensemble = VotingClassifier(estimators=[
    ('pipe_freq', pipe_freq),
    ('pipe_freq_no_sw', pipe_freq_no_sw),
    ('pipe_occur', pipe_occur),
    ('pipe_bernulli', pipe_bernulli),
    ('pipe_tf_idf', pipe_tf_idf),
    ('pipe_2gram', pipe_2gram),
    ('pipe_2gram_no_stop_words', pipe_2gram_no_stop_words)
])
ensemble.fit(X_train, y_train)
ensemble.score(test['article_paragraph'], test['type'])

0.978494623655914

In [29]:
with pd.option_context('max_colwidth', 1000):
    print(test[test['type'] != test_preds])

                                                                     article_title  \
61     Government to procure defense products from local manufacturers as priority   
77  PM to hand over Hero Of Our Times prize for the first time on Independence Day   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     