Importing libraries:
- os - for reading and saving files
- pandas - for working with dataset
- scikit-learn - for machine learning
- string - for removing punctuation
- joblib - for saving model and vectorizer

In [1]:
import os 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import string 
import joblib

Reading files from folders "astronomia" and "inne" and adding labels from folders name.

In [2]:
def read_files(folder_path, label):
    texts = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                texts.append(file.read())
    return texts, [label] * len(texts) 

astronomy_folder = 'dane/astronomia'
others_folder = 'dane/inne'

astronomy_texts, astronomy_labels = read_files(astronomy_folder, 'astronomia')
others_texts, others_labels = read_files(others_folder, 'inne')

Creating DataFrame and .csv file.

In [3]:
texts = astronomy_texts + others_texts
labels = astronomy_labels + others_labels

data = {'text': texts, 'label':labels}
df = pd.DataFrame(data)

df.to_csv('dane_zbiorcze.csv', index=False)
print('Dane zapisane.')

Dane zapisane.


In [4]:
df.sample(10)

Unnamed: 0,text,label
89,(53) Kalypso - planetoida z pasa głównego plan...,astronomia
41,Gromada kulista - zazwyczaj sferycznie symetry...,astronomia
188,"Czerwona gwiazda - symbol, pięciopromienna gwi...",inne
178,Cyberiada - cykl opowiadań Stanisława Lema osa...,inne
162,Narcyz - obraz olejny przypisywany włoskiemu m...,inne
65,"Efekt Primakoffa - hipotetyczne zjawisko, pole...",astronomia
226,Samolot – załogowy bądź bezzałogowy statek pow...,inne
64,"Arktur (""a Boo"", ""Alfa Bootis"") - najjaśniejsz...",astronomia
133,"Messier 9 (M9, NGC 6333) - gromada kulista poł...",astronomia
18,Vademecum miłośnika astronomii - kwartalnik po...,astronomia


Cleaning texts from punctuation, polish stopwords and capital letters.

In [5]:
def load_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        stopwords = file.read().splitlines()
    return set(stopwords)

stopwords_file = 'polish_stopwords.txt'

polish_stopwords = load_stopwords(stopwords_file)

def preprocess_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    words = text.split()
    words = [word for word in words if word not in polish_stopwords]
    return ' '.join(words)

df['text'] = df['text'].apply(preprocess_text)

In [6]:
df.sample(5)

Unnamed: 0,text,label
27,pulsar kraba psr b053121 stosunkowo młoda gwia...,astronomia
61,messier 2 m2 ngc 7089 gromada kulista znajdują...,astronomia
201,stade roland garros – kompleks sportowy paryżu...,inne
148,judyta obraz olejny namalowany ok 16261628 fra...,inne
40,układ planetarny system planetarny planety cia...,astronomia


Splitting data into training and testing.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

In [8]:
print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (224,)
X_test shape:  (56,)
y_train shape:  (224,)
y_test shape:  (56,)


Vectorizing the data.

In [9]:
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Training Naive Bayes model and making prediction. Evaluating the model.

In [10]:
model_NB = MultinomialNB()
model_NB.fit(X_train_vec, y_train)

y_pred_NB = model_NB.predict(X_test_vec)
print('NB Accuracy:', accuracy_score(y_test, y_pred_NB))
print('NB Classification Report:\n', classification_report(y_test, y_pred_NB))

NB Accuracy: 0.8214285714285714
NB Classification Report:
               precision    recall  f1-score   support

  astronomia       0.75      0.96      0.84        28
        inne       0.95      0.68      0.79        28

    accuracy                           0.82        56
   macro avg       0.85      0.82      0.82        56
weighted avg       0.85      0.82      0.82        56



Training Logistic Regression model and making prediction. Evaluating the model.

In [11]:
model_LR = LogisticRegression()
model_LR.fit(X_train_vec, y_train)

y_pred_LR = model_LR.predict(X_test_vec)
print('LR Accuracy:', accuracy_score(y_test, y_pred_LR))
print('LR Classification Report:\n', classification_report(y_test, y_pred_LR))

LR Accuracy: 0.9464285714285714
LR Classification Report:
               precision    recall  f1-score   support

  astronomia       1.00      0.89      0.94        28
        inne       0.90      1.00      0.95        28

    accuracy                           0.95        56
   macro avg       0.95      0.95      0.95        56
weighted avg       0.95      0.95      0.95        56



The Logistic Regression model made predictions better.

In [12]:
y_pred_LR[0:5]

array(['astronomia', 'astronomia', 'inne', 'inne', 'inne'], dtype=object)

In [13]:
X_test.head()

33     korona południowa łac corona australis dop cor...
108    planetozymal planetezymal małe ciało niebieski...
240    mykoryza12 mikoryza345a mycorrhiza – powszechn...
259    hollywood boulevard – ulica hollywood los ange...
154    liczba stopni swobody df ang degrees of freedo...
Name: text, dtype: object

Saving vectorizer and model.

In [14]:
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')
joblib.dump(model_LR, 'logistic_regression_model.pkl')

['logistic_regression_model.pkl']