<a href="https://colab.research.google.com/github/Sorrellee/Computational-linguistics-and-natural-language-processing/blob/main/%D0%9A%D0%BE%D0%BC%D0%BF%D1%8C%D1%8E%D1%82%D0%B5%D1%80%D0%BD%D0%B0%D1%8F_%D0%BB%D0%B8%D0%BD%D0%B3%D0%B2%D0%B8%D1%81%D1%82%D0%B8%D0%BA%D0%B0_%D0%B8_%D0%BE%D0%B1%D1%80%D0%B0%D0%B1%D0%BE%D1%82%D0%BA%D0%B0_%D0%B5%D1%81%D1%82%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%BE%D0%B3%D0%BE_%D1%8F%D0%B7%D1%8B%D0%BA%D0%B0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Разработка системы для автоматического анализа и категоризации электронных писем: классификация по типу запросов или содержанию



In [42]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
import nltk
from nltk.corpus import stopwords
import joblib

In [43]:
# Загрузка необходимого ресурса
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [44]:
# Загрузка данных
df = pd.read_csv('/content/messages.csv')

In [45]:
df

Unnamed: 0,subject,message,label
0,job posting - apple-iss research center,content - length : 3386 apple-iss research cen...,0
1,,"lang classification grimes , joseph e . and ba...",0
2,query : letter frequencies for text identifica...,i am posting this inquiry for sergei atamas ( ...,0
3,risk,a colleague and i are researching the differin...,0
4,request book information,earlier this morning i was on the phone with a...,0
...,...,...,...
2888,love your profile - ysuolvpv,hello thanks for stopping by ! ! we have taken...,1
2889,you have been asked to join kiddin,"the list owner of : "" kiddin "" has invited you...",1
2890,anglicization of composers ' names,"judging from the return post , i must have sou...",0
2891,"re : 6 . 797 , comparative method : n - ary co...",gotcha ! there are two separate fallacies in t...,0


In [46]:
# Предобработка текста
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    words = nltk.word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(words)

In [47]:
# Объединение столбцов 'subject' и 'message' в один столбец 'text'
df['text'] = df['subject'].fillna('') + ' ' + df['message'].fillna('')
df['text'] = df['text'].apply(preprocess_text)

In [48]:
# Разделение на обучающий и тестовый наборы
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

In [49]:
# Создание модели классификации текста с использованием SVM и TF-IDF векторизации
model = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', SVC(kernel='linear'))
])

In [50]:
# Обучение модели
model.fit(X_train, y_train)

In [51]:
# Предсказание на тестовом наборе
y_pred = model.predict(X_test)

In [52]:
# Оценка модели
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.9948186528497409
Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       464
           1       1.00      0.97      0.99       115

    accuracy                           0.99       579
   macro avg       1.00      0.99      0.99       579
weighted avg       0.99      0.99      0.99       579



In [53]:
# Сохранение модели
joblib.dump(model, 'ling_spam_model.pkl')

['ling_spam_model.pkl']

Использование модели

In [54]:
import pandas as pd
import joblib
import nltk

In [55]:
# Загрузите обученную модель
model = joblib.load('ling_spam_model.pkl')

In [56]:
# Пример нового электронного письма для классификации
new_email = "Congratulations! You've won a free vacation. Click here to claim your prize."

In [57]:
# Предобработка текста нового электронного письма
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

def preprocess_text(text):
    words = nltk.word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word.lower() not in stop_words]
    return ' '.join(words)

processed_email = preprocess_text(new_email)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:
# Классификация нового электронного письма
prediction = model.predict([processed_email])

In [59]:
# Вывод результата
if prediction[0] == 1:
    print("Это спам!")
else:
    print("Это не спам.")

Это спам!
