<a href="https://colab.research.google.com/github/RobertCall/FakeNewsNet/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Классификатор фейковых новостей

## Нейронная сеть основанная на word2vec сети

Скачивание необходимых пакетов

In [1]:
!pip install pymorphy2
!pip install ufal.udpipe
!pip install corpy
!pip install -U pymorphy2-dicts-ru

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.4 MB/s 
[?25hCollecting pymorphy2-dicts-ru<3.0,>=2.4
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[K     |████████████████████████████████| 8.2 MB 52.1 MB/s 
[?25hCollecting dawg-python>=0.7.1
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting docopt>=0.6
  Downloading docopt-0.6.2.tar.gz (25 kB)
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13723 sha256=c30cfd364fa2e01f7407177b104cb0c4afdb8e86eac01aa2989e8f23ada45bc6
  Stored in directory: /root/.cache/pip/wheels/56/ea/58/ead137b087d9e326852a851351d1debf4ada529b6ac0ec4e8c
Successfully built docopt
Installing collected 

Импорт пакетов

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import pymorphy2

import os
import gensim

import pandas as pd

import ufal.udpipe as udp
import corpy.udpipe as crp

from sklearn.model_selection import train_test_split
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, Flatten

Создание токенайзера и лемматайзера. И создание функции для обработки строк

In [3]:
nltk.download('stopwords')
stop_words = set(stopwords.words('russian'))
nltk_tokenizer = RegexpTokenizer(r'[а-яёa-z]+')

morph = pymorphy2.MorphAnalyzer()

def text_preprocessing(text):
  words = nltk_tokenizer.tokenize(text.lower())
  lem_text = [morph.parse(w)[0].normal_form for w in words if w not in stop_words]

  return lem_text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Скачивание и создание модели word2vec

In [4]:
if not os.path.isfile('220.zip'):
  !wget http://vectors.nlpl.eu/repository/20/220.zip
  !unzip 220.zip

w2v = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True)

--2022-12-26 16:26:45--  http://vectors.nlpl.eu/repository/20/220.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 638171816 (609M) [application/zip]
Saving to: ‘220.zip’


2022-12-26 16:27:10 (26.0 MB/s) - ‘220.zip’ saved [638171816/638171816]

Archive:  220.zip
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  


Чтение данных

In [5]:
df_news = pd.read_csv ('train.tsv', sep='\t')
test = pd.read_csv ('test.tsv', sep='\t')

Скачивание и создание модели для тегирования слов

In [6]:
# Скачивание модели UDPipe, обученную на русском языке
udp_model_url = r'https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3131/russian-syntagrus-ud-2.5-191206.udpipe'
udp_model_filename = 'russian-syntagrus-ud-2.5-191206.udpipe'
#if not os.path.isfile(udp_model_filename):
 # !wget.download(udp_model_url)

# Загрузка модели в оболочку corpy
# corpy_model = udp.Model.load(udp_model_filename)
corpy_model = crp.Model(udp_model_filename)
print('model', corpy_model)

# Функция для тегирования слов
def udp_tagging(lem_text):
  sents = [list(corpy_model.process(w)) for w in lem_text]
  tagged_words = [s[0].words[1].form + '_' + s[0].words[1].upostag for s in sents if s]

  return tagged_words

model <corpy.udpipe.Model object at 0x7fdc918ef760>


Предобработка текста (лемматизация, теггирование и фильтрация)

In [7]:
import pickle

X_not_filtered_texts = df_news['title'].apply(lambda x: udp_tagging(text_preprocessing(x))).array
with open('data1.pickle', 'wb') as f:
    pickle.dump(X_not_filtered_texts, f)
#with open ('data.pickle', 'rb') as f:
   #X_not_filtered_texts = pickle.load(f)

X_filtered_texts = []
for text in X_not_filtered_texts:
  words = []
  for word in text:
    if word in w2v.vocab:
      words.append(word)
  if len(words) == 0:
    print ("asd")
  X_filtered_texts.append(words)

In [8]:
X = [list(map(lambda x: float(w2v.vocab[x].index), t)) for t in X_filtered_texts]
X = pad_sequences(X, maxlen=100)
X = np.asarray(X).astype('float32')

Y = np.array(df_news['is_fake']).astype('float32').reshape((-1,1))

x_train, x_test, y_train, y_test = train_test_split(X, Y)

Создание модели для обработки текста

In [9]:
seq_model = Sequential()

weights = w2v.vectors
layer = Embedding(
    input_dim=weights.shape[0],
    output_dim=weights.shape[1],
    weights=[weights],
    input_length=100,
    mask_zero=True,
    trainable=False,
)

seq_model.add(layer)
seq_model.add(Dense(50, activation='relu'))
seq_model.add(Flatten())
seq_model.add(Dropout(0.6))
seq_model.add(Dense(1, activation='sigmoid'))

seq_model.compile(loss='mean_squared_logarithmic_error',
                  optimizer='adam', metrics=['accuracy'])

seq_model.fit(x_train, y_train, epochs=20,
               validation_data=(x_test, y_test))

seq_model.summary()
# seq_model.save("model.keras")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 300)          74799900  
                                                                 
 dense (Dense)               (None, 100, 50)           15050     
                                                                 
 flatten (Flatten)           (None, 5000)              0         
                                                                 
 dropout (Dropout)           (None, 5000)              0         
                                                                 
 dense_1 (Dense)             (None, 1)                 5001      
                       

In [None]:
from tensorflow.keras.models import load_model
seq_model = load_model("model.keras")

Набор новостей, на которых происходит тестирование (фейковые 2ая и 5ая)

In [10]:
fake_texts = ['Новак спрогнозировал новый мировой энергокризис через 5-10 лет',
              'Во всех школах России установят счетчик американского госдолга',
              'РФ продолжает рассматривать европу как потенциальный рынок для сбыта газа',
              'Число погибших от последствий зимнего шторма в США выросло до 28',
              'В магазинах появились дешевые яйца без желтков']

И обработка созданного набора

In [11]:
fake_preproc = [udp_tagging(text_preprocessing(fake_text)) for fake_text in fake_texts]

fake_words_tagged = []
for text in fake_preproc:
  words = []
  for word in text:
    if word in w2v.vocab:
      words.append(word)
  fake_words_tagged.append(words)

fake_indexed = [list(map(lambda x: float(w2v.vocab[x].index), t)) for t in fake_words_tagged]
fake_indexed = pad_sequences(fake_indexed, maxlen=100)
fake_indexed = np.asarray(fake_indexed).astype('float32')

print(seq_model.predict(fake_indexed))

[[0.14209108]
 [0.93135417]
 [0.08578382]
 [0.49240673]
 [0.4518472 ]]


Тестирование нейронной сети на тестовом наборе

(как оказалось в файле с тестовым набором лежат и фейковые, и реальные новости, но помечены все меткой реальных новостей, поэтому это тестирование не является валидным)

In [12]:
test = pd.read_csv ('test.tsv', sep='\t')
fake_preproc = test['title'].apply(lambda x: udp_tagging(text_preprocessing(x))).array
fake_words_tagged = []
for text in fake_preproc:
  words = []
  for word in text:
    if word in w2v.vocab:
      words.append(word)
  fake_words_tagged.append(words)
Y = np.array(test['is_fake']).astype('float32').reshape((-1,1))

fake_indexed = [list(map(lambda x: float(w2v.vocab[x].index), t)) for t in fake_words_tagged]
fake_indexed = pad_sequences(fake_indexed, maxlen=100)
fake_indexed = np.asarray(fake_indexed).astype('float32')

print(seq_model.evaluate(fake_indexed, Y))


[0.19597852230072021, 0.5609999895095825]


## Работающая нейронная сеть (вау)

Лемматизация текста и применение tfidf векторайзера. И скармливание это классификатору из Sklearn

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.linear_model import PassiveAggressiveClassifier
fake_preproc = df_news['title'].apply(lambda x: " ".join(text_preprocessing(x))).array
Y = np.array(df_news['is_fake']).astype('float32').reshape((-1,1))
x_train, x_test, y_train, y_test = train_test_split(fake_preproc, Y)
tfidf = TfidfVectorizer()
vec_train = tfidf.fit_transform(x_train)
vec_val = tfidf.transform(x_test)

pac = PassiveAggressiveClassifier(C = 0.01)

pac.fit(vec_train, y_train)
val_pred = pac.predict(vec_val)
print(classification_report(y_test, val_pred))

# Тестировние на тестовом наборе данных
fake_preproc = test['title'].apply(lambda x: " ".join(text_preprocessing(x))).array
Y = np.array(test['is_fake']).astype('float32').reshape((-1,1))
vec_val = tfidf.transform(fake_preproc)
val_pred = pac.predict(vec_val)
print(classification_report(Y, val_pred))


  y = column_or_1d(y, warn=True)


              precision    recall  f1-score   support

         0.0       0.85      0.82      0.84       722
         1.0       0.83      0.86      0.84       718

    accuracy                           0.84      1440
   macro avg       0.84      0.84      0.84      1440
weighted avg       0.84      0.84      0.84      1440

              precision    recall  f1-score   support

         0.0       1.00      0.48      0.65      1000
         1.0       0.00      0.00      0.00         0

    accuracy                           0.48      1000
   macro avg       0.50      0.24      0.32      1000
weighted avg       1.00      0.48      0.65      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Тестирование на созданном наборе новостей

In [14]:
fake_preproc = [" ".join(text_preprocessing(fake_text)) for fake_text in fake_texts]
vec_val = tfidf.transform(fake_preproc)
val_pred = pac.decision_function(vec_val)
val_pred0 = pac.predict(vec_val)
print(val_pred, val_pred0)

[-0.40157504  1.10566439 -0.69489376 -0.13854861  0.23341085] [0. 1. 0. 0. 1.]
