# Model Bag of Words
Model bag-of-words (worek słów) jest popularnym i prostym podejściem do reprezentowania danych tekstowych w przetwarzaniu języka naturalnego. Traktuje on fragment tekstu jako nieuporządkowany zbiór lub "worek" pojedynczych słów, pomijając gramatykę, kolejność słów i kontekst. Model ten reprezentuje tekst poprzez utworzenie słownika unikalnych słów, a następnie ilościowe określenie ich występowania w danym dokumencie. 

Model bag-of-words jest potrzebny do reprezentowania tekstu, ponieważ zapewnia skuteczny sposób konwersji nieustrukturyzowanych danych tekstowych na ustrukturyzowaną reprezentację numeryczną, z którą mogą sobie poradzić algorytmy uczenia maszynowego.

## Opracowanie modelu
Opracowanie modelu na potrzeby tego projektu przebiegało według następujących kroków:
1) Ładowanie potrzebnych bibliotek i zbiorów danych. <br> 
2) Czyszczenie danych tekstowych poprzez usunięcie niepotrzebnych znaków, cyfr, znaków interpunkcyjnych, wszelkich symboli specjalnych oraz konwersje tekstu na małe litery. <br> 
3) Tokenizacja: podział oczyszczonego tekstu na pojedyncze słowa lub tokeny. <br> 
4) Usuwanie 'Stopwords' - powszechnie używanych słów, które często pojawiają się w języku, ale nie wnoszą wiele do ogólnego zrozumienia tekstu (np. "the", "is", "and", "a", "an"). <br> 
5) Stemming: zredukowanie słów do ich formy podstawowej lub źródłowej, znanej jako "rdzeń". <br> 
6) Tworzenie słownika: stworzenie zestawu unikalnych słów poprzez zebranie wszystkich tokenów ze zbiorów danych. <br> 
7) Wektoryzacja: konwersja danych tekstowych na numeryczne wektory, które mogą być wykorzystane jako dane wejściowe do modelu uczenia maszynowego. W tym celu wypróbowany został CountVectorizer, TfidfVectorizer oraz HashingVectorizer - trzy schematy, które udostępnia biblioteka scikit-learn do budowy modelu Bag of Words. <br> 
8) Dzielenie danych: podział zbioru danych na zestawy treningowe i testowe. Jeden zostanie wykorzystany do wytrenowania modelu, a drugi do oceny jego wydajności. <br> 
9) Trening modelu: użycie metody regresji logistycznej. Regresja logistyczna jest algorytmem uczenia nadzorowanego, który może być używany do klasyfikowania dokumentów tekstowych na podstawie ich cech. <br> 
10) Ocena modelu: wykorzystanie testowego zbioru danych do oceny wydajności wyszkolonego modelu. <br> 
11) Załadowanie i przygotowanie nowych danych. <br> 
12) Wykorzystanie modelu do tworzenia predykcji na podstawie nowych danych.  
13) Ocena dokładności predykcji. <br> 

## Na czym polega ten projekt?
Celem projektu jest stworzenie modelu, który przewidywałby, czy artykuł jest fake newsem, czy nie, na podstawie jego tytułu. Źródłem danych są dwa zbiory - jeden zawiera wyłącznie prawdziwe artykuły, a drugi wyłącznie fałszywe. Każdy z zestawów zawiera ponad 20 000 rekordów, ale tylko cztery tysiące zostały wykorzystane w projekcie (pierwszy tysiąc z każdego zestawu do uczenia algorytmu i testowania oraz ostatni tysiąc z każdego do tworzenia nowych prognoz),

Źródło danych: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

# 1. Ładowanie potrzebnych bibliotek i zbiorów danych

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
import numpy as np
import regex as re

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from collections import Counter

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression


True_News = pd.read_csv("True.csv",sep=",", nrows=1000) #Pobieram pierwszy tysiąc rekordów ze zbioru z prawdziwymi artykułami
True_Text = True_News['title'] #Biorę tylko kolumnę z tytułami 

Fake_News = pd.read_csv("Fake.csv",sep=",", nrows=1000) #Pobieram pierwszy tysiąc rekordów ze zbioru z fałszywymi artykułami
Fake_Text = Fake_News['title'] #Biorę tylko kolumnę z tytułami 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Bartosz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [50]:
True_News.head(10) #Tak wygląda początkowy zbiór danych (prawdziwe artykuły)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
5,"White House, Congress prepare for talks on spe...","WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T...",politicsNews,"December 29, 2017"
6,"Trump says Russia probe will be fair, but time...","WEST PALM BEACH, Fla (Reuters) - President Don...",politicsNews,"December 29, 2017"
7,Factbox: Trump on Twitter (Dec 29) - Approval ...,The following statements were posted to the ve...,politicsNews,"December 29, 2017"
8,Trump on Twitter (Dec 28) - Global Warming,The following statements were posted to the ve...,politicsNews,"December 29, 2017"
9,Alabama official to certify Senator-elect Jone...,WASHINGTON (Reuters) - Alabama Secretary of St...,politicsNews,"December 28, 2017"


In [45]:
Fake_News.head(10) #Tak wygląda początkowy zbiór danych (fałszywe artykuły)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017"
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017"
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017"
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017"
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017"


In [46]:
True_Text.head(10) #Tak wygląda sam tekst tytułów (prawdziwe artykuły)

0    As U.S. budget fight looms, Republicans flip t...
1    U.S. military to accept transgender recruits o...
2    Senior U.S. Republican senator: 'Let Mr. Muell...
3    FBI Russia probe helped by Australian diplomat...
4    Trump wants Postal Service to charge 'much mor...
5    White House, Congress prepare for talks on spe...
6    Trump says Russia probe will be fair, but time...
7    Factbox: Trump on Twitter (Dec 29) - Approval ...
8           Trump on Twitter (Dec 28) - Global Warming
9    Alabama official to certify Senator-elect Jone...
Name: title, dtype: object

In [47]:
Fake_Text.head(10) #Tak wygląda sam tekst tytułów (fałszywe artykuły)

0     Donald Trump Sends Out Embarrassing New Year’...
1     Drunk Bragging Trump Staffer Started Russian ...
2     Sheriff David Clarke Becomes An Internet Joke...
3     Trump Is So Obsessed He Even Has Obama’s Name...
4     Pope Francis Just Called Out Donald Trump Dur...
5     Racist Alabama Cops Brutalize Black Boy While...
6     Fresh Off The Golf Course, Trump Lashes Out A...
7     Trump Said Some INSANELY Racist Stuff Inside ...
8     Former CIA Director Slams Trump Over UN Bully...
9     WATCH: Brand-New Pro-Trump Ad Features So Muc...
Name: title, dtype: object

# (2, 3, 4, 5). Czyszczenie tekstu, tokenizacja, usuwanie 'stopwords', stemming 

In [6]:
def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text) #usuwanie znaków niealfabetycznych
    cleaned_text = cleaned_text.lower()
    return cleaned_text
    
True_Text1 = True_Text.apply(clean_text)
Fake_Text1 = Fake_Text.apply(clean_text)

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

True_Text2 = True_Text1.apply(tokenize_text)
Fake_Text2 = Fake_Text1.apply(tokenize_text)

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    filtered_tokens = [token for token in tokens if not token in stop_words]
    return filtered_tokens
    
True_Text3 = True_Text2.apply(remove_stopwords)
Fake_Text3 = Fake_Text2.apply(remove_stopwords)

stemmer = PorterStemmer()

def apply_stemming(tokens):
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

True_Text4 = True_Text3.apply(apply_stemming)
Fake_Text4 = Fake_Text3.apply(apply_stemming)

In [48]:
True_Text4.head(10) #Tekst tytułów po przygotowaniu (prawdziwe)

0    [us, budget, fight, loom, republican, flip, fi...
1    [us, militari, accept, transgend, recruit, mon...
2    [senior, us, republican, senat, let, mr, muell...
3    [fbi, russia, probe, help, australian, diploma...
4    [trump, want, postal, servic, charg, much, ama...
5    [white, hous, congress, prepar, talk, spend, i...
6    [trump, say, russia, probe, fair, timelin, unc...
7    [factbox, trump, twitter, dec, approv, rate, a...
8                  [trump, twitter, dec, global, warm]
9    [alabama, offici, certifi, senatorelect, jone,...
Name: title, dtype: object

In [49]:
Fake_Text4.head(10) #Tekst tytułów po przygotowaniu (fałszywe)

0    [donald, trump, send, embarrass, new, year, ev...
1    [drunk, brag, trump, staffer, start, russian, ...
2    [sheriff, david, clark, becom, internet, joke,...
3    [trump, obsess, even, obama, name, code, websi...
4    [pope, franci, call, donald, trump, christma, ...
5    [racist, alabama, cop, brutal, black, boy, han...
6    [fresh, golf, cours, trump, lash, fbi, deputi,...
7    [trump, said, insan, racist, stuff, insid, ova...
8    [former, cia, director, slam, trump, un, bulli...
9    [watch, brandnew, protrump, ad, featur, much, ...
Name: title, dtype: object

# 6. Tworzenie słownika

In [51]:
all_filtered_tokens = list(Fake_Text4) + list(True_Text4)
#print(all_filtered_tokens)

# 7. Wektoryzacja
## CountVectorizer (liczba wystąpień danego słowa w tekście)

In [13]:
preprocessed_texts = [' '.join(tokens) for tokens in all_filtered_tokens]


vectorizer = CountVectorizer() 
vectorizer.fit(preprocessed_texts)

X_fake = vectorizer.transform([' '.join(tokens) for tokens in Fake_Text4])
X_true = vectorizer.transform([' '.join(tokens) for tokens in True_Text4])

X_fake_df = pd.DataFrame(X_fake.toarray(), columns=vectorizer.get_feature_names())
X_true_df = pd.DataFrame(X_true.toarray(), columns=vectorizer.get_feature_names())



In [14]:
print(X_true_df.head(10)) #Liczba wystąpień danego słowa w tekście (prawdziwe)

   abandon  abc  abduct  abe  abil  abl  abort  abroad  abrupt  abruptli  \
0        0    0       0    0     0    0      0       0       0         0   
1        0    0       0    0     0    0      0       0       0         0   
2        0    0       0    0     0    0      0       0       0         0   
3        0    0       0    0     0    0      0       0       0         0   
4        0    0       0    0     0    0      0       0       0         0   
5        0    0       0    0     0    0      0       0       0         0   
6        0    0       0    0     0    0      0       0       0         0   
7        0    0       0    0     0    0      0       0       0         0   
8        0    0       0    0     0    0      0       0       0         0   
9        0    0       0    0     0    0      0       0       0         0   

   absolut  abus  aca  accent  accept  access  accid  accident  accomplic  \
0        0     0    0       0       0       0      0         0          0   
1        

In [15]:
print(X_fake_df.head(10)) #Liczba wystąpień danego słowa w tekście (fałszywe)

   abandon  abc  abduct  abe  abil  abl  abort  abroad  abrupt  abruptli  \
0        0    0       0    0     0    0      0       0       0         0   
1        0    0       0    0     0    0      0       0       0         0   
2        0    0       0    0     0    0      0       0       0         0   
3        0    0       0    0     0    0      0       0       0         0   
4        0    0       0    0     0    0      0       0       0         0   
5        0    0       0    0     0    0      0       0       0         0   
6        0    0       0    0     0    0      0       0       0         0   
7        0    0       0    0     0    0      0       0       0         0   
8        0    0       0    0     0    0      0       0       0         0   
9        0    0       0    0     0    0      0       0       0         0   

   absolut  abus  aca  accent  accept  access  accid  accident  accomplic  \
0        0     0    0       0       0       0      0         0          0   
1        

In [54]:
#print(vectorizer.vocabulary_) #indeksy przydzielone danym słowom

## TfidfVectorizer (obliczanie częstotliwości słów)

In [18]:
vectorizer = TfidfVectorizer()
vectorizer.fit(preprocessed_texts)

X_fake = vectorizer.transform([' '.join(tokens) for tokens in Fake_Text4])
X_true = vectorizer.transform([' '.join(tokens) for tokens in True_Text4])

X_fake_df = pd.DataFrame(X_fake.toarray(), columns=vectorizer.get_feature_names())
X_true_df = pd.DataFrame(X_true.toarray(), columns=vectorizer.get_feature_names())



In [19]:
print(X_true_df.head(10)) #Częstotliwość występowania danego słowa (prawdziwe)

   abandon  abc  abduct  abe  abil  abl  abort  abroad  abrupt  abruptli  \
0      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
1      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
2      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
3      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
4      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
5      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
6      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
7      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
8      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
9      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   

   absolut  abus  aca  accent    accept  access  accid  accident  accomplic  \
0      0.0   0.0  0.0     0.0  0.000000     0.0    0.0       0.0        0.0   
1    

In [20]:
print(X_fake_df.head(10)) #Częstotliwość występowania danego słowa (fałszywe)

   abandon  abc  abduct  abe  abil  abl  abort  abroad  abrupt  abruptli  \
0      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
1      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
2      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
3      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
4      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
5      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
6      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
7      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
8      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   
9      0.0  0.0     0.0  0.0   0.0  0.0    0.0     0.0     0.0       0.0   

   absolut  abus  aca  accent  accept  access  accid  accident  accomplic  \
0      0.0   0.0  0.0     0.0     0.0     0.0    0.0       0.0        0.0   
1      0.

## HashingVectorizer (mapowanie każdego słowa do określonego indeksu w wektorze o stałej długości za pomocą funkcji haszującej.)

In [21]:
vectorizer = HashingVectorizer(n_features=50)
X_fake = vectorizer.transform([' '.join(tokens) for tokens in Fake_Text4])
X_true = vectorizer.transform([' '.join(tokens) for tokens in True_Text4])

X_fake_df = pd.DataFrame(X_fake.toarray())
X_true_df = pd.DataFrame(X_true.toarray())

In [22]:
print(X_fake_df.head(10))

         0         1    2         3         4         5         6         7   \
0  0.000000  0.000000  0.0  0.333333  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.0  0.000000 -0.353553  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
3 -0.316228  0.000000  0.0  0.000000  0.000000  0.000000  0.316228  0.000000   
4  0.000000  0.000000  0.0  0.000000 -0.377964  0.000000  0.000000  0.000000   
5  0.000000  0.000000  0.0  0.000000  0.333333 -0.333333  0.000000  0.000000   
6  0.000000  0.000000  0.0  0.000000  0.000000 -0.316228  0.000000  0.000000   
7  0.000000  0.000000  0.0  0.000000  0.000000  0.000000  0.000000  0.000000   
8  0.000000  0.000000  0.0  0.000000  0.000000 -0.250000  0.000000  0.000000   
9 -0.333333 -0.333333  0.0  0.000000  0.000000  0.000000  0.000000 -0.333333   

         8         9         10   11        12        13        14        15  \
0  0.000000  0.333333 -0.333333  0.0  0

In [23]:
print(X_true_df.head(10))

    0         1         2    3         4    5         6         7         8   \
0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
1  0.0  0.000000  0.377964  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
2  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.316228  0.000000  0.000000   
3  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
4  0.0  0.000000  0.000000  0.0  0.000000  0.0 -0.408248  0.000000  0.000000   
5  0.0 -0.377964  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.377964   
6  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
7  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
8  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.000000  0.000000  0.000000   
9  0.0 -0.333333  0.000000  0.0  0.333333  0.0  0.000000 -0.333333  0.000000   

    9         10        11        12        13        14        15   16   17  \
0  0.0  0.000000  0.000000  0.000000  0

In [24]:
# Największą dokładność okazał się mieć model bazujący na wektoryzacji za pomocą CountVectorizer
vectorizer = CountVectorizer() 
vectorizer.fit(preprocessed_texts)

X_fake = vectorizer.transform([' '.join(tokens) for tokens in Fake_Text4])
X_true = vectorizer.transform([' '.join(tokens) for tokens in True_Text4])

X_fake_df = pd.DataFrame(X_fake.toarray(), columns=vectorizer.get_feature_names())
X_true_df = pd.DataFrame(X_true.toarray(), columns=vectorizer.get_feature_names())



In [52]:
X_fake_df.head(10) #Liczba wystąpień danego słowa w tekście (fałszywe)

Unnamed: 0,abandon,abc,abduct,abe,abil,abl,abort,abroad,abrupt,abruptli,absolut,abus,aca,accent,accept,access,accid,accident,accomplic,accomplish,accord,account,accus,acknowledg,across,act,action,activ,activist,actual,ad,add,addict,address,admin,administr,admiss,admit,adopt,adult,advanc,advertis,advic,advis,advisor,advisori,advocaci,ae,affair,affect,...,worthi,would,wouldnt,wouldv,wound,wrap,wray,wreck,write,writeoff,writer,wrong,wrongdo,wrote,wsj,wtf,wwiii,xenophob,xi,xinhua,xma,yall,yankuang,yate,yawn,yeah,year,yearend,yearold,yell,yellen,yemen,yesterday,yet,yiannopoulo,york,yorker,youd,youll,young,your,youth,youv,yuln,zealand,zeldin,zero,zhong,zilch,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [53]:
X_true_df.head(10) #Liczba wystąpień danego słowa w tekście (prawdziwe)

Unnamed: 0,abandon,abc,abduct,abe,abil,abl,abort,abroad,abrupt,abruptli,absolut,abus,aca,accent,accept,access,accid,accident,accomplic,accomplish,accord,account,accus,acknowledg,across,act,action,activ,activist,actual,ad,add,addict,address,admin,administr,admiss,admit,adopt,adult,advanc,advertis,advic,advis,advisor,advisori,advocaci,ae,affair,affect,...,worthi,would,wouldnt,wouldv,wound,wrap,wray,wreck,write,writeoff,writer,wrong,wrongdo,wrote,wsj,wtf,wwiii,xenophob,xi,xinhua,xma,yall,yankuang,yate,yawn,yeah,year,yearend,yearold,yell,yellen,yemen,yesterday,yet,yiannopoulo,york,yorker,youd,youll,young,your,youth,youv,yuln,zealand,zeldin,zero,zhong,zilch,zuckerberg
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# 8. Dzielenie danych

In [28]:
X = pd.concat([X_fake_df, X_true_df], axis=0) #łączenie obydwu ramek danych 

#stworzenie dwóch serii - jedna z fałszywymi artykułami i jedna z prawdziwymi
y_fake = pd.Series([1] * len(X_fake_df)) #1 jako etykieta fałszywych
y_true = pd.Series([0] * len(X_true_df)) #0 jako etykieta prawdziwych
y = pd.concat([y_fake, y_true], axis=0) #połączenie dwóch serii 

#podzielenie danych na zbiór treningowy(80%) i zbiór testowy (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (1600, 3373)
Testing set shape: (400, 3373)


# 9. Trening modelu

In [29]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

array([0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1,

# 10. Ocena Modelu

In [30]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

precision = precision_score(y_test, y_pred)
print("Precision:", precision)

recall = recall_score(y_test, y_pred)
print("Recall:", recall)

f1 = f1_score(y_test, y_pred)
print("F1-score:", f1)

#92% dokładności na zbiorze testowym 

Accuracy: 0.92
Precision: 0.9113300492610837
Recall: 0.9296482412060302
F1-score: 0.9203980099502488


# 11. Załadowanie i przygotowanie nowych danych

In [31]:
True_News_New = pd.read_csv("True.csv",sep=",", skiprows=lambda x: x != 0 and x < (1000 - 1), nrows=1000) #ostatnie 1000 rekordów
True_Text_New = True_News_New['title']

Fake_News_New = pd.read_csv("Fake.csv",sep=",", skiprows=lambda x: x != 0 and x < (1000 - 1), nrows=1000) #ostatnie 1000 rekordów
Fake_Text_New = Fake_News_New['title']

In [32]:
True_Text_New1 = True_Text_New.apply(clean_text)
Fake_Text_New1 = Fake_Text_New.apply(clean_text)

True_Text_New2 = True_Text_New1.apply(tokenize_text)
Fake_Text_New2 = Fake_Text_New1.apply(tokenize_text)

True_Text_New3 = True_Text_New2.apply(remove_stopwords)
Fake_Text_New3 = Fake_Text_New2.apply(remove_stopwords)

True_Text_New4 = True_Text_New3.apply(apply_stemming)
Fake_Text_New4 = Fake_Text_New3.apply(apply_stemming)

X_fake_New = vectorizer.transform([' '.join(tokens) for tokens in Fake_Text_New4])
X_true_New = vectorizer.transform([' '.join(tokens) for tokens in True_Text_New4])

X_fake_New_df = pd.DataFrame(X_fake_New.toarray(), columns=vectorizer.get_feature_names())
X_true_New_df = pd.DataFrame(X_true_New.toarray(), columns=vectorizer.get_feature_names())




# 12. Wykorzystanie modelu do tworzenia predykcji na podstawie nowych danych

### Case_1 - Do modelu wprowadzamy zestaw z samymi fałszywymi artykułami i tworzymy predykcje 

In [33]:
new_predictions1 = model.predict(X_fake_New_df)

labels = ['true', 'fake']
new_labels = [labels[prediction] for prediction in new_predictions1]
X_fake_New_df['predicted_label'] = new_labels

In [55]:
results_df = pd.DataFrame({'Data': Fake_Text_New4 , 'Prediction' : new_labels})
results_df.head(10) #Predykcje dotyczące poszczególnych tytułów

Unnamed: 0,Data,Prediction
0,"[absolut, cringeworthi, moment, trump, tri, fl...",True
1,"[fed, report, stood, sarah, huckabe, smear, fr...",True
2,"[stun, new, poll, reveal, global, opinion, don...",True
3,"[former, gop, rep, throw, support, behind, oba...",True
4,"[trump, moron, claim, entir, russia, investig,...",True
5,"[watch, hit, trump, support, repeal, obamacar,...",True
6,"[republican, ad, hate, obamacar, bill]",True
7,"[number, jon, ossoff, lose, georgia, elect, ac...",True
8,"[gop, senat, lash, kellyann, conway, trumpcar,...",True
9,"[cop, republican, senat, offic, violent, assau...",True


In [35]:
new_predictions_1 = model.predict(X_fake_New)
print(new_predictions_1) #W tym ujęciu widzimy, że zdecydowana większość artykułów została zakwalifikowana jako fałszywe (0 - prawdziwe, 1 - fałszywe)

[1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1
 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 0
 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0
 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 0 1 1 



### Case_2 - Do modelu wprowadzamy zestaw z samymi prawdziwymi artykułami i tworzymy predykcje 

In [36]:
new_predictions2 = model.predict(X_true_New_df)

labels = ['true', 'fake']
new_labels = [labels[prediction] for prediction in new_predictions2]
X_true_New_df['predicted_label'] = new_labels

In [56]:
results_df = pd.DataFrame({'Data': True_Text_New4 , 'Prediction' : new_labels})
results_df.head(10) #Predykcje dotyczące poszczególnych tytułów

Unnamed: 0,Data,Prediction
0,"[exclus, advis, trump, excia, chief, propos, p...",True
1,"[matti, visit, seoul, defens, talk, tension, c...",True
2,"[role, assad, syria, futur, tillerson]",True
3,"[us, veteran, trump, save, bank, custom, right...",True
4,"[hous, narrowli, pass, measur, pave, way, trum...",True
5,"[us, appoint, new, top, offici, havana, embass...",True
6,"[fatal, niger, oper, spark, call, public, hear...",True
7,"[trump, administr, tap, coal, consult, mine, o...",True
8,"[congress, watchdog, investig, trump, voter, f...",True
9,"[us, envoy, haley, make, emot, visit, congo, d...",True


In [39]:
new_predictions_2 = model.predict(X_true_New)
print(new_predictions2) #W tym ujęciu widzimy, że zdecydowana większość artykułów została zakwalifikowana jako prawdziwe (0 - prawdziwe, 1 - fałszywe)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 1
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0
 1 1 0 0 1 0 1 0 0 1 0 0 



# 13. Ocena dokładności predykcji

### Case_1 (fałszywe artykuły)

In [57]:
X_fake_New_df['Actual_Label'] = "fake"
Actual_Label_Fake = X_fake_New_df['Actual_Label']
Predicted_Label_Fake = X_fake_New_df['predicted_label']

results_df = pd.DataFrame({'Data': Fake_Text_New4 , 'Prediction' : Predicted_Label_Fake, 'Actual': Actual_Label_Fake})
results_df.head(10) #Porównanie predykcji ('Prediction') z rzeczywistymi etykietami ('Actual') dla poszczególnych artykułów 

Unnamed: 0,Data,Prediction,Actual
0,"[absolut, cringeworthi, moment, trump, tri, fl...",fake,fake
1,"[fed, report, stood, sarah, huckabe, smear, fr...",fake,fake
2,"[stun, new, poll, reveal, global, opinion, don...",fake,fake
3,"[former, gop, rep, throw, support, behind, oba...",fake,fake
4,"[trump, moron, claim, entir, russia, investig,...",fake,fake
5,"[watch, hit, trump, support, repeal, obamacar,...",fake,fake
6,"[republican, ad, hate, obamacar, bill]",true,fake
7,"[number, jon, ossoff, lose, georgia, elect, ac...",fake,fake
8,"[gop, senat, lash, kellyann, conway, trumpcar,...",fake,fake
9,"[cop, republican, senat, offic, violent, assau...",fake,fake


In [41]:
X_fake_New_df['Actual_Label'] = 1
Actual_Label_Fake = X_fake_New_df['Actual_Label']

accuracy_New = accuracy_score(Actual_Label_Fake, new_predictions1 )
print("Accuracy:", accuracy_New)

#90% dokładności na zbiorze fałszywych artykułów (w porównaniu do 92% na zbiorze testowym)

Accuracy: 0.901


### Case_2 (prawdziwe artykuły)

In [58]:
X_true_New_df['Actual_Label'] = "true"
Actual_Label_True = X_true_New_df['Actual_Label']
Predicted_Label_True = X_true_New_df['predicted_label']

results_df = pd.DataFrame({'Data': True_Text_New4 , 'Prediction' : Predicted_Label_True, 'Actual': Actual_Label_True})
results_df.head(10) #Porównanie predykcji ('Prediction') z rzeczywistymi etykietami ('Actual') dla poszczególnych artykułów 

Unnamed: 0,Data,Prediction,Actual
0,"[exclus, advis, trump, excia, chief, propos, p...",True,True
1,"[matti, visit, seoul, defens, talk, tension, c...",True,True
2,"[role, assad, syria, futur, tillerson]",True,True
3,"[us, veteran, trump, save, bank, custom, right...",True,True
4,"[hous, narrowli, pass, measur, pave, way, trum...",True,True
5,"[us, appoint, new, top, offici, havana, embass...",True,True
6,"[fatal, niger, oper, spark, call, public, hear...",True,True
7,"[trump, administr, tap, coal, consult, mine, o...",True,True
8,"[congress, watchdog, investig, trump, voter, f...",True,True
9,"[us, envoy, haley, make, emot, visit, congo, d...",True,True


In [43]:
X_true_New_df['Actual_Label'] = 0
Actual_Label_True = X_true_New_df['Actual_Label']

accuracy_New = accuracy_score(Actual_Label_True, new_predictions2 )
print("Accuracy:", accuracy_New)

#85% dokładności na zbiorze prawdziwych artykułów (w porównaniu do 92% na zbiorze testowym)

Accuracy: 0.856
