## Zadanie: Wykrywanie fejków

1. Korzystając ze zbioru danych [dane/news_train.csv.gz](dane/news_train.csv.gz) zbuduj model klasyfikacji służący do wykrywania fałszywych wiadomości. 
Zbiór treningowy zawiera 6000 wiadomości w języku angielskim oznaczonych etykietami  ``'REAL'`` (wiadomość prawdziwa) lub  ``'FAKE'`` (wiadmość nieprawdziwa). Możesz użyć dowolnej metody wektoryzacji z laboratorium (``CountVectorizer``, ``TfidfVectorizer`` lub ``HashingVectorizer``) oraz dowolnej metody klasyfikacji (np. kNN, SVM, drzewo decyzyjne, regresja logistyczna). Dobierz parametry modelu tak aby dawał jak najwyższą poprawność wykrywania fejków.

In [1]:
import pandas as pd
import gzip
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import re

with gzip.open('dane/news_train.csv.gz', 'rt', encoding='utf-8') as f:
   df = pd.read_csv(f)

def clean_text(text):
   text = re.sub(r'[^a-zA-Z\s]', '', str(text).lower())
   return text

df['text_clean'] = df['text'].apply(clean_text)

X_train, X_test, y_train, y_test = train_test_split(df['text_clean'], df['label'], test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1,2), stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression(C=10, max_iter=1000)
model.fit(X_train_vec, y_train)

y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.9450


2. Oceń jakość modelu klasyfikacji zbudowanego w poprzednim punkcie wykorzystując zbiór testowy [dane/news_test.csv.gz](dane/news_test.csv.gz).  
Wyznacz poprawność klasyfikacji oraz macierz pomyłek.

In [None]:
with gzip.open('dane/news_test.csv.gz', 'rt', encoding='utf-8') as f:
   test_df = pd.read_csv(f)

test_df['text_clean'] = test_df['text'].apply(clean_text)

X_test_final = vectorizer.transform(test_df['text_clean'])
y_test_final = test_df['label']

y_pred_final = model.predict(X_test_final)

accuracy_final = accuracy_score(y_test_final, y_pred_final)

print(f"Test Accuracy: {accuracy_final:.4f}")


Test Accuracy: 0.9356


3. Weź najświeższą wiadomość ze strony CNN https://lite.cnn.com/ i wykorzystaj uzyskany model do odpowiedzi na pytanie, czy treść wiadomości jest prawdziwa czy nie? 

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://lite.cnn.com/2025/05/27/middleeast/gaza-aid-distribution-chaos-ghf-intl-latam"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
    
paragraphs = soup.find_all('p')
article_text = ' '.join([p.get_text() for p in paragraphs])
    
print(article_text)

cleaned_article = clean_text(article_text)
article_vectorized = vectorizer.transform([cleaned_article])
prediction = model.predict(article_vectorized)
probability = model.predict_proba(article_vectorized)

print(f"FAKE: {probability[0][0]:.4f}")
print(f"REAL: {probability[0][1]:.4f}")


  By Jeremy Diamond, Kareem Khadder, Abeer Salman and Mohammad Al Sawalhi, CNN
 
Updated: 
        1:50 PM EDT, Wed May 28, 2025
     
  Source: CNN
 
 
  An 11-week Israeli blockade on humanitarian aid has pushed the enclave’s population of more than 2 million Palestinians towards famine and into a deepening humanitarian crisis, with the first resumption of humanitarian aid trickling into the besieged enclave last week.
 
  Videos from the distribution site in Tel al-Sultan, run by the Gaza Humanitarian Foundation (GHF), showed large crowds rushing the facilities, tearing down some of the fencing and appearing to climb over barriers designed to control the flow of the crowd.
 
  On Wednesday, Palestinian health officials said one person had been shot dead and 48 wounded during the chaos. The person who was killed died of severe injuries at the Red Cross Field Hospital in Rafah, the officials said.
 
  The GHF said later Wednesday that a second distribution site opened and distributed