## Document Retrieval and Classification System

The datasert can be found on kaggle https://www.kaggle.com/datasets/shubh0799/fake-news

In [24]:
import pandas as pd
import numpy as np

In [26]:
df = pd.read_csv("/kaggle/input/fake-news/news.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [27]:
df.shape

(6335, 4)

In [28]:
df['label'].value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [31]:
df.isna().sum()

Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

In [32]:
df['text'][0]



In [34]:
df['text'][1]

'Google Pinterest Digg Linkedin Reddit Stumbleupon Print Delicious Pocket Tumblr \nThere are two fundamental truths in this world: Paul Ryan desperately wants to be president. And Paul Ryan will never be president. Today proved it. \nIn a particularly staggering example of political cowardice, Paul Ryan re-re-re-reversed course and announced that he was back on the Trump Train after all. This was an aboutface from where he was a few weeks ago. He had previously declared he would not be supporting or defending Trump after a tape was made public in which Trump bragged about assaulting women. Suddenly, Ryan was appearing at a pro-Trump rally and boldly declaring that he already sent in his vote to make him President of the United States. It was a surreal moment. The figurehead of the Republican Party dosed himself in gasoline, got up on a stage on a chilly afternoon in Wisconsin, and lit a match. . @SpeakerRyan says he voted for @realDonaldTrump : “Republicans, it is time to come home” ht

In [35]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [36]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [37]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return " ".join(tokens)

In [38]:
df['cleaned_text'] = df['text'].apply(clean_text)
documents = df['cleaned_text'].tolist()

In [39]:
documents[0]



Construction de la Matrice TF-IDF et Recherche
- Vectorisation (TfidfVectorizer).

- Traitement de la Requête

- Calcul de la Similarité

- Classement des Résultats

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [46]:
# Vectorisation du corpus
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

In [47]:
def search_documents(query):
    # Prétraitement
    clean_query = clean_text(query)
    query_vec = vectorizer.transform([clean_query])
    
    # Calcul de la similarité cosinus
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    
    # Récupération des indices des top 3
    top3_indices = np.argsort(similarities)[::-1][:3]
    
    return [(i, similarities[i]) for i in top3_indices]

In [69]:
# Liste de requêtes multiples
queries = [
    "Government corruption scandal",
    "New healthcare reform",
    "Economic recovery after pandemic",
]

In [70]:
results = {}
for query in queries:
    results[query] = search_documents(query)

In [76]:
for query, docs in results.items():
    print(f"\nRequête: {query}")
    for idx, score in docs:
        print(f"Document index {idx} - Similarité: {score:.4f} - {df.iloc[idx]['label']}")
        print(documents[idx])



Requête: Government corruption scandal
Document index 508 - Similarité: 0.2150 - FAKE
home economic american public longer deal limitless corruption government american public longer deal limitless corruption government 8 shares 102816 mary wilder federal government really dropping ball last decades time time prove completely untrustworthy care citizens united states best interests recent wikileaks emails proven beyond shadow doubt hold positions power within federal government owned corporations continue put financial gain individual freedom scary reality theres denying longer unfortunately americans deal decades nothing ever done american people felt though nothing could order stop however appears though lately gotten point unwilling put corruption government longer article published daily sheeple charles hugh smith writes ruling elite bamboozled conned misled bottom 95 decades phony facade political legitimacy rising tide raises boats cracked wide open machinery oppression looting 

## Classification

In [53]:

X = vectorizer.fit_transform(df['cleaned_text'])
y = df['label']

In [57]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

In [58]:
# Séparation du jeu de données
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [61]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [62]:
# Pour la classification multiclasse, tu peux utiliser 'multinomial' qui est souvent performant sur du texte.
model_log = LogisticRegression()
model_log.fit(X_train, y_train)

In [64]:

y_pred = model_log.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        FAKE       0.90      0.94      0.92       628
        REAL       0.94      0.89      0.92       639

    accuracy                           0.92      1267
   macro avg       0.92      0.92      0.92      1267
weighted avg       0.92      0.92      0.92      1267



In [59]:
# Let's change our model
model = SVC()
model.fit(X_train, y_train)

In [60]:
# Prédictions et évaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.936069455406472
              precision    recall  f1-score   support

        FAKE       0.91      0.96      0.94       628
        REAL       0.96      0.91      0.93       639

    accuracy                           0.94      1267
   macro avg       0.94      0.94      0.94      1267
weighted avg       0.94      0.94      0.94      1267

