# Fake News Detector
This is a very naive and simple approach which barely registers above random accuracy. You can definitely do better! But it should give you at least a first framework with which to begin. Here are some things to try:

## Feature Engineering:
Extract additional features from the text, such as:
- Number of words in the article.
- Number of sentences in the article.
- Average word length.
- Presence of specific keywords or phrases.
- Punctuation counts.
- Capitalization features (e.g., ratio of capitalized words).

## TF-IDF Vectorization:
Instead of using a simple CountVectorizer, try using the TF-IDF vectorization, which takes into account the importance of terms in the entire corpus.
    
## Word Embeddings:
Use pre-trained word embeddings like Word2Vec, GloVe, or fastText to capture semantic relationships between words. This can provide a richer representation of your text compared to a simple bag-of-words model.

## Deep Learning:
Implement a deep neural network, perhaps a recurrent neural network (RNN) or long short-term memory (LSTM) network, to capture sequential dependencies in the text.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Load the dataset
df = pd.read_csv("fake_news.csv")

In [None]:
df_genuine = df.copy()
df_genuine.drop('fake_news_article', axis=1, inplace=True)  # Drop the 'fake_news_article' column
df_genuine.rename(columns={'content': 'article'}, inplace=True)  # Rename 'content' to 'article'
df_genuine['is_fake'] = 0  # Set 'is_fake' to 0 for genuine news

In [None]:
# Create a dataframe with fake news
df_fake = df.copy()
df_fake.drop('content', axis=1, inplace=True)  # Drop the 'content' column
df_fake.rename(columns={'fake_news_article': 'article'}, inplace=True)  # Rename 'fake_news_article' to 'article'
df_fake['is_fake'] = 1  # Set 'is_fake' to 1 for fake news

In [None]:
# Concatenate the two dataframes
df_combined = pd.concat([df_genuine, df_fake], ignore_index=True)

In [None]:
# Define features (X) and target (y)
X = df_combined['article']
y = df_combined['is_fake']

In [None]:
# Split the data into training and testing sets while maintaining equal class sizes
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Create a CountVectorizer to convert a collection of text documents to a matrix of token counts
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

In [None]:
# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_counts, y_train)

In [None]:
# Make predictions on the test set
y_pred = clf.predict(X_test_counts)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

In [None]:
# Print the results
print(f"Accuracy: {accuracy}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(classification_rep)