<a href="https://colab.research.google.com/github/MouliAggarwal/FakeNewsAnalysis/blob/main/FakeNewsAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
import pandas as pd

# Task
Load the file "/content/fake_news_dataset.csv" into a dataframe and prepare the data for an internal ML model.

In [22]:
data = pd.read_excel("/content/fake_news_dataset.xlsx")
df = pd.DataFrame(data)

## Prepare data for model

### Subtask:
Preprocess the data to be compatible with the internal ML model. This might involve selecting relevant columns, encoding categorical variables, scaling numerical features, etc.


**Reasoning**:
Analyze missing values and select relevant columns. The 'date', 'source', 'author', and 'category' columns have missing values. For this task, the 'title' and 'text' columns are likely the most relevant features for a fake news detection model, and 'label' is the target variable. The other columns might not be as directly predictive or have significant missing data. We will handle missing values by dropping rows where 'title', 'text', or 'label' are missing, as these are crucial for the model.



In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

print(df.isnull().sum())

# Drop rows with missing values in crucial columns
df.dropna(subset=['title', 'text', 'label'], inplace=True)

# Select relevant columns
df_processed = df[['title', 'text', 'label']].copy()

title           0
text            0
date            0
source       1000
author       1000
category        0
label           0
tone        20000
dtype: int64


**Reasoning**:
The previous step handled missing values and selected relevant columns. Now, we need to preprocess the text data and encode the target variable for the ML model. Since the problem is likely a binary classification task (real or fake), label encoding is suitable for the 'label' column. We will combine 'title' and 'text' columns as the input feature, as both contain textual information relevant to the content.



In [24]:
# Combine title and text for the feature
df_processed['text'] = df_processed['title'] + ' ' + df_processed['text']
df_processed.drop('title', axis=1, inplace=True)

# Encode the target variable
label_encoder = LabelEncoder()
df_processed['label'] = label_encoder.fit_transform(df_processed['label'])

display(df_processed.head())

Unnamed: 0,text,label
0,Foreign Democrat final. more tax development b...,1
1,To offer down resource great point. probably g...,0
2,Himself church myself carry. them identify for...,0
3,You unit its should. phone which item yard Rep...,0
4,Billion believe employee summer how. wonder my...,0


In [29]:
from textblob import TextBlob

# Define a function to get sentiment
def get_sentiment(text):
    sentiment_score = TextBlob(text).sentiment.polarity
    if sentiment_score > 0:
        return 'positive'
    elif sentiment_score < 0:
        return 'negative'
    else:
        return 'neutral'

# Apply the function to the text column
df_processed['sentiment'] = df_processed['text'].apply(get_sentiment)

display(df_processed.head())


Unnamed: 0,text,label,sentiment
0,Foreign Democrat final. more tax development b...,1,positive
1,To offer down resource great point. probably g...,0,positive
2,Himself church myself carry. them identify for...,0,positive
3,You unit its should. phone which item yard Rep...,0,positive
4,Billion believe employee summer how. wonder my...,0,negative


In [32]:
# Save the DataFrame with sentiment column to a new Excel file
output_path = "df_with_sentiment.xlsx"  # you can change the filename/path if needed
df_processed.to_excel(output_path, index=False)

print(f"✅ DataFrame saved successfully to '{output_path}'")

✅ DataFrame saved successfully to 'df_with_sentiment.xlsx'
