<a href="https://colab.research.google.com/github/Kavin-Ramesh/Draw/blob/main/Real_vs_Fake_News_NLPModel_HaydenSamala%26KavinRamesh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fake News vs Real News Analysis


---
### Key Notes

We have a fully set up github repository in addition to this google colab notebook if you like to check it out. In addition to this for every single one of our codebases we have provided lots of comments explaining our thought process and what each line does.

---

Github repository URL:

https://github.com/hsamala688/Fake-News-Analysis



# New Section

This code snippet performs a preliminary analysis for fake news detection, focusing on stylistic and structural features of the text. Here's a breakdown of what each part does:

### 1. Library Imports

*   `import pandas as pd`: Imports the pandas library, essential for data manipulation and analysis using DataFrames.
*   `import numpy as np`: Imports the NumPy library, used for numerical operations, especially with arrays.
*   `import string`: Provides access to a collection of string constants, used here to identify punctuation characters.
*   `import matplotlib.pyplot as plt`: Imports Matplotlib's pyplot module, used for creating static, interactive, and animated visualizations in Python.
*   `from textblob import TextBlob`: Imports the `TextBlob` class, a library for processing textual data, specifically used here for sentiment analysis (subjectivity and polarity).
*   `from sklearn.model_selection import train_test_split`: Imports a utility for splitting datasets into training and testing sets, crucial for machine learning model evaluation.
*   `from sklearn.ensemble import RandomForestClassifier`: Imports the Random Forest classifier, an ensemble machine learning algorithm used for classification tasks.
*   `from sklearn.metrics import classification_report`: Imports a function to generate a text report showing the main classification metrics.

### 2. Data Loading and Labeling

*   `fake_news_df = pd.read_csv('Fake.csv')`: Loads the fake news dataset into a pandas DataFrame named `fake_news_df`.
*   `real_news_df = pd.read_csv('True.csv')`: Loads the real news dataset into a pandas DataFrame named `real_news_df`.
*   `fake_news_df['label'] = 0`: Adds a new column 'label' to the `fake_news_df` and assigns the value `0` (representing fake news) to all rows.
*   `real_news_df['label'] = 1`: Adds a new column 'label' to the `real_news_df` and assigns the value `1` (representing real news) to all rows.

### 3. Data Concatenation

*   `df = pd.concat([fake_news_df, real_news_df], ignore_index=True)`: Combines the `fake_news_df` and `real_news_df` into a single DataFrame `df`. `ignore_index=True` ensures the new DataFrame has a continuous index.

### 4. Feature Engineering

This section creates several new features from the 'text' column, aiming to capture stylistic and structural characteristics that might differentiate fake from real news:

*   `df['Word_Count'] = df['text'].apply(lambda x: len(str(x).split()))`: Calculates the number of words in each text entry and stores it in the 'Word_Count' column.
*   `df['Subjectivity'] = df['text'].apply(lambda x: TextBlob(str(x)).sentiment.subjectivity)`: Uses `TextBlob` to calculate the subjectivity score of each text. Subjectivity ranges from 0 (objective) to 1 (subjective).
*   `df['Polarity'] = df['text'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)`: Uses `TextBlob` to calculate the polarity score of each text. Polarity ranges from -1 (negative) to 1 (positive).
*   `def count_punctuation(text): ...`: Defines a helper function to count the number of punctuation marks in a given text.
*   `df['Punctuation_Count'] = df['text'].apply(count_punctuation)`: Applies the `count_punctuation` function to the 'text' column to get the total punctuation count for each entry.
*   `df['Punctuation_Density'] = np.divide(df['Punctuation_Count'], df['Word_Count']).fillna(0).replace([np.inf, -np.inf], 0)`: Calculates the ratio of punctuation marks to word count. `fillna(0)` handles cases where `Word_Count` might be zero, and `.replace()` handles potential division by zero resulting in infinity.
*   `df['Avg_Word_Length'] = df['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]) if len(str(x).split()) > 0 else 0)`: Calculates the average word length for each text. It includes a check to prevent errors for empty text entries.

### 5. Data Preparation for Modeling

*   `y = df['label']`: Assigns the 'label' column (our target variable, indicating fake or real news) to `y`.
*   `X = df[['Word_Count', 'Subjectivity', 'Polarity', 'Punctuation_Density', 'Avg_Word_Length']]`: Selects the engineered features as our independent variables and assigns them to `X`.

### 6. Train-Test Split

*   `X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)`: Splits the dataset into training and testing sets. 80% of the data is used for training (`X_train`, `y_train`), and 20% for testing (`X_test`, `y_test`). `random_state=42` ensures reproducibility of the split.

### 7. Model Training

*   `model = RandomForestClassifier(n_estimators=100, random_state=42)`: Initializes a Random Forest Classifier model. `n_estimators=100` means it will build 100 decision trees. `random_state=42` ensures reproducibility.
*   `model.fit(X_train, y_train)`: Trains the Random Forest model using the training features (`X_train`) and their corresponding labels (`y_train`).

### 8. Feature Importance Calculation

*   `feature_importances = pd.Series(model.feature_importances_, index=X.columns)`: Extracts the importance of each feature from the trained Random Forest model. Random Forests inherently provide a measure of feature importance.
*   `feature_importances = feature_importances.sort_values(ascending=False)`: Sorts the feature importances in descending order.

### 9. Visualization of Feature Importance

*   `plt.figure(figsize=(10, 8))`: Creates a new figure for the plot with a specified size.
*   `feature_importances.plot(kind='bar', color = 'skyblue')`: Generates a bar plot of the sorted feature importances.
*   `plt.title('Feature Importance in Fake News Detection')`: Sets the title of the plot.
*   `plt.ylabel('Importance Score')`: Sets the label for the y-axis.
*   `plt.xlabel('Stylistic/Structural Feature')`: Sets the label for the x-axis.
*   `plt.xticks(rotation = 0)`: Rotates the x-axis labels to prevent overlap.
*   `plt.grid(axis='y')`: Adds a grid to the y-axis for better readability.
*   `plt.savefig('feature_importance.png')`: Saves the generated plot as a PNG image named 'feature_importance.png'.
*   `plt.show()`: Displays the plot.

In [4]:
!rm -rf Fake-News-Analysis
!git clone https://github.com/hsamala688/Fake-News-Analysis.git

Cloning into 'Fake-News-Analysis'...
remote: Enumerating objects: 150, done.[K
remote: Counting objects: 100% (86/86), done.[K
remote: Compressing objects: 100% (69/69), done.[K
remote: Total 150 (delta 45), reused 46 (delta 17), pack-reused 64 (from 1)[K
Receiving objects: 100% (150/150), 52.80 MiB | 20.30 MiB/s, done.
Resolving deltas: 100% (68/68), done.
Updating files: 100% (16/16), done.
