In [2]:
import re
import pandas as pd
import requests
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import matplotlib.pyplot as plt
from cleantext import clean
from io import StringIO


Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


In [3]:
# URL of the CSV file
url = 'https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv'

# Fetching the content from the URL
response = requests.get(url)

# Checking if the request was successful (status code 200)
if response.status_code == 200:
    # Reading CSV data using pandas
    csv_data = StringIO(response.text)
    df = pd.read_csv(csv_data)

    # Displaying the first few rows of the DataFrame
    #print(df.head())
else:
    print(f"Failed to fetch data. Status code: {response.status_code}")


In [4]:
def clean_text_with_library(raw_text):
    # Use clean-text library for text cleaning
    cleaned_text = clean(
        raw_text,
        fix_unicode=True,
        to_ascii=True,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_numbers=True,            
        no_digits=True,             
        no_punct=True,             
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_number="<NUM>",
        replace_with_digit="<NUM>"
    )
    cleaned_text = re.sub(r'\b\d{4}-\d{2}-\d{2}\b', '<DATE>', cleaned_text)

    return cleaned_text

In [7]:
df['content'] = df['content'].apply(clean_text_with_library)
df.to_csv('Cleaned_data.csv', index=False)  
df = df.dropna(subset = ['type'])
df = df[df['type'] != 'unknown']


In [8]:
article_types = df['type'].unique()
article_types

array(['unreliable', 'fake', 'clickbait', 'conspiracy', 'reliable',
       'bias', 'hate', 'junksci', 'political'], dtype=object)

In [9]:

# Categorize into 'fake' and 'reliable'
df['category'] = df['type'].apply(lambda x: 'fake' if x in ['unreliable', 'fake', 'clickbait', 'conspiracy', 'bias', 'hate', 'junksci'] else 'reliable')

# Examine the percentage distribution
distribution = df['category'].value_counts(normalize=True) * 100

print("Percentage distribution of article categories:")
print(distribution)


Percentage distribution of article categories:
fake        88.793103
reliable    11.206897
Name: category, dtype: float64


### Is the datasat balanced? 
In the dataset, the percentage distribution indicates that 'fake' articles constitute approximately 88.79%, while 'reliable' articles make up around 11.21%. This distribution in the dataset is highly imbalanced, with 'fake' articles being the dominating factor.

The balance of a dataset is essential for general machine learning and  other statistical analyses. A balanced dataset is one where each class or category has roughly the same number of instances. In our case, the dataset is heavily skewed towards 'fake' articles. Here are some of my general considerations:

### Importance of a Balanced Dataset:

##### Model Performance: 
Imbalanced datasets can lead to biased model performance. Machine learning models trained on imbalanced data may struggle to correctly predict the minority class (in this case, 'reliable' articles) because they are biased towards the majority class ('fake' articles).

##### Evaluation Metrics:
Common evaluation metrics, such as accuracy, may not provide an accurate assessment of model performance in imbalanced datasets. For example, a model could achieve high accuracy by simply predicting the majority class all the time.

##### Generalization:
Models trained on imbalanced data might not generalize well to real-world scenarios, where the distribution of classes may be different. A balanced dataset helps ensure that models learn patterns from both classes and can generalize better.