### Now try to explore the 995K FakeNewsCorpus subset Download 995K FakeNewsCorpus subset. Make at least three non-trivial observations/discoveries about the data. These observations could be related to outliers, artefacts, or even better: genuinely interesting patterns in the data that could potentially be used for fake-news detection. Examples of simple observations could be how many missing values there are in particular columns - or what the distribution over domains is. Be creative!

In [5]:
import pandas as pd

file_path = "995,000_rows.csv"
chunksize = 10000 

# Columns to analyze for missing metadata
metadata_cols = ['authors', 'meta_keywords', 'meta_description', 'tags', 'summary']

# Accumulators
total_rows = 0
missing_counts_acc = None
domain_counts_acc = {}
error_count_acc = 0
content_lengths = []

# Process the CSV in chunks
for chunk in pd.read_csv(file_path, chunksize=chunksize, low_memory=False):
    total_rows += len(chunk)
    
    # Observation 1: Missing values for metadata columns
    chunk_missing = chunk[metadata_cols].isnull().sum()
    if missing_counts_acc is None:
        missing_counts_acc = chunk_missing
    else:
        missing_counts_acc += chunk_missing

    # Observation 2: Domain distribution 
    chunk_domain_counts = chunk['domain'].value_counts()
    for domain, count in chunk_domain_counts.items():
        domain_counts_acc[domain] = domain_counts_acc.get(domain, 0) + count

    # Observation 3: Content Artifacts and Anomalies
    # Convert the 'content' column to string
    chunk['content'] = chunk['content'].astype(str)
    # Detect rows containing explicitly "error"
    error_mask = chunk['content'].str.contains(r"\berror\b", case=False, regex=True, na=False)
    error_count_acc += error_mask.sum()


# Results after processing
print("Total rows processed:", total_rows)

print("\nMissing values in metadata columns (count):")
print(missing_counts_acc)
print("\nMissing values in metadata columns (percentage):")
print((missing_counts_acc / total_rows * 100).round(2))

print("\nDomain distribution (top 10):")
domain_series = pd.Series(domain_counts_acc).sort_values(ascending=False)
print(domain_series.head(10))

print("\nTotal articles with explicity 'error' in content:", error_count_acc)



Total rows processed: 995000

Missing values in metadata columns (count):
authors             442757
meta_keywords        38790
meta_description    525106
tags                764081
summary             995000
dtype: int64

Missing values in metadata columns (percentage):
authors              44.50
meta_keywords         3.90
meta_description     52.77
tags                 76.79
summary             100.00
dtype: float64

Domain distribution (top 10):
nytimes.com           176144
beforeitsnews.com      91468
dailykos.com           77640
express.co.uk          55983
nationalreview.com     37377
sputniknews.com        37229
abovetopsecret.com     27947
wikileaks.org          23699
www.newsmax.com        12688
www.ammoland.com       11129
dtype: int64

Total articles with explicity 'error' in content: 78554


#### Describe how you ended up representing the FakeNewsCorpus dataset (for instance with a Pandas dataframe). Argue for why you chose this design

We decided to represent the FakeNewsCorpus dataset using a Pandas DataFrame because the data was already in CSV format with columns like id, domain, content, and metadata. Using a DataFrame made it straightforward to load the entire table into memory, even though it has nearly a million rows, and provided an easy way to clean, filter, and analyze the data using built-in functions. This design was chosen because it simplifies the data tasks in this task: we could quickly compute statistics, check for missing values, and even visualize distributions without writing a lot of code. Even though our computers sometimes struggles with very large files and can handle different sizes, Pandas lets us process the data in chunks, so we could easily work on a subset of the data and then combine the results. So we chose Pandas, because it turns this messy CSV data into a table that is easier to understand and work with, and it has all the tools that we needed.


#### Did you discover any inherent problems with the data while working with it?

We initially attempted to load the entire dataset using pd.read_csv with low_memory=False to address the DtypeWarnings (especially in columns 0 and 1, which indicated mixed data types), but this approach overwhelmed our system and caused it to crash. We had to kill the process. Recognizing that our computer couldn’t handle the full dataset at once, we switched to a chunk-based approach. 


#### Report key properties of the data set - for instance through statistics or visualization

Here we added some coding so we could see even more statistics.

```python
# Accumulate content lengths for later statistics
content_lengths.extend(chunk['content'].apply(len).tolist())

print("\nContent length statistics:")
print(pd.Series(content_lengths).describe())

In [7]:
import pandas as pd

file_path = "995,000_rows.csv"
chunksize = 10000  

# Columns to analyze for missing metadata
metadata_cols = ['authors', 'meta_keywords', 'meta_description', 'tags', 'summary']

# Accumulators
total_rows = 0
missing_counts_acc = None
domain_counts_acc = {}
error_count_acc = 0
content_lengths = []

# Process the CSV in chunks
for chunk in pd.read_csv(file_path, chunksize=chunksize, low_memory=False):
    total_rows += len(chunk)
    
    # Observation 1: Missing values for metadata columns 
    chunk_missing = chunk[metadata_cols].isnull().sum()
    if missing_counts_acc is None:
        missing_counts_acc = chunk_missing
    else:
        missing_counts_acc += chunk_missing

    # Observation 2: Domain distribution 
    chunk_domain_counts = chunk['domain'].value_counts()
    for domain, count in chunk_domain_counts.items():
        domain_counts_acc[domain] = domain_counts_acc.get(domain, 0) + count

    # Observation 3: Content Artifacts and Anomalies
    # Convert the 'content' column to string
    chunk['content'] = chunk['content'].astype(str)
    # Detect rows containing "error"
    error_mask = chunk['content'].str.contains(r"\berror\b", case=False, regex=True, na=False)
    error_count_acc += error_mask.sum()

    # Accumulate content lengths
    content_lengths.extend(chunk['content'].apply(len).tolist())

# Results after processing
print("Total rows processed:", total_rows)

print("\nMissing values in metadata columns (count):")
print(missing_counts_acc)
print("\nMissing values in metadata columns (percentage):")
print((missing_counts_acc / total_rows * 100).round(2))

print("\nDomain distribution (top 10):")
domain_series = pd.Series(domain_counts_acc).sort_values(ascending=False)
print(domain_series.head(10))

print("\nTotal articles with 'error' in content:", error_count_acc)

print("\nContent length statistics:")
print(pd.Series(content_lengths).describe())

Total rows processed: 995000

Missing values in metadata columns (count):
authors             442757
meta_keywords        38790
meta_description    525106
tags                764081
summary             995000
dtype: int64

Missing values in metadata columns (percentage):
authors              44.50
meta_keywords         3.90
meta_description     52.77
tags                 76.79
summary             100.00
dtype: float64

Domain distribution (top 10):
nytimes.com           176144
beforeitsnews.com      91468
dailykos.com           77640
express.co.uk          55983
nationalreview.com     37377
sputniknews.com        37229
abovetopsecret.com     27947
wikileaks.org          23699
www.newsmax.com        12688
www.ammoland.com       11129
dtype: int64

Total articles with 'error' in content: 78554

Content length statistics:
count    995000.000000
mean       2851.342791
std        4111.355649
min           3.000000
25%         655.000000
50%        1817.000000
75%        3690.000000
max     

And from these value we can see that

**Total rows processed**: 995000
    This means the dataset has 995,000 individual articles (rows).

Missing values in **metadata columns** (count and percentage):
    For each metadata column, the script counted how many articles were missing data.

**authors**: 442,757 articles (about 44.50%) are missing an author name.

**meta_keywords**: 38,790 articles (3.90%) are missing meta keywords. (These are words or phrases included in a webpage’s HTML header intended to represent the main topics covered in the content)

**meta_description**: 525,106 articles (52.77%) have no meta description. (This is a brief summary of the webpage’s content provided in the HTML header.)

**tags**: 764,081 articles (76.79%) are missing tags. (Tags are labels or categories assigned to content by the publisher or content management system.)

**summary**: 995,000 articles (100%) have no summary at all.
In plain terms, many articles lack additional descriptive information, which might be a problem when trying to use this extra data.

**Domain distribution** (top 10): This tells you how many articles come from each website. For example:
    The New York Times (nytimes.com) appears 176,144 times.
    BeforeItsNews.com appears 91,468 times.
        And so on for the top 10 sites. This shows which sources are most common in the dataset.

**Total articles** with 'error' in content: 78554
    This means that 84 articles include the phrase "Fatal error" in their text. This might indicate issues in how the articles were scraped or technical errors on the source websites.

**Content length statistics**: These numbers summarize how long the article texts are (in characters). For example:
    **mean**: On average, each article has about 2,851 characters.
    **std (standard deviation)**: There is a lot of variation (4,111 characters) in article lengths.
    **min**: The shortest article has 3 characters.
    **25%, 50%, 75%**: 25% of articles have 655 characters or fewer, the median (50%) is 1,817 characters, and 75% of articles have 3,690 characters or fewer.

**max**: The longest article is 189,025 characters long.
    This tells us that while most articles are a few thousand characters long, there are some extremely long outliers.