# **Task 1: Exploratory Data Analysis (EDA) for Global News Dataset**

This notebook will guide you through the initial steps of the exploratory data analysis for the Global News Dataset. We will cover the following:

1. **Loading the Data**
2. **Top and Bottom Websites by Article Count**
3. **Traffic Data Analysis**
4. **Countries with the Most News Media Organizations**
5. **Sentiment Analysis**
6. **Content Metadata Comparison**
7. **Impact on Global Ranking**


In [None]:
# **1. Loading the Data**

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the environment
sns.set(style="whitegrid")
%matplotlib inline

# Load the datasets
news_data = pd.read_csv('path_to_your_data/data.csv')
domains_location = pd.read_csv('path_to_your_data/domains_location.csv')
traffic_data = pd.read_csv('path_to_your_data/traffic_data.csv')

# Display the first few rows of the datasets
print("News Data Sample")
print(news_data.head())

print("Domain Location Data Sample")
print(domains_location.head())

print("Traffic Data Sample")
print(traffic_data.head())


# **2. Top and Bottom Websites by Article Count**

Let's identify the top and bottom websites based on the number of articles they have published.


In [None]:
# Top 10 websites with the largest count of news articles
top_websites_by_article_count = news_data['source_name'].value_counts().head(10)
bottom_websites_by_article_count = news_data['source_name'].value_counts().tail(10)

# Plotting the top websites by article count
plt.figure(figsize=(10,6))
sns.barplot(x=top_websites_by_article_count.values, y=top_websites_by_article_count.index)
plt.title('Top 10 Websites by Article Count')
plt.xlabel('Number of Articles')
plt.ylabel('Website')
plt.show()

# Plotting the bottom websites by article count
plt.figure(figsize=(10,6))
sns.barplot(x=bottom_websites_by_article_count.values, y=bottom_websites_by_article_count.index)
plt.title('Bottom 10 Websites by Article Count')
plt.xlabel('Number of Articles')
plt.ylabel('Website')
plt.show()


# **3. Traffic Data Analysis**

Next, we'll analyze the traffic data to see which websites have the highest global ranking.


In [None]:
# Analyzing the traffic data
top_websites_by_traffic = traffic_data.sort_values(by='GlobalRank').head(10)

# Plotting top websites by traffic
plt.figure(figsize=(10,6))
sns.barplot(x=top_websites_by_traffic['GlobalRank'], y=top_websites_by_traffic['Domain'])
plt.title('Top 10 Websites by Traffic')
plt.xlabel('Global Rank')
plt.ylabel('Website')
plt.show()


# **4. Countries with the Most News Media Organizations**

Let's explore which countries have the most media organizations represented in the dataset.


In [None]:
# Analyzing countries with the most news media organizations
domains_location_count = domains_location['Country'].value_counts().head(10)

# Plotting countries with the most news media organizations
plt.figure(figsize=(10,6))
sns.barplot(x=domains_location_count.values, y=domains_location_count.index)
plt.title('Top 10 Countries with Most News Media Organizations')
plt.xlabel('Number of Media Organizations')
plt.ylabel('Country')
plt.show()


# **5. Sentiment Analysis**

We'll now explore the sentiment of the news articles across different websites.


In [None]:
# Sentiment analysis: Websites with the highest count of positive, neutral, and negative sentiments
sentiment_distribution = news_data.groupby('source_name')['title_sentiment'].value_counts(normalize=True).unstack()

# Plotting sentiment distribution for the top websites by article count
sentiment_distribution_top = sentiment_distribution.loc[top_websites_by_article_count.index]

sentiment_distribution_top.plot(kind='bar', stacked=True, figsize=(12,8))
plt.title('Sentiment Distribution Across Top 10 Websites by Article Count')
plt.xlabel('Website')
plt.ylabel('Proportion of Sentiments')
plt.show()


# **6. Content Metadata Comparison**

We'll compare the content lengths and title lengths across different websites to understand the variation in reporting styles.


In [None]:
# Content metadata comparison: Comparing raw message lengths across sites
news_data['content_length'] = news_data['content'].apply(lambda x: len(str(x)))

# Plotting the distribution of content lengths across top websites
plt.figure(figsize=(12,6))
sns.boxplot(x='source_name', y='content_length', data=news_data[news_data['source_name'].isin(top_websites_by_article_count.index)])
plt.xticks(rotation=90)
plt.title('Content Length Distribution Across Top 10 Websites by Article Count')
plt.xlabel('Website')
plt.ylabel('Content Length')
plt.show()

# Content metadata comparison: Comparing the number of words in titles across sites
news_data['title_length'] = news_data['title'].apply(lambda x: len(str(x).split()))

# Plotting the distribution of title lengths across top websites
plt.figure(figsize=(12,6))
sns.boxplot(x='source_name', y='title_length', data=news_data[news_data['source_name'].isin(top_websites_by_article_count.index)])
plt.xticks(rotation=90)
plt.title('Title Length Distribution Across Top 10 Websites by Article Count')
plt.xlabel('Website')
plt.ylabel('Title Length (Number of Words)')
plt.show()


# **7. Impact on Global Ranking**

Finally, we'll explore the relationship between the frequency of news reporting, sentiment, and the global ranking of the websites.


In [None]:
# Scatter plot: Impact of frequent news reporting and sentiment on website's global ranking
# Merging news data with traffic data for analysis
merged_data = pd.merge(news_data, traffic_data, left_on='source_name', right_on='Domain')

# Aggregating by source_name to get the total number of reports and average sentiment
reporting_frequency = merged_data.groupby('source_name').agg({
    'article_id': 'count',
    'title_sentiment': 'mean',
    'GlobalRank': 'min'
}).reset_index()

# Plotting the scatter plot
plt.figure(figsize=(10,6))
sns.scatterplot(x='article_id', y='GlobalRank', hue='title_sentiment', size='article_id', data=reporting_frequency, sizes=(20, 200), palette='coolwarm')
plt.title('News Reporting Frequency vs Global Ranking')
plt.xlabel('Number of Articles')
plt.ylabel('Global Ranking')
plt.show()
