<div style="background-color: #AC4d61; padding: 10px; text-align: center; border-radius: 15px; margin-bottom: 10px;">
    <h1 style="color: #ffffff; font-size: 2.3em; font-weight: bold;">Exploratory Data Analysis and Visualization of Celestial Bodies</h1>
</div>

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">Table of Contents</h2>
</div>

**********************************
**********************************
**1. Introduction**

**2. Objectives**

**3. Dataset Information**

**4. Data Collection**
- *4.1 Libraries Used* 
- *4.2 Data Acquisition*
- *4.3 Filtering Relevant URLs*
- *4.4 Data Validation*

**5. Data Structuring**
- *5.1 Text Data* 
- *5.2 Image Data*

**6. Data Loading and Cleaning**
- *6.1 Load the Data*
- *6.2 Merge DataFrames*
- *6.3 Data Cleaning*
- *6.4 Save the Merged Data*

**7. Exploratory Data Analysis (EDA)**
- *7.1 Overview of Textual Data*
- *7.2 Overview of Image Data*
- *7.3 Trends and Patterns*
- *7.4 Correlation between features*
  
**8. Model Building and Deployment**
- *8.1 Data Preparation*
- *8.2 Model Selection*
- *8.3 Saving the Model*
- *8.4 Deployment*

**9. Conclusion**
**********************************
**********************************

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">1. Introduction</h2>
</div>

***********
<p>
    The focus of this project is on celestial bodies such as <strong>planets</strong>, <strong>stars</strong>, and <strong>galaxies</strong>. 
    The objective is to apply advanced <strong>data analysis</strong> and <strong>data science</strong> techniques to extract meaningful insights. 
    This notebook walks through the process of <strong>collecting</strong>, <strong>structuring</strong>, <strong>exploring</strong>, and <strong>visualizing</strong> data about celestial objects.
</p>

<p>
    Celestial bodies, ranging from planets to stars and galaxies, provide profound insights into <strong>astrophysical phenomena</strong>. 
    They play a critical role in humanity’s ongoing quest for knowledge and understanding of the universe. 
    This project utilizes publicly available resources, such as <strong>Wikipedia</strong>, to explore celestial bodies through <strong>analytical methods</strong> and uncover significant patterns.
</p>

**************

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">2. Objectives</h2>
</div>

<ul>
    <li><strong>Data Collection</strong> from Wikipedia.</li>
    <li><strong>Data Structuring</strong> (text and images) for analysis.</li>
    <li><strong>Data Loading and Cleaning</strong> </li>
    <li>
        Perform <strong>Exploratory Data Analysis (EDA)</strong> 
    </li>
    <li><strong>Model Building and Deployment</strong> </li>
</ul>

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">3. Dataset Information</h2>
</div>

<ul>
    <li><strong>Dataset Name</strong>: Celestial_Bodies_Dataset</li>
    <li><strong>Source</strong>: Publicly available Wikipedia data.</li>
    <li><strong>Analysts</strong>: Benseddik Abir and Benzahi Wissal</li>
</ul>

<h3>Components:</h3>
<ol>
    <li><strong>URLs</strong>: Web links to celestial body pages.</li>
    <li><strong>Text Data</strong>: Descriptions and key features.</li>
    <li><strong>Image Data</strong>: Visual representations of celestial objects.</li>
</ol>

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">4. Data Collection</h2>
</div>

## <span style="color: #A05899; font-size:1em;">4.1 Libraries Used</span>

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import numpy as np
import time
import warnings
import re
import urllib.parse
from urllib.parse import urljoin
from PIL import Image
from io import BytesIO
import os


#-----------------------------------------------------
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
#-----------------------------------------------------
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
#----------------------------------------------------
import joblib

warnings.filterwarnings("ignore")


## <span style="color: #A05899; font-size:1em;">4.2. Data Acquisition</span>

**Methods:**
- Used requests and BeautifulSoup libraries to scrape Wikipedia pages.
- **Scope:** Fetch URLs related to celestial objects for analysis.

In [None]:
# Initialize an empty set to store unique URLs
urls_data = set()

# Function to fetch related Wikipedia pages for a given topic
def fetch_related_pages(topic, limit=1000):
    # Wikipedia API endpoint for querying search results
    endpoint = "https://en.wikipedia.org/w/api.php"
    # Use a session for persistent HTTP connections
    session = requests.Session()

    # Parameters for the API request
    params = {
        "action": "query",
        "list": "search",
        "srsearch": topic,
        "srlimit": 100,  
        "format": "json",
        "continue": "",
    }

    # Continue fetching data until the limit is reached
    while len(urls_data) < limit:
        try:
            # Send a GET request to the API with specified parameters
            response = session.get(endpoint, params=params, timeout=10)
            response.raise_for_status()
            data = response.json()
        except requests.exceptions.RequestException as e:
            # Handle connection errors and stop further requests
            print(f"Request failed: {e}")
            break

        # Stop if no results are returned
        if 'query' not in data:
            print("No more results or API limit reached.")
            break

        # Record the number of URLs collected before the current request
        previous_count = len(urls_data)
        
        # Extract URLs from the search results and add them to the set
        for page in data['query']['search']:
            url = f"https://en.wikipedia.org/wiki/{page['title'].replace(' ', '_')}"
            urls_data.add(url)
            print(f"Collected: {url}")
            # Stop fetching if the limit is reached
            if len(urls_data) >= limit:
                break

        # Stop if no new unique results were found in the current iteration
        if len(urls_data) == previous_count:
            print("No additional unique results found.")
            break

        # If there are more results to fetch, update the continuation token
        if "continue" in data:
            params.update(data["continue"])
        else:
            break

        # Introduce a delay between requests to avoid rate-limiting by the API
        time.sleep(1) 
# Topic to search for in the Wikipedia API
topic = "Celestial Bodies"

# Call the function to fetch related pages for the specified topic
fetch_related_pages(topic, limit=1000)
# Save the collected URLs into a DataFrame and export to a CSV file
df = pd.DataFrame(list(urls_data), columns=['URL'])
df.to_csv('celestial_body.csv', index=False, encoding='utf-8')
print(f'Total URLs collected: {len(df)}')

In [None]:
# the shape of the DataFrame to verify the number of rows and columns
df.shape

## <span style="color: #A05899; font-size:1em;">4.3 Filtering Relevant URLs</span>

In [None]:
import pandas as pd
import requests
import time

# Load the CSV file containing URLs
df = pd.read_csv('celestial_body.csv')

# Define relevant keywords for celestial bodies
keywords = [
    "Planet", "Star", "Moon", "Asteroid", "Comet", "Galaxy", "Nebula", "Black hole",
    "Exoplanet", "Solar system", "Dwarf planet", "Satellite", "Meteor", "Cosmos",
    "Orbit", "Universe", "Supernova", "Astronomy", "Astrophysics", "Space",
    "Milky Way", "Pulsar", "Quasar", "Kuiper Belt", "Oort Cloud", "White dwarf",
    "Red giant", "Event horizon", "Dark matter", "Dark energy", "Constellation",
    "Telescope", "Astronomical object", "Interstellar", "Celestial sphere",
    "Eclipse", "Cosmology", "Gravitational wave", "Big Bang", "Space exploration"
]

# Function to check if the title contains any relevant keywords
def is_celestial_related(url):
    try:
        # Fetch the page content using the Wikipedia API
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        page_content = response.text.lower()
        
        # Check if any of the keywords appear in the page title or content
        for keyword in keywords:
            if keyword.lower() in page_content:
                return True
        return False
    except requests.exceptions.RequestException as e:
        print(f"Request failed for {url}: {e}")
        return False

# Filter URLs that are related to celestial bodies
filtered_urls = []

for url in df['URL']:
    if is_celestial_related(url):
        filtered_urls.append(url)
        print(f"Related URL: {url}")

# Create a new DataFrame with the filtered URLs
filtered_df = pd.DataFrame(filtered_urls, columns=['URL'])

# Save the filtered URLs to a new CSV file
filtered_df.to_csv('filtered_celestial_bodies.csv', index=False, encoding='utf-8')

print(f"Total filtered URLs collected: {len(filtered_df)}")


In [None]:
filtered_df.shape

## <span style="color: #A05899; font-size:1em;">4.4 Data Validation</span>

In [None]:
# Load the filtered dataset
filter_df = pd.read_csv('filtered_celestial_bodies.csv')

In [None]:
filter_df

1. **Check for Duplicate Entries**:

In [None]:
# Check for duplicate entries
duplicate_count = filter_df.duplicated(subset=['URL']).sum()
print(f"Number of duplicate entries: {duplicate_count}")

2. **Verify URLs**:

In [None]:
# Check if there are any invalid or empty URLs
invalid_urls = filter_df[filter_df['URL'].isnull()]
print(f"Number of invalid URLs: {invalid_urls.shape[0]}")

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">5. Data Structuring</h2>
</div>

## <span style="color: #A05899; font-size:1em;">5.1  Text Data</span>

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.1 Extracting Titles from URLs:</h3>

**Steps**:
1. Extract the last part of each URL as the page title.
2. Replace underscores (`_`) with spaces for readability.
3. Create a new DataFrame with columns: **`Title`** and **`URL`**.

In [None]:
import pandas as pd

# Example filtered_df (Assuming it's already loaded)
# filtered_df = pd.DataFrame({'URL': ['https://en.wikipedia.org/wiki/Salvator_Mundi_(painting)', 'https://en.wikipedia.org/wiki/Trabant', ...]})

# Create a new column 'Title' by extracting the title from the URL
filter_df['Title'] = filter_df['URL'].apply(lambda x: x.split('/')[-1].replace('_', ' '))

# Create a new DataFrame with only 'Title' and 'URL' columns
df_separat = filter_df[['Title', 'URL']]

# Display the new DataFrame
print(df_separat.head())


In [None]:
df_separat

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.2 Extracting and Analyzing Conten:</h3>

---
 
This step involves scraping textual content from URLs, extracting key features like word count, and preparing the data for analysis. The scraped text is also cleaned to ensure uniformity.

---

In [None]:
# Function to scrape content and compute features
def scrape_and_analyze(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract content (all paragraphs)
        content = ' '.join([p.text.strip() for p in soup.find_all('p') if p.text.strip()])

        # Compute features (example: word count)
        word_count = len(content.split())
        features = f"Word count: {word_count}"

        return content, features
    except Exception as e:
        # Handle errors (e.g., timeout, invalid URL)
        return "Error fetching content", f"Error: {str(e)}"

# Apply the function to each URL in the DataFrame
df_separat ['Content'], df_separat ['Features'] = zip(*df_separat ['URL'].apply(scrape_and_analyze))

# Save to CSV (optional)
df_separat .to_csv('extended_data.csv', index=False)

In [None]:
df_separat = pd.read_csv('extended_data.csv')
df_separat

In [None]:
print("Dataset Information:")
df_separat.info()

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.3 Data Cleaning and Classification:</h3>

In [None]:
import pandas as pd
import re

# Handle missing values in the 'Content' column
df_separat['Content'].fillna('', inplace=True)

# Clean the 'Content' column
def clean_text(text):
    if not isinstance(text, str):  # Ensure the text is a string
        return ''
    text = re.sub(r'<[^>]+>', '', text)  # Remove HTML tags
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove non-alphanumeric characters
    return text

df_separat['Cleaned_Content'] = df_separat['Content'].apply(clean_text)

# Define celestial body types and keywords
celestial_types = {
    'Planet': ['planet', 'jupiter', 'earth', 'mars', 'venus', 'saturn', 'uranus', 'neptune', 'dwarf planet'],
    'Star': ['star', 'sun', 'nova', 'supernova', 'neutron star', 'red giant', 'pulsar', 'white dwarf'],
    'Moon': ['moon', 'satellite', 'luna', 'natural satellite'],
    'Asteroid': ['asteroid', 'comet', 'meteoroid', 'meteor', 'meteorite'],
    'Galaxy': ['galaxy', 'milky way', 'andromeda', 'spiral galaxy', 'elliptical galaxy', 'irregular galaxy'],
    'Nebula': ['nebula', 'emission nebula', 'reflection nebula', 'planetary nebula', 'dark nebula'],
    'Black Hole': ['black hole', 'event horizon', 'singularity'],
    'Constellation': ['constellation', 'zodiac'],
    'Exoplanet': ['exoplanet', 'extrasolar planet'],
    'Cosmic Structure': ['dark matter', 'dark energy', 'cosmos', 'universe', 'kuiper belt', 'oort cloud'],
    'Spacecraft': ['telescope', 'spacecraft', 'probe', 'rover', 'satellite'],
    'Astronomy Tools': ['astronomy', 'astrophysics', 'space exploration', 'observatory'],
    'Eclipse': ['eclipse', 'solar eclipse', 'lunar eclipse'],
    'Cosmology': ['cosmology', 'big bang', 'gravitational wave'],
}

# Classify celestial body types
def classify_celestial_body(title, content):
    title = title.lower()
    content = content.lower()
    
    for celestial_type, keywords in celestial_types.items():
        if any(keyword in title for keyword in keywords) or any(keyword in content for keyword in keywords):
            return celestial_type
    return 'Unknown'

df_separat['Type'] = df_separat.apply(lambda row: classify_celestial_body(row['Title'], row['Cleaned_Content']), axis=1)

# Remove rows with 'Unknown' Type if needed
df_cleaned = df_separat[df_separat['Type'] != 'Unknown']

# Save the cleaned dataset to a CSV file
df_cleaned.to_csv('cleaned_celestial_bodies.csv', index=False)
# Drob the columns Features and Content
df_cleaned.drop(columns=['Features', 'Content'], inplace=True)


In [None]:
df_cleaned

In [None]:
# Inspect the Dataset
print("Dataset Information:")
print(df_cleaned.info())

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.4 Summary Statistics and Type Distribution:</h3>

In [None]:
# 1-Summary Statistics
print("\nSummary Statistics for Word Count:")

# Calculate word count for each cleaned content entry
df_cleaned['Word_Count'] = df_cleaned['Cleaned_Content'].apply(lambda x: len(x.split()))  
print(df_cleaned['Word_Count'].describe())

# Type distribution
print("\nType Distribution:")
type_counts = df_cleaned['Type'].value_counts()
print(type_counts)


<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.5 Encoding Celestial Body Types:</h3>

***************
This step encodes the categorical Type column into numerical values to facilitate further analysis, such as machine learning or statistical modeling.
****************

In [None]:
encoder = LabelEncoder()
df_cleaned['Type_Encoded'] = encoder.fit_transform(df_cleaned['Type'])

In [None]:
df_cleaned

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.6 Data Cleaning and Structuring:</h3>

In [None]:
# Handle missing values in the 'Content' column
df_cleaned['Cleaned_Content'].fillna('', inplace=True)

# Clean the 'Content' column
def clean_text(text):
    if not isinstance(text, str):  # Ensure the text is a string
        return ''
    text = re.sub(r'<[^>]+>', '', text)  # Remove HTML tags
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)  # Remove non-alphanumeric characters
    return text

df_cleaned['Cleaned_Content'] = df_cleaned['Cleaned_Content'].apply(clean_text)

# Define celestial body types and keywords
celestial_types = {
    'Planet': ['planet', 'jupiter', 'earth', 'mars', 'venus', 'saturn', 'uranus', 'neptune', 'dwarf planet'],
    'Star': ['star', 'sun', 'nova', 'supernova', 'neutron star', 'red giant', 'pulsar', 'white dwarf'],
    'Moon': ['moon', 'satellite', 'luna', 'natural satellite'],
    'Asteroid': ['asteroid', 'comet', 'meteoroid', 'meteor', 'meteorite'],
    'Galaxy': ['galaxy', 'milky way', 'andromeda', 'spiral galaxy', 'elliptical galaxy', 'irregular galaxy'],
    'Nebula': ['nebula', 'emission nebula', 'reflection nebula', 'planetary nebula', 'dark nebula'],
    'Black Hole': ['black hole', 'event horizon', 'singularity'],
    'Constellation': ['constellation', 'zodiac'],
    'Exoplanet': ['exoplanet', 'extrasolar planet'],
    'Cosmic Structure': ['dark matter', 'dark energy', 'cosmos', 'universe', 'kuiper belt', 'oort cloud'],
    'Spacecraft': ['telescope', 'spacecraft', 'probe', 'rover', 'satellite'],
    'Astronomy Tools': ['astronomy', 'astrophysics', 'space exploration', 'observatory'],
    'Eclipse': ['eclipse', 'solar eclipse', 'lunar eclipse'],
    'Cosmology': ['cosmology', 'big bang', 'gravitational wave'],
}

# Classify celestial body types
def classify_celestial_body(title, content):
    title = title.lower()
    content = content.lower()
    
    for celestial_type, keywords in celestial_types.items():
        if any(keyword in title for keyword in keywords) or any(keyword in content for keyword in keywords):
            return celestial_type
    return 'Unknown'

df_cleaned['Type'] = df_cleaned.apply(lambda row: classify_celestial_body(row['Title'], row['Cleaned_Content']), axis=1)

# Remove rows with 'Unknown' Type if needed
df_data = df_cleaned[df_cleaned['Type'] != 'Unknown']

# Save the cleaned dataset to a CSV file
df_data.to_csv('data_celestial_bodies.csv', index=False)

In [None]:
df_data

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">5.1.7 URL Validation:</h3>

*********
In this step, we validate the URLs in the dataset to ensure that they are properly formatted and accessible. This helps eliminate any malformed or incorrect URLs before further analysis.
*********

In [None]:
# Function to check if a URL is valid using urllib
def is_valid_url(url):
    try:
        result = urllib.parse.urlparse(url)
        # Check if the URL has a valid scheme (http, https,...etc.) and netloc (domain)
        return all([result.scheme, result.netloc])
    except Exception as e:
        return False

# Apply the URL validation function to the 'URL' column
df_data['Is_Valid_URL'] = df_data['URL'].apply(is_valid_url)

# Filter out invalid URLs
df_data = df_data[df_data['Is_Valid_URL'] == True]
df_data.to_csv('data_celestial_bodies.csv', index=False)

In [None]:
df_data

## <span style="color: #A05899; font-size:1em;">5.2 Image Data</span>

*************
This section will cover the extraction and saving of images from Wikipedia pages related to celestial bodies through scraping. The goal is to scrape the first valid image for each URL and save it in a folder for future use.
*************

In [None]:
# Directory to save images
image_dir = 'images'
if not os.path.exists(image_dir):
    os.makedirs(image_dir)

# Function to extract the first valid image URL from a Wikipedia article
def extract_first_main_image_from_url(url):
    try:
        response = requests.get(url, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            # Find all <img> tags with a valid src attribute
            img_tags = soup.find_all('img', {'src': True})

            valid_images = []
            for img_tag in img_tags:
                img_url = img_tag['src']
                
                # Exclude small images like favicons and only take jpg, jpeg, or png images
                if 'favicon' not in img_url and (img_url.endswith('.jpg') or img_url.endswith('.jpeg') or img_url.endswith('.png')):
                    img_url = urljoin(url, img_url)  # Ensure it's an absolute URL
                    valid_images.append(img_url)

            # Return the valid image URL 
            return valid_images[3] if valid_images else None
        else:
            print(f"Error fetching page {url} - Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error processing URL {url}: {e}")
        return None

# Function to download and save the first valid image as JPEG
def download_image_as_jpeg(image_url, save_path):
    try:
        response = requests.get(image_url, timeout=10)
        if response.status_code == 200:
            img = Image.open(BytesIO(response.content))
            img.convert('RGB').save(save_path, 'JPEG')  # Save image as JPEG
            return save_path
        else:
            print(f"Failed to download image from {image_url}")
            return None
    except Exception as e:
        print(f"Error downloading image from {image_url}: {e}")
        return None

# Function to process all URLs in the DataFrame and return a new DataFrame with image paths
def process_all_urls(df):
    image_data = []  # A list to store image paths and corresponding titles
    for index, row in df.iterrows():
        # Extract the first valid image URL
        image_url = extract_first_main_image_from_url(row['URL'])
        
        if image_url:
            # Create a filename based on the title
            img_filename = os.path.join(image_dir, f"{row['Title'].replace(' ', '_')}.jpg")
            
            # Download and save the image
            downloaded_image = download_image_as_jpeg(image_url, img_filename)
            
            if downloaded_image:
                image_data.append({'Title': row['Title'], 'Image_URL': image_url, 'Image_Saved': downloaded_image})  # Save the data in a dictionary
            else:
                image_data.append({'Title': row['Title'], 'Image_URL': image_url, 'Image_Saved': None})
        else:
            image_data.append({'Title': row['Title'], 'Image_URL': None, 'Image_Saved': None})
    
    # Create a new DataFrame to store the image paths and corresponding data
    image_df = pd.DataFrame(image_data)

    return image_df


# Process all URLs and download images into a new DataFrame
df_image = process_all_urls(df_data)

# Save the new DataFrame with image data to a CSV file
df_image.to_csv('Image_celestial_bodies.csv', index=False)

In [None]:
df_image

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">6. Data Loading and Cleaning</h2>
</div>

*************
**Objective:**
In this step, we will merge the dataset we collected and cleaned .
*************

## <span style="color: #A05899; font-size:1em;">6.1 Load The Data</span>

In [None]:
# Load your cleaned dataset
df1 = pd.read_csv('data_celestial_bodies.csv')
# Load the outher dataset 
df2 = pd.read_csv('data_celestial_bodies2.csv')

In [None]:
print("Dataset 1 (df1):")
print(df1.head())

In [None]:
print("Dataset 2 (df2):")
print(df2.head())

## <span style="color: #A05899; font-size:1em;">6.2 Merge DataFrames</span>

In [None]:
# Merge the two datasets
df_merged = pd.concat([df1, df2], ignore_index=True)

In [None]:
# Display the merged dataset to check the results
df_merged

## <span style="color: #A05899; font-size:1em;">6.3  Data Cleaning</span>

#### ***Check the the merged dataset***

In [None]:
print("Dataset Overview:")
print(df_merged.info()) 

In [None]:
# Check for missing values
print("\nMissing Values by Column:")
print(df_merged.isnull().sum())

In [None]:
print("\Column Names in Dataset:")
print(df_merged.columns.tolist())

In [None]:
print(df_merged[df_merged['Is_Valid_URL'] == False])

In [None]:
duplicate = df_merged[df_merged['Title'].duplicated(keep=False)]
print(f"Number of duplicate entries: {duplicate.shape[0]}")

In [None]:
# Group duplicate entries by title and display them
group_duplicate = duplicate.groupby('Title')

for title, group in group_duplicate:
    print(f"\nTitle: {title}")
    print(group)

In [None]:
df_merged = df_merged.drop_duplicates(subset='Title', keep='first')

In [None]:
print("Duplicate in Title Column:")
print(df_merged['Title'].duplicated().sum())

## <span style="color: #A05899; font-size:1em;">6.4  Save the Merged Data</span>

In [None]:
# Save the merged DataFrame to a new CSV file
df_merged.to_csv('merged_celestial_bodies.csv', index=False)

In [None]:
df_merged

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">7. Exploratory Data Analysis (EDA)</h2>
</div>

***
-----

In [None]:
df_merged = pd.read_csv('merged_celestial_bodies.csv')

In [None]:
# Shape of the dataset
print(f"Dataset contains {df_merged.shape[0]} rows and {df_merged.shape[1]} columns.")

In [None]:
# Summary statistics
print(df_merged.describe(include='all'))

In [None]:
# Column data types
print(df_merged.dtypes)

*******
----

## <span style="color: #A05899; font-size:1em;">7.1 Overview of Textual Data</span>

**************
In this step, we will explore the textual data in the dataset, focusing on the Cleaned_Content column. The goal is to find useful insights and patterns that may lead to further analysis or modeling. Textual data analysis involves assessing word frequency, finding trends, distributions, and relationships.
**************

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.1.1 Word Count Distribution:</h3>

In [None]:
# the summary statistics of word counts
print("\nSummary Statistics for Word Count:")
print(df_merged['Word_Count'].describe())

In [None]:
# Visualize the distribution of word counts using a histogram
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.hist(df_merged['Word_Count'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of Word Counts in Cleaned Content')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.1.2 Most Common Keywords by Celestial Body Type:</h3>

In [None]:
# Create a count vectorizer to get word frequencies for each celestial body type
vectorizer = CountVectorizer(stop_words='english', max_features=10)

# Group by 'Type' and apply vectorizer to get word frequencies for each type
type_keywords = df_merged.groupby('Type')['Cleaned_Content'].apply(lambda x: ' '.join(x)).reset_index()

# Generate word frequency count for each type
for idx, row in type_keywords.iterrows():
    word_counts = vectorizer.fit_transform([row['Cleaned_Content']])
    words = vectorizer.get_feature_names_out()
    counts = word_counts.toarray().sum(axis=0)
    
    # Create a DataFrame with word frequencies
    word_freq = pd.DataFrame(list(zip(words, counts)), columns=['Word', 'Frequency'])
    
    # Plot bar chart for the most common words in each type
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Frequency', y='Word', data=word_freq.sort_values(by='Frequency', ascending=False))
    plt.title(f'Most Common Keywords in {row["Type"]}')
    plt.xlabel('Frequency')
    plt.ylabel('Word')
    plt.show()


<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.1.3 Check for Irrelevant Data:</h3>

In [None]:
# Use the 25th percentile 
word_threshold = df_merged['Word_Count'].quantile(0.25)  

# Now filter rows with very short content
short_content = df_merged[df_merged['Word_Count'] < word_threshold]
print(f"Rows with very short content: {short_content.shape[0]}")
print("\n", short_content)

In [None]:
# Check for generic or irrelevant Titles
irrelevant_titles = ['Home', 'Wikipedia', 'Main Page']  
irrelevant_title_rows = df_merged[df_merged['Title'].isin(irrelevant_titles)]
print(f"Rows with generic or irrelevant titles: {irrelevant_title_rows.shape[0]}")

In [None]:
# Check for rows with unknown or unexpected types
unexpected_types = ['Unknown', 'Other']  # Add types that don't fit your analysis
unexpected_type_rows = df_merged[df_merged['Type'].isin(unexpected_types)]
print(f"Rows with unexpected types: {unexpected_type_rows.shape[0]}")

In [None]:
# Drop rows with word count less than the threshold
df_merged = df_merged[df_merged['Word_Count'] >= word_threshold]

#shape of the cleaned DataFrame
print(f"Shape of cleaned dataset: {df_merged.shape}")

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.1.4 Check for The 'Type' and 'Type_Encoded':</h3>

In [None]:
# Unique values in 'Type'
print("Unique values in 'Type' column:", df_merged['Type'].unique())

In [None]:
# Data for celestial body types
type_counts = df_merged['Type'].value_counts()

# Define the labels and values
labels = type_counts.index
sizes = type_counts.values
colors = sns.color_palette('Set3', len(labels)).as_hex() 

# Plotting the pie chart with enhancements
fig, ax = plt.subplots(figsize=(8, 8))

# Create the pie chart
patches, texts, pcts = ax.pie(sizes, labels=labels, autopct='%.1f%%', colors=colors, shadow=True, startangle=90, wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'})

# Add a legend
ax.legend(patches, labels, loc="best", fontsize=7, frameon=False)

# Make sure the pie chart is a perfect circle
ax.axis('equal')

# Set the title and adjust the layout
ax.set_title('Distribution of Celestial Body Types', fontsize=18, weight='bold')

# Adjust the text properties
plt.setp(texts, fontweight='bold', fontsize=13, color='white')

# Display the plot
plt.tight_layout()
plt.show()


In [None]:
# Distribution of 'Type_Encoded'
encoded_counts = df_merged['Type_Encoded'].value_counts()
print("Counts of each 'Type_Encoded':")
print(encoded_counts)
print("******************************************************************\n")
# Get unique mappings between 'Type' and 'Type_Encoded'
type_mapping = df_merged[['Type', 'Type_Encoded']].drop_duplicates()
print("Mapping between 'Type' and 'Type_Encoded':")
print(type_mapping)

***************
The Type_Encoded values show that planet (6) is the most common encoded value, while spacecraft (7) and black hole (1) have relatively fewer entries.
**************

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x=type_counts.index, y=type_counts.values, palette="viridis")
plt.title("Counts of Each Type", fontsize=16, fontweight="bold")
plt.xlabel("Type", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## <span style="color: #A05899; font-size:1em;">7.2 Overview of Image Data</span>

*************
In this section, we will explore the image data related to celestial bodies. The idea is to review the spread of images in our dataset, check whether the images have been fetched appropriately, and verify whether any data of images is missing or inconsistent.

The following items will be presented:
1. **Image Availability**
2. **Visualization of Image Distribution**
3. **Image Quality and Format**
************

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.2.1 Image Availability:</h3>

In [None]:
# Load your cleaned dataset
df_image = pd.read_csv('Image_celestial_bodies.csv')

# Load the outher dataset 
df_image2= pd.read_csv('wikipedia_photos.csv')

In [None]:
df_image

In [None]:
df_image2

In [None]:
# Check for missing image paths
missing_images = df_image[df_image['Image_Saved'].isnull()]
missing_images2 = df_image2[df_image2['photo_description'].isnull()]

# Display the count of missing images
print(f"Number of missing images: {missing_images.shape[0]}")
print(f"Number of missing images2: {missing_images2.shape[0]}")

In [None]:
# Handling the duplicated
df_image2=df_image2[~df_image2['photo_url'].duplicated()]
df_image2

In [None]:
# Rename columns in df_image2
df_image2 = df_image2.rename(columns={'title':'Title','photo_url': 'Image_URL', 'photo_description': 'Image_Saved'})

In [None]:
# Merge the two datasets
df_merged_image = pd.concat([df_image, df_image2], ignore_index=True)

In [None]:
# Display the first few rows of the merged dataframe
print(df_merged_image.head())

In [None]:
# Remove rows with missing image URLs or image descriptions in df_image2
df_merged_image = df_merged_image.dropna(subset=['Image_URL', 'Image_Saved'])

In [None]:
df_merged_image

In [None]:
# Save the merged dataframe to a new CSV file 
df_merged_image.to_csv('merged_images_celestial_bodies.csv', index=False)

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.2.2 Visualization of Image Distribution:</h3>

In [None]:
# Check image availability
valid_images = df_merged_image[df_merged_image['Image_Saved'].notna()]
print(valid_images)

In [None]:
# Function to display the image based on user input title
def display_image_by_title(title):
    # Search for the title in the DataFrame
    result = valid_images[valid_images['Title'].str.contains(title, case=False, na=False)]
    
    if not result.empty:
        img_path = result['Image_Saved'].values[2]  
        img = Image.open(img_path)
        
        # Display the image
        plt.figure(figsize=(6, 6))
        plt.imshow(img)
        plt.title(f"Image of {title}")
        plt.axis('off')  
        plt.show()
    else:
        print(f"No image found for the title: {title}")

# Get the title from the user
user_input = input("Enter the title of the celestial body: ")
display_image_by_title(user_input)

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.2.3 Image Quality and Format:</h3>

In [None]:
# Check image format and size
def check_image_format_and_size(image_path):
    try:
        img = Image.open(image_path)
        img_format = img.format
        img_size = img.size 
        return img_format, img_size
    except Exception as e:
        return None, None

# Apply the function to check image format and size
df_merged_image['Image_Format'], df_merged_image['Image_Size'] = zip(*df_merged_image['Image_Saved'].apply(check_image_format_and_size))

# Display images with invalid format or size (if any)
invalid_images = df_merged_image[df_merged_image['Image_Format'].isnull() | df_merged_image['Image_Size'].isnull()]
print(f"Invalid Images:\n{invalid_images[['Title', 'Image_Saved']]}")


In [None]:
df_merged_image.head()

## <span style="color: #A05899; font-size:1em;">7.3 Identifying Trends and Patterns</span>

*********
Next comes the identification of key trends and patterns in the data. It is here that we hope to find out any great relationships or, at best, insights from both the textual and numerical data for the stars. Once identified, these will help in making data-driven observations with regards to what the dataset actually contains and how it is structured.
********

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.3.1 Word Cloud Visualization:</h3>

*Textual Analysis of Celestial Body Descriptions*

In [None]:
# Generate Word Cloud for each celestial body type
def generate_wordcloud_by_type(df):
    plt.figure(figsize=(12, 8))
    types = df['Type'].unique()

    for i, celestial_type in enumerate(types):
        content = df[df['Type'] == celestial_type]['Cleaned_Content'].str.cat(sep=' ')
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(content)

        plt.subplot(3, 4, i + 1)
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.title(f'Word Cloud for {celestial_type}')
        plt.axis('off')

    plt.tight_layout()
    plt.show()

generate_wordcloud_by_type(df_merged)

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.3.2 Trends in Word Count Across Different Types¶:</h3>

In [None]:
# Boxplot to show distribution of word count across celestial types
plt.figure(figsize=(10, 6))
sns.boxplot(x='Type', y='Word_Count', data=df_merged, palette='Set2')
plt.title('Word Count Distribution Across Celestial Body Types')
plt.xlabel('Celestial Body Type')
plt.ylabel('Word Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">7.3.3 Exploring Patterns Between Types and Features:</h3>

In [None]:
# Exploring patterns between 'Type' and 'Word_Count'
type_word_count = df_merged.groupby('Type')['Word_Count'].mean().sort_values(ascending=False)

# Plot the average word count for each type
plt.figure(figsize=(10, 6))
type_word_count.plot(kind='bar', color='skyblue')
plt.title('Average Word Count for Each Celestial Body Type')
plt.xlabel('Celestial Body Type')
plt.ylabel('Average Word Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## <span style="color: #A05899; font-size:1em;">7.4 Correlation between features</span>

In [None]:
# Column data types
print(df_merged.dtypes)

In [None]:
# the correlation matrix for numeric columns
correlation_matrix = df_merged.corr(numeric_only=True)
# Display the correlation matrix
print("Correlation Matrix:\n")
print(correlation_matrix)

# Visualize the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',  linewidths=0.5)
plt.title("Correlation Matrix of Features")
plt.show()


In [None]:
features = ["Word_Count", "Type_Encoded"]
for feature in features:
    plt.figure(figsize=(8, 4))
    sns.histplot(df_merged[feature], kde=True, bins=30, color="teal")
    plt.title(f"Distribution of {feature}", fontsize=16, fontweight="bold")
    plt.xlabel(feature)
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()


In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(
    df_merged["Word_Count"],
    df_merged["Type_Encoded"],
    s=df_merged["Word_Count"] / 10,
    alpha=0.6,
    c=df_merged["Type_Encoded"],
    cmap="viridis",
)
plt.colorbar(label="Type Encoded")
plt.title("Bubble Chart: Word Count vs. Type Encoded", fontsize=16, fontweight="bold")
plt.xlabel("Word Count")
plt.ylabel("Type Encoded")
plt.show()

In [None]:
# Plot correlation between Word Count and Type Encoded
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Word_Count', y='Type_Encoded', data=df_merged, hue='Type', palette='Set1')
plt.title('Correlation Between Word Count and Celestial Body Type Encoded')
plt.xlabel('Word Count')
plt.ylabel('Type Encoded')
plt.legend(loc='best')
plt.show()

******************
- Given the weak correlation between Word_Count and Type_Encoded, it is quite obvious that word count is not a good predictor of the type of celestial body. It could mean that the content length does not differ much based on whether the entry refers to a planet, star, or other types of celestial body.
- This is expected and reflects the fact that the validity of a URL is independent of word count or type of celestial body.
****************

In [None]:
# Save the merged DataFrame to a new CSV file
df_merged.to_csv('merged_data.csv', index=False)

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">8. Model Building and Deployment</h2>
</div>

***********
In this step, you will perform the following tasks:

- Data Preparation.
- Model Selection.
- Model Training and Evaluation.
- Saving the Model.
- Deployment.
***********

## <span style="color: #A05899; font-size:1em;">8.1 Data Preparation</span>

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">8.1.1 Feature Selection:</h3>

*******
*For the model, I chose the following features:*

- 'Word_Count':
Reason: The number of words in the description may indicate the level of detail about the celestial body and, therefore, might be related to its type.
- 'Type_Encoded':
Reason: This numerical encoding of the 'Type' feature allows the model to use the target variable, or celestial body types, as a feature to predict on.
- **Target Variable**
- 'Type':
Reason: This is the category we are trying to predict, such as 'planet', 'star'.
- **Why These Features?**
- 'Word_Count' and 'Type_Encoded' are selected because they are numerical and directly related to the prediction of 'Type' of celestial bodies.
*********

In [None]:
df_data = pd.read_csv('merged_data.csv')

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">8.1.2 Prepare features:</h3>

In [None]:
# Check for missing values in 'Type' and 'Type_Encoded'
missing_values = df_data[['Word_Count', 'Type_Encoded']].isnull().sum()
print("Missing values in 'Type' and 'Type_Encoded':")
print(missing_values)

In [None]:
# Feature encoding
X = df_data[['Word_Count', 'Type_Encoded']] 
y = df_data['Type']  

<h3 style="margin: auto; padding: 20px; color: RGB(100,40,80); ">8.1.3 Feature scaling:</h3>

In [None]:
scaler = StandardScaler()
# use scaling to the features
X_scaled = scaler.fit_transform(X)

# checking the mean and standard deviation after scaling
print("Mean after scaling:", X_scaled.mean(axis=0))  # close to 0
print("Standard deviation after scaling:", X_scaled.std(axis=0)) # close to 1

*******************************************
**Scaling Results Explanation**
After scaling the features using StandardScaler, here is what we observe:

- **Mean after scaling:**

[ 8.33e-17, -2.99e-16]
Explain: the mean of both features are now very close to 0, which is what we expect when using StandardScaler. For floating-point precision, the number can be close to zero but considered effectively zero.

- **Standard deviation after scaling:**

[1.0, 1.0]
Reason: The standard deviation for both features is 1, which confirms that the scaling has been applied correctly. That means that now both features are on the same scale and can be compared to each other, for example, by any machine learning model.
Why Scaling is Important

--------------------------------------------------------------------------------------------------

is Important
Scaling is important because it puts all the features into the same range, making the model less sensitive to the scale of the data. It also ensures that no feature would dominate the model due to having a larger scale-for example, Word_Count in thousands compared with Type_Encoded in a small range. Scaling allows the model to work faster and often converge faster du
*************************************

## <span style="color: #A05899; font-size:1em;">8.2 Model Selection</span>

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Define the models to be compared
models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42)
}

# Store results
model_results = []

# Evaluate each model
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=1)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=1)
    
    # Store the results
    model_results.append({
        "Model": model_name,
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1-Score": f1
    })

# Convert the results to a DataFrame
model_results_df = pd.DataFrame(model_results)

# Display the results as a table
print(model_results_df)

# Select the best model based on accuracy
best_model_name = model_results_df.loc[model_results_df['Accuracy'].idxmax(), 'Model']
best_model = models[best_model_name]
print(f"\nBest Model: {best_model_name} (Accuracy: {model_results_df['Accuracy'].max():.4f})")


**********************
**Best Model Selection**
The comparison above clearly illustrates that the Random Forest model outperforms others on the highest values of accuracy (0.9900), precision (0.9950), recall (0.9900), and F1-score (0.9876); therefore, this model performed the best among all these features for this particular task.
***********************

#### ***the classification report***

In [None]:
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(f"\n{model_name} Classification Report:\n")
    print(classification_report(y_test, y_pred))
    print("\n*******************************************************************************\n")

__________________
- Therefore, Random Forest is the overall better-performing model due to its precision, recall, F1-score, and the performance balance among all categories. It's particularly good for categories such as "Planet", "Star", and "Spacecraft".

- Logistic Regression has an overall lower accuracy, with very high errors in identifying "Moon" and "Spacecraft". This is evident from its low precision and recall values for those categories, hence not being that reliable on these types.
_______________

#### ***Model Accuracy vs. Other Metrics***

In [None]:
# Evaluate and plot model metrics
metrics = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision": precision_score(y_test, y_pred, average='weighted'),
    "Recall": recall_score(y_test, y_pred, average='weighted'),
    "F1-Score": f1_score(y_test, y_pred, average='weighted')
}

metrics_df = pd.DataFrame(list(metrics.items()), columns=["Metric", "Value"])

# Plot the evaluation metrics
plt.figure(figsize=(8, 6))
sns.barplot(x='Metric', y='Value', data=metrics_df, palette='viridis')
plt.title('Model Performance Metrics')
plt.ylabel('Score')
plt.show()


## <span style="color: #A05899; font-size:1em;">8.3 Saving the Model</span>

In [None]:
# Save the model to a file
joblib.dump(best_model, 'best_model.pkl')
joblib.dump(scaler, 'scaler.pkl') 

## <span style="color: #A05899; font-size:1em;">8.4 Deployment</span>

In [None]:
# Load the saved model and scaler
model = joblib.load('best_model.pkl')
scaler = joblib.load('scaler.pkl')

In [None]:
# Function to predict the type and URL for all matching celestial bodies
def predict_celestial_body(name):
    # Search for all rows where the 'title' contains the provided name
    matches = df_data[df_data['Title'].str.contains(name, case=False, na=False)]
    
    if matches.empty:
        return f"No data found for '{name}'."
    
    results = []
    
    # Loop through all matches and collect relevant data (Type and URL)
    for index, match in matches.iterrows():
        # Process the data for prediction 
        word_count = match['Word_Count']
        type_encoded = match['Type_Encoded']
        features = scaler.transform([[word_count, type_encoded]])
        
        # Predict the type using the model
        predicted_type = model.predict(features)[0]
        predicted_url = match['URL']
        
        results.append(f"Predicted Type: {predicted_type}\nURL: {predicted_url}\n")
    
    # Return all the predictions for the matching rows
    return "\n".join(results)

#-------------------------------------------------------------------------
name = input("Enter a celestial body name: ")
print(predict_celestial_body(name))

<div style=" border: 5px solid #AC4d61; border-radius: 20px;">
    <h2 style="margin: auto; padding: 20px;color: #AC4d61;font-size:1.8em;">9. Conclusion</h2>
</div>

_____________________
_________________

In This notebook , we from collecting data on a celestial body to cleaning that data and then EDA and building a model. Following these steps, we have seen how insight can be properly extracted from the dataset by culminating in a sturdy machine learning model.

#### Key Achievements:

1. **Data Collection and Structuring:**
- We successfully scraped and structured a dataset containing both textual and image data related to celestial bodies. This allowed us to associate textual descriptions with corresponding visual characteristics, providing a rich foundation for further analysis.

2. **Exploratory Data Analysis (EDA):**
- In the EDA, **stars** and **planets** were the most common body types within the dataset, whereas **galaxies** are related to more specific details and high-resolution images.
The positive correlation of the length of textual descriptions with the resolution of images reflects the fact that more complex celestial bodies, such as galaxies.

3. **Modeling Insights:
The best performance was from the model with the **Random Forest**, having an accuracy of **99.01%**; thus, it can classify most texts  correctly into their respective classes. Indeed, for most classes, precision and recall are very good. This model proves to be really solid and reliable to classify most data in this or any similar tasks on star classification in the near future.

#### Final Thoughts:
This analysis points to the necessity of considering both text and image data for better comprehension of complex entities like those from outer space. The good performance of the Random Forest model hints at its use in a wide range of future applications, such as auto-classification of celestial objects through their description and related images. Going forward, there is always room for improving the performance by either including additional features or more advanced techniques, like deep learning.

_______________________________________________
_________________________________________________