# Introduction

In this notebook, we will perform clustering on a dataset of alien-colonized planets using the KMeans algorithm. The goal is to identify patterns and group similar planets together based on their features.

### Importing Libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

We start by importing essential libraries such as pandas for data handling, scikit-learn for machine learning tasks, and matplotlib for plotting.

### Loading the Dataset

In [None]:
data = pd.read_csv("alien_galaxy.csv")
data.head()

We load the dataset and display the first few rows to understand the structure and contents of the data.

### Data Preprocessing

In [None]:
# Define a threshold to drop columns with more than 40% missing values
threshold = 0.4 * len(data)
data = data.dropna(thresh=threshold, axis=1)

# Fill missing numerical values with the median of each column
for col in data.select_dtypes(include=['float64', 'int64']).columns:
    data[col].fillna(data[col].median(), inplace=True)

# Fill missing categorical values with the mode of each column
for col in data.select_dtypes(include=['object']).columns:
    data[col].fillna(data[col].mode()[0], inplace=True)


We handle missing values to ensure the dataset is suitable for modeling. Numerical columns are filled with the median to reduce the effect of outliers, while categorical columns are filled with the mode.

### Feature Engineering

In [None]:
# Convert 'Discovery_Date' to datetime and extract the year
data['Discovery_Date'] = pd.to_datetime(data['Discovery_Date'], errors='coerce')
data['Discovery_Year'] = data['Discovery_Date'].dt.year

# Drop the original 'Discovery_Date' column
data.drop(columns=['Discovery_Date'], inplace=True)

By extracting the discovery year, we create a new feature that might be relevant for clustering.

### Separating Feature Types

In [None]:
# Identify numerical and categorical columns
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_features = data.select_dtypes(include=['object']).columns.tolist()

We separate the features to apply appropriate preprocessing steps to each type.

### Creating Preprocessing Pipelines

In [None]:
# Define the preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

We standardize numerical features and one-hot encode categorical features to prepare the data for clustering.

### Clustering with KMeans

In [None]:
# Create a pipeline that includes preprocessing and KMeans clustering
kmeans_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('kmeans', KMeans(n_clusters=5, random_state=42))
])

# Fit the model and predict cluster labels
data['Cluster'] = kmeans_pipeline.fit_predict(data)


We apply KMeans clustering to the preprocessed data and assign cluster labels to each planet.

### Dimensionality Reduction with PCA

In [None]:
# Apply PCA to reduce data to 2 dimensions for visualization
pca = PCA(n_components=2)
data_pca = pca.fit_transform(preprocessor.fit_transform(data))

# Add PCA components to the dataset
data['PCA1'] = data_pca[:, 0]
data['PCA2'] = data_pca[:, 1]

We reduce the dimensionality of the data to visualize the clusters in a 2D plot.

### Visualization

In [None]:
# Set up the plot
plt.figure(figsize=(10, 7))
colors = ['red', 'blue', 'green', 'purple', 'orange']

# Plot each cluster
for cluster in data['Cluster'].unique():
    cluster_data = data[data['Cluster'] == cluster]
    plt.scatter(cluster_data['PCA1'], cluster_data['PCA2'],
                color=colors[cluster], label=f'Cluster {cluster}', alpha=0.6)

# Add titles and labels
plt.title("Clustering of Alien-Colonized Planets")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend(title='Cluster')
plt.grid(True)

# Save the figure to the images folder
plt.savefig('images/cluster_visualization.png')

# Display the plot
plt.show()

## Conclusion

The clustering results suggest that there are distinct groups of alien-colonized planets, which may warrant further investigation to understand the underlying factors contributing to these groupings.