# <center> Star Type Prediction using Machine Learning
---

### 1) Dataset and Aim

* The objective of this project is to classify stars based on different features like it's luminosity, temperature, etc.
* Download [this dataset](https://drive.google.com/uc?id=12oc8SOpsbktDzmBXXZT3FNxKhCUQA_DA) and upload it to your colab file storage.                     
  


<img src='https://i.pinimg.com/originals/84/55/14/84551418f7ad91c47d75046db7c42993.png' width=50%>

### 2) Import Libraries

In [19]:
# Libraries
import numpy as np
import pandas as pd

### 3) Explore the Star Data

In [20]:
# Create the DataFrame from csv data
df = pd.read_csv('star_type_.csv')

In [21]:
# Show the random 10 rows in the dataset
df.sample(10)

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type
138,3324,0.0034,0.34,12.23,Red Dwarf
40,3826,200000.0,19.0,-6.93,Hypergiant
45,3600,320000.0,29.0,-6.6,Hypergiant
191,3257,0.0024,0.46,10.73,Red Dwarf
209,19360,0.00125,0.00998,11.62,White Dwarf
123,3146,0.00015,0.0932,16.92,Brown Dwarf
29,7230,8e-05,0.013,14.08,White Dwarf
81,10574,0.00014,0.0092,12.02,White Dwarf
223,23440,537430.0,81.0,-5.975,Hypergiant
235,38940,374830.0,1356.0,-9.93,Supergiant


In [22]:
# Targets
df['Star type'].value_counts()

Star type
Brown Dwarf      40
Red Dwarf        40
White Dwarf      40
Main Sequence    40
Hypergiant       40
Supergiant       40
Name: count, dtype: int64

In [23]:
# Fetch the generic information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Temperature (K)         240 non-null    int64  
 1   Luminosity(L/Lo)        240 non-null    float64
 2   Radius(R/Ro)            240 non-null    float64
 3   Absolute magnitude(Mv)  240 non-null    float64
 4   Star type               240 non-null    object 
dtypes: float64(3), int64(1), object(1)
memory usage: 9.5+ KB


### 4) Prepare X_train and y_train

In [24]:
# Save input features in X and target output in y
X = df.iloc[:, :-1]
y = df['Star type']

In [25]:
# Splitting the data into training and testing sets
from sklearn.model_selection import train_test_split

## Perform the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.values[0], X_test.values[0]

(array([3.541e+03, 1.300e-03, 2.560e-01, 1.433e+01]),
 array([1.650e+04, 1.300e-02, 1.400e-02, 1.189e+01]))

### 5) Create Pipeline

In [26]:
# Create the Pipeline with Scaler and ML Model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),                        # Feature scaling
    ('classifier', LogisticRegression(random_state=42))  # Classification model
])

### 6) Train and Test the Logistic Regression Model

In [27]:
# Train the Logistic Regression using Pipeline
pipeline.fit(X_train, y_train)

In [28]:
# Make the predictions
pred_labels = pipeline.predict(X_test) # predicted values on test data
actual_labels = y_test.values          # actual values

In [29]:
# Print the results
print(f'Here are the actual labels:-\n{actual_labels}')
print(f'\nHere are the predicted labels:-\n{pred_labels}')

Here are the actual labels:-
['White Dwarf' 'Brown Dwarf' 'Main Sequence' 'Hypergiant' 'Hypergiant'
 'Supergiant' 'Supergiant' 'White Dwarf' 'Brown Dwarf' 'White Dwarf'
 'Hypergiant' 'White Dwarf' 'Supergiant' 'Hypergiant' 'Supergiant'
 'Supergiant' 'Brown Dwarf' 'Red Dwarf' 'Main Sequence' 'Brown Dwarf'
 'Brown Dwarf' 'Red Dwarf' 'Supergiant' 'Main Sequence' 'Supergiant'
 'Main Sequence' 'Red Dwarf' 'White Dwarf' 'Supergiant' 'Main Sequence'
 'Main Sequence' 'Hypergiant' 'White Dwarf' 'Brown Dwarf' 'Red Dwarf'
 'Brown Dwarf' 'Red Dwarf' 'Supergiant' 'Red Dwarf' 'Supergiant'
 'Hypergiant' 'Supergiant' 'Hypergiant' 'Red Dwarf' 'Main Sequence'
 'Brown Dwarf' 'Hypergiant' 'Main Sequence']

Here are the predicted labels:-
['White Dwarf' 'Brown Dwarf' 'Red Dwarf' 'Hypergiant' 'Hypergiant'
 'Supergiant' 'Supergiant' 'White Dwarf' 'Brown Dwarf' 'White Dwarf'
 'Hypergiant' 'White Dwarf' 'Supergiant' 'Hypergiant' 'Supergiant'
 'Supergiant' 'Brown Dwarf' 'Red Dwarf' 'Main Sequence' 'Brown Dwarf'

In [30]:
# Check on which index the prediction did not match the actual output
incorrect_index = np.where(actual_labels!=pred_labels)[0]

# Print the actual and predicted label for the incorrect_index
for id in incorrect_index:
  print(f'Actual Label:- {actual_labels[id]}')
  print(f'Predicted Label:- {pred_labels[id]}')
  print()

Actual Label:- Main Sequence
Predicted Label:- Red Dwarf

Actual Label:- Main Sequence
Predicted Label:- Hypergiant



In [31]:
# Get the accuracy score
from sklearn.metrics import accuracy_score
print(accuracy_score(actual_labels, pred_labels))

0.9583333333333334


### 7) Downlaod the Pipeline and Test it

In [32]:
# Save the model
from pickle import dump
with open('model.pkl', 'wb') as file:
  dump(pipeline, file)

In [33]:
# Load the model and test it
from pickle import load
with open('model.pkl', 'rb') as file:
    loaded_data = load(file)

In [34]:
# Get data from test set
X_test.iloc[1,:], y_test.values[1]

(Temperature (K)           2637.00000
 Luminosity(L/Lo)             0.00073
 Radius(R/Ro)                 0.12700
 Absolute magnitude(Mv)      17.22000
 Name: 6, dtype: float64,
 'Brown Dwarf')

In [35]:
# Feature list
features = X_train.columns.to_list()
features

['Temperature (K)',
 'Luminosity(L/Lo)',
 'Radius(R/Ro)',
 'Absolute magnitude(Mv)']

In [36]:
# Prediction
test_df = pd.DataFrame([[2637, 0.00073, 0.12700, 17.22000]], columns=features)
pred = loaded_data.predict(test_df)
print(pred[0])

Brown Dwarf
