# STAR DATASET (2) ML

The first notebook on this topic can be found at the link [STAR DATASET (1) Analysis](https://www.kaggle.com/arthurchebotkov/star-dataset-1-analysis).

**Key findings:**

1) Dataset includes:
* **Star type** (target) - (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)
* Absolute Temperature (in K)
* Relative Luminosity (L/Lo)
* Relative Radius (R/Ro)
* Absolute Magnitude (Mv)
* Star Color (white,Red,Blue,Yellow,yellow-orange etc)
* Spectral Class (O,B,A,F,G,K,,M)

2) Categorial variables: Star Type, Star Color, Spectral Class: 
* Spectral Class and Star color are correlated **with target** and could be included for training in our model. 
* Spectral Class and Star color are correlated with each other, therefore we will need to exclude some of them from the learning model. 

3) Categorial and numerical variables:
* **Star type** (target): All numerical features are correlated with **target**
* Spectral Class: All numerical features are correlated with Spectral Class
* Star color: All numerical features are correlated with Star color

4) Numerical variables:  
Numerical features are moderately correlated, but not highly correlated. Аnd now, let's hold all the features for the learning models.

## Data preparation

Firstly prepare our dataset for ML

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('6 class csv.csv')
data

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.002400,0.1700,16.12,0,Red,M
1,3042,0.000500,0.1542,16.60,0,Red,M
2,2600,0.000300,0.1020,18.70,0,Red,M
3,2800,0.000200,0.1600,16.65,0,Red,M
4,1939,0.000138,0.1030,20.06,0,Red,M
...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,5,Blue,O
236,30839,834042.000000,1194.0000,-10.63,5,Blue,O
237,8829,537493.000000,1423.0000,-10.73,5,White,A
238,9235,404940.000000,1112.0000,-11.23,5,White,A


Star color values are duplicated in various ways. Let's bring everything to uniformity using pandas:

In [2]:
data['Star type'].replace([0,1,2,3,4,5],['Red Dwarf','White Dwarf','Brown Dwarf','Main Sequence','Supergiants','Hypergiants'],inplace=True)

data['Star color'].replace(['Blue White','Blue white','Blue-white','Blue white '],['Blue-White','Blue-White','Blue-White','Blue-White'],inplace=True)
data['Star color'].replace(['Yellowish White'],['Yellowish-White'],inplace=True)
data['Star color'].replace(['Pale yellow orange'],['Pale-Yellow-Orange'],inplace=True)
data['Star color'].replace(['yellow-white'],['Yellow-White'],inplace=True)
data['Star color'].replace(['white'],['White'],inplace=True)
data['Star color'].replace(['yellowish'],['Yellowish'],inplace=True)
data['Star color'].replace(['Blue '],['Blue'],inplace=True)

In [3]:
data['Star color'].unique()

array(['Red', 'Blue-White', 'White', 'Yellowish-White',
       'Pale-Yellow-Orange', 'Blue', 'Whitish', 'Yellow-White', 'Orange',
       'White-Yellow', 'Yellowish', 'Orange-Red'], dtype=object)

Our target Star type is a nominal variable (contains values that have no intrinsic ordering).
Let's restore the original name so that there is no ordering illusion.

In [4]:
data['Star type'].replace([0,1,2,3,4,5],['Red Dwarf','White Dwarf','Brown Dwarf','Main Sequence','Supergiants','Hypergiants'],inplace=True)
data

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star type,Star color,Spectral Class
0,3068,0.002400,0.1700,16.12,Red Dwarf,Red,M
1,3042,0.000500,0.1542,16.60,Red Dwarf,Red,M
2,2600,0.000300,0.1020,18.70,Red Dwarf,Red,M
3,2800,0.000200,0.1600,16.65,Red Dwarf,Red,M
4,1939,0.000138,0.1030,20.06,Red Dwarf,Red,M
...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,Hypergiants,Blue,O
236,30839,834042.000000,1194.0000,-10.63,Hypergiants,Blue,O
237,8829,537493.000000,1423.0000,-10.73,Hypergiants,White,A
238,9235,404940.000000,1112.0000,-11.23,Hypergiants,White,A


Let's check distribution of target classes.
See that target classes are balanced:

In [5]:
data.groupby(['Star type']).nunique()

Unnamed: 0_level_0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Star color,Spectral Class
Star type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Brown Dwarf,40,37,37,37,6,3
Hypergiants,39,37,37,38,5,6
Main Sequence,40,40,40,40,6,5
Red Dwarf,37,39,36,39,1,1
Supergiants,40,38,32,38,2,3
White Dwarf,37,37,35,40,1,1


### Select a target and features

Next select a target and features in new pandas.

In [6]:
# Select a target

y = data['Star type']

Categorial features must be converted to quantitative data in order to be able to ML. Using dummy coding for this.

In [7]:
star_color = data['Star color']
spectral_class = data['Spectral Class']

star_color = pd.get_dummies(star_color)
spectral_class = pd.get_dummies(spectral_class)

data_drop = data.drop(['Star type', 'Star color', 'Spectral Class'], axis=1)

X = pd.concat((data_drop, star_color,spectral_class), axis=1)

X

Unnamed: 0,Temperature (K),Luminosity(L/Lo),Radius(R/Ro),Absolute magnitude(Mv),Blue,Blue-White,Orange,Orange-Red,Pale-Yellow-Orange,Red,...,Yellow-White,Yellowish,Yellowish-White,A,B,F,G,K,M,O
0,3068,0.002400,0.1700,16.12,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,3042,0.000500,0.1542,16.60,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
2,2600,0.000300,0.1020,18.70,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
3,2800,0.000200,0.1600,16.65,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
4,1939,0.000138,0.1030,20.06,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,38940,374830.000000,1356.0000,-9.93,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
236,30839,834042.000000,1194.0000,-10.63,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
237,8829,537493.000000,1423.0000,-10.73,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
238,9235,404940.000000,1112.0000,-11.23,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


Next split the data into two sets: train and test sets

In [8]:
import sklearn
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 11)

## ML

### DECISION TREE

Firstly find the best hyperparameters using GridSearchCV 

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

tree_params = {'max_depth': range(1,10),'max_features': range(1,24)}
dtc = DecisionTreeClassifier(criterion='entropy', random_state=11)
tree_grid = GridSearchCV(dtc, tree_params, cv=5, n_jobs=-1, verbose=True)
tree_grid.fit(X_train, y_train)
print('Best cross-validation parameters:',tree_grid.best_params_)

Fitting 5 folds for each of 207 candidates, totalling 1035 fits
Best cross-validation parameters: {'max_depth': 3, 'max_features': 19}


Based on the validation results, set the tree parameters:
* Max depth = 3
* Max features = 19

In [10]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, max_features = 19, random_state=11)
dtc.fit(X_train, y_train)

#  record predictions on the test set
y_test_predict = dtc.predict(X_test)

 Look at the main metrics of the results of the learing model:

In [11]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print('(1) TREE DECISION - all features')
print()

# Print Confusion matrix:
print('Confusion matrix:')
cmatrix = confusion_matrix(y_test, y_test_predict)
print(cmatrix)
print()

# Print classification_report:
print('Classification_report:')
print(classification_report(y_test, y_test_predict))

(1) TREE DECISION - all features

Confusion matrix:
[[14  0  0  0  0  0]
 [ 0  9  0  0  0  0]
 [ 0  0 13  0  0  0]
 [ 0  0  0 17  0  0]
 [ 0  0  0  0  9  0]
 [ 0  0  0  0  0 10]]

Classification_report:
               precision    recall  f1-score   support

  Brown Dwarf       1.00      1.00      1.00        14
  Hypergiants       1.00      1.00      1.00         9
Main Sequence       1.00      1.00      1.00        13
    Red Dwarf       1.00      1.00      1.00        17
  Supergiants       1.00      1.00      1.00         9
  White Dwarf       1.00      1.00      1.00        10

     accuracy                           1.00        72
    macro avg       1.00      1.00      1.00        72
 weighted avg       1.00      1.00      1.00        72



Precision model works with **100% precision, recall, f1-score and accuracy**

Next let's draw the graph

In [12]:
from sklearn.tree import export_graphviz
export_graphviz(dtc, feature_names=(X_train.columns), out_file='dtc_all_features.dot', filled=True)

import graphviz
with open("./dtc_all_features.dot") as f:
    dot_graph = f.read()

# remove the display(...)\

graphviz.Source(dot_graph)

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x7fb411253fd0>

#### Only 2 features for prediction

See that model uses only 3 features: Absolute magnitude, Luminosity and Red.  
Now using information from presiosly analysis [STAR DATASET (1) Analysis](https://www.kaggle.com/arthurchebotkov/star-dataset-1-analysis) try to build another model using only 2 features. Key findings:  
(1) Absolute magnitude are correlated with target (F-score 1496.53)  
(2) Radius are correlated with target (F-score 1113.91)  
(3) Absolute magnitude and Radius correlation with each other (Pearson coefficient = -0,61, it is less then correliation between Absolute magnitude and Luminosity)  
(4) Spectral Class (p-value: 6.36e-52) and Star color (p-value: 4.35e-46) are correlated with target but also Spectral Class and Star color are correlated with each other (p-value: 1.94e-123)

Select a 2 features: one numerical Absolute magnitude(Mv) and one categorical Spectral Class

In [13]:
X = data.drop(['Star type', 'Radius(R/Ro)', 'Star color', 'Temperature (K)', 'Spectral Class','Luminosity(L/Lo)'], axis=1)

spectral_class = data['Spectral Class']
spectral_class = pd.get_dummies(spectral_class)

X = pd.concat((X, spectral_class), axis=1)
X

Unnamed: 0,Absolute magnitude(Mv),A,B,F,G,K,M,O
0,16.12,0,0,0,0,0,1,0
1,16.60,0,0,0,0,0,1,0
2,18.70,0,0,0,0,0,1,0
3,16.65,0,0,0,0,0,1,0
4,20.06,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
235,-9.93,0,0,0,0,0,0,1
236,-10.63,0,0,0,0,0,0,1
237,-10.73,1,0,0,0,0,0,0
238,-11.23,1,0,0,0,0,0,0


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

In [15]:
tree_params = {'max_depth': range(1,10),'max_features': range(1,9)}
dtc = DecisionTreeClassifier(criterion='entropy', random_state=10)
tree_grid = GridSearchCV(dtc, tree_params, cv=5, n_jobs=-1, verbose=True)
tree_grid.fit(X_train, y_train)
print('Best cross-validation parameters:',tree_grid.best_params_)

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best cross-validation parameters: {'max_depth': 3, 'max_features': 7}


In [16]:
dtc = DecisionTreeClassifier(criterion='entropy', max_depth=3, max_features = 7, random_state=10)
dtc.fit(X_train, y_train) 

#  record predictions on the test set
y_test_predict = dtc.predict(X_test)

print('TREE DECISION')
print()

# Print Confusion matrix:
print('Confusion matrix:')
cmatrix = confusion_matrix(y_test, y_test_predict)
print(cmatrix)
print()

# Print classification_report:
print('Classification_report:')
print(classification_report(y_test, y_test_predict))

TREE DECISION

Confusion matrix:
[[12  0  0  0  0  0]
 [ 0 13  0  0  0  0]
 [ 0  0 11  0  0  0]
 [ 0  0  0 15  0  0]
 [ 0  0  0  0  8  0]
 [ 0  0  0  0  0 13]]

Classification_report:
               precision    recall  f1-score   support

  Brown Dwarf       1.00      1.00      1.00        12
  Hypergiants       1.00      1.00      1.00        13
Main Sequence       1.00      1.00      1.00        11
    Red Dwarf       1.00      1.00      1.00        15
  Supergiants       1.00      1.00      1.00         8
  White Dwarf       1.00      1.00      1.00        13

     accuracy                           1.00        72
    macro avg       1.00      1.00      1.00        72
 weighted avg       1.00      1.00      1.00        72



In [17]:
from sklearn.tree import export_graphviz

export_graphviz(dtc, feature_names=(X_train.columns), out_file='dtc_2features.dot', filled=True)

In [19]:
import graphviz

with open("dtc_2features.dot") as f:
    dot_graph = f.read()

# remove the display(...)

graphviz.Source(dot_graph)

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x7f96d9dfb460>

See that model uses only **Absolute magnitude** feature and belonging to the **Spectral Class M**.  
  
Also precision model works with **100% precision, recall, f1-score and accuracy** with other sets of feature: 
* Absolute magnitude and Radius
* Absolute magnitude and Star color
* Absolute magnitude and Luminosity

#### --- NEXT WILL APPLY OTHER MACHINE LEARNING MODELS ---