# Intro

**Sloan Digital Sky Survey - DR18**

In this Machine Learning project we will classify the observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 42 features and 1 class column classifying the observation as either:
* a STAR
* a GALAXY
* a QSO (Quasi-Stellar Object) or a Quasar.

We will use **CatBoostClassifier** from catboost and **XGBClassifier** from xgboost

# Load packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from catboost import CatBoostClassifier
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import accuracy_score, classification_report

%matplotlib inline

# Load the data

In [None]:
dataset = pd.read_csv('../input/sloan-digital-sky-survey-dr18/SDSS_DR18.csv')

# Data Exploration and Analysis

In [None]:
dataset.head()

In [None]:
# Show number of rows and columns (m, n)
dataset.shape

**Check for null or missing values in the data**

In [None]:
null = dataset.isnull().sum().max()

if null == 0:
    print('There is no missing values')
else:
    print('There is missing values')



Let's show a concise summary of a dataset using info() function                           
such as index dtype and columns, non-null values and memory usage.

In [None]:
dataset.info()


Let’s get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.

In [None]:
dataset.describe()

In [None]:
dataset.columns.values

**Target Column**

In [None]:
dataset['class'].value_counts()  # returns a Series containing counts of unique values.

In [None]:
sns.countplot(x = dataset['class'])
plt.title('Class Categories')

Let's use **LabelEncoder** to encode target labels with value between 0 and n_classes-1   
here the target (class) has three unique values a GALAXY, a STAR and a QSO

In [None]:
encoder = LabelEncoder()
dataset['class'] = encoder.fit_transform(dataset['class'])

In [None]:
dataset['class'].value_counts()

A **correlation matrix** is a table containing correlation coefficients between variables, each cell in the table represents the correlation between two variables.

we will use a heatmap from seaborn to visualize the correlation between variables

In [None]:
corr_matrix = dataset.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, center=0, square=True, vmin=-1, vmax=1)
plt.title('Correlation Matrix')

# Data preprocessing

In [None]:
X = dataset.drop('class', axis=1)
y = dataset['class']

In [None]:
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, 
                                                    test_size=0.3, 
                                                    shuffle=True, 
                                                    random_state=44)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# The Model

**CatBoost** or Categorical Boosting is a high-performance open source library for gradient boosting on decision trees.                                                          It is designed for use on problems like regression and classification having a very large number of independent features. 

In [None]:
model = CatBoostClassifier(iterations=150,
                           learning_rate=0.1,
                           depth=5)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred[:10], y_test[:10]

Calculating Accuracy Score : **((TP + TN) / float(TP + TN + FP + FN))**

In [None]:
Acc = accuracy_score(y_test, y_pred)
print(f'Accuracy score for CatBoostClassifier: {Acc: .4f}')

In [None]:
print('classification report for CatBoostClassifier model: \n', classification_report(y_pred, y_test))

**XGBoost** is an optimized distributed gradient boosting library designed for efficient and scalable training of machine learning models. It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting” and it has become one of the most popular and widely used machine learning algorithms

In [None]:
xgbModel = XGBClassifier(n_estimators=50,          # Number of trees we want to build
                         max_depth=4,              # How deeply each tree is allowed to grow
                         learning_rate=0.1,        # Step size 
                         objective='reg:logistic') # It determines the loss function
xgbModel.fit(X_train, y_train)

In [None]:
preds = xgbModel.predict(X_test)

In [None]:
preds[:10], y_test[:10]

In [None]:
Acc = accuracy_score(y_test, preds)
print(f'Accuracy score for XGBClassifier: {Acc: .4f}')

In [None]:
print('classification report for XGBClassifier model: \n', classification_report(preds, y_test))

# Feature importance with XGBoost

In [None]:
# Let's plot top 10 most important features
plot_importance(xgbModel, max_num_features=10)