### Data Source
##### Sloan Digital Sky Survey DR14: Classification of Stars, Galaxies and Quasar: 
##### https://www.kaggle.com/lucidlenn/sloan-digital-sky-survey

##### Description
10,000 observations of space taken by the Sloan Digital Sky Survey (SDSS). 17 feature and 1 class column (identifying observation as a star, galaxy or quasar)

##### Variables/Columns
 objid = Object Identifier (PhotoObj table) [unique-drop]<br>
 ra = Right Ascension (PhotoObj table) [numerical]<br>
 dec = Declination (PhotoObj table) [numerical]<br>
 u, g, r, i, z = 5 bands of the telescope (per the Gunn-Thuan griz astronomical magnitude system) [numerical]<br>
 run = Run Number identifies the specific scan [categorical-23]<br>
 rerun = specifies how image was processed [unique-drop]<br>
 camcol = Camera Column (1 - 6) identifies scanline w/in the Run [categorical-6]<br>
 field = Field Number ~ starts at 11 (after an init'l rampup time) & can be as large as 800 for longer runs [categorical-703]<br>
 specobjid = Object Identifier [categorical-6349]<br>
 class = Object Class [Classification Labels/categorical-3]<br>
 redshift = Final Redshift [categorical-9637]<br>
 plate = Round AL plates at positions of objects of interest through which holes are drilled to  pass optical fiber [categorical-487]<br>
 mjd = Modified Julian Date (of Observation) [categorical-355]<br>
 fiberid = Optical Fiber ID  [categorical-892]

### Load CSV Data/Dependencies

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings
warnings.simplefilter('ignore')
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [2]:
obsrv = pd.read_csv('./data/Skyserver_SQL2_27_2018 6_51_39 PM.csv')

### Review/Clean Data 
re: obsv_model_InitDataAnalysis.ipynb for initial data review.

In [3]:
df = pd.DataFrame(obsrv, columns=['ra','dec','u','g','r','i','z','class', 'redshift'])

### Proprocess Data
##### Update Class Column from STRING to INT

In [4]:
# from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# le.fit(class_names)
# list(le.classes_)
# le.transform(['GALAXY', 'QSO', 'STAR'])

df['class'] = le.fit_transform(df['class'])

### Logistic Regression (Multinomial)

In [5]:
df.describe()

Unnamed: 0,ra,dec,u,g,r,i,z,class,redshift
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,175.529987,14.836148,18.619355,17.371931,16.840963,16.583579,16.422833,0.9154,0.143726
std,47.783439,25.212207,0.828656,0.945457,1.067764,1.141805,1.203188,0.952856,0.388774
min,8.2351,-5.382632,12.98897,12.79955,12.4316,11.94721,11.61041,0.0,-0.004136
25%,157.370946,-0.539035,18.178035,16.8151,16.173333,15.853705,15.618285,0.0,8.1e-05
50%,180.394514,0.404166,18.853095,17.495135,16.85877,16.554985,16.389945,1.0,0.042591
75%,201.547279,35.649397,19.259232,18.010145,17.512675,17.25855,17.141447,2.0,0.092579
max,260.884382,68.542265,19.5999,19.91897,24.80204,28.17963,22.83306,2.0,5.353854


#### Assign Data to Variables
X is the Feature Matrix, y is the Response Vector

In [6]:
X = df[["u", "g","r","i","z"]]
y = df["class"]
print("Shape: ", X.shape, y.shape)

Shape:  (10000, 5) (10000,)


#### Create Model ("Classifier")

In [7]:
# from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

LogisticRegression()

#### Test/Train Split, Fit, Score

In [8]:
# from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
classifier.fit(X_train, y_train)

LogisticRegression()

In [10]:
classifier.score(X_test, y_test)

0.9376666666666666

In [11]:
y_pred = classifier.predict(X_test)

In [12]:
# from sklearn import metrics
print("Accuracy of Logistic Regression model is:",
metrics.accuracy_score(y_test, y_pred)*100)

Accuracy of Logistic Regression model is: 93.76666666666667


In [13]:
df2 = pd.DataFrame(obsrv, columns=['u','g','r','i','z','class'])

In [14]:
# Assign X (data) and y (target)
X = df2.drop("class", axis=1)
y = df2["class"]
print(X.shape, y.shape)

(10000, 5) (10000,)


In [15]:
for num_iter in [50, 100, 200, 500, 1000]:
    for pen in ['none','l2']:
        print(num_iter, pen)
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        classifier = LogisticRegression(penalty=pen, max_iter=num_iter)
        classifier.fit(X_train, y_train)
        print(f"Training Data Score: {classifier.score(X_train, y_train)}")
        print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

50 none
Training Data Score: 0.9246666666666666
Testing Data Score: 0.9212
50 l2
Training Data Score: 0.9190666666666667
Testing Data Score: 0.9204
100 none
Training Data Score: 0.9394666666666667
Testing Data Score: 0.9376
100 l2
Training Data Score: 0.9308
Testing Data Score: 0.9432
200 none
Training Data Score: 0.942
Testing Data Score: 0.9436
200 l2
Training Data Score: 0.9332
Testing Data Score: 0.9436
500 none
Training Data Score: 0.9444
Testing Data Score: 0.9404
500 l2
Training Data Score: 0.936
Testing Data Score: 0.936
1000 none
Training Data Score: 0.9428
Testing Data Score: 0.94
1000 l2
Training Data Score: 0.9350666666666667
Testing Data Score: 0.9372


In [16]:
predictions = classifier.predict(X_test)
print(f"First 10 Predictions:   {predictions[:10]}")
print(f"First 10 Actual labels: {y_test[:10].tolist()}")

First 10 Predictions:   ['STAR' 'GALAXY' 'STAR' 'GALAXY' 'STAR' 'QSO' 'STAR' 'GALAXY' 'STAR'
 'GALAXY']
First 10 Actual labels: ['STAR', 'GALAXY', 'STAR', 'GALAXY', 'STAR', 'QSO', 'STAR', 'GALAXY', 'STAR', 'GALAXY']
