## About this file and all files relate to my undergraduate senior research project

This is "Data Preparation.ipynb", which is the code that prepares the data used in my undergraduate senior reseach project, "Galaxy Morphology Classification via Machine Learning."

Summary of the files relate to the senior research project, "Galaxy Morphology Classification via Machine Learning"
# 1. Data
・Galaxy Zoo --> "gz.csv"\
・SDSS DR7 --> "result.csv", "result (1).csv", "result (2).csv", "result (3).csv", "result (4).csv", "result (5).csv"
# 2. Code
・Data Preparation --> "Data Preparation.ipynb"\
・K-means clustering --> "K-means Clustering .ipynb"\
・SVM --> SVM.ipynb
# 3. Presentation
・Progress report used at colloquium on October 16th, 2019 --> "colloquium.key"
# 4. Paper
・Paper submitted on December 4th, 2019 --> "final paper.pdf"

If you have any questions, please contact me (kanta29.1996@gmail.com).


In [1]:
import pandas as pd
import numpy as np
from IPython.display import Image
%matplotlib inline

#Merge Galaxy Zoo data with SDSS DR7 data.
a0 = pd.read_csv("gz.csv") # Read Galaxy Zoo data.
a = a0.rename(columns={'OBJID': 'objID'})   # Edit objID name of Galaxy Zoo data.
b1 = pd.read_csv("result.csv") #Randomly extracted galaxy data from SDSS DR7 ✖︎ 6 csv files.
b2 = pd.read_csv("result (1).csv") 
b3 = pd.read_csv("result (2).csv")
b4 = pd.read_csv("result (3).csv")
b5 = pd.read_csv("result (4).csv")
b6 = pd.read_csv("result (5).csv")
b= pd.concat([b1, b2, b3, b4, b5, b6]) # Assign the sum of "b1"~"b6" to "b."
data = pd.merge(a, b, how="outer", on="objID").dropna(axis=0) # Merge csv file a, Galaxy Zoo, and b, SDSS DR7.
data.set_index("objID",inplace=True) # Assign "objID" as an index.

# Display the original data size.
print("Initial data size") 
print(data.shape)

# Calculate "Color index" columns.
data["u-g"] = data["u"] - data["g"]
data["g-r"] = data["g"] - data["r"]
data["r-i"] = data["r"] - data["i"]
data["i-z"] = data["i"] - data["z"]

# Calculate "Concentration" columns.
data["concentration_u"] = data["petroR50_u"]/data["petroR90_u"]
data["concentration_g"] = data["petroR50_g"]/data["petroR90_g"]
data["concentration_r"] = data["petroR50_r"]/data["petroR90_r"]
data["concentration_i"] = data["petroR50_i"]/data["petroR90_i"]
data["concentration_z"] = data["petroR50_z"]/data["petroR90_z"]
data["concentration_avg"] = (data["concentration_u"] + data["concentration_g"] + data["concentration_r"] + data["concentration_i"] + data["concentration_z"]) / 5


# Add "class" column.
class1 = data["ELLIPTICAL"].tolist() # Put each column in a list.
class2 = data["SPIRAL"].tolist()
class3 = data["UNCERTAIN"].tolist()
# Assign 1~3 to elliptical ~ uncertain.
Class1 = ["E" if i == 1.0 else 0 for i in class1]
Class2 = ["S" if i==1.0 else 0 for i in class2]
Class3 = ["U" if i==1.0 else 0 for i in class3]
# Put all together in a list "Class."
Class = Class1
for index, item in enumerate(Class):
    if item == "E":
        continue
    else:
        Class[index] = Class2[index]
for index, item in enumerate(Class):
    if item == 0:
        Class[index] = Class3[index]
    else:
        continue
data['class'] = np.array(pd.Series(Class))

#Drop rows of uncertain galaxies (because I don't use the uncertain galaxies in this research).
data.drop(data[data['class'] == "U" ].index, inplace=True)

#Delete rows which have outliers (The boundaries were found after I examine the data).         
data = data.drop(data[(data["u-g"] < 1) | (data["u-g"] > 2.5) |
                      (data["g-r"] > 1.6) | (data["g-r"] < 0.38) |(data["concentration_u"] < 0) |
                      (data["concentration_g"] < 0.28) | (data["concentration_g"] > 0.57)|
                      (data["concentration_r"] > 0.55) | (data["concentration_r"] < 0.25)].index)

# Delete unnecessary columns to minimize the data size.
def delete(a):
    for i in a:
        del data[i]
delcol = ['NVOTE', 'P_EL', 'P_CW', 'P_ACW', 'P_EDGE', 'P_DK',
         'P_MG', 'P_CS', 'P_EL_DEBIASED', 'P_CS_DEBIASED',
         'SPIRAL', 'ELLIPTICAL', 'UNCERTAIN', 'petroR50_u',
         'petroR50_g', 'petroR50_r', 'petroR50_i', 'petroR50_z',
         'petroR90_u', 'petroR90_g', 'petroR90_r', 'petroR90_i',
         'petroR90_z','u', 'g', 'r', 'i', 'z']
delete(delcol)

# Display the data size used for supervised machine learning (SVM).
print("Data size used for supervised machine learning (SVM)")
print(data.shape)

#Save the prepared data as "data_sl.csv", which is used for supervised machine learning (SVM). 
data.to_csv("data_sl.csv", index=False) 

# Randomly reduce the number of spiral galaxies until the number matches with that of elliptical galaxies. 
#Galaxy Zoo data contain more spiral galaxies than elliptical galaxies. 
g = data.groupby('class')
data = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

# Display the data size used for unsupervised machine learning (k-means clustering).
print("Data size used for unsupervised machine learning (k-means clusering)")
print(data.shape)

#Save the prepared data as "data_ul.csv", which is used for unsupervised machine learning (k-means clustering). 
data.to_csv("data_ul.csv", index=False) 

Initial data size
(58614, 45)
Data size used for supervised machine learning (SVM)
(20862, 28)
Data size used for unsupervised machine learning (k-means clusering)
(10442, 28)


In [5]:
a.tail()

Unnamed: 0,objID,RA,DEC,NVOTE,P_EL,P_CW,P_ACW,P_EDGE,P_DK,P_MG,P_CS,P_EL_DEBIASED,P_CS_DEBIASED,SPIRAL,ELLIPTICAL,UNCERTAIN
667939,587727226763870322,23:59:58.76,-09:41:34.7,35,0.171,0.8,0.0,0.029,0.0,0.0,0.829,0.057,0.943,1,0,0
667940,587730775499407475,23:59:58.78,+15:49:01.3,21,0.81,0.048,0.0,0.095,0.048,0.0,0.143,0.758,0.193,0,0,1
667941,587727223024124280,23:59:58.81,+15:39:49.4,28,0.286,0.0,0.071,0.393,0.179,0.071,0.464,0.099,0.603,0,0,1
667942,587730774425600239,23:59:59.02,+15:09:18.8,23,0.391,0.0,0.043,0.0,0.13,0.435,0.043,0.39,0.045,0,0,1
667943,587727177912615023,23:59:59.37,-11:11:31.5,54,0.556,0.0,0.037,0.333,0.074,0.0,0.37,0.153,0.722,1,0,0


In [6]:
b1.info

<bound method DataFrame.info of                     objID  petroR50_u  petroR50_g  petroR50_r  petroR50_i  \
0      587731187277627676    1.585448    1.580472    1.492296    1.556319   
1      587727223024189605    2.090823    1.754263    1.764493    1.688140   
2      587730774425665704    1.967605    1.832129    1.729350    1.651205   
3      587727178449485858    2.053708    1.599930    1.786578    1.981697   
4      587731187277693069    2.051601    1.677627    1.595021    1.666904   
...                   ...         ...         ...         ...         ...   
10367  587724241234559152    3.757246    2.501727    2.406140    2.363548   
10368  587727177938501705    2.169776    3.222019    2.946065    3.198834   
10369  587727178475438130    1.331477    1.289969    1.272132    1.271538   
10370  587724241234624712    4.898431    1.944715    2.026261    1.827387   
10371  587727179549245664    8.151965    3.103450    2.468091    2.439775   

       petroR50_z  petroR90_u  petroR90_g  