# Objective of the project 🚀

My objective is to create a machine learning model for object classification and then, try and compare it with a convolutional neural network (CNN) for images. For that, I will use Spark MLlib to train and evaluate the model. Secondly, create the CNN and compare both 🪐




# First Steps

## Download the data

The source of the data is: https://skyserver.sdss.org/CasJobs/
It is necessary to register and login a user to download the data you want. Then, you have to make a query specifying:
- Amount of rows
- Columns
- Where to keep the csv
- The database

As I want to get as much as possible data, I will not a maximum of rows.

I also add a "where" so I can get only data from planets, galaxies and stars:
- type = 3: Galaxies
- type = 6: Stars

and I downloaded the dataset to start working with it.

<img src="/home/haizeagonzalez/myproject/bigDataAstronomy/notebookImages/img1.png">

## Understanding the data

The columns we have are:
- objID: Unique identifier of the object → TYPE bigInt
- ra: Right ascension → TYPE float
- dec: Declination → TYPE float
- petroRad: Petrosian radius, used to know the size of galaxies in astronomical pictures. It is the amount of light that a galaxy emits in a sepecific radius. Very used because it is independent of the distance and brightness. We use different photometric filters:
    - petroRad_u: Near-ultraviolet
    - petroRad_g: Blue-Green
    - petroRad_r: Red
    - petroRad_i: Near-infrared
    - petroRad_z: Deeper infrared
 → TYPE: Real

- modelMag: Brightness measure adjusted to a galaxy model. Usual for galaxies. Also for all filters (u, g, r, i and z) → TYPE Real
- psfMag: Brightness measure based on the point source light profile. Usual for stars. Also for all filters (u, g, r, i and z) → TYPE Real
- u_g: (modelMag_u - modelMag_g)
- g_r: (modelMag_g - modelMag_r)
- r_i: (modelMag_r - modelMag_i)
- i_z: (modelMag_i - modelMag_z)
- fracDeV: The amount of brightness that the object has in the De Vaucouleurs profile. Also for all filters (u, g, r, i and z) → TYPE Real
- flags: Bit comination that explains different characteristics of the object. If we convert it to binary and check SDSS documentarion, we get a meaning for each bit → TYPE bigInt
- clean: Indicator that tell us if the object was cleaned → TYPE int



### What for?

PetroRad:
- Stars: Small and constant in all filters.
- Galaxies: Bigger and variates depending on the wavelengths.

ModelMag and psfMag:
- In the red filter:
    - Stars: modelMag_r ≈ psfMag_r
    - Galaxies: modelMag_r > psfMag_r
- In other filers:
    - Galaxies are usuarlly more red  (modelMag_g - modelMag_r is big).
    - Stars has different colors depending on their type.

fracDeV:
- Stars: fracDeV ≈ 0.
- Galaxies: fracDeV ≈ 1 (eliptic) or fracDeV < 1 (espiral).



# Spark

## Spark configuration

First, we need to create a spark sesion in case there is no one or get if there exists: "getOrCreate". I also decided to create a log in case there is any error during the process.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("bigDataAstronomyProject").getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

your 131072x1 screen size is bogus. expect trouble
25/03/14 12:49:29 WARN Utils: Your hostname, SSMRS3-04899600 resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/03/14 12:49:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/14 12:49:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Then, we need to read de csv data.

In [2]:
path = "/home/haizeagonzalez/bigDataProject/primaryObjs.csv"
path2 = "/home/haizeagonzalez/myproject/primaryObjs_reduced.csv"

df = spark.read.csv(path2, header=True)

Now, we are going to check if the data is correctly loaded.

In [3]:
df.show()

+-------------------+----------------+----------------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-------------+----------+---------+---------+----------+----------+---------+-----------------+-----+
|              objID|              ra|             dec|type|petroRad_u|petroRad_g|petroRad_r|petroRad_i|petroRad_z|modelMag_u|modelMag_g|modelMag_r|modelMag_i|modelMag_z|psfMag_u|psfMag_g|psfMag_r|psfMag_i|psfMag_z|      u_g|        g_r|          r_i|       i_z|fracDeV_u|fracDeV_g| fracDeV_r| fracDeV_i|fracDeV_z|            flags|clean|
+-------------------+----------------+----------------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-------------+----------+---------+---------+----------+----------+---------+--------

The schema and the chacacteristics of the data.

In [4]:
df.printSchema()

root
 |-- objID: string (nullable = true)
 |-- ra: string (nullable = true)
 |-- dec: string (nullable = true)
 |-- type: string (nullable = true)
 |-- petroRad_u: string (nullable = true)
 |-- petroRad_g: string (nullable = true)
 |-- petroRad_r: string (nullable = true)
 |-- petroRad_i: string (nullable = true)
 |-- petroRad_z: string (nullable = true)
 |-- modelMag_u: string (nullable = true)
 |-- modelMag_g: string (nullable = true)
 |-- modelMag_r: string (nullable = true)
 |-- modelMag_i: string (nullable = true)
 |-- modelMag_z: string (nullable = true)
 |-- psfMag_u: string (nullable = true)
 |-- psfMag_g: string (nullable = true)
 |-- psfMag_r: string (nullable = true)
 |-- psfMag_i: string (nullable = true)
 |-- psfMag_z: string (nullable = true)
 |-- u_g: string (nullable = true)
 |-- g_r: string (nullable = true)
 |-- r_i: string (nullable = true)
 |-- i_z: string (nullable = true)
 |-- fracDeV_u: string (nullable = true)
 |-- fracDeV_g: string (nullable = true)
 |-- fracDe

As all columns are string, we need to convert them into their type. For that:

In [5]:
from pyspark.sql.functions import col

df = df.withColumn("objID", col("objID").cast("long")) \
       .withColumn("ra", col("ra").cast("float")) \
       .withColumn("dec", col("dec").cast("float")) \
       .withColumn("petroRad_u", col("petroRad_u").cast("float")) \
       .withColumn("petroRad_g", col("petroRad_g").cast("float")) \
       .withColumn("petroRad_r", col("petroRad_r").cast("float")) \
       .withColumn("petroRad_i", col("petroRad_i").cast("float")) \
       .withColumn("petroRad_z", col("petroRad_z").cast("float")) \
       .withColumn("modelMag_u", col("modelMag_u").cast("float")) \
       .withColumn("modelMag_g", col("modelMag_g").cast("float")) \
       .withColumn("modelMag_r", col("modelMag_r").cast("float")) \
       .withColumn("modelMag_i", col("modelMag_i").cast("float")) \
       .withColumn("modelMag_z", col("modelMag_z").cast("float")) \
       .withColumn("psfMag_u", col("psfMag_u").cast("float")) \
       .withColumn("psfMag_g", col("psfMag_g").cast("float")) \
       .withColumn("psfMag_r", col("psfMag_r").cast("float")) \
       .withColumn("psfMag_i", col("psfMag_i").cast("float")) \
       .withColumn("psfMag_z", col("psfMag_z").cast("float")) \
       .withColumn("u_g", col("u_g").cast("float")) \
       .withColumn("g_r", col("g_r").cast("float")) \
       .withColumn("r_i", col("r_i").cast("float")) \
       .withColumn("i_z", col("i_z").cast("float")) \
       .withColumn("fracDeV_u", col("fracDeV_u").cast("float")) \
       .withColumn("fracDeV_g", col("fracDeV_g").cast("float")) \
       .withColumn("fracDeV_r", col("fracDeV_r").cast("float")) \
       .withColumn("fracDeV_i", col("fracDeV_i").cast("float")) \
       .withColumn("fracDeV_z", col("fracDeV_z").cast("float")) \
       .withColumn("flags", col("flags").cast("long")) \
       .withColumn("clean", col("clean").cast("int"))

df.printSchema()

root
 |-- objID: long (nullable = true)
 |-- ra: float (nullable = true)
 |-- dec: float (nullable = true)
 |-- type: string (nullable = true)
 |-- petroRad_u: float (nullable = true)
 |-- petroRad_g: float (nullable = true)
 |-- petroRad_r: float (nullable = true)
 |-- petroRad_i: float (nullable = true)
 |-- petroRad_z: float (nullable = true)
 |-- modelMag_u: float (nullable = true)
 |-- modelMag_g: float (nullable = true)
 |-- modelMag_r: float (nullable = true)
 |-- modelMag_i: float (nullable = true)
 |-- modelMag_z: float (nullable = true)
 |-- psfMag_u: float (nullable = true)
 |-- psfMag_g: float (nullable = true)
 |-- psfMag_r: float (nullable = true)
 |-- psfMag_i: float (nullable = true)
 |-- psfMag_z: float (nullable = true)
 |-- u_g: float (nullable = true)
 |-- g_r: float (nullable = true)
 |-- r_i: float (nullable = true)
 |-- i_z: float (nullable = true)
 |-- fracDeV_u: float (nullable = true)
 |-- fracDeV_g: float (nullable = true)
 |-- fracDeV_r: float (nullable = tr

Now that we have all the structure, we are going to explore and clean the data.

## Data cleaning

In principle, the data is cleaned because we get it from CasJobs and we apply clear filter to get good data. However, we are going to check whether there is any null value and the amount of galaxies and stars.  

In [6]:
from pyspark.sql.functions import col, when, count

df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-----+---+---+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---+---+---+---+---------+---------+---------+---------+---------+-----+-----+
|objID| ra|dec|type|petroRad_u|petroRad_g|petroRad_r|petroRad_i|petroRad_z|modelMag_u|modelMag_g|modelMag_r|modelMag_i|modelMag_z|psfMag_u|psfMag_g|psfMag_r|psfMag_i|psfMag_z|u_g|g_r|r_i|i_z|fracDeV_u|fracDeV_g|fracDeV_r|fracDeV_i|fracDeV_z|flags|clean|
+-----+---+---+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---+---+---+---+---------+---------+---------+---------+---------+-----+-----+
|    0|  0|  0|   0|         0|         0|         0|         0|         0|         0|         0|         0|         0|         0|       0|       0|       0|       0|       0|  0|  0|  0|  0|        0|        0|        0|        0|       

In [7]:
df.groupBy("type").count().show()

+----+-----+
|type|count|
+----+-----+
|   3|53538|
|   6|46462|
+----+-----+



As we can see, there is no null values and the amount of galaxies are less than the amount of stars, which make sense. However, this can affect the model so we are going to balance the data.

First, as it is a binary classification, we will update stars to 0 and galaxies to 1.

In [8]:
df = df.withColumn("type", when(col("type") == 6,1).otherwise(0))

In [9]:
df_stars = df.filter(df["type"] == 0)
df_galaxies = df.filter(df["type"] == 1)

#Print to know the conversion is correctly done
df_stars.show(5)
df_galaxies.show(5)

+-------------------+---------+---------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-----------+---------+---------+---------+---------+----------+---------+---------------+-----+
|              objID|       ra|      dec|type|petroRad_u|petroRad_g|petroRad_r|petroRad_i|petroRad_z|modelMag_u|modelMag_g|modelMag_r|modelMag_i|modelMag_z|psfMag_u|psfMag_g|psfMag_r|psfMag_i|psfMag_z|      u_g|        g_r|        r_i|      i_z|fracDeV_u|fracDeV_g|fracDeV_r| fracDeV_i|fracDeV_z|          flags|clean|
+-------------------+---------+---------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-----------+---------+---------+---------+---------+----------+---------+---------------+-----+
|1237645879562928258|16.020245|1.2676673|  

And now, we are going to balance the data.

In [10]:
from pyspark.sql.functions import col
from pyspark.sql import functions as F

#Get the amount of stars and galaxies
count_stars = df.filter(col("type") == 0).count()
count_galaxies = df.filter(col("type") == 1).count()

#Select the minimum number of both clases
min_count = min(count_stars, count_galaxies)

# Submuestreo: Tomar solo 'min_count' elementos de cada clase
df_stars = df.filter(col("type") == 0).sample(fraction=min_count / count_stars, seed=1)
df_galaxies = df.filter(col("type") == 1).sample(fraction=min_count / count_galaxies, seed=1)

#Union the data
df_balanced = df_stars.union(df_galaxies)

#Check if everything is ok
df_balanced.groupBy("type").count().show()
df_balanced.show(5)


+----+-----+
|type|count|
+----+-----+
|   0|46507|
|   1|46462|
+----+-----+

+-------------------+---------+---------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-----------+---------+---------+---------+---------+----------+---------+---------------+-----+
|              objID|       ra|      dec|type|petroRad_u|petroRad_g|petroRad_r|petroRad_i|petroRad_z|modelMag_u|modelMag_g|modelMag_r|modelMag_i|modelMag_z|psfMag_u|psfMag_g|psfMag_r|psfMag_i|psfMag_z|      u_g|        g_r|        r_i|      i_z|fracDeV_u|fracDeV_g|fracDeV_r| fracDeV_i|fracDeV_z|          flags|clean|
+-------------------+---------+---------+----+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+--------+--------+--------+--------+--------+---------+-----------+-----------+---------+---------+---------+---------+--------

We don't get exactly the same amount of data beacuse PySpark approximates the data.

As for the supervised machine model we won't use objID, ra, dec, flags and clean columns, we are going to remove them. 

In [11]:
df_ml_model = df_balanced.select("type", "petroRad_u", "petroRad_g", "petroRad_r", "petroRad_i", "petroRad_z",
                        "modelMag_u", "modelMag_g", "modelMag_r", "modelMag_i", "modelMag_z",
                        "psfMag_u", "psfMag_g", "psfMag_r", "psfMag_i", "psfMag_z",
                        "u_g", "g_r", "r_i", "i_z",
                        "fracDeV_u", "fracDeV_g", "fracDeV_r", "fracDeV_i", "fracDeV_z")

df_ml_model.printSchema()

root
 |-- type: integer (nullable = false)
 |-- petroRad_u: float (nullable = true)
 |-- petroRad_g: float (nullable = true)
 |-- petroRad_r: float (nullable = true)
 |-- petroRad_i: float (nullable = true)
 |-- petroRad_z: float (nullable = true)
 |-- modelMag_u: float (nullable = true)
 |-- modelMag_g: float (nullable = true)
 |-- modelMag_r: float (nullable = true)
 |-- modelMag_i: float (nullable = true)
 |-- modelMag_z: float (nullable = true)
 |-- psfMag_u: float (nullable = true)
 |-- psfMag_g: float (nullable = true)
 |-- psfMag_r: float (nullable = true)
 |-- psfMag_i: float (nullable = true)
 |-- psfMag_z: float (nullable = true)
 |-- u_g: float (nullable = true)
 |-- g_r: float (nullable = true)
 |-- r_i: float (nullable = true)
 |-- i_z: float (nullable = true)
 |-- fracDeV_u: float (nullable = true)
 |-- fracDeV_g: float (nullable = true)
 |-- fracDeV_r: float (nullable = true)
 |-- fracDeV_i: float (nullable = true)
 |-- fracDeV_z: float (nullable = true)



## Spark ML

As our objective is to create a machine learning model, we need to convert the data in a correct format: Vectors.

In [12]:
from pyspark.ml.feature import VectorAssembler

features = df.columns[1:] #We don't get the type beacuse is the result we want to get.
assembler = VectorAssembler(inputCols = features, outputCol = "features")

In [13]:
df = assembler.transform(df).select("features", "Type")
df.show(5)

+--------------------+----+
|            features|Type|
+--------------------+----+
|[15.9845666885375...|   1|
|[16.0201206207275...|   1|
|[16.0202445983886...|   0|
|[16.0392494201660...|   1|
|[15.9751977920532...|   0|
+--------------------+----+
only showing top 5 rows



Now, we are going to divide the dataset into train and test, so we can get the accuracy of the model.

In [14]:
train_data, test_data = df.randomSplit([0.8, 0.2], seed = 1)

We are going to try different models to check which is the best for our case.

In [15]:
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, LinearSVC

models = {
    "Logistic Regression": LogisticRegression(labelCol = "Type", featuresCol = "features"),
    "Decision Tree": DecisionTreeClassifier(labelCol="Type", featuresCol="features"),
    #"Random Forest": RandomForestClassifier(labelCol="Type", featuresCol="features", numTrees=100),
    #"Gradient Boosted Trees": GBTClassifier(labelCol="Type", featuresCol="features"),
    #"Linear SVM": LinearSVC(labelCol="Type", featuresCol="features")
}

In [16]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol = "Type", metricName = "areaUnderROC")

In [17]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="Type", predictionCol="prediction", metricName="accuracy")

In [18]:
for name, model in models.items():
    model_trained = model.fit(train_data)
    predictions = model_trained.transform(test_data)
    #auc = evaluator.evaluate(predictions)
    accuracy = evaluator.evaluate(predictions)
    #print(f"{name}: AUC = {auc:.4f}")
    print(f"{name}: Accuracy = {accuracy:.4f}")


Logistic Regression: Accuracy = 1.0000
Decision Tree: Accuracy = 1.0000


In [20]:
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

for name, model in models.items():
    # Entrenamiento del modelo
    model_trained = model.fit(train_data)

    # Predicciones
    predictions = model_trained.transform(test_data)
    
    # Evaluación
    accuracy = evaluator.evaluate(predictions)
    print(f"{name}: Accuracy = {accuracy:.4f}")
    
    # Seleccionar las columnas de predicción y etiquetas
    predictionAndLabels = predictions.select(
        F.col("prediction").cast(FloatType()), 
        F.col("type").cast(FloatType())  # Asegurarse de que 'type' es float
    )

    # Convertir a RDD para usar MulticlassMetrics
    metrics = MulticlassMetrics(predictionAndLabels.rdd.map(tuple))

    # Obtener la matriz de confusión
    conf_matrix = metrics.confusionMatrix().toArray()

    # Mostrar la matriz de confusión
    print(f"{name}: Confusion Matrix")
    print(conf_matrix)


Logistic Regression: Accuracy = 1.0000




Logistic Regression: Confusion Matrix
[[10563.     0.]
 [    0.  9195.]]
Decision Tree: Accuracy = 1.0000




Decision Tree: Confusion Matrix
[[10563.     0.]
 [    0.  9195.]]


cross validation

matriz confusion -> Si esta bien (cross validation si esta mal overfitting)

El accuracy el 100 pero puede ser que haya sido overfitting por tanto, puede ser que al haber más estrellas que galaxias, prediga el que más haya. Es por ello que vamos a bajar la cantidad de estrellas para que el modelo esté balanceado.

In [21]:
print(df_stars.count())
print(df_galaxies.count())

num_stars = df_stars.count()
num_galaxies = df_galaxies.count()


46507
46462


# Images

As I need also de images, I have downloaded from https://skyserver.sdss.org/dr18, specifying with a request:
- the location of the object (with right ascension (RA) and declination (dec))
- the zoom of the picture (scale)
- the dimmensions of the photo (with and height)