# Regression in Scala

<b>This project uses Scala 2.12 kernel in Jupyter.<br>
<b>Regression is performed on a dataset of car models and attributes.

## Introduction

<b>Disabling spark output

In [1]:
import $ivy.`org.apache.spark::spark-sql:2.4.0`
import org.apache.log4j.{Level, Logger}

// Set the log level to ERROR (or any other desired log level)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)


[32mimport [39m[36m$ivy.$[39m
[32mimport [39m[36morg.apache.log4j.{Level, Logger}[39m

In [2]:
import $ivy.`org.apache.spark::spark-sql:2.4.0`

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


[32mimport [39m[36m$ivy.$[39m
[32mimport [39m[36morg.apache.spark.sql._[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@7eb43cbd

<b>Loading data

In [3]:
val dataPath = "../input/FuelConsumption (1).csv"
val df: DataFrame = spark.read.option("header", "true").csv(dataPath)

[36mdataPath[39m: [32mString[39m = [32m"../input/FuelConsumption (1).csv"[39m
[36mdf[39m: [32mDataFrame[39m = [Year: string, MAKE: string ... 8 more fields]

<b>Dataframe is spark dataset.

In [4]:
df.getClass

[36mres4[39m: [32mClass[39m[[32mT[39m] = class org.apache.spark.sql.Dataset

## Data Exploration

### Attributes and descriptive statistics

<b>In this section some basic Apache spark dataframe functions are shown.

In [27]:
df.printSchema()

root
 |-- Year: string (nullable = true)
 |-- MAKE: string (nullable = true)
 |-- MODEL: string (nullable = true)
 |-- VEHICLE CLASS: string (nullable = true)
 |-- ENGINE SIZE: string (nullable = true)
 |-- CYLINDERS: string (nullable = true)
 |-- TRANSMISSION: string (nullable = true)
 |-- FUEL: string (nullable = true)
 |-- FUEL CONSUMPTION: string (nullable = true)
 |-- COEMISSIONS : string (nullable = true)



In [5]:
df.show()

+----+-----+------------------+--------------------+-----------+---------+------------+----+----------------+------------+
|Year| MAKE|             MODEL|       VEHICLE CLASS|ENGINE SIZE|CYLINDERS|TRANSMISSION|FUEL|FUEL CONSUMPTION|COEMISSIONS |
+----+-----+------------------+--------------------+-----------+---------+------------+----+----------------+------------+
|2000|ACURA|             1.6EL|             COMPACT|        1.6|        4|          A4|   X|            10.5|         216|
|2000|ACURA|             1.6EL|             COMPACT|        1.6|        4|          M5|   X|             9.8|         205|
|2000|ACURA|             3.2TL|            MID-SIZE|        3.2|        6|         AS5|   Z|            13.7|         265|
|2000|ACURA|             3.5RL|            MID-SIZE|        3.5|        6|          A4|   Z|              15|         301|
|2000|ACURA|           INTEGRA|          SUBCOMPACT|        1.8|        4|          A4|   X|            11.4|         230|
|2000|ACURA|    

In [6]:
df.head(5)

[36mres6[39m: [32mArray[39m[[32mRow[39m] = [33mArray[39m(
  [2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,10.5,216],
  [2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,9.8,205],
  [2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,13.7,265],
  [2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,15,301],
  [2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,11.4,230]
)

In [7]:
df.columns

[36mres7[39m: [32mArray[39m[[32mString[39m] = [33mArray[39m(
  [32m"Year"[39m,
  [32m"MAKE"[39m,
  [32m"MODEL"[39m,
  [32m"VEHICLE CLASS"[39m,
  [32m"ENGINE SIZE"[39m,
  [32m"CYLINDERS"[39m,
  [32m"TRANSMISSION"[39m,
  [32m"FUEL"[39m,
  [32m"FUEL CONSUMPTION"[39m,
  [32m"COEMISSIONS "[39m
)

In [16]:
df.dtypes

[36mres16[39m: [32mArray[39m[([32mString[39m, [32mString[39m)] = [33mArray[39m(
  ([32m"Year"[39m, [32m"StringType"[39m),
  ([32m"MAKE"[39m, [32m"StringType"[39m),
  ([32m"MODEL"[39m, [32m"StringType"[39m),
  ([32m"VEHICLE CLASS"[39m, [32m"StringType"[39m),
  ([32m"ENGINE SIZE"[39m, [32m"StringType"[39m),
  ([32m"CYLINDERS"[39m, [32m"StringType"[39m),
  ([32m"TRANSMISSION"[39m, [32m"StringType"[39m),
  ([32m"FUEL"[39m, [32m"StringType"[39m),
  ([32m"FUEL CONSUMPTION"[39m, [32m"StringType"[39m),
  ([32m"COEMISSIONS "[39m, [32m"StringType"[39m)
)

<b>Length of data.

In [9]:
df.count()

[36mres9[39m: [32mLong[39m = [32m639L[39m

<b>Number of columns.

In [10]:
df.columns.length

[36mres10[39m: [32mInt[39m = [32m10[39m

<b>Descriptive statistics.

In [11]:
df.describe().show()

+-------+------+-----+-----+---------------+------------------+------------------+------------+----+------------------+-----------------+
|summary|  Year| MAKE|MODEL|  VEHICLE CLASS|       ENGINE SIZE|         CYLINDERS|TRANSMISSION|FUEL|  FUEL CONSUMPTION|     COEMISSIONS |
+-------+------+-----+-----+---------------+------------------+------------------+------------+----+------------------+-----------------+
|  count|   639|  639|  639|            639|               639|               639|         639| 639|               639|              639|
|   mean|2000.0| null|626.0|           null|3.2657276995305202| 5.805946791862285|        null|null|14.713615023474212|296.8090766823161|
| stddev|   0.0| null|  0.0|           null|1.2310121715436397|1.6255876208780364|        null|null| 3.307043767251958|65.50417808775087|
|    min|  2000|ACURA|1.6EL|        COMPACT|                 1|                10|          A3|   D|                10|              104|
|    max|  2000|VOLVO|   Z8|VAN - 

<b>Null values.

In [21]:
val nullCounts = df.select(df.columns.map(c => sum(col(c).isNull.cast("int")).alias(c)): _*)

nullCounts.show()

+----+----+-----+-------------+-----------+---------+------------+----+----------------+------------+
|Year|MAKE|MODEL|VEHICLE CLASS|ENGINE SIZE|CYLINDERS|TRANSMISSION|FUEL|FUEL CONSUMPTION|COEMISSIONS |
+----+----+-----+-------------+-----------+---------+------------+----+----------------+------------+
|   0|   0|    0|            0|          0|        0|           0|   0|               0|           0|
+----+----+-----+-------------+-----------+---------+------------+----+----------------+------------+



[36mnullCounts[39m: [32mDataFrame[39m = [Year: bigint, MAKE: bigint ... 8 more fields]

### Feature Exploration | Year

<b>Year is one value for each entry.

In [19]:
df.select("Year").distinct().show()

+----+
|Year|
+----+
|2000|
+----+



In [30]:
df.drop(df("Year")).printSchema()

root
 |-- MAKE: string (nullable = true)
 |-- MODEL: string (nullable = true)
 |-- VEHICLE CLASS: string (nullable = true)
 |-- ENGINE SIZE: string (nullable = true)
 |-- CYLINDERS: string (nullable = true)
 |-- TRANSMISSION: string (nullable = true)
 |-- FUEL: string (nullable = true)
 |-- FUEL CONSUMPTION: string (nullable = true)
 |-- COEMISSIONS : string (nullable = true)



In [29]:
df.show()

+----+-----+------------------+--------------------+-----------+---------+------------+----+----------------+------------+
|Year| MAKE|             MODEL|       VEHICLE CLASS|ENGINE SIZE|CYLINDERS|TRANSMISSION|FUEL|FUEL CONSUMPTION|COEMISSIONS |
+----+-----+------------------+--------------------+-----------+---------+------------+----+----------------+------------+
|2000|ACURA|             1.6EL|             COMPACT|        1.6|        4|          A4|   X|            10.5|         216|
|2000|ACURA|             1.6EL|             COMPACT|        1.6|        4|          M5|   X|             9.8|         205|
|2000|ACURA|             3.2TL|            MID-SIZE|        3.2|        6|         AS5|   Z|            13.7|         265|
|2000|ACURA|             3.5RL|            MID-SIZE|        3.5|        6|          A4|   Z|              15|         301|
|2000|ACURA|           INTEGRA|          SUBCOMPACT|        1.8|        4|          A4|   X|            11.4|         230|
|2000|ACURA|    

In [None]:
val sampledDF = df.sample(0.1)
sampledDF.show()

### Data preparation

In [None]:
import org.apache.spark.sql.functions.col

// Assuming df is your DataFrame
val dfWithCasts = df
  .withColumn("Year", col("Year").cast("integer"))
  .withColumn("ENGINE SIZE", col("ENGINE SIZE").cast("float"))
  .withColumn("CYLINDERS", col("CYLINDERS").cast("integer"))
// Similarly, cast other numerical columns to the appropriate data types

// Now, you can run describe() to get descriptive statistics
val summary = dfWithCasts.describe()

// Show the summary statistics
summary.show()
