# MPG Cars

### Introduction:

The following exercise utilizes data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=8e65b1bc69328a4f4be4914f01750e301118d1616d2ca467feb76c04b6cddd34
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the first dataset [cars1](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv) and [cars2](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv).  

In [3]:
!wget -O cars1.csv https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv

--2024-04-11 11:36:46--  https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10163 (9.9K) [text/plain]
Saving to: ‘cars1.csv’


2024-04-11 11:36:46 (50.5 MB/s) - ‘cars1.csv’ saved [10163/10163]



In [4]:
!wget -O cars2.csv https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv

--2024-04-11 11:36:46--  https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9383 (9.2K) [text/plain]
Saving to: ‘cars2.csv’


2024-04-11 11:36:46 (54.1 MB/s) - ‘cars2.csv’ saved [9383/9383]



   ### Step 3. Assign each to a variable called cars1 and cars2

In [20]:
cars1 = spark.read.csv('cars1.csv', sep=',', header=True, inferSchema=True)
cars2 = spark.read.csv('cars2.csv', sep=',', header=True, inferSchema=True)

In [11]:
cars1.show(10)

+----+---------+------------+----------+------+------------+-----+------+--------------------+----+----+----+----+----+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car| _c9|_c10|_c11|_c12|_c13|
+----+---------+------------+----------+------+------------+-----+------+--------------------+----+----+----+----+----+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|NULL|NULL|NULL|NULL|NULL|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|NULL|NULL|NULL|NULL|NULL|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|NULL|NULL|NULL|NULL|NULL|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|NULL|NULL|NULL|NULL|NULL|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|NULL|NULL|NULL|NULL|NULL|
|15.0|        8|         429|       198|

In [12]:
cars2.show(10)

+----+---------+------------+----------+------+------------+-----+------+------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|               car|
+----+---------+------------+----------+------+------------+-----+------+------------------+
|33.0|        4|          91|        53|  1795|        17.4|   76|     3|       honda civic|
|20.0|        6|         225|       100|  3651|        17.7|   76|     1|    dodge aspen se|
|18.0|        6|         250|        78|  3574|        21.0|   76|     1| ford granada ghia|
|18.5|        6|         250|       110|  3645|        16.2|   76|     1|pontiac ventura sj|
|17.5|        6|         258|        95|  3193|        17.8|   76|     1|     amc pacer d/l|
|29.5|        4|          97|        71|  1825|        12.2|   76|     2| volkswagen rabbit|
|32.0|        4|          85|        70|  1990|        17.0|   76|     3|      datsun b-210|
|28.0|        4|          97|        75|  2155|        16.4|   76|    

### Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

In [18]:
cars1 = cars1.drop(*['_c9','_c10', '_c11', '_c12', '_c13'])

In [19]:
cars1.show(10)

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

# OR Dinamically

In [23]:
cars1 = spark.read.csv('cars1.csv', sep=',', header=True, inferSchema=True)
for col in cars1.columns:
    if cars1.where(F.col(col).isNull()).count() == cars1.count():
        cars1 = cars1.drop(col)

In [24]:
cars1.show(10)

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

### Step 5. What is the number of observations in each dataset?

In [25]:
cars1.count()

198

In [26]:
cars2.count()

200

### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [33]:
cars1.select('model').distinct().show()

+-----+
|model|
+-----+
|   76|
|   72|
|   73|
|   70|
|   75|
|   71|
|   74|
+-----+



In [34]:
cars2.select('model').distinct().show()

+-----+
|model|
+-----+
|   78|
|   81|
|   76|
|   77|
|   82|
|   80|
|   79|
+-----+



In [35]:
cars = cars1.union(cars2)

In [36]:
cars.show(10)

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [37]:
cars = cars.withColumn("owners", (F.rand() * (73000 - 15000) + 15000).cast('int'))

In [38]:
cars.show(10)

+----+---------+------------+----------+------+------------+-----+------+--------------------+------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|owners|
+----+---------+------------+----------+------+------------+-----+------+--------------------+------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...| 51208|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320| 27656|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite| 59338|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst| 64583|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino| 46357|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500| 49453|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevr

### Step 8. Add the column owners to cars