# MPG Cars

### Introduction:

The following exercise utilizes data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 43 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 44.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=bbe2901c49eeb8a66a5ab2a1434e436f1bf9ee3742b01bc9a1d29c0734e6cc81
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [23]:
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import IntegerType
from pyspark.files import SparkFiles

### Step 2. Import the first dataset [cars1](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv) and [cars2](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv).  

In [5]:
url1 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv"
url2 = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv"

spark = SparkSession.builder.appName("Exercise51").getOrCreate()
spark.sparkContext.addFile(url1)
spark.sparkContext.addFile(url2)

   ### Step 3. Assign each to a variable called cars1 and cars2

In [9]:
cars1 = spark.read.csv("file://"+SparkFiles.get("cars1.csv"), header=True, inferSchema=True)
cars2 = spark.read.csv("file://"+SparkFiles.get("cars2.csv"), header=True, inferSchema=True)

### Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

In [13]:
cars1.show()
idx_car = cars1.columns.index("car")
cars1 = cars1.drop(*cars1.columns[idx_car+1:])
cars1.show()

+----+---------+------------+----------+------+------------+-----+------+--------------------+----+----+----+----+----+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car| _c9|_c10|_c11|_c12|_c13|
+----+---------+------------+----------+------+------------+-----+------+--------------------+----+----+----+----+----+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|null|null|null|null|null|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|null|null|null|null|null|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|null|null|null|null|null|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|null|null|null|null|null|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|null|null|null|null|null|
|15.0|        8|         429|       198|

### Step 5. What is the number of observations in each dataset?

In [15]:
cars1.count(), cars2.count()

(198, 200)

### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [29]:
cars = cars1.union(cars2)
cars.show()

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [30]:
from random import randint

randint_udf = f.udf(lambda : randint(15_000, 73_000), IntegerType())

cars = cars.withColumn("owners", randint_udf())

### Step 8. Add the column owners to cars

In [31]:
cars.show()

+----+---------+------------+----------+------+------------+-----+------+--------------------+------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|owners|
+----+---------+------------+----------+------+------------+-----+------+--------------------+------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...| 47739|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320| 44051|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite| 71100|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst| 31585|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino| 18376|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500| 65649|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevr