# MPG Cars

### Introduction:

The following exercise utilizes data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

### Step 1. Import the necessary libraries

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import requests
import numpy
from pyspark.sql import functions as F
import random
from pyspark import SparkContext

In [3]:
spark = SparkSession.builder.master("local[1]").appName("cars").getOrCreate()

22/09/12 15:09:08 WARN Utils: Your hostname, xkeyscore resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface wlp0s20f3)
22/09/12 15:09:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/12 15:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
sc = spark.sparkContext

### Step 2. Import the first dataset [cars1](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv) and [cars2](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv).  

In [5]:
cars_1 = spark.read.options(header=True, inferSchema=True).csv("cars1.csv")
cars_2 = spark.read.options(header=True, inferSchema=True).csv("cars2.csv")

   ### Step 3. Assign each to a variable called cars1 and cars2

### Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

In [6]:
cars_1.columns

['mpg',
 'cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'model',
 'origin',
 'car',
 '_c9',
 '_c10',
 '_c11',
 '_c12',
 '_c13']

In [7]:
cars_1 = cars_1.drop('_c9', '_c10', '_c11', '_c12','_c13')

In [8]:
cars_1.show()

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

### Step 5. What is the number of observations in each dataset?

In [9]:
print("Count of cars1 dataset: {}".format(cars_1.count()))
print("Count of cars2 dataset: {}".format(cars_2.count()))

Count of cars1 dataset: 198
Count of cars2 dataset: 200


### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [10]:
print(cars_1.columns)
print(cars_2.columns)

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car']
['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car']


In [11]:
total_cars = cars_1.union(cars_2)

In [12]:
total_cars.count()

398

### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [13]:
l = []
for x in range(0,398):
    l.append(random.randint(15000,73000))
print(l)

[15298, 23826, 18233, 41082, 49197, 30916, 68243, 39010, 15138, 47722, 17453, 46775, 52016, 65597, 56224, 19807, 61071, 72827, 56736, 20743, 19340, 41415, 54732, 25416, 71020, 43716, 25641, 45119, 34557, 25444, 25118, 27871, 63760, 27291, 39642, 20631, 65526, 72080, 59588, 68561, 29729, 56358, 40105, 55655, 47448, 45037, 64951, 72915, 52333, 46939, 62047, 69902, 52387, 15269, 42151, 45861, 61699, 38602, 64302, 47325, 32481, 24603, 29642, 32072, 29922, 63449, 26183, 49356, 50263, 65729, 24904, 62913, 69017, 18416, 70854, 25828, 22976, 38993, 23082, 30143, 59483, 59566, 36910, 34633, 47278, 34284, 71172, 67318, 57312, 54042, 67205, 25337, 43913, 41186, 59864, 26345, 37011, 63731, 63551, 51003, 25322, 17660, 26297, 16702, 36792, 46861, 22828, 30464, 69792, 62169, 46994, 72319, 57336, 47293, 63020, 28865, 69486, 30234, 68994, 60028, 16044, 27274, 48341, 51087, 72605, 57758, 60287, 55747, 33290, 43055, 33370, 40841, 64303, 30656, 43840, 35233, 45621, 33712, 34654, 54639, 69484, 18875, 58861

In [15]:
total_cars.withColumn("owners",random.randint(15000,73000))

TypeError: col should be Column

In [59]:
total_cars.show()

+----+---------+------------+----------+------+------------+-----+------+--------------------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|model|origin|                 car|
+----+---------+------------+----------+------+------------+-----+------+--------------------+
|18.0|        8|         307|       130|  3504|        12.0|   70|     1|chevrolet chevell...|
|15.0|        8|         350|       165|  3693|        11.5|   70|     1|   buick skylark 320|
|18.0|        8|         318|       150|  3436|        11.0|   70|     1|  plymouth satellite|
|16.0|        8|         304|       150|  3433|        12.0|   70|     1|       amc rebel sst|
|17.0|        8|         302|       140|  3449|        10.5|   70|     1|         ford torino|
|15.0|        8|         429|       198|  4341|        10.0|   70|     1|    ford galaxie 500|
|14.0|        8|         454|       220|  4354|         9.0|   70|     1|    chevrolet impala|
|14.0|        8|         440|       215|  4312|   

In [None]:
owners_df = spark.createDataFrame()

In [47]:
owners_df = spark.createDataFrame

[0;31mSignature:[0m
[0mspark[0m[0;34m.[0m[0mcreateDataFrame[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mpyspark[0m[0;34m.[0m[0mrdd[0m[0;34m.[0m[0mRDD[0m[0;34m[[0m[0mAny[0m[0;34m][0m[0;34m,[0m [0mIterable[0m[0;34m[[0m[0mAny[0m[0;34m][0m[0;34m,[0m [0mForwardRef[0m[0;34m([0m[0;34m'PandasDataFrameLike'[0m[0;34m)[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mschema[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mtypes[0m[0;34m.[0m[0mAtomicType[0m[0;34m,[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mtypes[0m[0;34m.[0m[0mStructType[0m[0;34m,[0m [0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msamplingRatio[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mfloat[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;3

In [46]:
total_cars.withColumn("owners")

TypeError: withColumn() missing 1 required positional argument: 'col'

In [37]:
total_cars.withColumn?

[0;31mSignature:[0m [0mtotal_cars[0m[0;34m.[0m[0mwithColumn[0m[0;34m([0m[0mcolName[0m[0;34m:[0m [0mstr[0m[0;34m,[0m [0mcol[0m[0;34m:[0m [0mpyspark[0m[0;34m.[0m[0msql[0m[0;34m.[0m[0mcolumn[0m[0;34m.[0m[0mColumn[0m[0;34m)[0m [0;34m->[0m [0;34m'DataFrame'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Returns a new :class:`DataFrame` by adding a column or replacing the
existing column that has the same name.

The column expression must be an expression over this :class:`DataFrame`; attempting to add
a column from some other :class:`DataFrame` will raise an error.

.. versionadded:: 1.3.0

Parameters
----------
colName : str
    string, name of the new column.
col : :class:`Column`
    a :class:`Column` expression for the new column.

Notes
-----
This method introduces a projection internally. Therefore, calling it multiple
times, for instance, via loops in order to add multiple columns can generate big
plans which can cause performance issues 

### Step 8. Add the column owners to cars