# Problem Definition
#### How to determine the *price* of a used car?

## Contents

[Installation Setup](#Installation-Setup) <br>
+   [Environment Config](#Environment-Configuration) <br>
+   [Python Packages](#Loading-Packages) <br>
+   [Apache Spark](#Creating-SparkSession) <br>

[Extract, Transform, Load](#Extract-Data) <br>
This includes the various stages of the ETL Pipeline <br>
+   [Extract](#Extract-Data) <br>
    +   [Download from Kaggle](#Kaggle-Dataset) <br>
    +   [Cleaning Data](#Cleaning-Data) <br>
    +   [Cache Data](#Caching-Data-on-S3) <br>
+   [Transform](#Transform-Data) <br>
    +   [Sampling Data](#Sampling-Data) <br>
    +   [Exploratory Data Analysis using Pandas and Matplotlib](#Exploratory-Data-Analysis) <br>
    +   [Transformations](#Transformations) <br>
    +   [Cache Data](#Caching-Data-on-S3) <br>
+   [Load](#Load-Data) <br>
    +   
    +   
    +   
    
[Predicting Used Car Price](#Machine-Learning)
+   Implementing Linear Regression

# Installation Setup

## Tool Versions

```
Apache Spark - 2.4.3
Jupyter Notebook - 4.4.0
AirFlow - ?
```
    
## Environment Configuration

#### Configuring ~/.bash_profile

```
export PATH="/usr/local/bin:$PATH"
PATH="/Library/Frameworks/Python.framework/Versions/3.7/bin:${PATH}"
export PATH=/usr/local/scala/bin:$PATH
export PATH=/usr/local/spark/bin:$PATH
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
export PYSPARK_PYTHON=python3.7
```

#### Configuring ~/.bashrc

```
export PYSPARK_PYTHON=/usr/local/bin/python3.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7
```



### Findspark

Use `findspark` to be able to find and import **Pyspark** module, while correctly setting environmental variables and dependencies.

In [1]:
import traceback
import findspark
try:
    findspark.init('/usr/local/spark/')
except:
    print ("Error:", ''.join(traceback.format_stack()))

Check paths before Executing PySpark Session:

In [3]:
import os
import sys
print("PATH: %s\n" % os.environ['PATH'])
print("SPARK_HOME: %s" % os.environ['SPARK_HOME'])
print("PYSPARK_PYTHON: %s" % os.environ['PYSPARK_PYTHON'])
print("PYSPARK_DRIVER_PYTHON: %s" % os.environ['PYSPARK_DRIVER_PYTHON'])

PATH: /Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/spark/bin:/usr/local/scala/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/local/spark/bin:/usr/local/scala/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/local/spark/bin:/usr/local/scala/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/local/spark/bin:/usr/local/scala/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

SPARK_HOME: /usr/local/spark/
PYSPARK_PYTHON: /usr/local/bin/python3.7
PYSPARK_DRIVER_PYTHON: /usr/local/bin/python3.7


## Loading Packages 

In [4]:
#import libraries
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import matplotlib
import sklearn
import scipy
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab

### Package Versions

In [5]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 3.0.3
sklearn: 0.19.2
scipy: 1.2.1
pandas: 0.24.2
numpy: 1.16.3
Python: 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) 
[Clang 6.0 (clang-600.0.57)]


## Creating SparkSession
Creating Spark Session, hosted across all local nodes on a **Standalone Cluster**

In [6]:
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName("PySpark Craigslist") \
    .getOrCreate()

sc = spark.sparkContext

# Extract Data

## Kaggle Dataset

Available from [kaggle.com/austinreese/](https://www.kaggle.com/austinreese/craigslist-carstrucks-data)

In [7]:
vehicle_listings = spark.read.format("csv").option("header", "true").load("../data/craigslistVehiclesFull.csv")
type(vehicle_listings)

pyspark.sql.dataframe.DataFrame

In [8]:
vehicle_listings.printSchema()

root
 |-- url: string (nullable = true)
 |-- city: string (nullable = true)
 |-- price: string (nullable = true)
 |-- year: string (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- make: string (nullable = true)
 |-- condition: string (nullable = true)
 |-- cylinders: string (nullable = true)
 |-- fuel: string (nullable = true)
 |-- odometer: string (nullable = true)
 |-- title_status: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- vin: string (nullable = true)
 |-- drive: string (nullable = true)
 |-- size: string (nullable = true)
 |-- type: string (nullable = true)
 |-- paint_color: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- lat: string (nullable = true)
 |-- long: string (nullable = true)
 |-- county_fips: string (nullable = true)
 |-- county_name: string (nullable = true)
 |-- state_fips: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- state_name: string (nullable = true)
 |-- weather: 

In [149]:
print(vehicle_listings.count(),len(vehicle_listings.columns))

1723065 26


In [162]:
vehicle_listings.select("city","state_code","year","lat","long").describe().show()
vehicle_listings.select("manufacturer","make","price","condition","odometer").describe().show()
vehicle_listings.select("cylinders","fuel","transmission","drive","size","type").describe().show()
vehicle_listings.select("url","title_status","vin","paint_color","image_url").describe().show()
vehicle_listings.select("county_fips","state_fips","weather","state_code","state_name").describe().show()

+-------+----------+------------------+------------------+------------------+------------------+
|summary|      city|        state_code|              year|               lat|              long|
+-------+----------+------------------+------------------+------------------+------------------+
|  count|   1723065|           1664232|           1716750|           1723053|           1723063|
|   mean|      null|              28.4|2004.8408405417213|38.781324831980314|-93.60776401924767|
| stddev|      null|11.749231653365442|  12.0877162743775| 5.983397427783298| 67.11417140992036|
|    min|abbotsford|                17|             302.0|          -122.861|           -1000.0|
|    max|zanesville|                WY|            2019.0|              90.0|           51177.0|
+-------+----------+------------------+------------------+------------------+------------------+

+-------+------------+--------------------+--------------------+---------+------------------+
|summary|manufacturer|          

Generally, the `.describe()` or `.explain()` method is a good way to start exploring a dataset. For this dataset, there are too many columns, some very unclean, and it is hard to decipher much from the above result. <br>

## Cleaning Data

Cleaning is essential during the Extract Stage of the ETL pipeline. We want to be able to cache a cleaned dataset that can be used for multiple pipelines later on, without having to perform basic cleaning again.<br>

Some Columns seem redundant, we can remove them before moving forward.<br>

In [170]:
vehicle_listings=vehicle_listings.drop('url','vin', 'paint_color', 'image_url', \
                                       'lat', 'long', 'county_fips', 'state_fips','weather')
print("Remaining columns-",vehicle_listings.columns)

Remaining columns- ['city', 'price', 'year', 'manufacturer', 'make', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'size', 'type', 'county_name', 'state_code', 'state_name']


'lat' and 'long' provide no more useful information than ('city','county_name', 'state_code', 'state_name'), therefore are not needed.<br>
The other columns are not helpful to the analysis.

The attribute characteristics are varied, we have three types of features available in our data - <br>
**Categorical**, **Ordinal**, **Continuous**<br>
Although only 3 attributes, **price, year, odometer**, can be directly converted to numerical value (continuous), other attributes can be be transformed to fit the Categorical and Ordinal labels. The difference between categorical and ordinal values is that ordinal values have a clear and restricted ordering of types. For example, *'condition'* would be ordinal, ranging from excellent to poor.

Convert **numerical feature types** from string to float to obtain Continuous 

In [171]:
from functools import reduce

cols=["price","year","odometer",]

vehicle_listings = (reduce(
            lambda memo_df, col_name: memo_df.withColumn(col_name, vehicle_listings[col_name].cast("float")),
            cols,
            vehicle_listings))

In [175]:
vehicle_listings.printSchema()
print("Data Size-",vehicle_listings.count(),"*",len(vehicle_listings.columns))

root
 |-- city: string (nullable = true)
 |-- price: float (nullable = true)
 |-- year: float (nullable = true)
 |-- manufacturer: string (nullable = true)
 |-- make: string (nullable = true)
 |-- condition: string (nullable = true)
 |-- cylinders: string (nullable = true)
 |-- fuel: string (nullable = true)
 |-- odometer: float (nullable = true)
 |-- title_status: string (nullable = true)
 |-- transmission: string (nullable = true)
 |-- drive: string (nullable = true)
 |-- size: string (nullable = true)
 |-- type: string (nullable = true)
 |-- county_name: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- state_name: string (nullable = true)

Data Size- 1723065 * 17


There are over 1.7 million records, and 17 columns. As we can observe from a `.describe()` exploration of various attributes, the dataset is too large to perform EDA (Exploratory Data Analysis) comfortably. <br>
Right off the bat, some columns provide no useful data for the analysis and can be removed in the Extract pipeline. The final data from the Extract Stage should be as clean as possible without redundant data.<br>

After being cleaned, we store the data at the end of the Extract Stage as a parquet file, so we can return to this file without having to run the Extract pipeline again. <br>
With the new dataset, we sample the data and study it using a **Pandas** dataframe, in order to start the Transform stage of the pipeline with an *Exploratory Data Analysis*.<br>


# Transform Data

## Exploratory Data Analysis

To start the *EDA*, focus on the target variable, i.e. the variable we want our model to be able to predict - **price** 

In [155]:
vehicle_listings.where("price>100").select("price").describe().show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|           1696575|
|   mean|109233.15024799964|
| stddev|1.01224902731346E7|
|    min|             101.0|
|    max|      2.06862669E9|
+-------+------------------+



In [160]:
from pyspark.sql.functions import *
vehicle_listings.agg(max(col("price"))).show()
                    

+------------+
|  max(price)|
+------------+
|2.06862669E9|
+------------+



To explore the dataset for analysis, create a **sample pandas dataframe** from the dataframe:

In [114]:
seed=63
pd_vehicle_listings=vehicle_listings.sample(False,0.001,seed).toPandas()
pd_vehicle_listings.describe(include='all')

Unnamed: 0,url,city,price,year,manufacturer,make,condition,cylinders,fuel,odometer,...,paint_color,image_url,lat,long,county_fips,county_name,state_fips,state_code,state_name,weather
count,1772,1772,1772.0,1769.0,1634,1707.0,1049,1064,1764,1252.0,...,1079,1772,1772.0,1772.0,1685.0,1685,1685.0,1685,1772,1685.0
unique,1772,364,,,39,963.0,6,7,5,,...,12,1771,,,,599,,50,51,
top,https://skagit.craigslist.org/cto/d/mechanics-...,whistler,,,ford,1500.0,excellent,6 cylinders,gas,,...,white,https://images.craigslist.org/00r0r_9HsbmQi8aC...,,,,Jefferson,,CA,California,
freq,1,16,,,291,34.0,469,380,1581,,...,254,2,,,,26,,143,143,
mean,,,11439.492188,2005.105713,,,,,,109663.1,...,,,38.828968,-93.589165,29143.734375,,29.0546,,,54.055786
std,,,11828.822266,12.365001,,,,,,102515.4,...,,,6.043419,17.245035,16240.077148,,16.21484,,,7.75519
min,,,1.0,1921.0,,,,,,0.0,...,,,-27.369877,-176.791992,1003.0,,1.0,,,29.0
25%,,,3500.0,2003.0,,,,,,52493.0,...,,,34.951149,-105.004202,12131.0,,12.0,,,48.0
50%,,,7927.5,2007.0,,,,,,102810.0,...,,,39.504667,-88.256954,29510.0,,29.0,,,53.0
75%,,,15820.25,2013.0,,,,,,148450.8,...,,,42.623626,-80.855953,42071.0,,42.0,,,59.0


Pandas `dataframe.describe` gives us basic statistics but it is hard to determine how well distributed the sample is across the dataset. One way to figure this out is to observe the distibution of one column. As an example, let us use the *"city"* column for this.

*Random Sampling* like above works well for simple use cases. However, the car records are probably not evenly distributed. <br>
Plotting a histogram of the frequencies of cities can help check that - 

In [164]:
pd_vehicle_listings.corr()["price"].sort_values(ascending=False)

price          1.000000
year           0.282384
lat            0.102744
state_fips    -0.005703
county_fips   -0.005832
weather       -0.025649
long          -0.169885
odometer      -0.283637
Name: price, dtype: float64