# Working with rows

## Download and install Spark

In [1]:
!ls

[31mWorking_with_rows.ipynb[m[m


In [2]:
#!apt-get update
#!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#!wget -q http://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
#!tar xf spark-2.3.1-bin-hadoop2.7.tgz
#!pip install -q findspark

## Setup environment

In [3]:
import os


import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

## Downloading and preprocessing Chicago's Reported Crime Data

In [4]:
#!wget https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
#!ls -l

In [5]:
#!mv rows.csv\?accessType\=DOWNLOAD reported-crimes.csv
#!ls -l

In [6]:
from pyspark.sql.functions import to_timestamp,col,lit
rc = spark.read.csv('../../reported-crimes.csv',header=True).withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a')).filter(col('Date') <= lit('2018-11-11'))
rc.show(5)

+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+----+--------+------------+------------+----+-----------+-------------+--------------------+
|Case Number|               Date|               Block|IUCR|      Primary Type|         Description|Location Description|Arrest|Domestic|Beat|Ward|FBI Code|X Coordinate|Y Coordinate|Year|   Latitude|    Longitude|            Location|
+-----------+-------------------+--------------------+----+------------------+--------------------+--------------------+------+--------+----+----+--------+------------+------------+----+-----------+-------------+--------------------+
|   JA366925|2001-01-01 11:00:00|     016XX E 86TH PL|1153|DECEPTIVE PRACTICE|FINANCIAL IDENTIT...|           RESIDENCE| false|   false|0412|   8|      11|        null|        null|2001|       null|         null|                null|
|    G553545|2001-09-15 02:00:00|     013XX W POLK ST|0460|     

## Working with rows

**Add the reported crimes for an additional day, 12-Nov-2018, to our dataset.**

In [14]:
one_day = spark.read.csv('../../reported-crimes.csv',header=True).withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a')).filter(col('Date') == lit('2018-11-12'))

In [15]:
one_day.count()

0

In [11]:
rc.union(one_day).orderBy("Date", ascending=False).show(5)

+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+----+--------+------------+------------+----+------------+-------------+--------------------+
|Case Number|               Date|               Block|IUCR|   Primary Type|         Description|Location Description|Arrest|Domestic|Beat|Ward|FBI Code|X Coordinate|Y Coordinate|Year|    Latitude|    Longitude|            Location|
+-----------+-------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+----+--------+------------+------------+----+------------+-------------+--------------------+
|   HH100082|2001-12-31 23:58:00|    0000X N OGDEN AV|0810|          THEFT|           OVER $500|              STREET| false|   false|1333|null|      06|     1166017|     1900248|2001| 41.88186382|-87.665846779|(41.88186382, -87...|
|   HH100172|2001-12-31 23:55:00|  043XX S PULASKI RD|1320|CRIMINAL DAMA

**What are the top 10 number of reported crimes by Primary type, in descending order of occurence?**