We are interesting in a system that could classify crime discription into different categories. We want to create a system that could automatically assign a described crime to category which could help law enforcements to assign right officers to crime or could automatically assign officers to crime based on the classification.

We are using dataset from Kaggle on San Francisco Crime. Our responsibilty is to train a model based on 39 pre-defined categories, test the model accuracy 

To solve this problem, we will use a variety of feature extraction techniques along with different supervised machine learning algorithms in Pyspark.

## Setup Spark and load other libraries

In [1]:
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("crimeClass").getOrCreate()

sc = spark.sparkContext

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(60)

In [3]:
from pyspark.sql.functions import col, lower

## Data Extraction

In [4]:
data = spark.read.csv('train/train.csv', inferSchema=True, header=True, timestampFormat='yyyy-mm-dd hh mm ss')

In [5]:
data.head(1)

[Row(Dates='2015-05-13 23:53:00', Category='WARRANTS', Descript='WARRANT ARREST', DayOfWeek='Wednesday', PdDistrict='NORTHERN', Resolution='ARREST, BOOKED', Address='OAK ST / LAGUNA ST', X=-122.425891675136, Y=37.7745985956747)]

In [6]:
data.show()

+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-------------------+------------------+
|              Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|                  X|                 Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-------------------+------------------+
|2015-05-13 23:53:00|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST|  -122.425891675136|  37.7745985956747|
|2015-05-13 23:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST|  -122.425891675136|  37.7745985956747|
|2015-05-13 23:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...|   -122.42436302145|  37.8004143219856|
|2015-05-13 23:30:00| LARCENY/THEFT|GRAND THEFT FROM ...|Wednesday|  NORTHER

In [7]:
data.printSchema()

root
 |-- Dates: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Descript: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- PdDistrict: string (nullable = true)
 |-- Resolution: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)



In [8]:
data.select('Category', 'Descript').show(truncate = False)

+--------------+-----------------------------------------+
|Category      |Descript                                 |
+--------------+-----------------------------------------+
|WARRANTS      |WARRANT ARREST                           |
|OTHER OFFENSES|TRAFFIC VIOLATION ARREST                 |
|OTHER OFFENSES|TRAFFIC VIOLATION ARREST                 |
|LARCENY/THEFT |GRAND THEFT FROM LOCKED AUTO             |
|LARCENY/THEFT |GRAND THEFT FROM LOCKED AUTO             |
|LARCENY/THEFT |GRAND THEFT FROM UNLOCKED AUTO           |
|VEHICLE THEFT |STOLEN AUTOMOBILE                        |
|VEHICLE THEFT |STOLEN AUTOMOBILE                        |
|LARCENY/THEFT |GRAND THEFT FROM LOCKED AUTO             |
|LARCENY/THEFT |GRAND THEFT FROM LOCKED AUTO             |
|LARCENY/THEFT |PETTY THEFT FROM LOCKED AUTO             |
|OTHER OFFENSES|MISCELLANEOUS INVESTIGATION              |
|VANDALISM     |MALICIOUS MISCHIEF, VANDALISM OF VEHICLES|
|LARCENY/THEFT |GRAND THEFT FROM LOCKED AUTO            

We can see Descirpt coulmn gives some idea on how can we classify the crimes in different categories

In [9]:
#converting both coulmns data to lower case
new_data = data.select(lower(col('Category')), lower(col('Descript')))
new_data = new_data.withColumnRenamed('lower(Category)', 'Category').withColumnRenamed('lower(Descript)', 'Description')

In [10]:
new_data.show()

+--------------+--------------------+
|      Category|         Description|
+--------------+--------------------+
|      warrants|      warrant arrest|
|other offenses|traffic violation...|
|other offenses|traffic violation...|
| larceny/theft|grand theft from ...|
| larceny/theft|grand theft from ...|
| larceny/theft|grand theft from ...|
| vehicle theft|   stolen automobile|
| vehicle theft|   stolen automobile|
| larceny/theft|grand theft from ...|
| larceny/theft|grand theft from ...|
| larceny/theft|petty theft from ...|
|other offenses|miscellaneous inv...|
|     vandalism|malicious mischie...|
| larceny/theft|grand theft from ...|
|  non-criminal|      found property|
|  non-criminal|      found property|
|       robbery|robbery, armed wi...|
|       assault|aggravated assaul...|
|other offenses|   traffic violation|
|  non-criminal|      found property|
+--------------+--------------------+
only showing top 20 rows



In [11]:
new_data.printSchema()

root
 |-- Category: string (nullable = true)
 |-- Description: string (nullable = true)



In [12]:
new_data.count()

878049

There are total 878049 records in train.csv

To familiar ourselves with the dataset, we need to see the top list of the crime categories and descriptions.

In [13]:
#total number of unique categories in given crime data
new_data.select('Category').distinct().count()

39

In [14]:
#top 10 categories present in data 
new_data.groupBy('Category').count().orderBy(col('count').desc()).show(10)

+--------------+------+
|      Category| count|
+--------------+------+
| larceny/theft|174900|
|other offenses|126182|
|  non-criminal| 92304|
|       assault| 76876|
| drug/narcotic| 53971|
| vehicle theft| 53781|
|     vandalism| 44725|
|      warrants| 42214|
|      burglary| 36755|
|suspicious occ| 31414|
+--------------+------+
only showing top 10 rows



In [15]:
#total number of unique Description in given crime data
new_data.select('Description').distinct().count()

879

In [16]:
#top 10 Descriptions present in data 
new_data.groupBy('Description').count().orderBy(col('count').desc()).show(10)

+--------------------+-----+
|         Description|count|
+--------------------+-----+
|grand theft from ...|60022|
|       lost property|31729|
|             battery|27441|
|   stolen automobile|26897|
|drivers license, ...|26839|
|      warrant arrest|23754|
|suspicious occurr...|21891|
|aided case, menta...|21497|
|petty theft from ...|19771|
|malicious mischie...|17789|
+--------------------+-----+
only showing top 10 rows



**Category feature will be our label (multi-class).** 

## Splitting the dataset into Training and Test dataset

In [17]:
train_data, test_data = new_data.randomSplit([0.7,0.3], seed = 60)
print('Train data count - ', train_data.count())
print('Test data count - ', test_data.count())

Train data count -  614457
Test data count -  263592


In [18]:
from pyspark.ml.feature import (RegexTokenizer, StopWordsRemover, CountVectorizer, OneHotEncoder, StringIndexer, 
                               VectorAssembler, HashingTF, IDF, Word2Vec)
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, NaiveBayes

We are using above imported NLP features to work with text data (Description Column). As ML understand the number we have to convert the text data into number which machine can understand

Process : 
-  Description in text -> Tokens -> stop_word_remove -> count_vector/TF-IDF -> Description in number
- for eg.
-  I am a Boy, play cricket -> I, am, a, boy -> boy, cricket -> (1,1) -> [1, 1]

In [19]:

#tokenizer with regextokenizer()
regex_tokenizer = RegexTokenizer(pattern='\\W')\
                  .setInputCol("Description")\
                  .setOutputCol("tokens")

#stopwords with stopwordsremover()
extra_stopwords = ['http','amp','rt','t','c','the']
stopwords_remover = StopWordsRemover()\
                    .setInputCol('tokens')\
                    .setOutputCol('filtered_words')\
                    .setStopWords(extra_stopwords)
                    

#bags of words using countVectorizer()
count_vectors = CountVectorizer(vocabSize=10000, minDF=5)\
               .setInputCol("filtered_words")\
               .setOutputCol("features")


#TF-IDF to vectorise features instead of countVectoriser
hashingTf = HashingTF(numFeatures=10000)\
            .setInputCol("filtered_words")\
            .setOutputCol("raw_features")
            
#minDocFreq to remove sparse terms
idf = IDF(minDocFreq=5)\
        .setInputCol("raw_features")\
        .setOutputCol("features")

#bag of words using Word2Vec
word2Vec = Word2Vec(vectorSize=1000, minCount=0)\
           .setInputCol("filtered_words")\
           .setOutputCol("features")

#Encode the Category variable into label using StringIndexer
label_string_idx = StringIndexer()\
                  .setInputCol("Category")\
                  .setOutputCol("label")

We are using all these features in building pipelines forour model

In [20]:
#logistic Regression classifier
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

#Naive Bayes classifier
nb = NaiveBayes(smoothing=1)

### Logistic Regression with Count Vector Features

In [21]:
pipeline_cv_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, lr])
model_cv_lr = pipeline_cv_lr.fit(train_data)
predictions_cv_lr = model_cv_lr.transform(test_data)

In [22]:
predictions_cv_lr.select('Description','Category',"probability","label","prediction").orderBy("probability", 
                                                                                              ascending=False).show(n=5)

+--------------------+-------------+--------------------+-----+----------+
|         Description|     Category|         probability|label|prediction|
+--------------------+-------------+--------------------+-----+----------+
|theft, bicycle, <...|larceny/theft|[0.87175291843817...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.87175291843817...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.87175291843817...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.87175291843817...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.87175291843817...|  0.0|       0.0|
+--------------------+-------------+--------------------+-----+----------+
only showing top 5 rows



In [23]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator 
evaluator_cv_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_lr)
print('Accuracy : ', evaluator_cv_lr)

Accuracy :  0.9723579882349168


### Naive Bayes with Count Vector Features

In [24]:
pipeline_cv_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,count_vectors,label_string_idx, nb])
model_cv_nb = pipeline_cv_nb.fit(train_data)
predictions_cv_nb = model_cv_nb.transform(test_data)

In [25]:
evaluator_cv_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_cv_nb)
print('Accuracy : ', evaluator_cv_nb)

Accuracy :  0.9935325400900984


### Logistic Regression Using TF-IDF Features

In [26]:
pipeline_idf_lr = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, lr])
model_idf_lr = pipeline_idf_lr.fit(train_data)
predictions_idf_lr = model_idf_lr.transform(test_data)

In [27]:
predictions_idf_lr.select('Description','Category',"probability",
                          "label","prediction").orderBy("probability",ascending=False).show(n=5)

+--------------------+-------------+--------------------+-----+----------+
|         Description|     Category|         probability|label|prediction|
+--------------------+-------------+--------------------+-----+----------+
|theft, bicycle, <...|larceny/theft|[0.88418347520914...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.88418347520914...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.88418347520914...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.88418347520914...|  0.0|       0.0|
|theft, bicycle, <...|larceny/theft|[0.88418347520914...|  0.0|       0.0|
+--------------------+-------------+--------------------+-----+----------+
only showing top 5 rows



In [28]:
evaluator_idf_lr = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_lr)
print('Accuracy : ', evaluator_idf_lr)

Accuracy :  0.972293366901647


### Naive Bayes with TF-IDF Features

In [29]:
pipeline_idf_nb = Pipeline().setStages([regex_tokenizer,stopwords_remover,hashingTf, idf, label_string_idx, nb])
model_idf_nb = pipeline_idf_nb.fit(train_data)
predictions_idf_nb = model_idf_nb.transform(test_data)

In [30]:
evaluator_idf_nb = MulticlassClassificationEvaluator().setPredictionCol("prediction").evaluate(predictions_idf_nb)
print('Accuracy : ', evaluator_idf_nb)

Accuracy :  0.9949302190940209


As you can see, TF-IDF proves to be best vectoriser for this dataset, while Naive Bayes proves to be better algorithm for text analysis than Logistic regression.

Link - https://github.com/aakinlalu/Crime-Classification-using-PySpark/