## SF crime data analysis and modeling

### In this notebook, you can learn how to use Spark SQL for big data analysis on SF crime data. (https://data.sfgov.org/Public-Safety/sf-data/skgt-fej3/data). 
The first part of Homework is OLAP for scrime data analysis (80 credits).  
The second part is unsupervised learning for spatial data analysis (20 credits).   
The option part is the time series data analysis (50 credits).  
**Note**: you can download the small data (one month e.g. 2018-10) for debug, then download the data from 2013 to 2018 for testing and analysising. 

### How to submit the report for grading ? 
Publish your notebook and send your notebook to mike@laioffer.com, the email title would be like this way: Laidata181128_Spark_Hw1_Yourname  
Your report have to contain your data analysis insights.

In [3]:
from csv import reader
from pyspark.sql import Row 
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from ggplot import *
import warnings

import os
os.environ["PYSPARK_PYTHON"] = "python3"


In [4]:
# read data from the data storage
# please upload your data into databricks community at first. 
crime_data_lines = sc.textFile('/FileStore/tables/sf_data31.csv')
#prepare data 
df_crimes = crime_data_lines.map(lambda line: [x.strip('"') for x in next(reader([line]))])
#get header
header = df_crimes.first()
print (header)

#remove the first line of data
crimes = df_crimes.filter(lambda x: x != header)

#get the first line of data
#display(crimes.take(3))

#get the total number of data 
print (crimes.count())


### Solove  big data issues via Spark
approach 1: use RDD (not recommend)  
approach 2: use Dataframe, register the RDD to a dataframe (recommend for DE)  
approach 3: use SQL (recomend for data analysis or DS， 基础比较差的同学)  
***note***: you only need to choose one of approaches as introduced above

#### We provide 3 options to transform distributed data into dataframe and SQL table, you can choose any one of them to practice

In [7]:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("crime analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df_opt1 = spark.read.format("csv").option("header", "true").load("/FileStore/tables/sf_data.csv")
display(df_opt1)
df_opt1.createOrReplaceTempView("sf_crime")

IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,:@computed_region_yftq_j783,:@computed_region_p5aj_wyqh,:@computed_region_rxqg_mtj9,:@computed_region_bh8s_q3mv,:@computed_region_fyvs_ahh9,:@computed_region_9dfj_4gjx,:@computed_region_n4xg_c4py,:@computed_region_4isq_27mq,:@computed_region_fcz8_est8,:@computed_region_pigm_ib2e,:@computed_region_9jxd_iqea,:@computed_region_6pnf_4xz7,:@computed_region_6ezc_tdp2,:@computed_region_h4ep_8xdi,:@computed_region_nqbw_i6c3,:@computed_region_2dwj_jsy4
170623778,"SEX OFFENSES, FORCIBLE","FORCIBLE RAPE, BODILY FORCE",Monday,07-31-2017,23:59:00,TENDERLOIN,NONE,300 Block of EDDY ST,-122.4129305,37.78383444,"(37.783834437414136, -122.4129305220591)",17100000000000.0,20.0,5,10,36,7,10,9,28852,36,17.0,1.0,18.0,,18.0,6.0,2,1.0,1.0,,
170623483,"SEX OFFENSES, FORCIBLE",SEXUAL BATTERY,Monday,07-31-2017,23:42:00,TENDERLOIN,NONE,200 Block of OFARRELL ST,-122.4091554,37.78632459,"(37.786324588551146, -122.40915538690159)",17100000000000.0,19.0,5,3,36,5,10,10,28852,36,,,,,,5.0,2,1.0,1.0,,
170623433,OTHER OFFENSES,MISCELLANEOUS INVESTIGATION,Monday,07-31-2017,23:30:00,MISSION,NONE,100 Block of COLLINGWOOD ST,-122.4359649,37.76001224,"(37.760012242987266, -122.43596491800183)",17100000000000.0,38.0,3,5,5,2,4,5,28862,3,,,,,,,1,,,5.0,
176203156,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Monday,07-31-2017,23:30:00,SOUTHERN,NONE,9TH ST / TEHAMA ST,-122.4126295,37.77456454,"(37.77456454387646, -122.41262954835369)",17600000000000.0,32.0,1,10,34,8,2,9,28853,34,,1.0,,1.0,,,2,,,1.0,
170623405,OTHER OFFENSES,TRAFFIC VIOLATION,Monday,07-31-2017,23:29:00,TARAVAL,"ARREST, BOOKED",JUDAH ST / 18TH AV,-122.4759235,37.76174718,"(37.76174718252863, -122.47592350542668)",17100000000000.0,109.0,10,7,14,1,8,3,56,12,,,,,,,1,,,,
170623370,MISSING PERSON,MISSING JUVENILE,Monday,07-31-2017,23:03:00,BAYVIEW,NONE,0 Block of REUEL CT,-122.3828694,37.73664286,"(37.736642863386365, -122.38286941924974)",17100000000000.0,86.0,2,9,1,10,3,8,58,1,,,,,,,2,,,,
170623411,NON-CRIMINAL,"DEATH REPORT, NATURAL CAUSES",Monday,07-31-2017,23:00:00,CENTRAL,NONE,700 Block of CLAY ST,-122.4059732,37.79425019,"(37.79425018881105, -122.40597315343355)",17100000000000.0,104.0,6,3,6,3,1,10,28857,4,5.0,,5.0,,5.0,,2,,,,
170624005,STOLEN PROPERTY,"STOLEN PROPERTY, POSSESSION WITH KNOWLEDGE, RECEIVING",Monday,07-31-2017,23:00:00,TENDERLOIN,NONE,100 Block of GOLDEN GATE AV,-122.413048,37.78191196,"(37.78191196282981, -122.41304797207766)",17100000000000.0,20.0,5,10,36,7,10,9,28852,36,17.0,1.0,18.0,1.0,18.0,6.0,2,1.0,1.0,1.0,
170624005,LARCENY/THEFT,"LOST PROPERTY, GRAND THEFT",Monday,07-31-2017,23:00:00,TENDERLOIN,NONE,100 Block of GOLDEN GATE AV,-122.413048,37.78191196,"(37.78191196282981, -122.41304797207766)",17100000000000.0,20.0,5,10,36,7,10,9,28852,36,17.0,1.0,18.0,1.0,18.0,6.0,2,1.0,1.0,1.0,
170624005,WARRANTS,WARRANT ARREST,Monday,07-31-2017,23:00:00,TENDERLOIN,NONE,100 Block of GOLDEN GATE AV,-122.413048,37.78191196,"(37.78191196282981, -122.41304797207766)",17100000000000.0,20.0,5,10,36,7,10,9,28852,36,17.0,1.0,18.0,1.0,18.0,6.0,2,1.0,1.0,1.0,


In [8]:

from pyspark.sql import Row

def createRow(keys, values):
  assert len(keys) == len(values)
  mapped = dict(zip(keys, values))
  return Row(**mapped)

rdd_rows = crimes.map(lambda x: createRow(header, x))

df_opt2 = spark.createDataFrame(rdd_rows)
df_opt2.createOrReplaceTempView("sf_crime")
display(df_opt2)

Address,Category,Date,Date2,Day,DayOfWeek,Descript,Hour,IncidntNum,Location,Month,PdDistrict,PdId,Resolution,Time,X,Y,Year
700 Block of TEHAMA ST,VEHICLE THEFT,05/15/2018,2018/5/15,15,Tuesday,STOLEN MOTORCYCLE,11:00,180362289,"(37.77520656149669, -122.41191202732877)",5,SOUTHERN,18000000000000.0,NONE,10:30,-122.411912,37.77520656,2018
MARKET ST / SOUTH VAN NESS AV,NON-CRIMINAL,05/15/2018,2018/5/15,15,Tuesday,"AIDED CASE, MENTAL DISTURBED",4:00,180360948,"(37.77514629165388, -122.41925789481357)",5,SOUTHERN,18000000000000.0,NONE,4:14,-122.4192579,37.77514629,2018
CAPP ST / 21ST ST,OTHER OFFENSES,05/15/2018,2018/5/15,15,Tuesday,PAROLE VIOLATION,2:00,180360879,"(37.757100579642824, -122.41781255878655)",5,MISSION,18000000000000.0,"ARREST, BOOKED",2:01,-122.4178126,37.75710058,2018
CAPP ST / 21ST ST,OTHER OFFENSES,05/15/2018,2018/5/15,15,Tuesday,TRAFFIC VIOLATION ARREST,2:00,180360879,"(37.757100579642824, -122.41781255878655)",5,MISSION,18000000000000.0,"ARREST, BOOKED",2:01,-122.4178126,37.75710058,2018
CAPP ST / 21ST ST,OTHER OFFENSES,05/15/2018,2018/5/15,15,Tuesday,TRAFFIC VIOLATION,2:00,180360879,"(37.757100579642824, -122.41781255878655)",5,MISSION,18000000000000.0,"ARREST, BOOKED",2:01,-122.4178126,37.75710058,2018
700 Block of SHOTWELL ST,OTHER OFFENSES,05/15/2018,2018/5/15,15,Tuesday,"DRIVERS LICENSE, SUSPENDED OR REVOKED",1:00,180360829,"(37.75641376904809, -122.41561725232026)",5,MISSION,18000000000000.0,NONE,1:27,-122.4156173,37.75641377,2018
0 Block of 6TH ST,ROBBERY,05/15/2018,2018/5/15,15,Tuesday,"ROBBERY, BODILY FORCE",1:00,180360835,"(37.781953653725715, -122.41004163181597)",5,SOUTHERN,18000000000000.0,"ARREST, BOOKED",1:25,-122.4100416,37.78195365,2018
0 Block of 6TH ST,DRUG/NARCOTIC,05/15/2018,2018/5/15,15,Tuesday,POSSESSION OF NARCOTICS PARAPHERNALIA,1:00,180360835,"(37.781953653725715, -122.41004163181597)",5,SOUTHERN,18000000000000.0,"ARREST, BOOKED",1:25,-122.4100416,37.78195365,2018
1500 Block of HAIGHT ST,LIQUOR LAWS,05/15/2018,2018/5/15,15,Tuesday,MISCELLANEOUS LIQOUR LAW VIOLATION,0:00,180360794,"(37.76984648754153, -122.44776112231955)",5,PARK,18000000000000.0,"ARREST, BOOKED",0:19,-122.4477611,37.76984649,2018
1500 Block of HAIGHT ST,WARRANTS,05/15/2018,2018/5/15,15,Tuesday,ENROUTE TO OUTSIDE JURISDICTION,0:00,180360794,"(37.76984648754153, -122.44776112231955)",5,PARK,18000000000000.0,"ARREST, BOOKED",0:19,-122.4477611,37.76984649,2018


In [9]:

df_opt3 = crimes.toDF(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y', 'Location', 'PdId','date2','year','month','day','hour'])
display(df_opt3)
df_opt3.createOrReplaceTempView("sf_crime")

IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId,date2,year,month,day,hour
180362289,VEHICLE THEFT,STOLEN MOTORCYCLE,Tuesday,05/15/2018,10:30,SOUTHERN,NONE,700 Block of TEHAMA ST,-122.411912,37.77520656,"(37.77520656149669, -122.41191202732877)",18000000000000.0,2018/5/15,2018,5,15,11:00
180360948,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Tuesday,05/15/2018,4:14,SOUTHERN,NONE,MARKET ST / SOUTH VAN NESS AV,-122.4192579,37.77514629,"(37.77514629165388, -122.41925789481357)",18000000000000.0,2018/5/15,2018,5,15,4:00
180360879,OTHER OFFENSES,PAROLE VIOLATION,Tuesday,05/15/2018,2:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.4178126,37.75710058,"(37.757100579642824, -122.41781255878655)",18000000000000.0,2018/5/15,2018,5,15,2:00
180360879,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Tuesday,05/15/2018,2:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.4178126,37.75710058,"(37.757100579642824, -122.41781255878655)",18000000000000.0,2018/5/15,2018,5,15,2:00
180360879,OTHER OFFENSES,TRAFFIC VIOLATION,Tuesday,05/15/2018,2:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.4178126,37.75710058,"(37.757100579642824, -122.41781255878655)",18000000000000.0,2018/5/15,2018,5,15,2:00
180360829,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",Tuesday,05/15/2018,1:27,MISSION,NONE,700 Block of SHOTWELL ST,-122.4156173,37.75641377,"(37.75641376904809, -122.41561725232026)",18000000000000.0,2018/5/15,2018,5,15,1:00
180360835,ROBBERY,"ROBBERY, BODILY FORCE",Tuesday,05/15/2018,1:25,SOUTHERN,"ARREST, BOOKED",0 Block of 6TH ST,-122.4100416,37.78195365,"(37.781953653725715, -122.41004163181597)",18000000000000.0,2018/5/15,2018,5,15,1:00
180360835,DRUG/NARCOTIC,POSSESSION OF NARCOTICS PARAPHERNALIA,Tuesday,05/15/2018,1:25,SOUTHERN,"ARREST, BOOKED",0 Block of 6TH ST,-122.4100416,37.78195365,"(37.781953653725715, -122.41004163181597)",18000000000000.0,2018/5/15,2018,5,15,1:00
180360794,LIQUOR LAWS,MISCELLANEOUS LIQOUR LAW VIOLATION,Tuesday,05/15/2018,0:19,PARK,"ARREST, BOOKED",1500 Block of HAIGHT ST,-122.4477611,37.76984649,"(37.76984648754153, -122.44776112231955)",18000000000000.0,2018/5/15,2018,5,15,0:00
180360794,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,Tuesday,05/15/2018,0:19,PARK,"ARREST, BOOKED",1500 Block of HAIGHT ST,-122.4477611,37.76984649,"(37.76984648754153, -122.44776112231955)",18000000000000.0,2018/5/15,2018,5,15,0:00


#### Q1 question (OLAP): 
#####Write a Spark program that counts the number of crimes for different category.

Below are some example codes to demonstrate the way to use Spark RDD, DF, and SQL to work with big data. You can follow this example to finish other questions.

In [11]:

catorgory_set_rdd = crimes.map(lambda item: (item[1],1))
from operator import add
result = sorted(catorgory_set_rdd.reduceByKey(add).collect(), key = lambda item: -item[1])
display(result)

_1,_2
LARCENY/THEFT,218390
OTHER OFFENSES,104175
NON-CRIMINAL,100199
ASSAULT,69954
VANDALISM,42917
VEHICLE THEFT,34921
WARRANTS,33515
BURGLARY,31960
SUSPICIOUS OCC,30042
DRUG/NARCOTIC,25470


In [12]:
q1_result = df_opt1.groupBy('category').count().orderBy('count', ascending=False)
display(q1_result)

category,count
LARCENY/THEFT,2805
OTHER OFFENSES,1002
NON-CRIMINAL,991
ASSAULT,780
VANDALISM,650
VEHICLE THEFT,353
SUSPICIOUS OCC,312
WARRANTS,312
BURGLARY,302
MISSING PERSON,265


In [13]:
#Spark SQL based
crimeCategory = spark.sql("SELECT  category, COUNT(*) AS Count FROM sf_crime GROUP BY category ORDER BY Count DESC")
display(crimeCategory)

category,Count
LARCENY/THEFT,218390
OTHER OFFENSES,104175
NON-CRIMINAL,100199
ASSAULT,69954
VANDALISM,42917
VEHICLE THEFT,34921
WARRANTS,33515
BURGLARY,31960
SUSPICIOUS OCC,30042
DRUG/NARCOTIC,25470


In [14]:
# important hints: 
## first step: spark df or sql to compute the statisitc result 
## second step: export your result to a pandas dataframe. 

crimes_pd_df = crimeCategory.toPandas()

# Spark does not support this function, please refer https://matplotlib.org/ for visuliation. You need to use display to show the figure in the databricks community. 

#display(p)

#### Q2 question (OLAP)
Counts the number of crimes for different district, and visualize your results

In [16]:
q2_result = df_opt1.groupBy('PdDistrict').count().orderBy('count',ascending=False)
display(q2_result)

PdDistrict,count
SOUTHERN,1745
MISSION,1191
NORTHERN,1173
CENTRAL,1137
BAYVIEW,817
INGLESIDE,670
TARAVAL,662
TENDERLOIN,536
RICHMOND,527
PARK,519


#### Q3 question (OLAP)
Count the number of crimes each "Sunday" at "SF downtown".   
hints: SF downtown is defiend  via the range of spatial location. For example, you can use a rectangle to define the SF downtown, or you can define a cicle with center as well. Thus, you need to write your own UDF function to filter data which are located inside certain spatial range. You can follow the example here: https://changhsinlee.com/pyspark-udf/

In [18]:
SundayCrime=spark.sql("SELECT Date,COUNT(*) AS SundayCrimeCount FROM sf_crime WHERE dayofweek='Sunday' and x>=-122.5 and x<=-122.2 and y>=37.8 and y<=37.9 GROUP BY Date ORDER BY Date")
display(SundayCrime)

Date,SundayCrimeCount


In [19]:
dayCrime=spark.sql("SELECT dayofweek,COUNT(*) AS count FROM sf_crime WHERE x>=-122.5 and x<=-122.2 and y>=37.8 and y<=-122.4 GROUP BY dayofweek ORDER BY count DESC")
display(dayCrime)

#### Q4 question (OLAP)
Analysis the number of crime in each month of 2015, 2016, 2017, 2018. Then, give your insights for the output results. What is the business impact for your result?

In [21]:
Crime2015=spark.sql("SELECT CAST(month AS int), COUNT(*) AS Count FROM sf_crime WHERE year='2015' GROUP BY month ORDER BY CAST(month AS int)")
display(Crime2015)

month,Count
1,13606
2,12329
3,13929
4,12959
5,13729
6,13304
7,13365
8,13730
9,12896
10,13147


In [22]:
Crime2016=spark.sql("SELECT CAST(month AS int), COUNT(*) AS Count FROM sf_crime WHERE year='2016' GROUP BY month ORDER BY CAST(month AS int)")
display(Crime2016)

month,Count
1,12967
2,12106
3,12380
4,12328
5,12732
6,12094
7,12191
8,12471
9,12499
10,13388


In [23]:
Crime2017=spark.sql("SELECT CAST(month AS int), COUNT(*) AS Count FROM sf_crime WHERE year='2017' GROUP BY month ORDER BY CAST(month AS int)")
display(Crime2017)

month,Count
1,13084
2,12192
3,13711
4,12941
5,13267
6,12605
7,13171
8,12872
9,12684
10,13355


In [24]:
Crime2018=spark.sql("SELECT CAST(month AS int), COUNT(*) AS Count FROM sf_crime WHERE year='2018' GROUP BY month ORDER BY CAST(month AS int)")
display(Crime2018)

month,Count
1,12031
2,9947
3,10740
4,10306
5,3644


#### Q5 question (OLAP)
Analysis the number of crime w.r.t the hour in certian day like 2015/12/15, 2016/12/15, 2017/12/15. Then, give your travel suggestion to visit SF.

In [26]:
#crimeCategory = spark.sql("SELECT  category, COUNT(*) AS Count FROM sf_crime GROUP BY category ORDER BY Count DESC")
dailyCrime=spark.sql("SELECT hour, COUNT(*) AS Count From sf_crime WHERE Date='12/15/2015' GROUP BY hour ORDER BY Count DESC")
display(dailyCrime)

hour,Count
12:00,33
16:00,28
19:00,22
15:00,22
18:00,21
10:00,21
13:00,20
14:00,19
17:00,18
21:00,17


#### Q6 question (OLAP)
(1) Step1: Find out the top-3 danger disrict  
(2) Step2: find out the crime event w.r.t category and time (hour) from the result of step 1  
(3) give your advice to distribute the police based on your analysis results.

In [28]:
#Step1
dangerDis=spark.sql("SELECT PdDistrict, COUNT(*) AS Count FROM sf_crime GROUP BY PdDistrict ORDER BY Count DESC ")
display(dangerDis)

PdDistrict,Count
SOUTHERN,153747
MISSION,106994
NORTHERN,104025
CENTRAL,93527
BAYVIEW,75834
INGLESIDE,66343
TARAVAL,60191
TENDERLOIN,56719
PARK,47894
RICHMOND,46509


In [29]:
#Step2
#SundayCrime=spark.sql("SELECT Date,COUNT(*) AS SundayCrimeCount FROM sf_crime WHERE dayofweek='Sunday' and x>=-122.5 and x<=-122.2 and y>=37.8 and y<=37.9 GROUP BY Date ORDER BY Date")
crimeCategory = spark.sql("SELECT Category, COUNT(*) AS count FROM sf_crime WHERE PdDistrict='SOUTHERN' GROUP BY Category ORDER BY count DESC LIMIT 3")
display(crimeCategory)

Category,count
LARCENY/THEFT,51719
NON-CRIMINAL,19609
OTHER OFFENSES,17084


In [30]:
#crimeCategory = spark.sql("SELECT Category, COUNT(*) AS count2 FROM sf_crime WHERE PdDistrict IN ('SOUTHERN','MISSION','NORTHERN') GROUP BY Category ORDER BY count2 DESC limit 3")
#timeCategory = spark.sql("SELECT hour, COUNT(*) AS count FROM sf_crime WHERE PdDistrict='SOUTHERN' GROUP BY hour ORDER BY count DESC LIMIT 3")
timeCategory = spark.sql("SELECT hour, COUNT(*) AS count FROM sf_crime WHERE PdDistrict IN ('SOUTHERN','MISSION','NORTHERN') GROUP BY hour ORDER BY count DESC LIMIT 3")
display(timeCategory)

hour,count
18:00,23548
19:00,23335
17:00,22034


#### Q7 question (OLAP)
For different category of crime, find the percentage of resolution. Based on the output, give your hints to adjust the policy.

In [32]:
resolution_crime=spark.sql("SELECT Category, cast(cast(SUM(case when Resolution='ARREST, BOOKED' then 1 else 0 end)*100.0 / COUNT(*) as decimal(18,2)) as varchar(5))||'%' count from sf_crime Group by Category")
display(resolution_crime)

Category,count
FRAUD,14.27%
SUICIDE,1.78%
LIQUOR LAWS,66.14%
SECONDARY CODES,26.54%
FAMILY OFFENSES,36.07%
MISSING PERSON,2.31%
OTHER OFFENSES,49.37%
DRIVING UNDER THE INFLUENCE,90.19%
WARRANTS,91.77%
ARSON,22.86%


#### Q8 question (Apply Spark ML clustering for spatial data analysis)
Extra: visualize the spatial distribution of crimes and run a kmeans clustering algorithm (please use Spark ML kmeans)  
You can refer Spark ML Kmeans a example: https://spark.apache.org/docs/latest/ml-clustering.html#k-means

### Conclusion. 
Use four sentences to summary your work. Like what you have done, how to do it, what the techinical steps, what is your business impact. 
More details are appreciated. You can think about this a report for your manager. Then, you need to use this experience to prove that you have strong background on big  data analysis.  
Point 1:  what is your story ? and why you do this work ?   
Point 2:  how can you do it ?  keywords: Spark, Spark SQL, Dataframe, Data clean, Data visulization, Data size, clustering, OLAP,   
Point 3:  what do you learn from the data ?  keywords: crime, trend, advising, conclusion, runtime

### Optional part: Time series analysis
This part is not based on Spark, and only based on Pandas Time Series package.   
Note: I am not familiar with time series model, please refer the ARIMA model introduced by other teacher.   
process:  
1.visualize time series  
2.plot ACF and find optimal parameter  
3.Train ARIMA  
4.Prediction 

Refer:   
https://zhuanlan.zhihu.com/p/35282988  
https://zhuanlan.zhihu.com/p/35128342  
https://www.statsmodels.org/dev/examples/notebooks/generated/tsa_arma_0.html  
https://www.howtoing.com/a-guide-to-time-series-forecasting-with-arima-in-python-3  
https://www.joinquant.com/post/9576?tag=algorithm  
https://blog.csdn.net/u012052268/article/details/79452244