# United States - Crime Rates - 1960 - 2014

### Introduction:

This time you will create a data 

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [44]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
import requests

spark = SparkSession.builder\
                    .appName('crimess')\
                    .getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/US_Crime_Rates/US_Crime_Rates_1960_2014.csv). 

### Step 3. Assign it to a variable called crime.

In [45]:
url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/US_Crime_Rates/US_Crime_Rates_1960_2014.csv'

request_url = requests.get(url)

with open('data.csv', 'w', encoding='utf-8') as f:
    f.write(request_url.text)
    
crime = spark.read.csv('data.csv', sep=',', header=True)

df_crime = pd.read_csv('data.csv', sep=',', header=0)

### Step 4. What is the type of the columns?

## 🐍 **Solución Pyspark** 

In [46]:
crime.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Population: string (nullable = true)
 |-- Total: string (nullable = true)
 |-- Violent: string (nullable = true)
 |-- Property: string (nullable = true)
 |-- Murder: string (nullable = true)
 |-- Forcible_Rape: string (nullable = true)
 |-- Robbery: string (nullable = true)
 |-- Aggravated_assault: string (nullable = true)
 |-- Burglary: string (nullable = true)
 |-- Larceny_Theft: string (nullable = true)
 |-- Vehicle_Theft: string (nullable = true)



In [47]:
df_crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Year                55 non-null     int64
 1   Population          55 non-null     int64
 2   Total               55 non-null     int64
 3   Violent             55 non-null     int64
 4   Property            55 non-null     int64
 5   Murder              55 non-null     int64
 6   Forcible_Rape       55 non-null     int64
 7   Robbery             55 non-null     int64
 8   Aggravated_assault  55 non-null     int64
 9   Burglary            55 non-null     int64
 10  Larceny_Theft       55 non-null     int64
 11  Vehicle_Theft       55 non-null     int64
dtypes: int64(12)
memory usage: 5.3 KB


##### Have you noticed that the type of Year is int64. But pandas has a different type to work with Time Series. Let's see it now.

### Step 5. Convert the type of the column Year to datetime64

In [48]:
crime = crime.withColumn('Year', F.year(F.col('Year').cast(T.DateType())))
crime.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Population: string (nullable = true)
 |-- Total: string (nullable = true)
 |-- Violent: string (nullable = true)
 |-- Property: string (nullable = true)
 |-- Murder: string (nullable = true)
 |-- Forcible_Rape: string (nullable = true)
 |-- Robbery: string (nullable = true)
 |-- Aggravated_assault: string (nullable = true)
 |-- Burglary: string (nullable = true)
 |-- Larceny_Theft: string (nullable = true)
 |-- Vehicle_Theft: string (nullable = true)



In [49]:
df_crime['Year'] = pd.to_datetime(df_crime['Year'])

df_crime['Year']

0    1970-01-01 00:00:00.000001960
1    1970-01-01 00:00:00.000001961
2    1970-01-01 00:00:00.000001962
3    1970-01-01 00:00:00.000001963
4    1970-01-01 00:00:00.000001964
5    1970-01-01 00:00:00.000001965
6    1970-01-01 00:00:00.000001966
7    1970-01-01 00:00:00.000001967
8    1970-01-01 00:00:00.000001968
9    1970-01-01 00:00:00.000001969
10   1970-01-01 00:00:00.000001970
11   1970-01-01 00:00:00.000001971
12   1970-01-01 00:00:00.000001972
13   1970-01-01 00:00:00.000001973
14   1970-01-01 00:00:00.000001974
15   1970-01-01 00:00:00.000001975
16   1970-01-01 00:00:00.000001976
17   1970-01-01 00:00:00.000001977
18   1970-01-01 00:00:00.000001978
19   1970-01-01 00:00:00.000001979
20   1970-01-01 00:00:00.000001980
21   1970-01-01 00:00:00.000001981
22   1970-01-01 00:00:00.000001982
23   1970-01-01 00:00:00.000001983
24   1970-01-01 00:00:00.000001984
25   1970-01-01 00:00:00.000001985
26   1970-01-01 00:00:00.000001986
27   1970-01-01 00:00:00.000001987
28   1970-01-01 00:0

### Step 6. Set the Year column as the index of the dataframe

- No se puede usar una columna como indice

In [50]:
df_crime.set_index(df_crime['Year'])

Unnamed: 0_level_0,Year,Population,Total,Violent,Property,Murder,Forcible_Rape,Robbery,Aggravated_assault,Burglary,Larceny_Theft,Vehicle_Theft
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1970-01-01 00:00:00.000001960,1970-01-01 00:00:00.000001960,179323175,3384200,288460,3095700,9110,17190,107840,154320,912100,1855400,328200
1970-01-01 00:00:00.000001961,1970-01-01 00:00:00.000001961,182992000,3488000,289390,3198600,8740,17220,106670,156760,949600,1913000,336000
1970-01-01 00:00:00.000001962,1970-01-01 00:00:00.000001962,185771000,3752200,301510,3450700,8530,17550,110860,164570,994300,2089600,366800
1970-01-01 00:00:00.000001963,1970-01-01 00:00:00.000001963,188483000,4109500,316970,3792500,8640,17650,116470,174210,1086400,2297800,408300
1970-01-01 00:00:00.000001964,1970-01-01 00:00:00.000001964,191141000,4564600,364220,4200400,9360,21420,130390,203050,1213200,2514400,472800
1970-01-01 00:00:00.000001965,1970-01-01 00:00:00.000001965,193526000,4739400,387390,4352000,9960,23410,138690,215330,1282500,2572600,496900
1970-01-01 00:00:00.000001966,1970-01-01 00:00:00.000001966,195576000,5223500,430180,4793300,11040,25820,157990,235330,1410100,2822000,561200
1970-01-01 00:00:00.000001967,1970-01-01 00:00:00.000001967,197457000,5903400,499930,5403500,12240,27620,202910,257160,1632100,3111600,659800
1970-01-01 00:00:00.000001968,1970-01-01 00:00:00.000001968,199399000,6720200,595010,6125200,13800,31670,262840,286700,1858900,3482700,783600
1970-01-01 00:00:00.000001969,1970-01-01 00:00:00.000001969,201385000,7410900,661870,6749000,14760,37170,298850,311090,1981900,3888600,878500


### Step 7. Delete the Total column

In [51]:
crime = crime.drop(F.col('Total'))

In [52]:
crime.columns

['Year',
 'Population',
 'Violent',
 'Property',
 'Murder',
 'Forcible_Rape',
 'Robbery',
 'Aggravated_assault',
 'Burglary',
 'Larceny_Theft',
 'Vehicle_Theft']

In [53]:
df_crime.drop(columns=['Total'], inplace=True)

df_crime.head()

Unnamed: 0,Year,Population,Violent,Property,Murder,Forcible_Rape,Robbery,Aggravated_assault,Burglary,Larceny_Theft,Vehicle_Theft
0,1970-01-01 00:00:00.000001960,179323175,288460,3095700,9110,17190,107840,154320,912100,1855400,328200
1,1970-01-01 00:00:00.000001961,182992000,289390,3198600,8740,17220,106670,156760,949600,1913000,336000
2,1970-01-01 00:00:00.000001962,185771000,301510,3450700,8530,17550,110860,164570,994300,2089600,366800
3,1970-01-01 00:00:00.000001963,188483000,316970,3792500,8640,17650,116470,174210,1086400,2297800,408300
4,1970-01-01 00:00:00.000001964,191141000,364220,4200400,9360,21420,130390,203050,1213200,2514400,472800


### Step 8. Group the year by decades and sum the values

#### Pay attention to the Population column number, summing this column is a mistake

In [54]:
crime = crime.withColumn("Decade", (F.floor(F.col("Year") / 10) * 10).cast("int"))



In [55]:
crime.columns


['Year',
 'Population',
 'Violent',
 'Property',
 'Murder',
 'Forcible_Rape',
 'Robbery',
 'Aggravated_assault',
 'Burglary',
 'Larceny_Theft',
 'Vehicle_Theft',
 'Decade']

In [56]:

crime.groupBy("Decade").agg(
    F.sum("Population").alias("suma_population"),
    F.sum("Violent").alias("suma_Violent"),
    F.sum("Property").alias("suma_Property"),
    F.sum("Murder").alias("suma_Murder"),
    F.sum("Forcible_Rape").alias("suma_Forcible_Rape"),
    F.sum("Robbery").alias("suma_Robbery"),
    F.sum("Aggravated_assault").alias("suma_Aggravated_assault"),
    F.sum("Burglary").alias("suma_Burglary"),
    F.sum("Larceny_Theft").alias("suma_Larceny_Theft"),
    F.sum("Vehicle_Theft").alias("suma_Vehicle_Theft")
).orderBy("Decade").show()


+------+---------------+------------+-------------+-----------+------------------+------------+-----------------------+-------------+------------------+------------------+
|Decade|suma_population|suma_Violent|suma_Property|suma_Murder|suma_Forcible_Rape|suma_Robbery|suma_Aggravated_assault|suma_Burglary|suma_Larceny_Theft|suma_Vehicle_Theft|
+------+---------------+------------+-------------+-----------+------------------+------------+-----------------------+-------------+------------------+------------------+
|  1960|  1.915053175E9|   4134930.0|    4.51609E7|   106180.0|          236720.0|   1633510.0|              2158520.0|    1.33211E7|         2.65477E7|         5292100.0|
|  1970|  2.121193298E9|   9607930.0|    9.13838E7|   192230.0|          554570.0|   4159020.0|              4702120.0|     2.8486E7|         5.31578E7|         9739900.0|
|  1980|  2.371370069E9| 1.4074328E7|   1.170489E8|   206439.0|          865639.0|   5383109.0|              7619130.0|  3.3073494E7|       

In [64]:
import numpy as np

df_crime['decadas'] = (np.floor(df_crime['Year'].dt.year / 10) * 10).astype(int)

### Step 9. What is the most dangerous decade to live in the US?

- Los 90 según los datos listados en el step 8