**<h1>VIOLENT CRIME ANALYSIS IN THE USA </h1>**

Harpreet Singh Ghotra, Abdelkader Habel

# Introduction 

<p><b>Crime analysis is the systematic study of identifying and analyzing the patterns in terms of crime and disorder. Information on patterns can help law enforcement agencies deploy resources in a more effective manner and assist detectives in their investigations.</b> Crime analysis also plays a vital role in formulating crime prevention strategies. It's only through crime analysis that we can see the bigger picture and crucially an objective picture of what's going on. Besides these, crime analysis also provide the objective information that decision makers can use to determine <b>"what to do", "where to do it" and "which resources to allocate to do it."</b></p>
<p>So, <b>to derive statistics about crime – to estimate its levels and trends,  and inform law enforcement approaches to prevent it</b> – a conceptual framework for defining and thinking about crime is virtually a prerequisite. Developing and maintaining such a framework is no easy task, because the mechanics of crime are ever evolving and shifting. However, we can first analyze the violent crime data and find out the patterns and crimes occuring in the states of a country and then we can deploy strict survilliance systems to guard those cities and lower the crime rates.</p>
<p> Here, we will conduct our analysis  to find answer to these research questions.</p>
<p><b>1. Are violent crimes rising or falling in American cities?</b></p>
<p><b>2. Which is the most reported violent crime among all violent crimes?</b></p>
<p><b>3. Which cities of America have reported the highest numbers of violent crimes? </b></p>
<p><b>4. Time series analysis of reported violent crime cases in different cities </b></p>

# **<h2> Data Preprocessing </h2>**

**<h3>Data Sources </h3>**

<p>The <b>data source</b> we have taken in our analysis was <b>uploaded to Kaggle by the Marshal Project, a nonprofit organization that centers its work around criminal justice and and the carceral system.</b>They have named the data source as <b>"Crime in Context, 1975-2015".</b> (found at the following link: https://www.kaggle.com/marshallproject/crime-rates) This crime data was acquired from the <b>FBI Uniform Crime Reporting program's "Offenses Known and Clearances by Arrest" database for the year in question, held at the National Archives of Criminal Justice Data.</b> The data was compiled and <b>uploaded by Gabriel Dance, Tom Meagher, and Emily Hopkins of The Marshall Project. </b></p>
<p>The Marshall Project collected <b>more than 40 years of data on the four major crimes the FBI classifies as violent — homicide, rape, robbery and assault — in 68 police jurisdictions with populations of 250,000 or greater.</b></p>
<p> <b>The data is in 'csv' format which contains 2829 rows and 15 columns. The name of dataset is 'report.csv'</b> </p>

**<h3>Loading data into memory</h3>**

**<h4>Importing necessary libraries</h4>**

<p><b>Pandas</b> for reading csv file and processing data frames</p>
<p><b>Numpy</b> for doing numerical operations on python</p>
<p><b>Matplotlib</b> and <b>Seaborn</b> for visualizing the data</p>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


**<h4>Uploading data file in google colab </h4>**
<p>The <b>'report.csv'</b> file must be uploaded from your local computer. </p>

In [None]:
# from google.colab import files
# files.upload()

**<h4> Reading the csv files and printing the first five rows to get an understanding of data </h4>**

In [None]:
df = pd.read_csv('report.csv')
df.head()

**<h4> Knowing the datatypes of each column and checking the count of non-null values in each column</h4>**

In [None]:
df.info()

**<h4> Checking some statistical properties of the data to understand the data and its distribution</h4>**

In [None]:
df.describe()

**<h4> Here, we can see there are missing values in our datasets so let's quickly visualize the missing data with the help of a heatmap.</h4>**

In [None]:
sns.heatmap(df.isnull(), cbar=False)

##Missing Values

**<h4> Creating a dataframe for storing missing values for visualization purposes </h4>**

In [None]:
missing_values = df.isnull().sum()
missing_values = missing_values.to_frame()
missing_values.columns = ['count']
missing_values.index.names = ['Name']
missing_values['Name'] = missing_values.index
missing_values.reset_index(drop=True,inplace=True)

**<h4>Visualizing the number of missing data in each columns </h4>**

In [None]:
ax=plt.figure(figsize=(30,15))
ax = sns.barplot()
ax = sns.barplot(x='Name',y='count',data = missing_values,color="y")
ax.set_title('Number of null values in the columns',fontsize=30)
ax.set_xlabel('Total number of null values',fontsize=20)
ax.set_ylabel('Columns',fontsize=20)
plt.show()

**<h3>Handling Missing Values </h3>**

In [None]:
df['agency_code'].value_counts()

In [None]:
df['agency_jurisdiction'].value_counts()

<h6> Here, we can see the similar trends in <b>'agency_code'</b> column and <b>'agency_jurisdiction'</b> column. But when we see the unique values in each column, <b>'agency_jurisdiction' column</b> has one more unique value than <b>'agency_code'</b>..</h6>
<h6><b>Note:</b> the column <b>'agency_jurisdiction'</b> does not have any missing values but the column <b>'agency_code'</b> has missing values. So, let's quickly see the values in <b>'agency_jurisdiction'</b> column when the <b>'agency_code'</b> column  has missing values. </h6>

In [None]:
df.loc[df['agency_code'].isnull(),:]

<h6> Since, <b>'agency_code'</b> column and <b>'agency_jurisdiction'</b> column have similar patterns....They both refer to the similar thing. If we use both features during our data modelling then these two features will be highly correlated so we can drop any one column... Here, we are dropping <b>'agency_code'</b> column and storing the remaining columns in another df </h6>

In [None]:
df2 = df.drop('agency_code',axis=1)
df2.head()

**<h4>Handling missing values in 'population column'</h4>**

In [None]:
df2.loc[df2['population'].isnull(),'agency_jurisdiction'].value_counts()

<h6>When we explore the values that went missing in the population, we found that <b>41</b> values were missing from <b>'United States'</b> city in <b>'agency_jurisdiction'</b> column and <b>28</b> values were missing from </b>'Louisville, KY'</b> city in <b>'agency_jurisdiction'</b> column.</h6>

<h6> When we see the missing values from <b>'Louisville, KY'</b> city in <b>'agency_jurisdiction'</b> column, we found out that the data from <b>1975-2002</b> were missing but others are available...</h6>

<h6> Also to note that there is no such city or state with name <b>'United States'</b> in America.. Probably this column want to show all other cities and states which were <b>not covered in 'agency_jurisdiction' column...</b> But for modelling, we need clean and quality data so, we are removing the rows with the missing values in <b>'population' column.</b></h6>

In [None]:
df2 = df2.loc[~df2['population'].isnull(),:]
df2.shape

**<h4>Handling missing values in four columns: 'homicides' , 'rapes' , 'assaults', 'robberies' columns at once </h4>**

In [None]:
df2.loc[df2['homicides'].isnull(),'agency_jurisdiction'].value_counts()

In [None]:
df2.loc[df2['rapes'].isnull(),'agency_jurisdiction'].value_counts()

In [None]:
df2.loc[df2['assaults'].isnull(),'agency_jurisdiction'].value_counts()

In [None]:
df2.loc[df2['robberies'].isnull(),'agency_jurisdiction'].value_counts()

<h6>When we explore the missing values in each of these column, we found that all of them have <b>the same number of missing values</b> and <b>the data which have missing value in any one column also has missing values in other columns</b>.. Let's visualize it in dataframes</h6>

**<h6> Filling missing values of 'homicides' column </h6>**

In [None]:
df2.loc[df2['homicides'].isnull(),:]

**<h5> Filling the missing values of 'homicides' column with the mean of the similar distribution</h5>**

In [None]:
print(round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'homicides'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'homicides'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'homicides'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'homicides'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'homicides'].mean()))

<p> We can fill the missing values with the mean of the overall distribution as well. But here we are following a diiferent approach for filling the missing values. </p>
<p> When we check the missing values in the <b>'homicides'</b> column, they were from <b>five different cities.</b> We then checked the population column and see the distribution of values in the <b>'population'</b> column. Finally calculate the mean of the data belonging to <b>'homicides column'</b> and with the similar distribution of values in the 'population' column in our missing rows and use it to fill the missing value in <b>'agency_jurisdiction'</b>' column.</p>

In [None]:
df2.loc[(df2['homicides'].isnull()) & (df2['agency_jurisdiction']=="Cincinnati, OH"),'homicides'] = round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'homicides'].mean())
df2.loc[(df2['homicides'].isnull()) & (df2['agency_jurisdiction']=="Baltimore County, MD"),'homicides'] = round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'homicides'].mean())
df2.loc[(df2['homicides'].isnull()) & (df2['agency_jurisdiction']=="Cleveland, OH"),'homicides'] = round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'homicides'].mean())
df2.loc[(df2['homicides'].isnull()) & (df2['agency_jurisdiction']=="Portland, OR"),'homicides'] = round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'homicides'].mean())
df2.loc[(df2['homicides'].isnull()) & (df2['agency_jurisdiction']=="Tampa, FL"),"homicides"] = round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'homicides'].mean())

**<h5>Filling the missing values of rapes column</h5>**

In [None]:
df2.loc[df2['rapes'].isnull(),:]

In [None]:
df2.loc[(df2['rapes'].isnull()) & (df2['agency_jurisdiction']=="Cincinnati, OH"),'rapes'] = round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'rapes'].mean())
df2.loc[(df2['rapes'].isnull()) & (df2['agency_jurisdiction']=="Baltimore County, MD"),'rapes'] = round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'rapes'].mean())
df2.loc[(df2['rapes'].isnull()) & (df2['agency_jurisdiction']=="Cleveland, OH"),'rapes'] = round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'rapes'].mean())
df2.loc[(df2['rapes'].isnull()) & (df2['agency_jurisdiction']=="Portland, OR"),'rapes'] = round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'rapes'].mean())
df2.loc[(df2['rapes'].isnull()) & (df2['agency_jurisdiction']=="Tampa, FL"),"rapes"] = round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'rapes'].mean())

<p> When we check the missing values in the <b>'rapes'</b> column, the missing values were from <b>five different cities.</b> We then checked the <b>'population' </b> column and see the distribution of values in that column.  Finally, we calculate the mean of the data belonging to <b>'rapes' </b> column and with the similar distribution of values in the 'population' column in our missing rows and use it to fill the missing value in <b>'agency_jurisdiction'</b>' column.</p>

**<h5>Filling the missing values of assaults column </h5>**

In [None]:
df2.loc[df2['assaults'].isnull(),:]

In [None]:
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Cincinnati, OH"),'assaults'] = round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'assaults'].mean())
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Baltimore County, MD"),'assaults'] = round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'assaults'].mean())
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Cleveland, OH"),'assaults'] = round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'assaults'].mean())
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Portland, OR"),'assaults'] = round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'assaults'].mean())
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Tampa, FL"),"assaults"] = round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'assaults'].mean())
df2.loc[(df2['assaults'].isnull()) & (df2['agency_jurisdiction']=="Jacksonville, FL"),"assaults"] = round(df2.loc[(df2['agency_jurisdiction']=="Jacksonville, FL") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'assaults'].mean())

<p> When we check the missing values in the <b>'assaults' </b> column, the missing values were from <b>five different cities.</b> We then checked the <b>'population' </b> column and see the distribution of values in that column.  Finally, we calculate the mean of the data belonging to <b>'assaults' </b> column and with the similar distribution of values in the 'population' column in our missing rows and use it to fill the missing value in <b>'agency_jurisdiction'</b>' column.</p>

**<h5>Filling the missing values of 'robberies' column </h5>**

In [None]:
df2.loc[df2['robberies'].isnull(),:]

In [None]:
print(round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'robberies'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'robberies'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'robberies'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'robberies'].mean()))
print(round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'robberies'].mean()))

<p> When we check the missing values in the <b>'robberies' </b> column, the missing values were from <b>five different cities.</b> We then checked the <b>'population' </b> column and see the distribution of values in that column.  Finally, we calculate the mean of the data belonging to <b>'robberies' </b> column and with the similar distribution of values in the 'population' column in our missing rows and use it to fill the missing value in <b>'agency_jurisdiction'</b>' column.</p>

In [None]:
df2.loc[(df2['robberies'].isnull()) & (df2['agency_jurisdiction']=="Cincinnati, OH"),'robberies'] = round(df2.loc[(df2['agency_jurisdiction']=="Cincinnati, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'robberies'].mean())
df2.loc[(df2['robberies'].isnull()) & (df2['agency_jurisdiction']=="Baltimore County, MD"),'robberies'] = round(df2.loc[(df2['agency_jurisdiction']=="Baltimore County, MD") & (df2['population'] > 800000.0) & (df2['population'] < 900000.0),'robberies'].mean())
df2.loc[(df2['robberies'].isnull()) & (df2['agency_jurisdiction']=="Cleveland, OH"),'robberies'] = round(df2.loc[(df2['agency_jurisdiction']=="Cleveland, OH") & (df2['population'] > 300000.0) & (df2['population'] < 400000.0),'robberies'].mean())
df2.loc[(df2['robberies'].isnull()) & (df2['agency_jurisdiction']=="Portland, OR"),'robberies'] = round(df2.loc[(df2['agency_jurisdiction']=="Portland, OR") & (df2['population'] > 600000.0) & (df2['population'] < 700000.0),'robberies'].mean())
df2.loc[(df2['robberies'].isnull()) & (df2['agency_jurisdiction']=="Tampa, FL"),'robberies'] = round(df2.loc[(df2['agency_jurisdiction']=="Tampa, FL") & (df2['population'] > 200000.0) & (df2['population'] < 300000.0),'robberies'].mean())

**<h5> Filling the missing values of violent_crimes' column. </h5>**

In [None]:
df2.loc[df2['violent_crimes'] == (df2['homicides'] + df2['rapes'] + df2['assaults'] + df2['robberies']),:]

<h6> When we check, whether the <b>'violent_crimes'</b> column is just a separate column or is it a column which is derived from combining the values from all four types of crimes, we found that it is a derived column.. About <b>2753</b> rows have the value of <b>'violent_crimes'</b> equal to the sum of four different types of crime in four different columns. There were <b>7 missing rows</b> in all those four columns and we manage to fill the missing values of those column.. So, let's first check how many missing values are there in <b>'violent_crimes'</b> column and then fill the missing values with the sum of other four values in four different columns. </h6>

In [None]:
df2.loc[df2['violent_crimes'].isnull(),:]

In [None]:
df2.loc[df2['violent_crimes'].isnull(),'violent_crimes'] = df2.loc[df2['violent_crimes'].isnull(),'homicides'] + df2.loc[df2['violent_crimes'].isnull(),'rapes'] + df2.loc[df2['violent_crimes'].isnull(),'assaults'] + df2.loc[df2['violent_crimes'].isnull(),'robberies']

**<h5>Handling missing values in 'months_reported' column </h5>**

In [None]:
df2['months_reported'].value_counts()

In [None]:
df2.loc[df2['months_reported'].isnull(),:]

<h6> Here, in this column we can see that most of the crimes were reported at the <b>end of the month in december</b> and <b>68 rows</b> of this column are missing ... and <b>very few crime cases were reported at other times of the month..</b> So, we can drop this column since it is only pointing out to the end of the month i.e. '12' </h6>

In [None]:
df2 = df2.drop('months_reported',axis=1)
df2.head()

**<h5>Handling missing values in 'homicides_percapita' , 'rapes_percapita' , 'assaults_percapita' , 'robberies_percapita' columns at once </h6>**

In [None]:
df2.loc[df2['homicides_percapita'].isnull(),:]

In [None]:
df2.loc[df2['rapes_percapita'].isnull(),:]

In [None]:
df2.loc[df2['assaults_percapita'].isnull(),:]

In [None]:
df2.loc[df2['robberies_percapita'].isnull(),:]

In [None]:
df2.loc[df2['homicides_percapita'].isnull(),'homicides_percapita'] = round((df2.loc[df2['homicides_percapita'].isnull(),'homicides'] / df2.loc[df2['homicides_percapita'].isnull(),'population']) * 100000,2)
df2.loc[df2['rapes_percapita'].isnull(),'rapes_percapita'] = round((df2.loc[df2['rapes_percapita'].isnull(),'rapes'] / df2.loc[df2['rapes_percapita'].isnull(),'population']) * 100000,2)
df2.loc[df2['assaults_percapita'].isnull(),'assaults_percapita'] = round((df2.loc[df2['assaults_percapita'].isnull(),'assaults'] / df2.loc[df2['assaults_percapita'].isnull(),'population']) * 100000,2)
df2.loc[df2['robberies_percapita'].isnull(),'robberies_percapita'] = round((df2.loc[df2['robberies_percapita'].isnull(),'robberies'] / df2.loc[df2['robberies_percapita'].isnull(),'population']) * 100000,2)


**<h6>Handling missing values in 'crimes_percapita' column. </h6>**

In [None]:
df2.loc[df2['crimes_percapita'].isnull(),:]

<h6> The sum of four columns : <b>'homicides_percapita'</b>, <b>'rapes_percapita'</b>, <b>'assaults_percapita'</b> and <b>'robberies_percapita'</b> are combined together through aggregation(sum) to form <b>'crimes_percapita'</b>. When we explore the <b>'crimes_percapita'</b> column, there are <b>7</b> missing values in the column so replacing such missing values with the summation of other four columns. </h6>

In [None]:
df2.loc[df2['crimes_percapita'].isnull(),'crimes_percapita'] = df2.loc[df2['crimes_percapita'].isnull(),'homicides_percapita'] + df2.loc[df2['crimes_percapita'].isnull(),'rapes_percapita'] + df2.loc[df2['crimes_percapita'].isnull(),'assaults_percapita'] + df2.loc[df2['crimes_percapita'].isnull(),'robberies_percapita'] 

#Basic Data Analysis And Visualization 

In [None]:
df2.info()

In [None]:
df2.describe()

**<h6> Which cities of America have reported the most number of violent crime cases in the last 40 years? </h6>**

In [None]:
ax=plt.figure(figsize=(10,22))
ax = sns.barplot(df2["violent_crimes"],                 
                 y=df2["agency_jurisdiction"],estimator=sum,color="y")
ax.set_title('Total number of crimes in major cities from 1975 to 2015')
ax.set(xlabel='Total number of crimes', ylabel='City and state')

From the above bar graph, we can see that the <b>'New York City'</b> has reported the most number of violent crime cases in the last 40 years. More than <b>4 million criminal cases</b> are being reported in <b>40 years time</b> only in <b>'New York' city from 1975 to 2015.</b>

Second and third in our list, are <b>'Los Angeles' and 'Chigago'</b>. These cities also have reported around <b>2 million violent criminal cases in the 40 years.</b>

<b>'Baltimore Md' , 'Washington, DC','Philadelphia','TA,Houseton', 'Deteroit, Dallas','Miami-Dade Country'</b> are the top cities or states in the USA with greater number of cases than other states in the USA in 40 years time.

**<h6>Which is the most reported violent crime among all violent crimes in all times?</h6>**

In [None]:
ax=plt.figure(figsize=(15,15))
ax = sns.barplot(['homicides','rapes','assaults','robberies'],                 
                 y=[df2["homicides"].sum(),df2['rapes'].sum(),df2['assaults'].sum(),df2['robberies'].sum()],color="b")
ax.set_title('Total number of crimes in major cities from 1975 to 2015')
ax.set(xlabel='Total number of crimes', ylabel='City and state')
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2,height + 5,'{}'.format(height),ha="center")

<p>From this bar graph, we can see that  <b>'assaults'</b> has topped the list in violent crimes that existed between 1975 to 2015 in USA. <b>More than 12 million cases were reported as 'assaults' in the last 40 years.</b></p>
<p>Second in our list is <b>'robberies'</b>. <b>More than 11 million 'robberies'</b> were reported in the USA during the 40 year period covered by the dataset. <b>'Homicides'</b> has the lowest number of cases reported in the USA.<b> More than one million 'rapes'</b> were reported in the USA in the  40-year period...

#Time Series Visualisation and Analysis

**<h3> Let's visualize the curve to see the time series analysis of total crime that occur in the USA in 1975-2015 </h3>**

**<h3>Time series analysis of reported violent crime cases in different cities in 1975-2015</h3>**

##General Crimes Over the Country

In [None]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

In [None]:
import plotly.offline as py
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)
import plotly.graph_objs as go

a = set(df2["agency_jurisdiction"])
# print(a)
a = list(a)
# print(a)

doubles = []
for i in range(0,len(a)):
    doubles.append(df2[df2['agency_jurisdiction'].str.contains(a[i])])

# trace = dict()
trace = []
for i in range(0,len(a)):
    trace.append(go.Scatter(x = doubles[i]['report_year'],y=doubles[i]['violent_crimes'],name = a[i],opacity = 0.8))

data = [trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9],
        trace[10],trace[11],trace[12],trace[13],trace[14],trace[15],trace[16],trace[17],trace[18],trace[19],
         trace[20],trace[21],trace[22],trace[23],trace[24],trace[25],trace[26],trace[27],trace[28],trace[29],
          trace[30],trace[31],trace[32],trace[33],trace[34],trace[35],trace[36],trace[37],trace[38],trace[39],
           trace[40],trace[41],trace[42],trace[43],trace[44],trace[45],trace[46],trace[47],trace[48],trace[49],
            trace[50],trace[51],trace[52],trace[53],trace[54],trace[55],trace[56],trace[57],trace[58],trace[59],
             trace[60],trace[61],trace[62],trace[63],trace[64],trace[65],trace[66],trace[67]]

# layout = dict(title = "Total Crimes in US during in 40 years",
#               xaxis = dict(title = 'Time Span'),
#               yaxis = dict(title = 'Cumulative crimes'),)

layout = go.Layout(title = "Total Crimes in major cities from 1975 to 2015",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'))

# fig = dict(data=data, layout=layout)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig)
# fig.show()

**<h5>PLEASE NOTE THAT THE GRAPH IS INTERACTIVE. IT IS POSSIBLE TO SELECT ONLY THE LINE FOR ONE CITY OR TO SELECT ONLY THE ONES THE VIEWER IS INTERESTED IN.  </h5>**

<p>From this above figure, we can visualize the number of crime in the USA over the time span of 40 years. We can see New York city tops the list in commiting more crimes. After New York, Los Angeles and Chicago have reported the most number of crimes.</p>
<p>From the given time series analysis, we can see that the crime rates were at their peak in 1990s.. After the 1990s, the crime rates have significantly dropped in the three mentioned cities .. </p>
<p> We can also see the the increase that was recorded in Los Angeles when isolating the trace of that city. The rates were dropping from the high ones at the beggining of the 90s, until 1997-2002 where the rates have increased briefly, only to later continue dropping. </p>

<p> We have visulaize the overall crime according to the geographic in the above figure </p>

**<h5>Now, let's look at each specific crime that occurs in USA in the last 40 years. First we are going to see the number of the rape cases that is reported in the USA in the last 40 years </h5>**

In [None]:
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)


a = set(df2["agency_jurisdiction"])
# print(a)
a = list(a)
# print(a)

doubles = dict()
for i in range(0,len(a)):
    doubles[i] = df2[df2['agency_jurisdiction'].str.contains(a[i])]

trace = dict()
for i in range(0,len(a)):
    trace[i] = go.Scatter(x = doubles[i]['report_year'],y=doubles[i]['rapes'],name = a[i],opacity = 0.8)

data = [trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9],
        trace[10],trace[11],trace[12],trace[13],trace[14],trace[15],trace[16],trace[17],trace[18],trace[19],
         trace[20],trace[21],trace[22],trace[23],trace[24],trace[25],trace[26],trace[27],trace[28],trace[29],
          trace[30],trace[31],trace[32],trace[33],trace[34],trace[35],trace[36],trace[37],trace[38],trace[39],
           trace[40],trace[41],trace[42],trace[43],trace[44],trace[45],trace[46],trace[47],trace[48],trace[49],
            trace[50],trace[51],trace[52],trace[53],trace[54],trace[55],trace[56],trace[57],trace[58],trace[59],
             trace[60],trace[61],trace[62],trace[63],trace[64],trace[65],trace[66],trace[67]]

layout = dict(title = "Rape cases in major cities from 1975 to 2015",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'),)

fig = dict(data=data, layout=layout)
py.iplot(fig)

<p>From this  figure, we can visualize the number of rape cases in the USA over the time span of 40 years. We can see New York city tops the list in reporting rape cases. 3880 rape cases were reported in 1985 which was maximum at that time in New York city. After 1995, rape cases went down in New York city but started increasing after 2013. </p>
 <p>After New York, Chicago reported the most number of rape cases. The rape cases begin increasing in Chicago from 1982 and went to their peak in 1992. In that year, 3754 rape crimes were reported. After 1992, the rape cases is in control and the number of cases are declining, but still as of 2015, the third largest city with this issue.</p>

<p> We can also see the spikes were seen in Los Angeles as well.In 1980,Los Angeles reported the most number of rape cases. About 2813 cases were reported in that year in LA. After then, the cases stay on a constant decline until 2013. Not only were they increasing from 2013 to 2015, but they were dramatically increasing, at a rate higher than at any time during the 40 year period. </p>

**<h5>Time series analysis of homicide cases in different cities in last 40 years</h5>**

In [None]:
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)

a = set(df2["agency_jurisdiction"])
# print(a)
a = list(a)
# print(a)

doubles = dict()
for i in range(0,len(a)):
    doubles[i] = df2[df2['agency_jurisdiction'].str.contains(a[i])]

trace = dict()
for i in range(0,len(a)):
    trace[i] = go.Scatter(x = doubles[i]['report_year'],y=doubles[i]['homicides'],name = a[i],opacity = 0.8)

data = [trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9],
        trace[10],trace[11],trace[12],trace[13],trace[14],trace[15],trace[16],trace[17],trace[18],trace[19],
         trace[20],trace[21],trace[22],trace[23],trace[24],trace[25],trace[26],trace[27],trace[28],trace[29],
          trace[30],trace[31],trace[32],trace[33],trace[34],trace[35],trace[36],trace[37],trace[38],trace[39],
           trace[40],trace[41],trace[42],trace[43],trace[44],trace[45],trace[46],trace[47],trace[48],trace[49],
            trace[50],trace[51],trace[52],trace[53],trace[54],trace[55],trace[56],trace[57],trace[58],trace[59],
             trace[60],trace[61],trace[62],trace[63],trace[64],trace[65],trace[66],trace[67]]

layout = dict(title = "Homicide cases in major cities from 1975 to 2015",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'),)

fig = dict(data=data, layout=layout)
py.iplot(fig)

<p>From this  figure, we can visualize the number of homicide cases in the USA over the time span of 40 years. We can see New York city tops the list in reporting maximum number of homicide cases. 2245 homicide cases were reported in 1990 which was maximum at that time in New York city. After 1990, the reporting of homicide cases is in control and the number of cases have significantly dropped. </p>

**<h5>Time series analysis of assault cases in different cities in last 40 years</h5>**

In [None]:
#@title Default title text
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)

a = set(df2["agency_jurisdiction"])
# print(a)
a = list(a)
# print(a)

doubles = dict()
for i in range(0,len(a)):
    doubles[i] = df2[df2['agency_jurisdiction'].str.contains(a[i])]

trace = dict()
for i in range(0,len(a)):
    trace[i] = go.Scatter(x = doubles[i]['report_year'],y=doubles[i]['assaults'],name = a[i],opacity = 0.8)

data = [trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9],
        trace[10],trace[11],trace[12],trace[13],trace[14],trace[15],trace[16],trace[17],trace[18],trace[19],
         trace[20],trace[21],trace[22],trace[23],trace[24],trace[25],trace[26],trace[27],trace[28],trace[29],
          trace[30],trace[31],trace[32],trace[33],trace[34],trace[35],trace[36],trace[37],trace[38],trace[39],
           trace[40],trace[41],trace[42],trace[43],trace[44],trace[45],trace[46],trace[47],trace[48],trace[49],
            trace[50],trace[51],trace[52],trace[53],trace[54],trace[55],trace[56],trace[57],trace[58],trace[59],
             trace[60],trace[61],trace[62],trace[63],trace[64],trace[65],trace[66],trace[67]]

layout = dict(title = "Assault cases in US during in 40 years",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'),)

fig = dict(data=data, layout=layout)
py.iplot(fig)

<p>From this  figure, we can visualize the number of assault cases in the USA over the time span of 40 years. We can see New York city tops the list in reporting maximum assault cases. About 71K assault cases were reported in 1988 which was maximum at that time in New York city. After 1988, assault cases went down in New York city and reached the minumum value in 2008 but after 2008, the assault cases were in increasing trend but not as much as in 1988</p>

 <p>After New York, Los Angeles has reported the most number of assault cases. They begin increasing in LA from the beginning (1975) and went at the peak in 1991, where around 47 thousand cases were reported. After 1991, the assault cases is in control and the number of cases are declining.</p>

<p> We can also see that the spikes were seen in Chicago as well. At first, the assault cases were not that high, but they begin increasing from 1981 and went at their peak in 1991. In 1991, around 42K assault cases were reported in chicago, which was the most reported assault cases in Chicago in 40 years time span. After 1991, the assault cases is in control and the number of cases are declining in Chicago.</p>

<p> It seems that towards the end of the 80s and the very beginning of the 90s, the United States had a criminality and violence problem, just by noticing that the peaks for the major cities are during that time frame. <p>

In [None]:
#@title Default title text
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)

a = set(df2["agency_jurisdiction"])
# print(a)
a = list(a)
# print(a)

doubles = dict()
for i in range(0,len(a)):
    doubles[i] = df2[df2['agency_jurisdiction'].str.contains(a[i])]

trace = dict()
for i in range(0,len(a)):
    trace[i] = go.Scatter(x = doubles[i]['report_year'],y=doubles[i]['robberies'],name = a[i],opacity = 0.8)

data = [trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9],
        trace[10],trace[11],trace[12],trace[13],trace[14],trace[15],trace[16],trace[17],trace[18],trace[19],
         trace[20],trace[21],trace[22],trace[23],trace[24],trace[25],trace[26],trace[27],trace[28],trace[29],
          trace[30],trace[31],trace[32],trace[33],trace[34],trace[35],trace[36],trace[37],trace[38],trace[39],
           trace[40],trace[41],trace[42],trace[43],trace[44],trace[45],trace[46],trace[47],trace[48],trace[49],
            trace[50],trace[51],trace[52],trace[53],trace[54],trace[55],trace[56],trace[57],trace[58],trace[59],
             trace[60],trace[61],trace[62],trace[63],trace[64],trace[65],trace[66],trace[67]]

layout = dict(title = "Robberies cases in major cities from 1975 to 2015",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'),)

fig = dict(data=data, layout=layout)
py.iplot(fig)

<p>From this  figure, we can visualize the number of robberies cases in the USA over the time span of 40 years. We can see New York city tops the list in reporting robberies. About 107K cases were reported in 1981 which was maximum at that time in New York city. After 1981, robberies went down in New York city till 1987 and again the number of cases rises to reach more than 100K robberies cases in 1990 in New York. After then, the number of robberies cases has went down and is in control in New York.</p>

**<h5> We have seen 'New York', 'Los Angeles' and 'Chicago' are among the top cities where crimes are recorded at maximum. So, we are interested in visualizing the trend of 4 different crime cases in those cities with the help of a bar chart. </h5>** 

##Homicide Bar Graphs

In [None]:
NYC = df2.loc[df2['agency_jurisdiction']=='New York City, NY',:]
LOS_ANGELS = df2.loc[df2['agency_jurisdiction']=='Los Angeles, CA',:]
CHIGAGO = df2.loc[df2['agency_jurisdiction']=='Chicago, IL',:]

**<h5> Bar diagram of homicide cases in New York city over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(NYC["report_year"], NYC["homicides"],palette="gist_earth")

plt.ylabel("Number of Homicides in New York")

<p>From this bar diagram, we can see that number of homicide cases went at it's peak in 1990s with more than 2000 cases in New York city. After 1990, the homicide cases is in control and the number of cases are declining. </p>
<p> Now, let's visaulize the number of homicide cases in Los Angeles in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram of homicide cases in Los Angeles over a span of 40 years</h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(LOS_ANGELS["report_year"], LOS_ANGELS["homicides"],palette="gist_earth")

plt.ylabel("Number of Homicides in Los Angeles")

<p>From this bar diagram, we can see that number of homicide cases went to their peak in 1992 an 1993 with more than 1000 cases in Los Angeles. After 1993, the homicide cases is in control and the number of cases were declining. But in the year 2002, We can see some spike in the homicide case in Los Angeles. After 2002, the number of cases went down. </p>
<p> Now, let's visaulize the number of homicide cases in Chicago in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram of homicide cases in Chicago over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(CHIGAGO["report_year"], CHIGAGO["homicides"],palette="gist_earth")

plt.ylabel("Number of Homicides In Chicago")

<p>From this bar diagram, we can see that number of homicide cases went at it's peak in 1992 with more than 900 cases in Chicago. After 1993, the homicide cases is in control and the number of cases are declining.</p>


##Rape Bar Graphs

**<h5> Bar diagram of rape cases in New York over a span of 40 years in a bar diagram</h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(NYC["report_year"], NYC["rapes"],palette="cool")

plt.ylabel("Number of Rape Cases in New York")

<p>From this bar diagram, we can see that number of rape cases were high in 1975 to 1985. After 1985, the rape cases is in control and the number of cases were declining. But in the year 2014, we see a sudden spike in the rape case in New York. In the last two years in 2014 and 2015, the number of rape cases are increasing in New York.</p>
<p> Now, let's visualize the number of rape cases in Los Angeles in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram of rape cases in Los Angeles over a span of 40 years </h5>**


In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(LOS_ANGELS["report_year"], LOS_ANGELS["rapes"],palette="cool")

plt.ylabel("Number of Rape Cases in Los Angeles")

<p>From this bar diagram, we can see that number of rape cases went at it's peak in 1980 in Los Angeles. After 1980, the number of rape cases were declining in Los Angeles. But in the last two year, we see a sudden spike in the rape case in Los Angeles. In the last year in 2015, the number of rape cases went beyond 2000.</p>
<p> Now, let's visaulize the number of rape cases in Los Angeles in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram for rape cases in Chicago over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(CHIGAGO["report_year"], CHIGAGO["rapes"],palette="cool")

plt.ylabel("Number of rape cases in Chicago")

<p>From this bar diagram, we can see that number of rape cases start increasing from 1983 and reached at it's peak in 1992 in Los Chicago. After 1992, the number of rape cases slowly went down and came in control in Chicago.</p>


##Assault Cases Bar Graphs

**<h5> Bar diagram for assault cases in New York over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(NYC["report_year"], NYC["assaults"],palette="flare")
plt.ylabel("Number of Assault cases in New York")

<p>From this bar diagram, we can see that number of assault cases start increasing from 1984 and reached at it's peak in 1988 and 1989 in New York. After 1989, the number of assault cases slowly went down till 2010. After 2010, the assault cases are increasing slowing.</p>
<p> Now, let's visaulize the number of assault cases in Los Angeles in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram for assault cases in Los Angeles over a span of 40 years</h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(LOS_ANGELS["report_year"], LOS_ANGELS["assaults"],palette="flare")

plt.ylabel("Number of Assault crimes in Los Angeles")

<p>From this bar diagram, we can see that number of assault cases start increasing from 1985 and reached at it's peak in 1991 in Los Angeles. After 1991, the number of assault cases slowly went down till 2013. In last two years in 2014 and 2015, the assault cases are increasing which is not a good sign for Los Angeles.</p>
<p> Now, let's visaulize the number of assault cases in Chicago in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram for assault cases in Chicago over a span of 40 years</h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(CHIGAGO["report_year"], CHIGAGO["assaults"],palette="flare")

plt.ylabel("Number of Assault Cases in Chicago")

<p>From this bar diagram, we can see that number of assault cases start increasing from 1983 and reached at it's peak in 1991 in Chicago. After 1991,
the assault cases is in control and the number of cases is declining in Chicago.</p>


##Robberies Bar Graphs

**<h5> Bar diagram for robberies in New York over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(NYC["report_year"], NYC["robberies"],palette="gist_rainbow")

plt.ylabel("Number of Robberies reported in New York")

<p>From this bar diagram, we can see that number of robberies were very high in early 1980's  and reached at it's peak in 1991 and 1992 in Los Angeles. After then, the number seems to be in control and the number of cases is declining in Los Angeles.</p>
<p> Now, let's visaulize the number of robberies cases in Los Angeles in the span of 40 years in a bar-diagram</p>

**<h5> Bar diagram of robberies in Los Angeles over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(LOS_ANGELS["report_year"], LOS_ANGELS["robberies"],palette="gist_rainbow")

plt.ylabel("Number of robberies reported in Los Angeles")

Similarly, the numbers were very high in LA for robberies at the beginning of the 90s. 1991 was the year where the city recorded the maximum number, with close to 4 thousand robberies. Since then, the trend is general towards a decrease in cases. 

**<h5> Bar diagram of robberies in Chicago over a span of 40 years </h5>**

In [None]:
ax=plt.figure(figsize=(15,10))
plt.xticks(rotation='vertical')
ax=sns.barplot(CHIGAGO["report_year"], CHIGAGO["robberies"],palette="gist_rainbow")

plt.ylabel("Number of robberies reported in Chicago")

<p>From this bar diagram, we can see that number of robberies were very high in 1975. Then the cases went down till 1982. After 1982, there is significant rise in the number of robberies cases and reached at it's peak in 1991 in Chicago. After then, the number seems to be in control and the number of cases is declining in Chicago.</p>

#Increase or Decrease?

**<h5> Are the violent crimes increasing or decreasing in the USA? </h5>**
<p> For answering this question, we first have to sum up all the crimes that occurs in all the states in a particular year and then we try to plot the number of crimes over a time span of 40 years in a graph.</p>

In [None]:
report_years = df2['report_year'].unique().tolist()
total_violent_crimes = []
for year in report_years:
  total_violent_crimes.append(df2.loc[df2['report_year'] == year,'violent_crimes'].sum())
print(total_violent_crimes)
print(report_years)

**<h5> Visualizing the number of total crimes in USA over a span of 40 years</h5>**

In [None]:
configure_plotly_browser_state()
py.init_notebook_mode(connected=False)


data = go.Figure(go.Scatter(x = report_years,y = total_violent_crimes,opacity = 0.8))

layout = dict(title = "Robberies cases in US during in 40 years",
              xaxis = dict(title = 'Time Span'),
              yaxis = dict(title = 'Cumulative crimes'))

fig = dict(data=data, layout=layout)
py.iplot(fig)

<p>Here, we can see that the number of violent crimes started form above 500K in 1975 and went on increasing year after year. More than 900K cases were reported in the 1991, which we have seen is the around the peak for most major cities. But after this year, the crime is slowing down in USA and it has gotten under 450K. The total number of crimes reported in USA in 2015 is 438K which is less than the crimes reported at the beginning of 1975.</p>

**<h6>The answer to our research question is that the crime rates were increasing from 1975 to 1990s and reached it's peak in 1991 and after then the crime rates are declining in the given cities. Because these cities are well populated and have big populations, it is fair to say that by extension, the country at large too is seeing a decrease in violent crimes from the levels seen in the 90s. </h6>**

# **<h2> Modelling</h2>**

**<h3>Feature Selection</h3>**

In [None]:
data = df2.loc[:,['report_year','agency_jurisdiction','population','violent_crimes']]
data

**<h4>One-Hot-Encoding categorical variables </h4>**

In [None]:
cities_dummies = pd.get_dummies(data["agency_jurisdiction"])
data = pd.concat([data,cities_dummies],axis=1)
data.drop(['agency_jurisdiction','Wichita, KS'],axis='columns',inplace=True)

**<h4>Splitting Training data and Testing data </h4>**
<p> We are taking the crime data from 1975 to 2007 as training data and the crime data from 2008 to 2015 as testing data. Since we are doing time series analysis, we cannot randomly choose training and testing sets from different time spans </p>

In [None]:
train_data = data.loc[data['report_year'] < 2008,:]
test_data = data.loc[data['report_year'] >= 2008,:]
print(train_data.shape)
print(test_data.shape)

In [None]:
train_data['time'] = train_data.loc[:,'report_year'] - 1974
train_data['time_square'] = (train_data['time'])**2
# train_data['time_cube'] = (train_data['time'])**3
train_data.drop('report_year',axis=1,inplace=True)
# train_data.head()

In [None]:
test_data['time'] = test_data.loc[:,'report_year'] - 1974
test_data['time_square'] = (test_data['time'])**2
# test_data['time_cube'] = (test_data['time'])**3
test_data.drop('report_year',axis=1,inplace=True)
# test_data.head()

In [None]:
voilence_crime = train_data.pop('violent_crimes')
time = train_data.pop('time')
time_square = train_data.pop('time_square')
  
# insert column using insert(position,column_name,first_column) function
train_data.insert(0, 'voilence_crimes', voilence_crime)
train_data.insert(1, 'time', time)
train_data.insert(2, 'time_square', time_square)
train_data.head()

In [None]:
voilence_crime = test_data.pop('violent_crimes')
time = test_data.pop('time')
time_square = test_data.pop('time_square')
  
# insert column using insert(position,column_name,first_column) function
test_data.insert(0, 'voilence_crimes', voilence_crime)
test_data.insert(1, 'time', time)
test_data.insert(2, 'time_square', time_square)
test_data.head()

**<h5> Separating feature vectors and labels from training sets and testing sets </h5>**

In [None]:
x_train, y_train = train_data.iloc[:,1:],train_data.iloc[:,0]
x_test, y_test = test_data.iloc[:,1:],test_data.iloc[:,0]

**<h5>Training a Random Forest Regressor for predicting the number of violent crimes between the years 2008 to 2015</h5>**

In [None]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
from sklearn import tree
clf = RandomForestRegressor(n_estimators = 120, random_state = 42)
clf.fit(x_train,y_train)

In [None]:
# y_pred = clf.predict(x_test)
# Use the forest's predict method on the test data
y_pred = clf.predict(x_test)
# Calculate the absolute errors
errors = abs(y_pred - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'units.')

**<h5>Calculating the accuracy of the models </h5>**

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

**<h6>Let's see the parameter that our decision tree regressor has used to predict the model. We have created a random forest regressor for predicting the violent crimes in the upcoming years. We have used 120 decision trees to form a random forest to create a model. </h6>**

In [None]:
clf.estimators_

In [None]:
len(clf.estimators_)

## **<h2> Visualize the decision trees in Random Forest Regressor </h2>**

we**<h6>Visualizing the first decision tree </h6>**

**<h6> We can visualize any decision tree by providing the value clf.estimators_[?].Here we have used 120 trees to form a decision tree so we can pass the values between 0 to 119. </h6>**

In [None]:
plt.figure(figsize=(15,10))
tree.plot_tree(clf.estimators_[0],filled=True)

In [None]:
predicted_df = pd.DataFrame({'Actual': y_test,'Predicted':y_pred})
predicted_df

**<h5> Visualize the actual vs predicted values in the graph </h5>** 

In [None]:
sns.relplot(x="Actual", y="Predicted", hue='Actual', data=predicted_df[:50])

**<h6>Let's see which is the important features in doing the prediction </h6>**
<p> For this we need to calculate the <b>'feature_importances_'</b> of the predicted model</p>

In [None]:
importance =clf.feature_importances_
importance

In [None]:
columns = x_train.columns

**<h6>Let's see the important features and along with their correlations with the predicted value</h6>**

In [None]:
rfgraph =  pd.Series(importance,columns)
rfgraph

**<h6> We can see that the 'population' feature is the most important  when predicting the number of violent crimes in the USA </h6>**
<p> That generally means, the greater the density of population, the more likely chance of having more crimes in the country </p>

In [None]:
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

**<h6>Visualizing every important features for predicting the number of total violent crimes in the country </h6>**

In [None]:
figure(figsize=(10,10))
rfgraph.sort_values().plot.barh(color='red')
plt.title('random forest model visualization')

**<h6>From this feature importance graph, we conclude that 'population' is the most important feature for predicting the number of violent crimes in USA. Next feature is 'New York City'. This sounds fair because more number of crimes are being reported in New York City. So, this feature is playing a greater part in the prediction model. Third and the fourth important features are time_square and time. The 'time' feature refers to the difference in time between crimes reported and first crime reported in our dataset(i.e 1974). And 'time_square' refers to the square of 'time' feature variable. These two features are considered important because the crime rates were too high in the early 1990s and the crime rates are declining as the time progress so we see such dependency in our models as well. </h6>** 

# **<h2> Conclusion </h2>**

Our analysis of the dataset covering the 1975-2015 period in relation to crimes in American cities has allowed us to answer the questions we asked in the beginning. 

<p><b>1. Are violent crimes  rising or falling in American cities?</b></p>
<p><b>2. Which is the most reported violent crime among all violent crimes?</b></p>
<p><b>3. Which cities in the country have reported the highest numbers of violent crime cases? </b></p>

To question 1, we can positively say that <b>the number of crimes is decreasing</b> since the beginning of the 90s. In fact, the number of crimes that have been recorded in 2015 is lower than that in 1975. 

Question 2 is answered in the "Basic Data Analysis and Visualization" section of this report. The <b>most reported violent crime is assault</b>, followed by robberies, rapes and homicides. 

For Question 3, in the same section as question 2 we show that <b>New York City (New York), Los Angeles (California) and Chicago (Illinois)</b> are the three cities which report the hight numbers of crimes almost consistantly as we see in the different graohs in the time series section. 

It comes as no surprise to us that these are results we get, especially because the three most populated cities in the United States are these three. Our analysis focused on the <b>number</b> of reported crimes, rather than the <b>rates per capita</b> at which they were reported. 


 From the above figures, we conclude that New York, Los Angeles and Chicago are the cities with the highest number of crimes. We have seen how the numbers have evolved in the 1975-2015 period and for most of the crimes, there is a decrease. Of course, there are some years where the numbers increase instead of decreasing, but in general, the trend is towards the latter. One alarming point we noticed is the increase in rape reports in Los Angeles in the last two years of the available data, from 2013 to 2015. It is shown that the rates at which this increase happens is higher than at any other point during the 40 years covered by our analysis. 

In the timeframe that was covered in the dataset, it is shown through the various graphs that crimes have been in a constant increase between 1975 and the 90s. Around maybe 25-30 years ago, the number of reported crimes have decreased significantly, often showing a stark contrast between the numbers reported at the peak and a few years later. This phenomenon is generalized to the entire country and is called the "Great Crime Decline". 

Many theorize on the causes behind these drops in number of crimes. Some theories say that maybe it is because the neighborhoods that were hotspots are being gentrified, which means there is better economic activity in those neighborhoods, which leads to a lesser need or desire for people to commit crimes. It also leads to increased security with maybe more police forces there as a deterrent. Another theory that is out there is that technology is keeping people inside. There is this idea that if people are busy in front of their computers and TVs, they will not be committing crimes. There also are politicians who want to have the merits of this decline, like Bill Clinton after he passed his massive crime bill in which there were things that disproportionately affected people of color and minorities. The truth is that crime experts and researchers do not agree on one single answer, so we do not know what exactly caused that decrease. 


If we can create a model that can predict the number of violent crimes that can occur in each state, then law enforcement agencies can deploy resources in a more effective manner and assist detectives in their investigations. Considering this situation, we have created a model based on random forest to predict the number of violent crimes in each state in the upcoming year. However, the model we have created is not getting good accuracy.

The reason behind this is the lack of sufficient data. Here in this dataset, the crime data is provided in a year basis. We can 
design our prediction model to give more accurate results if there is an availability of  data of reported crimes
each day at different cities. LSTM are often used to do time series analysis but here we have treated this problem as regression
problem and created a model using random forest. This is the main reason behind our model to be not as accurate as we expeceted.
 
 So, this report provides moreover an analytical summary of whether the crimes rates are rising or falling in America, which crimes are reported more often and in which states the crimes rates are increasing or declining so that we can get an overall view of what's going on in the country in order to deploy the appropriate resources needed to counter crime, whethe it be by puttin more police officers, by offering better mental health support to communities or by applying what experts and academics will demand.