Since I was a teenager, police cases, crime and social behaviour outside the law have been interesting to me. Do not get me wrong, always from an analytical and social perspective, I am not in favor of these acts. Even though, I always consdiered it is an interesting aspect of our society that requires attention and further analysis, as I believe that, by observing them, we can obtain many relevant keys about our socio-economic situation, which if we take care of, they could help to construct a better place for everyone.

I am also very found of fiction, specially cinema and series. As you may gess from before, I love mistery, crimes and complex stories that, somehow, reflect how we really are as human beings. One show that really outstanded me was The Wire (first emitted in HBO 2002), written by David Simon (who also has published several fictional novels on the topic). This series had a surprising impact on me, as it taught me that, although it is a work of fiction, it is possible to portray a social trend and certain groups through its narrative. In the case of this series, the main plot ended up practically in the background to portray the city of Baltimore at the beginning of the 2000s, with the different social classes and behavors that composited it.

However, as realistic as it may be, fiction is sometimes too wedded to its status to allow itself to exaggerate about what it describes. Or is it? Well, this is the reason why we are here. In this notebook, we are going to work about this topic. By data analysis means, we are going to explore and analyze a nearly decade of arrests and crimes in a specific USA city and try to dive into the most interesting topics that we can find within it. By the end of this work, we will discuss on how fiction can really protray a real situation based on this exploration. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from statsmodels.tsa.arima_model import ARIMA
sns.set()

### The dataset

Before making any exploration, discussion or even preprocess task, I think we should first know which data we are handling, right? 
As I would have loved to do so, I could not find any good dataset to work with about this topic. So, despite of not being about the city I would have liked to talk about, I picked up a historic dataset provided by the NYPD which collects all the arrests produced in the city between 2006 and 2019, over a decade of records. So, we are going to load the dataset and explore a little bit what it offers to us.

I would also like to note that this dataset is not only one which we are being working with. However it is the core data source we will be handling in the development of this notebook. The rest of the data we are using will be presented in the time it comes handy.

In [None]:
dataframe = pd.read_csv('../input/arrestsnypd/NYPD_Arrests_Data__Historic_.csv')

The first thing we have to do regarding to the dataset, is to see how many data do we have in our hands.

In [None]:
len(dataframe)

So we have 5.012.956 entries in our dataset. That's quite a few. Even of being a record of fourteen years, it is still a surprising ammount of arrests produced in nearly a decade, which indeed speaks a lot of itself. But let's not jump right into conclussions just yet. Let's keep observing the information our dataset can provide to us.

The next thing to do is to see which categories does our dataset handle within its entries, which are the keys to continue our research.

In [None]:
dataframe.keys()

As we can see, we have some information within our hands. Of course, we are not using all of these information we have, as I do not want to dive into racial or gender implications of crimes, that is not the scope of this work. However, we can be assured we can count with several data types to extract relevant information about the matter we are concerned about.

After this little analysis on how many resources we have to make this work, lets take a look at the data itself and see what we have.

In [None]:
dataframe.head(10)

As we can observe, data comes in many formats. First, we see that the ARREST_DATE column brings us the data as a time signature, which will result really bennefitial for our purpose. There are also two descriptors for the crime types, PD_DESC, which I gess is the description the police gave in the arrest report generally speaking, and OFNS_DESCRIPTION which, as I observe by the specificity of it, it may be the official motive the affected person has been arrested by the police. So, in this exploration, I think we should use this second descriptor, as it will serve us as a big collection of crimes which eases the grouping tasks and further analysis. There is also some coordinates data, which we could use in some specific tasks, but they are not very relevant for this study. 

We have to notice also there are some missing data within the dataframe (represented as NaN). We have to tackle this first before the jumping into the exploratory analysis of the dataset. I suggest first to remove them and evaluate the impact of this removal.

In [None]:
arrests_dataset = dataframe.dropna()
len(dataframe) - len(arrests_dataset)

After dropping all the rows that contained missing data, we find out that we have discarded 26337 entries from our original dataframe. Despite of the fact that we could research on applying imputation methods to enable these entries in our dataset, I think it is not worth the effort, as we only loose a tiny percent of the original data, having still more than 4.000.000 entries to tackle. Another thing is that most of this missing data is categorical, so it is harder to gess which category they may belong to. We could discuss this classification problem furtherly. So, for now I believe this loss is no pushover for our work, so we can directly discard the NaN containing data and continue working.

## Exploratory data analysis

### History overview

After presenting the core dataset we hare handling, I think it is time to dive into the data exploration and the research sections. So, first of all, I think it would be interesting to take a general look at the data. As an approach, we could ask ourselves several questions about the dataset to make this analysis and try to answer them the most precise way.

The first question that rises could be, how has evolved crime and arrests over the past fourteen years. As data is presented for us in a time series fashion, we could plot a linechart to observe how the ammount of occurencies have been grouped over these years.

In [None]:
historic = arrests_dataset.ARREST_DATE
historic_per_year = pd.DatetimeIndex(historic).year
historic_per_year = pd.DataFrame({"Year":historic_per_year, "Arrests":1})
historic_per_year = historic_per_year.groupby(['Year'], as_index=False).sum()

In [None]:
sns.set(rc={'figure.figsize':(15,8.27)})
sns.lineplot(data=historic_per_year, x='Year', y='Arrests').set_title("Accumulated arrests timeline")

The first thing we see from this chart is that the number of accumulated crimes and arrests per year has decreased considerably over time. So to speak, we have a decreasing tendency in crimes. However, the numbers are still very high. We could analitically see it through the numbers themselves. 

In [None]:
historic_per_year.sort_values(by=['Arrests'], ascending=False)

So, as we can see, the accumulated crimes and arrests per year are really high, being 2010 the highest year with accumulated crime complaints and arrests. This is interesting, and we will furtherly explore during the course of the next sections. Continuing the analysis on the evolution and the tendency of these events, we can research more profoundly on this evolution. For example, it could be intresesting to see how these incidents evolve over time during the course of a year. In other words, is there any stablished cycle that could indicate us how to predict the total criminality in the city over the course of a year?

To do this, we need to performe some grouping operations withing the specific months of each year.

In [None]:
historic_arrests = pd.DataFrame({"Date":arrests_dataset.ARREST_DATE, "Year": pd.DatetimeIndex(arrests_dataset.ARREST_DATE).year})

In [None]:
years = historic_arrests.Year.unique()
groupedYears = {year: historic_arrests[historic_arrests.Year==year] for year in years}

In [None]:
#Plotting per month arrests
data_to_plot = []
for yeardf in groupedYears.items():
    aux_df = pd.DataFrame({"Month": pd.DatetimeIndex(yeardf[1].Date).month , "Arrests":1})
    aux_df = aux_df.groupby(['Month'], as_index=False).sum()
    aux_df = aux_df.sort_values(by=["Month"])
    aux_df.loc[aux_df.Month == 1, "Month"] = "Jan"
    aux_df.loc[aux_df.Month == 2, "Month"] = "Feb"
    aux_df.loc[aux_df.Month == 3, "Month"] = "Mar"
    aux_df.loc[aux_df.Month == 4, "Month"] = "Apr"
    aux_df.loc[aux_df.Month == 5, "Month"] = "May"
    aux_df.loc[aux_df.Month == 6, "Month"] = "Jun"
    aux_df.loc[aux_df.Month == 7, "Month"] = "Jul"
    aux_df.loc[aux_df.Month == 8, "Month"] = "Aug"
    aux_df.loc[aux_df.Month == 9, "Month"] = "Sep"
    aux_df.loc[aux_df.Month == 10, "Month"] = "Oct"
    aux_df.loc[aux_df.Month == 11, "Month"] = "Nov"
    aux_df.loc[aux_df.Month == 12, "Month"] = "Dec"
    data_to_plot.append(aux_df)

In [None]:
years = range(2006, 2020)
arrests_time_series = pd.DataFrame(columns=["Month", "Arrests"])
counter = 0
for yeardf in groupedYears.items():
    aux_df = pd.DataFrame({"Month": pd.DatetimeIndex(yeardf[1].Date).month , "Arrests":1})
    aux_df = aux_df.groupby(['Month'], as_index=False).sum()
    aux_df = aux_df.sort_values(by=["Month"])
    aux_df.loc[aux_df.Month == 1, "Month"] = f"01-01-{years[counter]}"
    aux_df.loc[aux_df.Month == 2, "Month"] = f"01-02-{years[counter]}"
    aux_df.loc[aux_df.Month == 3, "Month"] = f"01-03-{years[counter]}"
    aux_df.loc[aux_df.Month == 4, "Month"] = f"01-04-{years[counter]}"
    aux_df.loc[aux_df.Month == 5, "Month"] = f"01-05-{years[counter]}"
    aux_df.loc[aux_df.Month == 6, "Month"] = f"01-06-{years[counter]}"
    aux_df.loc[aux_df.Month == 7, "Month"] = f"01-07-{years[counter]}"
    aux_df.loc[aux_df.Month == 8, "Month"] = f"01-08-{years[counter]}"
    aux_df.loc[aux_df.Month == 9, "Month"] = f"01-09-{years[counter]}"
    aux_df.loc[aux_df.Month == 10, "Month"] = f"01-10-{years[counter]}"
    aux_df.loc[aux_df.Month == 11, "Month"] = f"01-11-{years[counter]}"
    aux_df.loc[aux_df.Month == 12, "Month"] = f"01-12-{years[counter]}"
    arrests_time_series = arrests_time_series.append(aux_df)
    counter+=1

After performing these grouping opperations, I plot the obtained data in different charts for the sake of comprehension, as charts that may be easy to comprehend for me, could not be that easy to see for other readers.

In [None]:
fig, axes = plt.subplots(5, 3)
fig.set_size_inches(20.5, 60)
x = 0
y = 0
for i in range(len(data_to_plot)):
    sns.lineplot(ax=axes[x,y], data=data_to_plot[i], x='Month', y='Arrests').set_title(years[i])
    y+=1
    if y>2:
        y=0
        x+=1
plt.close(14)
plt.show()

In [None]:
sns.set(rc={'figure.figsize':(15,8.27)})
for i,item in enumerate(data_to_plot):
    sns.lineplot(data=item, x='Month', y='Arrests', label=str(years[i])).set_title("Accumulated arrests by month")

In [None]:
fig = plt.figure()
fig.set_size_inches(40, 15)
arrests_time_series.Arrests = arrests_time_series.Arrests.astype(float)
sns.lineplot(data=arrests_time_series, x='Month', y='Arrests')
rolling_meandf = pd.DataFrame({'Month': arrests_time_series.Month,'Mean_arrests':arrests_time_series.rolling(window=6).mean()['Arrests']})
sns.lineplot(data=rolling_meandf, x='Month', y='Mean_arrests')
fig.show()

The main idea we can observe from this charts is that, indeed, it can be observed some sort of cycle in crime complaints and arrests during a year. I think the best representation can be observed through the second chart, as we see that all the progressions in the different years trace a symilar shape when passing through the same months. We can observe that, for example, March and October are the most conflicting months, as they are the months that experience the most drastic changes over the rest of the months. On the other side, we can also observe in all charts that the months that encompass the Christmas period are the ones where less arrests and complaints are reported. However, this may be a tricky fact, as Christmas is known for being a singificant social and religious event in the USA, so we can deduce that, during this period of time, police may be more focused on serving in specific social gatherings and events and less incisive in the problematic areas. We have to sum that the fact that the personel could be massibly rearranged for vacation motives. In fact, this observation can be joint with the spring drastical uprising that we notice in all charts can be interpreted as a recovery that takes place once all the effectives are in place and the judicial system (as some arrests may be effected by juridical orders) is fully reactivated. This event can be also interpreted on the other side, where police tries to make its full effectiveness before Christmas vacation.

### Crimes and population

We have analyzed how this crimes have evolved over the past of recent years. However, this evolution does not prortray, as sometimes fiction does, how they belong to the population. In other words, we have to analyze how present is delinquency over population and how it is distributed. In order to explore this topic, we need the aid of a population database. Obtaining this dataset has been somewhat complicated by the fact that the local government does not have a detailed historical dataset of its population census. So, we have to do a little bit of preprocess on this data before jumping into the analysis of the topic. I will use a dataset publicly available on this platform and uploaded by the local NYC government.

In [None]:
population = pd.read_csv('../input/new-york-city-population/new-york-city-population-by-borough-1950-2040.csv')

This population dataset is a projection of the local government of the population by borough (this distribution is interesting for our analysis) from 1950 till 2040. Obviously, from 2020, this data is merely a prediction. However, this is not our main issue. Our biggest problem is that these projections are shown in a ten year stride, which is not that conventional for us. However, we can simulate this data gap by data imputation. Of course, this will not produce accurate results and a precise analysis. Even though, I believe it will be enough to draw some interesting ideas about the topic we are analyzing.

In [None]:
borough_interest = population[["Borough","2000", "2010", "2020"]]
borough_interest
population_dataframe = pd.DataFrame(columns=["Year","Total", "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"])

In [None]:
for i in range(2000, 2021):
    if i == 2000 or i == 2010 or i == 2020:
        population_dataframe = population_dataframe.append({"Year": str(i),"Total": borough_interest[borough_interest["Borough"] == "NYC Total"][str(i)][0],
                              "Bronx": borough_interest[borough_interest["Borough"] == "   Bronx"][str(i)][1],
                              "Brooklyn": borough_interest[borough_interest["Borough"] == "   Brooklyn"][str(i)][2],
                              "Manhattan": borough_interest[borough_interest["Borough"] == "   Manhattan"][str(i)][3],
                              "Queens": borough_interest[borough_interest["Borough"] == "   Queens"][str(i)][4],
                              "Staten Island": borough_interest[borough_interest["Borough"] == "   Staten Island"][str(i)][5]}, ignore_index=True)
    else:
        population_dataframe = population_dataframe.append({"Year": str(i),"Total": np.nan,
                              "Bronx": np.nan,
                              "Brooklyn": np.nan,
                              "Manhattan": np.nan,
                              "Queens": np.nan,
                              "Staten Island": np.nan}, ignore_index=True)
population_dataframe = population_dataframe.set_index("Year")
population_dataframe = population_dataframe.apply(pd.to_numeric)
population_dataframe

So, my first attempt was to use a data imputing system to symulate the population growth. However, data is so few that these methods fail on their task. I append some code if you are interested on observe the fail.

In [None]:
imputer = KNNImputer(missing_values = np.nan, n_neighbors=2)
imputed_population = pd.DataFrame(imputer.fit_transform(population_dataframe))
imputed_population.columns = ["Total", "Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island"]
imputed_population["Year"] = range(2000, 2021) 
imputed_population

Given the fact that sophysticated systems do not achieve their task, I decided to take a simpler approach, maybe the best method is to just interpolate the intermediate values to get a somehow reliable population growth.

In [None]:
population_dataframe = population_dataframe.interpolate(method="linear")
population_dataframe

The produced result is more elegant and, even of not reflecting the reality behind the real data, It still is useful to make an approximation of the topic we are discussing in this section.

So, now, we are going to compute which percentage of the population has been arrested over the described years. This is rather simple as we already have the preprocessed required data to make this calculation.

In [None]:
population_dataframe = population_dataframe.reset_index()
total_population = population_dataframe[population_dataframe.Year > "2005"]
total_population = total_population[total_population.Year < "2020"]

In [None]:
percentage_df = pd.DataFrame(columns=["Year", "Arrests", "TotalPop", "Perc"])
for i in range(2006, 2020):
    total_pop = total_population[total_population["Year"] == str(i)].values[0,1]
    arrests = historic_per_year[historic_per_year["Year"]==2006].values[0,1]
    perc = (arrests/total_pop)*100.0
    percentage_df = percentage_df.append({"Year": i, "Arrests": arrests, "TotalPop": total_pop, "Perc":perc}, ignore_index=True)

percentage_df

As we can see from the following table, each year around 4.5% of the total population of New York City is arrested or has criminal complaints. This is really interesting, as, if we accumulate this percentage over the years, we can say that approximatelly, over a 20% of the population of the city has criminal records, as we are counting that some individuals are reincident on the delictive activity. This data interpretation gives us a clear idea: criminality is rather high in this city and, even of having a descending tendency, as we can interpret from the aforementioned charts and figures, it has a stable representation in terms of population percentage, which places into context the data described before, as this indicates that there is a constant segment of the population that incurs in delictive behavours. 

We could go deeper into this theory by exploring the given data by the different boroguhs that compose NYC, which could help us understand why the data points into the theorisation we are commenting. So, as we have prepared our population dataframe to do so, we only need to produce some tweaks in our original arrests dataset to take a look into the arrests and criminality distribution between the different boroughs in NYC.

In [None]:
boro_corrected_df = arrests_dataset[["ARREST_DATE", "ARREST_BORO"]]

In [None]:
boro_corrected_df.loc[boro_corrected_df.ARREST_BORO == "M", "ARREST_BORO"] = "Manhattan"
boro_corrected_df.loc[boro_corrected_df.ARREST_BORO == "B", "ARREST_BORO"] = "Bronx"
boro_corrected_df.loc[boro_corrected_df.ARREST_BORO == "K", "ARREST_BORO"] = "Brooklyn"
boro_corrected_df.loc[boro_corrected_df.ARREST_BORO == "Q", "ARREST_BORO"] = "Queens"
boro_corrected_df.loc[boro_corrected_df.ARREST_BORO == "S", "ARREST_BORO"] = "Staten Island"

In [None]:
dataframe_years_boro = pd.DataFrame({"Year": pd.DatetimeIndex(boro_corrected_df.ARREST_DATE).year, "Boro": boro_corrected_df.ARREST_BORO})
years = dataframe_years_boro.Year.unique()
years = np.sort(years)
groupedYears = {year: dataframe_years_boro[dataframe_years_boro.Year==year] for year in years}

In [None]:
data_to_plot = []
for yeardf in groupedYears.items():
    aux_df = pd.DataFrame({"Boro": yeardf[1].Boro , "Arrests":1})
    aux_df = aux_df.groupby(['Boro'], as_index=False).sum()
    data_to_plot.append(aux_df)

In [None]:
fig, axes = plt.subplots(7, 2)
fig.set_size_inches(30, 50)
x = 0
y = 0
for i in range(len(years)):
    sns.barplot(ax=axes[x,y],data=data_to_plot[i], y="Boro", x="Arrests").set_title(str(years[i]))
    y+=1
    if y>1:
        y=0
        x+=1

plt.show()

As we can see in this various plots, most of the arrests happen in Brooklyn. However, we are currently not measuring accurately the arrests levels that happen in each borough, as we are not taking into account the total population of each borough. In other words, I believe the crime per capita ammount can bring us a more accurate picture of how criminality is distributed over the city. So let's see what this quantity tells us.  

In [None]:
data_to_plot = []
for yeardf in groupedYears.items():
    aux_df = pd.DataFrame({"Boro": yeardf[1].Boro , "Arrests":1})
    aux_df = aux_df.groupby(['Boro'], as_index=False).sum()
    aux_df2 = pd.DataFrame({"Boro": aux_df.Boro, "CPC": aux_df.Arrests/population_dataframe[population_dataframe.Year == str(yeardf[0])].values[0, range(2,7)]})
    data_to_plot.append(aux_df2)

In [None]:
fig, axes = plt.subplots(7, 2)
fig.set_size_inches(30, 50)
x = 0
y = 0
for i in range(len(years)):
    sns.barplot(ax=axes[x,y],data=data_to_plot[i], y="Boro", x="CPC").set_title(str(years[i]))
    y+=1
    if y>1:
        y=0
        x+=1

plt.show()

As we can see, if we take a look at the Crime Per Capita amount, the interpretation radically changes. As we can see, the Bronx and Manhattan boroughs are the ones where most delinquency lays in. That is because the last measure did not take into account the total amount of population that resided in the current borought, so we were comparing unscaled data. In this case, we can observe that the humblest borought (the Bronx) lays in the top position in delinquency metrics, sided with Manhattan, which can have this high criminality ratio due to its high mobility and activity. This observations are logical, as delinquency rises in places where humbility and public resources are less available, this makes their population extremely susceptible to fall into delinquency when finding ways to survive in a system as particular as the American one. This topic is also recurrent in fiction, where delinquent protagonists are usually portrayed with a humble origin that, somehow, manages to get through the system by risking their freedom by performing morally and legally reprehensible acts. And, as the data speaks for itself, these narrative constructions are highly credible, as the probability of doing so in reality is rather high.

### Analyzing the offenses

In past sections, we have described the historic of criminality and arrests, analyzed its cycles and evaluated its presence on the population, as well as we have determined where most of the crimes happen. However, we have not spoken about which arrest complaints are the most recurrent ones in our dataset, which is also a relevant topic to analyse in order to reflect about how fiction treats criminality.

So the visualization in this case is rather simple, as we have to accumulate all the registered offenses descriptions, which seem to be generic crime types, in our dataset and plot their frequency. To not create a massive graphic, I'll plot the top twenty arrest offenses by frequency.

In [None]:
accum_offenses = pd.DataFrame({"Offense":arrests_dataset.OFNS_DESC, "Arrests":1})
accum_offenses = accum_offenses.groupby(["Offense"], as_index=False).sum().sort_values(by=['Arrests'], ascending=False).head(20)
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=accum_offenses, y="Offense", x="Arrests")

This graphic evidences something that may seem obvious if we have watched or read any fiction work about delinquency: the most recurrent arrest offence in the city of New York is related to drugs, and by a significant difference. So, in this case, we can see that fiction, where most of the situations or plot arcs begin in a drugs-plagued neighborhood, does no get far from reality, as data evidences that it is by far the most recurrent offense. Just to be clear, this is the data related to the people that has been arrested over fourteen years, we cannot imagine how many have commited the same crimes and have not been caught. In other words, we are only observing a part of this delinquent ambience, and we can see significant outstanding crimes, I cannot imagine what could we see if we had the whole data.

So, this graphic is generic, we could take a closer look by distributing the offenses over the several boroughs that conform NYC, so we can see which arrest complains are more frequent in each NYC zone. Again, to not oversaturate the charts, I plot the top ten offenses.

In [None]:
offenses_by_borough = pd.DataFrame({"Borough": arrests_dataset.ARREST_BORO,"Offense":arrests_dataset.OFNS_DESC, "Arrests":1})

In [None]:
offenses_by_borough.loc[offenses_by_borough.Borough == "M", "Borough"] = "Manhattan"
offenses_by_borough.loc[offenses_by_borough.Borough == "B", "Borough"] = "Bronx"
offenses_by_borough.loc[offenses_by_borough.Borough == "K", "Borough"] = "Brooklyn"
offenses_by_borough.loc[offenses_by_borough.Borough == "Q", "Borough"] = "Queens"
offenses_by_borough.loc[offenses_by_borough.Borough == "S", "Borough"] = "Staten Island"

In [None]:
boroughs = offenses_by_borough.Borough.unique()
groupedBoroughs = {borough: offenses_by_borough[offenses_by_borough.Borough==borough] for borough in boroughs}

In [None]:
data_to_plot = []
for borodf in groupedBoroughs.items():
    aux_df = pd.DataFrame({"Offense": borodf[1].Offense , "Arrests":1})
    aux_df = aux_df.groupby(['Offense'], as_index=False).sum()
    aux_df = aux_df.sort_values(by="Arrests", ascending=False).head(10)
    data_to_plot.append(aux_df)

In [None]:
fig, axes = plt.subplots(5, 1)
fig.set_size_inches(30, 40)
x = 0
y = 0
for i in range(len(boroughs)):
    sns.barplot(ax=axes[x],data=data_to_plot[i], y="Offense", x="Arrests").set_title(str(boroughs[i]))
    x+=1

plt.show()

This visualisation reinforces the ideas drawn above, as we observe that drug-related crimes are the most recurrent ones by a significant difference. However, we can observe that their impact is different in each borough. For example, in the humblest boroughs (Bronx and Brooklyn) the number of accumulated arrests is over 300,000 cases, while in less troubled areas, such as Staten Island, the incidence is much lower. This also gives relevant information of our research, as we observe that fiction does not exaggerate when presenting humble boroughs as places where drug dealing and drug-related crimes are surprisingly frequent, it is indeed a fact supported by data. Another aspect that has caught my attention is that Manhattan does not follow in some way the trend of the rest of the boroughs. We can observe that this is the place in NYC where most thefts occur, as assault and theft-related arrests are in the top three most frequent arrests. This information could be relevant, for example, for tourism related measures, as unoriented tourists are a frequent target of this delinquents. If, for example, the local government is expecting to have an increase of touristic visits to the city, should reinforce effectives on Manhattan, as it is the most probable place where visitors could suffer robbery.

Even of being interesting data, there is something that still bothers me, which is the profile of the delinquents that historically have been arrested for perpetrating such offenses. I though it was a good question to be answered by taking a look at the data, and the results were relativelly surprising.

In [None]:
accumulated_ages = arrests_dataset[arrests_dataset.AGE_GROUP.isin(['45-64', '25-44', '18-24', '<18', '65+'])].AGE_GROUP
accumulated_ages = pd.DataFrame({"Age_Group":accumulated_ages, "Arrests":1})
accumulated_ages = accumulated_ages.groupby(["Age_Group"], as_index=False).sum()
accumulated_ages = accumulated_ages.reindex([4,0,1,2,3])
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=accumulated_ages, x="Age_Group", y="Arrests")

This result is interesting, as we can see that the average crime rate is located in an age group between 25 and 44 years old. However, what is most interesting is that this distribution is not normal. In fact, we observe that there is an unbalance in this distribution tending to the left, i.e. the younger age groups are the most frequent offenders.
Indeed, we can confirm this evidence by skewing the used data. 

In [None]:
accumulated_ages.skew()

The data skewing confirms our theory: jounger population is more frequent to get arrested by criminal offenses.
But ¿which crimes are the most frequent in these age groups? Let's see what data says about it

In [None]:
under18 = arrests_dataset[arrests_dataset.AGE_GROUP == '<18']
young = arrests_dataset[arrests_dataset.AGE_GROUP == '18-24']
interest_group_arrests = under18.append(young)

In [None]:
interest_groups = pd.DataFrame({'offense':interest_group_arrests.OFNS_DESC, 'arrests':1})
interest_groups = interest_groups.groupby(['offense'], as_index=False).sum().sort_values(by=['arrests'], ascending=False).head(20)
sns.set(rc={'figure.figsize':(15,8.27)})
sns.barplot(data=interest_groups, y="offense", x="arrests")

As we can observe in this graphic, the youth (from under eighteen years old till 25) tends to be arrested by the same motive as the mean: drug related crimes. Which leads us to the last topic I want to search in this work.

### Delinquency related to education

In the previous section, we noted that young people were the most recidivist group within the crime figures we have handled in this study. This reminds me of the fourth season of The Wire, which we have quoted in this work. The theme of that season was how school failure and the neglect of the public school system ended up dragging many students from poor neighbourhoods into drugs and delinquency. I was surprised by the harshness with which it dealt with this issue. However, the series also talks about how the delinquents themselves tried to prevent these youth from taking the same path. So, I think it is interesting to check the correlation that this school failure has in the long run with the number of arrests produced in these population subgroups.

To do this, I picked up a publicly available dataset, by the NYC government, in this platform which has this information by year and borough. However, it is only from 2005 till 2010. Even though, this highly year biased dataset is not a problem for us, as it enables the exploration of possible consequences in the criminality during the consequent years. 

First, we are going to pick the data of interest (number of school dropouts) from this dataset.

In [None]:
dropouts_dataset = pd.read_csv('../input/nycgraduation-outcomes-2002020/2020-graduation_rates_public_borough.csv', sep=";")

In [None]:
dropouts = dropouts_dataset[['Cohort Year', 'Borough', '# Dropout']]
dropouts.columns = ['Year', 'Borough', 'Dropout']

First, we are going to evaluate historically the impact of dropouts from 2006 till 2016. In the crimes historic, we extend the curves till 2019 to evaluate if there is a perceptible impact.

In [None]:
under18_historic = interest_group_arrests.ARREST_DATE
under18_historic = pd.DatetimeIndex(under18_historic).year
under18_historic = pd.DataFrame({"Year":under18_historic, "Arrests":1})
under18_historic = under18_historic.groupby(['Year'], as_index=False).sum()

In [None]:
historic_dropouts = dropouts[['Year', 'Dropout']]
historic_dropouts = historic_dropouts.groupby(by="Year", as_index=False).sum()

I'll perform a quick plot of the dropout historic as it is interesting for future analysis.

In [None]:
sns.set(rc={'figure.figsize':(15,8.27)})
sns.lineplot(data=historic_dropouts, x="Year", y="Dropout").set_title("Evolution of dropouts in NYC")

In [None]:
complete_historic = pd.DataFrame({"Year": under18_historic.Year, "Arrests": under18_historic.Arrests, "Dropout":historic_dropouts[historic_dropouts["Year"] > 2005].Dropout.reset_index().Dropout})
complete_historic = complete_historic.set_index("Year")
sns.set(rc={'figure.figsize':(15,8.27)})
sns.lineplot(data=complete_historic).set_title("Evolution of dropouts and arrests in NYC")

As we can observe in this chart, we can perceive some relation between the dropout rate and the arrest quantities between the arrests and the dropout levels. We can see that, from 2004 till 2006 there is a high dropout rate with an ascending tendency. This effect then is seen in the arrest complaints in the youth group from 2006 till 2010, which steadily increases. So to speak, we can see the agument of delinquency in young people in a two-year margin. If we continue looking at the chart, we can see that in 2007 till 2014 this dropout rate steadily decreases at a slow pace. This effect can also be seen in the next two years in the criminality reports, where the amount of arrests start also to drop steadily. Again, two years after the decreasing start in the chart. However, this only proves that there is an influence of school dropouts on the arrests events, as the curves have different shapes (the dropouts counter is steep while the arrests is a more soft one as well as the arrests somehow is also a cummulative curve). Even though, this chart proves that there exists some influence between these two numbers.

In order to test how they are related, we can test their correlation and show it in a heatmap.

In [None]:
test = pd.DataFrame({"Year": under18_historic[under18_historic.Year < 2017].Year, "Arrests": under18_historic[under18_historic.Year < 2017].Arrests, "Dropout":historic_dropouts[historic_dropouts["Year"] > 2006].Dropout})
test = test.set_index("Year")
correlation = test.corr()
sns.heatmap(correlation, annot=True)

As we expected, we can see that the arrests events and the dropout numbers have a high correlation value (we sould take a look at the main diagonal of the chart). Even though, we have to be careful interpreting this value, as correlation does not mean causality. In other words, we can assert that the dropout rate does affect on the probability that a student gets involved into criminal activities, but it is not the main reason. As we have seen, for example, the borough and economic environment where they belong is also relevant. In fact, to have a more clear vision of the effect of that variable, we are going to study this effect in the different boroughs of NYC.

In [None]:
under18_borough = pd.DataFrame({"Year": pd.DatetimeIndex(interest_group_arrests.ARREST_DATE).year, "Borough": interest_group_arrests.ARREST_BORO, "Arrest": 1})
under18_borough.loc[under18_borough.Borough == "M", "Borough"] = "Manhattan"
under18_borough.loc[under18_borough.Borough == "B", "Borough"] = "Bronx"
under18_borough.loc[under18_borough.Borough == "K", "Borough"] = "Brooklyn"
under18_borough.loc[under18_borough.Borough == "Q", "Borough"] = "Queens"
under18_borough.loc[under18_borough.Borough == "S", "Borough"] = "Staten Island"

In [None]:
under18_borough = under18_borough.groupby(["Year", "Borough"], as_index=False).sum()

In [None]:
dropouts_boro = dropouts.groupby(["Year", "Borough"], as_index=False).sum()
dropouts_boro = dropouts_boro[dropouts_boro.Borough.isin(["Bronx","Brooklyn", "Queens", "Manhattan", "Staten Island"])]
dropouts_boro = dropouts_boro[dropouts_boro.Year > 2005].reset_index()[["Year", "Borough", "Dropout"]]
complete_dataframe = pd.DataFrame({"Year":dropouts_boro.Year, "Borough": dropouts_boro.Borough, "Dropout": dropouts_boro.Dropout, "Arrests":under18_borough[under18_borough.Year < 2017].Arrest})

In [None]:
fig, axes = plt.subplots(5, 1)
fig.set_size_inches(10, 40)
x=0
for borough in complete_dataframe.Borough.unique():
    aux_df = complete_dataframe[complete_dataframe.Borough == borough][["Year", "Dropout", "Arrests"]].set_index("Year")
    sns.lineplot(ax=axes[x],data=aux_df).set_title(borough)
    x+=1

The results obtained in each borough confirm, in some way, the idea we conveyed in the previous test, the school dropout rate exerts some influence on youth offending. However, it is not the only cause, we cannot indeed point it out as the main cause. In fact, we observe that the figures in each borough are very similar and describe a similar trend, except in Staten Island which does have lower rates. However, we see that the youth crime curve in each borough describes different tendencies in the figures among the youth. It is therefore logical to think that, for example, the economic situation (to some extent determined by the borough in which they live) where the young people are located is much more decisive than whether the students in question have dropped out of school. Therefore, we can attribute the phenomenon of delinquency in young people (and in adults as well) to, fundamentally, poverty and the resources they have available to survive. In fact, this issue is also addressed in the aforementioned fourth season of The Wire, where, although the education system fought hard to prevent young people from taking this path, it was sometimes impossible given the situation of misery and violence in which they were involved.

### The conclusion

Throughout this work we have studied, with the data available to us, how crime functioned in New York City, how it was distributed among the population and what were the most recurrent complaints. On the other hand, we conducted a small investigation on whether there is a relationship between juvenile delinquency, which is higher than the average delinquency separated by age groups, and school dropout. We have concluded that, although it does have some influence over a period of about two years, it is not the main cause of young people being forced to commit crime. In fact, we came to the conclusion, from all that we have studied in the work, that poverty is probably the main reason why people are motivated to commit crimes, which seems obvious, since it is after all a question of survival in many cases. 
Even though, the general line of this work was to observe with data whether what we often know through fiction is real or not. The answer is a resounding yes. 

Obviously, we have not rediscovered the wheel, as we have researched into the veracity of a literary genre known as the non-fiction narrative, where a fictional story is set in a real environment in order to anonymously describe the atmosphere and social situation in a certain place at a certain time. The Wire, one of the fathers of this genre, for example, is a story that grew out of several reports and journalistic investigations by its author, David Simon, in The Sun newspaper in Baltimore in the 2000s. Despite this fact, I think it was worth checking with data whether this is real, since it has also been interesting for us to observe different characteristics of crime in a city, as well as confirming that, in this case, fiction is terrifyingly close to reality.

This work raises other topics that would be treated within a predictive analysis of the dataset, such as the paradigm proposed in the third season of the cited show, where a police officer decides to create Hamsterdam, a fictitious neighbourhood where drug dealing is compeletelly legal, in order to reduce violence and street crime in normal neighbourhoods. Based on this premise, we could study, through population simulation, criminality data, and predictive models, whether this measure would be really effective in a city with large numbers of drug-related crime, such as the one we have just studied. It could be interesting in terms of bringing to the table information and possible solutions to improve citizen coexistence in relation to drug trafficking. However, all of this remains for future work.

Thank you very much for your attention and I hope that this notebook has helped you, both theoretically and technically, to learn something in the field of data science.

See you in future work!

### EXTRA - Is the arrests curve predictable?

After exploring the main topic of this work, I mentioned this could be approached with predictive means to predict, for example, the future criminality rates and acummulated arrests quantities values. I wanted to experiment a little bit about forecasting this amount in future years. In this erxtra section, I'm approaching the problem by the employment of an ARIMA model as as simple forecaster for this metric. This, in fact, requires a little bit of preprocessing of the dataset.

In [None]:
years = historic_arrests.Year.unique()
groupedYears = {year: historic_arrests[historic_arrests.Year==year] for year in years}
years = range(2006, 2020)
arrests_time_series = pd.DataFrame(columns=["Month", "Arrests"])
counter = 0
for yeardf in groupedYears.items():
    aux_df = pd.DataFrame({"Month": pd.DatetimeIndex(yeardf[1].Date).month , "Arrests":1})
    aux_df = aux_df.groupby(['Month'], as_index=False).sum()
    aux_df = aux_df.sort_values(by=["Month"])
    aux_df.loc[aux_df.Month == 1, "Month"] = f"{years[counter]}-01-01"
    aux_df.loc[aux_df.Month == 2, "Month"] = f"{years[counter]}-02-01"
    aux_df.loc[aux_df.Month == 3, "Month"] = f"{years[counter]}-03-01"
    aux_df.loc[aux_df.Month == 4, "Month"] = f"{years[counter]}-04-01"
    aux_df.loc[aux_df.Month == 5, "Month"] = f"{years[counter]}-05-01"
    aux_df.loc[aux_df.Month == 6, "Month"] = f"{years[counter]}-06-01"
    aux_df.loc[aux_df.Month == 7, "Month"] = f"{years[counter]}-07-01"
    aux_df.loc[aux_df.Month == 8, "Month"] = f"{years[counter]}-08-01"
    aux_df.loc[aux_df.Month == 9, "Month"] = f"{years[counter]}-09-01"
    aux_df.loc[aux_df.Month == 10, "Month"] = f"{years[counter]}-10-01"
    aux_df.loc[aux_df.Month == 11, "Month"] = f"{years[counter]}-11-01"
    aux_df.loc[aux_df.Month == 12, "Month"] = f"{years[counter]}-12-01"
    arrests_time_series = arrests_time_series.append(aux_df)
    counter+=1

In [None]:
arrests_time_series_arima = arrests_time_series
arrests_time_series_arima.columns = ['Date', 'Arrests']
arrests_time_series_arima.Date = pd.to_datetime(arrests_time_series_arima.Date)
arrests_time_series_arima = arrests_time_series_arima.set_index("Date")
#arrests_time_series_arima.index = pd.DatetimeIndex(arrests_time_series_arima.index.values)

In [None]:
pd.to_numeric(arrests_time_series_arima.Arrests)

In [None]:
model = ARIMA(pd.to_numeric(arrests_time_series_arima.Arrests), order=(1,1,2))

In [None]:
model_fit = model.fit()

In [None]:
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()

In this residual plotting, we can observe that there is a high variance in our dataset, so we can expect unaccurate predictions on next years. Indeed, we can compute it.

In [None]:
pd.to_numeric(arrests_time_series_arima.Arrests).var()

In [None]:
fig = model_fit.plot_predict(start='2007-01-01', end='2022-12-01')
fig.show()

As we have stated before, the model forecasts a slight uprise of arrests accumulation in the following years, from 2021 and 2022. However, the confidence interval for this forecast is too wide, which means that the obtained metric is potentially unaccurate and erratic. As we observed in the residuals, there is great variance in our dataset, which means that a simple predictive model is not able to forecast new data due to the high differencies obtained in the training phase of the model. This leads us to think about more sophisticated predictive models to make a variable regression from this time series.

Something else to add for future work!