# <span style = "color: #2E8B57">1. Analysing the Crime Boston Dataset </span>

## 🌔 <span style="color: #228B22"> 1.1: Libraries & Reading the Data </span>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
For start, we should first import our necessary libraries, and we should read our data with `read_csv` method of Pandas library.

</p>
</div>

<div class="alert alert-block alert-info"> 📌 Content: Records begin in June 14, 2015 and continue to September 3, 2018.
</div>

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import missingno as msno
from pylab import rcParams

%matplotlib inline

In [None]:
PATH = "../input/crimes-in-boston/crime.csv"

crime = pd.read_csv(PATH, encoding = "latin-1")
crime.head()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
Usually, after we completed this part, it is common to check for null values, data's shape, and take a look at the whole data with `info`, and `describe` methods.

</p>
</div>

## 🎊 <span style="color: #228B22"> 1.2: Data Exploration </span>

In [None]:
crime.isna().sum()

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
'SHOOTING', 'STREET', 'LAT', 'LONG' columns null values seems way to much. Let's look at the shape of the data now. So, we can interpret the amount of null values, and consider what to do with these.

</p>
</div>

In [None]:
crime.shape

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
🤔Hmmm... We have more than 300.000 entries. Maybe, just maybe, we can ignore 'STREET', 'LAT', 'LONG' column's null values but 'SHOOTING' column hardly has data but that's mean there is no information we can use? NA values can mean there is no shooting occured. It tells us something, don't you think?
</p>
</div>

In [None]:
crime.info()

In [None]:
crime.describe()

In [None]:
msno.matrix(crime);

In [None]:
def missing_values_table(dataframe):
    
    # Take columns with only null values
    variables_with_na = [col for col in dataframe.columns if dataframe[col].isnull().sum() > 0]
    
    # Sort them
    n_miss = dataframe[variables_with_na].isnull().sum().sort_values(ascending=False)
    
    # Calculate their ratio
    ratio = (dataframe[variables_with_na].isnull().sum() / dataframe.shape[0] * 100).sort_values(ascending=False)
    
    # Take their data types
    dtypes = dataframe.dtypes
    dtypesna = dtypes.loc[(np.sum(dataframe.isnull()) != 0)]
    missing_df = pd.concat([n_miss, np.round(ratio, 2), dtypesna], axis=1, keys=['Null Values', 'Ratio', 'Dtype'])
    
    if len(missing_df)>0:
        print(missing_df)
        print("\nThere are {} columns with missing values\n".format(len(missing_df)))
    else:
        print("\nThere is no missing value")

In [None]:
missing_values_table(crime)

In [None]:
crime["SHOOTING"].fillna("N", inplace = True)

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
Since I decided that null shooting values mean there is no shooting, I replaced them with 'N'.
</p>
</div>

In [None]:
missing_values_table(crime)

In [None]:
null_columns = crime.columns[crime.isna().any()]
crime[crime.isna().any(axis=1)][null_columns].tail(20)

In [None]:
crime["DISTRICT"].value_counts()

In [None]:
def value_count(data):
    
    for col in data.columns:
        print(f"{col} value counts: {data[col].value_counts().head()}")
        print("########################################################")

In [None]:
value_count(crime)

In [None]:
crime.loc[crime["INCIDENT_NUMBER"] == 'I162030584']

In [None]:
crime.loc[crime['INCIDENT_NUMBER'] == 'I152080623']

<div class="alert alert-block alert-info"> 🧨 When we take the values of the 'INCIDENT_NUMBER' repeated more than once they all have the same entries, except for the features that start with 'OFFENSE_...'.
</div>

<font color = "#5642C5">📝**Takeaways:**
1. Most crime happened in B1 district (Roxbury). Followed by C11 district (Dorchester).
2. Most crimes (318054) doesn't involve shooting.
3. We can't say much about dates, but midnight seems like busy hours.
4. While crime should decrease over the years, there is an increasing number of crimes. 2017 was the year with the most crimes.
5. Summer season (August, July) are again the seasons with the highest crime rates.
6. There is an increase in the crime rate on Friday, as it is the last working day.
7. In the evening hours and at 12 (noon) the crime rate seems to be high again.
8. UCR Part 3 crimes has leadership. These crimes are the lesser offenses of larceny-theft, simple assault, vandalism of a building or a property.
9. Washington St. has the most crime rate.</font>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
              color:white;">
'OCCURED_ON_DATE' column contains date but its dtype is object. We should fix that.
</p>
</div>

In [None]:
crime["OCCURRED_ON_DATE"] = pd.to_datetime(crime["OCCURRED_ON_DATE"]) 

<font color = "#5642C5">Let's check number of unique entries:</font>

In [None]:
crime.apply(pd.Series.nunique)

In [None]:
# Delete duplicate values

crime.drop_duplicates(subset = ["INCIDENT_NUMBER"], inplace = True)

In [None]:
crime.loc[crime['INCIDENT_NUMBER'] == 'I152080623']

In [None]:
# Rename long & capital letters features names

rename = {'OFFENSE_CODE_GROUP' : 'Group',
         'OFFENSE_DESCRIPTION' : 'Description',
         'DAY_OF_WEEK' : 'Day',
         'YEAR' : 'Year',
         'MONTH' : 'Month',
         'HOUR' : 'Hour',
         'STREET' : 'Street',
         'DISTRICT' : 'District',
         'SHOOTING' : 'Shooting',
         'OCCURRED_ON_DATE' : 'Date',
         'REPORTING_AREA' : 'Area',
         'OFFENSE_CODE' : 'Code'}

crime.rename(index = str, columns = rename, inplace = True)

In [None]:
crime.columns

<font color = "#5642C5">We have categorical values. We can convert them into `Categorical Data Type`.</font>

In [None]:
crime.Group = crime.Group.astype('category')
crime.Description = crime.Description.astype('category')
crime.Day = crime.Day.astype('category')
crime.UCR_PART = crime.UCR_PART.astype('category')
crime.District = crime.District.astype('category')

In [None]:
crime.info()

<font color = "#5642C5">Create two new features as <b>quarter & weekofyear</b>.</font>

In [None]:
# Creating two new features
crime["Quarter"] = crime["Date"].dt.quarter
crime["Weekofyear"] = crime["Date"].dt.weekofyear

# Convert into categorical data type
crime["Quarter"] = crime["Quarter"].astype("category")
crime["Weekofyear"] = crime["Weekofyear"].astype("category")

In [None]:
crime.loc[crime['Lat'] == -1]

In [None]:
# Specify columns to drop
drop = ["INCIDENT_NUMBER", 'Code', 'Description', 'Area', 'Location'] # Droping location since we have lat, and long

# Drop the specified columns
crime.drop(drop, axis = 1, inplace = True)

# Replace -1 with NaN in Lat, and Long columns
crime["Lat"].replace(-1, np.nan, inplace = True)
crime["Long"].replace(-1, np.nan, inplace = True)

In [None]:
missing_values_table(crime)

## 📊 <span style="color: #228B22"> 1.3: Data Visualization </span>

In [None]:
# Set plot parameters
plt.style.use("seaborn-darkgrid")
rcParams["figure.figsize"] = 20,9

sns.countplot(x = "Month", hue = "Year", data = crime)
plt.title("Total Number of Crimes For Each Month (2015-2018)", fontsize = 16, color = "#5642C5");

<font color = "#5642C5">Well, there is something to mention in here. Data does not contain first 5 month of year 2015 and last 3 month of year 2018.</font>

In [None]:
order = crime["Group"].value_counts().head(5).index
sns.countplot(data = crime, x = "Group", hue = "District", order = order)
plt.title("Number of Crime Group For Each District", fontsize = 16, color = "#5642C5");

<font color = "#5642C5">D4, A1, B2, C6 districts has the majority in crime numbers in Motor Vehicle Accident, and Larcency.</font>

In [None]:
# Specifying the values to plot (year, month, day, hour, district, street)
noc_year = pd.DataFrame(data = crime["Year"].value_counts().reset_index().values,
                        columns = ["year", "noc"]).sort_values("year").reset_index(drop = True)
noc_month = pd.DataFrame(data = crime["Month"].value_counts().reset_index().values,
                        columns = ["month", "noc"]).sort_values("month").reset_index(drop = True)
noc_day = pd.DataFrame(data = crime["Day"].value_counts().reset_index().values,
                        columns = ["day", "noc"]).sort_values("day").reset_index(drop = True)
noc_hour = pd.DataFrame(data = crime["Hour"].value_counts().reset_index().values,
                        columns = ["hour", "noc"]).sort_values("hour").reset_index(drop = True)
noc_dist = pd.DataFrame(data = crime["District"].value_counts().reset_index().values,
                        columns = ["dist", "noc"])
noc_street = pd.DataFrame(data = crime["Street"].value_counts().reset_index().values,
                        columns = ["street", "noc"]).sort_values("noc", ascending = False).reset_index(drop = True).head(30)

# Create a subplot with 3 rows and 2 cols
fig = make_subplots(rows = 3, cols = 2,
                   specs = [[{"type" : "scatter"}, {"type" : "scatter"}], [{"type" : "scatter"},
                             {"type" : "scatter"}], [{"type" : "bar"}, {"type" : "bar"}]],
                   subplot_titles = ("NOC per Year", "NOC per Month", "NOC per Day", "NOC per Hour", "NOC per District", "NOC per Street"))

# Plot the values
fig.add_trace(go.Scatter(x = noc_year["year"],
                        y = noc_year["noc"]), row = 1, col = 1)
fig.add_trace(go.Scatter(x = noc_month["month"],
                        y = noc_month["noc"]), row = 1, col = 2)
fig.add_trace(go.Scatter(x = noc_day["day"],
                        y = noc_day["noc"]), row = 2, col = 1)
fig.add_trace(go.Scatter(x = noc_hour["hour"],
                        y = noc_hour["noc"]), row = 2, col = 2)
fig.add_trace(go.Bar(x = noc_dist["dist"],
                        y = noc_dist["noc"]), row = 3, col = 1)
fig.add_trace(go.Bar(x = noc_street["street"],
                        y = noc_month["noc"]), row = 3, col = 2)

# Update x axes parameters
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Month", range=[0, 13], row=1, col=2)
fig.update_xaxes(title_text="Day", row=2, col=1)
fig.update_xaxes(title_text="Hour",row=2, col=2)
fig.update_xaxes(title_text="District", row=3, col=1)
fig.update_xaxes(title_text="Street", row=3, col=2)

# Update y axes parameters
fig.update_yaxes(title_text="Crime Count", row=1, col=1)
fig.update_yaxes(title_text="Crime Count",row=1, col=2)
fig.update_yaxes(title_text="Crime Count", row=2, col=1)
fig.update_yaxes(title_text="Crime Count", row=2, col=2)
fig.update_yaxes(title_text="Crime Count", row=3, col=1)
fig.update_yaxes(title_text="Crime Count", row=3, col=2)

fig.update_layout(showlegend=False,title_text="Distributions of Total NOC Between 2015-2018", height=900)

<font color = "#5642C5">📝 **Takeaways:**

1. Year 2017 has the majority in number of crime.
2. August has the majority, and maybe we can say by looking at the plot that summer season is suitable for crimes.
3. Wow, if we look at day plot, while the number of crimes increases on Fridays, there is a big drop on Sunday. Friday is the last day for work, maybe we can assume it is the reason, like sunday being the holiday makes crime rates drop.
4. Crime rate increases at five o'clock in the evening (after work hours) and at 12 noon (lunch break).
5. B2 Roxbury district leads crime rates. C11 Dorchester and D4 Back Bay/South End districts follows.
6. Commonwealth Ave. and Centre st. leads the crime rates.</font>

In [None]:
ucr_year = pd.DataFrame(data = (crime.groupby(["Year","UCR_PART"]).count()[['Group']]).reset_index().values,
                        columns= ["year","ucr_part","noc"]).sort_values('year').reset_index(drop=True)

px.bar(ucr_year, x = "year", y = "noc", color = "ucr_part", title = "UCRs per Year (Figure 1)", text = "noc")

In [None]:
fig = px.line(ucr_year, x = "year", y = "noc", color = "ucr_part", labels =  {"noc" : "Number of Crime",
                                                                       "year" : "Year",
                                                                       "ucr_part" : "UCR Part"}, title = "UCRs per Year (Figure 2)")

fig.update_layout(
    font_color="#5642C5",
    title_font_color="#5642C5",
    legend_title_font_color="#5642C5",
    font_size = 14
)

<font color = "#5642C5">Part one crimes (murder, manslaughter, sex offenses, robbery, aggravated assault, burglary, motor vehicle theft, and arson) are less common each year than other criminal groups, but in 2016 the part one group has more crimes than in other years. As you can see in the Figure 2.</font>

In [None]:
ucr_month = pd.DataFrame(data = (crime.groupby(["Month", "UCR_PART"]).count()["Group"]).reset_index().values,
                        columns = ["month", "ucr_part", "noc"]).sort_values("month").reset_index(drop = True)

fig2 = px.bar(ucr_month, x = "month", y = "noc", color = "ucr_part", title = "UCRs per Month (Figure 1)", labels = {"month" : "Month",
                                                                                                  "ucr_part" : "UCR Part",
                                                                                                  "noc" : "Number of Crime"}, 
                                                                                                   # change the bar mode
                                                                                                   barmode = "group",
                                                                                                   text = "noc",
                                                                                                   color_discrete_sequence=["red", "green", "blue", "magenta"])
fig2.update_traces(textposition = "outside")
fig2.update_layout(
    font_color="#5642C5",
    title_font_color="#5642C5",
    legend_title_font_color="#5642C5",
    font_size = 14)

In [None]:
fig3 = px.line(ucr_month, x = "month", y = "noc", color = "ucr_part", labels =  {"noc" : "Number of Crime",
                                                                               "month" : "Month",
                                                                               "ucr_part" : "UCR Part"},
                                                                               title = "UCRs per Month (Figure 2)")

fig3.update_layout(
    font_color="#5642C5",
    title_font_color="#5642C5",
    legend_title_font_color="#5642C5",
    font_size = 14
)

In [None]:
ucr_day = pd.DataFrame(data = (crime.groupby(["Day", "UCR_PART"]).count()["Group"]).reset_index().values,
                      columns = ["day", "ucr_part", "noc"]).sort_values("day").reset_index(drop = True)

fig3 = px.bar(ucr_day, x = "day", y = "noc", color = "ucr_part", title = "UCRs per Day (Figure 1)", labels = {"day" : "Day",
                                                                                            "noc" : "Number of Crime",
                                                                                            "ucr_part" : "UCR Part"},
                                                                                  text = "noc",
                                                                                  color_discrete_sequence=["red", "green", "blue", "goldenrod"])

fig3.update_traces(textposition = "outside")
fig3.update_layout(
    font_color="#5642C5",
    title_font_color="#5642C5",
    legend_title_font_color="#5642C5",
    font_size = 14)

In [None]:
px.line(ucr_day, x = "day", y = "noc", color = "ucr_part", title = "UCRs per Day with Line Chart (Figure 2)")

In [None]:
ucr_hour = pd.DataFrame(data = (crime.groupby(["Hour", "UCR_PART"]).count()["Group"]).reset_index().values,
                       columns = ["hour", "ucr part", "noc"]).sort_values("hour").reset_index(drop = True)

fig4 = px.bar(ucr_hour, x = "hour", y = "noc", color = "ucr part", title = "UCRs per Hour (Figure 1)",
             labels = {"noc" : "Number of Crime",
                       "hour" : "Hour",
                       "ucr part" : "UCR PART"}, text = "noc", color_discrete_sequence = ["red", "LightSeaGreen", "DarkCyan", "DarkSeaGreen"])

fig4.update_traces(textposition = "outside")
fig4.update_layout(
    font_color = "#5642C5",
    title_font_color = "#5642C5",
    legend_title_font_color = "#5642C5",
    font_size = 14)

In [None]:
fig5 = px.line(ucr_hour, x = "hour", y = "noc", color = "ucr part", title = "UCRs per Hour (Figure 2)",
              labels = {"noc" : "Number of Crime",
                        "hour" : "Hour",
                        "ucr part" : "UCR PART"})

fig5.update_layout(font_color = "#5642C5",
                   font_size = 14,
                   title_font_color = "#5642C5",
                   legend_title_font_color = "#5642C5")

In [None]:
fig6 = make_subplots(rows = 2, cols = 1, specs = [[{"type" : "bar"}], [{"type" : "bar"}]], 
                     subplot_titles = ("Number of Crime per Street", "Number of Crime per District"))

fig6.add_trace(go.Bar(x = noc_street["street"], y =  noc_street["noc"]), row = 1, col = 1)
fig6.add_trace(go.Bar(x = noc_dist["dist"], y =  noc_dist["noc"]), row = 2, col = 1)

In [None]:
crime.head()

In [None]:
sns.countplot(x = "Quarter", data = crime);

**Crime number is higher on third quarter of the year, summer months.**

In [None]:
sns.countplot(x = "Weekofyear", hue = "UCR_PART", data = crime);