<a href="https://www.kaggle.com/code/danielzaslavsky/data-cleaning-preprocessing-and-basic-testing?scriptVersionId=162223234" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# BGU Data Science Course

In [1]:
# imports:
import pandas as pd

### load the data:

In [2]:
theft_data = pd.read_csv("/kaggle/input/crimeprediction/theft_data_bgu.csv")
holidays_data_bgu = pd.read_csv("/kaggle/input/crimeprediction/holidays_data_bgu.csv")

In [3]:
# check the columns:
theft_data.columns

Index(['Date', 'District', 'ID', 'Case Number', 'Block', 'IUCR',
       'Primary Type', 'Description', 'Location Description', 'Arrest',
       'Domestic', 'Beat', 'Ward', 'Community Area', 'FBI Code',
       'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude',
       'Longitude', 'Location', 'Count'],
      dtype='object')

In [4]:
# consider only important columns:
# keep only the following columns in your dataframe- Date, District, Count(our label), Year

#### insert your code here:
theft_data = theft_data[['Date', 'District', 'Count', 'Year']]
####

theft_data.columns

Index(['Date', 'District', 'Count', 'Year'], dtype='object')

### Feature extraction:

In [5]:
# change the Date from string to timestamp, in theft_data & holidays_data_bgu:
# hint - use dt.datetime.strptime
import datetime as dt

#### insert your code here:

# pd.to_datetime does it faster without the need to specifically use and apply dt.datetime.strptime to the column
theft_data['Date'] = pd.to_datetime(theft_data['Date'], format = '%Y-%m-%d')
holidays_data_bgu['Date'] = pd.to_datetime(holidays_data_bgu['Date'], format = '%Y-%m-%d')
####


In [6]:
# add time information:

# create day of week column:
# name the column Week_day, it will contain a number for each day of week: 
# Monday=0, ...., Sunday=6
#### insert your code here:
theft_data['Week_day'] = theft_data['Date'].dt.weekday
####

# create season column:
# name the column Season, it will contain a string which tells us the season based on the Date: 
# hint - use the following package to indicate specific dates:
from datetime import date
# and write function that gets the date and returns the corresponding season
#### insert your code here:
def get_season(date):
    if date >= dt.datetime(date.year, 12, 21) and date <= dt.datetime(date.year+1, 3, 20): #will handle most next year scenarios I think
        return 'Winter'
    elif date >= dt.datetime(date.year, 9, 23) and date <= dt.datetime(date.year, 12, 20):
        return 'Autumn'
    elif date >= dt.datetime(date.year, 6, 21) and date <= dt.datetime(date.year, 9, 22):
        return 'Summer'
    else:
        return 'Spring'

theft_data['Season'] = theft_data['Date'].apply(get_season)
####

In [7]:
# add information from other sources
# holidays data
# create column is_holiday that will take 1 if its an holiday and 0 otherwise.

#### insert your code here:
theft_data = pd.merge(theft_data, holidays_data_bgu, how = 'left', on = 'Date')
theft_data['is_holiday'].fillna(0, inplace = True)
####

In [8]:
# now you suppose to have the following columns: Date, is_holiday, district, count(our label), year, week_day, season
theft_data.columns

Index(['Date', 'District', 'Count', 'Year', 'Week_day', 'Season',
       'is_holiday'],
      dtype='object')

In [9]:
# create dummy variables:
# hint - use the pandas function 'get_dummies'
#### insert your code here:
theft_data = pd.get_dummies(theft_data, columns = ['Week_day','Season','District'])
####

In [10]:
# now you suppose to have the following columns: Date, is_holiday, count, year,
# dummies for season, week_day and district
theft_data.columns

Index(['Date', 'Count', 'Year', 'is_holiday', 'Week_day_0', 'Week_day_1',
       'Week_day_2', 'Week_day_3', 'Week_day_4', 'Week_day_5', 'Week_day_6',
       'Season_Autumn', 'Season_Spring', 'Season_Summer', 'Season_Winter',
       'District_1.0', 'District_2.0', 'District_3.0', 'District_4.0',
       'District_5.0', 'District_6.0', 'District_7.0', 'District_8.0',
       'District_9.0', 'District_10.0', 'District_11.0', 'District_12.0',
       'District_14.0', 'District_15.0', 'District_16.0', 'District_17.0',
       'District_18.0', 'District_19.0', 'District_20.0', 'District_22.0',
       'District_24.0', 'District_25.0'],
      dtype='object')

In [11]:
# remove the original date columns: (we won't use it for modelling)
#### insert your code here:
theft_data.drop('Date', inplace = True, axis = 1) #reminder to self, axis = 1 drops a column, axis = 0 drops a row
####

### Train-Test Splitting:

In [12]:
# choose years for train and test
train_start_year = 2014
train_end_year = 2015
test_year = 2016

In [13]:
# split the data into train/test
dataTrain = theft_data[(theft_data["Year"] >= train_start_year) & (theft_data["Year"] <= train_end_year)]
labelsTrain = dataTrain.Count
dataTrain = dataTrain.drop('Count', axis=1)

dataTest = theft_data[(theft_data["Year"] == test_year)]
labelsTest = dataTest.Count
dataTest = dataTest.drop('Count', axis=1)

# Remove unnecessary columns:
dataTrain = dataTrain.drop('Year', axis=1)
dataTest = dataTest.drop('Year', axis=1)

print("Train data shape: " , dataTrain.shape)
print("Test data shape: " , dataTest.shape)

Train data shape:  (16060, 34)
Test data shape:  (8052, 34)


In [14]:
# check for null values (should print 0)
print(theft_data.isnull().sum().sum())

0


### Modelling

In [15]:
from sklearn.linear_model import LinearRegression

mlModel = LinearRegression()

mlModel.fit(dataTrain, labelsTrain)

### Evaluation

In [16]:
predTest = mlModel.predict(dataTest)

# print the Rsquare:
print("Test set R^2: ", mlModel.score(dataTest, labelsTest))

Test set R^2:  0.5638769210045331


### Questions:

In [17]:
# train the model on years 2013-2015 and test it on 2016.
# how does the model results change?
# try other combinations of train/test 

We change the start year in block 18 to 2013. 
Prior to this change, our R^2 was 0.563, and now it changed to 0.557. 
Changing the starting year to be earlier (2010) doesn't change the R^2, meaning that the relevant years for predicting 2016 are the years that are closest to it.

In [18]:
# For the next analyis return back to the model that was trained on years 2014-2015, and tested on 2016.
# according to the this model output which are the top-3 districts that have a larger amount of crimes? 
# Hint- use the model functions: coef_ and intercept_ 
# In addition, examine the statistics of the raw data itself - does it allign with the conclusions that were derived from the model itself
# Present the code, results and short explainatios for this analysis  

In [19]:
coefficients = mlModel.coef_
feature_coefficients = dict(zip(dataTrain.columns, coefficients))

sorted_coefficients = sorted(feature_coefficients.items(), key=lambda x: abs(x[1]), reverse=True)

top_3_districts = [item[0] for item in sorted_coefficients if item[0].startswith('District_')][:3]
for district in top_3_districts:
    print(district)

District_1.0
District_18.0
District_19.0


The results are that districts 1, 18 and 19 have the biggest crime rates. We can also see that in the raw data (*theft_data_bgu.csv* **without any changes**), where they have much larger space taken from the table.
In the analysis we did in the code above, we tried looking through the coef and sorted by the smallest coefs' by doing so, finding the districts with the highest crimerates.

In [20]:
# Suggest 2 addtional families of festures (if addtional information sources were avilable)
# that can be used to further improve the accuracy reults of the regression model - describe them in few sentences

1. Information about taxes and payments. People who need to pay things soon are more likely to commit crime closer to their due date, as they become desperate.
2. A feature that may be helpful is information about the number of released criminals. The more released criminals, the more likely it is we will see more crime (repeat offenders).