### Importing Necessary Packages

Before starting with any machine learning tasks, it's important to import the necessary libraries and packages. These packages provide essential functions and tools for data manipulation, model training, evaluation, and more.



In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
from sklearn.ensemble import ExtraTreesRegressor

In [3]:
crime_df = pd.read_excel('Regbcs, elkövetők, elkövetések 2009-2023.xlsx', sheet_name="Regbcs" ,engine='openpyxl')

# Preprocess the Excel File and Determine the Target Value

Steps made: preprocessing the Excel file, categorizing city populations, and calculating the crime rate based on city categories. The final results are saved to a new Excel file.

## Steps:

1. **Import Libraries**:
   - Import necessary libraries for handling Excel files and data manipulation.

2. **Load Data**:
   - Load the Excel file containing city population data.

3. **Categorize Cities**:
   - Categorize cities based on their population into different categories.
   - Drop cities with null populations (foreign countries) or those with populations below 1000 .
   
   4. **Calculate Crime Rate**:
   - Calculate the crime rate for each city category. This will be done by determining how many crimes were committed per number of people according to the city population.

5. **Save Results**:
   - Save the categorized data along with the calculated crime rates to a new Excel file.

In [4]:
import pandas as pd

# Load the population data
population_data = pd.read_excel('Population.xlsx', engine='openpyxl')

# Normalize location names in both datasets (remove extra spaces, standardize formats if needed)
crime_df['Elkövetés helye'] = crime_df['Elkövetés helye'].str.strip()
population_data['Location'] = population_data['Location'].str.strip()

# Drop duplicates from population_data to ensure no duplicate location entries
population_data = population_data.drop_duplicates(subset='Location')

# Merge the datasets on the location names
merged_data = pd.merge(crime_df, population_data, left_on='Elkövetés helye', right_on='Location', how='left')

# Dropping rows where population is null or City is (üres) or population is below 1000
merged_data = merged_data[merged_data['Population'].notna() & (merged_data['Elkövetés helye'] != '(üres)') & (merged_data['Population'] >= 1000)]

# Assign population categories
def categorize_population(population):
    if 100000 <= population <= 500000:
        return 'Nagyváros'
    elif 10000 <= population < 100000:
        return 'Középváros'
    elif 5000 <= population < 10000:
        return 'Kisváros'
    elif 2000 <= population < 5000:
        return 'Nagyfalvak'
    else:
        return 'Középfalvak'

merged_data['City Category'] = merged_data['Population'].apply(categorize_population)

# Drop the redundant 'Location' column from the population data
merged_data = merged_data.drop(columns=['Location'])

# Define function to calculate crime rate as a percentage based on the city category
def calculate_crime_rate(row):
    population = row['Population']
    crimes = row['Regisztrált bűncselekmények száma']

    if row['City Category'] == 'Nagyváros':
        return round((crimes / (population / 100000)), 4)
    elif row['City Category'] == 'Középváros':
        return round((crimes / (population / 1000)), 4)
    elif row['City Category'] == 'Kisváros':
        return round((crimes / (population / 1000)), 4)
    elif row['City Category'] == 'Nagyfalvak':
        return round((crimes / (population / 100)), 4)
    elif row['City Category'] == 'Középfalvak':
        return round((crimes / (population / 100)), 4)
    else:
        return None

# Apply the function to create the new 'Crime Rate' column
merged_data['Crime Rate'] = merged_data.apply(calculate_crime_rate, axis=1)

# Save the merged DataFrame to an Excel file
merged_data.to_excel('preprocessed_crime_data.xlsx', index=False, engine='openpyxl')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_data['City Category'] = merged_data['Population'].apply(categorize_population)


In [5]:
new_dataset = pd.read_excel('preprocessed_crime_data.xlsx')

### Dropping Null Values

Handling missing data is a crucial step in data preprocessing. Null or missing values can adversely affect the performance of machine learning models and data analysis. One common approach to dealing with missing values is to drop them from the dataset.


In [6]:
new_dataset = new_dataset.dropna()

### Excluding Crime Rate Values

To ensure a more balanced dataset and avoid potential biases caused by cities with extremely high or low populations, cities were excluded that fall into the top and bottom 1% of the population data.

In [7]:
min_treshold, max_treshold = new_dataset['Crime Rate'].quantile([0.001,0.9990])
min_treshold, max_treshold

(0.0105, 308.0323465000802)

In [8]:
new_dataset[new_dataset['Crime Rate']<min_treshold ].count()

Unnamed: 0,0
Bűncselemény,294
Elkövetés helye,294
Regisztrálás éve,294
Regisztrált bűncselekmények száma,294
Population,294
City Category,294
Crime Rate,294


In [9]:
new_dataset[new_dataset['Crime Rate']>max_treshold ].count()

Unnamed: 0,0
Bűncselemény,351
Elkövetés helye,351
Regisztrálás éve,351
Regisztrált bűncselekmények száma,351
Population,351
City Category,351
Crime Rate,351


In [10]:
new_dataset = new_dataset[(new_dataset['Crime Rate'] >= min_treshold) & (new_dataset['Crime Rate'] <= max_treshold)]

In [11]:
new_dataset.head(50)

Unnamed: 0,Bűncselemény,Elkövetés helye,Regisztrálás éve,Regisztrált bűncselekmények száma,Population,City Category,Crime Rate
0,_A költségvetési csaláshoz kapcsolódó felügyel...,Budapest XIII. Kerület,2019,1,122973,Nagyváros,0.8132
1,_A költségvetési csaláshoz kapcsolódó felügyel...,Budapest XIII. Kerület,2022,4,122973,Nagyváros,3.2527
2,_A költségvetési csaláshoz kapcsolódó felügyel...,Gödöllő,2019,1,32625,Középváros,0.0307
3,_Az Európai Közösségek pénzügyi érdekeinek meg...,Miskolc,2019,1,147533,Nagyváros,0.6778
4,_Az Európai Közösségek pénzügyi érdekeinek meg...,Szeged,2020,1,158797,Nagyváros,0.6297
5,_Csempészet,Biharkeresztes,2021,1,3849,Nagyfalvak,0.026
6,_Csempészet,Budaörs,2018,1,29024,Középváros,0.0345
7,_Csempészet,Budapest X. Kerület,2019,1,75628,Középváros,0.0132
8,_Erőszakos közösülés,Abony,2020,1,14841,Középváros,0.0674
9,_Erőszakos közösülés,Apostag,2019,1,2063,Nagyfalvak,0.0485


### Dropping an Unnecessary Column



In [12]:
new_dataset = new_dataset.drop(columns=['Regisztrált bűncselekmények száma'], axis=0)

In [13]:
new_dataset.head()

Unnamed: 0,Bűncselemény,Elkövetés helye,Regisztrálás éve,Population,City Category,Crime Rate
0,_A költségvetési csaláshoz kapcsolódó felügyel...,Budapest XIII. Kerület,2019,122973,Nagyváros,0.8132
1,_A költségvetési csaláshoz kapcsolódó felügyel...,Budapest XIII. Kerület,2022,122973,Nagyváros,3.2527
2,_A költségvetési csaláshoz kapcsolódó felügyel...,Gödöllő,2019,32625,Középváros,0.0307
3,_Az Európai Közösségek pénzügyi érdekeinek meg...,Miskolc,2019,147533,Nagyváros,0.6778
4,_Az Európai Közösségek pénzügyi érdekeinek meg...,Szeged,2020,158797,Nagyváros,0.6297


**Encode Values for Numeric-Only Model**

To ensure compatibility with the model, which accepts only numeric values,   the data is encoded accordingly. Saved the encoded values in a text file for future use within the web application.

In [14]:
le = LabelEncoder()

# Transform the 'City' column
new_dataset['Elkövetés helye'] = le.fit_transform(new_dataset['Elkövetés helye'])

# Create a mapping of class labels to numerical values
mapping_city = dict(zip(le.classes_, range(len(le.classes_))))


In [15]:
file = open('Mappings/City_Mapping.txt', 'wt')
for key, value in mapping_city.items():
    file.write(f"{key}: {value}\n")
file.close()

In [16]:
# Transform the 'Crime Type	' column
new_dataset['Bűncselemény'] = le.fit_transform(new_dataset['Bűncselemény'])

# Create a mapping of class labels to numerical values
mapping_type = dict(zip(le.classes_, range(len(le.classes_))))

In [17]:
file = open('Mappings/Crime_Type_Mapping.txt', 'wt')
for key, value in mapping_type.items():
  file.write(f"{key}: {value}\n")
file.close()

In [18]:
# Transform the 'Crime Type	' column
new_dataset['City Category'] = le.fit_transform(new_dataset['City Category'])

# Create a mapping of class labels to numerical values
mapping_type = dict(zip(le.classes_, range(len(le.classes_))))

In [19]:
file = open('Mappings/City_Category_Mapping.txt', 'wt')
for key, value in mapping_type.items():
  file.write(f"{key}: {value}\n")
file.close()

In [20]:
new_dataset.head()

Unnamed: 0,Bűncselemény,Elkövetés helye,Regisztrálás éve,Population,City Category,Crime Rate
0,229,140,2019,122973,4,0.8132
1,229,140,2022,122973,4,3.2527
2,229,398,2019,32625,2,0.0307
3,230,702,2019,147533,4,0.6778
4,230,1008,2020,158797,4,0.6297


### Model Training Phase

In this phase, distinguish between the features and the target variable as follows:

- **Features (X):** All columns in the dataset except the 'crime rate'.
- **Target (y):** The 'crime rate' column.

The features (X) consist of various attributes or predictors used by the model to estimate or predict the target variable. The target variable 'crime rate' is the outcome that the model aims to predict based on these input features.

In [21]:
x = new_dataset[new_dataset.columns[0:5]].values
x

array([[   229,    140,   2019, 122973,      4],
       [   229,    140,   2022, 122973,      4],
       [   229,    398,   2019,  32625,      2],
       ...,
       [   226,   1282,   2011,   6510,      0],
       [   226,   1284,   2010,   1558,      1],
       [   226,   1285,   2016,   3373,      3]])

In [22]:
y = new_dataset['Crime Rate'].values
y

array([0.8132, 3.2527, 0.0307, ..., 0.4608, 0.0642, 0.0296])

In [23]:
new_dataset.to_excel('merged_crime_population_final.xlsx', index=False, engine='openpyxl')

### Dataset Splitting

The dataset has been divided into training and testing sets using an 80/20 split:

- **Training Set:** 80% of the dataset is used for training the model.
- **Testing Set:** 20% of the dataset is reserved for testing the model.

This split ensures that the model is trained on a substantial portion of the data while also being evaluated on a separate subset to assess its performance and generalization capabilities.


In [24]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=48)

### KNeighborsRegressor

The `KNeighborsRegressor` is a machine learning algorithm used for regression tasks. It operates based on the concept of nearest neighbors, where the prediction for a given data point is made based on the values of its 'k' closest neighbors in the feature space.

- **Algorithm:** KNeighborsRegressor is a non-parametric algorithm that does not assume any underlying distribution of the data. Instead, it makes predictions by analyzing the distances between data points.  

- **Advantages:** It’s simple to implement and understand, and can work well when the relationship between features and target is complex and non-linear.

- **Disadvantages:** It can be computationally expensive, especially with large datasets, and its performance can be sensitive to the choice of 'k' and the distance metric used.




### Cross Validation

Before training the KNN model,  cross-validation was employed to identify the optimal parameters to utilize.

In [34]:
from sklearn.model_selection import GridSearchCV
model = KNeighborsRegressor(n_neighbors=5)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [35]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('R2 score:', metrics.r2_score(y_test, y_pred))

Mean Absolute Error: 1.192394718977424
Mean Squared Error: 60.624553687063134
R2 score: 0.31493888604786846


### RandomForestRegressor

The `RandomForestRegressor` is an ensemble learning algorithm used for regression tasks. It builds multiple decision trees and merges their predictions to improve accuracy and control overfitting.

- **Algorithm:** RandomForestRegressor is based on the Random Forest algorithm, which constructs a multitude of decision trees during training and outputs the average prediction (for regression) of the individual trees.

- **Advantages:** It handles large datasets and complex relationships between features and target well. It is robust to overfitting due to the averaging of multiple trees and can provide estimates of feature importance.

- **Disadvantages:** It can be computationally intensive, particularly with a large number of trees. The model can also become less interpretable due to the complexity of the ensemble.


In [36]:
model2 = RandomForestRegressor(n_jobs=-1, random_state=574)
model2.fit(x_train, y_train)
y_pred = model2.predict(x_test)

In [37]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('R2 score:', metrics.r2_score(y_test, y_pred))

Mean Absolute Error: 0.502419632997934
Mean Squared Error: 13.772615175535513
R2 score: 0.8443686176579677


### Extra Trees Regressor:

The Extra Trees Regressor (Extremely Randomized Trees) is an ensemble learning method that builds multiple decision trees to improve prediction accuracy.

- **Algorithm:**
Instead of finding the optimal split at each node, Extra Trees selects splits randomly from a set of candidate splits. Trees are grown to their maximum depth without pruning.

- **Advantages:** The randomness in splits reduces overfitting compared to individual decision trees. Faster training times due to the use of random splits and parallel processing.

- **Disadvantages:** The ensemble of many trees makes the model less interpretable compared to a single decision tree. May require significant memory, especially with a large number of trees.


In [38]:
model3 = ExtraTreesRegressor(n_jobs=-1, random_state=42)
model3.fit(x_train, y_train)
y_pred = model3.predict(x_test)

In [39]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('R2 score:', metrics.r2_score(y_test, y_pred))

Mean Absolute Error: 0.7241303014683805
Mean Squared Error: 27.160193010439322
R2 score: 0.6930881804931632


### Saving the Model to a `.pkl` File

After training a machine learning model, it’s essential to save it for future use, such as making predictions or deploying it in a production environment.

- In Python, the common practice for saving a trained model is to use the `pickle` module.



In [None]:
import pickle

In [None]:
model_filename = "Model/model.pkl"
with open(model_filename, 'wb') as file:
    pickle.dump(model2, file)

In [None]:
!zip -r /content/Model/model.zip /content/Model/model.pkl

  adding: content/Model/model.pkl (deflated 77%)
