## Crime Data Processing and Prediction

In this notebook, we focus on processing crime data and predicting future crime trends up to the year 2027. The goal is to clean and structure the raw crime data, then use it to make accurate predictions that can inform decision-making in urban planning and development.




In [2]:
import pandas as pd

In [3]:
# Note this dataset need to clear first row since its all year ending string
crime_df = pd.read_csv("/root/project-2-group-real-estate-industry-project-34/data/landing/CI_LGAb.csv", encoding='utf-16', delimiter='\t',skiprows=1)

In [6]:
crime_df.columns
df = pd.read_excel('/root/project-2-group-real-estate-industry-project-34/data/landing/population_data.xlsx', sheet_name='Table 2.2')
row = df.iloc[7] 
null_columns = [idx for idx, val in enumerate(row) if pd.isnull(val)]
years = [2011, 2012, 2013, 2014,2015, 2016,2017,2018,2019,2020,2021,2022] 

In [7]:
df_first_row = df.iloc[5].copy()

# Step 1: Rename NaN entries in the first row
df_first_row[df.columns.get_loc('Unnamed: 1')] = 'Suburb Name'

for i, year in enumerate(years):
    base_index = 2 + (i * 4)  # Start at 2 and increment by 4 for each year
    
    df_first_row[df.columns.get_loc(f"Unnamed: {base_index}")] = f"Estimated resident population_{year}"
    df_first_row[df.columns.get_loc(f"Unnamed: {base_index + 1}")] = f'Births_{year}'
    df_first_row[df.columns.get_loc(f'Unnamed: {base_index + 2}')] = f'Total_fertility_rate_{year}'


df.iloc[5] = df_first_row

# Step 2: Set the 5th row (index 5) as the header
df.columns = df.iloc[5]  # Use the 5th row as the new header
df = df.drop(5)  # Drop the row that has been used as the header

df = df.iloc[6:]
df = df.reset_index(drop = True)
df = df.dropna(axis = 1, how = 'all')
df = df.dropna()

# Create new columns with year-specific names


# Display the resulting DataFrame


  df_first_row[df.columns.get_loc('Unnamed: 1')] = 'Suburb Name'
  df_first_row[df.columns.get_loc(f"Unnamed: {base_index}")] = f"Estimated resident population_{year}"
  df_first_row[df.columns.get_loc(f"Unnamed: {base_index + 1}")] = f'Births_{year}'
  df_first_row[df.columns.get_loc(f'Unnamed: {base_index + 2}')] = f'Total_fertility_rate_{year}'


In [8]:
df = df[["Place of Usual Residence","Suburb Name","Estimated resident population_2011","Estimated resident population_2012","Estimated resident population_2013","Estimated resident population_2014","Estimated resident population_2015","Estimated resident population_2016","Estimated resident population_2017","Estimated resident population_2018","Estimated resident population_2019","Estimated resident population_2020","Estimated resident population_2021","Estimated resident population_2022"]]
df.rename(columns=lambda x: x.split('_')[-1] if 'Estimated resident population' in x else x, inplace=True)
df.head()

5,Place of Usual Residence,Suburb Name,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,206011106,Brunswick East,8966,9208,9870,10439,11062,11716,12154,12392,12602,13064,12964,13296
1,206011107,Brunswick West,13864,13963,14057,14192,14344,14523,14556,14695,14854,15010,14497,14684
2,206011109,Pascoe Vale South,9860,9954,10038,10122,10251,10465,10698,10834,10873,10836,10463,10413
3,206011495,Brunswick - North,11981,12254,12548,12922,13225,13581,13728,13928,14022,14124,13077,13254
4,206011496,Brunswick - South,12006,12402,12836,13233,13574,13854,13984,14200,14336,14235,13208,13364


In [9]:
# Remove rows where 'Local Government Area' is equal to 'Total'
crime_df = crime_df[crime_df['Local Government Area'].str.lower() != 'total']
crime_df = crime_df[crime_df['Local Government Area'] != ' Unincorporated Vic']
crime_df = crime_df[crime_df['Local Government Area'] != ' Justice Institutions and Immigration Facilities']

In [10]:
# List of year columns
year_columns = [str(year) for year in range(2015, 2025)]  # From 2015 to 2024

# Replace commas and convert the columns to float
crime_df[year_columns] = crime_df[year_columns].replace({',': ''}, regex=True).astype(float)/100000

## Training the Model and Making Predictions

In this part of the project, we use a **RandomForestRegressor** model to predict crime rates for the years 2025, 2026, and 2027. The RandomForest model is a powerful ensemble method that builds multiple decision trees and averages their results to provide accurate predictions while minimizing overfitting.

### Training the RandomForest Model

1. **Selecting the Training Data**:
   The model is trained on a rolling 5-year window of historical crime data. For example, to predict crime for 2025, the model is trained using data from the years 2020 through 2024. This ensures that the most recent crime trends are captured in the predictions.

   - The input features (`X`) are the crime rates for the previous 5 years.
   - The target variable (`y`) is the crime rate in the last year of the training window.

   ```python
   train_years = list(range(target_year - 5, target_year))  # e.g., for 2025: [2020, 2021, 2022, 2023, 2024]
   train_columns = [str(year) for year in train_years]      # Convert years to string for DataFrame column selection

   X = crime_df[train_columns].values  # Input features: crime rates for the last 5 years
   y = crime_df[train_columns[-1]].values  # Target: crime rate of the last year in the training window (e.g., 2024)


In [11]:
from sklearn.ensemble import RandomForestRegressor

# Assuming crime_df has data from 2015 to 2024
# Columns should be named as years (2015, 2016, ..., 2024)
years = list(range(2015, 2025))  # Existing years in the dataset (2015-2024)

# Loop to predict for 2025, 2026, 2027
for target_year in [2025, 2026, 2027]:
    # Define the training window (previous 5 years)
    train_years = list(range(target_year - 5, target_year))  # Example: for 2025, train on 2020-2024
    
    # Ensure we are using only the existing columns in crime_df for training (including predictions)
    train_columns = [str(year) for year in train_years]
    
    # Prepare the training data
    X = crime_df[train_columns].values
    y = crime_df[train_columns[-1]].values  # Use the last year in the window as the target for training
    
    # Initialize and train the model
    model = RandomForestRegressor(n_estimators=200, random_state=42)
    model.fit(X, y)
    
    # Predict the target year using the most recent 5 years
    predicted_values = model.predict(X)
    
    # Add the predictions for the current year as a new column in the dataframe
    crime_df[str(target_year)] = predicted_values



## Predicting Previous Years' Crime Rates

In addition to predicting future crime rates (2025, 2026, and 2027), we also performed **backward prediction** to estimate crime rates for past years (2014, 2013, 2012, and 2011). This method helps validate the model by allowing us to compare the predicted crime rates with actual data for these past years and assess the accuracy of the model.

To perform backward prediction, we used future data to predict the crime rate for each earlier year. For example, to predict the crime rate for 2014, we trained the model using data from 2015 through 2019. This rolling window of future data is used to simulate predicting earlier years.

First, we defined a list of target years for backward prediction (`2014, 2013, 2012, 2011`) and created training windows for each year, using data from the subsequent 5 years. For instance, when predicting the 2014 crime rate, the model was trained on data from 2015-2019. For each year, we ensured that sufficient future data existed, i.e., 5 years of data, before proceeding.

We initialized a **RandomForestRegressor** model with 200 estimators and trained it on the crime data from the future years. The input features (`X`) were the crime rates for the next 5 years, and the target variable (`y`) was the crime rate for the last year of the window (e.g., 2019). After training the model, we predicted the crime rate for the target year and stored the predicted values as a new column in the dataset for that specific year.

Backward prediction allows us to validate the model by comparing the predicted values with actual crime data. This approach helps evaluate how well the model generalizes across different time periods and captures the trends in crime rates. By testing the model on historical data, we can refine its parameters and improve its robustness for future predictions.

Overall, backward prediction is a powerful tool for validating the model, ensuring that it not only predicts future trends but also accurately captures patterns in past data. This strengthens the model's ability to forecast future crime rates and provides a more reliable framework for decision-making.


In [12]:
# Assuming crime_df has data from 2015 to 2024
# Columns should be named as years (2015, 2016, ..., 2024)

# Loop to predict for 2014, 2013, 2012, 2011
for target_year in [2014, 2013, 2012, 2011]:
    # Define the training window (next 5 years)
    train_years = list(range(target_year + 1, target_year + 6))  # Example: for 2014, train on 2015-2019
    
    # Ensure we are using only the existing columns in crime_df for training
    train_columns = [str(year) for year in train_years if str(year) in crime_df.columns]
    
    if len(train_columns) == 5:  # Proceed only if we have 10 years of data
        # Prepare the training data
        X = crime_df[train_columns].values
        y = crime_df[train_columns[-1]].values  # Use the last year in the window as the target for training
        
        # Initialize and train the model
        model = RandomForestRegressor(n_estimators=200, random_state=42)
        model.fit(X, y)
        
        # Predict the target year using the next 10 years
        predicted_values = model.predict(X)
        
        # Add the predictions for the current year as a new column in the dataframe
        crime_df[str(target_year)] = predicted_values


In [13]:
crime_df.head()

Unnamed: 0,Police Region,Local Government Area,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026,2027,2014,2013,2012,2011
0,1 North West Metro,Banyule,0.053835,0.056847,0.058278,0.056254,0.053887,0.053095,0.048606,0.04069,0.043032,0.049622,0.049437,0.049295,0.049291,0.053762,0.056108,0.058351,0.057285
1,1 North West Metro,Brimbank,0.070815,0.07493,0.067515,0.068737,0.068482,0.076499,0.072817,0.061233,0.064918,0.068713,0.068737,0.068909,0.068797,0.068198,0.068571,0.068329,0.075038
2,1 North West Metro,Darebin,0.071815,0.080354,0.081163,0.072685,0.070974,0.074375,0.075351,0.060264,0.064503,0.072322,0.073133,0.073548,0.07362,0.071282,0.073133,0.080624,0.079543
3,1 North West Metro,Hobsons Bay,0.058239,0.058484,0.054071,0.050688,0.050773,0.049794,0.048024,0.051925,0.047921,0.052503,0.052634,0.052819,0.052985,0.050774,0.05062,0.054229,0.058818
4,1 North West Metro,Hume,0.070675,0.078506,0.075145,0.068444,0.066695,0.064081,0.061749,0.0465,0.050053,0.054794,0.054995,0.05505,0.054918,0.066806,0.068728,0.075339,0.078209


In [14]:
forecasted_df = pd.read_csv("/root/project-2-group-real-estate-industry-project-34/data/landing/forecasted_populations.csv")
forecasted_df.head()

Unnamed: 0.1,Unnamed: 0,Code,Suburb,2023_Forecast,2024_Forecast,2025_Forecast,2026_Forecast,2027_Forecast
0,0,206011106,Brunswick East,13690,14083,14477,14871,15264
1,1,206011107,Brunswick West,14684,14684,14684,14684,14684
2,2,206011109,Pascoe Vale South,10388,10375,10368,10364,10363
3,3,206011495,Brunswick - North,13254,13254,13254,13254,13254
4,4,206011496,Brunswick - South,13364,13364,13364,13364,13364


In [15]:
# Perform the inner join on 'Place of Usual Residence' from df and 'Code' from forecasted_df
merged_df = pd.merge(df, forecasted_df[['Suburb', '2023_Forecast', '2024_Forecast', '2025_Forecast', '2026_Forecast', '2027_Forecast']],
                     left_on='Suburb Name', right_on='Suburb', how='inner')

In [17]:
# Rename the columns by extracting the year
merged_df.rename(columns={
    '2023_Forecast': '2023',
    '2024_Forecast': '2024',
    '2025_Forecast': '2025',
    '2026_Forecast': '2026',
    '2027_Forecast': '2027'
}, inplace=True)
merged_df.head()

Unnamed: 0,Place of Usual Residence,Suburb Name,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,Suburb,2023,2024,2025,2026,2027
0,206011106,Brunswick East,8966,9208,9870,10439,11062,11716,12154,12392,12602,13064,12964,13296,Brunswick East,13690,14083,14477,14871,15264
1,206011107,Brunswick West,13864,13963,14057,14192,14344,14523,14556,14695,14854,15010,14497,14684,Brunswick West,14684,14684,14684,14684,14684
2,206011109,Pascoe Vale South,9860,9954,10038,10122,10251,10465,10698,10834,10873,10836,10463,10413,Pascoe Vale South,10388,10375,10368,10364,10363
3,206011495,Brunswick - North,11981,12254,12548,12922,13225,13581,13728,13928,14022,14124,13077,13254,Brunswick - North,13254,13254,13254,13254,13254
4,206011496,Brunswick - South,12006,12402,12836,13233,13574,13854,13984,14200,14336,14235,13208,13364,Brunswick - South,13364,13364,13364,13364,13364


In [18]:
sa2_df = pd.read_csv("/root/project-2-group-real-estate-industry-project-34/data/landing/SA2_2021_AUST.csv",delimiter=';')

In [35]:
# Perform the inner join on 'Place of Usual Residence' from df and 'Code' from forecasted_df
merged_sa2_df = pd.merge(merged_df, sa2_df[['SA2_NAME_2021', 'SA3_NAME_2021']],
                     left_on='Suburb Name', right_on='SA2_NAME_2021', how='left')

In [36]:
# Update 'SA3_NAME_2021' with 'Suburb Name' where 'SA3_NAME_2021' is NaN
merged_sa2_df['SA3_NAME_2021'] = merged_sa2_df['SA3_NAME_2021'].fillna(merged_sa2_df['Suburb Name'])




In [37]:
len(merged_sa2_df['Place of Usual Residence'].unique())

608

In [None]:
# Splitting the 'SA3_NAME_2016' column by the dash '-' and extracting the first element
merged_sa2_df['SA3_NAME_2021_First'] = merged_sa2_df['SA3_NAME_2021'].str.split(' - ').str[0]
# Drop the 'Code' column
merged_sa2_df.drop(columns=['SA2_NAME_2021','Suburb','SA3_NAME_2021'], inplace=True)
merged_sa2_df.head()

## Converting SA2 to SA3 for Missing SA3 Regions

In this section, we address the issue of missing SA3 data for certain SA2 regions. The goal is to ensure that each SA2 region has a corresponding SA3 region, which is essential for consistent analysis. For some SA2s, there may be no corresponding SA3 available in the dataset. To handle these cases, we manually assign the correct SA3 region based on known mappings.



In [46]:
manual_input = {
    "Brunswick" : "Merri-bek",
    "Essendon" : "Moonee Valley",
    "Melbourne City" : "Melbourne",
    "Keilor" : "Brimbank",
    "Moreland" : "Merri-bek",
    "Sunbury" : "Hume",
    "Tullamarine" : "Hume",
    "Dandenong" : "Greater Dandenong",
    "Creswick" : "Hepburn",
    "Maryborough" : "Central Goldfields",
    "Bendigo" : "Greater Bendigo",
    "Heathcote" : "Greater Dandenong",
    "Barwon" : "Greater Geelong",
    "Geelong" : "Greater Geelong",
    "Upper Goulburn Valley" : "Greater Shepparton",
    "Gippsland" : "East Gippsland",
    "Latrobe Valley" : "Latrobe",
    "Grampians" : "Southern Grampians",
    "Murray River" : "Gannawarra",
    "Shepparton" : "Greater Shepparton",
    "Colac" : "Colac-Otway",
    "Tuggeranong" : "Monash",
    "Canberra East" : "Hume",
    "Dubbo" : "Wellington",
    "Highett (East)" : "Kingston"
    
}

In [47]:
# Replace the values in 'SA3_NAME_2016_First' if they are found in the 'manual' dictionary
merged_sa2_df['SA3_NAME_2021_First'] = merged_sa2_df['SA3_NAME_2021_First'].apply(
    lambda x: manual_input[x] if x in manual_input else x
)

In [None]:
# Set 'Local Government Area' as the index
crime_df.set_index('Local Government Area', inplace=True)
# Remove trailing spaces from the index of crime_df
crime_df.index = crime_df.index.str.strip()

In [49]:
# Check if each value in "SA3_NAME_2016_First" is in crime_df's index
merged_sa2_df['Exists_in_crime_df'] = merged_sa2_df['SA3_NAME_2021_First'].isin(crime_df.index)

# Count the number of True and False values in the 'Exists_in_crime_df' column
counts = merged_sa2_df['Exists_in_crime_df'].value_counts()

# Display the counts
print(counts)

Exists_in_crime_df
True     635
False      5
Name: count, dtype: int64


Rows without crime rates are insignificant to our analysis because they represent broad, aggregated areas like "Greater Melbourne" or "Total Victoria," which lack the specificity needed for targeted crime analysis. These entries do not provide actionable insights at the suburb or regional level, where we can effectively measure and analyze crime trends. Therefore, they have been filtered out to maintain the focus on relevant, detailed geographic data.


In [50]:
# Filter and print rows where 'Exists_in_crime_df' is False
false_rows = merged_sa2_df[merged_sa2_df['Exists_in_crime_df'] == False]

# Display the filtered rows
false_rows


Unnamed: 0,Place of Usual Residence,Suburb Name,2011,2012,2013,2014,2015,2016,2017,2018,...,2020,2021,2022,2023,2024,2025,2026,2027,SA3_NAME_2021_First,Exists_in_crime_df
420,2GMEL,Greater Melbourne,4169366,4265843,4370067,4476030,4586012,4714387,4820116,4916589,...,5061107,4975319,5035738,5114499,5193260,5272021,5350782,5429543,Greater Melbourne,False
596,215,North West,149634,150355,150981,151278,151422,151907,152885,153876,...,155797,155154,154859,155334,155809,156284,156759,157234,West Coast,False
637,217,Warrnambool and South West,122599,123046,123476,123747,124010,124491,125195,125971,...,127464,127631,127659,128119,128579,129039,129499,129959,Warrnambool and South West,False
638,2RVIC,Rest of Vic.,1368451,1385248,1402602,1418887,1436310,1458785,1482492,1506449,...,1553939,1572503,1590226,1610387,1630549,1650710,1670871,1691033,Rest of Vic.,False
639,2,Total Victoria,5537817,5651091,5772669,5894917,6022322,6173172,6302608,6423038,...,6615046,6547822,6625964,6724886,6823809,6922731,7021654,7120576,Total Victoria,False


In [51]:
# Drop rows where 'Exists_in_crime_df' is False
merged_sa2_df = merged_sa2_df[merged_sa2_df['Exists_in_crime_df'] != False]


## Multiplying Crime Rate by Population Count

To estimate the total crime occurrences in each region, we multiply the crime rate by the population count for each year. This approach allows us to compute the estimated number of crimes for each SA2 region, taking into account both the crime rate and the population size.




In [53]:
# List of years to process
years_columns = [str(year) for year in range(2011, 2028)]  # From 2015 to 2028

# Apply the logic: if Exists_in_crime_df is True, multiply corresponding year values
for year in years_columns:
    merged_sa2_df[year] = merged_sa2_df.apply(
        lambda row: row[year] * crime_df.loc[row['SA3_NAME_2021_First'], year]
        if row['Exists_in_crime_df'] else row[year],  # Multiply if Exists_in_crime_df is True
        axis=1
    )

# Display the updated dataframe
merged_sa2_df.head()


Unnamed: 0,Place of Usual Residence,Suburb Name,2011,2012,2013,2014,2015,2016,2017,2018,...,2020,2021,2022,2023,2024,2025,2026,2027,SA3_NAME_2021_First,Exists_in_crime_df
0,206011106,Brunswick East,613.762644,598.508904,621.211829,640.428161,686.50772,802.745172,790.836472,780.225104,...,799.960976,754.530728,675.197472,738.80823,797.478041,820.581478,843.299306,866.763845,Merri-bek,True
1,206011107,Brunswick West,949.052564,907.578175,884.739076,870.673097,890.18864,995.072391,947.129808,925.22659,...,919.12234,843.754394,745.682888,792.451428,831.510868,832.314597,832.694978,833.828636,Merri-bek,True
2,206011109,Pascoe Vale South,674.960926,646.998005,631.785647,620.980348,636.17706,717.030405,696.097464,682.130308,...,663.531624,608.967526,528.792966,560.609196,587.505125,587.676228,587.717975,588.461329,Merri-bek,True
3,206011495,Brunswick - North,820.152825,796.495234,789.763528,792.759144,820.7435,930.529377,893.253504,876.934736,...,864.869016,761.107554,673.064628,715.278618,750.534258,751.259716,751.603053,752.626311,Merri-bek,True
4,206011496,Brunswick - South,821.864187,806.115056,807.890074,811.83886,842.40244,949.234518,909.910912,894.0604,...,871.66599,768.732016,678.650648,721.214988,756.763228,757.494707,757.840894,758.872643,Merri-bek,True


In [54]:
# Dropping the last two columns and renaming "Place of Usual Residence" to "SA2_ID"
merged_sa2_df = merged_sa2_df.drop(columns=['SA3_NAME_2021_First', 'Exists_in_crime_df'])

# Renaming "Place of Usual Residence" to "SA2_ID"
merged_sa2_df = merged_sa2_df.rename(columns={'Place of Usual Residence': 'SA2_ID'})


In [24]:
merged_sa2_df.head()

Unnamed: 0,SA2_ID,Suburb Name,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026,2027
0,206011106,Brunswick East,613.762644,598.508904,621.211829,640.428161,686.50772,802.745172,790.836472,780.225104,771.960714,799.960976,754.530728,675.197472,738.80823,797.478041,820.581478,843.299306,866.763845
1,206011107,Brunswick West,949.052564,907.578175,884.739076,870.673097,890.18864,995.072391,947.129808,925.22659,909.911478,919.12234,843.754394,745.682888,792.451428,831.510868,832.314597,832.694978,833.828636
2,206011109,Pascoe Vale South,674.960926,646.998005,631.785647,620.980348,636.17706,717.030405,696.097464,682.130308,666.047361,663.531624,608.967526,528.792966,560.609196,587.505125,587.676228,587.717975,588.461329
3,206021110,Alphington - Fairfield,680.971604,701.915417,645.547595,637.542004,655.95821,758.059636,768.857099,684.329275,671.910858,709.38875,683.358219,551.777184,590.589468,662.180232,669.605016,673.404046,674.061408
4,206021112,Thornbury,1452.304584,1490.260116,1372.931347,1353.279847,1382.72601,1573.170612,1600.53436,1432.91209,1400.246046,1471.955625,1443.197703,1155.682728,1236.974031,1386.918994,1402.470007,1410.426977,1411.803804


In [57]:
merged_sa2_df.to_parquet("crime_data_with_predictions.parquet", index=False)
