# Predictive Analysis

**Authors:** 
- Marc Villalonga Llobera
- Patxi Juaristi Pagegi

**Date:** 08/01/2024

---

This Jupyter Notebook covers the third task of the project for the Data Mining subject of the Laurea Magistrale of the University of Pisa, focused in predictive analysis.

## Environment preparation and data reading

First of all, we will install all the required packages, and then import the libraries that we will use:


In [None]:
#%%capture
#!python -m pip install --upgrade pip
#!pip install pandas
#!pip install matplotlib

# Añadir aqui si necesitamos otras librerias

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

After importing the required libraries, we will read the datasets that we have exported in the task 1, which they contain the data filtered after the data preparation tasks.

In [None]:
# Load the three datasets
incidents_dataset = pd.read_csv('../project_datasets/incidents_v2.csv', low_memory=False)

## New Feature Definition

First, we will define new feature that will enable classification for later predictions.

### Time related features

We will extract new features related with the moment that the incident occurred, based on the `date` column.

- Extract month, day of the week, and year.
- Create a feature for weekends or weekdays.
- Create a feature for the season (spring, summer, autumn, winter).

In [None]:
# Convert 'date' column to datetime format
incidents_dataset['date'] = pd.to_datetime(incidents_dataset['date'])

# Extract month, day of the week, and year
incidents_dataset['month'] = incidents_dataset['date'].dt.month
incidents_dataset['day_of_week'] = incidents_dataset['date'].dt.dayofweek
incidents_dataset['year'] = incidents_dataset['date'].dt.year

# Create a feature for weekends or weekdays
incidents_dataset['is_weekend'] = (incidents_dataset['date'].dt.weekday >= 5).astype(int)

# Create a feature for the season
def get_season(month):
    if 3 <= month <= 5:
        return 'spring'
    elif 6 <= month <= 8:
        return 'summer'
    elif 9 <= month <= 11:
        return 'autumn'
    else:
        return 'winter'

incidents_dataset['season'] = incidents_dataset['month'].apply(get_season)

# Display the updated DataFrame
print(incidents_dataset[['date', 'month', 'day_of_week', 'year', 'is_weekend', 'season']].head())


### Geographical and participant features

Then, we will create various new features that will take into account the state and city of the incident with the participant features.

First of all, we will count the number of incidents per state and per city.

In [None]:
# City and State Incident Count
incidents_dataset['city_incident_count'] = incidents_dataset.groupby('city_or_county')['city_or_county'].transform('count')
incidents_dataset['state_incident_count'] = incidents_dataset.groupby('state')['state'].transform('count')

print(incidents_dataset[['city_or_county','state', 'city_incident_count', 'state_incident_count']].head())

Then, we will create two columns, one for the state and the other for the city, where we will define an index, to analyze the severity of the incidents per area. This severity is obtained by the sum of the killed and injured people.

In [None]:
# City and State Severity Index
incidents_dataset['city_severity_index'] = (incidents_dataset['n_killed'] + incidents_dataset['n_injured']) / incidents_dataset['city_incident_count']
incidents_dataset['state_severity_index'] = (incidents_dataset['n_killed'] + incidents_dataset['n_injured']) / incidents_dataset['state_incident_count']

print(incidents_dataset[['city_or_county','state', 'city_severity_index', 'state_severity_index']].head())

We will also add two columns for analyzing the average age of the incidents per zone.

In [None]:
# City and State Average Age of Participants
incidents_dataset['city_avg_age'] = incidents_dataset.groupby('city_or_county')['avg_age_participants'].transform('mean')
incidents_dataset['state_avg_age'] = incidents_dataset.groupby('state')['avg_age_participants'].transform('mean')

print(incidents_dataset[['city_or_county', 'state', 'city_avg_age', 'state_avg_age']].head())

To conclude, since the female participation is quite lower than the male one, we will also add columns to get the female participation in each zone.

In [None]:
# City and State Female Participation Rate
incidents_dataset['city_female_participation_rate'] = (incidents_dataset['n_females'] / incidents_dataset['n_participants']) * 100
incidents_dataset['state_female_participation_rate'] = (incidents_dataset['n_females'] / incidents_dataset.groupby('state')['n_participants'].transform('sum')) * 100

print(incidents_dataset[['city_or_county', 'state', 'city_female_participation_rate', 'state_female_participation_rate']][4:12])

## Preprocessing

To start with the preprocessing, we will remove the columns that we will not use in the prediction.

In [None]:
# Drop unnecessary columns
columns_to_drop = ['date', 'address', 'notes', 'incident_characteristics2', 'latitude', 'longitude',
                   'min_age_participants', 'max_age_participants', 'congressional_district', 'state_house_district', 'state_senate_district']
incidents_dataset = incidents_dataset.drop(columns=columns_to_drop)

### Handle Categorical Variables

Then, we will impute the missing values in numerical columns using the average, and the mode for categorical missing values.

In [None]:
# Impute missing values in numerical columns with mean
numerical_columns = incidents_dataset.select_dtypes(include=['float64', 'int64']).columns
incidents_dataset[numerical_columns] = incidents_dataset[numerical_columns].fillna(incidents_dataset[numerical_columns].mean())

# Impute missing values in categorical columns with mode
categorical_columns = incidents_dataset.select_dtypes(include=['object']).columns
incidents_dataset[categorical_columns] = incidents_dataset[categorical_columns].fillna(incidents_dataset[categorical_columns].mode().iloc[0])

As we are going to predict if in the incident there have been at least a killed person or not, we will create a binary variable, which will say whether there have been deaths or not in the incidents of the dataset. This will be obtained from the variable `n_killed`, if it is greater than 0 it will be *True*, and if it is not, it will be *False*. The name of the variable will be `people_killed`

In [None]:
# Create a binary target variable 'has_fatality'
incidents_dataset['people_killed'] = (incidents_dataset['n_killed'] > 0).astype(int)

We start by specifying a dictionary `categorical_columns_to_encode` that indicates the categorical columns to be one-hot encoded along with their respective thresholds. We have set the thresholds to control the number of unique values retained for each categorical column during one-hot encoding and limit the categories just to the most relevant values, because including all values, makes the dataset too big, and we experienced memory problems during the analysis.

Moreover, analyzing incidents with very rare characteristics does not make sense. Anyway, columns like state, gender, age group or season do not have any threshold, because we want to include all of them. For example, age groups are just three, genders are two, and seasons are four, so it is unnecessary to limit the distinct values.

Afterwards, we go through the list of categorical columns and perform the one-hot encoding using `get_dummies` function. For each categorical column specified in the dictionary, we either include all unique values or select the top values based on the provided threshold. Any values not meeting the threshold are grouped into an "Other" category

In [None]:
# Specify the columns for one-hot encoding along with their respective thresholds
categorical_columns_to_encode = {
    'state': None,  # Set to None to include all distinct values
    'city_or_county': 100,
    'participant_gender1': None,
    'participant_age_group1': None,
    'incident_characteristics1': 20,
    'season': None
}

# Perform one-hot encoding for categorical variables. Use sparse representation for one-hot encoding
for column, threshold in categorical_columns_to_encode.items():
    if threshold is None:
        top_values = incidents_dataset[column].unique()
    else:
        top_values = incidents_dataset[column].value_counts().nlargest(threshold).index
    incidents_dataset[column] = incidents_dataset[column].where(incidents_dataset[column].isin(top_values), 'Other')

incidents_dataset = pd.get_dummies(incidents_dataset, columns=categorical_columns_to_encode.keys(), sparse=True)

print(incidents_dataset.info())

### Feature Scaling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Extract target variable 'y'
y = incidents_dataset['people_killed']

# Extract features 'X'
X = incidents_dataset.drop(columns=['n_killed', 'people_killed'])

# Convert the dataframe to a dense array
X_array = X.to_numpy()

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_array)


# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42, shuffle = True)

# Check the shape of the resulting sets
print("\nTrain set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)

## Model Selection and Evaluation