# Predictive Analysis

**Authors:** 
- Marc Villalonga Llobera
- Patxi Juaristi Pagegi

**Date:** 08/01/2024

---

This Jupyter Notebook covers the third task of the project for the Data Mining subject of the Laurea Magistrale of the University of Pisa, focused in predictive analysis.

## Environment preparation and data reading

First of all, we will install all the required packages, and then import the libraries that we will use:


In [None]:
#%%capture
#!python -m pip install --upgrade pip
#!pip install pandas
#!pip install matplotlib

# Añadir aqui si necesitamos otras librerias

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

After importing the required libraries, we will read the datasets that we have exported in the task 1, which they contain the data filtered after the data preparation tasks.

In [None]:
# Load the three datasets
incidents_dataset = pd.read_csv('../project_datasets/incidents_v2.csv', low_memory=False)

## New Feature Definition

First, we will define new feature that will enable classification for later predictions.

### Time related features

We will extract new features related with the moment that the incident occurred, based on the `date` column.

- Extract month, day of the week, and year.
- Create a feature for weekends or weekdays.
- Create a feature for the season (spring, summer, autumn, winter).

In [None]:
incidents_dataset['date'] = pd.to_datetime(incidents_dataset['date'])  # Convert 'date' column to datetime format

# Extract month, day of the week, and year
incidents_dataset['month'] = incidents_dataset['date'].dt.month
incidents_dataset['day_of_week'] = incidents_dataset['date'].dt.dayofweek
incidents_dataset['year'] = incidents_dataset['date'].dt.year

# Create a feature for weekends or weekdays
incidents_dataset['is_weekend'] = (incidents_dataset['date'].dt.weekday >= 5).astype(int)

# Create a feature for the season
def get_season(month):
    if 3 <= month <= 5:
        return 'spring'
    elif 6 <= month <= 8:
        return 'summer'
    elif 9 <= month <= 11:
        return 'autumn'
    else:
        return 'winter'

incidents_dataset['season'] = incidents_dataset['month'].apply(get_season)

# Display the updated DataFrame
print(incidents_dataset[['date', 'month', 'day_of_week', 'year', 'is_weekend', 'season']].head())


### Geographical and participant features

Then, we will create various new features that will take into account the state and city of the incident with the participant features.

First of all, we will count the number of incidents per state and per city.

In [None]:
# City and State Incident Count
incidents_dataset['city_incident_count'] = incidents_dataset.groupby('city_or_county')['city_or_county'].transform('count')
incidents_dataset['state_incident_count'] = incidents_dataset.groupby('state')['state'].transform('count')

print(incidents_dataset[['city_or_county','state', 'city_incident_count', 'state_incident_count']].head())

Then, we will create two columns, one for the state and the other for the city, where we will define an index, to analyze the severity of the incidents per area. This severity is obtained by the sum of the killed and injured people.

In [None]:
# City and State Severity Index
incidents_dataset['city_severity_index'] = (incidents_dataset['n_killed'] + incidents_dataset['n_injured']) / incidents_dataset['city_incident_count']
incidents_dataset['state_severity_index'] = (incidents_dataset['n_killed'] + incidents_dataset['n_injured']) / incidents_dataset['state_incident_count']

print(incidents_dataset[['city_or_county','state', 'city_severity_index', 'state_severity_index']].head())

We will also add two columns for analyzing the average age of the incidents per zone.

In [None]:
# City and State Average Age of Participants
incidents_dataset['city_avg_age'] = incidents_dataset.groupby('city_or_county')['avg_age_participants'].transform('mean')
incidents_dataset['state_avg_age'] = incidents_dataset.groupby('state')['avg_age_participants'].transform('mean')

print(incidents_dataset[['city_or_county', 'state', 'city_avg_age', 'state_avg_age']].head())

To conclude, since the female participation is quite lower than the male one, we will also add columns to get the female participation in each zone.

In [None]:
# City and State Female Participation Rate
incidents_dataset['city_female_participation_rate'] = (incidents_dataset['n_females'] / incidents_dataset['n_participants']) * 100
incidents_dataset['state_female_participation_rate'] = (incidents_dataset['n_females'] / incidents_dataset.groupby('state')['n_participants'].transform('sum')) * 100

print(incidents_dataset[['city_or_county', 'state', 'city_female_participation_rate', 'state_female_participation_rate']][4:12])