# Midterm: Business Analytics and Machine Learning (IN2028)

<div style="text-align: center;">
    <img src="world_map.png" alt="The world map showing the events separated by size and color." style="width: 1020px;"/>
</div>


In this midterm exam, you will analyze two data sets, *events.csv* and *weather.csv*, to demonstrate your data science skills. Below is a detailed description of these files to guide your work. 

Your task involves three main components, each with a practical application in the field of data science:

1. Cleaning the data sets individually.
2. Performing an exploratory analysis.
3. Developing a predictive model to classify different types of earthquakes.

Your model's performance will be measured by the Balanced Accuracy (BAC) score, where all labels are equally important. The events.csv contains the columns *id* and *mag*. A row containing an entry in the *id* column does not include an entry in the *mag* column and vice versa. You can use every row with an entry in *mag* (and thus no id) to train your model. Remember, it is important to use this data set to evaluate your model. The remaining rows containing an entry in the column *id* depict the data set you should predict. For further instructions on this part, see the Prediction section below. 

In [112]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import balanced_accuracy_score,roc_auc_score, confusion_matrix, classification_report, accuracy_score

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE


seed = 2025 # Use this seed for every random operation.
np.random.seed(seed)


submission_counter = 0

## Data Preparation

### Events

The *events.csv* data set contains the following features:

* *time*: Time when the event occurred.
* *latitude*: Decimal degrees latitude of the event location.
* *longitude*: Decimal degrees longitude of the event location.
* *depth*: Depth of the event in kilometers.
* *mag*: The magnitude of the event.
* *magType*: The method or algorithm used to calculate the preferred magnitude for the event. 
* *dmin*: Horizontal distance from the epicenter to the nearest station in degrees. One degree corresponds to 111.2 kilometers. 
* *net*: The unique identifier of a data contributor. 
* *id*: A unique identifier of the event.
* *type*: Type of the seismic event.
* *horizontalError*: Uncertainty of the reported location of the event. A "shallow" value means the error is less than 10km. Otherwise, it is considererd as "deep".
* *depthError*: Uncertainty of the reported depth of the event in kilometers. 
* *magError*: The estimated standard error of the reported magnitude of the event.
* *is_country*: A binary variable indicating whether an event occured at sea (False) or on land (True).

You can assume that there are no outliers in the data set.

In [157]:
events = pd.read_csv("events.csv")

#events.info()
print(events.isnull().sum())

time                   0
latitude               0
longitude              0
depth                  0
mag                 3554
dmin                1898
net                    0
id                 10662
type                   0
horizontalError        0
depthError             0
magError           13975
is_country             0
dtype: int64


In [158]:
events["time"] = pd.to_datetime(events['time'])

for col in ['magError', 'dmin']:
    events[col] = events[col].fillna(events[col].mode()[0])


events.rename(columns={
    'latitude': 'lat',
    'longitude': 'lng',
}, inplace=True)


In [159]:
events["time"] = pd.to_datetime(events['time'])

for col in ['magError', 'dmin']:
    events[col] = events[col].fillna(events[col].mode()[0])

print(events.isnull().sum())

time                   0
lat                    0
lng                    0
depth                  0
mag                 3554
dmin                   0
net                    0
id                 10662
type                   0
horizontalError        0
depthError             0
magError               0
is_country             0
dtype: int64


### Weather

The second data set, which is stored in the *weather.csv* file, contains information about the weather at the events' locations. It contains the following columns:

- _time_: The timestamp of the recorded data, indicating the specific date and time of the observation.
- _temperature_: The measured air temperature at the given time and location in degrees Celsius.
- _humidity_: The relative humidity at the given time and location is expressed as a percentage.
- _precipitation_: The amount of precipitation (rainfall and snowfall combined) recorded at the given time and location, typically in millimeters.
- _sealevelPressure_: The atmospheric pressure at sea level, recorded at the given time and location, typically in hPa (hectopascals).
- _surfacePressure_: The atmospheric pressure at the surface level, recorded at the given time and location, typically in hPa (hectopascals).
- _lat_: The latitude coordinate of the location where the event was observed.
- _lng_: The longitude coordinate of the location where the event was observed.
- _nst_: The minimum number of seismic stations used to determine an event at a specific location.

You can assume that each event can uniquely be identified by the date and corresponding location, given by the latitude and longitude values.

**Important**: In contrast to the event data, the weather dataset provides hourly weather information for each location and date. Before merging this data with the event dataset, aggregate the weather data so that there is only one row per location and date instead of 24 rows.

In [None]:
weather = pd.read_csv('weather.csv')

## Prediction

Now, we are going to focus on predicting whether an earthquake is classified as a strong earthquake (magnitude > 4.4) or a normal earthquake (magnitude <= 4.4).

Make sure that your label is as follows:

$is\_high\_magnitude = \begin{cases} 1 \quad \text{if } mag > 4.4 \\ 0 \quad \text{else } \end{cases}$ 

## Upload Instructions

Once you have trained your final model, use it to predict the earthquakes that are associated with an id. Remember, these are those rows in the original *event.csv* that do not have an entry in the *mag* column.

The final data frame that contains your prediction needs to follow exactly the following structure:

- *id*: The id of an earthquake
- *prediction*: The (0 or 1) whether an earthquake has a high magnitude or not.

You can use the following *prepare_prediction* function to transform your data set into the correct format. The last cell demonstrates the use of the function and stores your submission into a csv file named submission_x, where x is the number of the submission. Upload this csv file and this jupyter notebook to the acmanager platform: https://analytics-cup.dss.in.tum.de/acm/. 

In [None]:
def prepare_prediction(df, id_column: str = 'id', prediction_column: str = 'is_high_magnitude') -> pd.DataFrame:
    """
    Prepare the prediction DataFrame for submission.

    Args:
    df (pd.DataFrame): The DataFrame containing the predictions.
    id_column (str): The name of the column containing the event IDs.
    prediction_column (str): The name of the column containing the predictions.

    Returns:
    pd.DataFrame: A DataFrame with the required format for submission.
    """
    # Check if the DataFrame contains the required columns
    if id_column not in df.columns or prediction_column not in df.columns:
        raise ValueError(f"Columns '{id_column}' and '{prediction_column}' are required in the DataFrame.")

    # Create a copy of the DataFrame with the required columns
    submission_df = df[[id_column, prediction_column]].copy()

    # Rename the prediction column to 'prediction'
    submission_df.rename(columns={prediction_column: 'prediction'}, inplace=True)

    return submission_df

In [122]:
# Save the data set
submission_df = prepare_prediction(
    df = , # Add your data set here
    id_column = , # Add the name of the column that contains the ids
    prediction_column = , # Add the name of the column that contains your prediction
)

submission_df.to_csv(f'submission_{submission_counter}', index=False)
submission_counter += 1