# Formula 1 2021 Weather Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#analysis">Data Analysis</a></li>
<li><a href="#ML">Modeling</a></li>
</ul>

<a id='intro'></a>
 ## Introduction
Formula 1 is one of the most competitive sports in the world. Engineers and technicians from every team use weather radar screens, provided by Ubimet to the teams, which allows them to track the current weather and make predictions during the race. Race engineers relay precise information to drivers, including:

- How many minutes until it starts raining
- Intensity of the rain
- Which corner will be hit first by the rain
- Duration of the rain
#### Dataset Description 

> I will go through Formula 1 2021 Weather Data for the Formula AI Hackathon '22 - Challenge 1.

> Here is some of the important attributes:
- SESSION_UID: Unique identifier for the session
- SESSION_TIME: Amount of seconds in the session
- TIMESTAMP: Unique every second for every session UID and player car index.
    - Date: Derived column from `TIMESTAMP`
- TRACK_TEMPERATURE: Track temp. in degrees Celsius
    - WEATHER_FORECAST_SAMPLES_M_TRACK_TEMPERATURE: Same but in relation with Weather forecast samples.
- AIR_TEMPERATURE: Air temp. in degrees Celsius
    - WEATHER_FORECAST_SAMPLES_M_AIR_TEMPERATURE: Same but in relation with Weather forecast samples.
- NUM_WEATHER_FORECAST_SAMPLES: Number of weather samples to follow
- SEASON_LINK_IDENTIFIER: Identifier for session - persists across saves
- SESSION_TYPE: 0 = unknown,
                1 = P1,
                2 = P2,
                3 = P3,
                4 = Short P,
                5 = Q1,
                6 = Q2,
                7 = Q3,
                8 = Short Q,
                9 = OSQ,
                10 = R,
                11 = R2,
                12 = Time Trial
                - For practice races (P1, P2, P3) the maximum session length is 1 hour.
                - For qualifying races (Q1, Q2, Q3) the maximum session length is 18 minutes.
    - WEATHER_FORECAST_SAMPLES_M_SESSION_TYPE: Same but in relation with Weather forecast samples.
- SESSION_TIME_LEFT: Time left in session in seconds
- SESSION_DURATION: Session duration in seconds
- WEATHER: 0 = clear,
           1 = light cloud,
           2 = overcast,
           3 = light rain,
           4 = heavy rain,
           5 = storm
    - WEATHER_FORECAST_SAMPLES_M_WEATHER: Same but in relation with Weather forecast samples.
- TRACK_TEMPERATURE_CHANGE: 0 = up, 1 = down, 2 = no change
- AIR_TEMPERATURE_CHANGE: 0 = up, 1 = down, 2 = no change
- RAIN_PERCENTAGE: Rain percentage (0-100)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [2]:
df = pd.read_csv("/kaggle/input/formulaaihackathon2022/weather.csv", low_memory=False)

<a id='wrangling'></a>
## Data Wrangling



### General Properties

In [3]:
from datetime import datetime
df['date']= list(map(lambda x:datetime.utcfromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S'),df['TIMESTAMP']))
df.sort_values(by='date', inplace=True)

> Getting Date from `TIMESTAMP` column... Now Timestamp is useless let's drop it.

In [4]:
df.drop(columns=["TIMESTAMP"],inplace=True)

In [5]:
df['date']

>Droping useless columns and columns that contains only one value that we can call it useless too as there is no variety and not related columns that will not help in our analysis

In [6]:
df.drop(columns=["Unnamed: 58", "M_PACKET_FORMAT", "M_GAME_MAJOR_VERSION", "M_GAME_MINOR_VERSION", "M_PACKET_VERSION",
                'M_PACKET_ID', "M_SECONDARY_PLAYER_CAR_INDEX", "M_ZONE_FLAG", "M_PIT_STOP_WINDOW_IDEAL_LAP", 
                "M_GAME_PAUSED", "GAMEHOST", "M_SLI_PRO_NATIVE_SUPPORT", "M_SAFETY_CAR_STATUS", "M_BRAKING_ASSIST",
                "M_PIT_RELEASE_ASSIST", "M_ZONE_START", "M_ZONE_FLAG", "M_PIT_STOP_WINDOW_IDEAL_LAP", "M_GAME_PAUSED",
                "M_FORECAST_ACCURACY", "M_SPECTATOR_CAR_INDEX", "M_PIT_STOP_WINDOW_LATEST_LAP", 
                "M_WEEKEND_LINK_IDENTIFIER", "M_DYNAMIC_RACING_LINE_TYPE", "M_PIT_STOP_REJOIN_POSITION", "M_AI_DIFFICULTY",
                "M_PIT_SPEED_LIMIT", "M_NETWORK_GAME", "M_TOTAL_LAPS", "M_STEERING_ASSIST","M_IS_SPECTATING",
                "M_DYNAMIC_RACING_LINE", "M_DRSASSIST", "M_NUM_MARSHAL_ZONES"], inplace=True)

In [7]:
df = df[(df['M_NUM_WEATHER_FORECAST_SAMPLES']!=0) & (df['M_SESSION_TYPE']!=0)]

>### Genral information from data:
- One player out of every three players use `pit assist`
- Most cars use manual gearbox assist -Make sense btw-
- `DYNAMIC_RACING_LINE_TYPE` most of players use 2D line type.
- Speed limit in Pit and Marshal areas must be under 80
- Most players are offline players.
- Almost all races are 200 laps races.
- Players using steering assist are rare.

> A touch of beauty to column names

In [8]:
df.columns = [x for x in df.columns.str.replace("M_",'')]

In [9]:
df.columns

In [10]:
df.head(5)

In [11]:
df.duplicated().sum()

In [12]:
df.drop_duplicates(inplace=True)

In [13]:
df.isna().sum() / len(df)*100

In [14]:
df.dropna(inplace=True)

In [15]:
df.info()

In [16]:
df.describe()

> Now our data is clean:
- suitable data types
- No null values
- No duplicates
- Reasonable ranges


> We are left with 27 column ,we almost lost half of the columns in cleaning, and 690000 rows.

<a id='analysis'></a>
 ## Data Analysis

In [17]:
# importing visualizing libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

rc_dic={'figure.figsize':(12,8),'font.size':20,'figure.titlesize':'medium','legend.fontsize':'small'}
sns.set(rc=rc_dic,style='ticks')

In [18]:
plt.figure(figsize=(10,5))
df.set_index('date')['RAIN_PERCENTAGE'].plot()
plt.xticks(rotation=90)
plt.show()

In [20]:
plt.pie(df['WEATHER'].value_counts(),labels=["clear", "light cloud", "overcast"])
plt.show()

> Weather is mostly clear to light clouds

In [22]:
sns.boxenplot(x="WEATHER",y="RAIN_PERCENTAGE", color="g", scale="linear", data=df, width = .8)
plt.xticks((0,1,2), ("clear", "light cloud", "overcast"));

> As we see there is always somekind of rain even if the weather is clear... We can also say that storm is rare -thanks god-

In [23]:
df.columns

In [24]:
plt.figure(figsize=(5,5))
sns.histplot(data=df, x='RAIN_PERCENTAGE', kde=True, color='#94A4EC', linewidth=0);

>`RAIN_PERCENTAGE` usually doesn't exceed 20 percent

In [25]:
sns.boxplot(x="TRACK_TEMPERATURE_CHANGE",y="TRACK_TEMPERATURE", color="g", data=df, width = .8)
plt.xticks((0,1,2), ("Up", "Down", "No change"));

In [26]:
plt.figure(figsize=(7,7))
sns.scatterplot(data=df, x='TRACK_TEMPERATURE', y='AIR_TEMPERATURE')
plt.show()

<a id='ML'></a>
## Modeling

In [27]:
weather = df.loc[:,['date','TRACK_ID','FORMULA', 'SESSION_TYPE', 'TIME_OFFSET', 'WEATHER',
                    'TRACK_TEMPERATURE_CHANGE', 'WEATHER_FORECAST_SAMPLES_TRACK_TEMPERATURE',
                    'WEATHER_FORECAST_SAMPLES_AIR_TEMPERATURE',
                   'AIR_TEMPERATURE_CHANGE', 'RAIN_PERCENTAGE', 'WEATHER_FORECAST_SAMPLES_WEATHER',
                   'TRACK_TEMPERATURE','AIR_TEMPERATURE']]

In [28]:
weather['TIME_OFFSET'].value_counts()

In [29]:
weather.head()

In [30]:
dates=weather.groupby('date').median()
dates.reset_index(inplace=True)

In [31]:
splitting_date = dates['date'].str.split(expand=True)
dates['dated'],dates['time']=splitting_date[0],splitting_date[1]

In [32]:
months_days = dates['dated'].str.split('-',expand=True)
dates['month'],dates['day'] = months_days[1],months_days[2]

In [33]:
hours_minutes_seconds = dates['time'].str.split(':',expand=True)
dates['hour'],dates['minute'],dates['second']=hours_minutes_seconds[0],hours_minutes_seconds[1],hours_minutes_seconds[2]

In [34]:
dates.drop(columns=['dated','time','date'], inplace=True)

In [35]:
dates.head()

In [36]:
dates.drop(columns=['month','second'],inplace=True)

In [37]:
for i in ['day','hour','minute']:
    dates[i] = dates[i].astype('int')

In [38]:
grouped_data=dates.groupby(['day','hour','minute']).median()
grouped_data.reset_index(inplace=True)

In [39]:
grouped_data.to_csv("Base_data.csv")

In [40]:
final = pd.read_csv('../input/finalll/Base_data.csv')

In [41]:
final.drop(columns=['Unnamed: 0'],inplace=True)
final.dropna(inplace=True)

In [42]:
final.drop(columns=['day','hour','minute'],inplace=True)

## Preparing the data for training the model

In [43]:
time_series = []
for i in range(final.shape[0]-1):
    time_series.append(final.values[i:i+1])

X = np.array(time_series)[:-1]

In [44]:
y = []
for x in X[1:]:
    y.append(x[-1])

y.append(final.values[-1])               #append the last item we skipped previouasly
y = np.array(y)

In [45]:
X.shape, y.shape

In [46]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split


import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.callbacks import Callback, ModelCheckpoint, EarlyStopping, ReduceLROnPlateau

In [47]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [48]:
X_train.shape, X_test.shape

In [52]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(1, 13)),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(13)
])

model.summary()

In [53]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), loss='mae')

In [54]:
save_weights_path = 'model_weights_1.hdf5'

checkpoint = ModelCheckpoint(save_weights_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min', save_freq='epoch')
early = EarlyStopping(monitor="val_loss", mode='min', patience=20, restore_best_weights=True)
redlr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10)
callbacks_list = [checkpoint, early, redlr]

In [55]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1500, verbose=1, callbacks=callbacks_list)

In [57]:
y_pred = model.predict(np.expand_dims(X_test[10], axis=0))

In [61]:
y_pred.round()

In [62]:
y_test[10]

In [63]:
model.save('weather_forecast_1.h5')

In [64]:
model = tf.keras.models.load_model('./weather_forecast_1.h5')

In [65]:
def predict_weather(model, example):
    
    
    #the Indices are just the columns that we will include in the prediction process
    current_weather = [example[26],example[28],example[32],example[41],example[48],example[44],
                       example[43],example[45],example[46], example[47],example[42],example[17],example[22]] 
    
    result={}   #to store the predictions per 5 mintues
    for i in range(1,13):
        next_weather = model.predict(np.expand_dims([current_weather], axis=0))
        next_weather = np.ceil(next_weather[0])
        result[i*5] = {'type':next_weather[4],'rain_percentage':next_weather[-4]}
        current_weather = next_weather

    #just include the 5, 10, 15, 30 ,60 tiem intervals
    result = {key: result[key] for key in result.keys() & {5, 10, 15, 30, 60}}
    return result

In [66]:
df = pd.read_csv("/kaggle/input/formulaaihackathon2022/weather.csv", low_memory=False)

## As we want the input to be an example row of data from the initial dataset, we load it again

In [67]:
predict_weather(model, df.loc[100,:])