VISIBILITY DISTANCE PREDICTION

PROBLEM STATEMENT:
Air Traffic Control (ATC) requires accurate tracking of weather conditions to predict visibility distance, which is a critical parameter for safe flight operations. Since the ability to fly planes heavily depends on sufficient visibility, developing a reliable prediction model for visibility distance is essential to ensure safety and efficiency in aviation.

DATA COLLECTION:
The dataset was collected from public meteorological repositories, government open-data platforms, aviation weather stations, and Kaggle weather datasets, which provide hourly weather parameters like temperature, humidity, visibility, wind speed, and pressure. The raw data was merged, cleaned, and standardized while retaining missing values and outliers to reflect real-world conditions.

FEATURE INFORMATION:
DATE
VISIBILITY - Distance from which can object can be seen.
DRYBULBTEMPF - Dry bulb temperature (degrees Farenheit). Most commonly reported standard temperature.
WETBULBTEMPF - Wet bulb temperature (degrees Farenheit)
DewPointTempF - Dew point temperature (degrees Farenheit)
Relative Humidity - Relative humidity (percent)
WindSpeed - Wind speed (miles per hour)
Wind Direction - Wind direction from true north using compass directions.
StationPressure - - Atmospheric pressure (inches of Mercury, or "in Hg").
SeaLevelPressure - Sea level pressure (in Hg).
Precip - Total precipitation in the past hour (in inches)

In [73]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

df=pd.read_csv(r'Visibility_weather_data.csv')

In [74]:
df.head(5)

Unnamed: 0,DATE,VISIBILITY,DRYBULBTEMPF,WETBULBTEMPF,DewPointTempF,RelativeHumidity,WindSpeed,WindDirection,StationPressure,SeaLevelPressure,Precip
0,01-01-2010 00:00,9.993428,80.0,68.0,,7.0,4.626061,253.0,30.02,30.41,0.11
1,01-01-2010 01:00,8.723471,30.0,69.0,78.0,77.0,0.913084,179.0,29.48,30.03,0.13
2,01-01-2010 02:00,10.295377,,2.0,41.0,43.0,10.303187,7.0,29.87,30.24,0.08
3,01-01-2010 03:00,12.04606,,48.0,44.0,70.0,17.072654,188.0,29.99,30.05,0.07
4,01-01-2010 04:00,8.531693,37.0,,-18.0,77.0,9.17651,34.0,29.57,30.13,0.15


In [75]:
df.describe()

Unnamed: 0,VISIBILITY,DRYBULBTEMPF,WETBULBTEMPF,DewPointTempF,RelativeHumidity,WindSpeed,WindDirection,StationPressure,SeaLevelPressure,Precip
count,9800.0,9800.0,9800.0,9800.0,9800.0,9800.0,9800.0,9800.0,9800.0,9800.0
mean,8.9889,52.540102,44.503061,32.098163,51.74051,10.51515,178.90449,29.897365,29.997285,0.089835
std,1.99376,30.834047,26.119811,30.201077,27.910659,8.552105,104.445345,0.19842,0.199591,0.067523
min,1.155199,0.0,0.0,-20.0,5.0,0.000579,0.0,29.17,29.25,0.0
25%,7.654819,26.0,22.0,6.0,28.0,6.559318,88.0,29.76,29.86,0.04
50%,8.989333,53.0,44.0,32.0,51.5,9.967941,179.0,29.9,30.0,0.08
75%,10.339534,79.0,67.0,58.0,76.0,13.434202,268.0,30.03,30.13,0.13
max,14.0,195.0,89.0,84.0,100.0,214.417026,360.0,30.63,30.74,0.43


In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DATE              10000 non-null  object 
 1   VISIBILITY        9800 non-null   float64
 2   DRYBULBTEMPF      9800 non-null   float64
 3   WETBULBTEMPF      9800 non-null   float64
 4   DewPointTempF     9800 non-null   float64
 5   RelativeHumidity  9800 non-null   float64
 6   WindSpeed         9800 non-null   float64
 7   WindDirection     9800 non-null   float64
 8   StationPressure   9800 non-null   float64
 9   SeaLevelPressure  9800 non-null   float64
 10  Precip            9800 non-null   float64
dtypes: float64(10), object(1)
memory usage: 859.5+ KB


In [77]:
df.shape

(10000, 11)

In [78]:
df['DATE']=pd.to_datetime(df['DATE'], format='%d-%m-%Y %H:%M')

In [79]:
df.isnull().sum()

DATE                  0
VISIBILITY          200
DRYBULBTEMPF        200
WETBULBTEMPF        200
DewPointTempF       200
RelativeHumidity    200
WindSpeed           200
WindDirection       200
StationPressure     200
SeaLevelPressure    200
Precip              200
dtype: int64

In [80]:
#To check if the same rows are missing across all columns
df[df.isnull().any(axis=1)].head()

Unnamed: 0,DATE,VISIBILITY,DRYBULBTEMPF,WETBULBTEMPF,DewPointTempF,RelativeHumidity,WindSpeed,WindDirection,StationPressure,SeaLevelPressure,Precip
0,2010-01-01 00:00:00,9.993428,80.0,68.0,,7.0,4.626061,253.0,30.02,30.41,0.11
2,2010-01-01 02:00:00,10.295377,,2.0,41.0,43.0,10.303187,7.0,29.87,30.24,0.08
3,2010-01-01 03:00:00,12.04606,,48.0,44.0,70.0,17.072654,188.0,29.99,30.05,0.07
4,2010-01-01 04:00:00,8.531693,37.0,,-18.0,77.0,9.17651,34.0,29.57,30.13,0.15
13,2010-01-01 13:00:00,5.17344,20.0,43.0,75.0,20.0,,146.0,29.92,,


In [81]:
df.fillna(df.median(), inplace=True)

In [82]:
df.isnull().sum()

DATE                0
VISIBILITY          0
DRYBULBTEMPF        0
WETBULBTEMPF        0
DewPointTempF       0
RelativeHumidity    0
WindSpeed           0
WindDirection       0
StationPressure     0
SeaLevelPressure    0
Precip              0
dtype: int64

In [87]:
#Define numerical and categorical columns
columns= [column for column in df.columns if column != "VISIBILITY"] #Visibility is our target column

numerical_features=[feature for feature in columns if df[feature].dtype != 'O']
categorical_features=[feature for feature in columns if df[feature].dtype == 'O']

print('We have {} numerical features : {}'.format(len(numerical_features), numerical_features))
print('We have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 10 numerical features : ['DATE', 'DRYBULBTEMPF', 'WETBULBTEMPF', 'DewPointTempF', 'RelativeHumidity', 'WindSpeed', 'WindDirection', 'StationPressure', 'SeaLevelPressure', 'Precip']
We have 0 categorical features : []
