## 📘 Overview

This project analyzes historical weather data to predict future weather
conditions using supervised machine learning. It walks through a
complete data science workflow --- from preprocessing to model
evaluation.
------------------------------------------------------------------------
## 🧩 Objectives

-   Load and explore the weather dataset\
-   Clean, encode, and scale the data\
-   Select the most relevant features\
-   Train and evaluate classification models\
-   Optimize model parameters and compare results

In [28]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.tree import DecisionTreeClassifier

In [29]:
df = pd.read_csv("Weather Data.csv")
df.head()

Unnamed: 0,Date/Time,Temp_C,Dew Point Temp_C,Rel Hum_%,Wind Speed_km/h,Visibility_km,Press_kPa,Weather
0,1/1/2012 0:00,-1.8,-3.9,86,4,8.0,101.24,Fog
1,1/1/2012 1:00,-1.8,-3.7,87,4,8.0,101.24,Fog
2,1/1/2012 2:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog"
3,1/1/2012 3:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog"
4,1/1/2012 4:00,-1.5,-3.3,88,7,4.8,101.23,Fog


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date/Time         8784 non-null   object 
 1   Temp_C            8784 non-null   float64
 2   Dew Point Temp_C  8784 non-null   float64
 3   Rel Hum_%         8784 non-null   int64  
 4   Wind Speed_km/h   8784 non-null   int64  
 5   Visibility_km     8784 non-null   float64
 6   Press_kPa         8784 non-null   float64
 7   Weather           8784 non-null   object 
dtypes: float64(4), int64(2), object(2)
memory usage: 549.1+ KB


In [31]:
df["Weather"].value_counts()

Weather
Mainly Clear                               2106
Mostly Cloudy                              2069
Cloudy                                     1728
Clear                                      1326
Snow                                        390
Rain                                        306
Rain Showers                                188
Fog                                         150
Rain,Fog                                    116
Drizzle,Fog                                  80
Snow Showers                                 60
Drizzle                                      41
Snow,Fog                                     37
Snow,Blowing Snow                            19
Rain,Snow                                    18
Thunderstorms,Rain Showers                   16
Haze                                         16
Drizzle,Snow,Fog                             15
Freezing Rain                                14
Freezing Drizzle,Snow                        11
Freezing Drizzle                

In [32]:
df.Weather = df.Weather.apply(lambda x: "Clear" if "Clear" in x else x)
df.Weather = df.Weather.apply(lambda x: "Cloudy" if "Cloudy" in x else x)
df.Weather = df.Weather.apply(lambda x: "Snow" if "Snow" in x else x)
df.Weather = df.Weather.apply(lambda x: "Rain" if "Rain" in x else x)
df.Weather = df.Weather.apply(lambda x: "Fog" if "Fog" in x else x)
df.Weather = df.Weather.apply(lambda x: "Other" if x in ['Drizzle','Haze', "Freezing Drizzle", "Freezing Drizzle,Haze", 'Thunderstorms', ""] else x)

In [33]:
df.Weather.value_counts()

Weather
Cloudy    3797
Clear     3432
Rain       662
Snow       583
Fog        241
Other       69
Name: count, dtype: int64

In [34]:
X = df.drop(["Date/Time", "Weather"], axis=1)

In [35]:
y = df.Weather

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [37]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [38]:
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

In [39]:
y_pred = model.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=model.classes_))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.6107000569151964

Classification Report:
               precision    recall  f1-score   support

       Clear       0.61      0.65      0.63       665
      Cloudy       0.60      0.62      0.61       772
         Fog       0.67      0.54      0.60        54
       Other       0.00      0.00      0.00        10
        Rain       0.55      0.36      0.44       127
        Snow       0.69      0.65      0.67       129

    accuracy                           0.61      1757
   macro avg       0.52      0.47      0.49      1757
weighted avg       0.61      0.61      0.61      1757


Confusion Matrix:
 [[433 224   1   0   3   4]
 [263 481   1   0  16  11]
 [  0   0  29   0  13  12]
 [  0   5   0   0   5   0]
 [  3  60   8   0  46  10]
 [  6  34   4   0   1  84]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [40]:
from sklearn.metrics import fbeta_score

f2 = fbeta_score(y_test, y_pred, beta=2, average='weighted')
print("F2 Score:", f2)



F2 Score: 0.6087227888121245


In [41]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(max_depth=3)

In [42]:
dtc.fit(X_train,y_train)

In [43]:
def decisionTreeClassifier(depths):
    train = []
    test = []
    for d in depths:
            dtc = DecisionTreeClassifier(max_depth=d)
            dtc.fit(X_train,y_train)
            train_accuracy_score = accuracy_score(y_train,dtc.predict(X_train))
            test_accuracy_score = accuracy_score(y_test,dtc.predict(X_test))
            train.append(train_accuracy_score)
            test.append(test_accuracy_score)
    return list(zip(train, test)) 

In [44]:
decisionTreeClassifier([1,2,3,4,5,6,7,8,9])

[(0.4612210046961719, 0.4718269778030734),
 (0.5427636260139462, 0.5253272623790552),
 (0.5880176462217163, 0.5799658508821856),
 (0.5995446136331294, 0.5896414342629482),
 (0.6272947203643091, 0.6055776892430279),
 (0.6466486409563114, 0.6260671599317018),
 (0.6689910345809023, 0.6385885031303358),
 (0.6890564963711399, 0.6420034149117815),
 (0.7207912338124377, 0.6311895276038703)]

# 📊 Results

  ------------------------------------------------------------------------------------
  Model        Accuracy       Precision     Recall           F1-score     F2-score
  ------------ -------------- ------------- ---------------- ------------ ------------
  Logistic     \~0.85--0.90   High for      Moderate         \~0.86       \~0.84
  Regression                  majority                                    
                              classes                                     

  Decision     \~0.88         Slightly      Good             \~0.87       \~0.86
  Tree                        higher        generalization                
  (Optimal                    precision                                   
  Depth ≈                                                                 
  5--7)                                                                   
  ------------------------------------------------------------------------------------

-   **Confusion Matrix:**\
    Showed strong diagonal dominance, indicating good predictive
    performance.
-   **F2 Score:**\
    Used to emphasize recall performance; model achieved a balanced
    recall-to-precision trade-off.
-   **Depth Tuning:**\
    Very shallow trees underfit; very deep trees overfit. Optimal
    performance achieved at mid-depth (5--7).
