<div style="text-align: center;">
  <img src="Images/Weather_Types.png" alt="Weather_Types Illustration" width="600"/>
</div>

## Hello!  

This is a project utilizing the **Weather-Type-Classification Database** to analyze and classify weather patterns. By exploring this dataset, we aim to build insights into weather characteristics and develop models for accurate weather classification based on the provided features.  

#### About the Dataset  
The dataset, available on [Kaggle](https://www.kaggle.com/datasets/nikhil7280/weather-type-classification), contains synthetically generated weather data designed for classification tasks. It includes a wide range of weather-related variables, offering opportunities to practice data preprocessing, feature engineering, and outlier detection. 

#### Dataset Overview  
The dataset consists of **13,200 rows** and **11 columns**, structured as follows:  

1. **Temperature**: Temperature in degrees Celsius, ranging from extreme cold to extreme heat.  
2. **Humidity**: Humidity percentage, including values above 100% to introduce outliers.  
3. **Wind Speed**: Wind speed in kilometers per hour, including unrealistically high values.  
4. **Precipitation (%)**: Precipitation percentage, with some outlier values.  
5. **Cloud Cover**: Description of cloud cover (categorical).  
6. **Atmospheric Pressure**: Atmospheric pressure in hPa, covering a wide range.  
7. **UV Index**: Strength of ultraviolet radiation (numeric).  
8. **Season**: Season during which the data was recorded (categorical).  
9. **Visibility (km)**: Visibility in kilometers, with very low or very high values.  
10. **Location**: The type of location where the data was recorded (categorical).  
11. **Weather Type**: Target variable, classifying weather as Rainy, Sunny, Cloudy, or Snowy.  

#### Project Inspiration  
This project is inspired by the need to develop robust classification models for weather prediction. By leveraging the **Weather-Type-Classification Database**, we aim to identify key relationships among variables and improve the accuracy of weather categorization.  

#### Goals of the Project  
1. **Data Analysis**: Explore the dataset to uncover trends and patterns in weather features.  
2. **Classification Models**: Develop and evaluate machine learning models to classify weather types based on the provided features.  

This project will focus on utilizing the **Weather-Type-Classification Database** to provide valuable insights into weather data and enhance classification strategies for weather-related applications.

> ⚠️ *Note*: Although the data is synthetic, it serves as a useful proxy for learning data preprocessing, outlier detection, and classification modeling techniques.

# **Step 1: Data Wrangling**

This notebook covers the **data wrangling and preprocessing phase** of the Weather-Type Classification project. The goal is to clean, transform, and prepare the raw weather dataset for exploratory data analysis (EDA) and classification modeling.

---

### Objectives of This Notebook

1. [Import Libraries and Load the Dataset](#import)  
2. [Initial Inspection](#inspection)  
3. [Handle Missing Values](#missing)  
4. [Remove Duplicate Records](#duplicates)  
5. [Rename Columns for Consistency](#rename)  
6. [Feature Engineering](#features)  
7. [Save the Cleaned Dataset](#save)

--- 

### Next Steps

- Step 2: [Exploratory Data Analysis (EDA) – Visual](./02_eda_visualization.ipynb)
- Step 3: [EDA – SQL Queries](./03_eda_sql_queries.ipynb)
- Step 4: [Modeling & Prediction](./04_modeling_prediction.ipynb)

---

<a id="import"></a>

## **1.1 Import Libraries and Load the Dataset**

We start by importing the necessary Python libraries and loading the dataset into a DataFrame.

In [1]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd

# NumPy is a Python library that supports fast operations on large, multi-dimensional arrays and provides a wide range of mathematical functions.
import numpy as np

# Import display function to render DataFrames or outputs neatly in the notebook
from IPython.display import display

In [2]:
# Load the dataset
print("Previewing the raw dataset:")
df = pd.read_csv('weather_classification_data.csv')
display(df.head())

Previewing the raw dataset:


Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy


---

<a id="inspection"></a>

## **1.2 Initial Inspection**

We inspect the structure, data types, and basic info of the dataset.

In [3]:
# Display basic structure of the dataset
print("Dataset Info:")
display(df.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Temperature           13200 non-null  float64
 1   Humidity              13200 non-null  int64  
 2   Wind Speed            13200 non-null  float64
 3   Precipitation (%)     13200 non-null  float64
 4   Cloud Cover           13200 non-null  object 
 5   Atmospheric Pressure  13200 non-null  float64
 6   UV Index              13200 non-null  int64  
 7   Season                13200 non-null  object 
 8   Visibility (km)       13200 non-null  float64
 9   Location              13200 non-null  object 
 10  Weather Type          13200 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB


None

In [4]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df.describe())

Numerical Summary:


Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Atmospheric Pressure,UV Index,Visibility (km)
count,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0,13200.0
mean,19.127576,68.710833,9.832197,53.644394,1005.827896,4.005758,5.462917
std,17.386327,20.194248,6.908704,31.946541,37.199589,3.8566,3.371499
min,-25.0,20.0,0.0,0.0,800.12,0.0,0.0
25%,4.0,57.0,5.0,19.0,994.8,1.0,3.0
50%,21.0,70.0,9.0,58.0,1007.65,3.0,5.0
75%,31.0,84.0,13.5,82.0,1016.7725,7.0,7.5
max,109.0,109.0,48.5,109.0,1199.21,14.0,20.0


In [5]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df.describe(include=[object]))

Categorical Summary:


Unnamed: 0,Cloud Cover,Season,Location,Weather Type
count,13200,13200,13200,13200
unique,4,4,3,4
top,overcast,Winter,inland,Rainy
freq,6090,5610,4816,3300


In [6]:
# Checking the shape of the dataset (rows, columns)
print(f"The dataset contains {df.shape[0]:,} rows and {df.shape[1]} columns.")

The dataset contains 13,200 rows and 11 columns.


In [7]:
# Getting the column for the dataframe
print(f"The dataset columns include:")
display(df.columns)

The dataset columns include:


Index(['Temperature', 'Humidity', 'Wind Speed', 'Precipitation (%)',
       'Cloud Cover', 'Atmospheric Pressure', 'UV Index', 'Season',
       'Visibility (km)', 'Location', 'Weather Type'],
      dtype='object')

In [8]:
# Getting the values of each of the 'Weather Type' column unique cities
weather_counts = df['Weather Type'].value_counts()

weather_counts

Weather Type
Rainy     3300
Cloudy    3300
Sunny     3300
Snowy     3300
Name: count, dtype: int64

---

<a id="missing"></a>

## **1.3 Handling Missing Values**

We identify missing values and apply appropriate strategies to handle them.

In [9]:
# Check for missing values
print("Missing Values per Column:")
display(df.isnull().sum())

Missing Values per Column:


Temperature             0
Humidity                0
Wind Speed              0
Precipitation (%)       0
Cloud Cover             0
Atmospheric Pressure    0
UV Index                0
Season                  0
Visibility (km)         0
Location                0
Weather Type            0
dtype: int64

---

<a id="duplicates"></a>

## **1.4 Remove Duplicate Records**

To ensure data quality, we check and remove duplicate rows.

In [10]:
# Remove duplicate rows
print(f"Duplicate Rows Found: {df.duplicated().sum()}")

Duplicate Rows Found: 0


---

<a id="rename"></a>

## **1.5 Rename Columns for Consistency**

Standardizing column names improves readability and downstream processing.

In [11]:
# Renaming Columns
df.rename(columns={'Atmospheric Pressure': 'ATM_Pressure',
                    'Precipitation (%)': 'Precipitation',
                    'Visibility (km)': 'Visibility'},inplace=True)

# Standardize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

---

<a id="features"></a>

## **1.6 Feature Engineering**

Handle outliers in numerical features like `Temperature`, `Atmospheric Pressure`, and `Wind Speed` by applying reasonable value ranges.

In [12]:
df = df[(df['temperature'] >= -50) & (df['temperature'] <= 50)]
df = df[(df['atm_pressure'] >= 950) & (df['atm_pressure'] <= 1050)]
df = df[(df['wind_speed'] >= 0) & (df['wind_speed'] <= 50)]

Based on the **Weather Classification** Dataset, here are some of the observations from the Dataset:

1. **Dataset Dimensions**:  
   The dataset initially contains **13,200 rows** and **11 columns**. After removing extreme outliers in **Temperature**, **Atmospheric Pressure**, and **Wind Speed**, the size of the dataset may have reduced (shape information after cleaning is missing).  

2. **Data Quality**:  
   - No missing values are present in the dataset across all columns.  
   - No duplicate rows exist in the dataset.  

3. **Numerical Features Summary**:  
   - **Temperature** ranges from **-25.0°C to 109.0°C**, with a mean of **19.13°C**. Post-cleaning, extreme values outside the range **-50°C to 50°C** have been removed.  
   - **Humidity** varies between **20% and 109%**, with an average of **68.71%**.  
   - **Wind Speed** ranges from **0 to 48.5 km/h**, with a mean of **9.83 km/h**. Post-cleaning, values above **50 km/h** have been removed.  
   - **Precipitation (%)** ranges from **0% to 109%**, with a mean of **53.64%**.  
   - **Atmospheric Pressure** has a range of **800.12 to 1199.21 hPa**, with a mean of **1005.83 hPa**. Post-cleaning, values outside **950 to 1050 hPa** have been excluded.  
   - **Visibility** (km) ranges from **0 to 20 km**, with a mean of **5.46 km**.  
   - **UV Index** varies between **0 and 14**, with an average of **4.01**.  

4. **Categorical Features Summary**:  
   - **Cloud Cover** has 4 unique categories, with "overcast" being the most frequent, appearing **6,090 times**.  
   - **Season** has 4 unique categories, with "Winter" being the most frequent, appearing **5,610 times**.  
   - **Location** has 3 unique categories, with "inland" being the most frequent, appearing **4,816 times**.  
   - **Weather Type** has 4 unique categories, with "Rainy" being the most frequent, appearing **3,300 times**.  

---

<a id="save"></a>

## **1.7 Save the Cleaned Dataset**

After completing the data cleaning and preprocessing steps, save the cleaned dataset to a CSV file for future use and reproducibility.

In [13]:
# Create a clean copy of the DataFrame for further processing and cleaning steps.
df_clean = df.copy()

In [14]:
# Display the first and last few rows of the cleaned dataset to verify changes
display(df_clean)

Unnamed: 0,temperature,humidity,wind_speed,precipitation,cloud_cover,atm_pressure,uv_index,season,visibility,location,weather_type
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy
...,...,...,...,...,...,...,...,...,...,...,...
13194,29.0,62,13.0,17.0,overcast,1002.81,2,Spring,5.0,coastal,Cloudy
13195,10.0,74,14.5,71.0,overcast,1003.15,1,Summer,1.0,mountain,Rainy
13197,30.0,77,5.5,28.0,overcast,1012.69,3,Autumn,9.0,coastal,Cloudy
13198,3.0,76,10.0,94.0,overcast,984.27,0,Winter,2.0,inland,Snowy


In [15]:
# Check the shape of the dataset after cleaning
print("Dataset shape after cleaning:")
print(df_clean.shape)

Dataset shape after cleaning:
(11944, 11)


In [16]:
# Recheck the dataframe info to verify datatypes and non-null counts
print("Dataset info:")
display(df_clean.info())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 11944 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   temperature    11944 non-null  float64
 1   humidity       11944 non-null  int64  
 2   wind_speed     11944 non-null  float64
 3   precipitation  11944 non-null  float64
 4   cloud_cover    11944 non-null  object 
 5   atm_pressure   11944 non-null  float64
 6   uv_index       11944 non-null  int64  
 7   season         11944 non-null  object 
 8   visibility     11944 non-null  float64
 9   location       11944 non-null  object 
 10  weather_type   11944 non-null  object 
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB


None

In [17]:
# Getting the column for the dataframe
print(f"The dataset columns include:")
display(df_clean.columns)

The dataset columns include:


Index(['temperature', 'humidity', 'wind_speed', 'precipitation', 'cloud_cover',
       'atm_pressure', 'uv_index', 'season', 'visibility', 'location',
       'weather_type'],
      dtype='object')

In [18]:
# Summary statistics for numerical columns
print("Numerical Summary:")
display(df_clean.describe())

Numerical Summary:


Unnamed: 0,temperature,humidity,wind_speed,precipitation,atm_pressure,uv_index,visibility
count,11944.0,11944.0,11944.0,11944.0,11944.0,11944.0,11944.0
mean,18.119642,69.539685,9.844022,52.455626,1005.77715,3.680593,5.138731
std,15.21078,19.576576,6.893475,32.110388,13.291386,3.642811,2.842095
min,-25.0,20.0,0.0,0.0,950.17,0.0,0.0
25%,4.0,59.0,5.0,19.0,995.3875,1.0,3.0
50%,21.0,70.0,9.0,57.0,1007.315,2.0,5.0
75%,30.0,84.0,13.5,81.0,1016.0625,6.0,7.125
max,50.0,109.0,48.5,109.0,1049.56,14.0,20.0


In [19]:
# Summary statistics for categorical columns
print("Categorical Summary:")
display(df_clean.describe(include=[object]))

Categorical Summary:


Unnamed: 0,cloud_cover,season,location,weather_type
count,11944,11944,11944,11944
unique,4,4,3,4
top,overcast,Winter,inland,Snowy
freq,5700,5291,4411,3079


In [20]:
# Save this lightly cleaned dataset for EDA and Dashboard
df_clean.to_csv('weather_classification_cleaned.csv', index=False)

#### With these datasets ready, we can now proceed confidently to perform Exploratory Data Analysis (EDA).