Data preprocessing

Here below is where data is being imported , a weather dataset from a CSV file (weatherdata--1111422(in).csv) using pd.read_csv() and printing the entire DataFrame using df.to_string(). This allows you to inspect all rows and columns of the dataset, which includes weather attributes such as date, location (longitude, latitude), elevation, temperatures, precipitation, wind, relative humidity, and solar radiation (which currently has missing values).

In [3]:
import pandas as pd
df = pd.read_csv('weather_data_EDA.csv')
print(df.head(5))

         Date  Year  Month Month_Name  Min Temperature(°C)  \
0  1979-01-01  1979      1    January               25.154   
1  1979-01-02  1979      1    January               24.853   
2  1979-01-03  1979      1    January               25.469   
3  1979-01-04  1979      1    January               24.851   
4  1979-01-05  1979      1    January               24.974   

   Max Temperature(°C)  Average Temperature(°C)  Precipitation(mm)  \
0               28.419                   26.786             13.125   
1               27.973                   26.413             23.355   
2               27.772                   26.620             39.468   
3               26.534                   25.692             23.830   
4               28.343                   26.658              9.089   

   Wind(km/h)  Relative Humidity  Solar(W/m²)  
0       6.932              0.824       21.584  
1       6.081              0.870       19.647  
2       6.015              0.874       23.400  
3       6.422 

Checking for null values

Checking for null values – Ensures the dataset has no missing data that could cause errors or bias in analysis.

 Checking for null (missing) values in each column.
 df.isnull() creates a DataFrame of True/False for missing values.
.sum() counts the number of True values (nulls) for each column

In [7]:
# Printing the number of null values per column

In [8]:
nulls_per_column = df.isnull().sum()
print("Nulls per column:\n", nulls_per_column)

Nulls per column:
 Date                       0
Year                       0
Month                      0
Month_Name                 0
Min Temperature(°C)        0
Max Temperature(°C)        0
Average Temperature(°C)    0
Precipitation(mm)          0
Wind(km/h)                 0
Relative Humidity          0
Solar(W/m²)                0
dtype: int64


Checking for duplicates 

Checking for duplicates – Ensures all records are unique to prevent skewing results in analysis or training a model.

In [11]:
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

Number of duplicate rows: 0


Displaying Dataframe structure and column information

 Displaying DataFrame structure and column information
 df.info() shows:
 - Number of entries (rows)
 - Column names, non-null counts, and data types
 - Memory usage
 This helps check if any columns have incorrect data types or missing values.

df.info() – Checks column data types and ensures data is as expected before processing.

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12868 entries, 0 to 12867
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Date                     12868 non-null  object 
 1   Year                     12868 non-null  int64  
 2   Month                    12868 non-null  int64  
 3   Month_Name               12868 non-null  object 
 4   Min Temperature(°C)      12868 non-null  float64
 5   Max Temperature(°C)      12868 non-null  float64
 6   Average Temperature(°C)  12868 non-null  float64
 7   Precipitation(mm)        12868 non-null  float64
 8   Wind(km/h)               12868 non-null  float64
 9   Relative Humidity        12868 non-null  float64
 10  Solar(W/m²)              12868 non-null  float64
dtypes: float64(7), int64(2), object(2)
memory usage: 1.1+ MB


Converting 'Date' column to datatime format

Currently, 'Date' is stored as an object (string). 
 Converting it to datetime64 allows for easier time-based operations 
 such as filtering by date, resampling, and plotting time series.

pd.to_datetime() – Standardizes date format for time-series operations and avoids errors in date calculations.

In [19]:
df['Date'] = pd.to_datetime(df['Date'])

Verifying the column type change 

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12868 entries, 0 to 12867
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Date                     12868 non-null  datetime64[ns]
 1   Year                     12868 non-null  int64         
 2   Month                    12868 non-null  int64         
 3   Month_Name               12868 non-null  object        
 4   Min Temperature(°C)      12868 non-null  float64       
 5   Max Temperature(°C)      12868 non-null  float64       
 6   Average Temperature(°C)  12868 non-null  float64       
 7   Precipitation(mm)        12868 non-null  float64       
 8   Wind(km/h)               12868 non-null  float64       
 9   Relative Humidity        12868 non-null  float64       
 10  Solar(W/m²)              12868 non-null  float64       
dtypes: datetime64[ns](1), float64(7), int64(2), object(1)
memory usage: 1.1+ MB


Printing DataFrame – Provides a final visual confirmation before saving.

In [23]:
# Printing the full DataFrame to verify that the columns are in the correct order
print(df.to_string())

            Date  Year  Month Month_Name  Min Temperature(°C)  Max Temperature(°C)  Average Temperature(°C)  Precipitation(mm)  Wind(km/h)  Relative Humidity  Solar(W/m²)
0     1979-01-01  1979      1    January               25.154               28.419                   26.786             13.125       6.932              0.824       21.584
1     1979-01-02  1979      1    January               24.853               27.973                   26.413             23.355       6.081              0.870       19.647
2     1979-01-03  1979      1    January               25.469               27.772                   26.620             39.468       6.015              0.874       23.400
3     1979-01-04  1979      1    January               24.851               26.534                   25.692             23.830       6.422              0.886       13.279
4     1979-01-05  1979      1    January               24.974               28.343                   26.658              9.089       6.162       

Saving to CSV – Stores the cleaned, well-formatted dataset for later analysis, sharing, or modeling.

In [25]:
# Exporting the cleaned and formatted DataFrame to a CSV file
# index=False ensures that the DataFrame's index is not written as a separate column in the CSV.
# This makes the file cleaner and easier to use in other tools.

df.to_csv('weatherdata_cleaned.csv', index=False)