Data preprocessing

Here below is where data is being imported , a weather dataset from a CSV file (weatherdata--1111422(in).csv) using pd.read_csv() and printing the entire DataFrame using df.to_string(). This allows you to inspect all rows and columns of the dataset, which includes weather attributes such as date, location (longitude, latitude), elevation, temperatures, precipitation, wind, relative humidity, and solar radiation (which currently has missing values).

In [3]:
import pandas as pd
df = pd.read_csv('/Users/dani-myburgh/Downloads/weatherdata--1111422(in).csv')
print(df.to_string())

FileNotFoundError: [Errno 2] No such file or directory: '/Users/dani-myburgh/Downloads/weatherdata--1111422(in).csv'

Dropping unnecessary columns

Dropping unnecessary columns from the DataFrame.
The columns 'Longitude', 'Latitude', and 'Elevation' are removed because they are not required for the current analysis or modeling.This helps reduce memory usage and keeps only relevant data.

In [None]:
df = df.drop(columns=['Longitude','Latitude', 'Elevation'])

Checking for null values

Checking for null values – Ensures the dataset has no missing data that could cause errors or bias in analysis.

 Checking for null (missing) values in each column.
 df.isnull() creates a DataFrame of True/False for missing values.
.sum() counts the number of True values (nulls) for each column

In [None]:
# Printing the number of null values per column

In [None]:
nulls_per_column = df.isnull().sum()
print("Nulls per column:\n", nulls_per_column)

Checking for duplicates 

Checking for duplicates – Ensures all records are unique to prevent skewing results in analysis or training a model.

In [None]:
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

Displaying Dataframe structure and column information

 Displaying DataFrame structure and column information
 df.info() shows:
 - Number of entries (rows)
 - Column names, non-null counts, and data types
 - Memory usage
 This helps check if any columns have incorrect data types or missing values.

df.info() – Checks column data types and ensures data is as expected before processing.

In [None]:
df.info()

Converting 'Date' column to datatime format

Currently, 'Date' is stored as an object (string). 
 Converting it to datetime64 allows for easier time-based operations 
 such as filtering by date, resampling, and plotting time series.

pd.to_datetime() – Standardizes date format for time-series operations and avoids errors in date calculations.

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

Verifying the column type change 

In [None]:
df.info()

Creating the average temperature columns

 This column is calculated as the mean of 'Min Temperature' and 'Max Temperature'.
#This helps in analysis where a single representative temperature is needed.

Creating Average Temperature – Adds a useful derived metric for easier analysis without losing min/max detail.

df.to_string() – Outputs the full dataset for inspection (although in practice, this can be slow for large datasets)

In [None]:
df['Average Temperature']= (df['Min Temperature'] + df['Max Temperature'])/2

In [None]:
# Printing the full DataFrame to view all rows and columns
# df.to_string() prints the entire dataset without truncation.
print(df.to_string())

Reordering the coulmns in the dataframe

Column reordering – Improves readability and ensures consistent structure, especially if the file will be shared or merged with other datasets.


 'new_order' defines the desired sequence of columns for better readability or consistency.
 This is useful for aligning data with a standard format or preferred layout.

In [None]:
new_order = ['Date',
             'Min Temperature',
             'Max Temperature',
             'Average Temperature',
             'Precipitation',
             'Wind',
             'Relative Humidity',
             'Solar']
# Applying the new column order to the DataFrame
df = df[new_order]

Printing DataFrame – Provides a final visual confirmation before saving.

In [None]:
# Printing the full DataFrame to verify that the columns are in the correct order
print(df.to_string())

Saving to CSV – Stores the cleaned, well-formatted dataset for later analysis, sharing, or modeling.

In [None]:
# Exporting the cleaned and formatted DataFrame to a CSV file
# index=False ensures that the DataFrame's index is not written as a separate column in the CSV.
# This makes the file cleaner and easier to use in other tools.

df.to_csv('/Users/dani-myburgh/Downloads/weatherdata_cleaned.csv', index=False)