# Data Processing
## Assignment 2
### Process Data
### **Step-by-Step Data Processing**
1. **Load the Data**
   - Load the dataset into a Pandas DataFrame for easy manipulation.

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('climate_change_agriculture_dataset.csv')

### Step 2: Remove Duplicate Rows

In [4]:
# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Remove duplicates
df = df.drop_duplicates()

Number of duplicate rows: 0


### Step 3: Handle Missing Values

In [7]:
# Check for missing values in the dataset
df.isnull().sum()

# Option 1: Drop rows with missing values
df_clean = df.dropna()

# Option 2: Fill missing values with mean (imputation)
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

### Step 4: Corect Data Types

In [11]:
# Check data types of all columns
print(df.dtypes)

# Convert the correct columns to their respective data types
df['Time'] = pd.to_datetime(df['Time'], errors='coerce')  # Convert to datetime
df['Temperature (°C)'] = pd.to_numeric(df['Temperature (°C)'], errors='coerce')  # Convert to numeric

# Handle any missing values after conversion
df['Time'] = df['Time'].fillna(pd.Timestamp('1970-01-01'))  # Fill missing dates with a default date
df['Temperature (°C)'] = df['Temperature (°C)'].fillna(df['Temperature (°C)'].mean())  # Fill missing numeric values

# Verify that the conversion was successful
print(df.dtypes)


Time                                object
Device Name                         object
Location                            object
Temperature (°C)                   float64
Atmospheric Pressure (kPa)         float64
Lightning Average Distance (km)    float64
Lightning Strike Count             float64
Maximum Wind Speed (m/s)           float64
Precipitation mm/h                 float64
Solar Radiation (W/m2)             float64
Vapor Pressure (kPa)               float64
Humidity (%)                       float64
Wind Direction (°)                 float64
Wind Speed (m/s)                   float64
dtype: object
Time                                object
Device Name                         object
Location                            object
Temperature (°C)                   float64
Atmospheric Pressure (kPa)         float64
Lightning Average Distance (km)    float64
Lightning Strike Count             float64
Maximum Wind Speed (m/s)           float64
Precipitation mm/h                 float

### Step 6: Handle Outliers

In [15]:
%pip install scipy
from scipy.stats import zscore # type: ignore

# Calculate Z-scores for the numeric column (adjust the column name accordingly)
df['z_score'] = zscore(df['Temperature (°C)'])  # Replace 'Temperature (°C)' with the actual column name

# Filter out rows with Z-score greater than the threshold (e.g., 3)
df = df[df['z_score'].abs() < 3]

# Drop the z_score column after filtering
df = df.drop(columns=['z_score'])

# Display the cleaned DataFrame
print("\nFinal DataFrame after outlier removal:")
print(df.head())

Collecting scipy
  Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (60 kB)
Downloading scipy-1.14.1-cp312-cp312-macosx_14_0_arm64.whl (23.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.14.1
Note: you may need to restart the kernel to use updated packages.

Final DataFrame after outlier removal:
                        Time                       Device Name  \
0  2021-11-19 15:20:09+11:00  DLB ATM41 Charlestown Skate Park   
1  2021-11-19 15:10:07+11:00  DLB ATM41 Charlestown Skate Park   
2  2021-11-19 15:00:06+11:00  DLB ATM41 Charlestown Skate Park   
3  2021-11-19 14:50:09+11:00  DLB ATM41 Charlestown Skate Park   
4  2021-11-19 14:40:05+11:00  DLB ATM41 Charlestown Skate Park   

               Location  Temperature (°C)  Atmospheric Pressure (kPa)  \
0  -32.96599, 151.69513              24.4    

### Step: Remove Unrelevant Data

In [18]:
if 'unnecessary_column' in df.columns:
    df = df.drop(columns=['unnecessary_column'])


### Step: Save Clean Data

In [19]:
# Save cleaned data to a new CSV
df.to_csv('cleaned_data.csv', index=False)
