# Data Processing
## Assignment 2
### Process Data Climate Change on Crop
### Step 1: Import Libraries Start by importing the necessary libraries:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Step 2: Load the Dataset Load the dataset into a pandas DataFrame:

In [13]:
# Load the dataset
df = pd.read_csv('climate_change_agriculture_dataset.csv') 
# Display the first few rows
df.head()

Unnamed: 0,Temperature,Precipitation,CO2 Levels,Crop Yield,Soil Health,Extreme Weather Events,Crop Disease Incidence,Water Availability,Food Security,Economic Impact
0,7,59,329,483,10,Drought,Low,High,Low,High
1,39,20,426,679,8,Heatwave,High,Low,High,Low
2,18,46,403,587,5,Flood,Low,Medium,Low,Medium
3,9,91,356,220,5,Heatwave,Medium,Medium,High,Medium
4,35,12,325,538,1,Storm,Medium,Medium,High,High


### Step 3: Explore the Dataset Understand the structure of the data:

In [3]:
# Get a summary of the dataset
df.info()

# Get a statistical summary of the numerical columns
df.describe()

# Check for missing values
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Temperature             1000 non-null   int64 
 1   Precipitation           1000 non-null   int64 
 2   CO2 Levels              1000 non-null   int64 
 3   Crop Yield              1000 non-null   int64 
 4   Soil Health             1000 non-null   int64 
 5   Extreme Weather Events  1000 non-null   object
 6   Crop Disease Incidence  1000 non-null   object
 7   Water Availability      1000 non-null   object
 8   Food Security           1000 non-null   object
 9   Economic Impact         1000 non-null   object
dtypes: int64(5), object(5)
memory usage: 78.3+ KB


Temperature               0
Precipitation             0
CO2 Levels                0
Crop Yield                0
Soil Health               0
Extreme Weather Events    0
Crop Disease Incidence    0
Water Availability        0
Food Security             0
Economic Impact           0
dtype: int64

### Step 4: Handle Missing Values 
If there are missing values, handle them by either dropping rows with missing values or filling them with the mean or median:

In [23]:
# Drop rows with missing values
df_clean = df.dropna()

# Alternatively, fill missing values with the mean of the numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

In [24]:
for col in df.select_dtypes(include=['object']).columns:
    df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


### Step 5: Normalize the Data: 
Normalize the numerical features so they are on a similar scale. This is important for many machine learning algorithms like K-Means or Logistic Regression.


In [25]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['Temperature', 'Precipitation', 'CO2 Levels']])


In [26]:
from sklearn.preprocessing import StandardScaler

# Select the numerical columns to normalize
numerical_columns = ['Temperature', 'Precipitation', 'CO2 Levels', 'Crop Yield', 'Soil Health']

# Initialize the scaler
scaler = StandardScaler()

# Apply the scaler to the numerical features
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Display the first few rows to verify normalization
df.head()

Unnamed: 0,Temperature,Precipitation,CO2 Levels,Crop Yield,Soil Health,Extreme Weather Events,Crop Disease Incidence,Water Availability,Food Security,Economic Impact
0,-1.215649,0.372264,-1.2776,-0.239774,1.575583,Drought,Low,High,Low,High
1,0.930126,-0.93628,0.420428,0.512194,0.872668,Heatwave,High,Low,High,Low
2,-0.478038,-0.063917,0.017803,0.159229,-0.181703,Flood,Low,Medium,Low,Medium
3,-1.081538,1.445941,-0.804953,-1.248793,-0.181703,Heatwave,Medium,Medium,High,Medium
4,0.661905,-1.204699,-1.347622,-0.028763,-1.587532,Storm,Medium,Medium,High,High


### Step 6: Final Dataset Overview: 
Check the final structure of the cleaned and preprocessed dataset to ensure everything is ready for modeling.

In [27]:
# Get a summary of the final cleaned dataset
df.info()

# Display the first few rows of the cleaned dataset
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Temperature             1000 non-null   float64
 1   Precipitation           1000 non-null   float64
 2   CO2 Levels              1000 non-null   float64
 3   Crop Yield              1000 non-null   float64
 4   Soil Health             1000 non-null   float64
 5   Extreme Weather Events  1000 non-null   object 
 6   Crop Disease Incidence  1000 non-null   object 
 7   Water Availability      1000 non-null   object 
 8   Food Security           1000 non-null   object 
 9   Economic Impact         1000 non-null   object 
dtypes: float64(5), object(5)
memory usage: 78.3+ KB


Unnamed: 0,Temperature,Precipitation,CO2 Levels,Crop Yield,Soil Health,Extreme Weather Events,Crop Disease Incidence,Water Availability,Food Security,Economic Impact
0,-1.215649,0.372264,-1.2776,-0.239774,1.575583,Drought,Low,High,Low,High
1,0.930126,-0.93628,0.420428,0.512194,0.872668,Heatwave,High,Low,High,Low
2,-0.478038,-0.063917,0.017803,0.159229,-0.181703,Flood,Low,Medium,Low,Medium
3,-1.081538,1.445941,-0.804953,-1.248793,-0.181703,Heatwave,Medium,Medium,High,Medium
4,0.661905,-1.204699,-1.347622,-0.028763,-1.587532,Storm,Medium,Medium,High,High


In [29]:
# Save the cleaned and transformed dataset
df.to_csv('processed_climate_change_data.csv', index=False)
