# Crop Recommendation — Data Preprocessing

This Jupyter Notebook is the **first step** of the Crop Recommendation Machine Learning project.
Its goal is to clean and save the dataset so that is uesd for model training.

## Tasks Performed in This Notebook

- Load the raw dataset  
- Inspect and clean the data  
- Encode crop labels using LabelEncoder  
- Apply feature scaling using StandardScaler  
- Split the data into training and testing sets  
- Save the processed data for future use

### Importing Required Libraries

Libraries used:
- Pandas and NumPy for data handling
- Scikit-Learn for label encoding, scaling, and splitting
- Pickle for saving preprocessing objects

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import train_test_split
import pickle 
import os 

os.makedirs('../data/',exist_ok=True)
os.makedirs('../Models/',exist_ok=True)


In [4]:
df=pd.read_csv('../data/Crop_recommendation.csv')
print('Loaded Data Sucessfully')
print('Shape of Data',df.shape)
df

Loaded Data Sucessfully
Shape of Data (2200, 8)


Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.717340,rice
...,...,...,...,...,...,...,...,...
2195,107,34,32,26.774637,66.413269,6.780064,177.774507,coffee
2196,99,15,27,27.417112,56.636362,6.086922,127.924610,coffee
2197,118,33,30,24.131797,67.225123,6.362608,173.322839,coffee
2198,117,32,34,26.272418,52.127394,6.758793,127.175293,coffee


###  Data Cleaning

This step includes:

- Checking for missing values  
- Removing duplicate rows  

This ensures data quality before training.

In [7]:
# Check for missing values or duplictes present in the dataset 
print("Missing Values per column")
print(df.isnull().sum())

before=df.shape[0]
df.drop_duplicates(inplace=True)
after=df.shape[0]
print(f'Removed {before-after} duplicates')

Missing Values per column
N              0
P              0
K              0
temperature    0
humidity       0
ph             0
rainfall       0
label          0
dtype: int64
Removed 0 duplicates


### Label Encoding

Crop names (labels) are categorical.

`LabelEncoder` converts crop names into numerical values:

| Crop     | Encoded |
|----------|---------|
| rice     | 0       |
| maize    | 1       |
| chickpea | 2       |

The encoder is saved using Pickle so predictions can be decoded later

In [9]:
le=LabelEncoder()
df['label']=le.fit_transform(df['label'])

print('The target labels are encoded successfully!!')
print("Encoded classess:",list(le.classes_))

with open('../Models/label_encoder.pkl','wb') as f:
    pickle.dump(le,f)

The target labels are encoded successfully!!
Encoded classess: ['apple', 'banana', 'blackgram', 'chickpea', 'coconut', 'coffee', 'cotton', 'grapes', 'jute', 'kidneybeans', 'lentil', 'maize', 'mango', 'mothbeans', 'mungbean', 'muskmelon', 'orange', 'papaya', 'pigeonpeas', 'pomegranate', 'rice', 'watermelon']


###  Train-Test Split

The dataset is split into:

- **80% training data**
- **20% testing data**

This allows the model to be trained and evaluated fairly.

In [15]:
x=df.drop('label',axis=1)
y=df['label']

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=42,test_size=0.2,stratify=y)
print('Data split into training and testing ')
print('Train shape :',x_train.shape,'Test Shape:',x_test.shape)

Data split into training and testing 
Train shape : (1760, 7) Test Shape: (440, 7)


### Feature Scaling

All numerical features (temperature, humidity, rainfall, etc.) are standardized using `StandardScaler`.

Scaling ensures that all features contribute equally to model training.

In [16]:
scaler=StandardScaler()
x_train_scaler=scaler.fit_transform(x_train)
x_test_scaler=scaler.transform(x_test)

# Convert them back to Dataframe 
x_train=pd.DataFrame(x_train_scaler,columns=x.columns)
x_test=pd.DataFrame(x_test_scaler,columns=x.columns) 


with open('../Models/scaler.pkl',"wb") as f:
    pickle.dump(scaler,f)

print('Feature Scaling Completed and Saved Successfully')

Feature Scaling Completed and Saved Successfully


### Saving Processed Data

The notebook saves:

- Training features  
- Testing features  
- Encoded labels  
- Label Encoder  
- Scaler  

These files will be used in the model training notebook.

In [18]:
x_train.to_csv('../data/X_train.csv',index=False)
x_test.to_csv('../data/X_test.csv',index=False)
pd.DataFrame(y_test).to_csv('../data/Y_test.csv',index=False)
pd.DataFrame(y_train).to_csv('../data/Y_train.csv',index=False)
print('Train test data has been successfully saved in data folder!!!')

Train test data has been successfully saved in data folder!!!


## Final Outcome

This notebook prepares machine-learning-ready clean data by:

- Removing noise  
- Converting labels  
- Scaling numeric values  
- Splitting into train/test sets  
- Saving reusable objects

Next step: **Train the Machine Learning model.**
##