In [1]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



Exercises
We are provided with the `Crop_yield` dataset that contains various factors that could influence the yield of a particular crop across different regions.

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(5)

Unnamed: 0,Region,Temperature,Rainfall,Soil_Type,Fertilizer_Usage,Pesticide_Usage,Irrigation,Crop_Variety,Yield
0,East,23.152156,803.362573,Clayey,204.792011,20.76759,1,Variety B,40.316318
1,West,19.382419,571.56767,Sandy,256.201737,49.290242,0,Variety A,26.846639
2,North,27.89589,-8.699637,Loamy,222.202626,25.316121,0,Variety C,-0.323558
3,East,26.741361,897.426194,Loamy,187.98409,17.115362,0,Variety C,45.440871
4,East,19.090286,649.384694,Loamy,110.459549,24.068804,1,Variety B,35.478118


In [3]:
# Briefly describes DataFrame dimensions
df.shape

(1000, 9)

`Exercise 1`
Our dataset contains several categorical features: `Region`, `Soil_Type`, and `Crop_Variety`.

Use dummy variable encoding to convert these features into a numerical format suitable for model training. Verify the transformation by displaying the first five rows of the modified dataset.

How has the number of variables in our dataset changed?

In [4]:
# Apply dummy variable encoding to the categorical variables
df_encoded = pd.get_dummies(df, columns=["Region", "Soil_Type", "Crop_Variety"], dtype=int)

# Display the first few rows of the modified dataset to confirm the transformation
df_encoded.head()

Unnamed: 0,Temperature,Rainfall,Fertilizer_Usage,Pesticide_Usage,Irrigation,Yield,Region_East,Region_North,Region_South,Region_West,Soil_Type_Clayey,Soil_Type_Loamy,Soil_Type_Sandy,Crop_Variety_Variety A,Crop_Variety_Variety B,Crop_Variety_Variety C
0,23.152156,803.362573,204.792011,20.76759,1,40.316318,1,0,0,0,1,0,0,0,1,0
1,19.382419,571.56767,256.201737,49.290242,0,26.846639,0,0,0,1,0,0,1,1,0,0
2,27.89589,-8.699637,222.202626,25.316121,0,-0.323558,0,1,0,0,0,1,0,0,0,1
3,26.741361,897.426194,187.98409,17.115362,0,45.440871,1,0,0,0,0,1,0,0,0,1
4,19.090286,649.384694,110.459549,24.068804,1,35.478118,1,0,0,0,0,1,0,0,1,0


In [5]:
df_encoded.shape

(1000, 16)

One thing to note is that the number of my columns have increased from `9` to `16` after using dummy variable encoding.

`Exercise 2`
We want to determine which variables from the new dataset we will use for model training.

Write a function `variance_thresholding` that will use variance thresholding to filter out features based on a variance threshold. The function should accept two parameters, which is the dataframe and the threshold value. It should return two DataFrames, one containing the only the features that meet the variance threshold criterion, and one containing the scaled dataframe.

Hint: Scaling is crucial as it allows the variance thresholding to be applied uniformly across features. Read up on using the MinMaxScaler() function from the sklearn.preprocessing package.

In [6]:
def variance_thresholding(df_encoded, threshold_value):
    
    # Splitting the dataset into features and target variable for scaling and training
    X = df_encoded.drop(columns=['Yield'])
    y = df_encoded['Yield']
    
    # Initialize and fit the scaler to the features only
    scaler = MinMaxScaler()
    scaled_features = scaler.fit_transform(X)
    
    # Convert the scaled features back to a DataFrame
    df_scaled = pd.DataFrame(scaled_features, columns=X.columns)
    
    # Initialize the VarianceThreshold object with the specified threshold value
    selector = VarianceThreshold(threshold=threshold_value)
    
    # Apply the selector to the scaled feature DataFrame
    df_filtered_values = selector.fit_transform(df_scaled)
    
    # Convert the array result into a DataFrame with only the selected features
    df_filtered = (
        pd.DataFrame(df_filtered_values, columns=df_scaled.columns
                     [selector.get_support(indices=True)])
    )
    
    # Return the filtered DataFrame
    return df_filtered, df_scaled

`Exercise 3`
Using the function we created in `Exercise 2`, apply variance threshold filtering to our encoded dataset, with a threshold of 0.03. Compare the number of features before and after applying the variance threshold.

In [7]:
# Call the variance_thresholding() function and pass the given threshold
df_filtered, df_scaled = variance_thresholding(df_encoded, 0.03)

# Compare the number of features before and after variance thresholding
print("Number of features before variance thresholding:", df_scaled.shape[1])
print("Number of features after variance thresholding:", df_filtered.shape[1])

Number of features before variance thresholding: 15
Number of features after variance thresholding: 13


`Exercise 4`
Train two linear regression models:

a) Using all the available features in our dummy encoded dataset from Exercise 1.

In [8]:
X_all = df_encoded.drop(columns=['Yield'])
y = df_encoded['Yield']
# Splitting both datasets into training and testing sets
X_train_all, X_test_all, y_train_all, y_test_all = (
    train_test_split(X_all, y, test_size=0.2, random_state=42)
)

# Training the model using all available features
model_all = LinearRegression()
model_all.fit(X_train_all, y_train_all)

b) Using only the features selected through the variance thresholding process in `Exercise 3`.

In [9]:
# Splitting the dataset into training and testing sets
X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered = train_test_split(df_filtered, y, test_size=0.2, random_state=42)

# Training the model using selected features
model_filtered = LinearRegression()
model_filtered.fit(X_train_filtered, y_train_filtered)