## Exercises

We are provided with the `Crop_yield` dataset that contains various factors that could influence the yield of a particular crop across different regions.

### Import libraries and dataset

In [10]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Load dataset
df= pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/Python/Crop_yield.csv")
df.head(5)

Unnamed: 0,Region,Temperature,Rainfall,Soil_Type,Fertilizer_Usage,Pesticide_Usage,Irrigation,Crop_Variety,Yield
0,East,23.152156,803.362573,Clayey,204.792011,20.76759,1,Variety B,40.316318
1,West,19.382419,571.56767,Sandy,256.201737,49.290242,0,Variety A,26.846639
2,North,27.89589,-8.699637,Loamy,222.202626,25.316121,0,Variety C,-0.323558
3,East,26.741361,897.426194,Loamy,187.98409,17.115362,0,Variety C,45.440871
4,East,19.090286,649.384694,Loamy,110.459549,24.068804,1,Variety B,35.478118


In [3]:
df.shape

(1000, 9)

### Exercise 1

Our dataset contains several categorical features: `Region`, `Soil_Type`, and `Crop_Variety`.

Use dummy variable encoding to convert these features into a numerical format suitable for model training. Verify the transformation by displaying the first five rows of the modified dataset.

> How has the number of variables in our dataset changed?

In [5]:
# Perform dummy variable encoding
df_encoded = pd.get_dummies(df, columns=['Region', 'Soil_Type', 'Crop_Variety'], drop_first=True, dtype=int)

print("\nDataFrame after Dummy Variable Encoding:")
print(df_encoded)


DataFrame after Dummy Variable Encoding:
     Temperature    Rainfall  Fertilizer_Usage  Pesticide_Usage  Irrigation  \
0      23.152156  803.362573        204.792011        20.767590           1   
1      19.382419  571.567670        256.201737        49.290242           0   
2      27.895890   -8.699637        222.202626        25.316121           0   
3      26.741361  897.426194        187.984090        17.115362           0   
4      19.090286  649.384694        110.459549        24.068804           1   
..           ...         ...               ...              ...         ...   
995    19.540123  543.669287         99.907202        44.913209           1   
996    26.090378  549.562256        237.636041        42.582608           1   
997    25.104737  756.628061        236.604958        19.356913           1   
998    19.604622  607.894675         86.379325        18.025284           0   
999    27.945697  543.381289        269.494596        38.605471           1   

         

### Exercise 2

We want to determine which variables from the new dataset we will use for model training.

Write a function `variance_thresholding` that will use variance thresholding to filter out features based on a variance threshold. The function should accept two parameters, which are the  DataFrame and the threshold value. It should return two DataFrames, one containing only the features that meet the variance threshold criterion, and one containing the scaled DataFrame.

**Hint:** Scaling is crucial as it allows the variance thresholding to be applied uniformly across features. Read up on using the `MinMaxScaler()` function from the `sklearn.preprocessing` package.

In [6]:
def variance_thresholding(df_encoded, threshold_value):

   # Splitting the dataset into features and target variable for scaling and training
    X = df_encoded.drop(columns=['Yield'])
    y = df_encoded['Yield']

    # Initialise and fit the scaler to the features only
    scaler = MinMaxScaler()
    scaled_features = scaler.fit_transform(X)

    # Convert the scaled features back to a DataFrame
    df_scaled = pd.DataFrame(scaled_features, columns=X.columns)

    # Initialise the VarianceThreshold object with the specified threshold value
    selector = VarianceThreshold(threshold=threshold_value)

    # Apply the selector to the scaled feature DataFrame
    df_filtered_values = selector.fit_transform(df_scaled)

    # Convert the array result into a DataFrame with only the selected features
    df_filtered = pd.DataFrame(df_filtered_values, columns=df_scaled.columns[selector.get_support(indices=True)])

    # Return the filtered DataFrame
    return df_filtered, df_scaled


### Exercise 3

Using the function we created in **Exercise 2**, apply variance threshold filtering to our encoded dataset, with a threshold of `0.03`. Compare the number of features before and after applying the variance threshold.

In [7]:
filtered_df = variance_thresholding(df_encoded, threshold_value= 0.03)
print("\nFiltered DataFrame:")
print(filtered_df)


Filtered DataFrame:
(     Fertilizer_Usage  Pesticide_Usage  Irrigation  Region_North  \
0            0.618965         0.268771         1.0           0.0   
1            0.825228         0.982393         0.0           0.0   
2            0.688819         0.382573         0.0           1.0   
3            0.551529         0.177395         0.0           0.0   
4            0.240490         0.351366         1.0           0.0   
..                ...              ...         ...           ...   
995          0.198152         0.872882         1.0           1.0   
996          0.750740         0.814572         1.0           1.0   
997          0.746603         0.233477         1.0           0.0   
998          0.143876         0.200160         0.0           0.0   
999          0.878561         0.715066         1.0           0.0   

     Region_South  Region_West  Soil_Type_Loamy  Soil_Type_Sandy  \
0             0.0          0.0              0.0              0.0   
1             0.0        

In [8]:
# Call the variance_thresholding() function and pass the given threshold
df_filtered, df_scaled = variance_thresholding(df_encoded, 0.03)

# Compare the number of features before and after variance thresholding
print("Number of features before variance thresholding:", df_scaled.shape[1])
print("Number of features after variance thresholding:", df_filtered.shape[1])

Number of features before variance thresholding: 12
Number of features after variance thresholding: 10


### Exercise 4

Train two linear regression models:

**a)** Using all the available features in our dummy encoded dataset from **Exercise 1**.

In [16]:
#Define features (X) and target variable (y)
X = df_encoded.drop(columns=['Yield'])
y = df_encoded['Yield']

#Split the dataset into training and testing sets (80% / 20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)




**b)** Using only the features selected through the variance thresholding process in **Exercise 3**.

In [17]:

#Split the dataset into training and testing sets (80% / 20%)
X_train_filtered, X_test_filtered, y_train_filtered, y_test_filtered = train_test_split(df_filtered, y, test_size=0.2, random_state=42)

# Create and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

