### **1. Project Objective**

The central goal of this project is to explore and analyze a dataset comprising of forty soybean cultivars harvested over two consecutive seasons. The aim is to understand the factors influencing thousand seed weight (MHG) and grain yield (GY) among these cultivars, investigating seasonal changes in these metrics, and creating predictive models for estimating MHG in new cultivar types. Additionally, the project will also include cluster analysis to group similar cultivars based on their characteristics.

**Table of Contents**


1. Project Objective

2. Data Cleaning, Preprocessing and Feature Engineering  

3. Exploratory Data Analysis (EDA)

4. Most Important Factors for MHG    - Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. These scores help us understand which features play a crucial role in determining the outcome of a predictive model. Let’s explore some methods for calculating feature importance:

Statistical Correlation Scores:
Calculate correlation coefficients (such as Pearson’s correlation) between each feature and the target variable.
Features with higher absolute correlation values are considered more important.

5. Most Important Factors for GY    

6. Seasonal Variations in MHG and GY

7. Creating Synthetic Data for New Cultivar Prediction    - join tables  , train 2 tables, test mean of features

8. Cluster Analysis

9. Model Selection and Tuning

10. Results

11. Conclusions

### **2. Data Cleaning, Preprocessing and Feature Engineering**

In [1]:
import pandas as pd

data_path = "./data/data.csv"
data = pd.read_csv(data_path)

print(data.head())
print(data.describe(include='all'))
print(data.info())


   Season    Cultivar  Repetition     PH    IFP    NLP     NGP   NGL   NS  \
0       1  NEO 760 CE           1  58.80  15.20   98.2  177.80  1.81  5.2   
1       1  NEO 760 CE           2  58.60  13.40  102.0  195.00  1.85  7.2   
2       1  NEO 760 CE           3  63.40  17.20  100.4  203.00  2.02  6.8   
3       1  NEO 760 CE           4  60.27  15.27  100.2  191.93  1.89  6.4   
4       1   MANU IPRO           1  81.20  18.00   98.8  173.00  1.75  7.4   

      MHG       GY  
0  152.20  3232.82  
1  141.69  3517.36  
2  148.81  3391.46  
3  148.50  3312.58  
4  145.59  3230.99  
            Season    Cultivar  Repetition          PH       IFP         NLP  \
count   320.000000         320  320.000000  320.000000  320.0000  320.000000   
unique         NaN          40         NaN         NaN       NaN         NaN   
top            NaN  NEO 760 CE         NaN         NaN       NaN         NaN   
freq           NaN           8         NaN         NaN       NaN         NaN   
mean      1

In [2]:
cultivars_path = './data/cultivars-description.ods'
cultivars_data = pd.read_excel(cultivars_path, engine='odf')

print(cultivars_data.head())
print(cultivars_data.describe(include='all'))
print(cultivars_data.info())

       Cultivars  Maturation group  Seeds per meter/linear  \
0  FTR 3190 IPRO               9.0                    12.5   
1  FTR 4288 IPRO               8.8                    11.0   
2   NK 8770 IPRO               8.7                    16.0   
3      M 8606I2X               8.6                    10.0   
4    M 8644 IPRO               8.6                    11.0   

   Density per meter/linear  
0                    250000  
1                    220000  
2                    320000  
3                    200000  
4                    220000  
            Cultivars  Maturation group  Seeds per meter/linear  \
count              40         40.000000               40.000000   
unique             40               NaN                     NaN   
top     FTR 3190 IPRO               NaN                     NaN   
freq                1               NaN                     NaN   
mean              NaN          8.042500               15.260000   
std               NaN          0.461818      

Ensuring column name and data type consistency, then making a function to find the closest name. Creating a list of names from cultivars_data for matching and replacing the 'Cultivar' names in data with the closest names from cultivars_data. This is all done so that the names of the cultivars in both datasets are consistent.

In [3]:
from fuzzywuzzy import process

cultivars_data.rename(columns={'Cultivars': 'Cultivar'}, inplace=True)

data['Cultivar'] = data['Cultivar'].astype(str)
cultivars_data['Cultivar'] = cultivars_data['Cultivar'].astype(str)

def find_closest_name(name, choices):
    return process.extractOne(name, choices)[0]

choices = cultivars_data['Cultivar'].tolist()

data['Cultivar'] = data['Cultivar'].apply(lambda x: find_closest_name(x, choices))

len(data["Cultivar"].unique())


40

Applying label encoding to 'Cultivar' columns in both datasets.

In [6]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_encoder = LabelEncoder()

label_encoder.fit(cultivars_data['Cultivar'])

data['Cultivar'] = label_encoder.transform(data['Cultivar'])
cultivars_data['Cultivar'] = label_encoder.transform(cultivars_data['Cultivar'])

print(cultivars_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Cultivar                  40 non-null     int64  
 1   Maturation group          40 non-null     float64
 2   Seeds per meter/linear    40 non-null     float64
 3   Density per meter/linear  40 non-null     int64  
dtypes: float64(2), int64(2)
memory usage: 1.4 KB
None


One-hot encoding to 'Season' and 'Repetition' and then concatenating the one-hot encoded columns back to the original dataframe.

In [5]:
one_hot_encoder = OneHotEncoder()
season_repetition_encoded = one_hot_encoder.fit_transform(data[['Season', 'Repetition']]).toarray()
season_repetition_encoded_df = pd.DataFrame(season_repetition_encoded, columns=one_hot_encoder.get_feature_names_out(['Season', 'Repetition']))

data = pd.concat([data, season_repetition_encoded_df], axis=1)
data.drop(['Season', 'Repetition'], axis=1, inplace=True)


print(data.head())
print(data.describe(include='all'))
print(data.info())

   Cultivar     PH    IFP    NLP     NGP   NGL   NS     MHG       GY  \
0        31  58.80  15.20   98.2  177.80  1.81  5.2  152.20  3232.82   
1        31  58.60  13.40  102.0  195.00  1.85  7.2  141.69  3517.36   
2        31  63.40  17.20  100.4  203.00  2.02  6.8  148.81  3391.46   
3        31  60.27  15.27  100.2  191.93  1.89  6.4  148.50  3312.58   
4        29  81.20  18.00   98.8  173.00  1.75  7.4  145.59  3230.99   

   Season_1  Season_2  Repetition_1  Repetition_2  Repetition_3  Repetition_4  
0       1.0       0.0           1.0           0.0           0.0           0.0  
1       1.0       0.0           0.0           1.0           0.0           0.0  
2       1.0       0.0           0.0           0.0           1.0           0.0  
3       1.0       0.0           0.0           0.0           0.0           1.0  
4       1.0       0.0           1.0           0.0           0.0           0.0  
         Cultivar          PH       IFP         NLP         NGP         NGL  \
count  3