### **1. Project Objective**

The central goal of this project is to explore and analyze a dataset comprising of forty soybean cultivars harvested over two consecutive seasons. The aim is to understand the factors influencing thousand seed weight (MHG) and grain yield (GY) among these cultivars, investigating seasonal changes in these metrics, and creating predictive models for estimating MHG in new cultivar types. Additionally, the project will also include cluster analysis to group similar cultivars based on their characteristics.

**Table of Contents**


1. Project Objective

2. Data Cleaning and Preprocessing

3. Feature Engineering

4. Exploratory Data Analysis (EDA)

5. Most Important Factors for MHG    - Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable. These scores help us understand which features play a crucial role in determining the outcome of a predictive model. Let’s explore some methods for calculating feature importance:

Statistical Correlation Scores:
Calculate correlation coefficients (such as Pearson’s correlation) between each feature and the target variable.
Features with higher absolute correlation values are considered more important.

6. Most Important Factors for GY    

7. Seasonal Variations in MHG and GY

8. Creating Synthetic Data for New Cultivar Prediction    - join tables  , train 2 tables, test mean of features

9. Cluster Analysis

10. Model Selection and Tuning

11. Results

12. Conclusions

### **2. Data Cleaning and Preprocessing**

In [1]:
import pandas as pd

data_path = "./data/data.csv"
data = pd.read_csv(data_path)

print(data.head())
print(data.describe(include='all'))
print(data.info())


   Season    Cultivar  Repetition     PH    IFP    NLP     NGP   NGL   NS  \
0       1  NEO 760 CE           1  58.80  15.20   98.2  177.80  1.81  5.2   
1       1  NEO 760 CE           2  58.60  13.40  102.0  195.00  1.85  7.2   
2       1  NEO 760 CE           3  63.40  17.20  100.4  203.00  2.02  6.8   
3       1  NEO 760 CE           4  60.27  15.27  100.2  191.93  1.89  6.4   
4       1   MANU IPRO           1  81.20  18.00   98.8  173.00  1.75  7.4   

      MHG       GY  
0  152.20  3232.82  
1  141.69  3517.36  
2  148.81  3391.46  
3  148.50  3312.58  
4  145.59  3230.99  
            Season    Cultivar  Repetition          PH       IFP         NLP  \
count   320.000000         320  320.000000  320.000000  320.0000  320.000000   
unique         NaN          40         NaN         NaN       NaN         NaN   
top            NaN  NEO 760 CE         NaN         NaN       NaN         NaN   
freq           NaN           8         NaN         NaN       NaN         NaN   
mean      1

In [3]:
cultivars_path = './data/cultivars-description.ods'
cultivars_data = pd.read_excel(cultivars_path, engine='odf')

print(cultivars_data.head())
print(cultivars_data.describe(include='all'))
print(cultivars_data.info())

       Cultivars  Maturation group  Seeds per meter/linear  \
0  FTR 3190 IPRO               9.0                    12.5   
1  FTR 4288 IPRO               8.8                    11.0   
2   NK 8770 IPRO               8.7                    16.0   
3      M 8606I2X               8.6                    10.0   
4    M 8644 IPRO               8.6                    11.0   

   Density per meter/linear  
0                    250000  
1                    220000  
2                    320000  
3                    200000  
4                    220000  
            Cultivars  Maturation group  Seeds per meter/linear  \
count              40         40.000000               40.000000   
unique             40               NaN                     NaN   
top     FTR 3190 IPRO               NaN                     NaN   
freq                1               NaN                     NaN   
mean              NaN          8.042500               15.260000   
std               NaN          0.461818      