# Report

## Setup

In [1]:
import pickle
import pandas as pd
import altair as alt

from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import RandomOverSampler

## 1 Introduction and data

> REMOVE THE FOLLOWING TEXT

This section includes an introduction to the project motivation, data, and research question.
Describe the data and definitions of key variables.

It should also include some exploratory data analysis.

All of the EDA won't fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.

### 1.1 Introduction and Motivation

For our project we have chosen a dataset about the release and establishment of the weevil Mecinus janthiniformis for biological control of Dalmatian toadflax in southern California. 

Dalmatian toadflax was introduced to North America in the 1800s, presumably as ornamental plants and for use in fabrics and folk remedies. The species is now widespread in large parts of the USA and Canada. A lack of natural enemies is seen as one reason why plant species become invasive pests when they are introduced into areas outside their original 
range. Due to its extensive root system and prolific seed production, it can thrive and spread rapidly in a variety of ecosystems, including grasslands and roadsides. 

Invasive plants such as Dalmatian toadflax can cause ecological problems. On the one hand, it can displace native plants and as a result reduce biodiversity. It can also have an impact on the availability of habitat for native animals as the dominance of Dalmatian toadflax can disrupt the strucutre and function of ecosystems. On the other hand, we have to think of the soil composition. Invasive plants can alter the soil by removing nutrients which can have an impact on the native vegetation. Moreover, Dalmatian toadflax can tend to accumulate dry material, which can increase fire hazard.

Therefore, attempts are being made to control the spread of Dalmatian toadflax. During the 1950s-1960s studies with herbicides have been conducted. Because of the variable responses, differenct herbicides have been recommended e.g. borate-chlorate mixtures, different acids or silvex. In addition, the use of adapted grasses in competition with toadflax was also investigated.
However, the studies concluded that chemical control alone is not practical for large infestations and emphasized the need for integrated control strategies that take into account both chemical and ecological factors.
Other studies have investigated the effect of prescribed fires on the spread of Dalmatian toadflax. However, these studies also concluded that fire does not reduce populations of Dalmatian toadflax.

Nowadays, biological control using herbivorous insects such as Mecinus janthiniformis is the most promising method of controlling invasive weeds in a long-term, cost-effective and sustainable way.
These insects can infest plants, seeds, roots, leaves and stems. Continuous effect and permanent control, as well as the good combination with other control methods and the self-spreading, but also long-term and environmentally friendly nature are just some of the advantages that should be mentioned.
However, studies and tests on the success of these natural control agents are very cost-intensive and time-consuming.
Nevertheless, the advantages outweigh the disadvantages, which is why the use of herbivorous insects to control invasive plants is widespread today.

In our case, weevil Mecinus janthiniformis populations were released in the investigated area of southern California.
We aim to find patterns and trends within the dataset and define factors and potential predictors associated with the growth of Dalmatian toadflax and the spread of weevil populations.

**References**

Jacobs, J. S., & Sheley, R. L. (2003). Prescribed fire effects on dalmation toadflax. Rangeland Ecology & Management/Journal of Range Management Archives, 56(2), 193-197.: https://journals.uair.arizona.edu/index.php/jrm/article/download/9791/9403

Robocker, W. C. (1968). Control of Dalmation Toadflax. Rangeland Ecology & Management/Journal of Range Management Archives, 21(2), 94-98.: https://journals.uair.arizona.edu/index.php/jrm/article/viewFile/5580/5190

Sing, S. E., De Clerck-Floate, R. A., Hansen, R. W., Pearce, H., Randall, C. B., Toševski, I., & Ward, S. M. (2016). Biology and biological control of Dalmatian and yellow toadflax (p. 141). Morgantown, West Virginia: USDA Forest Service, Forest Health Technology Enterprise Team.: https://www.fs.usda.gov/rm/pubs_journals/2016/rmrs_2016_sing_s001.pdf

Willden, S. A., & Evans, E. W. (2019). Summer development and survivorship of the weed biocontrol agent, Mecinus janthiniformis (Coleoptera: Curculionidae), within stems of its host, Dalmatian toadflax (Lamiales: Plantaginaceae), in Utah. Environmental entomology, 48(3), 533-539.

### 1.2 Research Questions and Hypothesis

**Linear Regression:**

We want to identify patterns and trends within our dataset and define factors or potential predictors associated with the growth of Dalmatian toadflax. Our model aims to predict the main stem length of Dalmatian toadflax based on different traits.

**Logistic Regression:**

For the logistic regression we want to predict, if a plant is infested by weevil populations or not. Therefore, we use different predictor variables and try to identify which characteristics of a plant are most likely to indicate a possible infestation.

Our **hypothesis** is that the spread of weevil populations contributes to the reduction of Dalmatian toadflax vegetation in the area and reduces the size of the plants. 

If it turns out that weevil cultivation does indeed influence the containment of the spread, this could serve as a basis for the decision to increase the use of herbivorous insects such as Mecinus janthiniformis to control the growth of Dalmatian toadflax.

### 1.3 Data Origin

The dataset was originally collected by Lincoln Smith starting in 2008 and published by the Agricultural Research Service (Department of Agriculture). 

Every year an observational study was conducted on six different sites within the investigated area with approximately 7-78 observations per site and year. The plants were collected, examined, measured and dissected in the labratory. Each observation in the datasets represents a plant in the investigated area.

Link to the data source: https://catalog.data.gov/dataset/data-from-release-and-establishment-of-the-weevil-mecinus-janthiniformis-for-biological-co

### 1.4 Data Dictionary

Let´s first take a look at the metadata. We want to get a brief overview of the data in our "df". We can see there are 25 variables. The description, role, type and format of each variable in our DataFrame can be taken from the following table:

In [2]:
meta = pd.read_excel('../data/raw/metadata_dictionary.xlsx')
meta

Unnamed: 0,Name,Description,Role,Type,Format
0,year,year that stems were infested,-,numeric,int64
1,diss date,date dissected in the laboratory,-,numeric,object
2,date,date collected in the field,-,numeric,object
3,site,six study sites at Hungry Valley study area,Predictor,nominal,object
4,trt,release or not in 2008 and 2014,-,nominal,object
5,BC,"1 = early establishment, 0 = late establishment",-,nominal,int64
6,stem #,stem ID,ID,numeric,int64
7,stem diam bottom (mm),diameter of stem at bottom,Predictor,numeric,float64
8,main stem length (cm),"length of stem, excluding side branches","Predictor, response",numeric,float64
9,total meja,"sum of no. empty chambers, dead larvae, dead p...",Predictor,numeric,int64


### 1.5 Data Corrections

- Spaltennamen angepasst, vereinheitlicht, Leerzeichen entfernt
- Tip of stem broken cut, broken oder nichts, unterteilt in 1 und 0 --> kategorische Prädiktorvariable
- Spalten entfernt, die nicht relevant für Analyse sind oder keine Daten enthalten

In [3]:
df = pd.read_csv('../data/interim/dissections_2012_HV_corrected.csv')
df.columns.to_list()

['Unnamed: 0',
 'main_stem_length_in_cm',
 'total_meja',
 'infested',
 'tip_of_stem_broken',
 'side_branches_in_cm',
 'total_number_adults',
 'stem_diam_bottom_in_cm',
 'diam_top_in_cm']

### 1.5 Definition of Key Variables

Describe the data and definitions of key variables.

- Variablen die Aussage über Größe der Pflanze und Anzahl der Käfer treffen, werden ausgewählt
- alle Variablen sollen nochmal ausführlicher beschrieben werden
- alle Variablen basierend auf Eigenschaften, die Variablen ausdrücken --> später Forward Selection, um Features für das Modell auszuwählen

From the data set we chose the following variables

- **Main stem length in cm:** Describes the length of the stem, excluding side branches.
- **Stem diam bottom in cm:** Describes the stem diameter at the bottom of the plant.
- **Diam top in cm:** Describes the stem diameter at the top of the plant.
- **Side branches in cm:** Describes the cumulative length of side branches that were examined.
- **Total meja:** Describes the sum of the number of empty chambers, dead larvae, dead pupae, live pupa, live larva and total number of adults.
- **Tip of stem broken:** Describes wether the stem tip is broken or cut (1 = broken/cut, 0 = intact stem tip)
- **Infested:** Describes wether a plant is infested or not (1 = infested by M. janthiniformis, 0 = not infested)
- **Total number adults:** Describes the number of live and dead adults.

Later we use forward selection to decide which variables are most suitable for our models.

#### 1.5.1 Linear Regression

The following variables were identified as possible predictor variables for linear regression. We define our possible predictor variables in a list called features.

- stem_diam_bottom_in_cm
- total_meja
- diam_top_in_cm
- side_branches_in_cm
- tip_of_stem_broken
- infested
- total_number_adults'

We want to use these variables to predict the main stem length of the plants and therefore define them as X. Y is assigned a series containing the values from the column main stem length that is defined by the y_label_lin variable.

In [4]:
y_label_lin = 'main_stem_length_in_cm'

features_lin = ['stem_diam_bottom_in_cm', 'diam_top_in_cm', 'side_branches_in_cm', 'total_meja', 'tip_of_stem_broken', 'infested', 'total_number_adults']

#### 1.5.1 Logistic Regression

For the logistic regression analysis, we want to predict whether a plant is infested or not in order to analyze whether plants infested by weevils are more likely to remain smaller and if weevils are really useful and effective as natural control agents. Possible predictor variables are again defined in a features list:

- main_stem_length_cm
- stem_diam_bottom_cm
- side_branches_cm
- diam_top_cm
- tip_of_stem_broken

As we want to predict wether a plant is infested or not we cannot use features that directly indicate an infestation e.g. total_meja', 'total_number_adults', 'meja_per_100_cm'. Therefore we use all features that are not directly indicating an infestation.

In [5]:
y_label_log = 'infested'

features_log = ['main_stem_length_in_cm', 'stem_diam_bottom_in_cm', 'side_branches_in_cm', 'diam_top_in_cm', 'tip_of_stem_broken']

### 1.6 Highlights from EDA

It should also include some exploratory data analysis.

All of the EDA won't fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.

#### 1.6.1 Linear Regression

In [6]:
df = pd.read_csv('../data/interim/dissections_2012_HV_corrected.csv', index_col = 0)

We create a grid of bar charts using Altair. Each column represents a different variable, either a feature or the response variable. Each row represents a histogram of the variables distribution.
This diagram is a tool for an initial, visual exploration of the data distributions in the training dataset. The histogram grids allow us to check the distribution of each variable in the training data set. We can identify patterns, outliers or characteristic shapes in the histograms that may indicate certain properties of the data.

In [7]:
alt.Chart(df).mark_bar().encode(
    alt.X(alt.repeat("column"), type="quantitative", bin=True),
    y='count()',
).properties(
    width=150,
    height=150
).repeat(
    column=[y_label_lin] + features_lin
)

- Kommentare zur Verteilung --> linksschief / rechtsschief etc.

This matrix provides a visual overview of the relationship between our response variable main stem length and our possible predictor variablens. By looking at the scatter plots, patterns such as linear relationships, clustering or outliers can be identified. 

In [8]:
alt.Chart(df).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=150,
    height=150
).repeat(
    row=[y_label_lin],
    column=[y_label_lin] + features_lin
).interactive()

We can identify that stem diam bottom and main stem length have a linear relationship.

In the following code cell we calculate the correlation coefficients between the columns in the DataFrame df_train and the target variable (y_label). The higher the correlation coefficient, the stronger the linear relationship between the two variables. A positive correlation indicates that as one variable increases, the other tends to increase, while a negative correlation indicates that as one variable increases, the other tends to decrease.

As we already recognized in the scatter plot matrix, stem diam bottom has the strongest linear relationship (0.66) with the response variable main stem length. This indicates that stem diam bottom will be a suitable predictor variable. However, we have to keep in mind that correlation does not imply causation.

In [9]:
corr = df.corr()
corr[y_label_lin].sort_values(ascending=False)

main_stem_length_in_cm    1.000000
stem_diam_bottom_in_cm    0.660344
infested                  0.103211
total_meja                0.026979
tip_of_stem_broken       -0.071523
total_number_adults      -0.094173
diam_top_in_cm           -0.167899
side_branches_in_cm      -0.256441
Name: main_stem_length_in_cm, dtype: float64

By applying the background gradient, we can improve the visualization of the correlation matrix.This color gradient allows for a quick and intuitive visual assessment of the strength and direction of correlations in the matrix.

In [10]:
corr.style.background_gradient(cmap='Blues')

Unnamed: 0,main_stem_length_in_cm,total_meja,infested,tip_of_stem_broken,side_branches_in_cm,total_number_adults,stem_diam_bottom_in_cm,diam_top_in_cm
main_stem_length_in_cm,1.0,0.026979,0.103211,-0.071523,-0.256441,-0.094173,0.660344,-0.167899
total_meja,0.026979,1.0,0.630152,-0.014358,0.070532,0.765378,0.175888,0.138283
infested,0.103211,0.630152,1.0,-0.083126,0.025031,0.431988,0.214868,0.232745
tip_of_stem_broken,-0.071523,-0.014358,-0.083126,1.0,-0.111848,0.138073,0.211586,0.540899
side_branches_in_cm,-0.256441,0.070532,0.025031,-0.111848,1.0,0.18602,0.213961,
total_number_adults,-0.094173,0.765378,0.431988,0.138073,0.18602,1.0,0.058155,0.037933
stem_diam_bottom_in_cm,0.660344,0.175888,0.214868,0.211586,0.213961,0.058155,1.0,0.468958
diam_top_in_cm,-0.167899,0.138283,0.232745,0.540899,,0.037933,0.468958,1.0


#### 1.6.1 Logistic Regression

To get more detailed information we are generating a grid of area charts using Altair, where each chart represents the distribution of quantitative variables, specified in y_label_log and features_log, in the DataFrame df_train_log. The areas are colored based on the values in the "infested" column.

In [11]:
alt.Chart(df).mark_area(
    opacity=0.5,
    interpolate='step'
).encode(
    alt.X(alt.repeat("column"), type="quantitative", bin=alt.Bin(maxbins=20)),
    alt.Y('count()', stack = None),
    alt.Color('infested:N'),
).properties(width=300).repeat(column=features_log)

To evaluate whether there are significant differences in the main stem lengths an diam top in cm between the categories "infested" and "not infested", we create boxplot diagrams showing the distribution of the values of the variables "main stem length_in_cm" and "diam_top_in_cm" for the two categories of the variable "infested". The visualization allows a quick comparison of the distribution of main stem lengths for different the categories "infested" and "not infested". We can observe that there are significant differences in the lengths between the two categories but we can also observe some outliers within the category "infested".

In [12]:
alt.Chart(df).mark_boxplot(
    size=50,
    opacity=0.7
).encode(
    x=alt.Y('main_stem_length_in_cm:Q', scale=alt.Scale(zero=True)),
    y='infested:N',  
).properties(width=300, height=300)

In [13]:
alt.Chart(df).mark_boxplot(
    size=50,
    opacity=0.7
).encode(
    x=alt.Y('diam_top_in_cm', scale=alt.Scale(zero=True)),
    y='infested:N',  
).properties(width=300, height=300)

- Unterschiede etwas beschreiben 

## 2 Methodology

> REMOVE THE FOLLOWING TEXT

This section includes a brief description of your modeling process.

Explain the reasoning for the type of model you're fitting, predictor variables considered for the model.

Additionally, show how you arrived at the final model by describing the model selection process, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.

### 2.1 Imputation

Using df.info() method, we quickly recognized that columns diam_top_in_cm and side_branches_in_cm only contain fewer values.
See chart below:

In [14]:
notna = pd.DataFrame(df.notna().sum()).rename(columns={0: "count"})
notna[["%"]] = round(notna[["count"]] / len(df) * 100,1)
notna

Unnamed: 0,count,%
main_stem_length_in_cm,1066,100.0
total_meja,1066,100.0
infested,1066,100.0
tip_of_stem_broken,1066,100.0
side_branches_in_cm,84,7.9
total_number_adults,1066,100.0
stem_diam_bottom_in_cm,1066,100.0
diam_top_in_cm,370,34.7


In [15]:
alt.Chart(notna.reset_index()).mark_bar().encode(
    x="count",
    y="index",
).properties(height=500)

We therefore can consider filling in the missing values using SimpleImputer class from Scikit-Learn.
There are three possible options now to proceed:

**Option 1: method = "reduced"**
With option 1, we omit all rows with NaN values and use a reduced DataFrame.

**Option 2: method = "imputed_mean"**
Option 2 uses imputation to fill in the missing values in the two specified columns. As strategy we specify mean so the missing values are replaced with the mean of each column.

**Option 3: method = "imputed_median"**
With option 3, the missing values in the two specified columns are also filled in by imputation. As a strategy, we specify the median so that the missing values are replaced by the median of each column.

We implemented two functions to either reduce our data frame by columns that were identified containing a high amount of NaN values or impute the missing values for the specified columns with mean or median values.

In addition, we implement the function split_and_save to split our DataFrames with the applied method into train and test data. 

### 2.2 Forward Selection

- Welche Prädiktorvariablen haben wir gewählt und warum?
- Wie sind wir zum endgültigen Modell gekommen?
- Modellauswahl und Variablenauswahl beschreiben
- Alles was wir am Modell angepasst und probiert haben

To identify the best linear and logistic regression models we implemented a function that evaluates both models.
We apply each method (reduced, imputed_mean, imputed_median) to each model and use forward selection to determine which variables are the most suitable for the respective model. The evaluation is based on the r^2 value for linear regression and ROC for logistic regression.

(See draft analysis for function definition.)

**Evaluation results:**

Based on the results we chose the imputed_mean method for our linear regression model. The best performance of the model can be reached using the following features to predict a plants main stem length.: 'total_meja', 'stem_diam_bottom_in_cm', 'tip_of_stem_broken', 'diam_top_in_cm'.

For our linear regression model we chose the reduced method. The best performance can be reached using the features'main_stem_length_in_cm', 'diam_top_in_cm' to predict wether a plant is infested or not.

We assume that a complete dataset would have yielded better results. 

### 2.2 Training and Validation

We are using cross-validation to evaluate the performance of our linear regression and logistic regression model on the training dataset. This approach provides a robust estimate of the model's performance by evaluating it on five different subsets of the training data.

In [16]:
df_scores_lin = pd.read_csv('../data/interim/scores_linear_regression.csv', index_col = 0)
df_scores_clf = pd.read_csv('../data/interim/scores_logistic_regression.csv', index_col = 0)

#### 2.2.1 Linear Regression

In [17]:
alt.Chart(df_scores_lin.reset_index()).mark_line(
     point=alt.OverlayMarkDef()
).encode(
    x=alt.X("index", bin=False, title="Fold", axis=alt.Axis(tickCount=5)),
    y=alt.Y("lr", aggregate="mean", title="Mean squared error (MSE)")
)

In [18]:
df_scores_lin.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
lr,5.0,295.066477,37.956283,233.055399,297.215897,302.709285,305.508943,336.842861


#### 2.2.1 Logistic Regression

In [19]:
df_scores_clf.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
lr,5.0,0.800452,0.04781,0.745763,0.779661,0.779661,0.830508,0.866667


### 2.3 Overfitting

In chapter 2.1 we chose the reduced method for our logistic regression model. Therefore, we have less observations in our remaining datset. We recognized that the proportion of infested and not infested plants in our dataset was imbalanced so we use overfitting to compensate the difference between value counts and balance our datset.

We used random oversampler from imblearn but the balanced datset did not lead to significant improvements of our model.

## 3 Results

> REMOVE THE FOLLOWING TEXT

This is where you will output the final model with any relevant model fit statistics.

Describe the key results from the model.
The goal is not to interpret every single variable in the model but rather to show that you are proficient in using the model output to address the research questions, using the interpretations to support your conclusions.

Focus on the variables that help you answer the research question and that provide relevant context for the reader.


- Unser finales Modell mit allen relevanten Statistiken und Modellgütekriterien
- Ergebnisse: Durch das Modell die Forschungsfragen und Hypothese beantworten


### 3.1 Linear Regression

### 3.2 Logistic Regression

- Precision / Recall --> Modell evaluieren

## 4 Discussion and Conclusion


> REMOVE THE FOLLOWING TEXT

In this section you'll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions.
In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved.
Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here.
Lastly, this section will include ideas for future work.

The study on the release and establishment of the weevil Mecinus janthiniformis for biological control of Dalmatian toadflax in Southern California provides valuable insights into the complex interplay between invasive plant species, biological control agents and the environment.

### 4.1 Achievments

- Umgang mit fehlenden Daten --> Impuation

After inspecting at the data, we noticed that some rows were missing information. We decided to use imputation to fill in these gaps. Also, we made it possible to remove rows with missing data, which reduced our dataset by about half. It is hypothesized that a complete dataset might have yielded more significant insights.


- Umgang mit unbalanciertem Datenset --> Oversampling
- Feature Selection
- Evaluierung der Modelle


### 4.2 Limitations

- potentielle Verbesserungen durch vollständige Daten 
- Daten und Ergebnisse beziehen sich nur auf eine bestimmte Region sind evtl. nicht übertragbar
- Erfolg der Käfern auch abhängig von äußeren Faktoren wie Wetter, Umwelteinflüssem, etc. die nicht planbar und im Modell integrierbar sind


### 4.3 Analysis of the Results

#### 4.3.1 Linar Regression

The investigation into factors predicting the total main stem length of Dalmatian toadflax provides an opportunity to understand the key indicators of the plant's growth. This knowledge can contribute to the development of targeted strategies for managing and controlling Dalmatian toadflax populations in southern California.

The most powerful predictor variables found with forward selection are:

- stem_diam_bottom_in_cm
- diam_top_in_cm
- side_branches_in_cm
- meja_per_100_cm
- total_meja

With these variables our model is reaching a moderate performance with an r² value of 0.658.


- welcher Faktor war am einflussreichsten?

#### 4.3.2 Logistic Regression

The hypothesis that weevil populations contribute to the reduction of Dalmatian toadflax vegetation seems to be supported by the research questions. If plants infested by weevils are found to be more likely to remain smaller, this indicates a possible link between the presence of weevils and the suppression of Dalmatian toadflax growth. This result is consistent with the biological control strategy, which aims to reduce the impact of the invasive plant on the ecosystem.

The most powerful predictor variables found with forward selection are:

- main_stem_length_in_cm
- stem_diam_bottom_in_cm
- side_branches_in_cm
- diam_top_in_cm
- tip_of_stem_broken

With these variables, our model is doing really well and can correctly predict infestations 88% of the time.