<html>
    <body>
        <h1 class="alert alert-info" style="text-align: center;">Wild Blueberry Yield: A Machine Learning Approach to help farmers Predicting Yiled of Wild Blue Berries</h1>
        <h2 id="contents">Table of Contents</h2>
        <ol>
            <a href="#section1"><li>Importing libraries and loading the dataset</li></a>
            <a href="#section2"><li>Exploring the dataset</li></a>
            <a href="#section3"><li>Data cleaning</li></a>
            <a href="#section4"><li>Exploratory data analysis</li></a>
            <ol>
                <a href="#sub_section1_1"><li type="i">Univariate analysis</li></a>
                <a href="#sub_section1_2"><li type="i">Bivariate analysis</li></a>
            </ol>        
            <a href="#section5"><li>Data Prepocessing</li></a>
            <a href="#section6"><li>Model Building and Evaluation</li></a>
            <ol>
                <a href="#sub_section2_1"><li type="i">Ridge Regrassion</li></a>
            </ol> 
        </ol>
    </body>
</html>

<div class="col-md-8">
    <h2 id="section1">1. Importing libraries and loading the dataset</h2>
    <p>Let's start by importing the necessary libraries and loading the dataset.</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate
import warnings
import os
import matplotlib as mpl

mpl.rcParams['agg.path.chunksize'] = 10000

color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

warnings.filterwarnings(action = 'ignore')

In [None]:
# Read the data
df_train = pd.read_csv('../input/playground-series-s3e14/train.csv')
df_test = pd.read_csv('../input/playground-series-s3e14/test.csv')
df_sample = pd.read_csv('../input/playground-series-s3e14/sample_submission.csv')

<div class="col-md-8">
    <h2 id="section2">2. Exploring the dataset</h2>
    <p>Let's explore the datasets:</p>
</div>
<div class="col-md-4">
    <a href="#contens">Back to top</h2>
</div>

In [None]:
# Sample train data
df_train.head()

In [None]:
# Sample train data
df_test.head()

<p>Let's explore the dataset to get a better understanding of its structure and content:</p>

In [None]:
# Funtion to create a brief summary of the data
def summary(df):
    print("Shape of the data: ", df.shape)
    df_summ = pd.DataFrame(df.dtypes, columns=['DataType'])
    df_summ['#Missing'] = df.isnull().sum().values
    df_summ['%Missing'] = (df.isnull().sum().values / len(df)) * 100
    df_summ['#Unique'] = df.nunique().values
    # get description of variable in a dataframe
    df_desc = pd.DataFrame(df.describe(include="all").transpose())
    df_summ['Min'] = df_desc['min'].values
    df_summ['Max'] = df_desc['max'].values
    df_summ['Std'] = df_desc['std'].values
    df_summ['Mean'] = df_desc['mean'].values
    df_summ['25%'] = df_desc['25%'].values
    df_summ['50%'] = df_desc['50%'].values
    df_summ['75%'] = df_desc['75%'].values
    df_summ['FirstValue'] = df.loc[0].values
    df_summ['LastValue'] = df.loc[len(df)-1].values

    return df_summ

In [None]:
# Train data summary
summary(df_train).style.background_gradient(cmap='crest', axis=0)

<div class="col-md-8">
    <h3 id="section3">3. Data cleaning</h3>
    <p>Nice! We have a dataset with <b>15289</b> rows and <b>18</b> columns.</p>
    <p>No missing values. Let's jump to the next step</p>
    <p>Let's clean the dataset by duplicates, irrelevant columns, and converting data types.</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Let's create a copy of the train and test data to perform data cleaning
df_wb_copy = df_train.copy()
df_wb_test_copy = df_test.copy()

In [None]:
# Check for duplicates in train data
df_wb_copy.duplicated().sum()

In [None]:
# Check for duplicates in test data
df_wb_test_copy.duplicated().sum()

<p>No duplicates in the dataset! Let's move on to the next step.</p>

<div class="col-md-8">
    <h3 id="section4">4. Exploratory data analysis</h3>
    <p>Let's perform exploratory data analysis to extract insights from the blueberry dataset:</p>
    <h4 id="sub_section1_1" >i. Univariate analysis</h4>
    <p>We will start by exploring the distribution of the numerical and categorical variables in the dataset:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function for calculating descriptives of numeric variable and plotting the distribution
def plot_dist(df, col, x_label, y_label, plot_title):
    _min = df[col].min()
    _max = df[col].max()
    ran = df[col].max()-df[col].min()
    mean = df[col].mean()
    median = df[col].median()
    st_dev = df[col].std()
    skew = df[col].skew()
    kurt = df[col].kurtosis()

    # calculating points of standard deviation
    points = mean-st_dev, mean+st_dev
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.histplot(data=df, x=col, bins=30, kde=True, color='dodgerblue')
    sns.lineplot(x=points, y=[0,0], color = 'black', label = "std_dev")
    sns.scatterplot(x=[_min,_max], y=[0,0], color = 'orange', label = "min/max")
    sns.scatterplot(x=[mean], y=[0], color = 'red', label = "mean")
    sns.scatterplot(x=[median], y=[0], color = 'blue', label = "median")
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)

    # Creating a DataFrame for the descriptive statistics
    variable_stats = pd.DataFrame({'Statistics': ['Minimum Value', 'Maximum Value', 'Range', 'Mean', 
                                                    'Median', 'Standard Deviation', 'Skewness', 'Kurtosis'], 
                                        'Value': [_min, _max, ran, mean, median, st_dev, skew, kurt]})
    
    plt.show()

    display(tabulate(variable_stats, headers='keys', showindex=False, tablefmt='html'))


In [None]:
# Function for plolting the distribution of categorical variables
def plot_cat(df, col, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.countplot(data=df, x=col, color='dodgerblue')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

<p>Here trip_duration column is our target variable. Let's explore its distribution.</p>

In [None]:
# Plot distribution of Duration Column
plot_dist(df_wb_copy, 'yield', 'Yield', 'Count', 'Distribution of yield column')

<p>Nice. We have a normal distribution in target variable.</p>

In [None]:
# Let's see the distribution of fruitset variable
plot_dist(df_wb_copy, 'fruitset', 'fruitset', 'Count', 'Distribution of fruitset column')

In [None]:
# Let's see the distribution of fruitmass variable
plot_dist(df_wb_copy, 'fruitmass', 'fruitmass', 'Count', 'Distribution of fruitmass column')

In [None]:
# Let's see the distribution of seeds variable
plot_dist(df_wb_copy, 'seeds', 'seeds', 'Count', 'Distribution of seeds column')

<p>Varibles 
<ul>
    <li>clonesize</li>
    <li>honeybee</li>
    <li>bumbles</li>
    <li>andrena</li>
    <li>osmia</li>
    <li>MaxOfUpperTRange</li>
    <li>MinOfUpperTRange</li>
    <li>AverageOfUpperTRange</li>
    <li>MaxOfLowerTRange</li>
    <li>MinOfLowerTRange</li>
    <li>AverageOfLowerTRange</li>
    <li>RainingDays</li>
    <li>AverageRainingDays</li>
</ul>
has too few unique values</p>
<p>Let's check there distribution as category variables</p>

In [None]:
# Plotting distribution of clonesize column
plot_cat(df_wb_copy, 'clonesize', 'clonesize', 'Count', 'Distribution of clonesize column')

In [None]:
# Plotting distribution of honeybee column
plot_cat(df_wb_copy, 'honeybee', 'honeybee', 'Count', 'Distribution of honeybee column')

In [None]:
# Plotting distribution of bumbles column
plot_cat(df_wb_copy, 'bumbles', 'bumbles', 'Count', 'Distribution of bumbles column')

In [None]:
# Plotting distribution of andrena column
plot_cat(df_wb_copy, 'andrena', 'andrena', 'Count', 'Distribution of andrena column')

In [None]:
# Plotting distribution of osmia column
plot_cat(df_wb_copy, 'osmia', 'osmia', 'Count', 'Distribution of osmia column')

In [None]:
# Plotting distribution of MaxOfUpperTRange column
plot_cat(df_wb_copy, 'MaxOfUpperTRange', 'MaxOfUpperTRange', 'Count', 'Distribution of MaxOfUpperTRange column')

In [None]:
# Plotting distribution of MinOfUpperTRange column
plot_cat(df_wb_copy, 'MinOfUpperTRange', 'MinOfUpperTRange', 'Count', 'Distribution of MinOfUpperTRange column')

In [None]:
# Plotting distribution of AverageOfUpperTRange column
plot_cat(df_wb_copy, 'AverageOfUpperTRange', 'AverageOfUpperTRange', 'Count', 'Distribution of AverageOfUpperTRange column')

In [None]:
# Plotting distribution of MaxOfLowerTRange column
plot_cat(df_wb_copy, 'MaxOfLowerTRange', 'MaxOfLowerTRange', 'Count', 'Distribution of MaxOfLowerTRange column')

In [None]:
# Plotting distribution of MinOfLowerTRange column
plot_cat(df_wb_copy, 'MinOfLowerTRange', 'MinOfLowerTRange', 'Count', 'Distribution of MinOfLowerTRange column')

In [None]:
# Plotting distribution of AverageOfLowerTRange column
plot_cat(df_wb_copy, 'AverageOfLowerTRange', 'AverageOfLowerTRange', 'Count', 'Distribution of AverageOfLowerTRange column')

In [None]:
# Plotting distribution of RainingDays column
plot_cat(df_wb_copy, 'RainingDays', 'RainingDays', 'Count', 'Distribution of RainingDays column')

In [None]:
# Plotting distribution of AverageRainingDays column
plot_cat(df_wb_copy, 'AverageRainingDays', 'AverageRainingDays', 'Count', 'Distribution of AverageRainingDays column')

<div class="col-md-8">
    <h4 id="sub_section1_2">ii. Bivariate analysis</h4>
    <p>Let's explore the relationship between the trip duration and other variables in the dataset:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function for plotting the distribution of numeric variables against the target variable
# Here target variable is assumed to be categorical
def plot_num_vs_target(df, col, target, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.scatterplot(data=df, x=target, y=col, color='dodgerblue')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

In [None]:
# Relationship between yield and fruitset
plot_num_vs_target(df_wb_copy, 'fruitset', 'yield', 'yield', 'fruitset', 'Relationship between yield and fruitset')

In [None]:
# Relationship between yield and fruitmass
plot_num_vs_target(df_wb_copy, 'fruitmass', 'yield', 'yield', 'fruitmass', 'Relationship between yield and fruitmass')

In [None]:
# Relationship between yield and fruitmass
plot_num_vs_target(df_wb_copy, 'seeds', 'yield', 'yield', 'seeds', 'Relationship between yield and seeds')

In [None]:
# Function for plotting the distribution of categorical variables against the target variable
def plot_cat_vs_target(df, col, target, x_label, y_label, plot_title):
    sns.set_style('darkgrid')
    plt.figure(figsize=(12,8))
    sns.boxplot(data=df, x=col, y=target, palette='Set1')
    plt.title(plot_title, fontsize=14)
    plt.xlabel(x_label, fontsize=12)
    plt.ylabel(y_label, fontsize=12)
    plt.show()

In [None]:
# Relationship between yield and clonesize
plot_cat_vs_target(df_wb_copy, 'clonesize', 'yield', 'clonesize', 'yield', 'Relationship between yield and clonesize')

In [None]:
# Relationship between yield and honeybee
plot_cat_vs_target(df_wb_copy, 'honeybee', 'yield', 'honeybee', 'yield', 'Relationship between yield and honeybee')

In [None]:
# Relationship between yield and bumbles
plot_cat_vs_target(df_wb_copy, 'bumbles', 'yield', 'bumbles', 'yield', 'Relationship between yield and bumbles')

In [None]:
# Relationship between yield and andrena
plot_cat_vs_target(df_wb_copy, 'andrena', 'yield', 'andrena', 'yield', 'Relationship between yield and andrena')

In [None]:
# Relationship between yield and osmia
plot_cat_vs_target(df_wb_copy, 'osmia', 'yield', 'osmia', 'yield', 'Relationship between yield and osmia')

In [None]:
# Relationship between yield and MaxOfUpperTRange
plot_cat_vs_target(df_wb_copy, 'MaxOfUpperTRange', 'yield', 'MaxOfUpperTRange', 'yield', 'Relationship between yield and MaxOfUpperTRange')

In [None]:
# Relationship between yield and MinOfUpperTRange
plot_cat_vs_target(df_wb_copy, 'MinOfUpperTRange', 'yield', 'MinOfUpperTRange', 'yield', 'Relationship between yield and MinOfUpperTRange')

In [None]:
# Relationship between yield and AverageOfUpperTRange
plot_cat_vs_target(df_wb_copy, 'AverageOfUpperTRange', 'yield', 'AverageOfUpperTRange', 'yield', 'Relationship between yield and AverageOfUpperTRange')

In [None]:
# Relationship between yield and MaxOfLowerTRange
plot_cat_vs_target(df_wb_copy, 'MaxOfLowerTRange', 'yield', 'MaxOfLowerTRange', 'yield', 'Relationship between yield and MaxOfLowerTRange')

In [None]:
# Relationship between yield and MinOfLowerTRange
plot_cat_vs_target(df_wb_copy, 'MinOfLowerTRange', 'yield', 'MinOfLowerTRange', 'yield', 'Relationship between yield and MinOfLowerTRange')

In [None]:
# Relationship between yield and AverageOfLowerTRange
plot_cat_vs_target(df_wb_copy, 'AverageOfLowerTRange', 'yield', 'AverageOfLowerTRange', 'yield', 'Relationship between yield and AverageOfLowerTRange')

In [None]:
# Relationship between yield and RainingDays
plot_cat_vs_target(df_wb_copy, 'RainingDays', 'yield', 'RainingDays', 'yield', 'Relationship between yield and RainingDays')

In [None]:
# Relationship between yield and AverageRainingDays
plot_cat_vs_target(df_wb_copy, 'AverageRainingDays', 'yield', 'AverageRainingDays', 'yield', 'Relationship between yield and AverageRainingDays')

<div class="col-md-8">
    <h3 id="section5">5. Data Preprocessing</h3>
    <p>Before we use variables in our model, we need to preprocess them. We will perform the following steps:</p>
    <ul>
        <li>One-hot encode categorical variables</li>
        <li>Lable encode categorical variables</li>
    </ul>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Function to encode categorical variables, we will use scikit-learn's LabelEncoder for label encoding and pandas get_dummies for one-hot encoding
from sklearn.preprocessing import LabelEncoder
def encode_cat(df, col, encoding_type):
    if encoding_type == 'label':
        label_encoder = LabelEncoder()
        df[col] = label_encoder.fit_transform(df[col])
    elif encoding_type == 'onehot':
        df = pd.get_dummies(df, columns=[col], prefix=[col])
    return df

<p>Now that we have only the continuos(float) variables, there is not need of encoding. let's check the correlation between them:</p>

In [None]:
# Function to plot correlation between variables
def plot_corr(df, size=10):
    corr = df.corr()
#     print(corr)
    fig, ax = plt.subplots(figsize=(size, size))
    sns.heatmap(corr, annot=True, linewidths=.5, ax=ax, cmap='crest')
    plt.show() 

In [None]:
# Correlation between variables in the training set
plot_corr(df_wb_copy.drop(['id'], axis=1))

<p>Let's check the correlation between the variables and the target variable:</p>

In [None]:
# Function to plot correlation of variables with the target variable as a barplot
def plot_corr_target(df, target, size=10):
    corr = df.corr()
    corr_target = corr[target]
    corr_target = corr_target.sort_values(ascending=False)
    corr_target = corr_target.drop(target)
    plt.figure(figsize=(size, size))
    corr_target.plot.barh()
    plt.show()

In [None]:
# Check correlation of variables with the target variable
plot_corr_target(df_wb_copy.drop(['id'], axis=1), 'yield')

<div class="col-md-8">
    <h3 id="section6">6. Model Building</h3>
    <p>Let's build a model to predict the Survival of passengers on the Titanic:</p>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# We will first separate the target variable from the features
y = df_wb_copy['yield']
x = df_wb_copy.drop(['yield', 'id'], axis=1)
x.shape, y.shape

<p>Let's scale the features usinf scikit-learn's MinMax scaler:</p>

In [None]:
## Importing the MinMax Scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x)

df_wb_test_scaled = scaler.fit_transform(df_wb_test_copy.drop(['id'], axis=1))

In [None]:
x = pd.DataFrame(x_scaled, columns = x.columns)

df_wb_test_final = pd.DataFrame(df_wb_test_scaled, columns = df_wb_test_copy.drop(['id'], axis=1).columns)

In [None]:
# Check train data after scaling
x.head()

In [None]:
# Check test data after scaling
df_wb_test_final.head()

<p>Now, let's split the dataset into training and test sets:</p>

In [None]:
# Importing the train test split function
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y, random_state = 50, stratify=y)


In [None]:
train_x.shape, train_y.shape, test_x.shape, test_y.shape

<div class="col-md-8">
    <h3 id="section7">7. Model Generation and Evaluation</h3>
    <p>We will use different classification algorithms to build models and evaluate them using F1 score:</p>
    <h4 id="sub_section2_1">i. Ridge Regression</h4>
</div>
<div class="col-md-4">
    <a href="#contents">Back to top</h2>
</div>

In [None]:
# Importing ridge from sklearn's linear_model module
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [None]:
#Set the different values of alpha to be tested
alpha_ridge = [0, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20, 25]

In [None]:
# defining a function which will fit ridge regression model, plot the results, and return the coefficients
def ridge_regression(train_x, train_y, test_x, test_y, alpha):
    #Fit the model
    ridgereg = Ridge(alpha=alpha)
    ridgereg.fit(train_x,train_y)
    train_y_pred = ridgereg.predict(train_x)
    test_y_pred = ridgereg.predict(test_x)

    
    #Return the result in pre-defined format
    msle_train = mean_absolute_error(train_y_pred, train_y)
    ret = [np.sqrt(msle_train)]
    
    msle_test = mean_absolute_error(test_y_pred, test_y)
    ret.extend([np.sqrt(msle_test)])
    
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    
    return ret

In [None]:
#Initialize the dataframe for storing coefficients.
col = ['mae_train','mae_test','intercept'] + ['coef_Var_%d'%i for i in range(1,17)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

In [None]:
#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(train_x, train_y, test_x, test_y, alpha_ridge[i])

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge

In [None]:
coef_matrix_ridge[['mae_train','mae_test']].plot()
plt.xlabel('Alpha Values')
plt.ylabel('MSE')
plt.legend(['train', 'test'])

In [None]:
#Printing number of zeros in each row of the coefficients dataset
coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

In [None]:
from sklearn.feature_selection import RFE

# Create the RFE object and rank each feature
model = Ridge(alpha=0.01)
rfe = RFE(estimator=model, n_features_to_select=1, step=1)
rfe.fit(x, y)

In [None]:
ranking_df = pd.DataFrame()
ranking_df['Feature_name'] = x.columns
ranking_df['Rank'] = rfe.ranking_

In [None]:
ranked = ranking_df.sort_values(by=['Rank'])
ranked

In [None]:
cols = ranked['Feature_name'][:16].values
# cols = np.delete(cols, 5)
cols

In [None]:
#Fit the model
ridgereg = Ridge(alpha=0.01)
ridgereg.fit(train_x[cols],train_y)
train_y_pred = ridgereg.predict(train_x[cols])
test_y_pred = ridgereg.predict(test_x[cols])


#Return the result in pre-defined format
mae_train = mean_absolute_error(train_y_pred, train_y)
# rmsle_train = np.sqrt(msle_train)

mae_test = mean_absolute_error(test_y_pred, test_y)
# rmsle_test = np.sqrt(msle_test)

print('mae_train:   ', mae_train)
print('mae_test:   ', mae_test)

In [None]:
#Fit the model entire train data
# ridgereg = Ridge(alpha=1e-8,normalize=True)
ridgereg.fit(x[cols], y)
train_pred = ridgereg.predict(x[cols])
test_pred = ridgereg.predict(df_wb_test_final[cols])


#Return the result in pre-defined format
mae_train = mean_absolute_error(train_pred, y)
# rmsle_train = np.sqrt(msle_train)


print('rmse_train:   ', mae_train)

In [None]:
df_sample.shape, test_pred.shape

In [None]:
df_sample.head()

In [None]:
df_sample['yield'] = test_pred

In [None]:
df_sample.to_csv('submission_wild_blue_berry_ridge.csv', index=False)