<h1 style="color: #e3db24;">00 | Libraries and Settings</h1>

In [2]:
# 📚 Basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# 🤖 Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import roc_curve, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

In [3]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
pd.set_option('display.float_format', '{:,.2f}'.format)
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #e3db24;">01 | Data Extraction</h1>

In [5]:
df3 = pd.read_csv('dataframes/df3_trees.csv')
df3

Unnamed: 0,lat,long,genus,species,alley_tree,height,trunk_circumference,crown_diameter,sponsorship,variety
0,54.06,12.10,acer,acer platanoides,0,8.00,0.69,6.00,0,0
1,54.05,12.10,acer,acer platanoides,0,8.00,0.69,6.00,0,0
2,54.07,12.12,quercus,quercus robur,0,18.00,1.10,2.00,0,0
3,54.07,12.12,quercus,quercus rubra,1,9.00,1.00,2.00,0,0
4,54.16,12.08,tilia,tilia cordata,1,7.00,0.38,3.00,0,0
...,...,...,...,...,...,...,...,...,...,...
69279,54.19,12.15,quercus,quercus robur,0,25.00,2.40,2.00,0,0
69280,54.19,12.15,quercus,quercus robur,0,15.00,2.23,2.00,0,0
69281,54.08,12.19,prunus,prunus avium,0,13.00,1.52,1.00,0,0
69282,54.20,12.15,pinus,pinus sylvestris,0,19.00,1.58,7.00,0,0


<h2 style="color: #ec7511;">Copy of the Dataframe</h2>

In [7]:
df6 = df3.copy()

<h2 style="color: #ec7511;">Moving the Target "sponsorship" to the Right</h2>

In [9]:
target = df6.pop('sponsorship')
df6.sample(3)

Unnamed: 0,lat,long,genus,species,alley_tree,height,trunk_circumference,crown_diameter,variety
68262,54.09,12.07,populus,populus,0,20.0,1.56,3.2,0
57822,54.17,12.2,picea,picea abies,0,22.0,1.35,1.0,0
37274,54.18,12.11,acer,acer pseudoplatanus,0,8.0,0.7,7.4,0


<h1 style="color: #e3db24;">02 | Some Extra EDA</h1>

for classification some specific eda will be done 

<h2 style="color: #ec7511;">Multicollineratity</h2>

In [12]:
num = df6.select_dtypes(include='number')
num

Unnamed: 0,lat,long,alley_tree,height,trunk_circumference,crown_diameter
0,54.06,12.10,0,8.00,0.69,6.00
1,54.05,12.10,0,8.00,0.69,6.00
2,54.07,12.12,0,18.00,1.10,2.00
3,54.07,12.12,1,9.00,1.00,2.00
4,54.16,12.08,1,7.00,0.38,3.00
...,...,...,...,...,...,...
69279,54.19,12.15,0,25.00,2.40,2.00
69280,54.19,12.15,0,15.00,2.23,2.00
69281,54.08,12.19,0,13.00,1.52,1.00
69282,54.20,12.15,0,19.00,1.58,7.00


In [13]:
num_corr = num.corr().round(2)

In [14]:
num.corrwith(num['sponsorship']).round(2).sort_values(ascending=False)

KeyError: 'sponsorship'

In [None]:
# heatmap correlation matrix
mask = np.zeros_like(num_corr)

f, ax = plt.subplots(figsize=(25, 15))
sns.set(font_scale=1.5)

ax = sns.heatmap(num_corr, mask=mask, annot=True, annot_kws={"size": 12}, linewidths=.5, cmap="coolwarm", fmt=".2f", ax=ax)
ax.set_title("Multicollinearity for Predicting 'Sponsorship'", fontsize=20)
plt.show()

In [None]:
# display because annot in heatmap doesn't work for me
num_corr

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #b8daff;
    border-radius: 4px;
    background-color: #0eece8;
    color: #004085;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: dadadada</strong>
<p>TEXT</p>
    <ul>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
    </ul>
</div>


<h1 style="color: #e3db24;">03 | Classification</h1>

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #f7b70d;
    border-radius: 4px;
    background-color: #e2ee1e;
    color: #060606;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Next Steps:</strong>
<p>for classification model etc..</p>
</div>


<h2 style="color: #ec7511;">Data Processing</h2>

<h3 style="color: #ec300e;">X-y Split</h3>

<h2 style="color: #ec7511;">Modeling</h2>

<h3 style="color: #ec300e;">Train-Test Split</h3>

<h3 style="color: #ec300e;">Model: Logistic Regression</h3>

Logistic regression is one of the most popular and used algorithms for classification problems. Since it is also relatively uncomplicated and easy to implement, it is often used as a starting model, although it can also produce very high-performance results used in production. Here we are going to talk about Binomial Logistic Regression, which is used for binary results. Multinomial Logistic Regression exists and can be used for multiclass classification problems, but it is used less frequently. We will not cover it in this lesson.

Logistic regression is actually a transformed linear regression function. We can see in the image below that if we tried to fit a linear regression to some data with a binary result, we would fit a line that does not predict very well for any value that is not in the extreme values: in the middle there is a lot of area where the line is very far from the points. To make our function closer to the data, we have to transform the function we are using. In this case, it is useful to use a sigmoid function, which estimates an "S" shape. Now we can see that our line fits the data much better.

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

<h2 style="color: #ec7511;">Model Validation</h2>

In [None]:
predictions = model.predict(X_test)

In [None]:
print(f'30% for test prediction data: {len(predictions)}.')

<h3 style="color: #FF6347;">Metrics</h3>

In [None]:
print(classification_report(y_test, predictions))

In [None]:
print("Test data accuracy: ",model.score(X_test,y_test))
print("Train data accuracy: ", model.score(X_train, y_train))

In [None]:
cm = confusion_matrix(y_test, predictions)

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
plt.figure(figsize=(8, 6))
disp.plot(cmap='Oranges')  
plt.grid(True)
plt.show()

<div style="
    padding: 15px;
    margin: 10px 0;
    border: 1px solid #b8daff;
    border-radius: 4px;
    background-color: #0eece8;
    color: #004085;
    font-size: 16px;
    line-height: 1.5;
    word-wrap: break-word;
    text-align: left;">
    <strong>Conclusions: dadada</strong>
<p>TEXT</p>
    <ul>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
        <li>TEXT</li>
    </ul>
</div>


<h1 style="color: #e3db24;">06 | Improving the Model</h1>

<h1 style="color: #e3db24;">07 | Reporting</h1>