<h1>Processo seletivo para Estágio em Data Science</h1>

<h3>Vamos começar importar as bibliotecas que precisaremos e o dataframe </h3>

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

<h3>Entendendo os dados</h3>

In [52]:
df = pd.read_csv('Safra_2018-2019.csv')
df = df.drop(['Unnamed: 0'], axis = 1)
df2020 = pd.read_csv('Safra_2020.csv').drop(['Unnamed: 0'], axis = 1)

In [34]:
df.describe()

Unnamed: 0,Estimativa_de_Insetos,Tipo_de_Cultivo,Tipo_de_Solo,Categoria_Pesticida,Doses_Semana,Semanas_Utilizando,Semanas_Sem_Uso,Temporada,dano_na_plantacao
count,80000.0,80000.0,80000.0,80000.0,80000.0,71945.0,80000.0,80000.0,80000.0
mean,1400.020875,0.283338,0.45555,2.267587,25.84675,28.66448,9.549088,1.897575,0.192312
std,849.792471,0.450622,0.498023,0.463748,15.557246,12.424751,9.905547,0.702079,0.455912
min,150.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,731.0,0.0,0.0,2.0,15.0,20.0,0.0,1.0,0.0
50%,1212.0,0.0,0.0,2.0,20.0,28.0,7.0,2.0,0.0
75%,1898.0,1.0,1.0,3.0,40.0,37.0,16.0,2.0,0.0
max,4097.0,1.0,1.0,3.0,95.0,67.0,50.0,3.0,2.0


In [35]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80000 entries, 0 to 79999
Data columns (total 10 columns):
Identificador_Agricultor    80000 non-null object
Estimativa_de_Insetos       80000 non-null int64
Tipo_de_Cultivo             80000 non-null int64
Tipo_de_Solo                80000 non-null int64
Categoria_Pesticida         80000 non-null int64
Doses_Semana                80000 non-null int64
Semanas_Utilizando          71945 non-null float64
Semanas_Sem_Uso             80000 non-null int64
Temporada                   80000 non-null int64
dano_na_plantacao           80000 non-null int64
dtypes: float64(1), int64(8), object(1)
memory usage: 6.1+ MB
None


<h2>Preprocessamento</h2>

<h5>Uma olhada superficial temos que há campos em Semanas_Utilizando que temos valores Nulos, mais de 10% do nosso dataset, um numero muito grande para simplesmente jogar fora</h5>

In [53]:
dfnan = df.isna()
p = len(dfnan.loc[dfnan['Semanas_Utilizando'] == True])/len(dfnan)
print("A porcentagem de valores faltando em Semanas_Utilizando", p)

A porcentagem de valores faltando em Semanas_Utilizando 0.1006875


In [54]:
def fill_nan_values(df):
    return df.fillna(df['Semanas_Utilizando'].mean())

In [55]:
df = fill_nan_values(df)

<h4>Para entendermos a base do problema que estamos lidando</h4>

In [56]:
def dummies(df):
    df['Categoria_Pesticida'] = df['Categoria_Pesticida'].replace(1, 'Nunca Usou')
    df['Categoria_Pesticida'] = df['Categoria_Pesticida'].replace(2, 'Já Usou')
    df['Categoria_Pesticida'] = df['Categoria_Pesticida'].replace(3, 'Esta usando')
    df = pd.get_dummies(df, columns=['Categoria_Pesticida'])
    df = pd.get_dummies(df, columns=['Temporada'])
    return df

In [57]:
df = dummies(df)

<h4>Com isso atribuimos se o agricultor nunca usou, ja usou ou esta usando e transformamos em novas labels, ja que 1,2 ou 3 não é algo linear: </h4>

In [59]:
def feature_engineering(df):
    df['Semanas_Utilizando'] = df['Semanas_Utilizando'].astype(int)
    df['Quant_total_de_Dose'] = (df['Semanas_Utilizando']*df['Doses_Semana'])
    df['Total_de_Semanas'] = df['Semanas_Utilizando'] + df['Semanas_Sem_Uso']
    df['Razao_Uso'] = (df['Semanas_Utilizando']/df['Total_de_Semanas']).fillna(0)
    s = np.zeros(len(df))
    for index in df.index:
        if df['Quant_total_de_Dose'][index] != 0:
            s[index] = df['Estimativa_de_Insetos'][index]/df['Quant_total_de_Dose'][index]
    df['Razao_de_inseto_por_dose'] = s
    return df

In [60]:
df = feature_engineering(df)

<h4>Vamos separar a Estimativa de insetos em bins</h4>

In [65]:
boundaries = [0,150,300,450,600,750,900,1050,1200,1350,1500,1650,1800,1950,2100,2400,2700,2900,3200,3800,4100]
s = pd.cut(df['Estimativa_de_Insetos'], boundaries, labels=False, retbins=True)
df['Estimativa_de_Insetos'] = s[0]

In [62]:
df.describe()

Unnamed: 0,Estimativa_de_Insetos,Tipo_de_Cultivo,Tipo_de_Solo,Doses_Semana,Semanas_Utilizando,Semanas_Sem_Uso,dano_na_plantacao,Categoria_Pesticida_Esta usando,Categoria_Pesticida_Já Usou,Categoria_Pesticida_Nunca Usou,Temporada_1,Temporada_2,Temporada_3,Quant_total_de_Dose,Total_de_Semanas,Razao_Uso,Razao_de_inseto_por_dose
count,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0,80000.0
mean,1400.020875,0.283338,0.45555,25.84675,28.597575,9.549088,0.192312,0.277125,0.713337,0.009538,0.302912,0.4966,0.200488,760.685313,38.146662,0.757036,3.40224
std,849.792471,0.450622,0.498023,15.557246,11.784339,9.905547,0.455912,0.447582,0.452205,0.097194,0.459521,0.499992,0.400368,573.358734,11.741582,0.238095,5.369961
min,150.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,731.0,0.0,0.0,15.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,330.0,30.0,0.581395,0.97619
50%,1212.0,0.0,0.0,20.0,28.0,7.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,600.0,39.0,0.8,1.886667
75%,1898.0,1.0,1.0,40.0,36.0,16.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1040.0,47.0,1.0,3.773333
max,4097.0,1.0,1.0,95.0,67.0,50.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,5510.0,78.0,1.0,163.84


<h1>Escolhendo o modelo</h1>

<h2>Arrumando as nossas features, podemos começar a construir nosso modelo</h2>

In [66]:
from sklearn.model_selection import cross_val_score
y = df['dano_na_plantacao']
X = df.drop(['dano_na_plantacao', 'Identificador_Agricultor'], axis =1)

<h4>Já que esse problema não apresenta uma grande quantidade de dados, seria dificil utilizar uma rede neural, então usaremos ensemble methods do sklearn</h4>

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

<h3>Random Forest Classifier</h3>

In [15]:
n_estimators = [5,10,15,20,30,50,100,200,300,400,500,750]
mean_score = []
for estim in n_estimators:
    clf = RandomForestClassifier(n_estimators=estim)
    scores = cross_val_score(clf, X, y, cv=5, n_jobs = -1, verbose = 1)
    mean_score.append(scores.mean())

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out

KeyboardInterrupt: 

In [None]:
plt.plot(n_estimators ,mean_score)

In [None]:
clf = RandomForestClassifier(n_estimators=500)
scores = cross_val_score(clf, X, y, cv=5, n_jobs = -1, verbose = 1)

<h3>Ada Boost Classifier</h3>

In [None]:
n_estimators = [5,10,15,20,30,50,100,200,300,400,750]
mean_score = []
for estim in n_estimators:
    clf = AdaBoostClassifier(n_estimators=estim)
    scores = cross_val_score(clf, X, y, cv=5, n_jobs = -1, verbose = 1)
    mean_score.append(scores.mean())

In [None]:
plt.plot(n_estimators ,mean_score)

<h3>KNN Classifier</h3>

In [None]:
n_estimators = [3,5,10,15,20,30,50,100, 200,300,500]
mean_score = []
for estim in n_estimators:
    clf = KNeighborsClassifier(n_neighbors=estim)
    scores = cross_val_score(clf, X, y, cv=5, n_jobs = -1, verbose = 1)
    mean_score.append(scores.mean())

In [None]:
plt.plot(n_estimators ,mean_score)

<h3>O modelo selecionado foi o Adaboost com n_estim =100</h3>

In [None]:
clf = AdaBoostClassifier(n_estimators=100)
scores = cross_val_score(clf, X, y, cv=5, n_jobs = -1, verbose = 1)
scores.mean()