In [1]:
import pandas as pd
import os

In [2]:

dir = '../../data/labeled_csv_files'
file = 'Anatel_labeled.csv'
file_path = os.path.join(dir, file)
df = pd.read_csv(file_path)

# Handle Missing Data

In [3]:
perc_null = (df.isnull().mean() * 100).round(2)


In [4]:
df = df.drop('CompartilhamentoInfraFisica_agg_non_none', axis = 1)

### Alternative 1 - Getting rid of all missing values

In [5]:
df_no_na = df.dropna()

In [6]:
print(f"Length of original dataframe: {len(df)}")
print(f"Lenght of non-nan dataframe: {len(df_no_na)}")

Length of original dataframe: 57328
Lenght of non-nan dataframe: 55006


# Scaling/Normalizing

Scaling and normalizing are not strictly necessary for decision trees. Here's why:

**Decision trees are insensitive to monotonic transformations:**

* Decision trees make splitting decisions based on the relative order of features, not their absolute values.
* Scaling (e.g., MinMaxScaler) only changes the range of values, not their order.
* Therefore, scaling doesn't affect how the tree splits the data.

**However, there are some situations where scaling or normalizing might be beneficial:**

* **Improves numerical stability:** Scaling features to a similar range can improve the numerical stability of the algorithm.
* **Enhances visualization and interpretation:** Scaled features are easier to compare and interpret, especially when dealing with mixed units.
* **Helps with comparing with other algorithms:** If you plan to combine decision trees with other algorithms that require scaled data, scaling beforehand can simplify the process.
* **May improve performance in some cases:** Although not guaranteed, scaling can sometimes lead to slightly better performance for decision trees.

**Normalization, however, requires caution:**

* Normalization (e.g., StandardScaler) changes the distribution of features, which can affect the splitting decisions.
* This may lead to suboptimal tree structures and potentially worse performance.

**In summary:**

* Scaling is generally not necessary for decision trees but can be beneficial in some situations.
* Normalization should be used cautiously and with awareness of its potential impact.
* It's recommended to experiment and compare performance with and without scaling/normalization in your specific case.


# Encoding Categorical Variables

If the dataset includes categorical variables, you'll need to encode them because decision trees typically work with numeric data.

In [7]:
for col in df.columns:
    print(f"Column '{col}': {df[col].dtype}")

Column 'NumEstacao_': float64
Column 'SiglaUf_max': object
Column 'CodMunicipio_max': int64
Column 'FreqTxMHz_min': float64
Column 'FreqTxMHz_max': float64
Column 'FreqRxMHz_min': float64
Column 'FreqRxMHz_max': float64
Column 'CodTipoClasseEstacao_max': object
Column 'ClassInfraFisica_agg_non_none': object
Column 'CodTipoAntena_max': int64
Column 'GanhoAntena_agg_non_none': float64
Column 'FrenteCostaAntena_max': float64
Column 'AnguloMeiaPotenciaAntena_max': float64
Column 'AnguloElevacao_min': float64
Column 'Polarizacao_max': object
Column 'AlturaAntena_max': float64
Column 'PotenciaTransmissorWatts_max': float64
Column 'CodDebitoTFI_max': object
Column 'LarguraFaixaNecessaria_max': float64
Column 'CaracteristicasBasicas_agg_non_none': object
Column 'LTE_max': bool
Column 'WCDMA_max': bool
Column 'GSM_max': bool
Column 'NR_NSA_max': bool
Column 'NR_SA-NSA_max': bool
Column 'DMR_max': bool
Column 'Digital_max': bool
Column 'DiasDesdeLicenciamento_max': float64
Column 'DiasDesdePrime

- [x] SiglaUf_max
- [x] CodMunicipio_max
- [x] CodTipoClasseEstacao_max
- [x] ClassInfraFisica_agg_non_none
- [x] Polarizacao_max
- [x] CodDebitoTFI_max
- [x] CaracteristicasBasicas_agg_non_none
- [x] LTE_max
- [x] WCDMA_max
- [x] GSM_max
- [x] NR_NSA_max
- [x] NR_SA-NSA_max
- [x] DMR_max
- [x] Digital_max

In [22]:
df['CaracteristicasBasicas_agg_non_none'].value_counts()

CaracteristicasBasicas_agg_non_none
G7W    51273
G9W     4212
D7W     1313
D9W      300
M7W      208
0G9       14
G7E        3
0G7        2
D7D        1
7W         1
F8W        1
Name: count, dtype: int64

In [14]:
z = pd.get_dummies(df, columns=['Polarizacao_max'], prefix='Polarizacao')


In [18]:
type(z)

pandas.core.frame.DataFrame

Decision trees and random forests can handle boolean variables without encoding. They naturally make binary decisions based on the values of the features.