In [1]:
import pandas as pd
import os
from clean_data_for_decision_tree import *

In [2]:
dir = '../../data/labeled_csv_files'
file = 'Anatel_labeled.csv'
file_path = os.path.join(dir, file)
df = pd.read_csv(file_path)

In [3]:
get_rid_of_problematic_columns(df)
rename_anatel_cols(df)


# Handle Missing Data

In [4]:
perc_null = (df.isnull().mean() * 100).round(2)


### Alternative 1 - Getting rid of all missing values

In [5]:
df_no_na = df.dropna()

In [6]:
print(f"Length of original dataframe: {len(df)}")
print(f"Lenght of non-nan dataframe: {len(df_no_na)}")

df = df_no_na.copy()
del df_no_na

Length of original dataframe: 57328
Lenght of non-nan dataframe: 55006


# Scaling/Normalizing

Scaling and normalizing are not strictly necessary for decision trees. Here's why:

**Decision trees are insensitive to monotonic transformations:**

* Decision trees make splitting decisions based on the relative order of features, not their absolute values.
* Scaling (e.g., MinMaxScaler) only changes the range of values, not their order.
* Therefore, scaling doesn't affect how the tree splits the data.

**However, there are some situations where scaling or normalizing might be beneficial:**

* **Improves numerical stability:** Scaling features to a similar range can improve the numerical stability of the algorithm.
* **Enhances visualization and interpretation:** Scaled features are easier to compare and interpret, especially when dealing with mixed units.
* **Helps with comparing with other algorithms:** If you plan to combine decision trees with other algorithms that require scaled data, scaling beforehand can simplify the process.
* **May improve performance in some cases:** Although not guaranteed, scaling can sometimes lead to slightly better performance for decision trees.

**Normalization, however, requires caution:**

* Normalization (e.g., StandardScaler) changes the distribution of features, which can affect the splitting decisions.
* This may lead to suboptimal tree structures and potentially worse performance.

**In summary:**

* Scaling is generally not necessary for decision trees but can be beneficial in some situations.
* Normalization should be used cautiously and with awareness of its potential impact.
* It's recommended to experiment and compare performance with and without scaling/normalization in your specific case.


# Encoding Categorical Variables

If the dataset includes categorical variables, you'll need to encode them because decision trees typically work with numeric data.

In [7]:
for col in df.columns:
    print(f"Column '{col}': {df[col].dtype}")

Column 'Station': float64
Column 'MinTxFreq': float64
Column 'MaxTxFreq': float64
Column 'MinRxFreq': float64
Column 'MaxRxFreq': float64
Column 'SiteType': object
Column 'AntennaCode': int64
Column 'AntennaGain': float64
Column 'FrontBackAntennaRation': float64
Column 'AnguloMeiaPotenciaAntena_max': float64
Column 'ElevationAngle': float64
Column 'Polarization': object
Column 'AntennaHeight': float64
Column 'TransmitterPower': float64
Column 'NecessaryBandwidth': float64
Column 'BasicFeatures': object
Column 'LTE': bool
Column 'WCDMA': bool
Column 'GSM': bool
Column 'NR_NSA': bool
Column 'NR_SA-NSA': bool
Column 'DMR': bool
Column 'Digital': bool
Column 'DaysSinceLicensing': float64
Column 'DaysSinceFirstLicensing': float64
Column 'DaysUntilExpiration': int64


- [x] SiglaUf_max
- [x] CodMunicipio_max
- [x] CodTipoClasseEstacao_max
- [x] ClassInfraFisica_agg_non_none
- [x] Polarizacao_max
- [x] CodDebitoTFI_max
- [x] CaracteristicasBasicas_agg_non_none
- [x] LTE_max
- [x] WCDMA_max
- [x] GSM_max
- [x] NR_NSA_max
- [x] NR_SA-NSA_max
- [x] DMR_max
- [x] Digital_max

In [8]:
df = pd.get_dummies(df, columns=['Polarization'], prefix='Polarization')
df = pd.get_dummies(df, columns=['BasicFeatures'], prefix='BasicFeatures')


Decision trees and random forests can handle boolean variables without encoding. They naturally make binary decisions based on the values of the features.

In [9]:
del dir, col, perc_null, file_path, file

# Build and Train the Decision Tree Model

In [10]:
df['Station'] = df['Station'].astype(int)
df.set_index('Station', inplace=True)

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [12]:
df.head()

Unnamed: 0_level_0,MinTxFreq,MaxTxFreq,MinRxFreq,MaxRxFreq,SiteType,AntennaCode,AntennaGain,FrontBackAntennaRation,AnguloMeiaPotenciaAntena_max,ElevationAngle,...,BasicFeatures_0G9,BasicFeatures_7W,BasicFeatures_D7D,BasicFeatures_D7W,BasicFeatures_D9W,BasicFeatures_F8W,BasicFeatures_G7E,BasicFeatures_G7W,BasicFeatures_G9W,BasicFeatures_M7W
Station,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
64300,788.0,3450.0,733.0,3450.0,GREENFIELD,760,11.2,26.0,105.0,0.0,...,False,False,False,False,False,False,False,False,True,False
1064380,788.0,2680.0,733.0,2560.0,GREENFIELD,760,17.0,28.0,65.0,0.0,...,False,False,False,False,False,False,False,True,False,False
5180368,798.0,3350.0,743.0,3350.0,GREENFIELD,760,16.6,25.0,69.0,-2.0,...,False,False,False,False,False,False,False,True,False,False
5180376,798.0,2640.0,743.0,2520.0,GREENFIELD,760,13.82,28.0,71.92,-2.0,...,False,False,False,False,False,False,False,True,False,False
5180384,798.0,2640.0,743.0,2520.0,GREENFIELD,760,14.1,30.0,69.0,0.0,...,False,False,False,False,False,False,False,True,False,False
