# Grupo 1 - Smog predicition
## Submission

En este *notebook* se va a generar el archivo necesario para el reto de Kaggle:
https://www.kaggle.com/competitions/cdaw-abid-smog-prediction

### Análisis y limpieza de datos

Se van a usar dos *dataframes*, df_train, que permite entrenar el modelo, y df_test, encargado de realizar la predicción de la característica *Smog*.

In [8]:
#Imports generales
import numpy as np
import pandas as pd
from pandas import Series
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

#Imports específicos
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.metrics import classification_report, recall_score, precision_score, make_scorer
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from scipy.stats import sem

#Visualización
sns.set(color_codes=True)

%matplotlib inline

df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test_nolabel.csv')

df_train['Gears'] = df_train['Transmission'].str.extract('(\d+)')
df_train['Gears'] = pd.to_numeric(df_train['Gears'], errors='coerce')
df_train['Transmission'] = df_train['Transmission'].str.extract('(\D+)')

df_test['Gears'] = df_test['Transmission'].str.extract('(\d+)')
df_test['Gears'] = pd.to_numeric(df_test['Gears'], errors='coerce')
df_test['Transmission'] = df_test['Transmission'].str.extract('(\D+)')

Hay que destacar que en este caso se usa el archivo de pruebas test_nolabel.csv para entrenar el modelo y subirlo posteriormente a la plataforma.

In [9]:
#Fuel Type
df_train.loc[df_train["Fuel Type"] == "X", "Fuel Type"] = 0
df_train.loc[df_train["Fuel Type"] == "Z", "Fuel Type"] = 1
df_train.loc[df_train["Fuel Type"] == "D", "Fuel Type"] = 2
df_train.loc[df_train["Fuel Type"] == "E", "Fuel Type"] = 3
df_train.loc[df_train["Fuel Type"] == "N", "Fuel Type"] = 4

#Transmission
df_train.loc[df_train["Transmission"] == "A", "Transmission"] = 0
df_train.loc[df_train["Transmission"] == "AM", "Transmission"] = 1
df_train.loc[df_train["Transmission"] == "AS", "Transmission"] = 2
df_train.loc[df_train["Transmission"] == "AV", "Transmission"] = 3
df_train.loc[df_train["Transmission"] == "M", "Transmission"] = 4


#Vehicle Class
df_train.loc[df_train["Vehicle Class"] == "Compact", "Vehicle Class"] = 0
df_train.loc[df_train["Vehicle Class"] == "Full-size", "Vehicle Class"] = 1
df_train.loc[df_train["Vehicle Class"] == "Mid-size", "Vehicle Class"] = 2
df_train.loc[df_train["Vehicle Class"] == "Minicompact", "Vehicle Class"] = 3
df_train.loc[df_train["Vehicle Class"] == "Minivan", "Vehicle Class"] = 4
df_train.loc[df_train["Vehicle Class"] == "Minicompact", "Vehicle Class"] = 5
df_train.loc[df_train["Vehicle Class"] == "Pickup truck: Small", "Vehicle Class"] = 6
df_train.loc[df_train["Vehicle Class"] == "Pickup truck: Standard", "Vehicle Class"] = 7
df_train.loc[df_train["Vehicle Class"] == "SUV: Small", "Vehicle Class"] = 8
df_train.loc[df_train["Vehicle Class"] == "SUV: Standard", "Vehicle Class"] = 9
df_train.loc[df_train["Vehicle Class"] == "Special purpose vehicle", "Vehicle Class"] = 10
df_train.loc[df_train["Vehicle Class"] == "Station wagon: Mid-size", "Vehicle Class"] = 11
df_train.loc[df_train["Vehicle Class"] == "Station wagon: Small", "Vehicle Class"] = 12
df_train.loc[df_train["Vehicle Class"] == "Subcompact", "Vehicle Class"] = 13
df_train.loc[df_train["Vehicle Class"] == "Two-seater", "Vehicle Class"] = 14

#Gears
df_train = df_train.dropna(subset=['Gears'])

In [10]:
#Fuel Type
df_test.loc[df_test["Fuel Type"] == "X", "Fuel Type"] = 0
df_test.loc[df_test["Fuel Type"] == "Z", "Fuel Type"] = 1
df_test.loc[df_test["Fuel Type"] == "D", "Fuel Type"] = 2
df_test.loc[df_test["Fuel Type"] == "E", "Fuel Type"] = 3
df_test.loc[df_test["Fuel Type"] == "N", "Fuel Type"] = 4

#Transmission
df_test.loc[df_test["Transmission"] == "A", "Transmission"] = 0
df_test.loc[df_test["Transmission"] == "AM", "Transmission"] = 1
df_test.loc[df_test["Transmission"] == "AS", "Transmission"] = 2
df_test.loc[df_test["Transmission"] == "AV", "Transmission"] = 3
df_test.loc[df_test["Transmission"] == "M", "Transmission"] = 4


#Vehicle Class
df_test.loc[df_test["Vehicle Class"] == "Compact", "Vehicle Class"] = 0
df_test.loc[df_test["Vehicle Class"] == "Full-size", "Vehicle Class"] = 1
df_test.loc[df_test["Vehicle Class"] == "Mid-size", "Vehicle Class"] = 2
df_test.loc[df_test["Vehicle Class"] == "Minicompact", "Vehicle Class"] = 3
df_test.loc[df_test["Vehicle Class"] == "Minivan", "Vehicle Class"] = 4
df_test.loc[df_test["Vehicle Class"] == "Minicompact", "Vehicle Class"] = 5
df_test.loc[df_test["Vehicle Class"] == "Pickup truck: Small", "Vehicle Class"] = 6
df_test.loc[df_test["Vehicle Class"] == "Pickup truck: Standard", "Vehicle Class"] = 7
df_test.loc[df_test["Vehicle Class"] == "SUV: Small", "Vehicle Class"] = 8
df_test.loc[df_test["Vehicle Class"] == "SUV: Standard", "Vehicle Class"] = 9
df_test.loc[df_test["Vehicle Class"] == "Special purpose vehicle", "Vehicle Class"] = 10
df_test.loc[df_test["Vehicle Class"] == "Station wagon: Mid-size", "Vehicle Class"] = 11
df_test.loc[df_test["Vehicle Class"] == "Station wagon: Small", "Vehicle Class"] = 12
df_test.loc[df_test["Vehicle Class"] == "Subcompact", "Vehicle Class"] = 13
df_test.loc[df_test["Vehicle Class"] == "Two-seater", "Vehicle Class"] = 14

#Gears
df_test['Gears'] = df_test['Gears'].fillna(df_test['Gears'].mean())

En df_test no se borran los "Gears" nulos ya que se necesitan los 390 datos con sus predicciones resultantes.

In [11]:
df_train.drop("Model Year", axis=1, inplace=True)
df_train.drop("Make", axis=1, inplace=True)
df_train.drop("Model", axis=1, inplace=True)
df_train.drop("Comb (mpg)", axis=1, inplace=True)
df_train.drop("Fuel Consumption City (L/100 km)", axis=1, inplace=True)
df_train.drop("Hwy (L/100 km)", axis=1, inplace=True)

df_test.drop("Model Year", axis=1, inplace=True)
df_test.drop("Make", axis=1, inplace=True)
df_test.drop("Model", axis=1, inplace=True)
df_test.drop("Comb (mpg)", axis=1, inplace=True)
df_test.drop("Fuel Consumption City (L/100 km)", axis=1, inplace=True)
df_test.drop("Hwy (L/100 km)", axis=1, inplace=True)

La preparación y la limpieza de los datos va a ser igual que en los modelos desarrollados anteriormente.

### Entrenamiento del modelo

In [12]:
features = ['Vehicle Class', 'Engine Size (L)', 'Cylinders', 'Transmission', 'Fuel Type', 'CO2 Emissions (g/km)', 'Gears']

x_train = df_train[features].values
y_train = df_train['Smog'].values

model = RandomForestClassifier(class_weight='balanced', max_features='log2', min_samples_leaf=1, min_samples_split=2, n_estimators=100, random_state=100)

model.fit(x_train, y_train)

RandomForestClassifier(class_weight='balanced', max_features='log2',
                       random_state=100)

### Predicción de *Smog*

En este apartado se realiza la predicción de *Smog* que se subirá posteriormente a la plataforma de Kaggle.

In [13]:
features = ['Vehicle Class', 'Engine Size (L)', 'Cylinders', 'Transmission', 'Fuel Type', 'CO2 Emissions (g/km)', 'Gears']

x_test = df_test[features].values

prediction = model.predict(x_test)

### Almacenamiento de datos

En este apartado se almacenan los datos de la predicción con sus IDs correspondientes en un archivo llamado results.csv.

In [14]:
result_df = pd.DataFrame({'id': df_test['id'], 'Predicted': prediction})

result_df.to_csv('results.csv', index=False)