# Data Set - Autos Usados
## Columns 
- **(line number)**
- **id**
- **marca**
- **modelo**
- **año_modelo**
- **millaje**
- **tipo_combustible**
- **motor**
- **transmision**
- **color_exterior**
- **color_interior**
- **accidente**
- **sin_daños**
- **precio**
## Sizes
- **Training Set** : 169768
- **Testing Set** : 18852


# Analisis Explorativo
### What am i to do 
- Investigar que variables son mas importantes son mas influyentes 
- Identificar variables nullable o con problemas de formateo
- Hacer un PCA o alguna analisis de varianza
- Preprocesar el training set
- Entrenar una red neuronal y testear metricas

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

In [2]:
# Load Data
data_path = '../resources/train.csv'
df = pd.read_csv(data_path)

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169679 entries, 0 to 169678
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   numero            169679 non-null  int64 
 1   id                169679 non-null  int64 
 2   marca             169679 non-null  object
 3   modelo            169679 non-null  object
 4   año_modelo        169679 non-null  int64 
 5   millaje           169679 non-null  int64 
 6   tipo_combustible  165082 non-null  object
 7   motor             169679 non-null  object
 8   transmisión       169679 non-null  object
 9   color_exterior    169679 non-null  object
 10  color_interior    169679 non-null  object
 11  accidente         167465 non-null  object
 12  sin_daños         150366 non-null  object
 13  precio            169679 non-null  int64 
dtypes: int64(5), object(9)
memory usage: 18.1+ MB
None


Por alguna razon sin_daños tiene menos valores

In [3]:
# Identify unique values that are neither "Yes" nor "No"
unique_values = df['sin_daños'].unique()
non_yes_no_values = [value for value in unique_values if value not in ["Yes", "No"]]

print("Unique values in 'sin_daños' that are neither 'Yes' nor 'No':", non_yes_no_values)

Unique values in 'sin_daños' that are neither 'Yes' nor 'No': [nan]


In [4]:
# Fill nan with Unknown, the behaviour will be the same as if it was filled with "No" that appears to be more appropiate but whatever
df['sin_daños'] = df['sin_daños'].fillna("Unknown")

print(df['sin_daños'].value_counts())

sin_daños
Yes        150366
Unknown     19313
Name: count, dtype: int64


In [5]:
# Drop Less Informative Columns
columns_to_drop = ['numero', 'id']
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])

# Separate features and target variable
X = df.drop(columns=['precio'])
y = df['precio']
print(X.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169679 entries, 0 to 169678
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   marca             169679 non-null  object
 1   modelo            169679 non-null  object
 2   año_modelo        169679 non-null  int64 
 3   millaje           169679 non-null  int64 
 4   tipo_combustible  165082 non-null  object
 5   motor             169679 non-null  object
 6   transmisión       169679 non-null  object
 7   color_exterior    169679 non-null  object
 8   color_interior    169679 non-null  object
 9   accidente         167465 non-null  object
 10  sin_daños         169679 non-null  object
dtypes: int64(2), object(9)
memory usage: 14.2+ MB
None


In [None]:
import pandas as pd
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import LabelEncoder

# Calculate Correlation for Numeric Features
# Only include numeric columns in correlation matrix
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
correlations = X[numeric_features].corrwith(y).abs()

# Calculate Mutual Information for Categorical Features
categorical_features = X.select_dtypes(include=['object']).columns
mutual_info_scores = {}

# Encode categorical features and calculate mutual information
for col in categorical_features:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    mutual_info_scores[col] = mutual_info_regression(X[[col]], y)[0]

# Combine the results
correlations = correlations.sort_values(ascending=False)
mutual_info_scores = {k: v for k, v in sorted(mutual_info_scores.items(), key=lambda item: item[1], reverse=True)}

# Display the most influential features
print("Numeric Features by Correlation with Price:")
print(correlations)

print("\nCategorical Features by Mutual Information with Price:")
for feature, score in mutual_info_scores.items():
    print(f"{feature}: {score:.4f}")

Numeric Features by Correlation with Price:
millaje       0.282018
año_modelo    0.230902
dtype: float64

Categorical Features by Mutual Information with Price:
motor: 0.5966
transmisión: 0.4391
modelo: 0.4137
sin_daños: 0.1604
marca: 0.1512
color_exterior: 0.1298
color_interior: 0.1251
accidente: 0.0623
tipo_combustible: 0.0345
