# Phase 2: Exploratory Data Analysis (EDA) with `CarDataAnalyzer`

**Objective:** Use our `CarDataAnalyzer` query engine to answer business questions and visualize the results.


In [11]:
# --- Imports ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# --- Set style ---
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# --- Load our CarDataAnalyzer class ---
# We use a Jupyter "magic command" to import the class from the other notebook
%run ./01-Data_Cleaning_and_Preprocessing.ipynb

# --- Initialize our analyzer ---
# We assume the object is called 'analyzer' in the previous notebook
print("✅ CarDataAnalyzer imported and ready to use.")
print(f"📊 Available datasets: {list(datasets.keys())}")
print(f"🔧 Available methods: {len([method for method in dir(analyzer) if not method.startswith('_')])}")


✅ Imports and setup completed
📊 Display options configured for better data viewing
📂 LOADING DATASETS SEPARATELY
Loading datasets...
✅ BASIC: Basic_table.csv - Shape: (1011, 4)
   Memory: 0.19 MB
✅ TRIM: Trim_table.csv - Shape: (335562, 9)
   Memory: 120.94 MB
✅ PRICE: Price_table.csv - Shape: (6333, 5)
   Memory: 1.23 MB
✅ SALES: Sales_table.csv - Shape: (773, 23)
   Memory: 0.26 MB

📊 SUMMARY:
   Total datasets loaded: 4
   Total memory usage: 122.63 MB

✅ All datasets loaded successfully!
🔧 STANDARDIZING COLUMN NAMES
✅ BASIC: No column name changes needed
✅ TRIM: Renamed 'Maker' to 'Automaker'
✅ PRICE: Renamed 'Maker' to 'Automaker'
✅ SALES: Renamed 'Maker' to 'Automaker'

📋 COLUMN STANDARDIZATION SUMMARY:
  BASIC: No changes needed
  TRIM: Maker → Automaker
  PRICE: Maker → Automaker
  SALES: Maker → Automaker

📊 FINAL COLUMNS BY DATASET:
  BASIC: ['Automaker', 'Automaker_ID', 'Genmodel', 'Genmodel_ID']
  TRIM: ['Genmodel_ID', 'Automaker', 'Genmodel', 'Trim', 'Year', 'Price', 'Gas_

Unnamed: 0,Automaker,Automaker_ID,Genmodel,Genmodel_ID
97,BMW,8,1 Series,8_1
98,BMW,8,2 Series,8_2
99,BMW,8,2 Series Active Tourer,8_3
100,BMW,8,2 Series Gran Tourer,8_4
101,BMW,8,3 Series,8_5



2️⃣ EXAMPLE: Price range by model
----------------------------------------
Price ranges for 1011 models:


Unnamed: 0,Automaker,Automaker_ID,Genmodel,Genmodel_ID,price_min,price_max,price_mean,price_entries
0,AC,1,Cobra,1_1,,,,
1,Abarth,2,124 Spider,2_1,26665.0,29515.0,28052.5,4.0
2,Abarth,2,500,2_2,13400.0,14325.0,13955.0,8.0
3,Abarth,2,500C,2_3,15775.0,17290.0,16394.571429,7.0
4,Abarth,2,595,2_4,14425.0,17675.0,15447.142857,7.0



3️⃣ EXAMPLE: Trim summary by model
----------------------------------------
Trim summary for 1011 models:


Unnamed: 0,Automaker,Automaker_ID,Genmodel,Genmodel_ID,trim_price_min,trim_price_max,trim_price_mean,trim_price_count,year_min,year_max,most_common_fuel,trim_count
0,AC,1,Cobra,1_1,,,,,,,,
1,Abarth,2,124 Spider,2_1,26665.0,35365.0,30524.090909,11.0,2016.0,2019.0,Petrol,11.0
2,Abarth,2,500,2_2,13400.0,15625.0,14542.578947,19.0,2009.0,2016.0,Petrol,19.0
3,Abarth,2,500C,2_3,15775.0,17658.0,16876.473684,19.0,2010.0,2016.0,Petrol,19.0
4,Abarth,2,595,2_4,14425.0,23805.0,19294.044586,157.0,2012.0,2018.0,Petrol,157.0



4️⃣ EXAMPLE: Sales summary by model
----------------------------------------
Sales summary for 1011 models:


Unnamed: 0,Automaker,Automaker_ID,Genmodel,Genmodel_ID,total_sales,avg_sales,max_sales,years_with_data
0,AC,1,Cobra,1_1,,,,
1,Abarth,2,124 Spider,2_1,1691.0,42.275,777.0,40.0
2,Abarth,2,500,2_2,5419.0,270.95,915.0,20.0
3,Abarth,2,500C,2_3,,,,
4,Abarth,2,595,2_4,18128.0,906.4,3907.0,20.0



5️⃣ EXAMPLE: Comprehensive info for Abarth 124 Spider
----------------------------------------
Model ID: 2_1 (Abarth 124 Spider)

BASIC_INFO:


Unnamed: 0,Automaker,Automaker_ID,Genmodel,Genmodel_ID
1,Abarth,2,124 Spider,2_1



TRIM_DETAILS:


Unnamed: 0,Genmodel_ID,Automaker,Genmodel,Trim,Year,Price,Gas_emission,Fuel_type,Engine_size
0,2_1,Abarth,124 spider,124 Spider1.4 Turbo MultiAir 170hp 2d,2016,29365,148,Petrol,1368
1,2_1,Abarth,124 spider,124 Spider1.4 Turbo MultiAir 170hp Sequenziale...,2016,31365,153,Petrol,1368
2,2_1,Abarth,124 spider,124 Spider1.4 Turbo MultiAir 170hp 2d,2017,29365,148,Petrol,1368
3,2_1,Abarth,124 spider,124 Spider1.4 Turbo MultiAir 170hp Sequenziale...,2017,31365,153,Petrol,1368
4,2_1,Abarth,124 spider,124 SpiderScorpione 1.4 Turbo MultiAir 170hp 2d,2017,26665,148,Petrol,1368



PRICE_HISTORY:


Unnamed: 0,Automaker,Genmodel,Genmodel_ID,Year,Entry_price
0,Abarth,124 Spider,2_1,2016,29365
1,Abarth,124 Spider,2_1,2017,26665
2,Abarth,124 Spider,2_1,2018,26665
3,Abarth,124 Spider,2_1,2019,29515



SALES_DATA:


Unnamed: 0,Automaker,Genmodel,Genmodel_ID,2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001
0,ABARTH,ABARTH 124,2_1,0,19,27,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,ABARTH,ABARTH SPIDER,2_1,0,223,777,409,176,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0



✅ Demo completed successfully!
🔬 ADVANCED QUERY EXAMPLES

1️⃣ TOP 10 MOST EXPENSIVE MODELS
----------------------------------------
Top 10 most expensive models (by max price):


Unnamed: 0,Automaker,Genmodel,price_min,price_max,price_mean
786,Rolls-Royce,Phantom,252038.0,320120.0,282802.25
546,Maybach,62,281200.0,302725.0,291846.2
784,Rolls-Royce,Dawn,264000.0,275240.0,266810.0
545,Maybach,57,243600.0,266707.0,254985.8
466,Lamborghini,Aventador,256020.0,262860.0,260295.0
271,Ferrari,812 Superfast,260908.0,260908.0,260908.0
788,Rolls-Royce,Wraith,228800.0,251240.0,237584.0
151,Bentley,Brooklands,225100.0,241539.0,231659.75
275,Ferrari,F12berlinetta,238232.0,239908.0,239051.142857
154,Bentley,Mulsanne,220000.0,238700.0,228583.333333



2️⃣ MODELS WITH MOST TRIM VARIATIONS
----------------------------------------
Top 10 models with most trim variations:


Unnamed: 0,Automaker,Genmodel,trim_count,trim_price_min,trim_price_max
101,BMW,3 Series,12289.0,13650.0,85130.0
586,Mercedes-Benz,C Class,9861.0,17600.0,97710.0
924,Vauxhall,Astra,7490.0,9430.0,29945.0
53,Audi,A4,7060.0,16015.0,70920.0
51,Audi,A3,6602.0,14475.0,44820.0
97,BMW,1 Series,5832.0,15507.0,40175.0
339,Ford,Mondeo,5815.0,12810.0,35440.0
932,Vauxhall,Insignia,5801.0,15376.0,36525.0
326,Ford,Focus,5673.0,9830.0,39040.0
594,Mercedes-Benz,E Class,5378.0,23345.0,108325.0



3️⃣ BEST SELLING MODELS
----------------------------------------
Top 10 best selling models:


Unnamed: 0,Automaker,Genmodel,total_sales,avg_sales,max_sales
325,Ford,Fiesta,1505740.0,75287.0,125619.0
326,Ford,Focus,1166989.0,58349.45,77782.0
928,Vauxhall,Corsa,1043713.0,52185.65,86840.0
962,Volkswagen,Golf,1018866.0,50943.3,70714.0
924,Vauxhall,Astra,816382.0,40819.1,65907.0
969,Volkswagen,Polo,662837.0,33141.85,52347.0
528,MINI,Hatch,618637.0,10310.616667,38958.0
677,Nissan,Qashqai,555468.0,27773.4,58132.0
101,BMW,3 Series,547991.0,27399.55,38244.0
917,Toyota,Yaris,474954.0,23747.7,31025.0



4️⃣ FUEL TYPE ANALYSIS
----------------------------------------
Fuel type distribution:


Unnamed: 0,fuel_type,model_count,avg_price
3,Petrol,408,41447.351079
0,Diesel,215,28080.627042
2,Other,22,43503.461766
1,Electric Diesel REX,2,30730.0



5️⃣ YEAR RANGE ANALYSIS
----------------------------------------
Year range distribution:


Unnamed: 0,year_min,year_max,model_count
20,1998.0,2018.0,40
171,2016.0,2018.0,17
174,2017.0,2018.0,17
143,2010.0,2018.0,16
164,2014.0,2018.0,15
162,2013.0,2018.0,14
6,1998.0,2004.0,14
22,1998.0,2020.0,13
79,2003.0,2018.0,13
4,1998.0,2002.0,11



✅ Advanced query examples completed!
📋 SUMMARY AND NEXT STEPS

✅ COMPLETED TASKS:
  1. ✅ Loaded all datasets separately (no merging)
  2. ✅ Standardized column names (Maker → Automaker)
  3. ✅ Created SQL-like query functions
  4. ✅ Performed data quality assessment
  5. ✅ Demonstrated flexible analysis capabilities
  6. ✅ Preserved all original data detail

📊 DATASET STATUS:
  Total datasets: 4
  Total rows: 343,679
  Total memory: 122.63 MB
  Data preserved: 100% (no aggregation loss)

🎯 KEY BENEFITS ACHIEVED:
  ✅ No data explosion (avoided 5.3M row merge)
  ✅ Preserved all trim details
  ✅ Flexible query capabilities
  ✅ Easy to add new analysis types
  ✅ Memory efficient
  ✅ Scales well with large datasets

🔍 AVAILABLE ANALYSIS METHODS:
  📊 analyzer.get_basic_info() - Dataset overview
  🔍 analyzer.query_models_by_automaker('BMW') - Filter by automaker
  🚗 analyzer.query_trim_details('2_1') - Get trim details
  💰 analyzer.get_price_range_by_model() - Price analysis
  🏷️  analyzer.g

## 1. Catalog Analysis: Who are the Key Players?

We will use the analyzer to get a summary of manufacturers and their models.


In [12]:
# --- Top 15 Manufacturers by Number of Models ---
print("🏭 MANUFACTURER ANALYSIS")
print("="*50)

# Get model count by manufacturer
automaker_counts = analyzer.basic['Automaker'].value_counts()
top_15_makers = automaker_counts.head(15)

print(f"Total manufacturers: {len(automaker_counts)}")
print(f"Total models: {len(analyzer.basic)}")
print(f"\nTop 15 manufacturers by number of models:")
print(top_15_makers)

# Create visualization with Plotly
fig_makers = px.bar(
    x=top_15_makers.index,
    y=top_15_makers.values,
    title='Top 15 Manufacturers by Number of Models',
    labels={'x': 'Manufacturer', 'y': 'Number of Models'},
    color=top_15_makers.values,
    color_continuous_scale='viridis'
)

fig_makers.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=500
)

fig_makers.show()

# Additional statistics
print(f"\n📊 STATISTICS:")
print(f"  Manufacturer with most models: {top_15_makers.index[0]} ({top_15_makers.iloc[0]} models)")
print(f"  Average models per manufacturer: {automaker_counts.mean():.1f}")
print(f"  Median models per manufacturer: {automaker_counts.median():.1f}")
print(f"  Top 5 account for: {(top_15_makers.head(5).sum() / len(analyzer.basic) * 100):.1f}% of the catalog")


🏭 MANUFACTURER ANALYSIS
Total manufacturers: 101
Total models: 1011

Top 15 manufacturers by number of models:
Automaker
BMW              50
Audi             48
Toyota           45
Peugeot          43
Nissan           41
Ford             41
Mercedes-Benz    40
Volkswagen       37
Fiat             30
Vauxhall         28
Hyundai          26
Citroen          26
Ferrari          23
Renault          23
Mitsubishi       22
Name: count, dtype: int64



📊 STATISTICS:
  Manufacturer with most models: BMW (50 models)
  Average models per manufacturer: 10.0
  Median models per manufacturer: 5.0
  Top 5 account for: 22.5% of the catalog


## 2. Price Analysis: What is the positioning of each brand?

We will query the price data to see which brands are, on average, more expensive.


In [13]:
# --- Price Analysis by Manufacturer ---
print("💰 PRICE ANALYSIS BY BRAND")
print("="*50)

# Get price summary by model
price_summary = analyzer.get_price_range_by_model()
print(f"Models with price data: {len(price_summary.dropna(subset=['price_mean']))}")

# Calculate average price by manufacturer
avg_price_by_maker = price_summary.groupby('Automaker')['price_mean'].agg(['mean', 'count', 'min', 'max']).reset_index()
avg_price_by_maker = avg_price_by_maker[avg_price_by_maker['count'] >= 3]  # At least 3 models
avg_price_by_maker = avg_price_by_maker.sort_values('mean', ascending=False).head(15)

print(f"\nTop 15 manufacturers by average price (min. 3 models):")
display(avg_price_by_maker)

# Create visualization
fig_price_maker = px.bar(
    avg_price_by_maker,
    x='Automaker',
    y='mean',
    title='Top 15 Manufacturers by Average Price',
    labels={'Automaker': 'Manufacturer', 'mean': 'Average Price (€)'},
    color='mean',
    color_continuous_scale='plasma',
    text='count'
)

fig_price_maker.update_traces(
    texttemplate='%{text} models',
    textposition='outside'
)

fig_price_maker.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=500
)

fig_price_maker.show()

# Price range analysis
print(f"\n📊 PRICE RANGE ANALYSIS:")
price_ranges = price_summary.groupby('Automaker').agg({
    'price_min': 'min',
    'price_max': 'max',
    'price_mean': 'mean'
}).reset_index()

price_ranges['price_range'] = price_ranges['price_max'] - price_ranges['price_min']
price_ranges['price_range_pct'] = (price_ranges['price_range'] / price_ranges['price_mean']) * 100

top_range_makers = price_ranges.nlargest(10, 'price_range_pct')
print("Top 10 manufacturers with highest price diversity:")
display(top_range_makers[['Automaker', 'price_min', 'price_max', 'price_range', 'price_range_pct']])


💰 PRICE ANALYSIS BY BRAND
Models with price data: 647

Top 15 manufacturers by average price (min. 3 models):


Unnamed: 0,Automaker,mean,count,min,max
78,Rolls-Royce,232294.631818,5,161000.0,282802.25
46,Lamborghini,182473.083333,4,128750.0,260295.0
28,Ferrari,173349.581491,14,92311.666667,260908.0
11,Bentley,169735.951567,6,124296.25,231659.75
59,McLaren,159270.833333,4,135000.0,190833.333333
7,Aston Martin,136875.55754,8,88134.714286,178419.857143
56,Maserati,68158.082846,8,49048.333333,98643.181818
72,Porsche,45929.649604,8,34198.052632,65499.26087
9,BMW,42296.080287,21,17766.4375,106407.857143
49,Lexus,41436.220212,9,22720.833333,74540.0



📊 PRICE RANGE ANALYSIS:
Top 10 manufacturers with highest price diversity:


Unnamed: 0,Automaker,price_min,price_max,price_range,price_range_pct
65,Nissan,6870.0,80470.0,73600.0,371.744151
8,Audi,12572.0,118345.0,105773.0,316.405691
37,Hyundai,4830.0,49530.0,44700.0,271.147013
95,Vauxhall,5850.0,54325.0,48475.0,263.427827
63,Mitsubishi,7256.0,50000.0,42744.0,261.239947
9,BMW,13650.0,115050.0,101400.0,239.738527
60,Mercedes-Benz,12545.0,109805.0,97260.0,236.354881
96,Volkswagen,6370.0,54860.0,48490.0,234.650289
93,Toyota,6632.0,47907.0,41275.0,232.429654
19,Citroen,5830.0,39660.0,33830.0,220.804046


## 3. Sales Analysis: What are the best-selling models?

Now we use the sales summary function to identify market leaders.


In [14]:
# --- Top 20 Models by Total Sales ---
print("📈 SALES ANALYSIS")
print("="*50)

# Get sales summary
sales_summary = analyzer.get_sales_summary()
sales_with_data = sales_summary.dropna(subset=['total_sales'])
print(f"Models with sales data: {len(sales_with_data)}")

# Top 20 models by total sales
top_20_sales = sales_with_data.nlargest(20, 'total_sales')
print(f"\nTop 20 models by total sales:")
display(top_20_sales[['Automaker', 'Genmodel', 'total_sales', 'avg_sales', 'max_sales']])

# Create visualization
fig_sales = px.bar(
    top_20_sales,
    x='Genmodel',
    y='total_sales',
    color='Automaker',
    title='Top 20 Models by Total Sales',
    labels={'Genmodel': 'Model', 'total_sales': 'Total Sales'},
    hover_data=['avg_sales', 'max_sales']
)

fig_sales.update_layout(
    xaxis_tickangle=-45,
    height=600,
    showlegend=True
)

fig_sales.show()

# Analysis by manufacturer
print(f"\n🏭 SALES BY MANUFACTURER:")
sales_by_maker = sales_with_data.groupby('Automaker').agg({
    'total_sales': 'sum',
    'Genmodel_ID': 'count',
    'avg_sales': 'mean'
}).reset_index()

sales_by_maker.columns = ['Automaker', 'total_sales', 'model_count', 'avg_sales_per_model']
sales_by_maker = sales_by_maker.sort_values('total_sales', ascending=False).head(15)

print("Top 15 manufacturers by total sales:")
display(sales_by_maker)

# Create manufacturer sales chart
fig_maker_sales = px.bar(
    sales_by_maker,
    x='Automaker',
    y='total_sales',
    title='Top 15 Manufacturers by Total Sales',
    labels={'Automaker': 'Manufacturer', 'total_sales': 'Total Sales'},
    color='total_sales',
    color_continuous_scale='viridis',
    text='model_count'
)

fig_maker_sales.update_traces(
    texttemplate='%{text} models',
    textposition='outside'
)

fig_maker_sales.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=500
)

fig_maker_sales.show()

# Sales statistics
print(f"\n📊 SALES STATISTICS:")
print(f"  Best-selling model: {top_20_sales.iloc[0]['Automaker']} {top_20_sales.iloc[0]['Genmodel']} ({top_20_sales.iloc[0]['total_sales']:,} sales)")
print(f"  Leading manufacturer: {sales_by_maker.iloc[0]['Automaker']} ({sales_by_maker.iloc[0]['total_sales']:,} total sales)")
print(f"  Average sales per model: {sales_with_data['total_sales'].mean():.0f}")
print(f"  Median sales per model: {sales_with_data['total_sales'].median():.0f}")
print(f"  Top 5 models account for: {(top_20_sales.head(5)['total_sales'].sum() / sales_with_data['total_sales'].sum() * 100):.1f}% of sales")


📈 SALES ANALYSIS
Models with sales data: 734

Top 20 models by total sales:


Unnamed: 0,Automaker,Genmodel,total_sales,avg_sales,max_sales
325,Ford,Fiesta,1505740.0,75287.0,125619.0
326,Ford,Focus,1166989.0,58349.45,77782.0
928,Vauxhall,Corsa,1043713.0,52185.65,86840.0
962,Volkswagen,Golf,1018866.0,50943.3,70714.0
924,Vauxhall,Astra,816382.0,40819.1,65907.0
969,Volkswagen,Polo,662837.0,33141.85,52347.0
528,MINI,Hatch,618637.0,10310.616667,38958.0
677,Nissan,Qashqai,555468.0,27773.4,58132.0
101,BMW,3 Series,547991.0,27399.55,38244.0
917,Toyota,Yaris,474954.0,23747.7,31025.0



🏭 SALES BY MANUFACTURER:
Top 15 manufacturers by total sales:


Unnamed: 0,Automaker,total_sales,model_count,avg_sales_per_model
22,Ford,4065448.0,32,6141.8375
69,Vauxhall,3120997.0,25,6241.994
70,Volkswagen,2787149.0,26,5154.128365
7,BMW,1895259.0,26,3644.728846
6,Audi,1768509.0,36,2435.177778
42,Mercedes-Benz,1579690.0,31,2547.887097
46,Nissan,1529444.0,27,2830.198148
68,Toyota,1509601.0,28,2695.716071
50,Peugeot,1363372.0,29,2320.944828
23,Honda,1001884.0,16,2790.634375



📊 SALES STATISTICS:
  Best-selling model: Ford Fiesta (1,505,740.0 sales)
  Leading manufacturer: Ford (4,065,448.0 total sales)
  Average sales per model: 42973
  Median sales per model: 4874
  Top 5 models account for: 17.6% of sales


## 4. Trim and Configuration Analysis

We explore the diversity of configurations (trims) offered by each manufacturer.


In [15]:
# --- Análisis de Trim y Configuraciones ---
print("🏷️ ANÁLISIS DE TRIM Y CONFIGURACIONES")
print("="*50)

# Obtener resumen de trim por modelo
trim_summary = analyzer.get_trim_summary_by_model()
trim_with_data = trim_summary.dropna(subset=['trim_count'])
print(f"Modelos con datos de trim: {len(trim_with_data)}")

# Top 15 modelos con más variaciones de trim
top_15_trims = trim_with_data.nlargest(15, 'trim_count')
print(f"\nTop 15 modelos con más variaciones de trim:")
display(top_15_trims[['Automaker', 'Genmodel', 'trim_count', 'trim_price_min', 'trim_price_max']])

# Crear visualización
fig_trims = px.bar(
    top_15_trims,
    x='Genmodel',
    y='trim_count',
    color='Automaker',
    title='Top 15 Modelos por Número de Variaciones de Trim',
    labels={'Genmodel': 'Modelo', 'trim_count': 'Número de Trims'},
    hover_data=['trim_price_min', 'trim_price_max']
)

fig_trims.update_layout(
    xaxis_tickangle=-45,
    height=500,
    showlegend=True
)

fig_trims.show()

# Análisis por fabricante
print(f"\n🏭 TRIMS POR FABRICANTE:")
trims_by_maker = trim_with_data.groupby('Automaker').agg({
    'trim_count': ['sum', 'mean', 'count'],
    'trim_price_mean': 'mean'
}).reset_index()

# Flatten column names
trims_by_maker.columns = ['Automaker', 'total_trims', 'avg_trims_per_model', 'model_count', 'avg_price']
trims_by_maker = trims_by_maker.sort_values('total_trims', ascending=False).head(15)

print("Top 15 fabricantes por número total de trims:")
display(trims_by_maker)

# Crear gráfico de trims por fabricante
fig_maker_trims = px.bar(
    trims_by_maker,
    x='Automaker',
    y='total_trims',
    title='Top 15 Fabricantes por Número Total de Trims',
    labels={'Automaker': 'Fabricante', 'total_trims': 'Total Trims'},
    color='total_trims',
    color_continuous_scale='viridis',
    text='model_count'
)

fig_maker_trims.update_traces(
    texttemplate='%{text} modelos',
    textposition='outside'
)

fig_maker_trims.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=500
)

fig_maker_trims.show()

# Estadísticas de trim
print(f"\n📊 ESTADÍSTICAS DE TRIM:")
print(f"  Modelo con más trims: {top_15_trims.iloc[0]['Automaker']} {top_15_trims.iloc[0]['Genmodel']} ({top_15_trims.iloc[0]['trim_count']} trims)")
print(f"  Fabricante con más trims: {trims_by_maker.iloc[0]['Automaker']} ({trims_by_maker.iloc[0]['total_trims']} trims totales)")
print(f"  Promedio de trims por modelo: {trim_with_data['trim_count'].mean():.1f}")
print(f"  Mediana de trims por modelo: {trim_with_data['trim_count'].median():.1f}")
print(f"  Total de trims únicos en el dataset: {len(analyzer.trim)}")


🏷️ ANÁLISIS DE TRIM Y CONFIGURACIONES
Modelos con datos de trim: 647

Top 15 modelos con más variaciones de trim:


Unnamed: 0,Automaker,Genmodel,trim_count,trim_price_min,trim_price_max
101,BMW,3 Series,12289.0,13650.0,85130.0
586,Mercedes-Benz,C Class,9861.0,17600.0,97710.0
924,Vauxhall,Astra,7490.0,9430.0,29945.0
53,Audi,A4,7060.0,16015.0,70920.0
51,Audi,A3,6602.0,14475.0,44820.0
97,BMW,1 Series,5832.0,15507.0,40175.0
339,Ford,Mondeo,5815.0,12810.0,35440.0
932,Vauxhall,Insignia,5801.0,15376.0,36525.0
326,Ford,Focus,5673.0,9830.0,39040.0
594,Mercedes-Benz,E Class,5378.0,23345.0,108325.0



🏭 TRIMS POR FABRICANTE:
Top 15 fabricantes por número total de trims:


Unnamed: 0,Automaker,total_trims,avg_trims_per_model,model_count,avg_price
4,BMW,31002.0,1476.285714,21,49390.538778
57,Vauxhall,30791.0,1282.958333,24,21894.959543
37,Mercedes-Benz,28461.0,1138.44,25,52327.327254
3,Audi,26133.0,1742.2,15,41871.156988
18,Ford,20137.0,839.041667,24,20791.878107
59,Volvo,18225.0,1139.0625,16,30594.698706
58,Volkswagen,17669.0,736.208333,24,25482.115173
44,Renault,16371.0,963.0,17,18975.962397
32,MINI,15255.0,1695.0,9,22982.999028
56,Toyota,14265.0,648.409091,22,21809.285725



📊 ESTADÍSTICAS DE TRIM:
  Modelo con más trims: BMW 3 Series (12289.0 trims)
  Fabricante con más trims: BMW (31002.0 trims totales)
  Promedio de trims por modelo: 518.6
  Mediana de trims por modelo: 153.0
  Total de trims únicos en el dataset: 335562


## 5. Fuel and Technology Analysis

We explore the distribution of fuel types and technologies in the market.


In [16]:
# --- Análisis de Combustibles y Tecnologías ---
print("⛽ ANÁLISIS DE COMBUSTIBLES Y TECNOLOGÍAS")
print("="*50)

# Análisis de tipos de combustible
fuel_analysis = analyzer.trim.groupby('Fuel_type').agg({
    'Genmodel_ID': 'nunique',
    'Price': 'mean',
    'Engine_size': 'mean'
}).reset_index()

fuel_analysis.columns = ['fuel_type', 'unique_models', 'avg_price', 'avg_engine_size']
fuel_analysis = fuel_analysis.sort_values('unique_models', ascending=False)

print("Distribución de tipos de combustible:")
display(fuel_analysis)

# Crear gráfico de combustibles
fig_fuel = px.pie(
    fuel_analysis,
    values='unique_models',
    names='fuel_type',
    title='Distribución de Modelos por Tipo de Combustible',
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig_fuel.update_traces(
    textposition='inside',
    textinfo='percent+label'
)

fig_fuel.show()

# Análisis de precios por combustible
fig_fuel_price = px.bar(
    fuel_analysis,
    x='fuel_type',
    y='avg_price',
    title='Precio Promedio por Tipo de Combustible',
    labels={'fuel_type': 'Tipo de Combustible', 'avg_price': 'Precio Promedio (€)'},
    color='avg_price',
    color_continuous_scale='viridis'
)

fig_fuel_price.update_layout(
    xaxis_tickangle=-45,
    showlegend=False,
    height=500
)

fig_fuel_price.show()

# Análisis por fabricante y combustible
print(f"\n🏭 COMBUSTIBLES POR FABRICANTE:")
fuel_by_maker = analyzer.trim.groupby(['Automaker', 'Fuel_type']).size().reset_index(name='trim_count')
fuel_by_maker_pivot = fuel_by_maker.pivot(index='Automaker', columns='Fuel_type', values='trim_count').fillna(0)

# Top 10 fabricantes
top_makers = fuel_by_maker_pivot.sum(axis=1).nlargest(10).index
fuel_by_maker_top = fuel_by_maker_pivot.loc[top_makers]

print("Top 10 fabricantes por distribución de combustibles:")
display(fuel_by_maker_top)

# Crear gráfico de calor para combustibles por fabricante
fig_heatmap = px.imshow(
    fuel_by_maker_top,
    title='Distribución de Combustibles por Fabricante (Top 10)',
    labels=dict(x="Tipo de Combustible", y="Fabricante", color="Número de Trims"),
    color_continuous_scale='viridis'
)

fig_heatmap.update_layout(height=500)
fig_heatmap.show()

# Estadísticas de combustible
print(f"\n📊 ESTADÍSTICAS DE COMBUSTIBLE:")
print(f"  Tipo de combustible más común: {fuel_analysis.iloc[0]['fuel_type']} ({fuel_analysis.iloc[0]['unique_models']} modelos)")
print(f"  Tipo de combustible más caro: {fuel_analysis.loc[fuel_analysis['avg_price'].idxmax(), 'fuel_type']} (€{fuel_analysis['avg_price'].max():,.0f})")
print(f"  Total de tipos de combustible: {len(fuel_analysis)}")
print(f"  Promedio de precios por combustible: €{fuel_analysis['avg_price'].mean():,.0f}")

# Análisis de tamaño de motor
print(f"\n🔧 ANÁLISIS DE TAMAÑO DE MOTOR:")
engine_stats = analyzer.trim.groupby('Automaker')['Engine_size'].agg(['mean', 'min', 'max', 'count']).reset_index()
engine_stats = engine_stats[engine_stats['count'] >= 5]  # Al menos 5 modelos
engine_stats = engine_stats.sort_values('mean', ascending=False).head(15)

print("Top 15 fabricantes por tamaño promedio de motor:")
display(engine_stats)


⛽ ANÁLISIS DE COMBUSTIBLES Y TECNOLOGÍAS
Distribución de tipos de combustible:


Unnamed: 0,fuel_type,unique_models,avg_price,avg_engine_size
3,Petrol,606,24823.460108,1943.430765
0,Diesel,423,25788.461761,1987.275287
2,Other,112,32613.917749,1943.42449
1,Electric Diesel REX,10,41974.444444,1498.0



🏭 COMBUSTIBLES POR FABRICANTE:
Top 10 fabricantes por distribución de combustibles:


Fuel_type,Diesel,Electric Diesel REX,Other,Petrol
Automaker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bmw,15670.0,0.0,338.0,14985.0
Vauxhall,12812.0,0.0,972.0,17007.0
Mercedes-benz,13632.0,0.0,557.0,14258.0
Audi,12382.0,0.0,49.0,13702.0
Ford,10000.0,0.0,320.0,9817.0
Volvo,10497.0,0.0,597.0,7131.0
Volkswagen,9272.0,2.0,69.0,8326.0
Renault,7679.0,0.0,84.0,8608.0
Mini,5740.0,0.0,41.0,9474.0
Toyota,3985.0,1.0,1601.0,8678.0



📊 ESTADÍSTICAS DE COMBUSTIBLE:
  Tipo de combustible más común: Petrol (606 modelos)
  Tipo de combustible más caro: Electric Diesel REX (€41,974)
  Total de tipos de combustible: 4
  Promedio de precios por combustible: €31,300

🔧 ANÁLISIS DE TAMAÑO DE MOTOR:
Top 15 fabricantes por tamaño promedio de motor:


Unnamed: 0,Automaker,mean,min,max,count
51,Rolls-royce,6566.849315,5379,6751,146
11,Corvette,6086.516129,5967,6162,62
36,Maybach,5694.081633,5513,5980,49
29,Lamborghini,5583.472727,4961,6498,220
5,Bentley,5561.581114,3993,6761,413
2,Aston martin,5378.736264,3239,5935,546
18,Ferrari,4611.493671,3496,6496,237
22,Hummer,4489.333333,3653,6162,54
64,Tvr,4065.814815,3605,4475,54
35,Maserati,3822.738035,2979,4691,397


## 6. Temporal Analysis and Market Evolution

We analyze temporal trends in prices, sales, and model launches.


In [17]:
# --- Análisis Temporal y Evolución del Mercado ---
print("📅 ANÁLISIS TEMPORAL Y EVOLUCIÓN DEL MERCADO")
print("="*50)

# Análisis de años en datos de trim
year_analysis = analyzer.trim.groupby('Year').agg({
    'Genmodel_ID': 'nunique',
    'Price': 'mean',
    'Engine_size': 'mean'
}).reset_index()

year_analysis.columns = ['year', 'unique_models', 'avg_price', 'avg_engine_size']
year_analysis = year_analysis.sort_values('year')

print("Evolución anual del mercado:")
display(year_analysis)

# Crear gráfico de evolución temporal
fig_years = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Modelos por Año', 'Precio Promedio por Año', 'Tamaño Motor Promedio', 'Evolución Completa'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": True}]]
)

# Modelos por año
fig_years.add_trace(
    go.Scatter(x=year_analysis['year'], y=year_analysis['unique_models'], 
               mode='lines+markers', name='Modelos', line=dict(color='blue')),
    row=1, col=1
)

# Precio promedio por año
fig_years.add_trace(
    go.Scatter(x=year_analysis['year'], y=year_analysis['avg_price'], 
               mode='lines+markers', name='Precio', line=dict(color='red')),
    row=1, col=2
)

# Tamaño motor promedio por año
fig_years.add_trace(
    go.Scatter(x=year_analysis['year'], y=year_analysis['avg_engine_size'], 
               mode='lines+markers', name='Motor', line=dict(color='green')),
    row=2, col=1
)

# Evolución completa (normalizada)
year_analysis_norm = year_analysis.copy()
for col in ['unique_models', 'avg_price', 'avg_engine_size']:
    year_analysis_norm[col] = (year_analysis_norm[col] - year_analysis_norm[col].min()) / (year_analysis_norm[col].max() - year_analysis_norm[col].min())

fig_years.add_trace(
    go.Scatter(x=year_analysis_norm['year'], y=year_analysis_norm['unique_models'], 
               mode='lines+markers', name='Modelos (norm)', line=dict(color='blue')),
    row=2, col=2
)

fig_years.add_trace(
    go.Scatter(x=year_analysis_norm['year'], y=year_analysis_norm['avg_price'], 
               mode='lines+markers', name='Precio (norm)', line=dict(color='red')),
    row=2, col=2
)

fig_years.add_trace(
    go.Scatter(x=year_analysis_norm['year'], y=year_analysis_norm['avg_engine_size'], 
               mode='lines+markers', name='Motor (norm)', line=dict(color='green')),
    row=2, col=2
)

fig_years.update_layout(
    title_text="Evolución Temporal del Mercado Automotriz",
    showlegend=True,
    height=800
)

fig_years.show()

# Análisis de ventas por año (usando datos de sales)
print(f"\n📈 ANÁLISIS DE VENTAS POR AÑO:")

# Reshape sales data para análisis temporal
year_columns = [col for col in analyzer.sales.columns if col.isdigit()]
sales_long = pd.melt(
    analyzer.sales,
    id_vars=['Automaker', 'Genmodel', 'Genmodel_ID'],
    value_vars=year_columns,
    var_name='Year',
    value_name='Sales_Volume'
)

sales_long['Year'] = sales_long['Year'].astype(int)
sales_by_year = sales_long.groupby('Year')['Sales_Volume'].sum().reset_index()

print("Ventas totales por año:")
display(sales_by_year)

# Crear gráfico de ventas por año
fig_sales_year = px.line(
    sales_by_year,
    x='Year',
    y='Sales_Volume',
    title='Evolución de Ventas Totales por Año',
    labels={'Year': 'Año', 'Sales_Volume': 'Ventas Totales'},
    markers=True
)

fig_sales_year.update_layout(
    height=400,
    showlegend=False
)

fig_sales_year.show()

# Análisis de lanzamientos por año
print(f"\n🚗 ANÁLISIS DE LANZAMIENTOS POR AÑO:")
launch_analysis = analyzer.trim.groupby('Year')['Genmodel_ID'].nunique().reset_index()
launch_analysis.columns = ['year', 'new_models']

print("Nuevos modelos lanzados por año:")
display(launch_analysis)

# Crear gráfico de lanzamientos
fig_launch = px.bar(
    launch_analysis,
    x='year',
    y='new_models',
    title='Nuevos Modelos Lanzados por Año',
    labels={'year': 'Año', 'new_models': 'Nuevos Modelos'},
    color='new_models',
    color_continuous_scale='viridis'
)

fig_launch.update_layout(
    height=400,
    showlegend=False
)

fig_launch.show()

# Estadísticas temporales
print(f"\n📊 ESTADÍSTICAS TEMPORALES:")
print(f"  Año con más modelos: {year_analysis.loc[year_analysis['unique_models'].idxmax(), 'year']} ({year_analysis['unique_models'].max()} modelos)")
print(f"  Año con precios más altos: {year_analysis.loc[year_analysis['avg_price'].idxmax(), 'year']} (€{year_analysis['avg_price'].max():,.0f})")
print(f"  Año con más ventas: {sales_by_year.loc[sales_by_year['Sales_Volume'].idxmax(), 'Year']} ({sales_by_year['Sales_Volume'].max():,} ventas)")
print(f"  Año con más lanzamientos: {launch_analysis.loc[launch_analysis['new_models'].idxmax(), 'year']} ({launch_analysis['new_models'].max()} modelos)")
print(f"  Rango de años analizado: {year_analysis['year'].min()} - {year_analysis['year'].max()}")


📅 ANÁLISIS TEMPORAL Y EVOLUCIÓN DEL MERCADO
Evolución anual del mercado:


Unnamed: 0,year,unique_models,avg_price,avg_engine_size
0,1998,167,19606.667315,2012.286235
1,1999,188,19616.93161,2008.653541
2,2000,215,19942.398273,2039.007602
3,2001,242,20004.284321,2067.637878
4,2002,248,19944.819539,2053.874297
5,2003,262,19906.270409,2040.28856
6,2004,276,20978.174795,2076.944575
7,2005,281,21345.599324,2076.888889
8,2006,284,22240.105016,2092.974089
9,2007,295,22839.089947,2104.668169



📈 ANÁLISIS DE VENTAS POR AÑO:
Ventas totales por año:


Unnamed: 0,Year,Sales_Volume
0,2001,269887
1,2002,436447
2,2003,622347
3,2004,818247
4,2005,1024944
5,2006,1256941
6,2007,1527848
7,2008,1500885
8,2009,1578106
9,2010,1674012



🚗 ANÁLISIS DE LANZAMIENTOS POR AÑO:
Nuevos modelos lanzados por año:


Unnamed: 0,year,new_models
0,1998,167
1,1999,188
2,2000,215
3,2001,242
4,2002,248
5,2003,262
6,2004,276
7,2005,281
8,2006,284
9,2007,295



📊 ESTADÍSTICAS TEMPORALES:
  Año con más modelos: 2016 (347 modelos)
  Año con precios más altos: 2021 (€54,573)
  Año con más ventas: 2016 (2,476,613 ventas)
  Año con más lanzamientos: 2016 (347 modelos)
  Rango de años analizado: 1998 - 2021


## 7. Correlation and Relationship Analysis

We explore the relationships between price, engine size, sales, and other key variables.


In [18]:
# --- Correlation and Relationship Analysis ---
print("🔗 CORRELATION AND RELATIONSHIP ANALYSIS")
print("="*50)

# Create a combined dataset for correlation analysis
combined_data = analyzer.get_price_range_by_model()
sales_data = analyzer.get_sales_summary()
trim_data = analyzer.get_trim_summary_by_model()

# Merge datasets
analysis_df = combined_data.merge(sales_data[['Genmodel_ID', 'total_sales', 'avg_sales']], on='Genmodel_ID', how='left')
analysis_df = analysis_df.merge(trim_data[['Genmodel_ID', 'trim_count', 'year_min', 'year_max', 'most_common_fuel']], on='Genmodel_ID', how='left')

# Select numeric columns for correlation
numeric_cols = ['price_mean', 'price_min', 'price_max', 'price_entries', 'total_sales', 'avg_sales', 'trim_count', 'year_min', 'year_max']
corr_data = analysis_df[numeric_cols].corr()

print("Correlation matrix:")
display(corr_data)

# Create correlation heatmap
fig_corr = px.imshow(
    corr_data,
    title='Correlation Matrix between Key Variables',
    labels=dict(color="Correlation"),
    color_continuous_scale='RdBu_r',
    zmin=-1, zmax=1,
    aspect="auto"
)

fig_corr.update_layout(height=600)
fig_corr.show()

# Price vs sales relationship analysis
print(f"\n💰📈 PRICE VS SALES RELATIONSHIP:")
price_vs_sales = analysis_df.dropna(subset=['price_mean', 'total_sales'])
print(f"Models with price and sales data: {len(price_vs_sales)}")

fig_scatter = px.scatter(
    price_vs_sales,
    x='price_mean',
    y='total_sales',
    color='Automaker',
    size='trim_count',
    hover_data=['Genmodel', 'price_mean', 'total_sales', 'trim_count'],
    title='Relationship between Average Price and Total Sales',
    labels={'price_mean': 'Average Price (€)', 'total_sales': 'Total Sales'},
    opacity=0.6
)

fig_scatter.update_layout(height=600)
fig_scatter.show()

# Price vs trim variations relationship analysis
print(f"\n💰🏷️ PRICE VS TRIM VARIATIONS RELATIONSHIP:")
price_vs_trim = analysis_df.dropna(subset=['price_mean', 'trim_count'])

fig_scatter_trim = px.scatter(
    price_vs_trim,
    x='trim_count',
    y='price_mean',
    color='Automaker',
    size='price_entries',
    hover_data=['Genmodel', 'price_mean', 'trim_count'],
    title='Relationship between Number of Trims and Average Price',
    labels={'trim_count': 'Number of Trims', 'price_mean': 'Average Price (€)'},
    opacity=0.6
)

fig_scatter_trim.update_layout(height=600)
fig_scatter_trim.show()

# Correlation statistics
print(f"\n📊 CORRELATION STATISTICS:")
print(f"  Price vs sales correlation: {corr_data.loc['price_mean', 'total_sales']:.3f}")
print(f"  Price vs trims correlation: {corr_data.loc['price_mean', 'trim_count']:.3f}")
print(f"  Sales vs trims correlation: {corr_data.loc['total_sales', 'trim_count']:.3f}")
print(f"  Price min vs max correlation: {corr_data.loc['price_min', 'price_max']:.3f}")


🔗 CORRELATION AND RELATIONSHIP ANALYSIS
Correlation matrix:


Unnamed: 0,price_mean,price_min,price_max,price_entries,total_sales,avg_sales,trim_count,year_min,year_max
price_mean,1.0,0.997746,0.994833,-0.104471,-0.139064,-0.13656,-0.13663,0.212673,0.123669
price_min,0.997746,1.0,0.986994,-0.141769,-0.148677,-0.146205,-0.147657,0.23466,0.110326
price_max,0.994833,0.986994,1.0,-0.053694,-0.125693,-0.123193,-0.118429,0.184612,0.144094
price_entries,-0.104471,-0.141769,-0.053694,1.0,0.451649,0.443055,0.513093,-0.549656,0.377698
total_sales,-0.139064,-0.148677,-0.125693,0.451649,1.0,0.989895,0.667117,-0.162289,0.266554
avg_sales,-0.13656,-0.146205,-0.123193,0.443055,0.989895,1.0,0.650386,-0.158602,0.262477
trim_count,-0.13663,-0.147657,-0.118429,0.513093,0.667117,0.650386,1.0,-0.250629,0.213904
year_min,0.212673,0.23466,0.184612,-0.549656,-0.162289,-0.158602,-0.250629,1.0,0.543981
year_max,0.123669,0.110326,0.144094,0.377698,0.266554,0.262477,0.213904,0.543981,1.0



💰📈 PRICE VS SALES RELATIONSHIP:
Models with price and sales data: 585



💰🏷️ PRICE VS TRIM VARIATIONS RELATIONSHIP:



📊 CORRELATION STATISTICS:
  Price vs sales correlation: -0.139
  Price vs trims correlation: -0.137
  Sales vs trims correlation: 0.667
  Price min vs max correlation: 0.987
