# Dataset Analisys

> Our dataset lists all bike models that were available on Bikez.com four years ago.
> We found this dataset on Kaggle, precisely on https://www.kaggle.com/datasets/victormegir/bikes-from-bikezcom?select=urls.csv


### Library Imports and Setup

This cell loads all the required libraries for data analysis, visualization, and preprocessing:

- `pandas`, `numpy`: for data manipulation and numerical operations.
- `json`, `Path`, `re`, `sys`: for parsing, filesystem navigation, and general utilities.
- `plotly.express`, `plotly.graph_objects`, `make_subplots`: to create interactive plots and custom visualizations.
- `statsmodels.api`: to perform linear regression and statistical modeling.
- `plotly.io`: for configuring plot display settings.
- `cpi`: a library for adjusting prices for inflation, installed automatically in the Jupyter kernel via `sys`.


In [3]:
import pandas as pd
import json
from pathlib import Path
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import statsmodels.api as sm
import plotly.io as pio
import re


import sys
# Prezzi con inflazione
!{sys.executable} -m pip install cpi
import cpi

print("✅ Import successful")


✅ Import successful


### Load and Normalize Raw Motorcycle Data from JSON

We load raw motorcycle data from the JSON file downloaded from Kaggle. We used the CSV version in the beginning, but we quickly noticed inconsistences around every column.

In this cell we simply load the data from the JSON, normalizing it and transforming it into a pandas dataframe




In [4]:
JSON_PATH = Path("data/raw.json")

# 1. Open JSON
with open(JSON_PATH, "r", encoding="utf-8") as f:
    data = json.load(f)

# 2. Proceed only if it's a list of records
if isinstance(data, dict):
    data = list(data.values())
elif not isinstance(data, list):
    raise ValueError("The JSON file seems to be empty or not a list of records.")

# 3. Convert to DataFrame, normalizing
df = pd.json_normalize(data)

# 4. Remove empty columns, or columns with all NaN values
df.columns = [col.strip() for col in df.columns]
df = df.loc[:, ~df.columns.str.fullmatch(r"Unnamed:.*|^$")]
df = df.dropna(axis=1, how="all")

### General info about the dataset

In the dataset we got data from bikes all the way from 1894. Sadly the data we got for these older models is very limited, we only got the model name and some other columns.

As the models get newer we have more and more data, so most of our analysis are made with more recent motorcycles.

In [5]:
print(f"Il dataframe ha {df.shape[1]} colonne.")
print(f"Il dataframe ha {df.shape[0]} righe.")

print(f"La moto più vecchia è del {df['Year'].min()}")
print(f"La moto più recente è del {df['Year'].max()}.")

# display(df.columns)

Il dataframe ha 77 colonne.
Il dataframe ha 38624 righe.
La moto più vecchia è del 1894
La moto più recente è del 2021.


### Define useful functions

We declare a lot of functions to use in our analysis. These ranges from a simple conversion to float that we use on some columns that don't have a specific type, all the way to a function that adjusts prices of our bikes taking inflation in account.

In [6]:
# Convert to numeric
def to_float(txt, unit=None):
    if pd.isna(txt): return None
    m = re.search(r"[-+]?\d*\.?\d+", str(txt))
    if not m: return None
    num = float(m.group())
    if unit=="in_to_mm":   num *= 25.4
    if unit=="lbft_to_Nm": num *= 1.35582
    return num

# We use this function to make the fonts bigger, as well as changing the background color
def set_theme(fig, font_size=20, bg_color="#141415", font_color="white", grid_color="#333333"):
    fig.update_layout(
        paper_bgcolor=bg_color,
        plot_bgcolor=bg_color,
        font=dict(size=font_size, color=font_color),
        xaxis=dict(gridcolor=grid_color, zerolinecolor=grid_color),
        yaxis=dict(gridcolor=grid_color, zerolinecolor=grid_color),
        legend=dict(bgcolor="rgba(0,0,0,0)")  # Transparent legend background
    )
    return fig

# Convert Price to numeric
def parse_price(value):
    if pd.isna(value):
        return np.nan
    
    # Search for currency and amount
    # Example: "Euro 9990,00" o "US$ 9990.00"
    match = re.search(r'(Euro|US\$)\s*([\d,\.]+)', str(value))
    if not match:
        return np.nan
    
    currency, amount_str = match.groups()
    
    # Convert to float and handle commas and periods
    # Example: "9990,00" -> 9990.00
    amount = float(amount_str.replace(',', '').replace('.', ''))
    
    # Handle different currencies, we use a fixed conversion rate for simplicity
    if currency == 'Euro':
        return amount * 1.1 
    elif currency == 'US$':
        return amount
    else:
        return np.nan
    
# Extract the numeric part of the Power column
def extract_hp(value):
    match = re.search(r'\d+', str(value))
    return int(match.group()) if match else None
  
# Extract the float part of the rating column
def extract_rating(text):
    if isinstance(text, str):
        match = re.search(r"([0-9]+(?:\.[0-9]+)?)", text)
        if match:
            return float(match.group(1))
    return np.nan

# Cpi gets us the inflation-adjusted price from 1913
def adjust_price_for_inflation(row):
    try:
        price = float(row["Price"])
        year = int(row["Year"])
        return cpi.inflate(price, year, to=2021)
    except:
        return np.nan



def show_col(col):
    # Returns a little sample of a column, but ignores NaNs
    return df[df[col].notna()].sample(3)[col]

def_col = ["#987434"]

### Particular columns to analyze

Since we have a great number of columns, we choose to print a sample of some of them to see if they can be useful

In [7]:
# Useless columns, to drop

print("\nUSELESS COLUMNS\n\n")

display(show_col("Insurance costs"))

display(show_col("Ask questions"))

display(show_col("Rating"))

display(show_col("Comments"))

display(show_col("Modifications compared to previous model"))

df.drop(columns=["Insurance costs", "Ask questions", "Comments"], inplace=True)



# Interesting columns, perhaps to analyze
print("\nUSEFUL COLUMNS\n\n")

display(show_col("Greenhouse gases"))
# injection or carbuettor
display(show_col("Fuel system"))

display(show_col("Model"))



USELESS COLUMNS




9228     Compare US insurance quotes from the nation's ...
17553    Compare US insurance quotes from the nation's ...
4710     Compare US insurance quotes from the nation's ...
Name: Insurance costs, dtype: object

12698    Join the 13 KTM 450 Rally Replica discussion g...
30209    Join the 90 Aprilia Pegaso 125 discussion grou...
37433    Join the 48 Ariel 4G Square Four 1000 discussi...
Name: Ask questions, dtype: object

7228     Do you know this bike?Click here to rate it. W...
6717      3.5  See the detailed rating of touring capab...
14764    Do you know this bike?Click here to rate it. W...
Name: Rating, dtype: object

9522                                  Bike made in India.
8340                                        Sold in Asia.
286     7´ Touchscreen. Windscreen. Cruise control. He...
Name: Comments, dtype: object

17355           New dual-mode Dynamic Power Steering (DPS)
13458                  New, high-grip, durable tyre design
32092    1st year production of two years\r\n\r\n11.79 ...
Name: Modifications compared to previous model, dtype: object


USEFUL COLUMNS




12880     62.6 CO2 g/km. (CO2 - Carbon dioxide emission) 
4867      99.8 CO2 g/km. (CO2 - Carbon dioxide emission) 
5486     104.9 CO2 g/km. (CO2 - Carbon dioxide emission) 
Name: Greenhouse gases, dtype: object

52       Injection. Electronic intake pipe fuel injecti...
2459                                        Carburettor. E
34884                                          Carburettor
Name: Fuel system, dtype: object

25316         Honda GL 1800 Gold Wing
26275        Moto Guzzi California EV
28016    Yamaha XVZ 13 TFL MM Limited
Name: Model, dtype: object

### Cleaning of dataset and inizializations

Before creating our charts we have to convert some columns to float, remove some extreme outliers, and create some new columns.

In [8]:

# Convert certain columns to numeric
df["Displacement_cc"]   = df["Displacement"].apply(to_float)
df["Torque_Nm"]         = df["Torque"].apply(lambda x: to_float(x, unit="lbft_to_Nm") or to_float(x))
df["Seat_height_mm"]    = df["Seat height"].apply(to_float)
df["Fuel_capacity_l"]   = df["Fuel capacity"].apply(to_float)
df["Year"]   = df["Year"].apply(to_float)

# Show only the first word of each model, this is the manufacturer
df["Manufacturer"] = df["Model"].apply(lambda x: re.sub(r"\s.*", "", x) if isinstance(x, str) else x)

df["Price"] = df["Price as new"].apply(parse_price)
df["HP"] = df["Power"].apply(extract_hp)

df["Rating"] = df["Rating"].apply(extract_rating)

# palette & template
TEMPLATE = "presentation"
pio.templates.default = TEMPLATE

# Filter out motorcycle with HP >= 500
df = df[df["HP"] < 500]



# Adjust price for inflation

# Filtra solo righe valide
mask = (df["Year"] >= 1913) & (df["Price"].notna())
df_valid = df.loc[mask].copy()

# Calcola i fattori di inflazione solo una volta per ogni anno
unique_years = df_valid["Year"].unique()
inflation_factors = {
    year: cpi.inflate(1, int(year), to=2021) for year in unique_years
}

# Applica il fattore in modo vettoriale
df_valid["Price_adj"] = df_valid["Price"] * df_valid["Year"].map(inflation_factors)

# Assegna al dataframe originale (solo dove ha senso)
df.loc[mask, "Price_adj"] = df_valid["Price_adj"]

# Show the first 5 rows of the adjusted prices
print("\nADJUSTED PRICES\n\n")
display(show_col(["Year", "Price", "Price_adj"]))


ADJUSTED PRICES




Unnamed: 0,Year,Price,Price_adj
24887,2006.0,1550.0,2083.350694
18196,2010.0,71500.0,88850.364127
8170,2016.0,,


### Analysis of duplicates
In 2020 we saw an incredible spike of bike models. So we decided to analize our dataset and search for duplicates.

We found a big number of duplicates spanning across every year, with a concentration of them in 2020.

To find these we defined some key columns, that defined if a bike model is identical to another.

In [None]:

emission_cols = [
    "Emission details",        
    "Greenhouse gases",        
    "Fuel consumption"         
]

# Every column that is related to performance
performance_cols = [
    "Power",
    "Torque",
    "Weight incl. oil, gas, etc",
    "Dry weight",              
    "Power/weight ratio",
    "Top speed",
]

mechanical_cols = [
    "Engine type",
    "Displacement",
    "Fuel control", 
    "Clutch", 
    "Exhaust system"
]

# Group everything so we can easily drop duplicates
important_cols = [
    "Manufacturer", "Model", "Category", "Year"
] + emission_cols + performance_cols + mechanical_cols


dups = df.duplicated(subset=["Manufacturer", "Model", "Category", "Year"], keep=False)

# Only select duplicates found on more general columns
df_dups = df[dups]

# Drop the duplicates, now based on the important columns instead of the general ones
df_unique_tech = df_dups.drop_duplicates(subset=important_cols)

print(f"Duplicati totali: {df_dups.shape[0]}")
print(f"Unici in base a info tecniche: {df_unique_tech.shape[0]}")

print(f"Righe originali: {df.shape[0]}")
df = df.drop_duplicates(subset=important_cols, keep="first")

print(f"Righe nuove: {df.shape[0]}")





Duplicati totali: 4615
Unici in base a info tecniche: 2289
Righe originali: 26258
Righe nuove: 23932


## Charts
 > Let's finnally see some charts

### Models by category
 > Let's see the most common type of motorcycle, leaving out those that appear less in the dataset


In [10]:
# Group by most used, leave the others in "Other"
df_grouped = df.groupby("Category").size().reset_index(name="count")


df_top_types = df_grouped[df_grouped["count"] >= 1500].sort_values(by="count", ascending=False)
other_types = df_grouped[df_grouped["count"] < 1500].sum()["count"]

# Create a new DataFrame with the top types and the "Other" category
df_grouped = pd.concat([df_top_types, pd.DataFrame([{'Category': 'Other', 'count': other_types}])], ignore_index=True)

# Concat the top types and the "Other" category
df_pie = pd.concat(
    [df_top_types, pd.DataFrame([{"Category": "Other", "count": other_types}])],
    ignore_index=True
)

# Sort the DataFrame by count in descending order
ordered_categories = list(df_top_types["Category"]) + ["Other"]
df_pie["Category"] = pd.Categorical(
    df_pie["Category"], categories=ordered_categories, ordered=True
)
gray_seq = ["gray"] + ["gray"] + ["gray"] + ["white"]
color_seq = px.colors.qualitative.Plotly[:4] + gray_seq  # bianco per Other


# Create the pie chart
fig1 = px.pie(
    df_pie,
    names="Category",
    values="count",
    title="<b>Models by Type of bike</b>",
    hole=0.35,
    color_discrete_sequence=color_seq,
    category_orders={"Category": ordered_categories},
)

# Rotate the pie chart, make it counterclockwise, don't atutomatically sort
fig1.update_traces(
    direction="counterclockwise",
    sort=False,
    rotation=-61
)


set_theme(fig1).show()

### How many models were created each year
 > Let's see for each category of bike, how many bikes were created

In [11]:
# Seleziona le 6 categorie più presenti
top6 = df_grouped.sort_values("count", ascending=False).head(7)["Category"].tolist()

# Filtra il dataframe originale per le sole 6 categorie
df_top6 = df[df["Category"].isin(top6)]

# Filtra le righe con Year >= 1980
df_top6 = df_top6[df_top6["Year"] >= 1980]
df_top6 = df_top6[df_top6["Year"] < 2021]

# Raggruppa per anno e categoria, conta le moto
year_cat_counts = (
    df_top6.groupby(["Year", "Category"])
    .size()
    .reset_index(name="count")
)

# Grafico a linee
fig = px.line(
    year_cat_counts,
    x="Year",
    y="count",
    color="Category",
    title="<b>Models by Year</b>",
    markers=True,
    color_discrete_sequence=color_seq,
    category_orders={"Category": ordered_categories},
)
set_theme(fig).show()

# 757 000 unità nel 1980 a 217 000 nel 1993
# dal 2020, le normative euro 5 diventano obbligatorie su ogni nuovo modello
#  Per non perdere omologazioni, i costruttori registrano a raffica versioni Euro 4 “final edition” 
#  Ma alcune omologano già anche Euro 5 nello stesso anno, avendo due versioni

dup_cols = ["Manufacturer","Model","Category","Year", "Greenhouse gases"]
dups = df.duplicated(subset=dup_cols, keep=False)


#### What happened in 1993 and '94? Most importantly what happened in 2020, is it something in our dataset? 
 > Thanks to the duplicates analysis we can confidently say that our dataset had an error, but still in 2020 there are lots of new models created

### Counts of bikes per cc
 > Let's figure out how many bikes were produced per cc category

In [12]:
bins   = [0, 125, 400, 700, 1000, 2000, np.inf]
labels = ["0–125 (A1)", "125–400 (A-lim)", "400–700 (A)", 
          "700–1000 (A)", "1000-2000 (A)", "2000+ (A)"]

# A new column with the categories in which every model falls
df["Disp_cat"] = pd.cut(df["Displacement_cc"], bins=bins, labels=labels, right=True)

# For each category, we calculate the count and the mode            
count_df  = (df["Disp_cat"]
             .value_counts()
             .reindex(labels)
             .rename("Count")
             .reset_index()
             .rename(columns={"Disp_cat": "Displacement Range (cc)"}))

# Calulates the mode for each category
mode_df = (
    df.groupby("Disp_cat", observed=False)["Displacement_cc"]
    .agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    .reindex(labels)
    .reset_index()
    .rename(columns={
        "Disp_cat": "Displacement Range (cc)",
        "Displacement_cc": "mode_cc"
    })
)

# Merge the two DataFrames
disp_summary = count_df.merge(mode_df, on="Displacement Range (cc)")
# Display the mode in a more readable format
disp_summary["mode_text"] = "Mode: " + disp_summary["mode_cc"].round(0).astype(int).astype(str)


fig2 = px.bar(
    disp_summary,
    x="Displacement Range (cc)",
    y="Count",
    #This will show the mode on the bar
    text="mode_text",

    title="<b>Count of Models by CC</b>",
    color_discrete_sequence=def_col,
    height=650
)

fig2.update_xaxes(categoryorder="array", categoryarray=labels)

fig2.update_traces(
    texttemplate="%{text}", 
    textposition="inside",
    insidetextanchor="middle", 
    textfont=dict(color="white", size=20)
)



set_theme(fig2).show()


### Price related to a lot of stuff

In [13]:

# Keep only the columns we need for this scatter
plot_df = df.dropna(subset=["Year", "Displacement_cc", "HP", "Price_adj"])

plot_df = plot_df[plot_df["Year"] > 1990]
plot_df = plot_df[plot_df["Displacement_cc"] < 3000]


fig3 = px.scatter(
    plot_df,
    x="Year",
    y="Displacement_cc",
    size="Price_adj",
    color="HP",
    hover_name="Model", # when hovering, show the model, really cute
    color_continuous_scale="YlOrRd",
    range_color=[0, 200],
    opacity=0.6,
    height=700,
    title="Motorcycle Prices by Displacement and Year",
    labels={
        "Year": "Year",
        "Displacement_cc": "Displacement (cc)",
        "Price": "Price (scaled size)",
        "HP": "Horsepower"
    }
)
# Remove the borders from every point
fig3.update_traces(marker=dict(line=dict(width=0)))

# In this for we add the horizontal lines, these are the limits for the A1, A-lim, A and A2 licenses

fig3.update_yaxes(
    tickvals=[125, 250, 600, 1000, 2000],
    title="Displacement (cc)"
)

# Asse x con tick fissi (opzionale)
fig3.update_xaxes(
    tickvals=[1980, 1990, 2000, 2010, 2015, 2020],
    title="Year"
)

fig3.update_layout(
    coloraxis_colorbar=dict(title="Horsepower (HP)"),
)
set_theme(fig3)
fig3.show()


# riportare i prezzi tenendo conto dell'inflazione

# scatter plot su CC e HP, 
# magari scartare il tempo, interessarsi meglio sul prezzo


### Avg cc made per year

In [14]:

# Media cilindrata per anno
df_disp_avg = (
    df.dropna(subset=["Year", "Displacement_cc"])
      .query("Displacement_cc < 3000")
      .query("Year > 1920")
      .groupby("Year")["Displacement_cc"]
      .mean()
      .reset_index()
)

x = df_disp_avg["Year"].to_numpy()
y = df_disp_avg["Displacement_cc"].to_numpy()
b, a = np.polyfit(x, y, deg=1)  

df_disp_avg["reg_line"] = a + b * x

fig_cc = px.bar(
    df_disp_avg,
    x="Year",
    y="Displacement_cc",
    title="Average Engine Displacement by Year",
    labels={
        "Year": "Year",
        "Displacement_cc": "Avg Displacement (cc)"
    },
    color_discrete_sequence=def_col,
    height=600
)

fig_cc.add_scatter(
    x=df_disp_avg["Year"],
    y=df_disp_avg["reg_line"],
    mode="lines",
    name="Linear Trend",
    line=dict(color="white", width=3, dash="dash"),
    showlegend=False
)

set_theme(fig_cc).show()



### Price

In [15]:
df_filtered = df.dropna(subset=["Category", "HP", "Price_adj"])

df_filtered = df_filtered[~df_filtered["Category"].str.contains("Prototype")]
df_filtered = df_filtered[~df_filtered["Category"].str.contains("ATV")]

df_grouped = df_filtered.groupby("Category", as_index=False).mean({
    "Price_adj": "mean",
    "HP": "mean"
})
fig = px.scatter(
    df_grouped,
    x="HP", y="Price",
    color="Category",
    hover_name="Category",
    title="<b>Prezzo per cavalli (media per categoria)</b><br>",
    height=600
)

fig.update_traces(textposition='top center')
set_theme(fig).show()

### Eletric bikes, how many models, how fast are they

In [16]:
if "Top_speed_kmh" not in df.columns:
    df["Top_speed_kmh"] = df["Top speed"].str.extract(r"([\d\.]+)").astype(float)

# 2. Seleziona i modelli con 0 HP
elec_df = df[df["Engine type"] == "Electric"].dropna(subset=["Top_speed_kmh"])

bins   = [50, 100, 150,200,np.inf]
labels = ["0-50", "50-100", "100-150",">150"]

elec_df["speed_band"] = pd.cut(
    elec_df["Top_speed_kmh"],
    bins=bins,
    labels=labels,
    right=False
)

band_counts = (elec_df["speed_band"]
                 .value_counts()
                 .reindex(labels)
                 .rename("count")
                 .reset_index())

# 4. Grafico a barre
fig_band = px.bar(
    band_counts,
    x="speed_band",
    y="count",
    color_discrete_sequence=["#26a69a"],
    title="<b>Veicoli elettrici - velocità massima per fascia (km/h)</b>",
    height=500
)

fig_band.update_xaxes(title="Velocità (km/h)")
fig_band.update_yaxes(title="Numero di modelli")
set_theme(fig_band).show()

### HP related to power

In [17]:
if "Top_speed_kmh" not in df.columns:
    df["Top_speed_kmh"] = df["Top speed"].str.extract(r"([\d\.]+)").astype(float)

filt_df = df.query("HP >= 1 and HP <= 300 and Top_speed_kmh <= 400")

fig_ts_lim = px.scatter(
    filt_df,
    x="HP",
    y="Top_speed_kmh",
    color="Category",
    hover_data=["Model", "Year"],
    title="<b>Potenza vs Velocità massima<br><sup>(HP 10-300, Vmax ≤ 400 km/h)</sup></b>",
    height=600,
    opacity=0.4
)

fig_ts_lim.update_xaxes(title="Potenza (HP)")
fig_ts_lim.update_yaxes(title="Velocità massima (km/h)")

set_theme(fig_ts_lim).show()
