# Machine Learning Zoomcamp - Homework 01

This notebook contains the homework assignment for ML Zoomcamp covering data analysis and machine learning workflow.

## 1. Import Required Libraries

Import necessary libraries including pandas, numpy, matplotlib, seaborn, and scikit-learn.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 2. Load and Explore Data

Load the dataset and perform initial exploration including shape, info, and basic statistics.

In [2]:
pd.__version__

'2.3.2'

In [3]:
import urllib.request
import os

# Download the dataset
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
filename = "car_fuel_efficiency.csv"

print(f"Downloading {filename}...")
urllib.request.urlretrieve(url, filename)

# Verify the download
if os.path.exists(filename):
    file_size = os.path.getsize(filename)
    print(f"✅ Download successful!")
    print(f"File: {filename}")
    print(f"Size: {file_size} bytes")
else:
    print("❌ Download failed!")

Downloading car_fuel_efficiency.csv...
✅ Download successful!
File: car_fuel_efficiency.csv
Size: 874188 bytes


In [10]:
# Load the dataset
df = pd.read_csv(filename)

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
df.head()

Dataset loaded successfully!
Dataset shape: (9704, 11)

First few rows:


Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [11]:
# Basic data exploration
print("Dataset Info:")
print(df.info())
print("\n" + "="*50 + "\n")

print("Basic Statistics:")
print(df.describe())
print("\n" + "="*50 + "\n")

print("Missing Values:")
print(df.isnull().sum())
print("\n" + "="*50 + "\n")

print("Data Types:")
print(df.dtypes)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB
None


Basic Statistics:
       engine_displacement  num_cylinders   horsepower  vehicle_weight  \
count          9704.000000    9222.000000  8996.000000     9704.00

In [12]:
print("Fuel Type Analysis:")
print("="*50)

# Value counts
print("Value counts:")
print(df['fuel_type'].value_counts())

print("\nValue counts (with percentages):")
print(df['fuel_type'].value_counts(normalize=True) * 100)

print("\nUnique values:")
print(df['fuel_type'].unique())

print(f"\nNumber of unique fuel types: {df['fuel_type'].nunique()}")

# Check for missing values
print(f"\nMissing values in fuel_type: {df['fuel_type'].isnull().sum()}")


Fuel Type Analysis:
Value counts:
fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64

Value counts (with percentages):
fuel_type
Gasoline    50.474031
Diesel      49.525969
Name: proportion, dtype: float64

Unique values:
['Gasoline' 'Diesel']

Number of unique fuel types: 2

Missing values in fuel_type: 0


In [13]:
# Find maximum fuel efficiency of cars from Asia
print("Maximum Fuel Efficiency Analysis:")
print("="*50)

# First, let's see what regions are available
print("Available regions:")
print(df['origin'].unique())
print()

# Filter cars from Asia
asia_cars = df[df['origin'] == 'Asia']
print(f"Number of cars from Asia: {len(asia_cars)}")

# Find maximum fuel efficiency for Asian cars
max_efficiency_asia = asia_cars['fuel_efficiency_mpg'].max()
print(f"Maximum fuel efficiency of cars from Asia: {max_efficiency_asia}")

# Get details of the car(s) with maximum efficiency from Asia
max_efficiency_cars = asia_cars[asia_cars['fuel_efficiency_mpg'] == max_efficiency_asia]
print(f"\nCars with maximum efficiency from Asia:")
print(max_efficiency_cars[['fuel_efficiency_mpg', 'fuel_type']])

# Compare with other regions
print(f"\nFuel efficiency comparison by region:")
region_efficiency = df.groupby('origin')['fuel_efficiency_mpg'].agg(['min', 'max', 'mean']).round(2)
print(region_efficiency)

Maximum Fuel Efficiency Analysis:
Available regions:
['Europe' 'USA' 'Asia']

Number of cars from Asia: 3247
Maximum fuel efficiency of cars from Asia: 23.759122836520497

Cars with maximum efficiency from Asia:
      fuel_efficiency_mpg fuel_type
9387            23.759123  Gasoline

Fuel efficiency comparison by region:
         min    max   mean
origin                    
Asia    6.89  23.76  14.97
Europe  6.20  25.97  14.94
USA     6.70  24.97  15.04


In [14]:
# Median value of horsepower
# Find the median value of horsepower column in the dataset.
# Next, calculate the most frequent value of the same horsepower column.
# Use fillna method to fill the missing values in horsepower column with the most frequent value from the previous step.
# Now, calculate the median value of horsepower once again.
print("horsepower Analysis:")
print("="*50)

# First, let's see what regions are available
# print("Available horsepower:")
# print(df['horsepower'].unique())
# print()

# Find the median value of horsepower
median_horsepower = df['horsepower'].median()
print(f"Median horsepower: {median_horsepower}")

# Find the most frequent value (mode) of horsepower
most_frequent_horsepower = df['horsepower'].mode()[0]
print(f"Most frequent horsepower: {most_frequent_horsepower}")

# Fill missing values in horsepower with the most frequent value
df['horsepower'].fillna(most_frequent_horsepower, inplace=True)

# Calculate the median value of horsepower again
new_median_horsepower = df['horsepower'].median()
print(f"New median horsepower (after filling missing values): {new_median_horsepower}")

horsepower Analysis:
Median horsepower: 149.0
Most frequent horsepower: 152.0
New median horsepower (after filling missing values): 152.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['horsepower'].fillna(most_frequent_horsepower, inplace=True)


In [None]:
# Sum of weights
print("Sum of weights Analysis:")
print("="*50)
# Select all the cars from Asia
asia_cars = df[df['origin'] == 'Asia']
# Select only columns vehicle_weight and model_year
asia_cars_selected = asia_cars[['vehicle_weight', 'model_year']]
# Select the first 7 values
asia_cars_first7 = asia_cars_selected.head(7)
# Get the underlying NumPy array. Let's call it X.
X = asia_cars_first7.to_numpy()
# Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
print(X)

XTX = np.dot(X.T, X)
print("XTX")
print(XTX)
# Invert XTX.
XTX_inv = np.linalg.inv(XTX) # Invert XTX
print("XTX_inv")
print(XTX_inv)

# Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200].
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
# Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
# w = np.dot(XTX_inv, np.dot(X.T, y))
w = np.dot(np.dot(XTX_inv, X.T), y)
print("w")
print(w)

# What's the sum of all the elements of the result?
sum_w = w.sum()
print(f"Sum of all elements in w: {sum_w}")
https://github.com/Sergjalo/ml-zoomcamp.git

Sum of weights Analysis:
[[2714.21930965 2016.        ]
 [2783.86897424 2010.        ]
 [3582.68736772 2007.        ]
 [2231.8081416  2011.        ]
 [2659.43145076 2016.        ]
 [2844.22753389 2014.        ]
 [3761.99403819 2019.        ]]
XTX
[[62248334.33150761 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]
XTX_inv
[[ 5.71497081e-07 -8.34509443e-07]
 [-8.34509443e-07  1.25380877e-06]]
w
[0.01386421 0.5049067 ]
Sum of all elements in w: 0.5187709081074006
