This notebook cleans and explores an automobile dataset.
It removes unnecessary columns, handles missing values and duplicates, converts data types, and prepares the data for analysis.

The dataset provides insight into the type of vehicles, their manufacturers, and how price, horsepower, engine size and fuel-efficiency (measured in miles per gallon / MPG) affect each other.

By examining how these variables interact and influence each other, an analysis of what drives the cost of a vehicle can be made. 

In [None]:
# Import libraries

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt


In [None]:
# Load the automobile dataset

df = pd.read_csv("automobile.txt")
df.head(10)

### Data Cleansing
#### Clean the data

Identify columns that are redundant or unnecessary.Remove the following columns `['normalized-losses', 'symboling']` from the data set as they will not be used in the analysis.

In [None]:
# Redundant columns dropped

df_cleaned = df.drop(['normalized-losses', 'symboling'], axis=1, inplace=True)
print(df_cleaned)

#### Remove any duplicate rows

In [None]:
# Remove duplicate rows

remove_duplicates = df_cleaned.drop_duplicates()
print(remove_duplicates)

#### Remove rows with missing data

Some automobiles in the database have missing values which implies that their values have not been recorded or some information is missing. Discard such entries from the dataframe.

In [None]:
# Replace '?' with NaN and drop missing rows

df_replace = df_cleaned.replace('?', np.nan, inplace=True)
missing_rows = df_cleaned.dropna(inplace=True)

Change columns with numerical data column to an integer data type using numpyâ€™s `int64` method.

In [None]:
# Convert specified columns to numeric and int64

df_cleaned["city-mpg"] = pd.to_numeric(df_cleaned["city-mpg"], errors="coerce")
df_cleaned["highway-mpg"] = pd.to_numeric(df_cleaned["highway-mpg"], errors="coerce")

### Finding Certain Categories
Locate all automobiles in the "hatchback" genre.

In [None]:
# Create a dataframe of hatchbacks

hatchbacks = df_cleaned[df_cleaned["body-style"] == "hatchback"]

### Exploration

#### Identify relationships between variables

Identify and create relationships that can help formulate analysis. 

#### Which are the 5 most expensive cars?

How do the most expensive and cheapest cars compare? Exploring the most expensive cars highlights if some moviecars are worth the money spent on them based on their fuel economy (mpg or miles per gallon)

In [None]:
# Top 5 most expensive cars

top_5_expensive = df_cleaned.sort_values(by='price', ascending=False).head(5)

# The most expensive cars are less fuel-efficient than the cheapest ones. 

#### Which manufacturer builds the most fuel efficient vehicles?

Compare the average mpg for each vehicle manufacture's vehicles and create a bar plot

In [None]:
# Calculate average mpg for each manufacturer

df_cleaned["average-mpg"] = df_cleaned[["city-mpg", "highway-mpg"]].mean(axis=1)
avg_mpg_by_make = df_cleaned.groupby("make")["average-mpg"].mean().sort_values(ascending=False)

# The average mpg plot

plt.figure(figsize=(12, 6))
avg_mpg_by_make.plot(kind="bar", color="skyblue")
plt.title("Average MPG by Vehicle Manufacturer")
plt.ylabel("Average MPG")
plt.xlabel("Manufacturer")
plt.grid(axis="y")
plt.show()

#### Which vehicles have the largest engine capacity.
Sort the dataframe based on the engine-size column.

In [None]:
# Vehicle with largest engine capacity

largest_engines = df.sort_values(by='engine-size', ascending=False).head(5)
print(largest_engines)

#### Which vehicle manufacturer has the most car models in the dataset

In [None]:
# Manufacturer with the most car models

most_models = df['make'].value_counts().head(1)
print(most_models)

In [None]:
# Correlation matrix

correlation_matrix = df.corr(numeric_only=True)

# Correlation heatmap

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix of Numerical Features')
plt.tight_layout()
plt.close()