# Exploratory Data Analysis – Car Prices  
**Provider:** IBM (Data Analyst Professional Certificate, Coursera)  

As part of IBM’s Data Analyst curriculum, I completed an intensive **hands-on lab** focused on **Exploratory Data Analysis (EDA)**—a crucial step in understanding and preparing data for further analysis and modeling. The lab simulated real-world scenarios to explore patterns, summarize features, and extract actionable insights from data.  

## Key Objectives & Outcomes  
- **Data Exploration** – Investigated features and characteristics to predict car prices.  
- **Pattern Analysis** – Analyzed trends, distributions, and relationships using descriptive statistics.  
- **Data Grouping** – Organized data based on selected attributes and created pivot tables for deeper insights.  
- **Impact Assessment** – Evaluated the influence of independent variables on car prices.  

## Main/Business Question  
> What are the key characteristics that most significantly affect the price of a car?  

## Libraries Used  
For this lab, the following Python libraries were employed:  

* `pandas` – For efficient data manipulation and management.  
* `numpy` – For numerical computations and mathematical operations.  
* `scipy` – For statistical analysis.  
* `seaborn` – For data visualization and plotting.  

*Note:* Specific library versions were used to match the lab environment:  


In [8]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [9]:
# Load dataset with first row as headers
filepath = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'
df = pd.read_csv(filepath, header=0)  # header=0 tells pandas to use the first row as column names

In [10]:
# Display the first few rows to verify
df.head()


Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,13495.0,11.190476,Medium,0,1
1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,16500.0,11.190476,Medium,0,1
2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,9.0,154.0,5000.0,19,26,16500.0,12.368421,Medium,0,1
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.84863,...,10.0,102.0,5500.0,24,30,13950.0,9.791667,Medium,0,1
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.84863,...,8.0,115.0,5500.0,18,22,17450.0,13.055556,Medium,0,1


## Analyzing Individual Feature Patterns Using Visualization

In this step, we focus on **visualizing individual features** to understand their distributions, trends, and relationships. We use **Matplotlib** and **Seaborn**, two popular Python libraries for creating informative and attractive plots.

The `%matplotlib inline` command ensures that all plots **display directly within the Jupyter Notebook**, making it easier to explore and interpret the data interactively.


In [11]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

### How I Choose the Right Visualization Method

When I visualize individual variables, I first make sure I understand the **type of variable** I’m dealing with—whether it’s categorical or numerical. Knowing this helps me pick the **most appropriate visualization** to clearly represent the data.


In [12]:
# list the data types for each column
print(df.dtypes)

symboling              int64
normalized-losses      int64
make                  object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
city-L/100km         float64
horsepower-binned     object
diesel                 int64
gas                    int64
dtype: object


From this overview, I can see which columns are **numerical** (int64, float64) and which are **categorical** (object). This helps me decide which visualization techniques to use—like histograms or box plots for numerical variables and bar plots for categorical variables—so I can explore the data effectively.

### What is the data type of the column "peak-rpm"?

In [13]:
# list the data types for each column
df['peak-rpm'].dtypes


dtype('float64')

Next, I will focus on the **numeric columns** in the dataset and calculate their **correlation matrix**. This helps me understand how strongly each numerical feature is related to the others.

In [14]:
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=['float64', 'int64'])
numeric_df.corr()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,diesel,gas
symboling,1.0,0.466264,-0.535987,-0.365404,-0.242423,-0.55016,-0.233118,-0.110581,-0.140019,-0.008245,-0.182196,0.075819,0.27974,-0.035527,0.036233,-0.082391,0.066171,-0.196735,0.196735
normalized-losses,0.466264,1.0,-0.056661,0.019424,0.086802,-0.373737,0.099404,0.11236,-0.029862,0.055563,-0.114713,0.217299,0.239543,-0.225016,-0.181877,0.133999,0.238567,-0.101546,0.101546
wheel-base,-0.535987,-0.056661,1.0,0.876024,0.814507,0.590742,0.782097,0.572027,0.493244,0.158502,0.250313,0.371147,-0.360305,-0.470606,-0.543304,0.584642,0.476153,0.307237,-0.307237
length,-0.365404,0.019424,0.876024,1.0,0.85717,0.492063,0.880665,0.685025,0.608971,0.124139,0.159733,0.579821,-0.28597,-0.665192,-0.698142,0.690628,0.657373,0.211187,-0.211187
width,-0.242423,0.086802,0.814507,0.85717,1.0,0.306002,0.866201,0.729436,0.544885,0.188829,0.189867,0.615077,-0.2458,-0.633531,-0.680635,0.751265,0.673363,0.244356,-0.244356
height,-0.55016,-0.373737,0.590742,0.492063,0.306002,1.0,0.307581,0.074694,0.180449,-0.062704,0.259737,-0.087027,-0.309974,-0.0498,-0.104812,0.135486,0.003811,0.281578,-0.281578
curb-weight,-0.233118,0.099404,0.782097,0.880665,0.866201,0.307581,1.0,0.849072,0.64406,0.167562,0.156433,0.757976,-0.279361,-0.749543,-0.794889,0.834415,0.785353,0.221046,-0.221046
engine-size,-0.110581,0.11236,0.572027,0.685025,0.729436,0.074694,0.849072,1.0,0.572609,0.209523,0.028889,0.822676,-0.256733,-0.650546,-0.679571,0.872335,0.745059,0.070779,-0.070779
bore,-0.140019,-0.029862,0.493244,0.608971,0.544885,0.180449,0.64406,0.572609,1.0,-0.05539,0.001263,0.566936,-0.267392,-0.582027,-0.591309,0.543155,0.55461,0.054458,-0.054458
stroke,-0.008245,0.055563,0.158502,0.124139,0.188829,-0.062704,0.167562,0.209523,-0.05539,1.0,0.187923,0.098462,-0.065713,-0.034696,-0.035201,0.08231,0.0373,0.241303,-0.241303


This output is the **correlation matrix** of all numeric columns in the dataset. Each number shows how strongly two variables are related:  

- Values close to **1** indicate a strong positive correlation (both increase together).  
- Values close to **-1** indicate a strong negative correlation (one increases while the other decreases).  
- Values near **0** indicate little or no linear relationship.  

For example, `engine-size` and `price` have a high positive correlation, meaning cars with bigger engines tend to cost more. Similarly, `city-mpg` and `price` have a negative correlation, showing that cars with higher fuel efficiency tend to be cheaper. This matrix helps identify which features are most closely related to price or to each other.


### Finding the correlation between the following columns: bore, stroke, compression-ratio, and horsepower.
Next, I will focus on four specific numeric features—**bore, stroke, compression-ratio, and horsepower**—to see how strongly they are related to each other using correlation.


In [16]:
# Calculate the correlation matrix for the selected columns to see how they relate to each other
df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

Unnamed: 0,bore,stroke,compression-ratio,horsepower
bore,1.0,-0.05539,0.001263,0.566936
stroke,-0.05539,1.0,0.187923,0.098462
compression-ratio,0.001263,0.187923,1.0,-0.214514
horsepower,0.566936,0.098462,-0.214514,1.0
