# Module 1: Importing and Inspecting Data with Pandas

This notebook covers the basic steps for loading a dataset into a pandas DataFrame, assigning headers, and performing initial data inspection. This is based on the IBM Data Analysis (DA0101EN) course, Module 1.

---

## 1. Setup

First, we import the **pandas** library, which is the standard tool for data manipulation and analysis in Python.

In [None]:
import pandas as pd

## 2. Importing the Dataset

We will load the "Automobile" dataset from the UCI Machine Learning Repository.

* We define the `url` where the data is located.
* We use `pd.read_csv()` to load the data.
* We specify `header=None` because the original file does not contain a header row. If we omit this, pandas will incorrectly use the first row of data as the headers.

---

In [None]:
# Define the URL for the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

# Read the CSV data into a DataFrame, specifying no header
df = pd.read_csv(url, header=None)

## 3. Assigning Column Headers

Our DataFrame currently has default integer headers (0, 1, 2...). To make the data understandable, we need to assign meaningful column names.

* We create a list called `headers` containing the correct name for each column in order.
* We assign this list to the `df.columns` attribute.

---

In [None]:
# Create a list of headers
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors",
           "body-style", "drive-wheels", "engine-location", "wheel-base", "length", "width",
           "height", "curb-weight", "engine-type", "num-of-cylinders", "engine-size", "fuel-system",
           "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]

# Assign the headers to the DataFrame
df.columns = headers

## 4. Initial Data Inspection

After loading and cleaning the headers, the next step is to inspect the data to understand its structure, identify potential issues (like missing values or incorrect data types), and get a statistical overview.

### View First 5 Rows with `.head()`

The `.head()` method is the best way to get a quick visual confirmation that your data is loaded correctly and the headers are assigned properly.

In [None]:
# Display the first 5 rows
df.head()

### Get a Concise Summary with `.info()`

The `.info()` method provides a high-level summary of the DataFrame. It's excellent for quickly checking:
* The total number of entries (rows).
* The number of columns.
* The data type (`Dtype`) of each column.
* The number of non-null (i.e., not missing) values for each column.
* Memory usage.

**Key Observation:** Notice below that columns like `normalized-losses`, `horsepower`, and `price` are an `object` (string) type. This indicates they contain non-numeric characters (like '?' for missing values) and will need to be cleaned before we can perform calculations on them.

In [None]:
# Get a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

### Check Data Types with `.dtypes`

If you only want to see the data types for each column, `.dtypes` is a more direct command.

In [None]:
# Check the data types of each column
df.dtypes

### Statistical Summary with `.describe()`

The `.describe()` method provides a statistical summary for all **numerical** columns ( `int64` and `float64`). It automatically calculates:
* **count:** The number of non-missing values.
* **mean:** The average value.
* **std:** The standard deviation.
* **min:** The minimum value.
* **25%:** The 25th percentile (1st quartile).
* **50%:** The 50th percentile (median).
* **75%:** The 75th percentile (3rd quartile).
* **max:** The maximum value.

In [None]:
# Get a statistical summary of the numerical columns
df.describe()

### Full Summary with `.describe(include="all")`

By default, `.describe()` ignores object/categorical columns. To get a summary of **all** columns, use `include="all"`.

For **categorical** columns (type `object`), it provides:
* **unique:** The number of distinct (unique) values.
* **top:** The most frequently occurring value.
* **freq:** The frequency (count) of the `top` value.

For **numerical** columns, it provides the same statistics as the default `.describe()`.

---

In [None]:
# Get a statistical summary of all columns (numerical and categorical)
df.describe(include="all")

## 5. Exporting the Data (Optional)

After adding headers, you might want to save your progress to a new CSV file.

* We use `df.to_csv()` to save the DataFrame.
* We specify `index=False` to prevent pandas from writing the DataFrame's index (0, 1, 2...) as an extra column in the new file.

In [None]:
# Export the DataFrame with headers to a new CSV file
# index=False prevents writing the row indices as a new column
df.to_csv("automobile_data_with_headers.csv", index=False)