# 7.Exploring data

-----

This group of attributes and methods are your "first-look" tools. When you first load a dataset (e.g., from a CSV), you can't just look at all 2 million rows. These are your "doctor's checkup" tools—they let you take a quick "pulse" of the data to understand its **shape**, **size**, **data types**, and **content** *without* changing anything.

**How It Works in Memory**: These are (mostly) very fast and cheap. `.shape`, `.columns`, `.index`, and `.dtypes` are just **attributes**; they are properties stored *with* the DataFrame, so looking them up is instant (O(1) time). `.head()` and `.tail()` are also fast, as they just grab the first or last few rows. `.info()` and `.describe()` are the only ones that have to *do work*—they scan your data to count non-null values or calculate statistics, which can take a moment on very large datasets.

**When to Use This**: You should use these **every single time** you load a new dataset. This is *always* Step 1.

  * Use `.head()` to see what your data looks like.
  * Use `.info()` to check for missing (`NaN`) values and wrong data types.
  * Use `.shape` to see how big your dataset is.
  * Use `.describe()` to check for "weird" data (like an `Age` of -5 or 500).

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

These are a mix of **attributes (no `()`)** and **methods (with `()`)**.

#### Attributes (No Parentheses)

```python
# Returns a tuple (rows, columns)
dataframe.shape

# Returns the Index object for the rows
dataframe.index

# Returns the Index object for the columns
dataframe.columns

# Returns a Series listing the data type of each column
dataframe.dtypes
```

#### Methods (With Parentheses)

```python
# Returns the first n rows
dataframe.head(n=5)
```

  * **`n`**: The number of rows to return.
      * **Default:** `5`
      * **When to use:** Use `df.head(10)` to see more rows, or `df.head(2)` to see fewer.

<!-- end list -->

```python
# Returns the last n rows
dataframe.tail(n=5)
```

  * **`n`**: The number of rows to return.
      * **Default:** `5`
      * **When to use:** Same as `.head()`. Useful for checking if data is sorted by date.

<!-- end list -->

```python
# Prints a concise summary of the DataFrame
dataframe.info(verbose=True, memory_usage='deep', ...)
```

  * **`verbose`**: If `True`, prints the full summary. If `False` (for DataFrames with many columns), it prints a short summary.
      * **Default:** `True`
  * **`memory_usage`**:
      * **Default:** `'auto'` or `True` (gives a memory estimate).
      * **When to use:** Use `memory_usage='deep'` to get the *true* total memory, which is slower but much more accurate for columns with text (`object`).

<!-- end list -->

```python
# Generates descriptive statistics
dataframe.describe(percentiles=None, include=None, exclude=None)
```

  * **`percentiles`**: A list of percentile values (between 0 and 1) to show.
      * **Default:** `[.25, .5, .75]` (shows 25th, 50th/median, 75th percentiles).
      * **When to use:** Use `percentiles=[.1, .5, .9]` to see the 10th, 50th, and 90th percentiles.
  * **`include`**: A list of data types to *include* in the summary.
      * **Default:** `None` (which means *only* numeric columns).
      * **When to use:** Use `include='object'` to summarize text columns (gives `unique`, `top`, `freq`). Use `include='all'` to show *all* columns.
  * **`exclude`**: A list of data types to *exclude*.
      * **Default:** `None`.

-----

### 1\. Basic Example

Let's create a simple DataFrame and run all 8 commands to see what they do.

```python
import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Clara', 'David'],
    'Age': [25, 30, 22, 35],
    'Score': [88.5, 92.0, 78.5, 85.0]
}
df = pd.DataFrame(data)

print("--- The DataFrame ---")
print(df)

print("\n--- 1. .shape (Attribute) ---")
print(df.shape)

print("\n--- 2. .index (Attribute) ---")
print(df.index)

print("\n--- 3. .columns (Attribute) ---")
print(df.columns)

print("\n--- 4. .dtypes (Attribute) ---")
print(df.dtypes)

print("\n--- 5. .head(2) (Method) ---")
print(df.head(2))

print("\n--- 6. .tail(2) (Method) ---")
print(df.tail(2))

print("\n--- 7. .info() (Method) ---")
# .info() prints directly, it doesn't return anything
df.info()

print("\n--- 8. .describe() (Method) ---")
print(df.describe())
```

**Output:**

```
--- The DataFrame ---
    Name  Age  Score
0  Alice   25   88.5
1    Bob   30   92.0
2  Clara   22   78.5
3  David   35   85.0

--- 1. .shape (Attribute) ---
(4, 3)

--- 2. .index (Attribute) ---
RangeIndex(start=0, stop=4, step=1)

--- 3. .columns (Attribute) ---
Index(['Name', 'Age', 'Score'], dtype='object')

--- 4. .dtypes (Attribute) ---
Name      object
Age        int64
Score    float64
dtype: object

--- 5. .head(2) (Method) ---
    Name  Age  Score
0  Alice   25   88.5
1    Bob   30   92.0

--- 6. .tail(2) (Method) ---
    Name  Age  Score
2  Clara   22   78.5
3  David   35   85.0

--- 7. .info() (Method) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     4 non-null      int64  
 2   Score   4 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 224.0+ bytes

--- 8. .describe() (Method) ---
             Age      Score
count   4.000000   4.000000
mean   28.000000  86.000000
std     5.830952   5.700877
min    22.000000  78.500000
25%    24.250000  83.375000
50%    27.500000  86.750000
75%    31.250000  89.375000
max    35.000000  92.000000
```

**Explanation:**

  * **`.shape`**: Told us it's 4 rows, 3 columns.
  * **`.index`**: Showed the row labels are a `RangeIndex` (0 to 3).
  * **`.columns`**: Showed the column labels (`Name`, `Age`, `Score`).
  * **`.dtypes`**: Showed the type of each column (`object` is text).
  * **`.head(2)`**: Showed just the first 2 rows.
  * **`.tail(2)`**: Showed just the last 2 rows.
  * **`.info()`**: This is the best summary. It shows the index, columns, `Non-Null Count` (no missing data\!), and `Dtype` for *every column*.
  * **`.describe()`**: Gave us statistical summaries (count, mean, min, max, etc.) for the *numeric columns only* (`Age` and `Score`).

-----

### 2\. Intermediate Example

Now, let's use a "messier" dataset with missing values and mixed types to see how `.info()` and `.describe()` *really* shine.

**Example 9: `.info()` to find missing data**

```python
data_messy = {
    'Name': ['Alice', 'Bob', 'Clara', 'David', 'Eva'],
    'Age': [25, 30, np.nan, 35, 28], # One missing Age
    'City': ['NY', 'LA', 'NY', 'SF', 'LA']
}
df_messy = pd.DataFrame(data_messy)

print("--- Messy DataFrame ---")
print(df_messy)

print("\n--- .info() reveals the NaN ---")
df_messy.info()
```

**Output:**

```
--- Messy DataFrame ---
    Name   Age City
0  Alice  25.0   NY
1    Bob  30.0   LA
2  Clara   NaN   NY
3  David  35.0   SF
4    Eva  28.0   LA

--- .info() reveals the NaN ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    5 non-null      object 
 1   Age     4 non-null      float64
 2   City    5 non-null      object 
dtypes: float64(1), object(2)
memory usage: 248.0+ bytes
```

**Explanation:**
Look at the `.info()` output. The `RangeIndex` shows **5 entries**. The `Age` column shows **4 non-null**. This mismatch (5 vs. 4) instantly tells you that the `Age` column has *one missing value*. Also, note that the `Age` `Dtype` is `float64`. This is because `NaN` is a float, so it "upcast" the whole column from integer to float to hold the missing value.

**Example 10: `.describe()` on numeric vs. object columns**

```python
# Using the same messy DataFrame
print("--- .describe() (Default, numeric) ---")
print(df_messy.describe())

print("\n--- .describe(include='object') (Text) ---")
print(df_messy.describe(include='object'))
```

**Output:**

```
--- .describe() (Default, numeric) ---
             Age
count   4.000000
mean   29.500000
std     4.203173
min    25.000000
25%    27.250000
50%    29.000000
75%    31.250000
max    35.000000

--- .describe(include='object') (Text) ---
        Name City
count      5    5
unique     5    3
top    Alice   NY
freq       1    2
```

**Explanation:**

  * The default `.describe()` only showed the `Age` column (the only numeric one).
  * By specifying `include='object'`, we get a *different* summary for the `Name` and `City` columns.
      * **count**: Total non-null entries.
      * **unique**: Number of unique values (`LA`, `NY`, `SF` = 3 unique cities).
      * **top**: The most frequent value (`NY`).
      * **freq**: How many times the `top` value appeared (2).

-----

### 3\. Advanced or Tricky Case

**Example 11: `.describe(include='all')`**

This parameter combines *both* summaries, showing `NaN` for stats that don't apply.

```python
# Using the same messy DataFrame
print("--- .describe(include='all') ---")
print(df_messy.describe(include='all'))
```

**Output:**

```
--- .describe(include='all') ---
        Name        Age City
count      5   4.000000    5
unique     5        NaN    3
top    Alice        NaN   NY
freq       1        NaN    2
mean     NaN  29.500000  NaN
std      NaN   4.203173  NaN
min      NaN  25.000000  NaN
25%      NaN  27.250000  NaN
50%      NaN  29.000000  NaN
75%      NaN  31.250000  NaN
max      NaN  35.000000  NaN
```

**Explanation:**
This is tricky but very thorough. For the `Name` and `City` columns, it shows the text stats (`unique`, `top`, `freq`) and `NaN` for the numeric stats (`mean`, `std`, etc.). For the `Age` column, it does the reverse.

**Example 12: `.info(memory_usage='deep')`**

This is an advanced trick to find the *real* memory usage. `object` columns (text) are tricky because Pandas just stores *pointers* to the text. `memory_usage='deep'` tells `.info()` to go and measure the *actual size of the text*, which can be much, much larger.

```python
print("\n--- .info() (Normal) ---")
df_messy.info()

print("\n--- .info(memory_usage='deep') (Accurate) ---")
df_messy.info(memory_usage='deep')
```

**Output:**

```
--- .info() (Normal) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
...
memory usage: 248.0 bytes

--- .info(memory_usage='deep') (Accurate) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
...
memory usage: 604.0 bytes
```

**Explanation:**
The normal `.info()` reported 248.0 bytes. But the `deep` scan showed 604.0 bytes. This is because it went and *actually measured* the size of the strings ('Alice', 'Bob', 'NY', 'LA', etc.). On a dataset with millions of rows, this difference could be megabytes vs. gigabytes.

-----

### 4\. Real-World Use Case

This is the **"Initial Data Triage"** workflow you will do 100% of the time after loading data.

**Scenario:** You just loaded a huge file: `df = pd.read_csv('sales_data_2025.csv')`

Your workflow should be:

**Example 13: The 5-Step Triage**

1.  **`df.info()`**

      * **Action:** Run it.
      * **Questions to ask:** "How many entries? 2.1M". "Are there any nulls?" (e.g., "Oh, `customer_email` has 1.8M non-null... lots missing\!"). "Are the `Dtypes` correct?" (e.g., "Wait, `Order_Date` is `object`? I need to convert that to datetime. `Revenue` is `object`? That's bad.")

2.  **`df.head()`**

      * **Action:** Run it.
      * **Questions to ask:** "What does the data *look* like?" (e.g., "Ah, `Revenue` is `object` because it has '$' and ',' in it, like '$1,200.50'. I'll need to clean that.")

3.  **`df.shape`**

      * **Action:** Check it.
      * **Questions to ask:** "How many rows and columns? 2.1M rows, 45 columns."

4.  **`df.describe()`**

      * **Action:** Run it.
      * **Questions to ask:** "Are there any *sanity* issues?" (e.g., "Look at `Quantity`: `min` is -10. That's a data error."). "Look at `Unit_Price`: `max` is 99,000. Is that real or an outlier?"

5.  **`df.describe(include='object')`**

      * **Action:** Run it.
      * **Questions to ask:** "How many unique categories?" (e.g., "Column `Country`: `unique` is 250. Good."). (e.g., "Column `State`: `unique` is 75... wait, there are only 50 states. That means I have dirty data like 'NY' and 'New York' and 'n.y.'")

**Explanation:**
This 5-step process, which takes 30 seconds, gives you a complete "to-do" list for your data cleaning.

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 14: Attribute vs. Method (`.shape` vs `.shape()`)**

```python
# Wrong code
try:
    print(df.shape())
except TypeError as e:
    print(f"Error: {e}")
    
# Correct code
print(f"Correct: {df.shape}")
```

**Error/Wrong Output:**

```
Error: 'tuple' object is not callable
Correct: (5, 3)
```

**Why it happens:**
`.shape` is an **attribute** (a property), not a method (an action). You don't add `()` to it. The same is true for `.index`, `.columns`, and `.dtypes`.

**Mistake 15: Method vs. Attribute (`.info()` vs `.info`)**

```python
# Wrong code
print(df.info)
```

**Output:**

```
<bound method DataFrame.info of     Name   Age City
0  Alice  25.0   NY
1    Bob  30.0   LA
2  Clara   NaN   NY
3  David  35.0   SF
4    Eva  28.0   LA>
```

**Why it happens:**
This is the reverse mistake. `.info()` is a **method** (an action), so it *requires* `()` to be called. Without them, you just get a description of the method itself, which isn't useful.
**Correction:** `df.info()`

-----

### 6\. Key Terms (Explained Simply)

  * **Attribute:** A stored property of an object (e.g., `df.shape`). Accessed *without* `()`.
  * **Method:** An action or function an object can perform (e.g., `df.head()`). Accessed *with* `()`.
  * **Metadata:** Data *about* your data (e.g., number of rows, column names, data types).
  * **Descriptive Statistics:** Summaries of your data (e.g., mean, median, min, max).
  * **`dtype` (Data Type):** The type of data in a column (`int64` for numbers, `float64` for decimals, `object` for text).
  * **`NaN` (Not a Number):** Pandas' marker for a single missing value.
  * **Percentile/Quartile:** A measure of where data falls. The 25th percentile (or 1st quartile) means 25% of the data is *below* that value. The 50th percentile is the `median`.

-----

### 7\. Best Practices

  * **Always use these first:** Make it a reflex. `pd.read_csv(...)`, then `df.info()`, `df.head()`.
  * **Trust `.info()` for missing data:** It's the fastest way to check *all* columns at once.
  * **Check `dtypes` obsessively:** A column `Revenue` being `object` is a 100% guarantee of future errors. Fix it *first*.
  * **Use `df.describe(include='object')`:** Don't just `describe` your numbers. Your text columns often have just as many data quality issues (like typos, which you'd see in the `unique` count).
  * **Don't trust `.head()`:** `.head()` only shows the first 5 rows. They might be perfectly clean, while row 50,000 is a mess. It's a *peek*, not a *proof*.

-----

### 8\. Mini Summary

  * These 8 tools are for **inspection**, not modification.
  * **Attributes (no `()`)**:
      * `.shape`: (rows, cols)
      * `.index`: Row labels
      * `.columns`: Column labels
      * `.dtypes`: Data type of each column
  * **Methods (with `()`)**:
      * `.head()`/`.tail()`: Peek at first/last 5 rows.
      * `.info()`: Best summary for `NaN`s and `dtypes`.
      * `.describe()`: Statistical summary for numeric (default) or text (`include='object'`) columns.

-----

### 10\. Practice Tasks

**Data for Tasks:**

```python
data = {
    'Category': ['A', 'B', 'A', 'A', 'B', 'C'],
    'Value': [10, 15, 12, np.nan, 8, 10],
    'Notes': ['good', 'bad', 'ok', 'good', 'ok', 'bad']
}
df_practice = pd.DataFrame(data, index=['r1', 'r2', 'r3', 'r4', 'r5', 'r6'])
```

**Task 16 (Easy):**
Using `df_practice`, print its shape and its column names.

**Task 17 (Medium):**
Print a summary of `df_practice` that shows you that the 'Value' column has a missing entry. Then, print the first 3 rows.

**Task 18 (Hard):**
Generate *two* summaries from `df_practice`:

1.  A statistical summary of the 'Value' column.
2.  A frequency summary of the 'Category' and 'Notes' columns.

-----

### 11\. Recommended Next Topic

You've now created a DataFrame and "explored" it to find problems. [cite\_start]The next logical step in the roadmap is to start *fixing* those problems by changing the DataFrame's structure (like renaming or dropping columns). [cite: 97-100]

**Recommended:** **Structure changes (Renaming: `.rename()`, Adding/removing columns: `.insert()`, `del`, `.drop()`)**

-----

### 12\. Quick Reference Card

| Command | Type | What It Shows |
| :--- | :--- | :--- |
| **`.shape`** | Attribute | `(rows, columns)` tuple. |
| **`.index`** | Attribute | The row labels. |
| **`.columns`** | Attribute | The column labels. |
| **`.dtypes`** | Attribute | `Series` of data types for each column. |
| **`.head(n=5)`** | Method | `DataFrame` of the *first* `n` rows. |
| **`.tail(n=5)`** | Method | `DataFrame` of the *last* `n` rows. |
| **`.info()`** | Method | Complete summary of index, columns, `NaN` counts, `dtypes`, and memory. |
| **`.describe()`** | Method | `DataFrame` with statistics (mean, std, min, max, quartiles) for numeric columns. |
| `.describe(include='object')` | Method | `DataFrame` with stats (count, unique, top, freq) for text columns. |

-----

### 13\. Common Interview Questions

1.  **You've just loaded a 2GB CSV. What are the *first* 3-5 things you do?**
      * `df.info()`: To check for nulls and `dtypes`.
      * `df.head()`: To visually inspect the data and see *why* a `dtype` might be wrong (e.g., '$' in a number).
      * `df.shape`: To confirm the size.
      * `df.describe()`: To check for outliers or data errors (e.g., `min = -1`).
2.  **How do you check for missing values in a DataFrame?**
      * The best and fastest way is `df.info()`. It gives you a `Non-Null Count` for every column at once.
      * (A more advanced way is `df.isna().sum()`, which gives a direct count of `NaN`s per column).
3.  **How do you get a summary of text-based (non-numeric) columns?**
      * You use `df.describe(include='object')`. This will show you the `count`, `unique` values, the `top` (most frequent) value, and its `freq` (frequency).
4.  **What is the difference between `df.shape` and `df.shape()`?**
      * `df.shape` is an **attribute** that returns a tuple (rows, columns). This is correct.
      * `df.shape()` is an **error**. You are trying to *call* the tuple as a function.

-----

### 14\. Performance Considerations

  * **O(1) (Instant):**
      * `.shape`, `.index`, `.columns`, `.dtypes`. These are properties, just looking them up.
  * **O(k) (Very Fast):**
      * `.head(k)`, `.tail(k)`. Time is proportional to `k`, not the size of the DataFrame.
  * **O(N\*M) or O(N) (Can be slow):**
      * `.info()`: Can be slow if `verbose=True` (the default) on a DataFrame with *many columns* (M). It has to iterate over all columns.
      * `.describe()`: Must iterate over all *numeric* columns (M\_numeric) and all rows (N) to compute statistics. Time is roughly O(N \* M\_numeric).
      * `.info(memory_usage='deep')`: This is the slowest, as it has to iterate over *every single text element* in all `object` columns. Use it, but be aware it can take time.
  * **Memory:** All these methods are very light on memory. They return small new objects (summaries), not copies of your full data.

-----

### 15\. When NOT to Use This

  * **Don't use these to *change* data.** These are for *inspection only*.
  * **Don't use `.head()` to make assumptions.** Do not assume your whole 10-million-row dataset is clean just because the first 5 rows are.
  * **Don't use `.describe()` as your only statistical analysis.** It is a *summary*. It will not show you if your data has two different groups (bimodal distribution) or other complex patterns. It's a *starting point* for analysis, not the end.
  * **Don't use `.info()` for a precise `NaN` count in a script.** `df.info()` *prints* to the console. If you need to *use* the number of `NaN`s in a variable, use `df.isna().sum()` (which returns a Series) instead.