<a href="https://colab.research.google.com/github/Harsh-Patel25/Python/blob/main/daily_lessons/Day_9_pandas_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  **📚 <span style="color:red">Day‑9 Lesson: Python Pandas Tutorials – Part 1</span> 🚀**

> **Pandas** is a powerful library for data manipulation and analysis. It provides two primary data structures: **DataFrame** (a table-like structure) and **Series** (a one-dimensional labeled array). Pandas makes it easy to work with structured data.

---

## 1. Installation and Importing Pandas

Before you can use Pandas, you must install it (if not already installed) and then import it.  
> **Command:**  
```bash
!pip install pandas
```

> **In Python:**  
```python
import pandas as pd
import numpy as np  # We often use NumPy to create data for Pandas
```

---

## 2. Creating a DataFrame

### 2.1 Using NumPy Arrays

You can create a DataFrame from a NumPy array. For example, consider a simple array created by reshaping a range of numbers:

```python
# Create a NumPy array of shape (5,4)
data = np.arange(0, 20).reshape(5, 4)
print(data)
```

> **Output:**  
```
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])
```

Now create a DataFrame with custom row and column labels:

```python
df = pd.DataFrame(data=data,
                  index=["Row1", "Row2", "Row3", "Row4", "Row5"],
                  columns=["Column1", "Column2", "Column3", "Column4"])
```

### **Extra Examples for DataFrame Creation:**

1. **From a Dictionary:**
   ```python
   data_dict = {
       "Name": ["Alice", "Bob", "Charlie"],
       "Age": [25, 30, 35],
       "City": ["New York", "Los Angeles", "Chicago"]
   }
   df_dict = pd.DataFrame(data_dict)
   print(df_dict)
   ```
2. **From a List of Dictionaries:**
   ```python
   data_list = [
       {"A": 1, "B": 2},
       {"A": 3, "B": 4},
       {"A": 5, "B": 6}
   ]
   df_list = pd.DataFrame(data_list)
   print(df_list)
   ```
3. **Empty DataFrame and then Adding Data:**
   ```python
   df_empty = pd.DataFrame(columns=["X", "Y", "Z"])
   df_empty.loc[0] = [10, 20, 30]
   df_empty.loc[1] = [40, 50, 60]
   print(df_empty)
   ```
4. **Using np.arange with reshape (Different dimensions):**
   ```python
   arr = np.arange(12).reshape(3, 4)
   df_from_arr = pd.DataFrame(arr, columns=["C1", "C2", "C3", "C4"])
   print(df_from_arr)
   ```

---

## 3. Viewing and Exploring the DataFrame

### 3.1 Using Head, Tail, Type, Info, and Describe

- **`df.head()`**: Shows the first 5 rows by default.
- **`df.tail()`**: Shows the last 5 rows.
- **`type(df)`**: Tells you the type of the object (should be `<class 'pandas.core.frame.DataFrame'>`).
- **`df.info()`**: Displays a concise summary including data types and non-null counts.
- **`df.describe()`**: Provides summary statistics for numeric columns.

```python
print("DataFrame Head:")
print(df.head())

print("\nDataFrame Tail:")
print(df.tail())

print("\nDataFrame Type:")
print(type(df))

print("\nDataFrame Info:")
df.info()

print("\nDataFrame Description:")
print(df.describe())
```

### **Extra Examples:**

1. **Using `df.shape` to check dimensions:**
   ```python
   print("Shape of DataFrame:", df.shape)
   ```
2. **Checking Column Data Types:**
   ```python
   print("Data types:\n", df.dtypes)
   ```
3. **Customizing `head()` (e.g., first 3 rows):**
   ```python
   print("First 3 rows:\n", df.head(3))
   ```
4. **Summary for Non-Numeric Data:**
   ```python
   # For a DataFrame with string data:
   df_str = pd.DataFrame({"Name": ["Alice", "Bob", "Charlie"],
                          "City": ["NY", "LA", "CHI"]})
   print(df_str.describe(include=[object]))
   ```

---

## 4. Indexing and Selecting Data

### 4.1 Column Selection

- **Single Column as a Series:**  
  ```python
  col1 = df['Column1']
  print("Type of df['Column1']:", type(col1))
  ```
  
- **Multiple Columns:**  
  ```python
  subset_df = df[['Column1', 'Column2', 'Column3']]
  print(subset_df)
  ```

### 4.2 Row Selection using `loc` and `iloc`

- **Using `loc` (Label-based):**
  ```python
  # Select rows with labels Row3 and Row4
  rows_loc = df.loc[['Row3', 'Row4']]
  print("Rows selected using loc:\n", rows_loc)
  ```

- **Using `iloc` (Integer-location based):**
  ```python
  # Select rows by index positions, e.g., 2nd and 3rd rows (index 2 and 3)
  rows_iloc = df.iloc[2:4, 0:2]
  print("Rows selected using iloc:\n", rows_iloc)
  ```

### **Extra Examples for Indexing:**

1. **Selecting a Single Cell:**
   ```python
   cell_value = df.loc["Row3", "Column3"]
   print("Value at Row3, Column3:", cell_value)
   ```
2. **Boolean Indexing (Filtering):**
   ```python
   # Filter rows where Column2 is greater than 2
   filtered_df = df[df['Column2'] > 2]
   print("Filtered DataFrame:\n", filtered_df)
   ```
3. **Selecting with `iloc` for all rows and columns 1 and 2:**
   ```python
   subset_iloc = df.iloc[:, 1:3]
   print("Subset using iloc (columns 1-2):\n", subset_iloc)
   ```
4. **Using `loc` for range selection:**
   ```python
   # Assuming your index is labeled, select from Row2 to Row4:
   range_loc = df.loc["Row2":"Row4"]
   print("Rows from Row2 to Row4:\n", range_loc)
   ```

---

## 5. Converting DataFrame into Arrays

You can convert parts of your DataFrame into a NumPy array using the `.values` attribute (or `.to_numpy()` in newer versions).

```python
array_from_df = df.iloc[:, 1:].values
print("DataFrame converted to array:\n", array_from_df)
```

### **Extra Examples:**

1. **Converting a single column to array:**
   ```python
   col_array = df['Column1'].to_numpy()
   print("Column1 as array:", col_array)
   ```
2. **Converting the entire DataFrame:**
   ```python
   full_array = df.to_numpy()
   print("Entire DataFrame as array:\n", full_array)
   ```
3. **Reshaping converted array:**
   ```python
   reshaped_array = full_array.reshape(4, 5)  # Only if total elements allow this shape
   print("Reshaped array:\n", reshaped_array)
   ```
4. **Using `.values` vs `.to_numpy()`:**  
   Both work similarly. Example:
   ```python
   array_via_values = df.values
   array_via_to_numpy = df.to_numpy()
   print("Using .values:\n", array_via_values)
   print("Using .to_numpy():\n", array_via_to_numpy)
   ```

---

## 6. Basic Operations on DataFrames

### 6.1 Handling Missing Data and Nulls

- **Check for nulls:**
  ```python
  print("Null counts:\n", df.isnull().sum())
  ```
  
- **Boolean result for non-null columns:**
  ```python
  print("Non-null indicator:\n", df.isnull().sum() == 0)
  ```

### **Extra Examples:**

1. **Filling missing values:**
   ```python
   df_filled = df.fillna(0)
   print("DataFrame after filling nulls:\n", df_filled)
   ```
2. **Dropping rows with missing values:**
   ```python
   df_dropped = df.dropna()
   print("DataFrame after dropping nulls:\n", df_dropped)
   ```
3. **Count missing values for each column in a DataFrame created with NaNs:**
   ```python
   df_nan = pd.DataFrame([[1, np.nan, 2], [1, 3, 4]],
                         index=["Row1", "Row2"],
                         columns=["Column1", "Column2", "Column3"])
   print("Null counts in df_nan:\n", df_nan.isnull().sum())
   ```
4. **Using boolean indexing to filter rows without nulls:**
   ```python
   df_no_nulls = df_nan[df_nan.notnull().all(axis=1)]
   print("Rows with no nulls:\n", df_no_nulls)
   ```

### 6.2 Aggregation and Value Counts

- **Value Counts:**
  ```python
  # For a given column in a DataFrame
  print("Value counts for Column3:")
  print(df['Column3'].value_counts())
  ```

- **Unique Values:**
  ```python
  print("Unique values in Column2:", df['Column2'].unique())
  ```

- **Filtering DataFrame with conditions:**
  ```python
  filtered_condition = df[df['Column2'] > 2]
  print("Rows where Column2 > 2:\n", filtered_condition)
  ```

### **Extra Examples:**

1. **Aggregation: Mean, Median, Sum:**
   ```python
   print("Mean of Column1:", df['Column1'].mean())
   print("Median of Column1:", df['Column1'].median())
   print("Sum of Column1:", df['Column1'].sum())
   ```
2. **Count distinct values using `nunique()`:**
   ```python
   print("Distinct values in Column2:", df['Column2'].nunique())
   ```
3. **Grouping Data:**
   ```python
   # Group by a column (if you had categorical data)
   # For demonstration, assume a new column 'Category'
   df['Category'] = ["A", "B", "A", "B", "A"]
   grouped = df.groupby("Category")["Column1"].mean()
   print("Average Column1 by Category:\n", grouped)
   ```
4. **Applying custom functions:**
   ```python
   # Apply a lambda function to modify a column
   df["Column1_plus_10"] = df["Column1"].apply(lambda x: x + 10)
   print("DataFrame after applying lambda:\n", df)
   ```

---

## Summary

- **Pandas DataFrames** allow you to work with structured, tabular data efficiently.
- **Series** (a single column of data) are useful for one-dimensional data.
- Basic methods like **head(), tail(), info(), describe()** provide quick insights into your data.
- Indexing with **loc** (label-based) and **iloc** (integer-based) lets you access data in flexible ways.
- Converting DataFrames into NumPy arrays can be useful for computations.
- **Basic operations** including handling missing data, aggregation, and filtering are essential for Exploratory Data Analysis (EDA) and Machine Learning.

---
