# Introduction


---

### üìö Table of Contents: Introduction to Pandas

1.  **What is Pandas?** ü§î
2.  **Why Pandas? The Superpowers of Data Analysis!** üí™
3.  **Getting Started: Installation & Import** üõ†Ô∏è
4.  **A Glimpse at Pandas' Core: Series & DataFrames (The Building Blocks!)** üß±
5.  **Pro Tips & Common Pitfalls (Learn from the Pros!)** üí°
6.  **Quick Quiz: Test Your Knowledge!** ‚ùì




---

### 1. What is Pandas? ü§î

Imagine you have a huge pile of disorganized papers ‚Äì some with numbers, some with names, some with dates. Trying to find specific information or make sense of it all would be a nightmare, right? üò´

**Pandas** is like a super-organized digital filing cabinet for your data in Python! üóÑÔ∏è It's a powerful open-source library that provides easy-to-use data structures and data analysis tools. Think of it as a specialized spreadsheet program, but instead of clicking around, you're using Python code to manipulate and analyze your data with incredible speed and flexibility.

The name "Pandas" actually comes from "Panel Data," an econometrics term for multi-dimensional data. Pretty cool, huh? üòé



### 2. Why Pandas? The Superpowers of Data Analysis! üí™

So, why bother learning Pandas when you could just use Excel or Google Sheets? Here's why Pandas is the go-to choice for data professionals:

* **Handles Big Data with Ease:** Pandas can effortlessly handle datasets that would crash your traditional spreadsheet software. We're talking millions of rows and columns! üìà
* **Data Cleaning & Preparation:** Real-world data is rarely perfect. Pandas makes it super easy to clean messy data, deal with missing values, fix inconsistencies, and get your data ready for analysis. Say goodbye to manual scrubbing! üßπ
* **Powerful Data Manipulation:** Want to filter data, sort it, group it, combine different datasets, or perform complex calculations? Pandas has a function for almost everything you can imagine. It's like having a data superhero at your fingertips! ‚ú®
* **Integration with Other Libraries:** Pandas plays nicely with other essential Python libraries like NumPy (for numerical operations), Matplotlib and Seaborn (for stunning visualizations), and Scikit-learn (for machine learning). It's a team player! ü§ù
* **Time-Saving & Efficient:** Automate repetitive tasks and analyze data much faster than manual methods. This means more time for insights and less time on tedious work! ‚è≥

In short, Pandas empowers you to **load, manipulate, clean, and analyze data** with Python, making it an indispensable tool for anyone working with data. üìä



### 3. Getting Started: Installation & Import üõ†Ô∏è

Before we can unleash Pandas' power, we need to make sure it's installed and ready to go!

#### Installation (If you don't have it already):

If you're using **Anaconda** (which is highly recommended for data science in Python), Pandas comes pre-installed! üéâ

If you're using a standard Python installation, you can install it using `pip` in your terminal or command prompt:

```bash
pip install pandas
````

Or, directly in a Jupyter Notebook cell (add a `!` before the command to run shell commands):

```python
!pip install pandas
```



#### Importing Pandas:

Every time you want to use Pandas in a Python script or Jupyter Notebook, you need to import it. The standard convention is to import it as `pd`. This is like giving it a short nickname, so you don't have to type `pandas` every single time. üòâ


In [1]:
# Import the pandas library, giving it the common alias 'pd'
import pandas as pd

print("Pandas imported successfully! You're ready to roll! üéâ")

Pandas imported successfully! You're ready to roll! üéâ




### 4\. A Glimpse at Pandas' Core: Series & DataFrames (The Building Blocks\!) üß±

Pandas is built around two primary data structures that you'll be working with constantly:

  * **Series:** Think of a **Series** as a single column of data, like a list, but with a special "index" (labels for each item). It's one-dimensional.
  * **DataFrame:** A **DataFrame** is the real star\! It's like a whole spreadsheet or a SQL table ‚Äì a collection of Series (columns) arranged in rows and columns. It's two-dimensional, allowing you to store and organize tabular data.

Don't worry too much about the details right now; we'll dive deep into Series and DataFrames in the next sections. For now, just know that these are the fundamental ways Pandas organizes your data. üèóÔ∏è

Let's quickly create a simple Series and DataFrame to see what they look like:


In [2]:
# Creating a simple Pandas Series
my_series = pd.Series([10, 20, 30, 40, 50])
# print(my_series) # Uncomment to see the output!

# Creating a simple Pandas DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
my_dataframe = pd.DataFrame(data)
# print(my_dataframe) # Uncomment to see the output!


### 5\. Pro Tips & Common Pitfalls (Learn from the Pros\!) üí°

#### Pro Tips:

  * **Always Import as `pd`:** Stick to `import pandas as pd`. This is the universal standard and will make your code easier for others (and future you\!) to understand.
  * **Jupyter Notebook is Your Friend:** Jupyter Notebooks (or JupyterLab) are perfect for learning and experimenting with Pandas. They allow you to run code in small chunks and see immediate results. üß™
  * **Start Small:** When learning a new concept, always try it out with a small, simple dataset first. Once you understand it, apply it to larger, more complex data. ü§è
  * **Read the Documentation (Eventually\!):** The official Pandas documentation is excellent. You don't need to read it all now, but knowing it exists as a reliable resource is key for later on\! üìö

#### Common Pitfalls for Beginners:

  * **Forgetting to Import:** It's a common mistake\! If you get a `NameError: name 'pd' is not defined`, you probably forgot to run `import pandas as pd`. ü§¶‚Äç‚ôÄÔ∏è
  * **Typos:** Python is case-sensitive\! `Pandas` is not the same as `pandas`. Always double-check your spelling, especially for function names. üêõ
  * **Not Understanding Data Types:** Pandas tries to guess the data type of your columns (numbers, text, etc.). Sometimes it gets it wrong, which can lead to unexpected behavior. We'll cover this more later, but keep it in mind\! üö´
  * **In-Place Operations vs. New Objects:** Some Pandas operations modify the data "in-place" (e.g., `df.drop(columns='A', inplace=True)`), while others return a new DataFrame (e.g., `df.head()`). Pay attention to whether a function modifies your original DataFrame or gives you a new one. This is a common source of confusion\! üîÑ



### 6\. Quick Quiz: Test Your Knowledge\! ‚ùì

Let's see how much you've absorbed\! Choose the best answer for each question.

1.  **Which of the following is the standard alias for importing the Pandas library?**
    a) `import pandas as p`
    b) `import pandas as data`
    c) `import pandas as pd`
    d) `import pandas as panda`

    **Think about it\!** ü§î

2.  **What are the two primary data structures in Pandas?**
    a) Lists and Dictionaries
    b) Arrays and Matrices
    c) Series and DataFrames
    d) Rows and Columns

    **Think about it\!** ü§î

3.  **True or False: Pandas is primarily used for numerical computation, similar to NumPy, and doesn't handle text data well.**

    **Think about it\!** ü§î



-----

### End of Introduction\! üéâ

You've successfully completed your first step into the world of Pandas\! Give yourself a pat on the back\! üëè In the next sections, we'll start getting hands-on with creating and understanding Series and DataFrames in much more detail.

---

<a id="mini-project"></a>  
## üéØ **Mini-Project: Analyze COVID-19 Data ü¶†**  

**Scenario:** Explore a COVID-19 dataset (simplified example).  


In [None]:
# Sample data (in practice, use real CSV)
data = {
    "Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "Cases": [120, 150, np.nan],
    "Deaths": [5, 8, 10]
}
covid_df = pd.DataFrame(data)

# Task 1: Handle missing values
covid_df["Cases"] = covid_df["Cases"].fillna(covid_df["Cases"].mean())

# Task 2: Calculate mortality rate
covid_df["Mortality Rate (%)"] = (covid_df["Deaths"] / covid_df["Cases"]) * 100

# Task 3: Save cleaned data
covid_df.to_csv("cleaned_covid_data.csv", index=False)

print("Processed Data:\n", covid_df)



---

<a id="pro-tips"></a>  
## üí° **Pro Tips & Common Pitfalls**  

‚úÖ **Always check data after reading:**  
```python
df.info()     # Data types & missing values  
df.head()     # First 5 rows  
df.describe() # Summary statistics  
```  

‚ùå **Avoid chaining without copies:**  
```python
# Risky:
df = df.dropna().reset_index()  # May cause issues  

# Better:
df_clean = df.dropna().copy()  
df_clean.reset_index(inplace=True)  
```  

üî• **Read large files efficiently:**  
```python
# Read in chunks
chunk_iter = pd.read_csv("big_file.csv", chunksize=10000)  
for chunk in chunk_iter:
    process(chunk)  
```

---

## üéâ **Key Takeaways**  
‚úî **Pandas** handles tabular data effortlessly.  
‚úî **DataFrames** > Excel (fight me üòâ).  
‚úî **Series** are single columns; **DataFrames** are full tables.  
‚úî **Reading/Writing** is simple with `read_csv`, `to_excel`, etc.  

**Next Steps:**  
- Try loading a real dataset from [Kaggle](https://www.kaggle.com/).  
- Explore `pd.read_json()` for API data.  

**Happy data wrangling!** üêº


---

# üìä DataFrames & Essential Operations üõ†Ô∏è

Welcome back, data explorer! üó∫Ô∏è In the last section, we got a taste of what DataFrames are. Now, it's time to truly understand them and learn the fundamental operations that will make you a Pandas pro! üöÄ

A **DataFrame** is the most widely used data structure in Pandas. Think of it as a table with rows and columns, similar to a spreadsheet or a SQL table. Each column can have a different data type (e.g., numbers, text, dates), which makes DataFrames incredibly flexible for real-world data.

![image-2.png](attachment:image-2.png)

---

### üìö Table of Contents: DataFrames & Operations

1.  **Creating DataFrames** üé®
    * From Dictionaries
    * From Lists of Dictionaries
    * From CSV Files (Your Everyday Data Source!)
2.  **Exploring Your DataFrame: First Look & Info** üëÄ
    * `head()` & `tail()`: Peeking at Your Data
    * `info()`: Getting a Summary
    * `shape`: Dimensions of Your Data
    * `columns` & `index`: Understanding Labels
    * `describe()`: Quick Statistics
3.  **Selecting Data: The Art of Subsetting** üîç
    * Selecting Single Columns
    * Selecting Multiple Columns
    * Selecting Rows by Position (`.iloc[]`)
    * Selecting Rows by Label (`.loc[]`)
    * Boolean Indexing (Conditional Selection) ‚ú®
4.  **Adding & Modifying Data: Making Changes** ‚ûï
    * Adding New Columns
    * Modifying Existing Columns
    * Adding New Rows (Briefly)
5.  **Removing Data: Cleaning Up Your Table** üóëÔ∏è
    * Dropping Columns
    * Dropping Rows
6.  **Basic Operations: Calculations & Summaries** üßÆ
    * Arithmetic Operations
    * Aggregating Data (`.sum()`, `.mean()`, `.min()`, `.max()`, etc.)
    * `value_counts()`: Counting Unique Values
7.  **Sorting Your Data** ‚ÜïÔ∏è
8.  **Pro Tips & Common Pitfalls** üí°
9.  **Mini-Challenges: Put Your Skills to the Test!** üß©



---

### 1. Creating DataFrames üé®

There are many ways to create a DataFrame, but we'll focus on the most common and practical ones.

#### From Dictionaries: The Go-To Method for Small Data

Creating a DataFrame from a Python dictionary is super common, especially when you have data already structured as key-value pairs. Each key becomes a column name, and its corresponding value (a list or array) becomes the column's data.


In [1]:
import pandas as pd

# Let's create some data about our favorite fruits! üçéüçåüçä
fruit_data = {
    'Fruit': ['Apple', 'Banana', 'Orange', 'Grape', 'Strawberry'],
    'Color': ['Red', 'Yellow', 'Orange', 'Purple', 'Red'],
    'Price_per_kg': [2.5, 1.2, 1.8, 3.0, 4.5],
    'Quantity_in_stock': [100, 150, 80, 50, 120]
}

# Create a DataFrame from the dictionary
fruits_df = pd.DataFrame(fruit_data)

# Let's see our first DataFrame!
print(fruits_df) 

        Fruit   Color  Price_per_kg  Quantity_in_stock
0       Apple     Red           2.5                100
1      Banana  Yellow           1.2                150
2      Orange  Orange           1.8                 80
3       Grape  Purple           3.0                 50
4  Strawberry     Red           4.5                120


In [2]:
from IPython.display import display

# Cleaner output via Ipython
display(fruits_df)

Unnamed: 0,Fruit,Color,Price_per_kg,Quantity_in_stock
0,Apple,Red,2.5,100
1,Banana,Yellow,1.2,150
2,Orange,Orange,1.8,80
3,Grape,Purple,3.0,50
4,Strawberry,Red,4.5,120



#### From Lists of Dictionaries: For Row-Oriented Data

Sometimes your data might come as a list where each element is a dictionary representing a row. Pandas handles this beautifully\!


In [3]:
# Data about students
student_records = [
    {'Name': 'Alice', 'Age': 22, 'Major': 'Computer Science', 'GPA': 3.8},
    {'Name': 'Bob', 'Age': 21, 'Major': 'Data Science', 'GPA': 3.5},
    {'Name': 'Charlie', 'Age': 23, 'Major': 'Mathematics', 'GPA': 3.9},
    {'Name': 'Diana', 'Age': 20, 'Major': 'Computer Science', 'GPA': 3.2}
]

# Create a DataFrame from a list of dictionaries
students_df = pd.DataFrame(student_records)

display(students_df)

Unnamed: 0,Name,Age,Major,GPA
0,Alice,22,Computer Science,3.8
1,Bob,21,Data Science,3.5
2,Charlie,23,Mathematics,3.9
3,Diana,20,Computer Science,3.2



#### From CSV Files (Your Everyday Data Source\!) üìÇ

In the real world, most of your data will come from files, especially Comma Separated Value (CSV) files. Pandas makes reading them incredibly easy\!

First, let's pretend we have a `sales_data.csv` file. To simulate this, we'll create a dummy CSV file. In a real scenario, this file would already exist on your computer.


In [8]:
# This cell is just to create a dummy CSV file for demonstration!
# You won't usually do this in a real project.
csv_content = """OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore
1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi
1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar
1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad
1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta
1012,Nadia Ali,Mobile Phone,Electronics,2,750,2023-01-21,Gujranwala
1013,Osama Tariq,Mouse,Accessories,1,25,2023-01-22,Abbottabad
1014,Maria Iqbal,Book,Books,2,15,2023-01-23,Sargodha
1015,Talha Javed,Headphones,Accessories,2,55,2023-01-24,Sahiwal
1016,Fatima Riaz,Tablet,Electronics,1,420,2023-01-25,Gujrat
1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur
1018,Laiba Akram,Monitor,Electronics,1,310,2023-01-27,Dera Ghazi Khan
1019,Hashir Khan,Keyboard,Accessories,1,40,2023-01-28,Mirpur
1020,Hareem Asad,Book,Books,1,15,2023-01-29,Larkana
"""
with open('sales_data.csv', 'w') as f:
    f.write(csv_content)

print("Dummy 'sales_data.csv' created! üéâ")
# You can also check this csv file in your folder.

Dummy 'sales_data.csv' created! üéâ


- **`with open('sales_data.csv', 'w') as f: f.write(csv_content)`**
   - This creates a new file called sales_data.csv in the current directory.

   - The `with` statement ensures the file is properly closed after writing.

   - `'w'` means write mode (it will overwrite if the file already exists).

   - `f.write(csv_content)` writes the CSV string to the file.

> Only Use This Approach When you want to generate sample CSV files dynamically in tutorials or testing.


Now, let's read it into a DataFrame\!

In [9]:
# Reading a CSV file into a DataFrame
sales_df = pd.read_csv('sales_data.csv')

display(sales_df)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad



**üí° Pro Tip:** Pandas can read many other file types too, like Excel (`.read_excel()`), JSON (`.read_json()`), and even data directly from SQL databases (`.read_sql()`)\!



-----

### **2. Exploring DataFrameüëÄ**

Once you have a DataFrame, the first thing you'll want to do is get a sense of its contents and structure.

Let's use our `sales_df` for these examples.



#### **`head()`** & **`tail()`**: Peeking at Your Data üëÄ

One of the most used method for getting a quick overview of the DataFrame, is the `head()` method.
* `.head(n)`: Shows the **first `n` rows** (default is 5). Great for a quick preview.




In [10]:
# Display the first 3 rows of sales_df
sales_df.head(3)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad


There is also a **`tail()`** method for viewing the last rows of the DataFrame.
* `.tail(n)`: Shows the **last `n` rows** (default is 5). Useful for checking recently added data.

In [11]:
# Display the last 5 rows of sales_df
sales_df.tail()

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
15,1016,Fatima Riaz,Tablet,Electronics,1,420,2023-01-25,Gujrat
16,1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur
17,1018,Laiba Akram,Monitor,Electronics,1,310,2023-01-27,Dera Ghazi Khan
18,1019,Hashir Khan,Keyboard,Accessories,1,40,2023-01-28,Mirpur
19,1020,Hareem Asad,Book,Books,1,15,2023-01-29,Larkana



#### `info()`: Getting a Concise Summary ‚ÑπÔ∏è

`info()` is one of the most useful functions for a quick overview. is a quick way to look at the data types, missing values, and data size of a DataFrame. It tells you:

  * The number of entries (rows).
  * The number of columns.
  * Column names.
  * Number of non-null values in each column (handy for spotting missing data\!).
  * Data type (`dtype`) of each column.
  * Memory usage.

<!-- end list -->


In [12]:
# Get a summary of the DataFrame
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   OrderID       20 non-null     int64 
 1   CustomerName  20 non-null     object
 2   Product       20 non-null     object
 3   Category      20 non-null     object
 4   Quantity      20 non-null     int64 
 5   UnitPrice     20 non-null     int64 
 6   OrderDate     20 non-null     object
 7   City          20 non-null     object
dtypes: int64(3), object(5)
memory usage: 1.4+ KB


**Result Explained**
- The result tells us there are 20 rows and 8 columns:
   - ```RangeIndex: 20 entries, 0 to 19```
   -  ```Data columns (total 8 columns):```
         
- And the name of each column, with the data type:

|  #  |  Column       |  Non-Null Count |  Dtype 
| --- |  ------       |  -------------- |  ----- 
|  0  |  OrderID      |  20 non-null    |  int64 
|  1  |  CustomerName |  20 non-null    |  object
|  2  |  Product      |  20 non-null    |  object
|  3  |  Category     |  20 non-null    |  object
|  4  |  Quantity     |  20 non-null    |  int64 
|  5  |  UnitPrice    |  20 non-null    |  int64 
|  6  |  OrderDate    |  20 non-null    |  object
|  7  |  City         |  20 non-null    |  object

-dtypes: int64(3), object(5)

#### **Null Values**
The `info()` method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are no null values.

Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data, and you will learn more about that in the next chapters.


---

#### **`shape`**: Dimensions of Your Data üìè

The number of rows and columns of a DataFrame can be identified using the **`.shape`** attribute of the DataFrame. It returns a **tuple (row, column)** and can be indexed to get only rows, and only columns count as output


In [None]:
sales_df.shape    # Get the number of rows and columns (Output(20, 10))
sales_df.shape[0] # Get the number of rows only (Output: 20)
sales_df.shape[1] # Get the number of columns only (Output: 10)

10


#### **Getting all Columns name**

  * Calling the **`.columns`** attribute of a DataFrame object returns the column names in the form of an Index object. As a reminder, a pandas index is the address/label of the row or column.

In [155]:
# Get column names
sales_df.columns

Index(['OrderID', 'CustomerName', 'Product', 'Category', 'Quantity',
       'UnitPrice', 'OrderDate', 'City', 'Revenue', 'Price_After_Tax'],
      dtype='object')

In [None]:
# To get a list of column name
list(sales_df.columns)

['OrderID',
 'CustomerName',
 'Product',
 'Category',
 'Quantity',
 'UnitPrice',
 'OrderDate',
 'City',
 'Revenue',
 'Price_After_Tax']


----

#### `describe()`: Quick Statistics for Numerical Columns üìà

`describe()` generates descriptive statistics of your numerical columns, including:

  * `count`: Number of non-null values.
  * `mean`: Average value.
  * `std`: Standard deviation (how spread out the data is).
  * `min`/`max`: Minimum and maximum values.
  * `25%`, `50%` (median), `75%`: Quartiles.

<!-- end list -->


In [149]:
# To Get descriptive statistics for all numerical columns
sales_df.describe()

# Get descriptive statistics for specific column
#sales_df.UnitPrice.describe()

Unnamed: 0,OrderID,Quantity,UnitPrice,Revenue,Price_After_Tax
count,20.0,20.0,20.0,20.0,20.0
mean,1010.5,1.45,288.25,334.0,302.6625
std,5.91608,0.686333,402.516018,470.818436,422.641819
min,1001.0,1.0,15.0,15.0,15.75
25%,1005.75,1.0,15.0,37.5,15.75
50%,1010.5,1.0,50.0,52.5,52.5
75%,1015.25,2.0,405.0,405.0,425.25
max,1020.0,3.0,1250.0,1500.0,1312.5


> Often, practitioners find it easy to view such statistics by **transposing** them with the `.T `attribute.

In [148]:
sales_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
OrderID,20.0,1010.5,5.91608,1001.0,1005.75,1010.5,1015.25,1020.0
Quantity,20.0,1.45,0.686333,1.0,1.0,1.0,2.0,3.0
UnitPrice,20.0,288.25,402.516018,15.0,15.0,50.0,405.0,1250.0
Revenue,20.0,334.0,470.818436,15.0,37.5,52.5,405.0,1500.0
Price_After_Tax,20.0,302.6625,422.641819,15.75,15.75,52.5,425.25,1312.5



-----

### **3. Selecting Data: The Art of Subsetting üîç**

Accessing specific parts of your DataFrame is crucial. Pandas offers powerful and flexible ways to select data.

#### **Selecting Single Columns:**

You can select a single column using bracket notation, similar to how you access values in a dictionary. This will return a **Pandas Series**.


In [32]:
# Select the 'Product' column
products = sales_df['Product']
print(products)
print(type(products)) # It's a Pandas Series!

0           Laptop
1     Mobile Phone
2            Mouse
3             Book
4       Headphones
5           Tablet
6             Book
7          Monitor
8         Keyboard
9             Book
10          Laptop
11    Mobile Phone
12           Mouse
13            Book
14      Headphones
15          Tablet
16            Book
17         Monitor
18        Keyboard
19            Book
Name: Product, dtype: object
<class 'pandas.core.series.Series'>



#### **Selecting Multiple Columns:**

To select multiple columns, pass a *list of column names* inside the brackets. This will return a **new DataFrame**.


In [33]:
# Select 'Product' and 'Price' columns
product_prices = sales_df[['Product', 'UnitPrice']]
print(product_prices)
print(type(product_prices)) # It's a DataFrame!

         Product  UnitPrice
0         Laptop       1200
1   Mobile Phone        800
2          Mouse         20
3           Book         15
4     Headphones         60
5         Tablet        400
6           Book         15
7        Monitor        300
8       Keyboard         45
9           Book         15
10        Laptop       1250
11  Mobile Phone        750
12         Mouse         25
13          Book         15
14    Headphones         55
15        Tablet        420
16          Book         15
17       Monitor        310
18      Keyboard         40
19          Book         15
<class 'pandas.core.frame.DataFrame'>



> **‚ö†Ô∏è Common Pitfall:** Forgetting the double brackets `[[]]` when selecting multiple columns. `df['col1', 'col2']` will give you an error. Remember `df[['col1', 'col2']]`.



----

#### **Selecting Rows by Position (`.iloc[]`)** üìç

`.iloc[]` (integer-location) is used to select rows (and/or columns) by their **integer position** (0-based index), just like slicing a Python list.

In [35]:
# Select the first row
sales_df.iloc[0]

OrderID                1001
CustomerName       Ali Raza
Product              Laptop
Category        Electronics
Quantity                  1
UnitPrice              1200
OrderDate        2023-01-10
City                 Lahore
Name: 0, dtype: object

In [36]:
# Select rows from index 1 up to (but not including) 4
sales_df.iloc[1:4]

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad


In [37]:
# Select specific rows by their positions
sales_df.iloc[[0, 2, 5]]

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan


In [None]:
# Select the value at row 2, column 1
sales_df.iloc[2, 1] 

'Ahmed Shah'

In [44]:
# Select values from all rows for columns at positions 1 and 3 ('Product' and 'Price')
sales_df.iloc[:, [1, 3]]

Unnamed: 0,CustomerName,Category
0,Ali Raza,Electronics
1,Sana Khan,Electronics
2,Ahmed Shah,Accessories
3,Ayesha Noor,Books
4,Bilal Aslam,Accessories
5,Zainab Malik,Electronics
6,Umer Farooq,Books
7,Hina Ahmed,Electronics
8,Imran Qureshi,Accessories
9,Farah Yousaf,Books



---

#### **Selecting Rows by Label (`.loc[]`)** üè∑Ô∏è

`.loc[]` (label-location) is used to select rows (and/or columns) by their **labels**. If your DataFrame has a default integer index, you can still use those integers as labels.


In [45]:
# Select the row with index label 0 (which is the first row by default)
sales_df.loc[0]

OrderID                1001
CustomerName       Ali Raza
Product              Laptop
Category        Electronics
Quantity                  1
UnitPrice              1200
OrderDate        2023-01-10
City                 Lahore
Name: 0, dtype: object

In [46]:
# Select rows with index labels from 1 to 3 (inclusive)
# Note: .loc[] is inclusive for slicing!
sales_df.loc[1:3]

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad


In [47]:
# Select specific rows by their labels
sales_df.loc[[0, 2, 5]]

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan


In [48]:
# Select the value at row label 1, column label 'Product' (Mouse)
sales_df.loc[1, 'Product'] # Output: 'Mouse'

'Mobile Phone'

In [49]:
# Select values from all rows for columns 'Product' and 'Price'
sales_df.loc[:, ['Product', 'UnitPrice']]

Unnamed: 0,Product,UnitPrice
0,Laptop,1200
1,Mobile Phone,800
2,Mouse,20
3,Book,15
4,Headphones,60
5,Tablet,400
6,Book,15
7,Monitor,300
8,Keyboard,45
9,Book,15



**üí° Pro Tip:**

  * Use `.iloc[]` when you know the *position* (numerical index) of the rows/columns you want.
  * Use `.loc[]` when you know the *label* (name) of the rows/columns you want.



---

#### **Boolean Indexing (Conditional Selection) ‚ú®**

This is where Pandas truly shines\! You can select rows based on a condition applied to one or more columns. It's like asking your data a question\!

In [56]:
# Select all sales where the Quantity is greater than 1
high_quantity_sales = sales_df[sales_df['Quantity'] > 1]
display(high_quantity_sales)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi
11,1012,Nadia Ali,Mobile Phone,Electronics,2,750,2023-01-21,Gujranwala
13,1014,Maria Iqbal,Book,Books,2,15,2023-01-23,Sargodha
14,1015,Talha Javed,Headphones,Accessories,2,55,2023-01-24,Sahiwal
16,1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur


In [55]:
# Select all sales for 'Laptop'
laptop_sales = sales_df[sales_df['Product'] == 'Laptop']
display(laptop_sales)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
10,1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta


In [54]:
# Combine multiple conditions using & (AND) or | (OR)
# Select sales of 'Laptop' where the Price is greater than 1000
expensive_laptop_sales = sales_df[(sales_df['Product'] == 'Laptop') & (sales_df['UnitPrice'] > 1000)]
display(expensive_laptop_sales)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
10,1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta


In [57]:
# Select sales where Quantity is 1 OR Product is 'Webcam'
one_item_or_webcam_sales = sales_df[(sales_df['Quantity'] == 1) | (sales_df['Product'] == 'Webcam')]
display(one_item_or_webcam_sales)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad
10,1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta
12,1013,Osama Tariq,Mouse,Accessories,1,25,2023-01-22,Abbottabad
15,1016,Fatima Riaz,Tablet,Electronics,1,420,2023-01-25,Gujrat


In [59]:
# Select sales where the Product is either 'Laptop' or 'Monitor' using .isin()
selected_products = sales_df[sales_df['Product'].isin(['Laptop', 'Monitor'])]
display(selected_products)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
10,1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta
17,1018,Laiba Akram,Monitor,Electronics,1,310,2023-01-27,Dera Ghazi Khan



**‚ö†Ô∏è Common Pitfall:**

  * Using `and` or `or` instead of `&` or `|` for combining conditions in Pandas is a common beginner's mistake. Python's `and`/`or` operate on boolean values directly, while `&`/`|` perform element-wise logical operations on Series.
  * Forgetting parentheses `()` around each condition when combining them, as `&` and `|` have higher precedence than comparison operators.



-----

### 4\. Adding & Modifying Data: Making Changes ‚ûï

Data is rarely static. You'll often need to add new information or update existing entries.

Let's work with a copy of `sales_df` to avoid altering our original data for these examples.


In [None]:
import pandas as pd

pd.read_csv("sales_data.csv").to_csv("sales_data_copy.csv", index=False)

sales_df_copy = pd.read_csv("sales_data_copy.csv")
# display(sales_df_copy)


**üîç Explanation**
- **`pd.read_csv("...")`**	Loads the CSV into a DataFrame
- **`to_csv("...")`**	Saves a copy as a new CSV file
- **`index=False`**	Prevents pandas from writing row numbers



#### Adding New Columns: Enriching Your Data üåü

You can add a new column by simply assigning a Series or a list of values to a new column name. The length of the new data must match the number of rows in the DataFrame.


In [71]:
# Add a 'Total_Price' column (Quantity * Price)
sales_df_copy['Total_Price'] = sales_df_copy['Quantity'] * sales_df_copy['UnitPrice']
print("DataFrame after adding 'Total_Price':")
display(sales_df_copy)

DataFrame after adding 'Total_Price':


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore,1200
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi,800
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15


In [74]:
# Add a 'Discount_Applied' column with a boolean value
sales_df_copy['Discount_Applied'] = [False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True ]
print("\nDataFrame after adding 'Discount_Applied':")
display(sales_df_copy)


DataFrame after adding 'Discount_Applied':


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price,Discount_Applied
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore,1200,False
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi,800,True
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40,False
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45,False
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore,60,True
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan,400,False
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30,False
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot,300,True
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45,False
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15,False



#### Modifying Existing Columns: Updating Your Information ‚úçÔ∏è

You can modify an entire column or specific values within a column.


In [76]:
# Let's say all prices increased by 10%
sales_df_copy['UnitPrice'] = sales_df_copy['UnitPrice'] * 1.10
print("\nDataFrame after updating 'Price' column:")
display(sales_df_copy)


DataFrame after updating 'Price' column:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price,Discount_Applied
0,1001,Ali Raza,Laptop,Electronics,1,1452.0,2023-01-10,Lahore,1200,False
1,1002,Sana Khan,Mobile Phone,Electronics,1,968.0,2023-01-11,Karachi,800,True
2,1003,Ahmed Shah,Mouse,Accessories,2,24.2,2023-01-12,Islamabad,40,False
3,1004,Ayesha Noor,Book,Books,3,18.15,2023-01-13,Faisalabad,45,False
4,1005,Bilal Aslam,Headphones,Accessories,1,72.6,2023-01-14,Lahore,60,True
5,1006,Zainab Malik,Tablet,Electronics,1,484.0,2023-01-15,Multan,400,False
6,1007,Umer Farooq,Book,Books,2,18.15,2023-01-16,Rawalpindi,30,False
7,1008,Hina Ahmed,Monitor,Electronics,1,363.0,2023-01-17,Sialkot,300,True
8,1009,Imran Qureshi,Keyboard,Accessories,1,54.45,2023-01-18,Peshawar,45,False
9,1010,Farah Yousaf,Book,Books,1,18.15,2023-01-19,Hyderabad,15,False


In [78]:
# Update the 'Quantity' for 'Webcam' sales to 5
sales_df_copy.loc[sales_df_copy['Product'] == 'Book', 'Quantity'] = 5
print("\nDataFrame after updating 'Webcam' quantity:")
display(sales_df_copy)


DataFrame after updating 'Webcam' quantity:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price,Discount_Applied
0,1001,Ali Raza,Laptop,Electronics,1,1452.0,2023-01-10,Lahore,1200,False
1,1002,Sana Khan,Mobile Phone,Electronics,1,968.0,2023-01-11,Karachi,800,True
2,1003,Ahmed Shah,Mouse,Accessories,2,24.2,2023-01-12,Islamabad,40,False
3,1004,Ayesha Noor,Book,Books,5,18.15,2023-01-13,Faisalabad,45,False
4,1005,Bilal Aslam,Headphones,Accessories,1,72.6,2023-01-14,Lahore,60,True
5,1006,Zainab Malik,Tablet,Electronics,1,484.0,2023-01-15,Multan,400,False
6,1007,Umer Farooq,Book,Books,5,18.15,2023-01-16,Rawalpindi,30,False
7,1008,Hina Ahmed,Monitor,Electronics,1,363.0,2023-01-17,Sialkot,300,True
8,1009,Imran Qureshi,Keyboard,Accessories,1,54.45,2023-01-18,Peshawar,45,False
9,1010,Farah Yousaf,Book,Books,5,18.15,2023-01-19,Hyderabad,15,False



#### Adding New Rows (Briefly) ‚û°Ô∏è

Adding rows directly to an existing DataFrame can be less straightforward and often less efficient, especially for large datasets. A common approach is to create a new DataFrame for the new row(s) and then use `pd.concat()`. We'll cover `pd.concat()` in more detail when we discuss combining DataFrames.


In [82]:
# Creating a new row as a DataFrame
new_sale = pd.DataFrame([{'OrderID': 1021, 'CustomerName': 'Hanan Akram', 'Product': 'Speaker', 'Quantity': 1, 'UnitPrice': 280, 'City': 'Garha More', 'Date': '2023-01-19'}])

# Concatenating the new row to the existing DataFrame
# Note: ignore_index=True resets the index after concatenation
updated_sales_df = pd.concat([sales_df_copy, new_sale], ignore_index=True)
print("\nDataFrame after adding a new row:")
display(updated_sales_df)


DataFrame after adding a new row:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price,Discount_Applied,Date
0,1001,Ali Raza,Laptop,Electronics,1,1452.0,2023-01-10,Lahore,1200.0,False,
1,1002,Sana Khan,Mobile Phone,Electronics,1,968.0,2023-01-11,Karachi,800.0,True,
2,1003,Ahmed Shah,Mouse,Accessories,2,24.2,2023-01-12,Islamabad,40.0,False,
3,1004,Ayesha Noor,Book,Books,5,18.15,2023-01-13,Faisalabad,45.0,False,
4,1005,Bilal Aslam,Headphones,Accessories,1,72.6,2023-01-14,Lahore,60.0,True,
5,1006,Zainab Malik,Tablet,Electronics,1,484.0,2023-01-15,Multan,400.0,False,
6,1007,Umer Farooq,Book,Books,5,18.15,2023-01-16,Rawalpindi,30.0,False,
7,1008,Hina Ahmed,Monitor,Electronics,1,363.0,2023-01-17,Sialkot,300.0,True,
8,1009,Imran Qureshi,Keyboard,Accessories,1,54.45,2023-01-18,Peshawar,45.0,False,
9,1010,Farah Yousaf,Book,Books,5,18.15,2023-01-19,Hyderabad,15.0,False,



-----

### **5\. Removing Data: Cleaning Up Your Table üóëÔ∏è**

Sometimes you need to get rid of unnecessary columns or rows.

Let's use our `sales_df_copy` again.



#### **Dropping Columns: Say Goodbye to Unwanted Data üëã**

The `.drop()` method is used for removing rows or columns.

  * To drop columns, specify `axis=1` or `axis='columns'`.
  * `inplace=True` modifies the DataFrame directly; otherwise, it returns a new DataFrame with the column(s) dropped.

<!-- end list -->

In [None]:
# Drop a single column: 'Discount_Applied'
sales_df_copy.drop('Discount_Applied', axis=1, inplace=True)
print("\nDataFrame after dropping 'Discount_Applied':")
display(sales_df_copy)


DataFrame after dropping 'Discount_Applied':


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Total_Price
0,1001,Ali Raza,Laptop,Electronics,1,1452.0,2023-01-10,Lahore,1200
1,1002,Sana Khan,Mobile Phone,Electronics,1,968.0,2023-01-11,Karachi,800
2,1003,Ahmed Shah,Mouse,Accessories,2,24.2,2023-01-12,Islamabad,40
3,1004,Ayesha Noor,Book,Books,5,18.15,2023-01-13,Faisalabad,45
4,1005,Bilal Aslam,Headphones,Accessories,1,72.6,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,1,484.0,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,5,18.15,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,1,363.0,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,1,54.45,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,5,18.15,2023-01-19,Hyderabad,15


In [None]:
# Drop multiple columns: 'Quantity' and 'Price'
sales_df_copy.drop(['Quantity', 'UnitPrice'], axis='columns', inplace=True)
print("\nDataFrame after dropping 'Quantity' and 'UnitPrice':")
display(sales_df_copy)


DataFrame after dropping 'Quantity' and 'UnitPrice':


Unnamed: 0,OrderID,CustomerName,Product,Category,OrderDate,City,Total_Price
0,1001,Ali Raza,Laptop,Electronics,2023-01-10,Lahore,1200
1,1002,Sana Khan,Mobile Phone,Electronics,2023-01-11,Karachi,800
2,1003,Ahmed Shah,Mouse,Accessories,2023-01-12,Islamabad,40
3,1004,Ayesha Noor,Book,Books,2023-01-13,Faisalabad,45
4,1005,Bilal Aslam,Headphones,Accessories,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,2023-01-19,Hyderabad,15



#### **Dropping Rows: Removing Specific Records ‚ùå**

To drop rows, specify `axis=0` or `axis='index'` and the index label(s) of the rows you want to remove.


In [None]:
# Drop the row with index label 0 (the first row)
sales_df_copy.drop(0, axis=0, inplace=True)
print("DataFrame after dropping row 0:")
display(sales_df_copy)

DataFrame after dropping row 0:


Unnamed: 0,OrderID,CustomerName,Product,Category,OrderDate,City,Total_Price
1,1002,Sana Khan,Mobile Phone,Electronics,2023-01-11,Karachi,800
2,1003,Ahmed Shah,Mouse,Accessories,2023-01-12,Islamabad,40
3,1004,Ayesha Noor,Book,Books,2023-01-13,Faisalabad,45
4,1005,Bilal Aslam,Headphones,Accessories,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,2023-01-19,Hyderabad,15
10,1011,Kamran Aziz,Laptop,Electronics,2023-01-20,Quetta,1250


In [93]:
# Drop rows with index labels 1 and 3
sales_df_copy.drop([1, 3], axis='index', inplace=True)
print("\nDataFrame after dropping rows 1 and 3:")
display(sales_df_copy)


DataFrame after dropping rows 1 and 3:


Unnamed: 0,OrderID,CustomerName,Product,Category,OrderDate,City,Total_Price
2,1003,Ahmed Shah,Mouse,Accessories,2023-01-12,Islamabad,40
4,1005,Bilal Aslam,Headphones,Accessories,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,2023-01-19,Hyderabad,15
10,1011,Kamran Aziz,Laptop,Electronics,2023-01-20,Quetta,1250
11,1012,Nadia Ali,Mobile Phone,Electronics,2023-01-21,Gujranwala,1500
12,1013,Osama Tariq,Mouse,Accessories,2023-01-22,Abbottabad,25


In [94]:
# You can also drop rows based on a condition!
# For example, drop all sales of 'Mouse'
# This requires a two-step process: first find the indices, then drop.
indices_to_drop = sales_df_copy[sales_df_copy['Product'] == 'Mouse'].index
sales_df_copy.drop(indices_to_drop, inplace=True)
print("\nDataFrame after dropping 'Mouse' sales:")
display(sales_df_copy)


DataFrame after dropping 'Mouse' sales:


Unnamed: 0,OrderID,CustomerName,Product,Category,OrderDate,City,Total_Price
4,1005,Bilal Aslam,Headphones,Accessories,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,2023-01-19,Hyderabad,15
10,1011,Kamran Aziz,Laptop,Electronics,2023-01-20,Quetta,1250
11,1012,Nadia Ali,Mobile Phone,Electronics,2023-01-21,Gujranwala,1500
13,1014,Maria Iqbal,Book,Books,2023-01-23,Sargodha,30
14,1015,Talha Javed,Headphones,Accessories,2023-01-24,Sahiwal,110




**üí° Pro Tip:** Be careful with `inplace=True`\! It permanently modifies your DataFrame. If you want to keep the original DataFrame, assign the result to a new variable: `df_new = df.drop('col_to_drop', axis=1)`.



-----

### **6\. Basic Operations: Calculations & Summaries üßÆ**

Pandas allows you to perform mathematical operations on entire columns and summarize your data quickly.

Let's use our `sales_df` again (re-read it to ensure it's fresh for these examples).


In [96]:
sales_df = pd.read_csv('sales_data.csv')
print("Fresh sales_df for calculations:")
display(sales_df)

Fresh sales_df for calculations:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad



#### Arithmetic Operations: Column-wise Math üî¢

You can perform standard arithmetic operations directly on columns (Series). Pandas will apply the operation element-wise.


In [98]:
# Calculate the total revenue for each sale (already done, but good to re-emphasize)
sales_df['Revenue'] = sales_df['Quantity'] * sales_df['UnitPrice']
print("\nDataFrame with 'Revenue' column:")
display(sales_df)


DataFrame with 'Revenue' column:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Revenue
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore,1200
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi,800
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore,60
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan,400
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot,300
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15


In [106]:
# Add a tax of 5% to the price
sales_df['Price_After_Tax'] = sales_df['UnitPrice'] * 1.05
print("\nDataFrame with 'Price_After_Tax' column:")
display(sales_df)


DataFrame with 'Price_After_Tax' column:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Revenue,Price_After_Tax
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore,1200,1260.0
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi,800,840.0
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40,21.0
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45,15.75
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore,60,63.0
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan,400,420.0
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30,15.75
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot,300,315.0
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45,47.25
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15,15.75



#### Aggregating Data: Getting Summary Statistics üìä

Pandas provides many built-in aggregation functions to quickly summarize your data.

  * `.sum()`: Sum of values.
  * `.mean()`: Average value.
  * `.median()`: Middle value.
  * `.min()`: Minimum value.
  * `.max()`: Maximum value.
  * `.std()`: Standard deviation.
  * `.count()`: Number of non-null values.

<!-- end list -->


In [112]:
# Calculate the total quantity sold across all products
total_quantity_sold = sales_df['Quantity'].sum()
print(f"Total quantity sold: {total_quantity_sold} units")

Total quantity sold: 29 units


In [111]:
# Calculate the average price of products
average_price = sales_df['UnitPrice'].mean()
print(f"Average product price: ${average_price:.2f}")

Average product price: $288.25


In [113]:
# Find the maximum revenue from a single sale
max_revenue = sales_df['Revenue'].max()
print(f"Maximum revenue from a single sale: ${max_revenue:.2f}")

Maximum revenue from a single sale: $1500.00



#### **`value_counts()`: Counting Unique Values üßê**

`value_counts()` is fantastic for understanding the distribution of categorical data. It returns a Series showing the counts of unique values in a column.


In [118]:
# Count how many times each product appears in the sales data
product_counts = sales_df['Product'].value_counts()
print("\nProduct sales counts:")
display(product_counts)


Product sales counts:


Product
Book            6
Laptop          2
Mobile Phone    2
Mouse           2
Headphones      2
Tablet          2
Monitor         2
Keyboard        2
Name: count, dtype: int64

In [119]:
#  You can also get normalized frequencies (percentages)
product_percentages = sales_df['Product'].value_counts(normalize=True)
print("\nProduct sales percentages:")
print(product_percentages)


Product sales percentages:
Product
Book            0.3
Laptop          0.1
Mobile Phone    0.1
Mouse           0.1
Headphones      0.1
Tablet          0.1
Monitor         0.1
Keyboard        0.1
Name: proportion, dtype: float64



-----

### **7\. Sorting Your Data ‚ÜïÔ∏è**

Sorting your DataFrame is a common operation to organize data for better readability or analysis.

  * `sort_values()`: Sorts by the values in one or more columns.
  * `sort_index()`: Sorts by the DataFrame's index.

<!-- end list -->


In [122]:
# Sort by 'Price' in ascending order (default)
sales_df_sorted_price = sales_df.sort_values(by='UnitPrice')
print("\nSales data sorted by price (ascending):")
display(sales_df_sorted_price)


Sales data sorted by price (ascending):


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Revenue,Price_After_Tax
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45,15.75
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30,15.75
13,1014,Maria Iqbal,Book,Books,2,15,2023-01-23,Sargodha,30,15.75
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15,15.75
19,1020,Hareem Asad,Book,Books,1,15,2023-01-29,Larkana,15,15.75
16,1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur,45,15.75
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40,21.0
12,1013,Osama Tariq,Mouse,Accessories,1,25,2023-01-22,Abbottabad,25,26.25
18,1019,Hashir Khan,Keyboard,Accessories,1,40,2023-01-28,Mirpur,40,42.0
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45,47.25


In [123]:
# Sort by 'Quantity' in descending order
sales_df_sorted_quantity_desc = sales_df.sort_values(by='Quantity', ascending=False)
print("\nSales data sorted by quantity (descending):")
display(sales_df_sorted_quantity_desc)


Sales data sorted by quantity (descending):


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Revenue,Price_After_Tax
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45,15.75
16,1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur,45,15.75
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30,15.75
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad,40,21.0
14,1015,Talha Javed,Headphones,Accessories,2,55,2023-01-24,Sahiwal,110,57.75
11,1012,Nadia Ali,Mobile Phone,Electronics,2,750,2023-01-21,Gujranwala,1500,787.5
13,1014,Maria Iqbal,Book,Books,2,15,2023-01-23,Sargodha,30,15.75
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi,800,840.0
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore,1200,1260.0
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot,300,315.0


In [124]:
# Sort by multiple columns: first by 'Product' (ascending), then by 'Quantity' (descending)
sales_df_multi_sort = sales_df.sort_values(by=['Product', 'Quantity'], ascending=[True, False])
print("\nSales data sorted by Product (asc) then Quantity (desc):")
display(sales_df_multi_sort)


Sales data sorted by Product (asc) then Quantity (desc):


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City,Revenue,Price_After_Tax
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad,45,15.75
16,1017,Zeeshan Haider,Book,Books,3,15,2023-01-26,Bahawalpur,45,15.75
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi,30,15.75
13,1014,Maria Iqbal,Book,Books,2,15,2023-01-23,Sargodha,30,15.75
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad,15,15.75
19,1020,Hareem Asad,Book,Books,1,15,2023-01-29,Larkana,15,15.75
14,1015,Talha Javed,Headphones,Accessories,2,55,2023-01-24,Sahiwal,110,57.75
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore,60,63.0
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar,45,47.25
18,1019,Hashir Khan,Keyboard,Accessories,1,40,2023-01-28,Mirpur,40,42.0



-----

### 8\. Pro Tips & Common Pitfalls üí°

#### Pro Tips:

  * **Chaining Methods:** Many Pandas methods return DataFrames, allowing you to chain operations together (e.g., `df.head().info()`). This makes your code more concise. üîó
  * **`.copy()` is Your Friend:** When you modify a DataFrame, sometimes you want to work on a copy to preserve the original. Use `df.copy()` to explicitly create a copy, preventing unexpected modifications. ü©π
  * **Method vs. Attribute:** Remember `df.shape` is an attribute (no parentheses), while `df.head()` is a method (requires parentheses). Pay attention to this\! ü§ì
  * **Use `.` for Column Selection (sometimes):** For single-word column names that don't clash with DataFrame methods, `df.ColumnName` works as a shortcut for `df['ColumnName']`. However, `df['Column Name With Spaces']` or `df['min']` (if 'min' is also a method name) require the bracket notation. Stick to `df['ColumnName']` for consistency and to avoid errors.
  * **Explore with Tab Completion:** In Jupyter, after typing `df.` and pressing `Tab`, you'll see a list of available methods and attributes. Use this to discover new functionalities\! ‚å®Ô∏è

#### Common Pitfalls:

  * **Modifying Original vs. Copy:** As mentioned, if you forget `.copy()`, you might accidentally modify your original DataFrame.
  * **Boolean Indexing `&` vs. `and`:** This is a very common one\! Always use `&` for element-wise AND and `|` for element-wise OR in boolean indexing.
  * **Missing Parentheses:** Forgetting `()` for methods (e.g., `df.info` instead of `df.info()`). This will often just show you the method object, not execute it.
  * **Case Sensitivity:** Column names and other labels are case-sensitive (`'product'` is not the same as `'Product'`).
  * **Slicing with `.loc` vs. `.iloc`:** Remember `.loc` slicing is *inclusive* of the end label, while `.iloc` slicing is *exclusive* of the end position (like standard Python slicing). This can trip you up\! ü§Ø



-----

### 9\. Mini-Challenges: Put Your Skills to the Test\! üß©

It's time to practice what you've learned\! Use the `sales_df` DataFrame (you can re-read it from `sales_data.csv` if needed).


In [126]:
sales_df_challenge = pd.read_csv('sales_data.csv')

print("Your challenge DataFrame:")
display(sales_df_challenge)

Your challenge DataFrame:


Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
6,1007,Umer Farooq,Book,Books,2,15,2023-01-16,Rawalpindi
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
8,1009,Imran Qureshi,Keyboard,Accessories,1,45,2023-01-18,Peshawar
9,1010,Farah Yousaf,Book,Books,1,15,2023-01-19,Hyderabad



**Challenge 1: High-Value Products** üí∞
Select all sales where the 'Price' is greater than or equal to 100. Store the result in a new DataFrame called `high_value_products_df`.


In [127]:
high_value_products = sales_df_challenge[sales_df_challenge['UnitPrice'] >= 100]
display(high_value_products)

Unnamed: 0,OrderID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
5,1006,Zainab Malik,Tablet,Electronics,1,400,2023-01-15,Multan
7,1008,Hina Ahmed,Monitor,Electronics,1,300,2023-01-17,Sialkot
10,1011,Kamran Aziz,Laptop,Electronics,1,1250,2023-01-20,Quetta
11,1012,Nadia Ali,Mobile Phone,Electronics,2,750,2023-01-21,Gujranwala
15,1016,Fatima Riaz,Tablet,Electronics,1,420,2023-01-25,Gujrat
17,1018,Laiba Akram,Monitor,Electronics,1,310,2023-01-27,Dera Ghazi Khan



**Challenge 2: Product & Quantity Summary** üì¶
Find the total `Quantity` sold for each unique `Product`.
(Hint: You might need `.groupby()` for this, but try with `value_counts()` and then thinking about how to sum quantities later if you're not sure about `groupby` yet, or just calculate total quantity for now. For the true answer, we'll learn `groupby` in a later section\!)
*Self-correction: Let's stick to operations covered so far.*

**Revised Challenge 2: Total Quantity of Laptops** üíª
What is the total `Quantity` of 'Laptop' sold?


In [134]:
# Your code here for Revised Challenge 2
laptop_quantity = sales_df_challenge[sales_df_challenge['Product'] == 'Laptop']['Quantity'].sum()
print(f"Total quantity of Laptops sold: {laptop_quantity}") 

Total quantity of Laptops sold: 2



**Challenge 3: Rename a Column** ‚úçÔ∏è
Rename the 'OrderID' column to 'TransactionID'.


In [136]:
# Your code here for Challenge 3
# Method 1: Using .rename()
sales_df_challenge.rename(columns={'OrderID': 'TransactionID'}, inplace=True)

# Method 2 (creates a new DataFrame):
sales_df_challenge = sales_df_challenge.rename(columns={'OrderID': 'TransactionID'})

display(sales_df_challenge.head()) # Uncomment to check your answer!

Unnamed: 0,TransactionID,CustomerName,Product,Category,Quantity,UnitPrice,OrderDate,City
0,1001,Ali Raza,Laptop,Electronics,1,1200,2023-01-10,Lahore
1,1002,Sana Khan,Mobile Phone,Electronics,1,800,2023-01-11,Karachi
2,1003,Ahmed Shah,Mouse,Accessories,2,20,2023-01-12,Islamabad
3,1004,Ayesha Noor,Book,Books,3,15,2023-01-13,Faisalabad
4,1005,Bilal Aslam,Headphones,Accessories,1,60,2023-01-14,Lahore



**Challenge 4: Drop a Column** üëã
Remove the 'Date' column from the `sales_df_challenge` DataFrame.


In [None]:
# Your code here for Challenge 4
sales_df_challenge.drop('OrderDate', axis=1, inplace=True)
display(sales_df_challenge.head())

Unnamed: 0,TransactionID,CustomerName,Product,Category,Quantity,UnitPrice
0,1001,Ali Raza,Laptop,Electronics,1,1200
1,1002,Sana Khan,Mobile Phone,Electronics,1,800
2,1003,Ahmed Shah,Mouse,Accessories,2,20
3,1004,Ayesha Noor,Book,Books,3,15
4,1005,Bilal Aslam,Headphones,Accessories,1,60



-----

### End of DataFrames & Operations\! üéâ

You've just learned how to manipulate DataFrames like a pro\! From creating them to selecting, modifying, and cleaning data, you now have a solid foundation. These are the operations you'll be using constantly in any data analysis task. üí™

**What's next?** In the upcoming sections, we'll dive deeper into more advanced topics like handling missing data, grouping, merging, and applying custom functions. Get ready for more data magic\! üßô‚Äç‚ôÇÔ∏è

**Keep practicing\!** The more you write Pandas code, the more intuitive it becomes. You're doing great\! ‚ú®


# üßπ Data Cleaning & Preparation: Taming the Wild Data! ü¶Å

Welcome back, data wranglers! ü§† You've learned how to create and manipulate DataFrames. Now, it's time for a super important skill: **Data Cleaning and Preparation**.

Imagine you're a chef, and you've just received a batch of ingredients. Some vegetables might be bruised, some fruits might be overripe, and some labels might be missing. You wouldn't cook with them as is, right? You'd clean, trim, and prepare them first! üç≥

Data is no different! Real-world data is almost *never* perfect. It often has:
* **Missing values:** Gaps where data should be. üôà
* **Duplicate entries:** The same information repeated. üëØ‚Äç‚ôÄÔ∏è
* **Incorrect data types:** Numbers stored as text, dates as random strings. üìù
* **Inconsistent formatting:** 'USA', 'U.S.A.', 'United States' for the same country. ü§Ø

This section will equip you with the essential Pandas tools to clean and prepare your data, making it ready for analysis and insights! Let's get our hands dirty (with data, of course)! üß§

---

### üìö Table of Contents: Data Cleaning & Preparation

1.  **Understanding the Mess: The Need for Cleaning** üßê
2.  **Handling Missing Data** üïµÔ∏è‚Äç‚ôÄÔ∏è
    * Identifying Missing Values
    * Dropping Rows/Columns with Missing Values (`.dropna()`)
    * Filling Missing Values (`.fillna()`)
3.  **Managing Duplicate Data** üëØ‚Äç‚ôÇÔ∏è
    * Identifying Duplicates (`.duplicated()`)
    * Removing Duplicates (`.drop_duplicates()`)
4.  **Correcting Data Types** üîÑ
    * Checking Data Types (`.dtypes`, `.info()`)
    * Converting Data Types (`.astype()`, `pd.to_numeric()`, `pd.to_datetime()`)
5.  **Renaming Columns for Clarity** ‚úçÔ∏è
6.  **Replacing Values for Consistency** üéØ
7.  **Pro Tips & Common Pitfalls** üí°
8.  **Mini-Challenges: Data Cleaning Gym!** üèãÔ∏è‚Äç‚ôÄÔ∏è



---

### Let's Create a Messy DataFrame! üòà

To practice, we need some messy data! We'll create a DataFrame that simulates common real-world issues.


In [156]:
import pandas as pd
import numpy as np # NumPy is often used with Pandas, especially for NaN (Not a Number)

# Our messy dataset!
messy_data = {
    'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 101], # Duplicate Customer ID
    'Product_Name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop', 'Webcam', 'Monitor', 'Mouse', 'Speaker', 'Laptop'],
    'Price_USD': [1200.00, 25.50, 75.00, np.nan, 1200.00, 50.00, 300.00, 25.50, np.nan, 1200.00], # Missing prices
    'Quantity': [1, 2, 1, 1, 1, 3, 1, 2, 1, 1],
    'Order_Date': ['2023-01-15', '2023/01/15', '16-Jan-2023', '2023-01-17', '2023-01-18', '2023-01-18', '2023-01-17', '2023/01/15', '2023-01-19', '2023-01-15'], # Inconsistent date formats
    'Customer_Segment': ['Gold', 'Silver', 'Gold', 'Bronze', 'Gold', 'Silver', 'Bronze', 'Silver', 'Gold', 'Gold'],
    'Payment Method': ['Credit Card ', 'Debit Card', 'Credit Card', 'Cash', 'Credit Card', 'Debit Card', 'Cash', 'Debit Card ', 'Credit Card', 'Credit Card'] # Trailing spaces
}

df_messy = pd.DataFrame(messy_data)

print("Our Super Messy DataFrame:")
display(df_messy)

# Let's get a first look at its info
df_messy.info()

Our Super Messy DataFrame:


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer_ID       10 non-null     int64  
 1   Product_Name      10 non-null     object 
 2   Price_USD         8 non-null      float64
 3   Quantity          10 non-null     int64  
 4   Order_Date        10 non-null     object 
 5   Customer_Segment  10 non-null     object 
 6   Payment Method    10 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 692.0+ bytes



**Look at the `df_messy.info()` output. Can you spot some issues? ü§î**

  * **`Price_USD`** has fewer non-null values than others. (Missing data\!)
  * **`Order_Date`** is an `object` (string) type, but it should be a date\!
  * **`Payment Method`** also an `object`. Are there hidden issues?



-----

### 1\. Understanding the Mess: The Need for Cleaning üßê

Why do we even have messy data?

  * **Human Error:** Typos, manual data entry mistakes. üë®‚Äçüíª
  * **Data Entry Systems:** Different systems might record data differently.
  * **Missing Information:** A customer didn't provide a detail, or a sensor failed.
  * **Data Merges:** Combining data from different sources often introduces inconsistencies.
  * **Scraping Issues:** Data collected from websites can be unstructured.

Cleaning ensures your analysis is accurate and reliable. Garbage in, garbage out\! üóëÔ∏è‚û°Ô∏èüìä



-----

### **2. Handling Missing Data üïµÔ∏è‚Äç‚ôÄÔ∏è**

Missing data is represented by `NaN` (Not a Number) in numerical columns or `None` in object columns. Pandas uses NumPy's `np.nan` to represent these.

#### **Identifying Missing Values: Where are the Gaps? üìç**

The first step is always to find where the missing values are.

  * `.isna()` or `.isnull()`: Returns a boolean DataFrame of the same shape, indicating `True` where values are missing.
  * `.notna()` or `.notnull()`: The opposite, `True` where values are *not* missing.
  * `.sum()`: When applied after `.isna()`, it counts the `True` values (which are 1) for each column, giving you a total count of missing values per column.

<!-- end list -->


In [158]:
# Check for missing values - returns a DataFrame of True/False
display(df_messy.isna())

Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False
8,False,False,True,False,False,False,False
9,False,False,False,False,False,False,False


In [159]:
# Count missing values per column - super useful!
print("\nMissing values per column:")
print(df_messy.isna().sum())


Missing values per column:
Customer_ID         0
Product_Name        0
Price_USD           2
Quantity            0
Order_Date          0
Customer_Segment    0
Payment Method      0
dtype: int64


In [161]:
# Check total missing values in the entire DataFrame
print(f"\nTotal missing values in DataFrame: {df_messy.isna().sum().sum()}")


Total missing values in DataFrame: 2



From the output of `df_messy.isna().sum()`, you should see that `Price_USD` has 2 missing values.




#### **Dropping Rows/Columns with Missing Values (`.dropna()`): The "Delete It\!" Approach ‚úÇÔ∏è**

If you have a lot of data and only a few missing values, or if the missing values are crucial and can't be reasonably imputed, you might choose to drop them.

  * `df.dropna(axis=0, how='any', inplace=False)`:
      * `axis=0` (default) or `'index'`: Drops rows.
      * `axis=1` or `'columns'`: Drops columns.
      * `how='any'` (default): Drops the row/column if *any* `NaN` is present.
      * `how='all'`: Drops the row/column only if *all* values are `NaN`.
      * `inplace=True`: Modifies the DataFrame directly.

<!-- end list -->

In [185]:
df_cleaned_drop = df_messy.copy() # Work on a copy!

# Drop rows where ANY column has a missing value
print("DataFrame after dropping rows with any missing values:")
df_cleaned_drop.dropna(inplace=True) # Let's make it permanent for this copy
display(df_cleaned_drop)
print("\nMissing values after dropping rows:")
print(df_cleaned_drop.isna().sum())

DataFrame after dropping rows with any missing values:


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card



Missing values after dropping rows:
Customer_ID         0
Product_Name        0
Price_USD           0
Quantity            0
Order_Date          0
Customer_Segment    0
Payment Method      0
dtype: int64



**‚ö†Ô∏è Pitfall:** Be careful with `dropna()`. If you have many rows with just a few missing values, you might accidentally lose a lot of valuable data\! Always check `isna().sum()` first.



#### **Filling Missing Values (`.fillna()`): The "Guess & Fill" Approach ‚úçÔ∏è**

Often, dropping data isn't an option. Instead, you can fill missing values using various strategies.

  * `df.fillna(value)`: Fill with a specific value (e.g., 0, 'Unknown').
  * `df.fillna(method='ffill')`: **Forward-fill** - Fills `NaN` with the *previous* valid observation.
  * `df.fillna(method='bfill')`: **Backward-fill** - Fills `NaN` with the *next* valid observation.
  * `df.fillna(df['column'].mean())`: Fill with the mean (average) of the column.
  * `df.fillna(df['column'].median())`: Fill with the median (middle value) of the column.
  * `df.fillna(df['column'].mode()[0])`: Fill with the mode (most frequent value) of the column.

Let's work with `df_messy` again.


In [163]:
df_cleaned_fill = df_messy.copy()

# Option 1: Fill 'Price_USD' with the mean price
mean_price = df_cleaned_fill['Price_USD'].mean()
df_cleaned_fill['Price_USD'].fillna(mean_price, inplace=True)
print("DataFrame after filling Price_USD with mean:")
display(df_cleaned_fill)
print("\nMissing values after mean fill:")
print(df_cleaned_fill.isna().sum()) # Should show 0 for Price_USD

DataFrame after filling Price_USD with mean:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned_fill['Price_USD'].fillna(mean_price, inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,509.5,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,509.5,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card



Missing values after mean fill:
Customer_ID         0
Product_Name        0
Price_USD           0
Quantity            0
Order_Date          0
Customer_Segment    0
Payment Method      0
dtype: int64


In [None]:
# Let's reset for another fill example
df_cleaned_fill = df_messy.copy()

# Option 2: Fill 'Price_USD' with a specific value (e.g., 0 or a placeholder)
df_cleaned_fill['Price_USD'].fillna(0, inplace=True)
print("\nDataFrame after filling Price_USD with 0:")
display(df_cleaned_fill)



DataFrame after filling Price_USD with 0:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned_fill['Price_USD'].fillna(0, inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,0.0,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,0.0,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


In [165]:
# Let's reset for another fill example
df_cleaned_fill = df_messy.copy()

# Option 3: Fill 'Price_USD' with the median price (often more robust to outliers than mean)
median_price = df_cleaned_fill['Price_USD'].median()
df_cleaned_fill['Price_USD'].fillna(median_price, inplace=True)
print("\nDataFrame after filling Price_USD with median:")
display(df_cleaned_fill)


DataFrame after filling Price_USD with median:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned_fill['Price_USD'].fillna(median_price, inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,187.5,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,187.5,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


In [166]:
# Let's reset for another fill example
df_cleaned_fill = df_messy.copy()

# Option 4: For non-numerical data or sequential data, ffill/bfill can be useful.
# If 'Customer_Segment' had a missing value, and we wanted to carry forward the previous segment
df_cleaned_fill['Customer_Segment'].fillna(method='ffill', inplace=True)
print("\nDataFrame after forward-filling Customer_Segment (if it had NaNs):")
display(df_cleaned_fill)


DataFrame after forward-filling Customer_Segment (if it had NaNs):


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned_fill['Customer_Segment'].fillna(method='ffill', inplace=True)
  df_cleaned_fill['Customer_Segment'].fillna(method='ffill', inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


> **üí° Pro Tip:** The choice of imputation method (mean, median, mode, specific value, ffill/bfill) depends heavily on the nature of your data and the reason for missingness. For skewed data, median is often better than mean. For categorical data, mode is common.


#### **Quick Quiz ‚ùì**

**Problem 1 (Beginner):**
Identify the number of missing values in each column of `df_missing_quiz`.

**Problem 2 (Intermediate):**
Create a new DataFrame called `df_missing_dropped` by dropping all rows from `df_missing_quiz` that contain **any** missing values. Print the new DataFrame and its missing value counts.

**Problem 3 (Advanced):**
Fill the missing `Price` values in `df_missing_quiz` with the median of the `Price` column. Then, fill the missing `CustomerName` values with the string 'Unknown'. Print the DataFrame after these operations and verify there are no missing values in these columns.

<details>
<summary>Click to show Answers</summary>

```python
# Problem 1 Answer
print("Missing values per column in df_missing_quiz:")
print(df_missing_quiz.isna().sum())

# Problem 2 Answer
df_missing_dropped = df_missing_quiz.copy()
df_missing_dropped.dropna(inplace=True)
print("\nDataFrame after dropping rows with any missing values:")
print(df_missing_dropped)
print("\nMissing values after dropping rows:")
print(df_missing_dropped.isna().sum())

# Problem 3 Answer
df_missing_filled = df_missing_quiz.copy()
median_price_quiz = df_missing_filled['Price'].median()
df_missing_filled['Price'].fillna(median_price_quiz, inplace=True)
df_missing_filled['CustomerName'].fillna('Unknown', inplace=True)
print("\nDataFrame after filling Price with median and CustomerName with 'Unknown':")
print(df_missing_filled)
print("\nMissing values after filling:")
print(df_missing_filled.isna().sum())
```



-----

### **3\. Managing Duplicate Data üëØ‚Äç‚ôÇÔ∏è**

Duplicate rows can skew your analysis, leading to incorrect counts or aggregates.

#### **Identifying Duplicates (`.duplicated()`): Spotting Repeats üëÄ**

  * `df.duplicated()`: Returns a boolean Series, `True` for rows that are duplicates (meaning they have appeared earlier in the DataFrame). By default, it keeps the *first* occurrence and marks subsequent ones as duplicates.
  * `df.duplicated(subset=['col1', 'col2'])`: Checks for duplicates based *only* on specific columns.
  * `df.duplicated(keep='first')` (default): Marks all but the first occurrence as `True`.
  * `df.duplicated(keep='last')`: Marks all but the last occurrence as `True`.
  * `df.duplicated(keep=False)`: Marks *all* occurrences (including the first/last) of a duplicate set as `True`.

In [None]:
df_duplicates = df_messy.copy()

# Find duplicate rows (entire row match)
print("Boolean Series indicating duplicate rows:")
print(df_duplicates.duplicated())

# Count how many duplicate rows there are
print(f"\nNumber of duplicate rows: {df_duplicates.duplicated().sum()}")

# Find duplicates based on a subset of columns (e.g., Customer_ID and Product_Name)
print("\nDuplicates based on Customer_ID and Product_Name:")
print(df_duplicates.duplicated(subset=['Customer_ID', 'Product_Name']))

Boolean Series indicating duplicate rows:


0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool


Number of duplicate rows: 0

Duplicates based on Customer_ID and Product_Name:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9     True
dtype: bool



> Notice `Customer_ID` 101 and `Product_Name` 'Laptop' appear twice.



#### Removing Duplicates (`.drop_duplicates()`): Cleaning Up\! üßπ

Once identified, removing duplicates is straightforward.

  * `df.drop_duplicates(subset=None, keep='first', inplace=False)`:
      * `subset`: Columns to consider for identifying duplicates. If `None` (default), considers all columns.
      * `keep`: 'first' (default), 'last', or `False`. Determines which duplicate to keep.
      * `inplace=True`: Modifies the DataFrame directly.

<!-- end list -->


In [169]:
df_no_duplicates = df_messy.copy()

# Drop duplicate rows (keeping the first occurrence by default)
print("DataFrame before dropping duplicates:")
display(df_no_duplicates)

DataFrame before dropping duplicates:


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


In [188]:
df_no_duplicates.drop_duplicates(inplace=True)
print("\nDataFrame after dropping exact duplicate rows:")
display(df_no_duplicates)
print(f"Number of rows after dropping: {len(df_no_duplicates)}")


DataFrame after dropping exact duplicate rows:


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


Number of rows after dropping: 9


In [189]:
# Reset to original messy df for another example
df_no_duplicates = df_messy.copy()

# Drop duplicates based on 'Customer_ID' and 'Product_Name', keeping the last one
# This is useful if the latest entry is considered the most accurate.
df_no_duplicates.drop_duplicates(subset=['Customer_ID', 'Product_Name'], keep='last', inplace=True)
print("\nDataFrame after dropping duplicates based on Customer_ID and Product_Name (keeping last):")
display(df_no_duplicates)


DataFrame after dropping duplicates based on Customer_ID and Product_Name (keeping last):


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card


#### **Quick Quiz ‚ùì**

**Problem 1 (Beginner):**
Identify and print all rows in `df_duplicates_quiz` that are exact duplicates.

**Problem 2 (Intermediate):**
Create a new DataFrame called `df_no_student_duplicates` by removing duplicate entries based on the `StudentID` column, keeping only the first occurrence for each `StudentID`. Print the new DataFrame.

**Problem 3 (Advanced):**
Identify and print all rows that are duplicates based on the combination of `StudentID` and `Course`, including the first occurrence of such a duplicate pair. Then, create a DataFrame `df_unique_enrollments` by dropping these duplicates, ensuring only truly unique `StudentID`-`Course` pairs remain (i.e., if a student enrolled in the same course twice, one entry should be removed).

<details>
<summary>Click to show Answers</summary>

```python
# Problem 1 Answer
print("Exact duplicate rows in df_duplicates_quiz:")
print(df_duplicates_quiz[df_duplicates_quiz.duplicated()])

# Problem 2 Answer
df_no_student_duplicates = df_duplicates_quiz.copy()
df_no_student_duplicates.drop_duplicates(subset=['StudentID'], keep='first', inplace=True)
print("\nDataFrame after dropping duplicates based on StudentID (keeping first):")
print(df_no_student_duplicates)

# Problem 3 Answer
print("\nDuplicates based on StudentID and Course (including first occurrence):")
print(df_duplicates_quiz[df_duplicates_quiz.duplicated(subset=['StudentID', 'Course'], keep=False)])

df_unique_enrollments = df_duplicates_quiz.copy()
df_unique_enrollments.drop_duplicates(subset=['StudentID', 'Course'], inplace=True)
print("\nDataFrame after dropping duplicates based on StudentID and Course:")
print(df_unique_enrollments)
```



-----

### **4. Correcting Data Types üîÑ**

Incorrect data types can prevent you from performing calculations (e.g., trying to sum text) or sorting correctly.

#### **Checking Data Types (`.dtypes`, `.info()`): What's Under the Hood? üïµÔ∏è‚Äç‚ôÄÔ∏è**

You've already used `.info()`. `.dtypes` is another quick way to see types.


In [172]:
print("Current data types:")
print(df_messy.dtypes)

Current data types:
Customer_ID           int64
Product_Name         object
Price_USD           float64
Quantity              int64
Order_Date           object
Customer_Segment     object
Payment Method       object
dtype: object


> Notice `Order_Date` and `Payment Method` are `object` (string). `Price_USD` is `float64`.



#### **Converting Data Types (`.astype()`, `pd.to_numeric()`, `pd.to_datetime()`):**

  * **`.astype(dtype)`:** A general method to cast a Series or DataFrame to a specified `dtype` (e.g., `int`, `float`, `str`, `bool`).
  * **`pd.to_numeric(series, errors='coerce')`:** Converts a Series to a numeric type. `errors='coerce'` is crucial: it turns non-convertible values into `NaN` instead of throwing an error.
  * **`pd.to_datetime(series, format=...)`:** Converts a Series to datetime objects. The `format` argument helps Pandas parse inconsistent date strings.

Let's use **`df_messy`** for conversions.


In [177]:
df_type_corrected = df_messy.copy()

# Convert 'Price_USD' to integer (after filling NaNs if desired, otherwise NaNs prevent int conversion)
# Let's first fill the NaNs in Price_USD with the median
median_price = df_type_corrected['Price_USD'].median()
df_type_corrected['Price_USD'].fillna(median_price, inplace=True)

# Now convert 'Price_USD' to integer (if you want whole numbers)
df_type_corrected['Price_USD'] = df_type_corrected['Price_USD'].astype(int)
print("\nDataFrame after converting Price_USD to int (after filling NaNs):")
print(df_type_corrected.dtypes)
display(df_type_corrected)


# Convert 'Quantity' to float if needed (just for demo)
df_type_corrected['Quantity'] = df_type_corrected['Quantity'].astype(float)
print("\nDataFrame after converting Quantity to float:")
print(df_type_corrected.dtypes)




DataFrame after converting Price_USD to int (after filling NaNs):
Customer_ID          int64
Product_Name        object
Price_USD            int64
Quantity             int64
Order_Date          object
Customer_Segment    object
Payment Method      object
dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_type_corrected['Price_USD'].fillna(median_price, inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,187,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200,1,2023-01-18,Gold,Credit Card
5,106,Webcam,50,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300,1,2023-01-17,Bronze,Cash
7,108,Mouse,25,2,2023/01/15,Silver,Debit Card
8,109,Speaker,187,1,2023-01-19,Gold,Credit Card
9,101,Laptop,1200,1,2023-01-15,Gold,Credit Card



DataFrame after converting Quantity to float:
Customer_ID           int64
Product_Name         object
Price_USD             int64
Quantity            float64
Order_Date           object
Customer_Segment     object
Payment Method       object
dtype: object


In [194]:
# Converting 'Order_Date' to datetime objects - this is critical!
# Notice the inconsistent formats: 'YYYY-MM-DD', 'YYYY/MM/DD', 'DD-Mon-YYYY'
# Pandas pd.to_datetime is smart, but sometimes needs a hint.
# Without a format, it often guesses correctly, but for consistency and robustness, 'infer_datetime_format=True' is good.
df_type_corrected['Order_Date'] = pd.to_datetime(df_type_corrected['Order_Date'], infer_datetime_format=True, format =  'mixed')
print("\nDataFrame after converting Order_Date to datetime:")
print(df_type_corrected.dtypes)
display(df_type_corrected)


DataFrame after converting Order_Date to datetime:
Customer_ID                  int64
Product_Name                object
Price_USD                    int64
Quantity                   float64
Order_Date          datetime64[ns]
Customer_Segment            object
Payment Method              object
dtype: object


  df_type_corrected['Order_Date'] = pd.to_datetime(df_type_corrected['Order_Date'], infer_datetime_format=True, format =  'mixed')


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200,1.0,2023-01-15,Gold,Credit Card
1,102,Mouse,25,2.0,2023-01-15,Silver,Debit Card
2,103,Keyboard,75,1.0,2023-01-16,Gold,Credit Card
3,104,Monitor,187,1.0,2023-01-17,Bronze,Cash
4,105,Laptop,1200,1.0,2023-01-18,Gold,Credit Card
5,106,Webcam,50,3.0,2023-01-18,Silver,Debit Card
6,107,Monitor,300,1.0,2023-01-17,Bronze,Cash
7,108,Mouse,25,2.0,2023-01-15,Silver,Debit Card
8,109,Speaker,187,1.0,2023-01-19,Gold,Credit Card
9,101,Laptop,1200,1.0,2023-01-15,Gold,Credit Card


In [191]:
# What if 'Price_USD' had text like 'N/A'?
mixed_num_series = pd.Series([10, 20, 'N/A', 40, 50])
print("\nOriginal mixed series:")
display(mixed_num_series)
print(mixed_num_series.dtypes)

# Convert using pd.to_numeric with errors='coerce'
converted_num_series = pd.to_numeric(mixed_num_series, errors='coerce')
print("\nConverted numeric series with 'coerce':")
display(converted_num_series)
print(converted_num_series.dtypes) # Notice the NaN where 'N/A' was!


Original mixed series:


0     10
1     20
2    N/A
3     40
4     50
dtype: object

object

Converted numeric series with 'coerce':


0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

float64



**üí° Pro Tip:** Always convert date columns to datetime objects. This allows for powerful time-series analysis (e.g., filtering by month, calculating duration).


#### **Quick Quiz ‚ùì**


**Problem 1 (Beginner):**
Check and print the initial data types of all columns in `df_types_quiz`.

**Problem 2 (Intermediate):**
Convert the `Amount` column to a numeric data type. Handle any non-numeric values by coercing them into `NaN`. After conversion, fill any resulting `NaN` values in `Amount` with the mean of the column. Finally, convert `IsExpressDelivery` to boolean type. Print the data types and the `Amount` and `IsExpressDelivery` columns of the updated DataFrame.

**Problem 3 (Advanced):**
Convert the `DateString` column to a proper datetime data type. Ensure that Pandas can correctly parse the inconsistent date formats. Then, convert the `QuantityOrdered` column to an integer type, coercing any values that cannot be converted into `NaN`. After conversion, fill any `NaN` values in `QuantityOrdered` with the mode of the column. Print the data types and the `DateString` and `QuantityOrdered` columns of the final DataFrame.

<details>
<summary>Click to show Answers</summary>

```python
# Problem 1 Answer
print("Initial data types of df_types_quiz:")
print(df_types_quiz.dtypes)

# Problem 2 Answer
df_types_corrected_p2 = df_types_quiz.copy()

df_types_corrected_p2['Amount'] = pd.to_numeric(df_types_corrected_p2['Amount'], errors='coerce')
mean_amount = df_types_corrected_p2['Amount'].mean()
df_types_corrected_p2['Amount'].fillna(mean_amount, inplace=True)

df_types_corrected_p2['IsExpressDelivery'] = df_types_corrected_p2['IsExpressDelivery'].astype(bool)

print("\nDataFrame after converting Amount to numeric (and filling NaN) and IsExpressDelivery to boolean:")
print(df_types_corrected_p2.dtypes)
print("\nAmount column:")
print(df_types_corrected_p2['Amount'])
print("\nIsExpressDelivery column:")
print(df_types_corrected_p2['IsExpressDelivery'])

# Problem 3 Answer
df_types_corrected_p3 = df_types_quiz.copy()

df_types_corrected_p3['DateString'] = pd.to_datetime(df_types_corrected_p3['DateString'], infer_datetime_format=True)

df_types_corrected_p3['QuantityOrdered'] = pd.to_numeric(df_types_corrected_p3['QuantityOrdered'], errors='coerce')
mode_quantity = df_types_corrected_p3['QuantityOrdered'].mode()[0]
df_types_corrected_p3['QuantityOrdered'].fillna(mode_quantity, inplace=True)
df_types_corrected_p3['QuantityOrdered'] = df_types_corrected_p3['QuantityOrdered'].astype(int)


print("\nDataFrame after converting DateString to datetime and QuantityOrdered to int (and filling NaN):")
print(df_types_corrected_p3.dtypes)
print("\nDateString column:")
print(df_types_corrected_p3['DateString'])
print("\nQuantityOrdered column:")
print(df_types_corrected_p3['QuantityOrdered'])
```



-----

### **5. Renaming Columns for Clarity ‚úçÔ∏è**

Clear column names make your data understandable and your code readable.

#### **Using `.rename()`: The Recommended Way üëç**

  * `df.rename(columns={'old_name': 'new_name'}, inplace=False)`:
      * Pass a dictionary mapping old names to new names.
      * `inplace=True` for direct modification.

<!-- end list -->

In [182]:
df_renamed = df_messy.copy()

# Rename 'Customer_ID' to 'CustomerID' and 'Payment Method' to 'Payment_Method'
df_renamed.rename(columns={
    'Customer_ID': 'CustomerID',
    'Payment Method': 'Payment_Method'
}, inplace=True)

print("DataFrame after renaming columns:")
display(df_renamed.head()) # Check the new column names!

DataFrame after renaming columns:


Unnamed: 0,CustomerID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment_Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card



#### Assigning to `.columns`: For Renaming All Columns ‚úçÔ∏è

If you want to rename *all* columns at once, you can assign a new list of names to `df.columns`. Make sure the new list has the *exact same number* of elements as there are columns\!


In [183]:
df_all_renamed = df_messy.copy()

# Get current columns
print("Original columns:", df_all_renamed.columns.tolist())

# Assign a new list of column names (must match the order and number of columns)
new_column_names = ['C_ID', 'Prod_Name', 'Price_$', 'Qty', 'Order_Dt', 'C_Segment', 'Pay_Method']
df_all_renamed.columns = new_column_names

print("\nDataFrame after assigning new column list:")
print(df_all_renamed.columns)
display(df_all_renamed.head())

Original columns: ['Customer_ID', 'Product_Name', 'Price_USD', 'Quantity', 'Order_Date', 'Customer_Segment', 'Payment Method']

DataFrame after assigning new column list:
Index(['C_ID', 'Prod_Name', 'Price_$', 'Qty', 'Order_Dt', 'C_Segment',
       'Pay_Method'],
      dtype='object')


Unnamed: 0,C_ID,Prod_Name,Price_$,Qty,Order_Dt,C_Segment,Pay_Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,Credit Card
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,Credit Card
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,Credit Card



**‚ö†Ô∏è Pitfall:** Using `df.columns = new_list` is risky if you're not absolutely sure about the order and count of columns. `rename()` is generally safer for specific changes.


#### **Quick Quiz ‚ùì**

**Problem 1 (Beginner):**
Rename the column 'product\_id' to 'ProductID' in `df_rename_quiz`. Print the updated column names.

**Problem 2 (Intermediate):**
Rename the column 'product name' to 'ProductName' and 'unit\_price\_usd' to 'PriceUSD' in `df_rename_quiz` in a single operation. Print the updated DataFrame head.

**Problem 3 (Advanced):**
Rename all columns in `df_rename_quiz` to be in PascalCase (e.g., 'product\_id' becomes 'ProductId', 'stock\_quantity' becomes 'StockQuantity'). You should convert existing snake\_case names to PascalCase programmatically. Print the DataFrame's column names after this operation.

<details>
<summary>Click to show Answers</summary>

```python
# Problem 1 Answer
df_rename_p1 = df_rename_quiz.copy()
df_rename_p1.rename(columns={'product_id': 'ProductID'}, inplace=True)
print("Column names after renaming 'product_id':")
print(df_rename_p1.columns.tolist())

# Problem 2 Answer
df_rename_p2 = df_rename_quiz.copy()
df_rename_p2.rename(columns={
    'product name': 'ProductName',
    'unit_price_usd': 'PriceUSD'
}, inplace=True)
print("\nDataFrame head after renaming 'product name' and 'unit_price_usd':")
print(df_rename_p2.head())

# Problem 3 Answer
df_rename_p3 = df_rename_quiz.copy()
# Helper function to convert snake_case to PascalCase
def to_pascal_case(snake_str):
    components = snake_str.split('_')
    # Handle 'product name' by replacing space with underscore first for consistency
    if ' ' in components:
        components = snake_str.replace(' ', '_').split('_')
    return "".join(x.title() for x in components)

new_columns_pascal = {col: to_pascal_case(col) for col in df_rename_p3.columns}
df_rename_p3.rename(columns=new_columns_pascal, inplace=True)
print("\nColumn names after converting all to PascalCase:")
print(df_rename_p3.columns.tolist())
```



-----

### **6. Replacing Values for Consistency üéØ**

Sometimes, a column might have different representations for the same logical category (e.g., 'M', 'Male', 'm' for male).

  * `df.replace(old_value, new_value)`: Replaces specific values.
  * `df.replace({'col1': {old1: new1}, 'col2': {old2: new2}})`: Replaces values in specific columns using a dictionary of dictionaries.

Let's clean up that `Payment Method` column with trailing spaces\!


In [184]:
df_replaced = df_messy.copy()

# First, let's clean up whitespace in 'Payment Method'
# The .str accessor is used for string operations on Series
df_replaced['Payment Method'] = df_replaced['Payment Method'].str.strip()

print("Payment Method unique values after stripping whitespace:")
print(df_replaced['Payment Method'].unique()) # Check unique values to confirm!

# Now, let's say we want to standardize 'Credit Card' to 'CC'
df_replaced['Payment Method'].replace('Credit Card', 'CC', inplace=True)

print("\nDataFrame after replacing 'Credit Card' with 'CC':")
print(df_replaced['Payment Method'].unique()) # Check unique values again
display(df_replaced)

# Example: replace multiple values at once in a specific column
df_replaced['Product_Name'].replace({'Laptop': 'PC', 'Mouse': 'Pointing Device'}, inplace=True)
print("\nDataFrame after replacing product names:")
display(df_replaced)

Payment Method unique values after stripping whitespace:
['Credit Card' 'Debit Card' 'Cash']

DataFrame after replacing 'Credit Card' with 'CC':
['CC' 'Debit Card' 'Cash']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_replaced['Payment Method'].replace('Credit Card', 'CC', inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,Laptop,1200.0,1,2023-01-15,Gold,CC
1,102,Mouse,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,CC
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,Laptop,1200.0,1,2023-01-18,Gold,CC
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Mouse,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,CC
9,101,Laptop,1200.0,1,2023-01-15,Gold,CC



DataFrame after replacing product names:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_replaced['Product_Name'].replace({'Laptop': 'PC', 'Mouse': 'Pointing Device'}, inplace=True)


Unnamed: 0,Customer_ID,Product_Name,Price_USD,Quantity,Order_Date,Customer_Segment,Payment Method
0,101,PC,1200.0,1,2023-01-15,Gold,CC
1,102,Pointing Device,25.5,2,2023/01/15,Silver,Debit Card
2,103,Keyboard,75.0,1,16-Jan-2023,Gold,CC
3,104,Monitor,,1,2023-01-17,Bronze,Cash
4,105,PC,1200.0,1,2023-01-18,Gold,CC
5,106,Webcam,50.0,3,2023-01-18,Silver,Debit Card
6,107,Monitor,300.0,1,2023-01-17,Bronze,Cash
7,108,Pointing Device,25.5,2,2023/01/15,Silver,Debit Card
8,109,Speaker,,1,2023-01-19,Gold,CC
9,101,PC,1200.0,1,2023-01-15,Gold,CC


#### **Quick Quiz ‚ùì**


**Problem 1 (Beginner):**
Replace all occurrences of 'bad ' (with a trailing space) in the `CustomerRating` column with 'Bad' (capitalized, no space). Print the unique values of `CustomerRating` after the replacement.

**Problem 2 (Intermediate):**
Clean up the `Region` column by removing any leading or trailing whitespace from all its values. Then, replace 'East' with 'Eastern' and 'West' with 'Western' in this column. Print the unique values of `Region` after these operations.

**Problem 3 (Advanced):**
Standardize the `Status` column. First, convert all values to lowercase and remove any leading/trailing whitespace. Then, replace 'in-active' with 'inactive' and 'pending' with 'On Hold'. Finally, print the DataFrame and the unique values of `Status` to verify the changes.

<details>
<summary>Click to show Answers</summary>

```python
# Problem 1 Answer
df_replace_p1 = df_replace_quiz.copy()
df_replace_p1['CustomerRating'].replace('bad ', 'Bad', inplace=True)
print("Unique values of CustomerRating after replacement (P1):")
print(df_replace_p1['CustomerRating'].unique())

# Problem 2 Answer
df_replace_p2 = df_replace_quiz.copy()
df_replace_p2['Region'] = df_replace_p2['Region'].str.strip()
df_replace_p2['Region'].replace({'East': 'Eastern', 'West': 'Western'}, inplace=True)
print("\nUnique values of Region after stripping and replacement (P2):")
print(df_replace_p2['Region'].unique())

# Problem 3 Answer
df_replace_p3 = df_replace_quiz.copy()
df_replace_p3['Status'] = df_replace_p3['Status'].str.lower().str.strip()
df_replace_p3['Status'].replace({'in-active': 'inactive', 'pending': 'On Hold'}, inplace=True)
print("\nDataFrame after standardizing Status column (P3):")
print(df_replace_p3)
print("\nUnique values of Status after standardization (P3):")
print(df_replace_p3['Status'].unique())
```




-----

### **7. Pro Tips & Common Pitfalls üí°**

#### Pro Tips:

  * **Order Matters\!** The sequence of your cleaning steps can be important. For instance, filling NaNs before converting types is often necessary. Cleaning whitespace before checking unique values is smart.
  * **Work on Copies:** When performing complex cleaning, create a copy of your DataFrame (`df.copy()`) before making significant changes. This allows you to revert if something goes wrong.
  * **Check Unique Values (`.unique()`, `.value_counts()`):** Use these frequently, especially for categorical columns, to spot inconsistencies, typos, and variations.
  * **Automate:** For recurring datasets, put your cleaning steps into a function. ‚öôÔ∏è
  * **Visualize the Mess:** Sometimes, a quick histogram or bar plot can reveal data issues (e.g., outliers, skewed distributions).

#### Common Pitfalls:

  * **`inplace=True` Misuse:** Forgetting or overusing `inplace=True`. If you don't use it, remember to assign the result back: `df = df.dropna()`. If you use it, be aware that your original DataFrame is altered.
  * **Not Handling Missing Values Before Type Conversion:** Trying to convert a column with `NaN`s (float) to `int` will cause an error unless you handle the `NaN`s first.
  * **Ignoring `errors='coerce'` in `pd.to_numeric()`:** If you have non-numeric strings in a column you expect to be numbers, `astype(int)` will crash. `pd.to_numeric(errors='coerce')` gracefully turns them into `NaN`.
  * **Not Cleaning Whitespace:** Trailing/leading spaces in string columns (`' Gold'` vs `'Gold'`) are a common cause of unexpected behavior (e.g., `value_counts()` showing two different categories). Always `.str.strip()`\!
  * **Date Format Nightmares:** Dates are tricky\! Always verify your date column's `dtype` after conversion. If it's still `object`, `pd.to_datetime` might need a specific `format` argument.



-----

### 8\. Mini-Challenges: Data Cleaning Gym\! üèãÔ∏è‚Äç‚ôÄÔ∏è

Let's take a fresh messy DataFrame and put your cleaning skills to the test\!


In [None]:
import pandas as pd
import numpy as np

# A new messy dataset for challenges!
challenge_data = {
    'Item': ['Pen', 'Notebook', 'Pen', 'Eraser', 'Notebook', 'Pencil', 'Pen', 'Stapler', 'Notebook', 'Pen'],
    'Category': ['Stationery', 'Office Supplies ', 'Stationery', 'Stationery', 'Office Supplies', np.nan, 'Stationery', 'Office Supplies', 'Office Supplies', 'Stationery'],
    'Price': [2.5, 5.0, 2.5, 1.0, 5.0, 0.75, 2.5, 10.0, np.nan, 2.5],
    'Stock_Count': [100, 50, 100, 75, 50, 'low', 100, 30, 40, 100], # 'low' is problematic!
    'Last_Restock': ['2023-03-01', 'March 5, 2023', '2023-03-01', '2023/03/02', 'March 5, 2023', '2023-03-03', '2023-03-01', '2023-03-04', '2023-03-06', '2023-03-01']
}
df_challenge = pd.DataFrame(challenge_data)

print("Your Challenge DataFrame:")
# print(df_challenge)
# print("\nInitial info:")
# df_challenge.info()


**Challenge 1: Identify and Fill Missing `Price` Values** üí∞
Find how many missing values are in the 'Price' column, then fill them with the median price.


In [None]:
# Your code here for Challenge 1
missing_prices = df_challenge['Price'].isna().sum()
# print(f"Missing values in Price column: {missing_prices}")

median_price_challenge = df_challenge['Price'].median()
df_challenge['Price'].fillna(median_price_challenge, inplace=True)

# print("\nDataFrame after filling Price NaNs:")
# print(df_challenge['Price'].isna().sum()) # Should be 0
# print(df_challenge)


**Challenge 2: Handle `Stock_Count` and Convert Type** üì¶
The 'Stock\_Count' column has a non-numeric value ('low'). Convert this column to a numeric type, turning 'low' into `NaN` if it cannot be directly converted to a number. Then, fill any resulting `NaN`s with the mean of the column. Finally, convert `Stock_Count` to integer type.


In [None]:
# Your code here for Challenge 2
# Convert 'Stock_Count' to numeric, coercing errors to NaN
df_challenge['Stock_Count'] = pd.to_numeric(df_challenge['Stock_Count'], errors='coerce')

# Fill resulting NaNs with the mean
mean_stock_count = df_challenge['Stock_Count'].mean()
df_challenge['Stock_Count'].fillna(mean_stock_count, inplace=True)

# Convert to integer type
df_challenge['Stock_Count'] = df_challenge['Stock_Count'].astype(int)

# print("\nDataFrame after cleaning and converting Stock_Count:")
# print(df_challenge.dtypes)
# print(df_challenge)


**Challenge 3: Clean `Category` and Remove Duplicates** üßº
Remove any leading/trailing whitespace from the 'Category' column. Then, identify and remove exact duplicate rows from the DataFrame, keeping only the first occurrence.


In [None]:
# Your code here for Challenge 3
# Clean whitespace in 'Category'
df_challenge['Category'] = df_challenge['Category'].str.strip()

# Print unique values to confirm strip worked (before and after)
# print("\nUnique categories before strip (if not done yet):", df_challenge['Category'].unique())
# df_challenge['Category'] = df_challenge['Category'].str.strip()
# print("Unique categories after strip:", df_challenge['Category'].unique())

# Drop exact duplicate rows
initial_rows = len(df_challenge)
df_challenge.drop_duplicates(inplace=True)
# print(f"\nNumber of rows before dropping duplicates: {initial_rows}")
# print(f"Number of rows after dropping duplicates: {len(df_challenge)}")
# print(df_challenge)


**Challenge 4: Standardize `Last_Restock` Date Format** üìÖ
Convert the 'Last\_Restock' column to a proper datetime format.


In [None]:
# Your code here for Challenge 4
df_challenge['Last_Restock'] = pd.to_datetime(df_challenge['Last_Restock'], infer_datetime_format=True)

# print("\nDataFrame after converting Last_Restock to datetime:")
# print(df_challenge.dtypes)
# print(df_challenge)



-----

### End of Data Cleaning & Preparation\! üéâ

Phew\! You've successfully navigated the sometimes messy world of data cleaning. These skills are absolutely fundamental and will save you countless hours and headaches in your data analysis journey. You're becoming a true data detective\! üïµÔ∏è‚Äç‚ôÄÔ∏èüîç

**What's next?** With clean data in hand, we're ready to perform more sophisticated analyses. Next up, we'll learn about grouping and aggregating data ‚Äì a super powerful way to summarize your information\! üìä

**Keep up the amazing work\!** Your data skills are growing\! üå±



# **Data Cleaning & Transformation**  

### üìö **Table of Contents**  
1. [String Operations ‚úÇÔ∏è](#string-operations)  
   - `str.replace()`, `str.contains()`  
   - Case Conversion & Splitting  
2. [Applying Functions üõ†Ô∏è](#applying-functions)  
   - `apply()`, `map()`, `applymap()`  
3. [GroupBy Operations üìä](#groupby-operations)  
   - `groupby()`, `agg()`, `transform()`  
4. [Quick Quiz ‚ùì](#quick-quiz)  
5. [Mini-Project: Clean & Analyze Customer Data üë•](#mini-project)  
6. [Pro Tips & Common Pitfalls üí°](#pro-tips)  


---

<a id="string-operations"></a>  
## 1Ô∏è‚É£ **String Operations ‚úÇÔ∏è**  

### **1. `str.replace()` - Replace Substrings**  
Clean messy string data.  


In [15]:
import pandas as pd

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Diana"],
    "Email": ["alice@gmail.com", "bob@yahoo.com", "charlie@hotmail.com", "diana@gmail.com"]
})

# Replace domain
df["Email"] = df["Email"].str.replace("@gmail.com", "@company.com")
print("Updated Emails:\n", df["Email"])

Updated Emails:
 0      alice@company.com
1          bob@yahoo.com
2    charlie@hotmail.com
3      diana@company.com
Name: Email, dtype: object


### **2. `str.contains()` - Filter by Substrings**  
Find rows matching a pattern.  

In [16]:
# Find Gmail users
gmail_users = df[df["Email"].str.contains("gmail")]
print("Gmail Users:\n", gmail_users)

Gmail Users:
 Empty DataFrame
Columns: [Name, Email]
Index: []


**Other Useful Methods:**  
| Method          | Description                  | Example                     |  
|----------------|----------------------------|---------------------------|  
| `str.lower()`  | Convert to lowercase        | `df["Name"].str.lower()`  |  
| `str.split()`  | Split strings               | `df["Email"].str.split("@")` |  
| `str.len()`    | Get string length           | `df["Name"].str.len()`    |  


---

<a id="applying-functions"></a>  
## 2Ô∏è‚É£ **Applying Functions üõ†Ô∏è**  

### **1. `apply()` - Row/Column-wise Operations**  
Run custom functions on DataFrames.  


In [17]:
# Calculate name lengths
df["Name Length"] = df["Name"].apply(lambda x: len(x))

# Uppercase all names
df["Name"] = df["Name"].apply(str.upper)
print("\nName Lengths:\n", df)


Name Lengths:
       Name                Email  Name Length
0    ALICE    alice@company.com            5
1      BOB        bob@yahoo.com            3
2  CHARLIE  charlie@hotmail.com            7
3    DIANA    diana@company.com            5


### **2. `map()` - Element-wise Replacement**  
Map values using a dictionary.  


In [18]:
gender_map = {"Alice": "F", "Bob": "M", "Charlie": "M", "Diana": "F"}
df["Gender"] = df["Name"].map(gender_map)
print("\nWith Gender:\n", df)


With Gender:
       Name                Email  Name Length Gender
0    ALICE    alice@company.com            5    NaN
1      BOB        bob@yahoo.com            3    NaN
2  CHARLIE  charlie@hotmail.com            7    NaN
3    DIANA    diana@company.com            5    NaN


### **3. `applymap()` - Apply to All Elements**  
Use for **element-wise** operations on entire DataFrames.  


In [19]:
# Example: Create a dummy DataFrame
df_nums = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})

# Square all values
squared = df_nums.applymap(lambda x: x ** 2)
print("\nSquared:\n", squared)


Squared:
    A   B
0  1  16
1  4  25
2  9  36


  squared = df_nums.applymap(lambda x: x ** 2)


---

<a id="groupby-operations"></a>  
## 3Ô∏è‚É£ **GroupBy Operations üìä**  

### **1. `groupby()` - Split Data into Groups**  
Group rows by a column‚Äôs values.  


In [20]:
# Group by gender
grouped = df.groupby("Gender")
print("Groups:", grouped.groups)  # Shows group indices

Groups: {}


### **2. `agg()` - Aggregate Statistics**  
Compute multiple stats per group.  


In [21]:
# Average name length by gender
agg_stats = grouped.agg({
    "Name Length": ["mean", "max", "min"],
    "Name": "count"  # Number of people
})
print("\nAggregated Stats:\n", agg_stats)


Aggregated Stats:
 Empty DataFrame
Columns: [(Name Length, mean), (Name Length, max), (Name Length, min), (Name, count)]
Index: []


### **3. `transform()` - Group-wise Computations**  
Return a DataFrame with the **same shape** as the original.  


In [22]:
# Center name lengths by group mean
df["Centered Length"] = grouped["Name Length"].transform(
    lambda x: x - x.mean()
)
print("\nCentered Lengths:\n", df)


Centered Lengths:
       Name                Email  Name Length Gender  Centered Length
0    ALICE    alice@company.com            5    NaN              NaN
1      BOB        bob@yahoo.com            3    NaN              NaN
2  CHARLIE  charlie@hotmail.com            7    NaN              NaN
3    DIANA    diana@company.com            5    NaN              NaN


---

<a id="quick-quiz"></a>  
## ‚ùì **Quick Quiz**  

1. **How to filter rows where `Email` contains "yahoo"?**  
   - A) `df[df["Email"] == "yahoo"]`  
   - B) `df[df["Email"].str.contains("yahoo")]` ‚úÖ  
   - C) `df.filter("yahoo")`  

2. **What does `groupby().agg()` return?**  
   - A) A modified DataFrame with new columns  
   - B) A DataFrame with aggregated statistics per group ‚úÖ  
   - C) A list of group names  


---

<a id="mini-project"></a>  
## üéØ **Mini-Project: Clean & Analyze Customer Data üë•**  

**Scenario:** Process a customer dataset with messy strings and group metrics.  


In [23]:
data = {
    "Customer": ["ALICE", "Bob", "CHARLIE", "diana"],
    "Spend": [120, 80, 200, 150],
    "City": ["NY", "Paris", "NY", "London"]
}
customers = pd.DataFrame(data)

# Task 1: Standardize names (capitalize first letter)
customers["Customer"] = customers["Customer"].str.capitalize()

# Task 2: Group by city and calculate stats
city_stats = customers.groupby("City").agg({
    "Spend": ["sum", "mean", "count"],
    "Customer": lambda x: ", ".join(x)  # List customers per city
})

# Task 3: Add spend rank per city
customers["Rank"] = customers.groupby("City")["Spend"].rank(ascending=False)

print("Cleaned Data:\n", customers)
print("\nCity Stats:\n", city_stats)

Cleaned Data:
   Customer  Spend    City  Rank
0    Alice    120      NY   2.0
1      Bob     80   Paris   1.0
2  Charlie    200      NY   1.0
3    Diana    150  London   1.0

City Stats:
        Spend                     Customer
         sum   mean count        <lambda>
City                                     
London   150  150.0     1           Diana
NY       320  160.0     2  Alice, Charlie
Paris     80   80.0     1             Bob



---

<a id="pro-tips"></a>  
## üí° **Pro Tips & Common Pitfalls**  

‚úÖ **Use `pd.namedagg()` for readable multi-column aggregations (Pandas 1.0+):**  
```python
city_stats = customers.groupby("City").agg(
    total_spend=pd.NamedAgg(column="Spend", aggfunc="sum"),
    avg_spend=pd.NamedAgg(column="Spend", aggfunc="mean")
)
```

‚ùå **Avoid `apply()` for simple operations (use vectorized methods like `str.upper()` instead).**  

üî• **Chain operations efficiently:**  
```python
# Clean, filter, and group in one go
result = (
    customers.assign(Clean_Name=lambda x: x["Customer"].str.capitalize())
    .query("Spend > 100")
    .groupby("City")
    .mean()
)
```

---

## üéâ **Key Takeaways**  
‚úî **String ops** clean text data (`str.replace()`, `str.contains()`).  
‚úî **`apply()`/`map()`** customize transformations.  
‚úî **`groupby()` + `agg()`/`transform()`** reveal group patterns.  

**Next Steps:**  
- Try merging string columns with `df["Full_Name"] = df["First"] + " " + df["Last"]`.  
- Explore time-based grouping with `pd.Grouper(key="Date")`.  

**Happy data wrangling!** üêº‚ú®


# **üîó Merging & Reshaping Data** 

> **Bringing Your Data Together & Changing Its View! üîÑ**

Hello, data architects! üèóÔ∏è You've mastered cleaning your data, which is fantastic! But what if your valuable information is spread across multiple tables or files? Or what if your data is organized in a way that's not ideal for your analysis?

That's where **Merging** and **Reshaping** come in!
* **Merging (or Joining):** This is like stitching together different pieces of a puzzle üß© to form a complete picture. You combine DataFrames based on common columns.
* **Reshaping:** This is about changing the layout of your data, like rotating a table or stacking columns into rows. It helps you get your data in the perfect form for plotting or analysis. üìê

Get ready to connect your datasets and twist them into new, insightful shapes! ‚ú®

---

### üìö Table of Contents: Merging & Reshaping Data

1.  **Merging DataFrames (SQL-Style Joins)** ü§ù
    * What are Merges? (The Analogy)
    * Inner Merge (`pd.merge(..., how='inner')`)
    * Left Merge (`pd.merge(..., how='left')`)
    * Right Merge (`pd.merge(..., how='right')`)
    * Outer Merge (`pd.merge(..., how='outer')`)
    * **Practice Problems: Merging** üß™
2.  **Reshaping DataFrames** üîÅ
    * Pivoting (`.pivot_table()`)
    * Stacking & Unstacking (`.stack()`, `.unstack()`)
    * Melting (`.melt()`)
    * **Practice Problems: Reshaping** üß™
3.  **Pro Tips & Common Pitfalls** üí°
4.  **Mini-Challenge: Integrated Data Transformation!** üéØ



---

### 1. Merging DataFrames (SQL-Style Joins) ü§ù

Imagine you have two separate lists: one has customer IDs and their names, and another has customer IDs and their recent orders. To know which customer made which order, you need to *merge* these lists using the common 'Customer ID'.

Pandas `pd.merge()` function is incredibly powerful and works much like SQL JOINs.

Let's create two sample DataFrames:


In [None]:
import pandas as pd
import numpy as np # For potential NaN values

# DataFrame 1: Customer Information
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'city': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Seattle']
})

# DataFrame 2: Order Information
orders_df = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [2, 4, 1, 6, 3, 2], # Customer ID 6 doesn't exist in customers_df
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Speaker', 'Webcam'],
    'amount': [1200, 25, 75, 300, 80, 50]
})

print("Customers DataFrame:")
display(customers_df)
print("\nOrders DataFrame:")
display(orders_df)



#### **Understanding `on`, `left_on`, `right_on`, and `how`**

  * **`on`**: The column(s) to join on. If columns have the same name in both DataFrames, use `on='column_name'`.
  * **`left_on` / `right_on`**: If the join column has *different* names in each DataFrame, use `left_on='name_in_left_df'` and `right_on='name_in_right_df'`.
  * **`how`**: This specifies the type of merge (join). This is the most crucial part\!



---

#### **Inner Merge (`how='inner'`): Only the Matches\! ü§ù**

An inner merge returns only the rows where the join key exists in *both* DataFrames. It's like finding the common ground.


In [None]:
# Inner merge on 'customer_id'
inner_merged_df = pd.merge(customers_df, orders_df, on='customer_id', how='inner')

print("Inner Merged DataFrame:")
display(inner_merged_df)

# Notice customer_id 6 (from orders_df) and customer_id 5 (from customers_df) are not present.


**Explanation:** Only customers `1`, `2`, `3`, and `4` appear in both `customers_df` and `orders_df`, so only their combined records are in the `inner_merged_df`. Customer ID `6` from `orders_df` and Customer ID `5` from `customers_df` are excluded because they don't have a match in the other DataFrame.



---

#### **Left Merge (`how='left'`): Keep All from Left, Match from Right\! ‚¨ÖÔ∏è**

A left merge returns all rows from the *left* DataFrame and matching rows from the *right* DataFrame. If there's no match in the right DataFrame, `NaN` is filled for the columns from the right.


In [None]:
# Left merge on 'customer_id'
left_merged_df = pd.merge(customers_df, orders_df, on='customer_id', how='left')

print("Left Merged DataFrame:")
display(left_merged_df)

# Notice customer_id 5 from customers_df is included, but its order columns are NaN.


**Explanation:** All customers from `customers_df` (the left DataFrame) are included. For `customer_id` 5, since there's no corresponding order in `orders_df`, the columns from `orders_df` (`order_id`, `product`, `amount`) are filled with `NaN`.



----

#### **Right Merge (`how='right'`): Keep All from Right, Match from Left**‚û°Ô∏è

A right merge returns all rows from the *right* DataFrame and matching rows from the *left* DataFrame. If there's no match in the left DataFrame, `NaN` is filled for the columns from the left.


In [None]:
# Right merge on 'customer_id'
right_merged_df = pd.merge(customers_df, orders_df, on='customer_id', how='right')

print("Right Merged DataFrame:")
display(right_merged_df)

# Notice customer_id 6 from orders_df is included, but its customer columns are NaN.

**Explanation:** All orders from `orders_df` (the right DataFrame) are included. For `customer_id` 6, since there's no corresponding customer in `customers_df`, the columns from `customers_df` (`name`, `city`) are filled with `NaN`.


---

#### **Outer Merge (`how='outer'`): Keep Everything\! üåê**

An outer merge returns all rows from *both* DataFrames, combining them where there are matches and filling `NaN` where there are no matches. It's the "union" of both DataFrames.


In [None]:
# Outer merge on 'customer_id'
outer_merged_df = pd.merge(customers_df, orders_df, on='customer_id', how='outer')

print("Outer Merged DataFrame:")
display(outer_merged_df)

# Notice both customer_id 5 and customer_id 6 are included, with NaNs where no match exists.


**Explanation:** This merge includes all records that exist in either `customers_df` or `orders_df`. Where a `customer_id` exists in one but not the other, `NaN` values are used to fill the missing information.



-----

#### üß™ Practice Problems: Merging

Let's work with new DataFrames.


In [None]:
# Practice DataFrames
products_df = pd.DataFrame({
    'product_id': [1, 2, 3, 4],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Webcam'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Peripherals']
})

inventory_df = pd.DataFrame({
    'item_id': [1, 2, 5, 3], # item_id 5 doesn't exist in products_df
    'product_id': [1, 2, 1, 3], # product_id 1 is duplicated here
    'stock_quantity': [100, 250, 50, 120],
    'warehouse': ['A', 'B', 'C', 'A']
})

print("Products DataFrame:")
# print(products_df)
print("\nInventory DataFrame:")
# print(inventory_df)


**Practice 1: Inner Merge** ü§î
Perform an **inner merge** between `products_df` and `inventory_df` on the `product_id` column. Store the result in `inner_product_inventory`. What products are included?

**Practice 2: Left Merge (Inventory View)** ‚¨ÖÔ∏è
Perform a **left merge** where `inventory_df` is the left DataFrame and `products_df` is the right DataFrame, merging on `product_id`. Store the result in `inventory_with_details`. What happens to `item_id` 5?

**Practice 3: Right Merge (Products View)** ‚û°Ô∏è
Perform a **right merge** where `products_df` is the right DataFrame and `inventory_df` is the left DataFrame, merging on `product_id`. Store the result in `products_with_inventory`. What happens to 'Webcam'?

<details>
<summary>Click to show Answers</summary>

```python
# Your code here for Practice 1
inner_product_inventory = pd.merge(products_df, inventory_df, on='product_id', how='inner')
# print(inner_product_inventory)

# Your code here for Practice 2
inventory_with_details = pd.merge(inventory_df, products_df, on='product_id', how='left')
# print(inventory_with_details)

# Your code here for Practice 3
products_with_inventory = pd.merge(inventory_df, products_df, on='product_id', how='right')
# print(products_with_inventory)
```



-----

### **2. Reshaping DataFrames üîÅ**

Reshaping data means changing its structure ‚Äì from wide to long, or long to wide. This is super useful for:

  * Making data ready for specific types of analysis (e.g., plotting changes over time).
  * Cleaning up "unpivotable" data.
  * Aggregating data in new ways.



---

#### **Pivoting (`.pivot_table()`): From Long to Wide ‚ÜîÔ∏è**

`pivot_table()` is used to create a "pivot table" similar to those in Excel. It rearranges data from a "long" format (where values are stacked in one column) into a "wide" format (where values spread across multiple new columns). It also allows for aggregation\!

  * `values`: The column(s) whose values will populate the new table.
  * `index`: The column(s) to put on the new index (rows).
  * `columns`: The column(s) whose unique values will become new columns.
  * `aggfunc`: The aggregation function to apply (e.g., `sum`, `mean`, `count`, `np.sum`).

Let's create some sales data that's "long":


In [None]:
# Sales data in a 'long' format
sales_long_df = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],
    'Region': ['East', 'West', 'East', 'West', 'Central'],
    'Product': ['A', 'B', 'A', 'C', 'A'],
    'Units_Sold': [10, 15, 8, 12, 5],
    'Revenue': [100, 200, 80, 150, 60]
})

# Convert 'Date' to datetime for good practice
sales_long_df['Date'] = pd.to_datetime(sales_long_df['Date'])

print("Sales Long Format DataFrame:")
display(sales_long_df)


Now, let's pivot it\!


In [None]:
# Pivot to see total Units_Sold by Date and Product
# Here, 'Date' becomes index, 'Product' becomes columns, 'Units_Sold' are values, aggregated by sum.
pivot_units_sold = sales_long_df.pivot_table(
    values='Units_Sold',
    index='Date',
    columns='Product',
    aggfunc='sum'
)
print("\nPivoted Units Sold by Date and Product:")
display(pivot_units_sold)

# Pivot to see total Revenue by Region and Product (fill missing values with 0)
pivot_revenue_region = sales_long_df.pivot_table(
    values='Revenue',
    index='Region',
    columns='Product',
    aggfunc='sum',
    fill_value=0 # Fill NaNs (where a product wasn't sold in a region) with 0
)
print("\nPivoted Revenue by Region and Product (filled NaNs with 0):")
display(pivot_revenue_region)


**‚ö†Ô∏è Pitfall:** `pivot_table()` needs an aggregation function (`aggfunc`) because there might be multiple values for a given `index`-`column` combination (e.g., two sales of Product A on the same date). If you don't specify one, it defaults to `mean`, which might not be what you want\! If you have only unique combinations, you can use `.pivot()` without `aggfunc`.



---

#### **Stacking & Unstacking (`.stack()`, `.unstack()`): Managing MultiIndex üìà**

These methods are great for transforming data between "tall" (stacked) and "wide" (unstacked) formats, especially when dealing with MultiIndex (hierarchical indexes).

  * `.stack()`: Pivots a level of the column labels to the row index, creating a MultiIndex on the rows. Makes the DataFrame "taller".
  * `.unstack()`: Pivots a level of the row index to the column labels, creating a MultiIndex on the columns. Makes the DataFrame "wider".

Let's use our `pivot_units_sold` DataFrame from above, which already has a multi-level index.


In [None]:
print("Original pivoted DataFrame (pivot_units_sold):")
display(pivot_units_sold)

In [None]:

# Stack the 'Product' level from columns to rows
stacked_df = pivot_units_sold.stack()
print("\nStacked DataFrame:")
display(stacked_df)
print(stacked_df.index) # Notice the MultiIndex!

In [None]:
# Unstack the 'Product' level back from rows to columns
unstacked_df = stacked_df.unstack()
print("\nUnstacked DataFrame (back to original pivot form):")
display(unstacked_df)


In [None]:
# You can unstack a specific level (e.g., 'Date' if it were a column, or a sub-level of the index)
# Let's create a DataFrame with MultiIndex columns for another unstack example
multi_col_df = pd.DataFrame(np.random.rand(3, 4),
                            index=['A', 'B', 'C'],
                            columns=pd.MultiIndex.from_product([['Sales', 'Costs'], ['Q1', 'Q2']]))
print("\nDataFrame with MultiIndex Columns:")
display(multi_col_df)

In [None]:
# Unstack the second level of columns ('Q1', 'Q2') to rows
unstacked_multi_col = multi_col_df.unstack(level=1)
print("\nUnstacked MultiIndex Columns:")
display(unstacked_multi_col)


**Explanation:**

  * `stack()` takes the innermost level of the column index (in `pivot_units_sold`, that's 'Product' A, B, C) and rotates it to become the innermost level of the row index. This results in a Series (if one value column) or DataFrame with a MultiIndex.
  * `unstack()` does the reverse, taking a level from the row index (by default the innermost) and rotating it to become the innermost level of the column index.



---

#### **Melting (`.melt()`): From Wide to Long ‚ÜîÔ∏è**

`pd.melt()` is the opposite of `pivot_table()`. It transforms a DataFrame from a "wide" format to a "long" format. This is incredibly useful when you have multiple columns that represent values of a single variable, and you want to stack them into one or more value columns.

  * `id_vars`: Column(s) to keep as identifier variables (these won't be melted).
  * `value_vars`: Column(s) to unpivot (these will be stacked into a single new column). If `None`, all non-`id_vars` columns are melted.
  * `var_name`: Name for the new column that will contain the original column names.
  * `value_name`: Name for the new column that will contain the values from the original melted columns.

Let's create a "wide" DataFrame:


In [None]:
# Data where each column is a metric for a year
yearly_sales_wide = pd.DataFrame({
    'City': ['New York', 'Los Angeles', 'Chicago'],
    '2020_Sales': [1000, 1500, 800],
    '2021_Sales': [1100, 1600, 850],
    '2022_Sales': [1200, 1700, 900]
})

print("Yearly Sales (Wide Format):")
display(yearly_sales_wide)

# Melt the DataFrame to a 'long' format
# 'City' is our identifier. The year columns will be melted.
melted_sales_df = pd.melt(yearly_sales_wide,
                          id_vars=['City'],
                          value_vars=['2020_Sales', '2021_Sales', '2022_Sales'],
                          var_name='Year_Metric', # New column for '2020_Sales', '2021_Sales', etc.
                          value_name='Sales_Amount') # New column for the actual sales values
print("\nMelted Sales (Long Format):")
display(melted_sales_df)



**Explanation:** The `2020_Sales`, `2021_Sales`, `2022_Sales` columns are "unpivoted." Their names (`2020_Sales`, etc.) go into the `Year_Metric` column, and their actual sales values go into the `Sales_Amount` column. The `City` column remains as an identifier. This makes it much easier to, for example, plot sales trends over years using a single `Sales_Amount` column.




-----

#### **üß™ Practice Problems: Reshaping**


In [None]:
# Practice Data for Reshaping
sensor_data = pd.DataFrame({
    'SensorID': ['A1', 'A2', 'B1', 'B2'],
    'Jan_Temp': [20.1, 22.5, 18.0, 21.3],
    'Jan_Humidity': [60, 65, 70, 62],
    'Feb_Temp': [21.5, 23.0, 19.5, 22.0],
    'Feb_Humidity': [61, 66, 72, 63],
    'Mar_Temp': [23.0, 24.5, 21.0, 23.5],
    'Mar_Humidity': [63, 68, 75, 65]
})

print("Original Sensor Data (Wide):")
# print(sensor_data)


**Practice 1: Melt Sensor Data** üå°Ô∏è
Melt the `sensor_data` DataFrame from wide to long format.

  * Keep `SensorID` as the identifier.
  * Melt all other columns.
  * Name the new variable column `Metric_Month` and the value column `Value`.
  * Store the result in `melted_sensor_data`.



**Practice 2: Pivot Back to Temperature & Humidity** üéØ
From `melted_sensor_data` (from Practice 1), try to pivot it to get 'Temperature' and 'Humidity' as separate columns, indexed by 'SensorID' and 'Month'.
*Hint: You'll need to extract 'Month' and 'Metric Type' (Temp/Humidity) from `Metric_Month` column first.*


<details><summary>Click for Answers</summary>

```python
# Your code here for Practice 1
melted_sensor_data = pd.melt(sensor_data,
                             id_vars=['SensorID'],
                             var_name='Metric_Month',
                             value_name='Value')
# print(melted_sensor_data)

# Your code here for Practice 2

# Step 1: Split 'Metric_Month' into 'Month' and 'Metric_Type'
melted_sensor_data[['Month', 'Metric_Type']] = melted_sensor_data['Metric_Month'].str.split('_', expand=True)

# print("\nMelted Sensor Data with new columns:")
# print(melted_sensor_data.head())

# Step 2: Pivot the data
# index: SensorID, Month
# columns: Metric_Type (Temp, Humidity)
# values: Value
reshaped_sensor_data = melted_sensor_data.pivot_table(
    index=['SensorID', 'Month'],
    columns='Metric_Type',
    values='Value'
)
# print("\nReshaped Sensor Data (Temperature and Humidity as columns):")
# print(reshaped_sensor_data)

```


-----

### **3. Pro Tips & Common Pitfalls üí°**

#### Pro Tips:

  * **Choose the Right Merge Type:** Understand `inner`, `left`, `right`, and `outer`. A wrong `how` can lead to lost data or too many `NaN`s.
  * **Specify `on`:** Always explicitly state the `on` (or `left_on`/`right_on`) columns. Relying on Pandas to guess can lead to unexpected results if columns happen to have the same name but shouldn't be merged.
  * **Index Resets After Merge:** Merging usually resets the index. If your index was meaningful, consider making it a column before merging or using `set_index()` after.
  * **Pivoting for Aggregation:** `pivot_table()` is your go-to for summarizing data by categories and turning it into a cross-tabulation.
  * **Melting for Plotting:** Data in "long" format (melted) is often preferred for visualization libraries like Seaborn or Matplotlib, as it simplifies mapping variables to axes.
  * **Understand MultiIndex:** `stack()` and `unstack()` work wonders with MultiIndex. Practice creating and manipulating MultiIndex DataFrames.

#### Common Pitfalls:

  * **Forgetting `on` or `how` in `merge()`:** Can lead to a cartesian product (every row from left combined with every row from right if no `on` specified) or an unexpected type of join.
  * **Column Name Mismatch:** Typos or case sensitivity issues in column names when merging or pivoting. (`CustomerID` \!= `customer_id`). Use `.columns` to verify.
  * **Data Type Mismatch in Merge Keys:** If `customer_id` is int in one DF and string in another, `merge` won't find matches. Convert types first\!
  * **`pivot_table()` and Duplicates:** If your `index` and `columns` combination is not unique and you don't specify an `aggfunc`, `pivot_table` will error because it doesn't know how to combine multiple values.
  * **Not Handling `NaN`s in Pivoted Data:** Pivoting often introduces `NaN`s (e.g., a product wasn't sold in a particular region/date). Remember to `fillna()` if you want to replace them.
  * **`melt()` vs. Manual Loop:** Trying to manually loop through columns to reshape data is almost always less efficient and more error-prone than using `pd.melt()`.



-----

### **4. Mini-Challenge: Integrated Data Transformation\! üéØ**

Let's put your merging and reshaping skills to the test with a mini-project\!

You have two datasets from a fictional online store:

  * **`sales_data.csv`**: Contains daily sales information for different products.
  * **`product_details.csv`**: Contains additional details about each product.

Your goal is to:

1.  Load both datasets.
2.  Merge them to include product details with sales data.
3.  Reshape the merged data to show the total quantity sold for each product category on each date.

<!-- end list -->

<details><summary>Click for Code</summary>

```python
# Create dummy CSV files for the challenge
sales_csv_content = """Date,ProductID,Quantity
2023-04-01,P001,5
2023-04-01,P002,3
2023-04-01,P003,2
2023-04-02,P001,7
2023-04-02,P002,4
2023-04-02,P004,1
2023-04-03,P001,2
2023-04-03,P003,6
2023-04-03,P005,3
"""
with open('sales_data.csv', 'w') as f:
    f.write(sales_csv_content)

product_details_csv_content = """ProductID,ProductName,Category
P001,Laptop,Electronics
P002,Mouse,Electronics
P003,Keyboard,Electronics
P004,Monitor,Peripherals
P005,Webcam,Peripherals
"""
with open('product_details.csv', 'w') as f:
    f.write(product_details_csv_content)

print("Dummy CSV files created for the challenge! üéâ")

# Reload pandas for a fresh start if running independently
import pandas as pd
```

**Challenge Steps:**

**Step 1: Load Data** üì•
Load `sales_data.csv` into `sales_df` and `product_details.csv` into `product_df`.
Make sure `Date` in `sales_df` is a datetime object.

```python
# Your code here for Step 1
sales_df = pd.read_csv('sales_data.csv')
product_df = pd.read_csv('product_details.csv')

sales_df['Date'] = pd.to_datetime(sales_df['Date'])

# print("Sales DataFrame Head:")
# print(sales_df.head())
# print("\nProduct Details DataFrame Head:")
# print(product_df.head())
```

**Step 2: Merge DataFrames** ü§ù
Perform a merge operation to combine `sales_df` and `product_df`. You want to keep all sales records and add their product details.
Store the result in `merged_sales_details`.

```python
# Your code here for Step 2
merged_sales_details = pd.merge(sales_df, product_df, on='ProductID', how='left')

# print("\nMerged Sales Details DataFrame Head:")
# print(merged_sales_details.head())
```

**Step 3: Reshape for Category Sales per Date** üìä
Create a pivot table from `merged_sales_details` that shows the total `Quantity` sold for each `Category` (`columns`) on each `Date` (`index`).
Fill any missing values (where a category had no sales on a specific date) with 0.
Store the result in `daily_category_sales`.

```python
# Your code here for Step 3
daily_category_sales = merged_sales_details.pivot_table(
    values='Quantity',
    index='Date',
    columns='Category',
    aggfunc='sum',
    fill_value=0
)

# print("\nDaily Category Sales Pivot Table:")
# print(daily_category_sales)
```


-----

### End of Merging & Reshaping\! üéâ

You've just performed some truly powerful data transformations\! Merging allows you to integrate disparate datasets, and reshaping gives you the flexibility to present your data in the most analytical way. These are core skills for any data professional. üèóÔ∏è

**Ready for the next challenge?** We'll now explore the fascinating world of **Time Series Analysis** with Pandas\! ‚è∞

**Keep pushing forward\!** You're building a robust data skill set\! üí™

# **Merging & Reshaping Data**  


### üìö **Table of Contents**  
1. [Concatenation üß©](#concatenation)  
   - `pd.concat()` Basics  
   - Axis Argument (Rows vs. Columns)  
2. [Joins & Merges ü§ù](#joins--merges)  
   - `pd.merge()` (Inner, Outer, Left, Right)  
   - `join()` Method  
3. [Pivoting & Melting üîÑ](#pivoting--melting)  
   - `pivot_table()`  
   - `melt()`  
4. [Quick Quiz ‚ùì](#quick-quiz)  
5. [Hands-on Project: COVID-19 Data Analysis ü¶†](#hands-on-project)  
6. [Pro Tips & Common Pitfalls üí°](#pro-tips)  


---

<a id="concatenation"></a>  
## 1Ô∏è‚É£ **Concatenation üß©**  

### **1. `pd.concat()` - Combine DataFrames**  
Stack DataFrames **vertically (rows)** or **horizontally (columns)**.  


In [24]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})

# Vertical concatenation (default: axis=0)
result_vertical = pd.concat([df1, df2], axis=0)
print("Vertical Concatenation:\n", result_vertical)

# Horizontal concatenation (axis=1)
result_horizontal = pd.concat([df1, df2], axis=1)
print("\nHorizontal Concatenation:\n", result_horizontal)

Vertical Concatenation:
    A  B
0  1  3
1  2  4
0  5  7
1  6  8

Horizontal Concatenation:
    A  B  A  B
0  1  3  5  7
1  2  4  6  8


**Key Parameters:**  
| Parameter | Description |  
|-----------|-------------|  
| `axis`    | `0` for rows, `1` for columns |  
| `ignore_index` | Reset index (avoids duplicate indices) |  


---

<a id="joins--merges"></a>  
## 2Ô∏è‚É£ **Joins & Merges ü§ù**  

### **1. `pd.merge()` - SQL-like Joins**  
Combine DataFrames based on **common columns (keys)**.  


In [25]:
# Sample DataFrames
employees = pd.DataFrame({
    "ID": [1, 2, 3],
    "Name": ["Alice", "Bob", "Charlie"]
})

sales = pd.DataFrame({
    "ID": [1, 2, 4],
    "Revenue": [200, 150, 300]
})

# Inner join (default: only matching keys)
inner_join = pd.merge(employees, sales, on="ID")
print("Inner Join:\n", inner_join)

# Left join (all rows from left DF)
left_join = pd.merge(employees, sales, on="ID", how="left")
print("\nLeft Join:\n", left_join)

Inner Join:
    ID   Name  Revenue
0   1  Alice      200
1   2    Bob      150

Left Join:
    ID     Name  Revenue
0   1    Alice    200.0
1   2      Bob    150.0
2   3  Charlie      NaN


**Join Types (`how` parameter):**  
- `inner`: Only matching keys  
- `outer`: All keys (fills NaN for mismatches)  
- `left`: All left keys  
- `right`: All right keys  

### **2. `join()` Method**  
Similar to `merge()`, but uses **indices** by default.  


In [26]:
# Set index and join
employees.set_index("ID", inplace=True)
sales.set_index("ID", inplace=True)
joined = employees.join(sales, how="left")
print("\nIndex Join:\n", joined)


Index Join:
        Name  Revenue
ID                  
1     Alice    200.0
2       Bob    150.0
3   Charlie      NaN


---

<a id="pivoting--melting"></a>  
## 3Ô∏è‚É£ **Pivoting & Melting üîÑ**  

### **1. `pivot_table()` - Summarize Data**  
Reshape data to **highlight relationships** (like Excel pivot tables).  


In [27]:
data = {
    "Date": ["2023-01-01", "2023-01-01", "2023-01-02"],
    "City": ["NY", "Paris", "NY"],
    "Temperature": [22, 18, 20]
}
weather = pd.DataFrame(data)

# Pivot: Cities as columns, Dates as rows
pivoted = weather.pivot_table(
    values="Temperature", 
    index="Date", 
    columns="City"
)
print("Pivoted Table:\n", pivoted)

Pivoted Table:
 City          NY  Paris
Date                   
2023-01-01  22.0   18.0
2023-01-02  20.0    NaN


### **2. `melt()` - Unpivot Data**  
Convert **wide** data to **long** format.  


In [28]:
melted = weather.melt(
    id_vars=["Date"],  # Columns to keep
    value_vars=["City", "Temperature"]  # Columns to unpivot
)
print("\nMelted Table:\n", melted)


Melted Table:
          Date     variable  value
0  2023-01-01         City     NY
1  2023-01-01         City  Paris
2  2023-01-02         City     NY
3  2023-01-01  Temperature     22
4  2023-01-01  Temperature     18
5  2023-01-02  Temperature     20


---

<a id="quick-quiz"></a>  
## ‚ùì **Quick Quiz**  

1. **How to combine two DataFrames side-by-side?**  
   - A) `pd.concat([df1, df2], axis=0)`  
   - B) `pd.concat([df1, df2], axis=1)` ‚úÖ  
   - C) `pd.merge(df1, df2)`  

2. **Which join type keeps all rows from the left DataFrame?**  
   - A) `inner`  
   - B) `outer`  
   - C) `left` ‚úÖ  

---

<a id="hands-on-project"></a>  
## üéØ **Hands-on Project: COVID-19 Data Analysis ü¶†**  

**Scenario:** Analyze COVID-19 cases and deaths by country/date.  


In [29]:
# Sample data (replace with real data from WHO/Johns Hopkins)
cases = pd.DataFrame({
    "Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "Country": ["USA", "USA", "India"],
    "Cases": [1000, 1200, 800]
})

deaths = pd.DataFrame({
    "Date": ["2023-01-01", "2023-01-02"],
    "Country": ["USA", "USA"],
    "Deaths": [50, 60]
})

# Task 1: Merge cases and deaths
merged = pd.merge(cases, deaths, on=["Date", "Country"], how="left")

# Task 2: Pivot by country
pivoted = merged.pivot_table(
    values=["Cases", "Deaths"],
    index="Date", 
    columns="Country",
    aggfunc="sum"
)

# Task 3: Calculate mortality rate
merged["Mortality Rate"] = (merged["Deaths"] / merged["Cases"]) * 100

print("Merged Data:\n", merged)
print("\nPivoted Data:\n", pivoted)

Merged Data:
          Date Country  Cases  Deaths  Mortality Rate
0  2023-01-01     USA   1000    50.0             5.0
1  2023-01-02     USA   1200    60.0             5.0
2  2023-01-03   India    800     NaN             NaN

Pivoted Data:
             Cases         Deaths      
Country     India     USA  India   USA
Date                                  
2023-01-01    NaN  1000.0    NaN  50.0
2023-01-02    NaN  1200.0    NaN  60.0
2023-01-03  800.0     NaN    0.0   NaN


---

<a id="pro-tips"></a>  
## üí° **Pro Tips & Common Pitfalls**  

‚úÖ **Use `pd.merge()`‚Äôs `validate` parameter to check for unexpected duplicates:**  
```python
pd.merge(df1, df2, on="key", validate="one_to_one")  # Raises error if duplicates  
```

‚ùå **Avoid chained indexing after `pivot_table()` (use `.reset_index()` first).**  

üî• **Handle missing values in merges:**  
```python
merged.fillna({"Deaths": 0}, inplace=True)  # Replace NaN deaths with 0  
```

---

## üéâ **Key Takeaways**  
‚úî **`pd.concat()`** stacks DataFrames vertically/horizontally.  
‚úî **`pd.merge()`** combines data like SQL joins.  
‚úî **`pivot_table()`** summarizes; **`melt()`** unpivots data.  

**Next Steps:**  
- Try merging on multiple keys (e.g., `on=["Date", "Country"]`).  
- Explore `pd.crosstab()` for frequency tables.  

**Happy analyzing!** üìäüîç


# **‚è∞ Time Series Analysis:**
> **Unlocking Patterns Over Time! ‚è≥** 

Welcome, time travelers! üöÄ You've learned how to clean, merge, and reshape your data. Now, let's add another dimension: **time**! Many real-world datasets, like stock prices, sensor readings, weather data, or daily sales, have a time component. Analyzing these **time series** helps us understand trends, predict future values, and identify seasonality.

Pandas is incredibly powerful for working with time series data, largely because it was built with financial time series in mind. You'll learn to easily manipulate, aggregate, and visualize data that changes over time. Get ready to spot trends and make temporal insights! üìà

---

### üìö Table of Contents: Time Series Analysis

1.  **What is Time Series Data?** ü§î
2.  **Creating Time Series Data in Pandas** üìÖ
    * The `datetime` Type
    * `pd.to_datetime()` Revisited
    * `pd.date_range()`
    * Setting a DateTimeIndex
3.  **Selecting and Slicing Time Series** ‚úÇÔ∏è
4.  **Resampling Time Series (Aggregating Over Time)** üìä
    * Upsampling vs. Downsampling
    * Common Frequency Aliases
5.  **Time-Based Shifting (`.shift()`)** ‚û°Ô∏è
6.  **Rolling Windows (Moving Averages with `.rolling()`)** üåä
7.  **Pro Tips & Common Pitfalls** üí°
8.  **Mini-Challenges: Time Series Troubleshooter!** üïµÔ∏è‚Äç‚ôÄÔ∏è

---

### Let's Create Some Time Series Data! üìà

To demonstrate time series operations, we'll create a synthetic dataset representing daily website traffic and user sign-ups over a period.


In [None]:
import pandas as pd
import numpy as np

# Create a date range for a few weeks
dates = pd.date_range(start='2024-01-01', periods=30, freq='D')

# Create some dummy data
np.random.seed(42) # for reproducibility
website_data = {
    'Date': dates,
    'Page_Views': np.random.randint(500, 2000, size=len(dates)),
    'Sign_Ups': np.random.randint(10, 50, size=len(dates)),
    'Bounce_Rate': np.round(np.random.uniform(0.3, 0.7, size=len(dates)), 2)
}

df_website = pd.DataFrame(website_data)

# Print initial info
print("Initial Website Data DataFrame:")
display(df_website.head())
print("\nDataFrame Info:")
df_website.info()


Notice the `Date` column is already a `datetime64` type, thanks to `pd.date_range()`. This is crucial\!




-----

### **1. What is Time Series Data? ü§î**

Time series data is a sequence of data points indexed (or listed) in time order. Most commonly, it is a sequence taken at successive equally spaced points in time.

**Key characteristics:**

  * **Time-indexed:** Each data point has a timestamp associated with it.
  * **Ordered:** The order of observations matters.
  * **Temporal Dependency:** Past values can influence future values (e.g., yesterday's sales impact today's).

The most important thing for Pandas to treat data as a time series is that your time-related column must be a **`datetime` object**.



-----

### **2. Creating Time Series Data in Pandas üìÖ**

#### **The `datetime` Type: The Heart of Time Series üíñ**

Pandas (and Python) has a special data type for dates and times. When your date column is of `datetime` type, Pandas unlocks many powerful time series functionalities. If your date column is just a string (`object` type), Pandas won't recognize it as time-based.

#### **`pd.to_datetime()` Revisited: Converting to the Right Type üîÑ**

We saw this in data cleaning. It's so important it deserves a deeper look here.


In [None]:
# Example: Let's create a DataFrame with a string date column
sales_str_dates = pd.DataFrame({
    'SaleDate': ['2023-01-01', '2023/01/02', 'Jan 3, 2023'],
    'Revenue': [100, 150, 200]
})

print("Original sales_str_dates info:")
sales_str_dates.info()

# Convert 'SaleDate' to datetime
sales_str_dates['SaleDate'] = pd.to_datetime(sales_str_dates['SaleDate'])

print("\nSales_str_dates info after conversion:")
sales_str_dates.info()
display(sales_str_dates)


**Pro Tip:** Always use `pd.to_datetime()` with `errors='coerce'` if you suspect some dates might be unparseable. This will turn invalid date strings into `NaT` (Not a Time) instead of crashing your code.


In [None]:
# Example with an invalid date
invalid_dates_df = pd.DataFrame({
    'EventDate': ['2023-05-10', 'Not a Date!', '2023-05-12'],
    'Event': ['Meeting', 'Error', 'Launch']
})

# Convert with errors='coerce'
invalid_dates_df['EventDate'] = pd.to_datetime(invalid_dates_df['EventDate'], errors='coerce')

print("\nDataFrame with coerced dates:")
display(invalid_dates_df)
print(invalid_dates_df.info()) # Notice NaT as datetime64[ns]


#### **`pd.date_range()`: Generating Date Sequences üóìÔ∏è**

This function is fantastic for creating a sequence of dates, which is useful for setting up a time series or for filling in missing dates.

  * `start`, `end`: Define the start and end dates.
  * `periods`: Define the number of dates to generate.
  * `freq`: The frequency of the date interval (e.g., 'D' for daily, 'W' for weekly, 'M' for monthly).

<!-- end list -->

In [None]:
# Generate daily dates for a week
daily_sequence = pd.date_range(start='2024-07-01', periods=7, freq='D')
print("Daily sequence:", daily_sequence)

# Generate monthly dates for 6 months
monthly_sequence = pd.date_range(start='2024-01-01', periods=6, freq='M') # 'M' for month end
print("\nMonthly sequence:", monthly_sequence)

# Generate hourly dates for a day
hourly_sequence = pd.date_range(start='2024-07-11 09:00', end='2024-07-11 15:00', freq='H')
print("\nHourly sequence:", hourly_sequence)


---

#### Setting a DateTimeIndex: Making Time the Ruler\! üëë

For most time series operations, it's best to set your `datetime` column as the DataFrame's index. This is called a **DateTimeIndex**. This enables powerful time-based slicing and resampling.


In [None]:
# Let's use our df_website
df_website_indexed = df_website.set_index('Date')

print("\nWebsite Data with Date as Index (DateTimeIndex):")
display(df_website_indexed.head())
print("\nIndexed DataFrame Info:")
df_website_indexed.info()
print("Index type:", df_website_indexed.index)



Notice `Date` is now the `DatetimeIndex`.

**From now on, we will use `df_website_indexed` for our examples.**



-----

### **3. Selecting and Slicing Time Series ‚úÇÔ∏è**

Once your DataFrame has a DateTimeIndex, selecting data by date becomes incredibly intuitive\!


In [None]:
# Select data for a specific date
print("Data for 2024-01-05:")
print(df_website_indexed.loc['2024-01-05'])

# Select data for a range of dates
print("\nData from 2024-01-10 to 2024-01-15 (inclusive):")
print(df_website_indexed.loc['2024-01-10':'2024-01-15'])

# Select data for a specific month (partial string indexing!)
print("\nAll data for January 2024:")
print(df_website_indexed.loc['2024-01'].head()) # Just head, as it's the whole month

# Select data for a specific year (if your data spans multiple years)
print("\nAll data for 2024:")
print(df_website_indexed.loc['2024'].head()) # Again, just head


**Pro Tip:** Partial string indexing is super cool\! You can use 'YYYY', 'YYYY-MM', 'YYYY-MM-DD', 'YYYY-MM-DD HH', etc., to select time ranges, as long as your index is a `DatetimeIndex`.




-----

### **4. Resampling Time Series (Aggregating Over Time) üìä**

Resampling involves changing the frequency of your time series data.

  * **Downsampling:** Aggregating data from a higher frequency to a lower frequency (e.g., daily to weekly, hourly to daily). This usually involves an aggregation function (sum, mean, max, etc.).
  * **Upsampling:** Converting data from a lower frequency to a higher frequency (e.g., weekly to daily). This often involves filling or interpolation of new values. (Less common for beginners, but good to know it exists\!)

The `.resample()` method is your friend here. You apply it to a DateTimeIndexed Series or DataFrame, specify the new frequency, and then apply an aggregation function.


In [None]:
# Our DataFrame for resampling
print("Original df_website_indexed head:")
display(df_website_indexed.head())


#### **Common Frequency Aliases (a few examples):**

  * `'D'`: Daily
  * `'W'`: Weekly (Sunday end)
  * `'M'`: Monthly (month end)
  * `'Q'`: Quarterly (quarter end)
  * `'A'` or `'Y'`: Annually (year end)
  * `'H'`: Hourly
  * `'min'` or `'T'`: Minutely

#### **Downsampling Examples:**

In [None]:
# Calculate weekly total page views (downsample from daily to weekly)
weekly_page_views = df_website_indexed['Page_Views'].resample('W').sum()
print("\nWeekly total page views:")
print(weekly_page_views)

# Calculate monthly average sign-ups and bounce rate
monthly_summary = df_website_indexed[['Sign_Ups', 'Bounce_Rate']].resample('M').mean()
print("\nMonthly average sign-ups and bounce rate:")
print(monthly_summary)

# Calculate daily max page views (if we had finer data like hourly)
# Let's pretend we have hourly data for a day
hourly_data = pd.DataFrame({
    'Time': pd.date_range(start='2024-01-01 00:00', periods=24, freq='H'),
    'Value': np.random.randint(10, 100, size=24)
}).set_index('Time')

print("\nHourly data head:")
print(hourly_data.head())

# daily_max_value = hourly_data['Value'].resample('D').max()
print("\nDaily max value from hourly data:")
print(daily_max_value)



-----

### **5. Time-Based Shifting (`.shift()`): Comparing Past and Present ‚û°Ô∏è**

The `.shift()` method is used to shift the data by a given number of periods (rows) forward or backward. This is useful for:

  * Comparing current values to previous values (e.g., day-over-day change).

  * Creating lagged features for forecasting.

  * `periods`: The number of periods to shift (positive for forward, negative for backward).

  * `freq`: Optional, shifts the index (dates) by a frequency offset without changing the data alignment.

<!-- end list -->


In [None]:
# Calculate the day-over-day change in Page_Views
# A shift of 1 means comparing today's views with yesterday's views.
df_website_indexed['Page_Views_Prev_Day'] = df_website_indexed['Page_Views'].shift(1)
df_website_indexed['Page_Views_Change'] = df_website_indexed['Page_Views'] - df_website_indexed['Page_Views_Prev_Day']

print("\nDataFrame with shifted page views and daily change:")
print(df_website_indexed.head())

# Notice the first row of Page_Views_Prev_Day is NaN because there's no previous day.


**Explanation:** `shift(1)` moves all values down by one row. So, for '2024-01-02', `Page_Views_Prev_Day` will show the `Page_Views` from '2024-01-01'.



-----

### **6. Rolling Windows (Moving Averages with `.rolling()`): Smoothing Trends üåä**

Rolling windows (or moving averages) are used to smooth out short-term fluctuations in time series data and highlight longer-term trends. You calculate a statistic (like mean, sum, std) over a *moving window* of a specified size.

  * `.rolling(window)`: Defines the size of the rolling window.
  * `.rolling(window='7D')`: You can also specify time-based windows.

<!-- end list -->


In [None]:
# Calculate a 7-day rolling average of Page_Views
df_website_indexed['Page_Views_7D_MA'] = df_website_indexed['Page_Views'].rolling(window=7).mean()

# Calculate a 3-day rolling sum of Sign_Ups
df_website_indexed['Sign_Ups_3D_Sum'] = df_website_indexed['Sign_Ups'].rolling(window=3).sum()

print("\nDataFrame with rolling averages/sums:")
display(df_website_indexed.head(10)) # Look at more rows to see the rolling effect


**Explanation:** For `Page_Views_7D_MA`, the value for January 7th is the average of Page\_Views from Jan 1st to Jan 7th. The value for January 8th is the average of Page\_Views from Jan 2nd to Jan 8th, and so on. The first `window-1` values will be `NaN` because there aren't enough preceding values to fill the window.



-----

### **7. Pro Tips & Common Pitfalls üí°**

#### Pro Tips:

  * **Always `pd.to_datetime()`:** This is the golden rule for time series in Pandas. Make sure your date/time column is a `datetime` object.
  * **Set DateTimeIndex:** For powerful slicing and resampling, set your datetime column as the DataFrame's index.
  * **Explore Frequencies:** Pandas has many frequency aliases (`'D'`, `'W'`, `'M'`, `'H'`, `'S'`, etc.). Experiment to find the right one for your resampling needs.
  * **Understand `resample()` Aggregations:** After `.resample()`, you *must* apply an aggregation function (`.sum()`, `.mean()`, `.max()`, `.min()`, `.count()`, etc.) to tell Pandas how to combine the data within each new time bin.
  * **Visualize\!** Time series data is best understood visually. Once you've processed your data, plot it to see trends and patterns\! üìà
    ```python
    # Example for plotting (requires matplotlib)
    # import matplotlib.pyplot as plt
    # df_website_indexed['Page_Views'].plot(title='Daily Page Views')
    # df_website_indexed['Page_Views_7D_MA'].plot(title='7-Day Rolling Average Page Views')
    # plt.show()
    ```

#### Common Pitfalls:

  * **Date Column Not `datetime` Type:** The most common error\! If `df.info()` shows your date column as `object`, you can't do time series operations effectively. Convert it\!
  * **Not Setting Index:** Trying to use `.resample()` or partial string indexing without a `DatetimeIndex` will lead to errors.
  * **`resample()` Without Aggregation:** Just `df.resample('W')` won't work; you need `df.resample('W').sum()` or `df.resample('W').mean()`.
  * **Misinterpreting `shift()`:** Remember `shift(1)` brings the *previous* value into the *current* row. If you want the *next* value, use `shift(-1)`.
  * **`NaN` Values in Rolling/Shifting:** The first few values of a rolling window or shifted column will often be `NaN` because there isn't enough data to fill the window/shift. Remember to handle these `NaN`s if they're problematic for further analysis.
  * **Timezones:** If your data has time zone information, be mindful of it. `pd.to_datetime` has a `tz` argument. For most beginner analysis, you might ignore it, but it's a deep rabbit hole\! üåç



-----

### **8. Mini-Challenges: Time Series Troubleshooter! üïµÔ∏è‚Äç‚ôÄÔ∏è**

Let's test your time series skills\! Use the original `df_website` DataFrame for these challenges (or re-run the creation cell to get a fresh start).

In [None]:
import pandas as pd
import numpy as np

# Re-create original df_website for challenges
dates = pd.date_range(start='2024-01-01', periods=30, freq='D')
np.random.seed(42) # for reproducibility
website_data_challenge = {
    'Date': dates,
    'Page_Views': np.random.randint(500, 2000, size=len(dates)),
    'Sign_Ups': np.random.randint(10, 50, size=len(dates)),
    'Bounce_Rate': np.round(np.random.uniform(0.3, 0.7, size=len(dates)), 2)
}
df_website_challenge = pd.DataFrame(website_data_challenge)

print("Challenge DataFrame Head:")
display(df_website_challenge.head())
print("\nChallenge DataFrame Info:")
df_website_challenge.info()


**Challenge 1: Set Index and Select Data** üìÖ
Set the 'Date' column as the DataFrame's index for `df_website_challenge`. Then, select all data for the first two weeks of January 2024 (from '2024-01-01' to '2024-01-14'). Store this in `first_two_weeks_df`.

**Challenge 2: Weekly Average Page Views** üìà
Using `df_website_challenge_indexed`, calculate the **weekly average** of 'Page\_Views'. Store the result in `weekly_avg_views`.

**Challenge 3: Calculate Previous Day's Sign-Ups** üîô
Add a new column to `df_website_challenge_indexed` called 'Sign\_Ups\_Yesterday' that contains the 'Sign\_Ups' value from the previous day.

**Challenge 4: 5-Day Rolling Mean of Bounce Rate** üåä
Calculate a **5-day rolling mean** for the 'Bounce\_Rate' column in `df_website_challenge_indexed`. Store it in a new column called 'Bounce\_Rate\_5D\_MA'.

<details><summary>Click for Code</summary>

```python
# Your code here for Challenge 1
df_website_challenge_indexed = df_website_challenge.set_index('Date')
first_two_weeks_df = df_website_challenge_indexed.loc['2024-01-01':'2024-01-14']

# print("\nFirst Two Weeks Data:")
# print(first_two_weeks_df)

# Your code here for Challenge 2
weekly_avg_views = df_website_challenge_indexed['Page_Views'].resample('W').mean()

# print("\nWeekly Average Page Views:")
# print(weekly_avg_views)

# Your code here for Challenge 3
df_website_challenge_indexed['Sign_Ups_Yesterday'] = df_website_challenge_indexed['Sign_Ups'].shift(1)

# print("\nDataFrame with Sign_Ups_Yesterday:")
# print(df_website_challenge_indexed.head())

# Your code here for Challenge 4
df_website_challenge_indexed['Bounce_Rate_5D_MA'] = df_website_challenge_indexed['Bounce_Rate'].rolling(window=5).mean()

# print("\nDataFrame with 5-Day Rolling Mean of Bounce Rate:")
# print(df_website_challenge_indexed.head(10))
```




-----

### End of Time Series Analysis\! üéâ

You've just added a crucial skill to your data analysis toolkit\! Understanding and manipulating time series data is fundamental in many fields, from finance to marketing to environmental science. Pandas makes these complex operations surprisingly straightforward. ‚è∞

**What's next?** We've covered a vast amount of Pandas already\! The next step might be a comprehensive real-world project, or perhaps a deeper dive into visualization or specific statistical methods.

**Keep practicing these time series techniques\!** They are incredibly valuable. You're becoming a true data master\! üöÄ

---

# **Time Series Analysis**  


### üìö **Table of Contents**  
1. [DateTime Indexing ‚è∞](#datetime-indexing)  
   - `pd.to_datetime()`  
   - `resample()`  
2. [Rolling & Expanding Windows üìà](#rolling--expanding-windows)  
3. [Time Zone Handling üåê](#time-zone-handling)  
4. [Quick Quiz ‚ùì](#quick-quiz)  
5. [Mini-Project: Stock Price Analysis üíπ](#mini-project)  
6. [Pro Tips & Common Pitfalls üí°](#pro-tips)  


---

<a id="datetime-indexing"></a>  
## 1Ô∏è‚É£ **DateTime Indexing ‚è∞**  

### **1. `pd.to_datetime()` - Convert to DateTime**  
Transform strings or timestamps into Pandas DateTime format.  


In [30]:
import pandas as pd

# Sample data with date strings
data = {
    "Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "Price": [100, 105, 103]
}
df = pd.DataFrame(data)

# Convert to DateTime
df["Date"] = pd.to_datetime(df["Date"])
df.set_index("Date", inplace=True)  # Set as index for time-based operations
print("DateTime Index:\n", df)

DateTime Index:
             Price
Date             
2023-01-01    100
2023-01-02    105
2023-01-03    103


### **2. `resample()` - Time-Based Aggregation**  
Upsample or downsample time series data.  


In [31]:
# Daily to monthly resampling (mean price)
monthly = df.resample("M").mean()
print("\nMonthly Average:\n", monthly)


Monthly Average:
                  Price
Date                  
2023-01-31  102.666667


  monthly = df.resample("M").mean()



**Common Resampling Frequencies:**  
| Alias | Description |  
|-------|-------------|  
| `D`   | Daily |  
| `W`   | Weekly |  
| `M`   | Monthly |  
| `Q`   | Quarterly |  

---

<a id="rolling--expanding-windows"></a>  
## 2Ô∏è‚É£ **Rolling & Expanding Windows üìà**  

### **1. Rolling Windows**  
Compute metrics over a **fixed-size moving window**.  


In [32]:
# 2-day rolling average
df["Rolling_Avg"] = df["Price"].rolling(window=2).mean()
print("Rolling Average:\n", df)

Rolling Average:
             Price  Rolling_Avg
Date                          
2023-01-01    100          NaN
2023-01-02    105        102.5
2023-01-03    103        104.0


### **2. Expanding Windows**  
Compute metrics over **all prior data**.  


In [33]:
# Cumulative average
df["Expanding_Avg"] = df["Price"].expanding().mean()
print("\nExpanding Average:\n", df)


Expanding Average:
             Price  Rolling_Avg  Expanding_Avg
Date                                         
2023-01-01    100          NaN     100.000000
2023-01-02    105        102.5     102.500000
2023-01-03    103        104.0     102.666667


---

<a id="time-zone-handling"></a>  
## 3Ô∏è‚É£ **Time Zone Handling üåê**  

### **1. Localize & Convert Time Zones**

In [34]:
# Localize to UTC (no timezone)
df = df.tz_localize("UTC")

# Convert to US/Eastern
df = df.tz_convert("US/Eastern")
print("\nUS/Eastern Time:\n", df)


US/Eastern Time:
                            Price  Rolling_Avg  Expanding_Avg
Date                                                        
2022-12-31 19:00:00-05:00    100          NaN     100.000000
2023-01-01 19:00:00-05:00    105        102.5     102.500000
2023-01-02 19:00:00-05:00    103        104.0     102.666667


### **2. Handle Daylight Savings**  
Pandas automatically adjusts for DST transitions.  


In [35]:
# Example with DST transition (March 12, 2023)
dst_dates = pd.to_datetime(["2023-03-11", "2023-03-12", "2023-03-13"])
dst_series = pd.Series([1, 2, 3], index=dst_dates).tz_localize("US/Eastern")
print("\nDST Handling:\n", dst_series)


DST Handling:
 2023-03-11 00:00:00-05:00    1
2023-03-12 00:00:00-05:00    2
2023-03-13 00:00:00-04:00    3
dtype: int64


---

<a id="quick-quiz"></a>  
## ‚ùì **Quick Quiz**  

1. **How to resample daily data to quarterly?**  
   - A) `resample("Q")` ‚úÖ  
   - B) `resample("3M")`  
   - C) `resample("90D")`  

2. **What does `rolling(window=7).mean()` compute?**  
   - A) 7-day moving average ‚úÖ  
   - B) Total over 7 days  
   - C) Median of 7 days  


---

<a id="mini-project"></a>  
## üéØ **Mini-Project: Stock Price Analysis üíπ**  

**Scenario:** Analyze AAPL stock prices with time series techniques.  


In [36]:
# Sample data (replace with real data from yfinance)
data = {
    "Date": pd.date_range("2023-01-01", periods=5),
    "Close": [142.5, 145.3, 144.8, 146.2, 147.0]
}
stocks = pd.DataFrame(data).set_index("Date")

# Task 1: 2-day rolling volatility (std dev)
stocks["Rolling_Volatility"] = stocks["Close"].rolling(2).std()

# Task 2: Weekly resampling (last closing price)
weekly = stocks.resample("W").last()

# Task 3: Timezone-aware analysis
stocks = stocks.tz_localize("UTC").tz_convert("US/Eastern")

print("Daily Data:\n", stocks)
print("\nWeekly Data:\n", weekly)

Daily Data:
                            Close  Rolling_Volatility
Date                                                
2022-12-31 19:00:00-05:00  142.5                 NaN
2023-01-01 19:00:00-05:00  145.3            1.979899
2023-01-02 19:00:00-05:00  144.8            0.353553
2023-01-03 19:00:00-05:00  146.2            0.989949
2023-01-04 19:00:00-05:00  147.0            0.565685

Weekly Data:
             Close  Rolling_Volatility
Date                                 
2023-01-01  142.5                 NaN
2023-01-08  147.0            0.565685




---

<a id="pro-tips"></a>  
## üí° **Pro Tips & Common Pitfalls**  

‚úÖ **Use `min_periods` in `rolling()` to handle early NaN values:**  
```python
df.rolling(window=5, min_periods=1).mean()  # Starts averaging immediately  
```

‚ùå **Avoid mixing timezones without conversion (can cause errors).**  

üî• **Combine resampling with aggregation:**  
```python
# Open-High-Low-Close (OHLC) resampling
ohlc = df["Price"].resample("D").ohlc()  
```

---

## üéâ **Key Takeaways**  
‚úî **`pd.to_datetime()`** converts strings to DateTime objects.  
‚úî **`resample()`** aggregates time series data.  
‚úî **Rolling/expanding windows** reveal trends.  
‚úî **Timezone handling** ensures global compatibility.  

**Next Steps:**  
- Fetch real stock data with `yfinance` library.  
- Explore `pd.Timedelta` for time-based arithmetic.  

**Happy time traveling!** ‚è≥üìä

#  **Efficient Data Handling**  

### üìö **Table of Contents**  
1. [Optimizing Memory Usage üß†](#optimizing-memory-usage)  
   - Categorical Data (`astype("category")`)  
   - Downcasting Numeric Types  
2. [Chunking Large Datasets üóÇÔ∏è](#chunking-large-datasets)  
3. [Speeding Up Operations ‚ö°](#speeding-up-operations)  
   - `eval()`  
   - `query()`  
4. [Quick Quiz ‚ùì](#quick-quiz)  
5. [Mini-Project: Process 1GB Sales Data Efficiently üí∞](#mini-project)  
6. [Pro Tips & Common Pitfalls üí°](#pro-tips)  


---

<a id="optimizing-memory-usage"></a>  
## 1Ô∏è‚É£ **Optimizing Memory Usage üß†**  

### **1. Categorical Data (`astype("category")`)**  
Reduce memory for **low-cardinality columns** (e.g., gender, country).  


In [3]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    "User_ID": [1, 2, 3],
    "Country": ["USA", "UK", "USA"]  # Only 2 unique values
})

# Convert to category
df["Country"] = df["Country"].astype("category")
print("Memory Usage (MB):\n", df.memory_usage(deep=True) / 1024**2)

Memory Usage (MB):
 Index      0.000126
User_ID    0.000023
Country    0.000204
dtype: float64


**When to Use:**  
- Columns with **<50% unique values** (e.g., `gender`, `product_category`).  

### **2. Downcasting Numeric Types**  
Shrink numeric columns to smallest feasible type.  


In [4]:
df["User_ID"] = pd.to_numeric(df["User_ID"], downcast="unsigned")
print("\nOptimized Types:\n", df.dtypes)


Optimized Types:
 User_ID       uint8
Country    category
dtype: object


**Supported Downcasts:**  
| Type          | Description              |  
|---------------|--------------------------|  
| `integer`     | 8/16/32/64-bit ints      |  
| `unsigned`    | Unsigned ints            |  
| `float`       | 32/64-bit floats         |  



---

<a id="chunking-large-datasets"></a>  
## 2Ô∏è‚É£ **Chunking Large Datasets üóÇÔ∏è**  

Process **GBs of data** without loading everything into memory.  


In [None]:
# Process CSV in chunks
chunk_size = 10_000  # Rows per chunk
total_rows = 0

for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
    total_rows += len(chunk)
    # Process each chunk (e.g., filter, aggregate)
    chunk_filtered = chunk[chunk["price"] > 100]
    
print(f"Total rows processed: {total_rows:,}")

**Key Parameters:**  
| Parameter     | Description                |  
|--------------|---------------------------|  
| `chunksize`  | Rows per chunk            |  
| `iterator`   | Returns iterator if `True` |  


---

<a id="speeding-up-operations"></a>  
## 3Ô∏è‚É£ **Speeding Up Operations ‚ö°**  

### **1. `eval()` - Fast Expressions**  
Compute operations **without intermediate variables**.  


In [None]:
df = pd.DataFrame({"A": range(1, 100_000), "B": range(100_000, 1, -1)})

# Standard method (slower)
%timeit df["A"] + df["B"] / 2

# eval() method (faster)
%timeit df.eval("A + B / 2")

**Typical Speedup:** **~2x faster** for large DataFrames.  

### **2. `query()` - Fast Filtering**  
Filter rows **with string expressions**.  


In [None]:
# Standard filtering
%timeit df[df["A"] > 50_000]

# query() method
%timeit df.query("A > 50000")

**Bonus:** Use **external variables** with `@`:  

In [None]:
threshold = 50_000
df.query("A > @threshold")

---

<a id="quick-quiz"></a>  
## ‚ùì **Quick Quiz**  

1. **Which dtype reduces memory for "Country" columns with 10 unique values?**  
   - A) `float32`  
   - B) `category` ‚úÖ  
   - C) `object`  

2. **How to process a 10GB CSV file on a laptop with 8GB RAM?**  
   - A) Load it all with `pd.read_csv()`  
   - B) Use `chunksize` in `read_csv()` ‚úÖ  
   - C) Convert to Excel first  


---

<a id="mini-project"></a>  
## üéØ **Mini-Project: Process 1GB Sales Data Efficiently üí∞**  

**Scenario:** Analyze large sales data with memory constraints.  


In [None]:
# Step 1: Read in chunks and preprocess
chunk_iter = pd.read_csv("sales_1gb.csv", chunksize=50_000)

# Step 2: Optimize memory per chunk
dfs_processed = []
for chunk in chunk_iter:
    # Downcast numerics
    chunk["customer_id"] = pd.to_numeric(chunk["customer_id"], downcast="unsigned")
    chunk["price"] = pd.to_numeric(chunk["price"], downcast="float")
    
    # Convert categories
    chunk["product_type"] = chunk["product_type"].astype("category")
    
    # Filter and aggregate
    chunk_filtered = chunk.query("price > 100")
    dfs_processed.append(chunk_filtered)

# Step 3: Combine results
final_df = pd.concat(dfs_processed)
print(f"Final DataFrame shape: {final_df.shape}")
print("\nMemory Usage (MB):\n", final_df.memory_usage(deep=True).sum() / 1024**2)




---

<a id="pro-tips"></a>  
## üí° **Pro Tips & Common Pitfalls**  

‚úÖ **Use `pd.api.types.is_sparse()` to check for sparse columns (great for mostly NaN data).**  

‚ùå **Avoid `eval()`/`query()` for small DataFrames (overhead outweighs benefits).**  

üî• **Parallelize chunk processing with `multiprocessing`:**  
```python
from multiprocessing import Pool

def process_chunk(chunk):
    return chunk.query("price > 100")

with Pool(4) as p:  # 4 cores
    dfs = p.map(process_chunk, pd.read_csv("big_file.csv", chunksize=50_000))
```

---

## üéâ **Key Takeaways**  
‚úî **Categoricals/downcasting** slash memory usage.  
‚úî **Chunking** enables out-of-core processing.  
‚úî **`eval()`/`query()`** speed up computations.  

**Next Steps:**  
- Explore `dask` for **distributed DataFrames** larger than RAM.  
- Try `pd.to_sparse()` for datasets with many zeros/NaNs.  

**Happy optimizing!** üöÄ

# üìä Advanced Data Analysis with Python: 
*Mastering MultiIndex, Aggregations & Window Functions*


## üéØ Table of Contents
1. üìö Introduction to Advanced Data Analysis
2. üèóÔ∏è MultiIndex DataFrames
   - 2.1 Creating MultiIndex DataFrames
   - 2.2 Indexing and Slicing
   - 2.3 Stacking and Unstacking
3. üßÆ Custom Aggregations with agg()
   - 3.1 Built-in Aggregation Functions
   - 3.2 Custom Aggregation Functions
   - 3.3 Aggregating with Multiple Functions
4. ‚è≥ Window Functions
   - 4.1 Rolling Windows
   - 4.2 Expanding Windows
   - 4.3 Exponentially Weighted Windows
5. üöÄ Hands-on Project: Time-Series Analysis
   - 5.1 Loading and Preparing Data
   - 5.2 MultiIndex Analysis
   - 5.3 Custom Aggregations
   - 5.4 Window Function Applications
6. üß© Challenge Problems
7. üìù Conclusion & Next Steps


## 1. üìö Introduction to Advanced Data Analysis

Welcome, data explorer! ÔøΩÔ∏è In this notebook, we'll unlock powerful pandas techniques to analyze complex datasets.

**Why learn this?**
- Real-world data is often hierarchical (MultiIndex)
- Standard aggregations sometimes aren't enough (agg())
- Time-series analysis requires special tools (window functions)


In [None]:
# First, let's import our tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup for pretty visuals
plt.style.use('seaborn')
sns.set_palette("husl")
%matplotlib inline

## 2. ÔøΩÔ∏è MultiIndex DataFrames

### 2.1 Creating MultiIndex DataFrames

A MultiIndex (hierarchical index) lets you store and manipulate data with multiple dimensions.

In [None]:
# Method 1: From tuples
index_tuples = [('California', 2020), ('California', 2021),
                ('New York', 2020), ('New York', 2021)]
populations = [39538223, 39613493, 20201249, 20215751]
pop_df = pd.DataFrame({
    'Population': populations,
    'Index': index_tuples
})
pop_df = pop_df.set_index('Index')
pop_df.index.names = ['State', 'Year']

# Method 2: Using pd.MultiIndex.from_product
states = ['California', 'New York']
years = [2020, 2021]
index = pd.MultiIndex.from_product([states, years], names=['State', 'Year'])
pop_data = [39538223, 39613493, 20201249, 20215751]
pop_df = pd.DataFrame({'Population': pop_data}, index=index)

# Method 3: Setting columns as indexes
sales_data = {
    'Region': ['North', 'North', 'South', 'South'],
    'Product': ['Widget', 'Gadget', 'Widget', 'Gadget'],
    'Sales': [1200, 800, 1500, 900]
}
sales_df = pd.DataFrame(sales_data)
multi_sales = sales_df.set_index(['Region', 'Product'])

**üí° Pro Tip:** MultiIndex is great for panel data (multiple observations over time for different entities).

### 2.2 Indexing and Slicing


In [None]:
# Selecting a single value
ca_2020 = pop_df.loc[('California', 2020)]

# Selecting all years for California
ca_all_years = pop_df.loc['California']

# Cross-section (select all data for year 2020)
year_2020 = pop_df.xs(2020, level='Year')

# Using slicers
idx = pd.IndexSlice
recent_data = pop_df.loc[idx[:, 2021:], :]

**‚ùì Mini-Challenge:** Create a MultiIndex DataFrame with cities and temperature readings for different dates, then extract all temperatures for a specific city.

### 2.3 Stacking and Unstacking


In [None]:
# Unstacking moves inner index level to columns
unstacked_sales = multi_sales.unstack()

# Stacking does the reverse
restacked_sales = unstacked_sales.stack()

# Unstack specific levels
sales_by_product = multi_sales.unstack(level='Product')

**‚ö†Ô∏è Common Pitfall:** Unstacking can create NaN values if not all index combinations exist.

## 3. üßÆ Custom Aggregations with agg()

### 3.1 Built-in Aggregation Functions


In [None]:
# Sample data
np.random.seed(42)
sales = pd.DataFrame({
    'Region': ['North']*5 + ['South']*5,
    'Product': ['Widget', 'Gadget']*5,
    'Sales': np.random.randint(100, 1000, 10),
    'Profit': np.random.randint(10, 200, 10)
})

# Basic aggregations
sales.groupby('Region').agg({'Sales': 'sum', 'Profit': 'mean'})

# Multiple aggregations
sales.groupby('Product').agg({
    'Sales': ['min', 'max', 'mean', 'std'],
    'Profit': ['sum', 'mean']
})

### 3.2 Custom Aggregation Functions

In [None]:
# Define custom functions
def range_agg(series):
    return series.max() - series.min()

def percent_over(series, threshold=500):
    return (series > threshold).mean() * 100

# Apply custom functions
sales.groupby('Region').agg({
    'Sales': [range_agg, lambda x: percent_over(x, 600)]
})

**üí° Pro Tip:** Use lambda functions for simple one-off aggregations and named functions for complex logic you'll reuse.

### 3.3 Aggregating with Multiple Functions


In [None]:
# Different aggregations per column
agg_dict = {
    'Sales': ['sum', 'mean', range_agg],
    'Profit': ['median', percent_over]
}
results = sales.groupby('Product').agg(agg_dict)

# Flatten multi-level column names
results.columns = ['_'.join(col).strip() for col in results.columns.values]

**‚ùì Mini-Challenge:** Create a custom aggregation function that calculates the median absolute deviation and apply it to both Sales and Profit by Region.

## 4. ‚è≥ Window Functions

### 4.1 Rolling Windows


In [None]:
# Sample time series data
dates = pd.date_range('2023-01-01', periods=100)
ts_data = pd.DataFrame({
    'Date': dates,
    'Value': np.sin(np.linspace(0, 10, 100)) * 50 + 100 + np.random.normal(0, 5, 100)
}).set_index('Date')

# 7-day rolling average
ts_data['7_day_avg'] = ts_data['Value'].rolling(window=7).mean()

# 14-day rolling minimum with min_periods
ts_data['14_day_min'] = ts_data['Value'].rolling(window=14, min_periods=5).min()

# Plotting
fig, ax = plt.subplots(figsize=(12, 6))
ts_data['Value'].plot(ax=ax, label='Daily Value')
ts_data['7_day_avg'].plot(ax=ax, label='7-day Avg')
ax.legend()

### 4.2 Expanding Windows


In [None]:
# Expanding calculations
ts_data['Expanding_Avg'] = ts_data['Value'].expanding().mean()
ts_data['Expanding_Max'] = ts_data['Value'].expanding().max()

# Cumulative sum
ts_data['Cumulative_Sum'] = ts_data['Value'].cumsum()

### 4.3 Exponentially Weighted Windows

In [None]:
# Exponentially weighted moving average
ts_data['EWMA_7'] = ts_data['Value'].ewm(span=7).mean()

# Comparing window functions
fig, ax = plt.subplots(figsize=(12, 6))
ts_data['Value'].plot(ax=ax, alpha=0.3, label='Daily Value')
ts_data['7_day_avg'].plot(ax=ax, label='7-day Rolling')
ts_data['EWMA_7'].plot(ax=ax, label='7-day EWMA')
ax.legend()

**‚ö†Ô∏è Common Pitfall:** Forgetting to sort time-series data before applying window functions can lead to incorrect results.

## 5. ÔøΩ Hands-on Project: Stock Market Analysis

Let's analyze Apple's stock data with our new skills!


In [None]:
# Load stock data
apple = pd.read_csv('https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=345427200&period2=1689984000&interval=1d&events=history')

# Convert to datetime and set index
apple['Date'] = pd.to_datetime(apple['Date'])
apple = apple.set_index('Date').sort_index()

# Add some features
apple['Daily_Return'] = apple['Close'].pct_change()
apple['Log_Return'] = np.log(apple['Close']/apple['Close'].shift(1))

### 5.2 MultiIndex Analysis

In [None]:
# Create a MultiIndex with Year-Month
apple['Year'] = apple.index.year
apple['Month'] = apple.index.month
apple_multi = apple.set_index(['Year', 'Month'], append=True)

# Analyze by year-month
monthly_stats = apple_multi.groupby(['Year', 'Month'])['Close'].agg(['mean', 'min', 'max'])

### 5.3 Custom Aggregations

In [None]:
# Custom aggregation: Volatility (std of daily returns)
def volatility(returns):
    return returns.std() * np.sqrt(252)  # Annualized

monthly_volatility = apple_multi.groupby(['Year', 'Month'])['Daily_Return'].agg(volatility)

### 5.4 Window Function Applications

In [None]:
# 50-day and 200-day moving averages
apple['MA_50'] = apple['Close'].rolling(50).mean()
apple['MA_200'] = apple['Close'].rolling(200).mean()

# Bollinger Bands
apple['Upper_Band'] = apple['MA_50'] + 2*apple['Close'].rolling(50).std()
apple['Lower_Band'] = apple['MA_50'] - 2*apple['Close'].rolling(50).std()

# Plotting
fig, ax = plt.subplots(figsize=(14, 7))
apple['Close'].plot(ax=ax, label='Close Price')
apple['MA_50'].plot(ax=ax, label='50-day MA')
apple['MA_200'].plot(ax=ax, label='200-day MA')
apple['Upper_Band'].plot(ax=ax, linestyle='--', color='g', alpha=0.7)
apple['Lower_Band'].plot(ax=ax, linestyle='--', color='r', alpha=0.7)
ax.legend()


## 6. üß© Challenge Problems

**Challenge 1:** Create a MultiIndex DataFrame with simulated sensor data from multiple devices and calculate rolling averages per device.

**Challenge 2:** Write a custom aggregation function that calculates the percentage of days where the stock price increased from the previous day, grouped by month.

**Challenge 3:** Implement a trading strategy using window functions (e.g., buy when 50-day MA crosses above 200-day MA) and calculate hypothetical returns.

## 7. üìù Conclusion & Next Steps

**What we've covered:**
- Mastered MultiIndex DataFrames for hierarchical data
- Created custom aggregations for specialized metrics
- Applied powerful window functions for time-series analysis

**Next steps to level up:**
1. Explore `pd.Grouper` for time-based grouping
2. Learn about `resample()` for time-series aggregation
3. Dive into `pd.cut()` for binning analysis

```python
# üéâ Congratulations on completing this advanced data analysis module!
# Keep coding and exploring data's hidden stories! üìà‚ú®
```

**Final Pro Tip:** When working with large MultiIndex DataFrames, consider using `.swaplevel()` and `.sort_index()` to optimize performance for your specific access patterns.

**Happy analyzing!** ü§ìüíª

# üìä Phase 4: Real-World Applications & Integration
## Module 12: Integration with Other Libraries

```python
# üéØ Table of Contents
"""
1. üìö Introduction to Library Integration
2. üìä Visualization Power Combo: Pandas + Matplotlib/Seaborn
   - 2.1 Direct Plotting from DataFrames
   - 2.2 Customizing Visualizations
   - 2.3 Advanced Plot Types
3. ü§ñ Machine Learning Pipeline: Pandas + Scikit-Learn
   - 3.1 Data Preprocessing Patterns
   - 3.2 Feature Engineering Techniques
   - 3.3 Building ML-Ready Datasets
4. üîÑ The Full Workflow: From Data to Deployment
5. üß© Real-World Integration Challenge
6. üìù Conclusion & Next Steps
"""
```

## 1. üìö Introduction to Library Integration

Welcome to the power zone! ‚ö° Here we'll combine pandas with other essential Python libraries to create professional data workflows.

**Why this matters:**
- Real data science is multi-library
- Each library has superpowers we can combine
- Professional workflows require integration


In [None]:
# Import our full toolkit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Setup visualization defaults
plt.style.use('seaborn')
sns.set_palette("husl")
%matplotlib inline

## 2. üìä Visualization Power Combo: Pandas + Matplotlib/Seaborn

### 2.1 Direct Plotting from DataFrames


In [None]:
# Load sample dataset
mpg = sns.load_dataset('mpg')

# Basic pandas plotting
mpg['mpg'].plot(kind='hist', bins=20, title='MPG Distribution')

# Seaborn + pandas integration
sns.boxplot(x='origin', y='mpg', data=mpg)

# FacetGrid with pandas data
g = sns.FacetGrid(mpg, col='origin')
g.map(sns.scatterplot, 'horsepower', 'mpg', alpha=0.7)

**üí° Pro Tip:** Use `pd.plotting.scatter_matrix()` for quick pairwise relationships visualization.

### 2.2 Customizing Visualizations


In [None]:
# Create a figure and axes
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Distribution with pandas
mpg['mpg'].plot(kind='kde', ax=ax1, title='MPG Density')
ax1.set_xlabel('Miles per Gallon')

# Plot 2: Relationship with seaborn
sns.regplot(x='horsepower', y='mpg', data=mpg, ax=ax2, scatter_kws={'alpha':0.3})
ax2.set_title('MPG vs Horsepower')

# Add overall title
fig.suptitle('Automobile Data Analysis', fontsize=16)
plt.tight_layout()

### 2.3 Advanced Plot Types

In [None]:
# Correlation heatmap
corr = mpg.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

# Pairplot with hue
sns.pairplot(mpg, hue='origin', vars=['mpg', 'horsepower', 'weight'])

# Time-series plot (if we had date data)
if 'date' in mpg.columns:
    mpg.set_index('date')['mpg'].plot(style='-o', figsize=(12, 6))

**‚ùì Mini-Challenge:** Create a customized visualization showing the relationship between weight and mpg, faceted by number of cylinders, with different colors for each origin.

## 3. ü§ñ Machine Learning Pipeline: Pandas + Scikit-Learn

### 3.1 Data Preprocessing Patterns


In [None]:
# Load data
titanic = sns.load_dataset('titanic').dropna(subset=['age', 'embarked'])

# Define features and target
X = titanic[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = titanic['survived']

# Train-test split with pandas
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['pclass', 'sex', 'embarked']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

### 3.2 Feature Engineering Techniques

In [None]:
# Create new features with pandas
titanic['family_size'] = titanic['sibsp'] + titanic['parch']
titanic['age_group'] = pd.cut(titanic['age'], 
                             bins=[0, 18, 30, 50, 100],
                             labels=['child', 'young', 'adult', 'senior'])

# Binning continuous variables
titanic['fare_bin'] = pd.qcut(titanic['fare'], q=4, labels=['low', 'medium', 'high', 'very_high'])

# Interaction features
titanic['age_class'] = titanic['age'] * titanic['pclass']

**‚ö†Ô∏è Common Pitfall:** Always perform feature engineering before train-test split to avoid data leakage.

### 3.3 Building ML-Ready Datasets


In [None]:
# Full pipeline example
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=42)

# Create full pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model)])

# Fit the model
clf.fit(X_train, y_train)

# Get feature names after preprocessing
preprocessor.fit(X_train)
feature_names = (numeric_features + 
                list(preprocessor.named_transformers_['cat']
                    .named_steps['onehot']
                    .get_feature_names_out(categorical_features)))

**üí° Pro Tip:** Use `pd.DataFrame(model.feature_importances_, index=feature_names)` to create interpretable feature importance tables.

## 4. üîÑ The Full Workflow: From Data to Deployment


In [None]:
# Complete workflow example
from sklearn.metrics import classification_report

# 1. Load and prepare data
housing = pd.read_csv('https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv')

# 2. Feature engineering
housing['rooms_per_household'] = housing['total_rooms']/housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms']
housing['population_per_household'] = housing['population']/housing['households']

# 3. Handle missing values
housing['total_bedrooms'].fillna(housing['total_bedrooms'].median(), inplace=True)

# 4. Prepare for ML
X = housing.drop('median_house_value', axis=1)
y = housing['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Build preprocessing pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler()),
])

cat_attributes = ['ocean_proximity']
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, list(X.select_dtypes(include=[np.number]))),
    ('cat', OneHotEncoder(), cat_attributes),
])

# 6. Train model
from sklearn.linear_model import LinearRegression
lin_reg = Pipeline([
    ('preprocessing', full_pipeline),
    ('linear_regression', LinearRegression())
])
lin_reg.fit(X_train, y_train)

# 7. Evaluate
predictions = lin_reg.predict(X_test)

## 5. üß© Real-World Integration Challenge

**Challenge 1:** Visualize Feature Importance
- Train a RandomForest on the Titanic dataset
- Create a horizontal bar plot of feature importances using Seaborn
- Bonus: Color-code by feature type (numeric vs categorical)

**Challenge 2:** Complete ML Pipeline
- Load the diamonds dataset (`sns.load_dataset('diamonds')`)
- Build a full pipeline predicting price
- Include feature engineering, preprocessing, and modeling
- Visualize residuals vs predictions

**Challenge 3:** Interactive Dashboard Prep
- Create a function that takes a DataFrame and:
  - Generates summary statistics
  - Creates a correlation heatmap
  - Produces pairplots for numeric columns
- Bonus: Use `ipywidgets` to make it interactive




## 6. üìù Conclusion & Next Steps

**Key Takeaways:**
- Pandas integrates seamlessly with visualization libraries
- Scikit-learn pipelines handle pandas DataFrames beautifully
- The full data science workflow can be streamlined with these integrations

**Where to Go Next:**
1. Explore `plotly` for interactive visualizations
2. Learn `featuretools` for automated feature engineering
3. Dive into `mlflow` for experiment tracking

```python
# üöÄ You're now equipped to build professional data science workflows!
# Keep integrating and innovating! üåü
```

**Final Pro Tip:** Create reusable preprocessing pipelines for common data types (time-series, text, etc.) to accelerate your future projects.

**Happy coding!** üé®ü§ñüìà

# ‚ö° Module 13: Performance Benchmarking: Pandas vs. NumPy vs. Pure Python

```python
# üéØ Table of Contents
"""
1. üìö Introduction to Performance Benchmarking
2. ‚è±Ô∏è Benchmarking Methodology
3. üèéÔ∏è Speed Showdown: Common Operations
   - 3.1 Mathematical Operations
   - 3.2 Data Filtering
   - 3.3 Aggregations
4. üìä Memory Usage Comparison
5. üß† Decision Framework: When to Use Which Tool
6. üõ†Ô∏è Optimization Techniques
7. üß© Performance Challenge
8. üìù Conclusion & Next Steps
"""
```

## 1. üìö Introduction to Performance Benchmarking

Welcome to the Python Performance Olympics! üèÜ Today we'll compare:

- üêº Pandas (high-level data manipulation)
- üî¢ NumPy (numerical computing)
- üêç Pure Python (base language)

In [None]:
import pandas as pd
import numpy as np
import timeit
import sys
from memory_profiler import memory_usage
import matplotlib.pyplot as plt

# Setup benchmarking
plt.style.use('seaborn')
%matplotlib inline

# Create large dataset for testing
SIZE = 1_000_000
data = np.random.randn(SIZE)
py_list = list(data)
np_array = np.array(data)
pd_series = pd.Series(data)

## 2. ‚è±Ô∏è Benchmarking Methodology

We'll use two key metrics:
1. Execution time with `%timeit`
2. Memory usage with `memory_profiler`

**üí° Pro Tip:** Always benchmark with realistic dataset sizes - small datasets won't reveal true performance differences!


In [None]:
def benchmark(func, *args, **kwargs):
    """Run time and memory benchmarks on a function"""
    # Time benchmark
    time_taken = timeit.timeit(lambda: func(*args, **kwargs), number=10)
    
    # Memory benchmark
    mem_usage = memory_usage((func, args, kwargs))
    
    return time_taken, max(mem_usage)

def compare_methods(operations):
    """Compare multiple implementations"""
    results = []
    for name, func in operations.items():
        time, mem = benchmark(func)
        results.append({'Method': name, 'Time (s)': time, 'Memory (MB)': mem})
    return pd.DataFrame(results)

## 3. üèéÔ∏è Speed Showdown: Common Operations

### 3.1 Mathematical Operations


In [None]:
# Define operations
def py_square(lst):
    return [x**2 for x in lst]

def np_square(arr):
    return arr**2

def pd_square(ser):
    return ser**2

# Benchmark
operations = {
    'Pure Python': py_square,
    'NumPy': np_square,
    'Pandas': pd_square
}
math_results = compare_methods(operations)

**Expected Outcome:**  
NumPy will likely be fastest, followed by Pandas, with pure Python trailing.

### 3.2 Data Filtering


In [None]:
threshold = 0.5

def py_filter(lst):
    return [x for x in lst if x > threshold]

def np_filter(arr):
    return arr[arr > threshold]

def pd_filter(ser):
    return ser[ser > threshold]

filter_results = compare_methods({
    'Pure Python': py_filter,
    'NumPy': np_filter,
    'Pandas': pd_filter
})

**‚ö†Ô∏è Common Pitfall:** For small datasets, the overhead of NumPy/Pandas might make pure Python faster!

### 3.3 Aggregations


In [None]:
def py_mean(lst):
    return sum(lst)/len(lst)

def np_mean(arr):
    return arr.mean()

def pd_mean(ser):
    return ser.mean()

agg_results = compare_methods({
    'Pure Python': py_mean,
    'NumPy': np_mean,
    'Pandas': pd_mean
})

**‚ùì Mini-Challenge:** Benchmark standard deviation calculations across all three methods.

## 4. üìä Memory Usage Comparison


In [None]:
def memory_comparison():
    """Compare memory usage of data structures"""
    structures = {
        'Python List': py_list,
        'NumPy Array': np_array,
        'Pandas Series': pd_series
    }
    
    mem_sizes = {}
    for name, obj in structures.items():
        mem_sizes[name] = sys.getsizeof(obj)/1024/1024  # Convert to MB
    
    return pd.DataFrame.from_dict(mem_sizes, orient='index', columns=['Size (MB)'])

mem_df = memory_comparison()

**üí° Pro Tip:** Pandas has more overhead than NumPy due to its rich functionality - choose NumPy when memory is critical.

## 5. üß† Decision Framework: When to Use Which Tool

**Use Pure Python When:**
- Working with small datasets (< 1,000 elements)
- Need maximum flexibility with data types
- Writing complex custom algorithms

**Use NumPy When:**
- Working with numerical data only
- Need maximum performance for math operations
- Memory efficiency is critical

**Use Pandas When:**
- Working with tabular/heterogeneous data
- Need labeled axes and metadata
- Doing data cleaning/wrangling
- Need time-series functionality

```python
# Decision Tree Visualization
decision_tree = """
START
‚îÇ
‚îú‚îÄ‚îÄ Working with numerical arrays only? ‚Üí NumPy
‚îÇ
‚îú‚îÄ‚îÄ Working with tabular/mixed data? ‚Üí Pandas
‚îÇ
‚îú‚îÄ‚îÄ Dataset size < 1,000? ‚Üí Consider Pure Python
‚îÇ
‚îî‚îÄ‚îÄ Need special algorithms? ‚Üí Pure Python + NumPy/Pandas combo
"""
```

## 6. üõ†Ô∏è Optimization Techniques

### Pandas-Specific Optimizations


In [None]:
# 1. Use vectorized operations
# Slow:
df['new_col'] = df['col'].apply(lambda x: x*2)
# Fast:
df['new_col'] = df['col'] * 2

# 2. Use proper data types
df['int_col'] = df['int_col'].astype('int32')  # Saves memory vs int64

# 3. Use categoricals for low-cardinality strings
df['category_col'] = df['category_col'].astype('category')

### NumPy Optimization Tricks


In [None]:
# 1. Use views not copies
arr_view = arr[10:20]  # View (no memory copy)
arr_copy = arr[10:20].copy()  # Explicit copy

# 2. Use in-place operations
arr *= 2  # In-place
arr = arr * 2  # Creates new array

# 3. Pre-allocate arrays
result = np.empty_like(arr)  # Pre-allocation
np.multiply(arr, 2, out=result)  # No temporary arrays

## 7. üß© Performance Challenge

**Challenge 1:** Benchmarking Sorting
- Implement sorting in pure Python, NumPy, and Pandas
- Compare performance across different array sizes (1K, 100K, 1M elements)
- Visualize results with matplotlib

**Challenge 2:** Memory-Efficient Data Processing
- Create a 10M element dataset
- Compare memory usage of different groupby operations
- Implement the most memory-efficient version

**Challenge 3:** Real-World Optimization
- Take a slow pandas operation from your own work
- Optimize it using vectorization and proper dtypes
- Measure the speedup achieved


## 8. üìù Conclusion & Next Steps

**Key Takeaways:**
- NumPy excels at numerical operations
- Pandas adds convenience at a small performance cost
- Pure Python is flexible but slower for bulk operations

**Where to Go Next:**
1. Explore `numba` for further speedups
2. Learn about Dask for out-of-core computations
3. Investigate `polars` as a pandas alternative

```python
# üöÄ Performance optimization is a journey, not a destination!
# Keep measuring and improving! üèéÔ∏èüí®
```

**Final Pro Tip:** Always profile before optimizing - you might be surprised where the real bottlenecks are!

**Happy benchmarking!** ‚è±Ô∏èüìä

# üìä Module 14: Case Studies & Best Practices in Data Analysis

```python
# üéØ Table of Contents
"""
1. üìö Introduction to Production-Grade Data Analysis
2. üèóÔ∏è Optimizing Data Pipelines
   - 2.1 Pipeline Design Patterns
   - 2.2 Performance Optimization
   - 2.3 Maintainability Best Practices
3. üêõ Common Pitfalls & Debugging Tips
   - 3.1 Data Quality Issues
   - 3.2 Performance Bottlenecks
   - 3.3 Unexpected Results
4. üèÜ Capstone Project: Customer Segmentation
   - 4.1 Data Collection & Cleaning
   - 4.2 Feature Engineering
   - 4.3 Clustering Analysis
   - 4.4 Results Interpretation
5. üìù Conclusion & Career Next Steps
"""
```

## 1. üìö Introduction to Production-Grade Data Analysis

Welcome to professional data analysis! üéì Today we'll bridge the gap between academic exercises and real-world applications.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import seaborn as sns

# Setup
plt.style.use('seaborn')
sns.set_palette("husl")
%matplotlib inline
pd.set_option('display.max_columns', 100)

## 2. üèóÔ∏è Optimizing Data Pipelines

### 2.1 Pipeline Design Patterns

**Modular Pipeline Architecture:**

In [None]:
def data_loader(path):
    """Load and validate raw data"""
    df = pd.read_csv(path)
    assert not df.empty, "Empty DataFrame loaded"
    return df

def data_cleaner(df):
    """Handle missing values and outliers"""
    df = df.dropna(subset=['essential_columns'])
    df = remove_outliers(df)
    return df

def feature_engineerer(df):
    """Create derived features"""
    df['feature_ratio'] = df['feature1'] / df['feature2']
    return df

# Main pipeline
def run_pipeline(data_path):
    """End-to-end data processing"""
    data = data_loader(data_path)
    clean_data = data_cleaner(data)
    final_data = feature_engineerer(clean_data)
    return final_data

**üí° Pro Tip:** Use Python's `logging` module to track pipeline execution instead of print statements.

### 2.2 Performance Optimization


In [None]:
# Optimized operations
def optimized_operations(df):
    # Vectorized operations
    df['new_feature'] = df['feature1'] * 0.8 + df['feature2'] * 0.2
    
    # Efficient filtering
    filtered = df.query('value > 0.5 and category in @allowed_categories')
    
    # Memory reduction
    for col in df.select_dtypes(include=['float64']):
        df[col] = pd.to_numeric(df[col], downcast='float')
    
    return df

### 2.3 Maintainability Best Practices

In [None]:
# Configuration-driven pipeline
CONFIG = {
    'data_cleaning': {
        'drop_na_columns': ['income', 'age'],
        'outlier_thresholds': {
            'spend': (0, 10000),
            'visits': (0, 365)
        }
    },
    'features': {
        'ratios': ['spend/visits', 'online/in_store'],
        'bins': {
            'age': [0, 18, 35, 55, 100]
        }
    }
}

def configurable_pipeline(df, config):
    """Maintainable configuration-driven pipeline"""
    # Data cleaning
    df = df.dropna(subset=config['data_cleaning']['drop_na_columns'])
    
    # Feature engineering
    for ratio in config['features']['ratios']:
        num, den = ratio.split('/')
        df[ratio] = df[num] / df[den]
    
    return df

## 3. üêõ Common Pitfalls & Debugging Tips

### 3.1 Data Quality Issues

**Debugging Checklist:**
1. Verify null values: `df.isna().sum()`
2. Check for duplicates: `df.duplicated().sum()`
3. Validate ranges: `df.describe()`
4. Verify categorical values: `df['category'].value_counts()`


In [None]:
def data_quality_report(df):
    """Generate comprehensive data quality report"""
    report = {
        'missing_values': df.isna().sum(),
        'duplicates': df.duplicated().sum(),
        'data_types': df.dtypes,
        'numeric_stats': df.describe(),
        'categorical_counts': {col: df[col].value_counts() 
                              for col in df.select_dtypes(include=['object'])}
    }
    return report

### 3.2 Performance Bottlenecks

**Diagnosing Slow Code:**

In [None]:
# Using line_profiler
%load_ext line_profiler
%lprun -f slow_function slow_function(df)

# Memory profiling
from memory_profiler import profile
@profile
def memory_intensive_operation():
    # Your code here

### 3.3 Unexpected Results

**Debugging Framework:**
1. Isolate the problem - create minimal test case
2. Check intermediate results
3. Verify assumptions about the data
4. Use visualization to spot anomalies

In [None]:
def debug_analysis(df):
    """Visual debugging helper"""
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Plot distributions
    df.hist(ax=axes[0,0])
    axes[0,0].set_title('Distributions')
    
    # Plot correlations
    sns.heatmap(df.corr(), annot=True, ax=axes[0,1])
    axes[0,1].set_title('Correlations')
    
    # Plot scatter matrix
    pd.plotting.scatter_matrix(df, alpha=0.2, ax=axes[1,:])
    
    plt.tight_layout()

## 4. üèÜ Capstone Project: Customer Segmentation

### 4.1 Data Collection & Cleaning

In [None]:
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"
customers = pd.read_csv(url)

# Clean data
customers_clean = (customers
                   .dropna()
                   .pipe(remove_outliers, 
                        columns=['Fresh', 'Milk', 'Grocery'],
                        threshold=3)
                   .reset_index(drop=True))

### 4.2 Feature Engineering


In [None]:
# Create meaningful features
customers_featured = customers_clean.assign(
    perishable_ratio=lambda x: x['Fresh'] / (x['Grocery'] + 1e-6),
    daily_consumption=lambda x: x['Milk'] / 30,
    product_diversity=lambda x: x[['Grocery', 'Milk', 'Fresh']].std(axis=1)
)

# Scale features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(customers_featured)

### 4.3 Clustering Analysis

In [None]:
# Determine optimal clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Apply K-means
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(scaled_data)

# Visualize with PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
plt.scatter(principal_components[:,0], principal_components[:,1], c=clusters)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Customer Segments')
plt.show()

### 4.4 Results Interpretation

In [None]:
# Analyze cluster characteristics
customers_featured['cluster'] = clusters
cluster_profiles = customers_featured.groupby('cluster').mean()

# Visualize cluster profiles
plt.figure(figsize=(12, 6))
sns.heatmap(cluster_profiles.T, cmap='YlGnBu', annot=True, fmt='.1f')
plt.title('Average Values by Cluster')
plt.show()

# Generate business insights
insights = {
    0: "High-volume grocery buyers - Target with bulk discounts",
    1: "Fresh product focused - Highlight quality and freshness",
    2: "Milk and dairy heavy - Cross-sell other perishables",
    3: "Low-volume diverse buyers - Focus on convenience"
}



## 5. üìù Conclusion & Career Next Steps

**Key Takeaways:**
- Professional pipelines require robustness and maintainability
- Systematic debugging saves hours of frustration
- End-to-end projects demonstrate real-world value

**Career Next Steps:**
1. Build a portfolio of complete projects
2. Learn workflow tools (Airflow, Prefect)
3. Master version control for data science
4. Practice communicating insights

```python
# üöÄ You're now ready for professional data analysis work!
# Go build something amazing! üåü
```

**Final Pro Tip:** Document your projects with:
1. Clear problem statement
2. Data dictionary
3. Key assumptions
4. Business impact

**Happy analyzing!** üìàüîç