# Introduction to Pandas

## 1. What is Pandas?

Pandas is a powerful Python library for data analysis, providing efficient data structures and functions for data manipulation.

In [1]:
import numpy as np
import pandas as pd

Hey there! So, let's imagine Pandas and NumPy as your favorite building blocks. Both are super useful for creating amazing structures, but they have different special abilities.

**Pandas** is like having blocks with labels and drawers. You can store your toys and easily find them later because everything is well-organized and labeled. It helps you handle data that's more like a spreadsheet, with rows and columns, just like your school homework.

**NumPy** is like having blocks that are great for math and calculations. Imagine you want to quickly count how many red and blue blocks you have, or do some cool math tricks with them. NumPy is perfect for that because it's super fast with numbers.

### Key Differences Between Pandas and NumPy:

| Feature              | Pandas                           | NumPy                         |
|----------------------|----------------------------------|-------------------------------|
| Data Structure       | Labeled rows and columns (DataFrame) | Multi-dimensional arrays      |
| Ease of Use          | User-friendly for data analysis  | More technical and math-focused|
| Data Types           | Mixed data types                 | Mostly numerical data         |

### Why Do We Need Pandas in Data Analysis?

Pandas is like a superhero for data analysis! When you have a big mess of information and need to clean it up, organize it, and make sense of it, Pandas can swoop in and save the day. 

**Example:**
Imagine you're a detective trying to solve a mystery. You have a notebook full of clues (data) about different suspects. Pandas helps you sort through those clues, find patterns, and figure out who the mystery person is. For example, you can easily filter out suspects based on their height, age, or location. 

With Pandas, you can do things like:

1. **Organize Data:** Keep track of your suspects and their details in a neat table.
2. **Filter Information:** Quickly find suspects who match certain criteria (like being taller than 5 feet).
3. **Analyze Patterns:** See if there's a pattern in the clues that points to the culprit.

So, in a nutshell, Pandas helps you turn a messy pile of information into something neat and easy to understand, making it super helpful for data analysis! 📊🔍

Hope that helps! If you have any more questions, just let me know. 😊

### **Exercise:** Install pandas using `pip install pandas` (if not installed). Import pandas and print its version.


            ### **AI Prompt: Understanding Pandas**
            - Explain Pandas as if you're teaching a 10-year-old.
            - What are the key differences between Pandas and NumPy?
            - Why do we need Pandas in data analysis? Provide an example.
            

## 2. Pandas Objects - Series

Series is a one-dimensional indexed data structure in Pandas.

In [2]:
series1 = pd.Series([1, 2], index=['a', 'b'])
series2 = pd.Series({"a": 1, "b": 2})
print(series1)
print(series2)

a    1
b    2
dtype: int64
a    1
b    2
dtype: int64


In [None]:
series1.index
# index la cac nhan cua cac dong trong bang

Index(['a', 'b'], dtype='object')

In [None]:
series2.values
# values la cac gia tri cua array

array([1, 2])

### **Exercise:** Create a Series containing 5 city names as values and use index labels as country names.


            ### **AI Prompt: Exploring Pandas Series**
            - How does a Pandas Series differ from a Python list?
            - Can you provide a real-world example where using a Series is beneficial?
            - Generate additional exercises that explore Series operations.
            

## 3. Pandas Objects - DataFrame

DataFrame is a two-dimensional table with labeled rows and columns.

In [5]:
df1 = pd.DataFrame({'state': ['Ohio', 'California'], 'year': [2000, 2010]})
print(df1)

df2 = pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'])
print(df2)

        state  year
0        Ohio  2000
1  California  2010
        foo       bar
0  0.772922  0.300910
1  0.731532  0.375881
2  0.443576  0.589051


In [6]:
from pprint import pprint
pprint(df1)

        state  year
0        Ohio  2000
1  California  2010


In [None]:
df1.columns


RangeIndex(start=0, stop=2, step=1)

In [8]:
df1.values


array([['Ohio', 2000],
       ['California', 2010]], dtype=object)

In [9]:
df1.index

RangeIndex(start=0, stop=2, step=1)

In [11]:
df3=pd.DataFrame(np.random.rand(3,2),columns=["too","far"],index=["01","02","03"])
pprint(df3)

         too       far
01  0.200150  0.201454
02  0.111765  0.339657
03  0.034390  0.636285


### **Exercise:** Create a DataFrame with student names, their subjects, and corresponding grades.

In [19]:
student={'name':["Linh","ngan","Khue"],"math":[6,9,7],"physics":[8,8,8]}
stu=pd.DataFrame(student)
stu

Unnamed: 0,name,math,physics
0,Linh,6,8
1,ngan,9,8
2,Khue,7,8



            ### **AI Prompt: Working with DataFrames**
            - Compare a DataFrame with an SQL table. How are they similar and different?
            - What are some common operations performed on a DataFrame?
            - Describe a scenario where a DataFrame is more useful than a Series.
            

In [29]:
dulieucosan={'cot':["r1","r2","r3"],'A':[1,2,3],'b':[2,3,5]}
them = pd.DataFrame(dulieucosan)
pprint(them)

  cot  A  b
0  r1  1  2
1  r2  2  3
2  r3  3  5


## 4. Indexing and Selection

Pandas provides various ways to select and retrieve data using `.loc[]` and `.iloc[]`.

In [None]:
s1 = pd.Series(range(10, 14), index=list("abcd"))
pprint(s1)
print(s1.loc['b'])
print(s1.iloc[1]) #implicit index bat dau danh so tu 0, truy cap vao value trong dataframe

a    10
b    11
c    12
d    13
dtype: int64
11
11


NameError: name 'b' is not defined

### **Exercise:** Given a Series of five countries and their populations, retrieve the population of a specific country using `.loc[]`.


            ### **AI Prompt: Mastering Indexing**
            - What is the difference between `.loc[]` and `.iloc[]` in Pandas?
            - How can incorrect indexing lead to errors in data analysis?
            - Generate a challenging indexing problem and explain how to solve it.
            

# ghep series


In [None]:
ser=pd.Series()

## 5. Handling Missing Data

Pandas allows handling missing values using `fillna()` or `dropna()`.

In [25]:
df_missing = pd.DataFrame({'A': [1, 2, None], 'B': [None, 3, 4]})
df_filled = df_missing.fillna(0)
print(df_filled)

     A    B
0  1.0  0.0
1  2.0  3.0
2  0.0  4.0


### **Exercise:** Create a DataFrame with missing values and replace them with the column mean.


            ### **AI Prompt: Handling Missing Data**
            - Why is handling missing data important in real-world datasets?
            - Compare the differences between `.fillna()`, `.dropna()`, and `.interpolate()`.
            - What are some strategies for handling missing categorical values?
            

## 6. MultiIndex (Hierarchical Indexing)

MultiIndex allows working with hierarchical data efficiently.

In [None]:
index = [('California', 2000), ('California', 2010), ('New York', 2000)]
populations = [33871648, 37253956, 18976457]
ind = pd.MultiIndex.from_tuples(index, names=('State', 'Year'))
df_multi = pd.DataFrame({'Population': populations}, index=ind)
print(df_multi)

### **Exercise:** Create a MultiIndex DataFrame for sales data across multiple years and retrieve sales for a specific year.


            ### **AI Prompt: Exploring MultiIndex**
            - When should you use MultiIndex instead of a simple index?
            - How does MultiIndex impact performance in large datasets?
            - Can you create a dataset where MultiIndexing is necessary?
            

## Final Comprehensive Exercise

- Create a dataset of 10 employees with **Name, Age, Department, Salary, and Year of Joining**.
- Retrieve employees in a specific department.
- Fill missing salaries with the department average.
- Create a MultiIndex grouping employees by **Department and Year of Joining**.
- Display the department with the highest average salary.