# 🐼 Pandas: Data Analysis Library in Python

Pandas is a powerful, fast, flexible, and easy-to-use open-source data analysis and manipulation tool built on top of the Python programming language.

---

## 📚 What is Pandas?

Pandas stands for “Python Data Analysis Library.” It provides data structures like Series and DataFrame that are ideal for handling structured data. It is extensively used in data preprocessing, analysis, and visualization.

---

## 🧱 Core Data Structures

### 1. Series
- A one-dimensional labeled array capable of holding any data type.
- Think of it as a column in an Excel sheet.

```python
import pandas as pd
s = pd.Series([10, 20, 30, 40])
````

### 2. DataFrame

* A two-dimensional labeled data structure with columns of potentially different types.
* Similar to a spreadsheet or SQL table.

```python
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
```

---

## 🔧 Key Features of Pandas

✅ Handling of missing data
✅ Size mutability (adding/removing columns/rows)
✅ Powerful group by functionality
✅ Data alignment and integrated handling of time series
✅ Built-in methods for reading/writing data from files (CSV, Excel, SQL, JSON, etc.)

---

## 🛠️ Commonly Used Functions

| Function        | Description                             |
| --------------- | --------------------------------------- |
| `pd.read_csv()` | Load data from a CSV file               |
| `df.head()`     | Show top 5 rows                         |
| `df.describe()` | Summary statistics                      |
| `df.info()`     | General information about the DataFrame |
| `df.drop()`     | Drop rows or columns                    |
| `df.groupby()`  | Group data based on a column            |
| `df.fillna()`   | Fill missing values                     |
| `df.isnull()`   | Identify missing values                 |
| `df.merge()`    | Combine DataFrames                      |
| `df.apply()`    | Apply function to rows/columns          |

---

## 📊 Use Cases

* Data cleaning and preprocessing
* Exploratory data analysis (EDA)
* Feature engineering for machine learning
* Time-series analysis
* Automation of data workflows

---

🎯 Pandas is an essential tool for any Data Scientist, Data Engineer, or ML Engineer. Mastering it will drastically improve your productivity and efficiency when working with data in Python.


In [2]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-macosx_10_9_x86_64.whl.metadata (89 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-macosx_10_9_x86_64.whl (12.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [pandas]2m2/3[0m [pandas]
[1A[2KSuccessfully installed pandas-2.2.3 pytz-2025.2 tzdata-2025.2


In [2]:
import pandas as pd
data = [1,2,3,4,5]
series=pd.Series(data)
print(series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [6]:
# create a series from dictionary
data = {'a':1,'b':2,'c':3}
series_dict=pd.Series(data)
print(series_dict)

a    1
b    2
c    3
dtype: int64


In [7]:
data=[10,20,30]
index=['a','b','c']
pd.Series(data,index=index)

a    10
b    20
c    30
dtype: int64

In [9]:
# Dataframe
# create a data frame from a dictionary of list

data= {
    'Name':['Anmol','John','Jack'],
    "Age":[25,30,45],
    "City":['Bangalore','New York','Florida']
}
df=pd.DataFrame(data)
print(df)
print(type(df))

    Name  Age       City
0  Anmol   25  Bangalore
1   John   30   New York
2   Jack   45    Florida
<class 'pandas.core.frame.DataFrame'>


In [12]:
data= [
{'Name':'Anmol',"Age":25,"City":'Bangalore'},
{'Name':'Anmol',"Age":25,"City":'Bangalore'},
{'Name':'Anmol',"Age":25,"City":'Bangalore'},
{'Name':'Anmol',"Age":25,"City":'Bangalore'},
{'Name':'Anmol',"Age":25,"City":'Bangalore'}
]
df = pd.DataFrame(data)
print(df)
print(type(df))

    Name  Age       City
0  Anmol   25  Bangalore
1  Anmol   25  Bangalore
2  Anmol   25  Bangalore
3  Anmol   25  Bangalore
4  Anmol   25  Bangalore
<class 'pandas.core.frame.DataFrame'>


In [7]:
df = pd.read_csv("sales_data.csv")
df.head()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.7,2,2871.0,2/24/2003 0:00,Shipped,1,2,2003,...,897 Long Airport Avenue,,NYC,NY,10022.0,USA,,Yu,Kwai,Small
1,10121,34,81.35,5,2765.9,5/7/2003 0:00,Shipped,2,5,2003,...,59 rue de l'Abbaye,,Reims,,51100.0,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,7/1/2003 0:00,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508.0,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.7,8/25/2003 0:00,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003.0,USA,,Young,Julie,Medium
4,10159,49,100.0,14,5205.27,10/10/2003 0:00,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium


In [9]:
df.tail()

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
94,10275,36,100.0,3,6901.92,7/23/2004 0:00,Shipped,3,7,2004,...,"67, rue des Cinquante Otages",,Nantes,,44000,France,EMEA,Labrune,Janine,Medium
95,10285,27,100.0,8,5438.07,8/27/2004 0:00,Shipped,3,8,2004,...,39323 Spinnaker Dr.,,Cambridge,MA,51247,USA,,Hernandez,Marta,Medium
96,10299,29,100.0,11,6683.34,9/30/2004 0:00,Shipped,3,9,2004,...,Keskuskatu 45,,Helsinki,,21240,Finland,EMEA,Karttunen,Matti,Medium
97,10308,20,100.0,1,4570.4,10/15/2004 0:00,Shipped,4,10,2004,...,3758 North Pendale Street,,White Plains,NY,24067,USA,,Frick,Steve,Medium
98,10318,37,100.0,3,7667.14,11/2/2004 0:00,Shipped,4,11,2004,...,7586 Pompton St.,,Allentown,PA,70267,USA,,Yu,Kyung,Large


In [13]:
# Accessing data from dataframe
data

[{'Name': 'Anmol', 'Age': 25, 'City': 'Bangalore'},
 {'Name': 'Anmol', 'Age': 25, 'City': 'Bangalore'},
 {'Name': 'Anmol', 'Age': 25, 'City': 'Bangalore'},
 {'Name': 'Anmol', 'Age': 25, 'City': 'Bangalore'},
 {'Name': 'Anmol', 'Age': 25, 'City': 'Bangalore'}]

In [15]:
df['Name']

0    Anmol
1    Anmol
2    Anmol
3    Anmol
4    Anmol
Name: Name, dtype: object

In [16]:
df.iloc[0]

Name        Anmol
Age            25
City    Bangalore
Name: 0, dtype: object

In [17]:
# iat
df.iat[2,2]

'Bangalore'

In [18]:
# Data manipulation with dataframe
df

Unnamed: 0,Name,Age,City
0,Anmol,25,Bangalore
1,Anmol,25,Bangalore
2,Anmol,25,Bangalore
3,Anmol,25,Bangalore
4,Anmol,25,Bangalore


In [20]:
df['Salary']=[50000,60000,70000,80000,90000]
df

Unnamed: 0,Name,Age,City,Salary
0,Anmol,25,Bangalore,50000
1,Anmol,25,Bangalore,60000
2,Anmol,25,Bangalore,70000
3,Anmol,25,Bangalore,80000
4,Anmol,25,Bangalore,90000


In [21]:
df.drop("Salary",axis=1,inplace=True)

In [22]:
df

Unnamed: 0,Name,Age,City
0,Anmol,25,Bangalore
1,Anmol,25,Bangalore
2,Anmol,25,Bangalore
3,Anmol,25,Bangalore
4,Anmol,25,Bangalore


In [25]:
# Add age to the column
df['Age']=df['Age']+1
df

Unnamed: 0,Name,Age,City
0,Anmol,26,Bangalore
1,Anmol,26,Bangalore
2,Anmol,26,Bangalore
3,Anmol,26,Bangalore
4,Anmol,26,Bangalore


In [27]:
df.drop(0,inplace=True)
df

Unnamed: 0,Name,Age,City
1,Anmol,26,Bangalore
2,Anmol,26,Bangalore
3,Anmol,26,Bangalore
4,Anmol,26,Bangalore


In [29]:
df.describe()

Unnamed: 0,Age
count,4.0
mean,26.0
std,0.0
min,26.0
25%,26.0
50%,26.0
75%,26.0
max,26.0


# 📝 Pandas Practice Questions

These questions will help you solidify your understanding of Pandas operations and data manipulation techniques.

---

## 📁 Data Loading & Inspection

1. Load a CSV file named `data.csv` into a DataFrame.
2. Display the first 10 rows of the DataFrame.
3. Show the shape, columns, and data types of the DataFrame.
4. Check for missing values and count them column-wise.
5. Rename a column `OldName` to `NewName`.

---

## 🔍 Data Selection & Filtering

6. Select a specific column `Age` from the DataFrame.
7. Filter rows where `Age` is greater than 30.
8. Select rows where `Gender` is 'Female' and `Score` > 80.
9. Use `.loc` to select the value in row index 3 and column `Name`.
10. Use `.iloc` to select the 2nd row and 4th column.

---

## 🧼 Data Cleaning

11. Drop rows with any missing values.
12. Fill missing values in the `Salary` column with the mean.
13. Replace all values in the `Gender` column: 'M' with 'Male', 'F' with 'Female'.
14. Remove duplicate rows.
15. Convert the `Date` column to datetime format.

---

## 🧮 Aggregation and Grouping

16. Find the average `Score` by `Department`.
17. Count the number of entries per unique value in the `City` column.
18. Get the maximum `Salary` per `Job Role`.
19. Group by `Country` and `City` and calculate mean of numeric columns.
20. Create a pivot table with `Department` as index and average `Score`.

---

## 🔗 Merging and Joining

21. Merge two DataFrames `df1` and `df2` on the column `EmployeeID`.
22. Perform a left join of `sales_df` and `product_df` on `ProductID`.
23. Concatenate two DataFrames vertically.
24. Concatenate two DataFrames horizontally using axis=1.
25. Merge three DataFrames using multiple joins.

---

## ✏️ Apply and Lambda

26. Create a new column `Bonus = Salary * 0.1`.
27. Use `apply()` to double the values in the `Score` column.
28. Apply a lambda function to capitalize all entries in `Name` column.
29. Create a new column `IsSenior` where `Age` > 50 is `True`, else `False`.
30. Use `map()` to map values in `Grade` to numeric scale.

---

## 📌 Indexing and Sorting

31. Set `EmployeeID` as the index.
32. Sort the DataFrame by `Salary` in descending order.
33. Reset the index and drop the previous index column.
34. Sort by `Department` and then by `Age`.
35. Retrieve the row with the highest `Score`.

---

## 🧾 File I/O

36. Read data from an Excel file.
37. Write the DataFrame to a new CSV file.
38. Export DataFrame to Excel with multiple sheets.
39. Load data from a JSON file.
40. Save only selected columns to a CSV.

---

## 🧠 Advanced Exercises

41. Find the top 5 highest-paid employees.
42. Count unique values in each column.
43. Identify correlation between numerical columns.
44. Create a histogram of `Salary`.
45. Plot a bar chart of average `Score` per `Department`.
46. Add a rolling mean column for a time series.
47. Detect and remove outliers in the `Salary` column.
48. Normalize the `Age` column using min-max scaling.
49. One-hot encode the `Department` column.
50. Create a multi-index DataFrame using `City` and `Job`.

---

💡 Tip: Use `pandas documentation` or `?function_name` in Jupyter for help with any function.

🚀 Happy Practicing!
```