# Detailed Introduction to Pandas for Data Science

Pandas is a powerful library for data manipulation and analysis in Python. It provides two primary data structures: `Series` and `DataFrame`, which are essential for handling and analyzing structured data. This notebook covers various operations and functionalities in Pandas.

## 1. Importing Pandas

To use Pandas, you need to import it. It is standard practice to import Pandas as `pd` for convenience.

In [20]:
import pandas as pd

## 2. Creating Pandas Data Structures

Pandas provides two main data structures: `Series` and `DataFrame`. Understanding how to create and manipulate these structures is fundamental to using Pandas effectively.

### 2.1. Series
A `Series` is a one-dimensional labeled array capable of holding any data type. It can be created from lists, arrays, or dictionaries.

In [21]:
# Creating a Pandas Series from a list
series_from_list = pd.Series([10, 20, 30, 40, 50])
print("Series from list:")
print(series_from_list)

# Creating a Pandas Series from a dictionary
series_from_dict = pd.Series({'a': 1, 'b': 2, 'c': 3})
print("\nSeries from dictionary:")
print(series_from_dict)

Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Series from dictionary:
a    1
b    2
c    3
dtype: int64


### 2.2. DataFrame
A `DataFrame` is a two-dimensional labeled data structure with columns of potentially different types. It can be created from dictionaries of Series, lists, or other DataFrames.

In [22]:
# Creating a DataFrame from a dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print("DataFrame from dictionary of lists:")
print(df)

DataFrame from dictionary of lists:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## 3. Inspecting Data

Pandas provides various methods to inspect and understand the structure of your data. These methods are crucial for data exploration and analysis.

### 3.1. Viewing the First and Last Rows
Use `head()` and `tail()` methods to view the first and last few rows of a DataFrame.

In [23]:
# Viewing the first 3 rows
print("First 3 rows:")
print(df.head(3))

# Viewing the last 2 rows
print("\nLast 2 rows:")
print(df.tail(2))

First 3 rows:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Last 2 rows:
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


### 3.2. Summary Statistics
The `describe()` method provides summary statistics of the numerical columns in a DataFrame.

In [24]:
# Summary statistics of numerical columns
print("Summary statistics:")
print(df.describe())

Summary statistics:
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0


### 3.3. Data Types
Use the `dtypes` attribute to view the data types of each column in a DataFrame.

In [25]:
# Data types of each column
print("Data types of columns:")
print(df.dtypes)

Data types of columns:
Name    object
Age      int64
City    object
dtype: object


## 4. Selecting Data

Selecting data from a DataFrame is a crucial operation for data analysis. Pandas provides various ways to select data based on labels or positions.

### 4.1. Selecting Columns
You can select columns from a DataFrame by using column names.

In [26]:
# Selecting a single column
print("Name column:")
print(df['Name'])

# Selecting multiple columns
print("\nName and City columns:")
print(df[['Name', 'City']])

Name column:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Name and City columns:
      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago


### 4.2. Selecting Rows
You can select rows using the `loc` and `iloc` methods. `loc` is label-based, while `iloc` is integer-location based.

In [27]:
# Selecting a row by label
print("Row with label 1:")
print(df.loc[1])

# Selecting a row by index position
print("\nRow with index position 2:")
print(df.iloc[2])

Row with label 1:
Name            Bob
Age              30
City    Los Angeles
Name: 1, dtype: object

Row with index position 2:
Name    Charlie
Age          35
City    Chicago
Name: 2, dtype: object


### 4.3. Conditional Selection
You can select rows based on conditions applied to column values.

In [28]:
# Selecting rows where Age > 25
print("Rows where Age > 25:")
print(df[df['Age'] > 25])

Rows where Age > 25:
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## 5. Data Cleaning

Data cleaning is an essential step in data analysis. Pandas provides various methods to handle missing values, duplicate entries, and other data issues.

### 5.1. Handling Missing Values
You can detect and handle missing values using methods like `isna()`, `dropna()`, and `fillna()`.

In [29]:
# Creating a DataFrame with missing values
df_with_nan = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
})
print("DataFrame with NaN values:")
print(df_with_nan)

# Detecting missing values
print("\nMissing values in DataFrame:")
print(df_with_nan.isna())

# Dropping rows with missing values
df_dropped = df_with_nan.dropna()
print("\nDataFrame after dropping missing values:")
print(df_dropped)

# Filling missing values
df_filled = df_with_nan.fillna({'Age': df_with_nan['Age'].mean(), 'City': 'Unknown'})
print("\nDataFrame after filling missing values:")
print(df_filled)

DataFrame with NaN values:
      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0         None
3     None  40.0      Chicago

Missing values in DataFrame:
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3   True  False  False

DataFrame after dropping missing values:
    Name   Age      City
0  Alice  25.0  New York

DataFrame after filling missing values:
      Name        Age         City
0    Alice  25.000000     New York
1      Bob  33.333333  Los Angeles
2  Charlie  35.000000      Unknown
3     None  40.000000      Chicago


### 5.2. Removing Duplicates
You can remove duplicate rows using the `drop_duplicates()` method.

In [30]:
# Creating a DataFrame with duplicate rows
df_with_duplicates = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago']
})
print("DataFrame with duplicates:")
print(df_with_duplicates)

# Removing duplicate rows
df_no_duplicates = df_with_duplicates.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

DataFrame with duplicates:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2    Alice   25     New York
3  Charlie   35      Chicago

DataFrame after removing duplicates:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
3  Charlie   35      Chicago


## 6. Data Aggregation

Data aggregation involves summarizing and grouping data based on specific criteria. Pandas provides powerful methods for aggregation and grouping.

### 6.1. Grouping Data
You can group data using the `groupby()` method and then perform aggregate functions like sum, mean, and count.

In [31]:
# Creating a DataFrame for grouping
df_grouping = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [25, 30, 25, 35, 30],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
})
print("DataFrame for grouping:")
print(df_grouping)

# Grouping by 'Name' and calculating mean age
grouped = df_grouping.groupby(['Name','City']).mean()
print("\nGrouped Data (mean age by Name):")
print(grouped)

DataFrame for grouping:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2    Alice   25     New York
3  Charlie   35      Chicago
4      Bob   30  Los Angeles

Grouped Data (mean age by Name):
                      Age
Name    City             
Alice   New York     25.0
Bob     Los Angeles  30.0
Charlie Chicago      35.0


### 6.2. Aggregation Functions
You can apply aggregation functions like `sum()`, `mean()`, `count()`, and `agg()` to grouped data.

In [32]:
# Aggregating data with multiple functions
aggregation = df_grouping.groupby('City').agg({
    'Age': ['mean', 'sum', 'count']
})
print("\nAggregated Data (mean, sum, count of Age by City):")
print(aggregation)


Aggregated Data (mean, sum, count of Age by City):
              Age          
             mean sum count
City                       
Chicago      35.0  35     1
Los Angeles  30.0  60     2
New York     25.0  50     2


## 7. Merging and Joining DataFrames

Merging and joining DataFrames are essential operations for combining datasets based on common columns or indices.

### 7.1. Merging DataFrames
You can merge DataFrames using the `merge()` method, similar to SQL joins.

In [33]:
# Creating DataFrames for merging
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
    'ID': [1, 2, 4],
    'Age': [25, 30, 40]
})
print("DataFrame df1:")
print(df1)
print("\nDataFrame df2:")
print(df2)

# Merging DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print("\nMerged DataFrame:")
print(merged_df)

DataFrame df1:
   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie

DataFrame df2:
   ID  Age
0   1   25
1   2   30
2   4   40

Merged DataFrame:
   ID   Name  Age
0   1  Alice   25
1   2    Bob   30


### 7.2. Joining DataFrames
Joining DataFrames can be done using the `join()` method, which is useful for joining on indices.

In [34]:
# Creating DataFrames for joining
df3 = pd.DataFrame({
    'ID': [1, 2, 3],
    'City': ['New York', 'Los Angeles', 'Chicago']
}).set_index('ID')
df4 = pd.DataFrame({
    'ID': [1, 2, 4],
    'Salary': [70000, 80000, 90000]
}).set_index('ID')
print("DataFrame df3:")
print(df3)
print("\nDataFrame df4:")
print(df4)

# Joining DataFrames on index
joined_df = df3.join(df4, how='left')
print("\nJoined DataFrame:")
print(joined_df)

DataFrame df3:
           City
ID             
1      New York
2   Los Angeles
3       Chicago

DataFrame df4:
    Salary
ID        
1    70000
2    80000
4    90000

Joined DataFrame:
           City   Salary
ID                      
1      New York  70000.0
2   Los Angeles  80000.0
3       Chicago      NaN


## Conclusion

This notebook has covered the basics of Pandas for data manipulation and analysis, including data inspection, selection, cleaning, aggregation, merging, visualization, and exporting. These skills are essential for data science and can be expanded upon with more complex operations and functions.