<a href="https://colab.research.google.com/github/MissK143/MissK143.github.io/blob/main/Python_Libraries_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

## Welcome to Pandas

Pandas is a powerful Python library for data analysis and manipulation. It's built on top of the NumPy library and provides easy-to-use data structures and data analysis tools. This chapter will introduce you to the basics of Pandas, making it perfect for beginners. Let's dive in!

### What is Pandas?

Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools. It's particularly suited for working with structured data (similar to SQL tables or Excel spreadsheets) where we want to perform complex data manipulations and analysis easily.

### Why Use Pandas?

- **Data manipulation**: Easily transform and manipulate large datasets.
- **Data analysis**: Perform complex data analysis tasks with simple commands.
- **Data cleaning**: Quickly clean messy datasets, filling missing values, or dropping rows or columns.
- **Data visualization**: Although Pandas is not primarily a data visualization library, it seamlessly integrates with Matplotlib for basic plotting.

### Installation and Setup

Pandas is included in the standard Anaconda installation. However, if you need to install Pandas separately, you can do so using pip. Run the following command in your Jupyter notebook:



In [None]:
!pip install pandas

## Importing Pandas

To start using Pandas, we first need to import it into our Jupyter notebook. It's conventional to import Pandas with the alias `pd`:


In [None]:
import pandas as pd

Additionally, we'll import NumPy since it's often used alongside Pandas for numerical computations:

In [None]:
import numpy as np

# Chapter 2: Data Structures in Pandas

Pandas is built on two fundamental data structures: Series and DataFrame. Understanding these structures is key to mastering data manipulation and analysis with Pandas. In this chapter, we'll explore both Series and DataFrames in detail, covering creation, operations, and practical examples.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (similar to a Python list) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

### Creating Series

A Series can be created in several ways, including from lists, numpy arrays, and dictionaries.


In [None]:
# From a list
ser1 = pd.Series([1, 2, 3, 4])
print(ser1)

In [None]:
# From a numpy array
arr = np.array([1, 2, 3, 4])
ser2 = pd.Series(arr)
print(ser2)

In [None]:
# From a dictionary
dict = {'a': 1, 'b': 2, 'c': 3}
ser3 = pd.Series(dict)
print(ser3)


### Indexing and Selecting Data

You can access individual elements of a Series through indexing similar to numpy arrays.


In [None]:
# Accessing the first element
print(ser1[0])

In [None]:
# Accessing elements by custom index when created from a dictionary
print(ser3['a'])


### Operations with Series

Series supports many operations, which can be broadly classified into arithmetic operations, aggregation operations, and boolean operations.


In [None]:
# Arithmetic operations
ser = pd.Series([1, 2, 3, 4])
print(ser + 10)


In [None]:
# Aggregation operations
print(ser.sum())
print(ser.mean())


In [None]:
# Boolean operations
print(ser > 2)


## DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used pandas object for data analysis.

### Creating DataFrames

DataFrames can be created from various data structures like lists, dictionaries, Series, and even numpy arrays.


In [None]:
# From a dictionary of series
df1 = pd.DataFrame({'A': ser1, 'B': ser2})
df1

In [None]:
# From a list of dictionaries
lst_of_dicts = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df2 = pd.DataFrame(lst_of_dicts)
df2

In [None]:
# From a numpy array
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])
df3 = pd.DataFrame(numpy_array, columns=['A', 'B', 'C'])
df3

### Data Selection and Indexing

Selecting and indexing in DataFrames allows you to retrieve individual data or subsets of your data.


In [None]:
df1

In [None]:
# Selecting a single column
df1['A']

In [None]:
# Selecting multiple columns
df1[['A', 'B']]


### DataFrame Operations

DataFrames support a wide range of operations, from arithmetic operations to complex aggregations and transformations.


In [None]:
# Arithmetic operations
df1 + 10


In [None]:
# Aggregation
df1.mean()

In [None]:
# Applying functions
df1.apply(np.sqrt)

## Comparison between Series and DataFrame

- **Dimensionality**: Series is one-dimensional, while DataFrame is two-dimensional.
- **Data structure**: Series can be considered as a single column of data, while DataFrame is more like a collection of Series objects put together to form a table.
- **Use case**: Use Series when you need a single array of data with an index, and use DataFrame when you need to represent and manipulate tabular data with rows and columns.


# Chapter 3: Basic Operations with DataFrames

DataFrames are central to data manipulation and analysis in Pandas. This chapter dives deep into the various operations you can perform with DataFrames, starting from viewing and selecting data, to adding and dropping rows or columns, and much more. Understanding these operations is crucial for any data analysis task.

## Viewing Data

One of the first steps in data analysis is to understand the structure and content of your dataset. Pandas provides several methods to get a quick overview of your DataFrame.

### Head and Tail

- `head(n)`: This method returns the first `n` rows of the DataFrame. By default, `n=5`.
- `tail(n)`: Similarly, `tail(n)` returns the last `n` rows.


In [None]:
# Sample DataFrame
df = pd.DataFrame({'A': range(1, 11), 'B': range(11, 21)})


In [None]:
# Viewing the first 3 rows
df.head(3)


In [None]:
# Viewing the last 3 rows
print(df.tail(3))


### Descriptive Statistics

- `describe()`: This method provides a quick overview of the statistical distribution of numerical columns – count, mean, std (standard deviation), min, quartiles, and max.


In [None]:
# Descriptive statistics
print(df.describe())


### Info

- `info()`: Provides a concise summary of the DataFrame, including the number of non-null entries in each column, the data type of each column, and the memory usage.


In [None]:
df.info()

## Data Selection

Selecting specific subsets of data is a common task in data analysis. Pandas provides multiple ways to select and index rows and columns in DataFrames.

### Column Selection

You can select a column using its label, returning a Series, or pass a list of labels to select multiple columns, returning a DataFrame.


In [None]:
# Selecting a single column
series_a = df['A']
print(series_a)


In [None]:
# Selecting multiple columns
df_subset = df[['A', 'B']]
print(df_subset)

### Row Selection

Rows can be selected by position using `iloc` or by label using `loc`.

- `iloc`: Integer-location based indexing for selection by position.
- `loc`: Label-location based indexer for selection by label.


In [None]:
# Selecting a single row by position
row_by_position = df.iloc[0]
print(row_by_position)

In [None]:
# Selecting a row by label
# Note: If your index is integer-based, this will look similar to iloc, but it's label-based.
row_by_label = df.loc[0]
print(row_by_label)

### Slicing

Pandas supports slicing rows or columns using `iloc` and `loc`.


In [None]:
# Slicing rows
rows_slice = df.iloc[1:4]
print(rows_slice)

In [None]:
# Slicing columns
# Note: For `iloc`, columns are also indexed by their integer positions.
columns_slice = df.iloc[:, 0:2]
print(columns_slice)

## Adding and Dropping Columns

### Adding Columns

You can add a new column to a DataFrame simply by assigning it to the DataFrame with a new column label.


In [None]:
# Adding a new column
df['C'] = range(21, 31)


### Dropping Columns

Columns can be removed using the `drop` method, specifying the `axis=1` for columns.



In [None]:
# Dropping a column
df_dropped = df.drop('C', axis=1)


## Conditional Selection

Pandas allows for conditional selection using column values.



In [None]:
# Conditional selection
condition = df['A'] > 5
print(condition)

In [None]:
filtered_df = df[condition]
print(filtered_df)

## Setting and Resetting Index

### Setting Index

You can set one of the columns as an index using `set_index`.


In [None]:
# Setting 'A' as the index
df_with_new_index = df.set_index('A')
df_with_new_index

### Resetting Index

To revert to the default integer index, use `reset_index`.


In [None]:
# Resetting index
df_reset = df_with_new_index.reset_index()
df_reset

# Chapter 4: Data Manipulation and Cleaning with Pandas

In this chapter, we explore the intricate world of data manipulation and cleaning using Pandas, a cornerstone in the daily workflow of data scientists and analysts. The integrity and usability of data directly influence the outcomes of analyses and models. As such, Pandas provides a robust toolkit for addressing common data issues, including handling missing data, cleaning datasets, transforming data types, and applying functions to enrich and prepare data for analysis.

### Handling Missing Data
Missing data is a common occurrence in datasets and needs to be addressed to prevent skewed analyses and inaccurate results. Pandas offers several methods for detecting, removing, and imputing missing values in DataFrames and Series.

#### Detecting Null Values

The first step in handling missing data is detecting its presence. Pandas provides two methods for this: `.isnull()` and `.notnull()`. Both return a boolean mask over the data indicating the presence or absence of missing data.


In [None]:
# Create a DataFrame with missing values
df = pd.DataFrame({'Name': ['Alice', 'Bob', np.nan], 'Age': [24, np.nan, 30], 'Salary': [50000, 60000, 45000]})

# Detect missing values
print(df.isnull())


#### Dropping Null Values

Pandas `.dropna()` method allows you to exclude axes with missing data. The method can be fine-tuned with parameters to target specific rows or columns and how to handle partially missing data.




In [None]:
# Drop any rows with missing values
df_dropna_rows = df.dropna()
print(df_dropna_rows)

In [None]:
# Drop columns with any missing values
df_dropna_cols = df.dropna(axis=1)
print(df_dropna_cols)

In [None]:
# Drop rows where all cells are missing
df_dropna_all = df.dropna(how='all')
print(df_dropna_all)

#### Filling Null Values

Rather than removing missing data, another approach is to impute missing values using the `.fillna()` method. This can be a specific value, a computed value (e.g., mean, median), or a method like forward-fill or back-fill.


In [None]:
# Fill missing values with a specific value
df_filled = df.fillna(value=0)
print(df_filled)

In [None]:
# Fill missing values with the mean of the column
df_fill_mean = df['Age'].fillna(value=df['Age'].mean())
print(df_fill_mean)

In [None]:
# Forward-fill to propagate the last valid observation forward
df_ffill = df.fillna(method='ffill')
print(df_ffill)

In [None]:
# Back-fill to propagate the next valid observation backward
df_bfill = df.fillna(method='bfill')
print(df_bfill)

### Type Conversion

Ensuring data is of the correct type is crucial for analyses and computations. Pandas allows for explicit type conversion using the `.astype()` method, catering to the need for numerical computations, string manipulations, or categorical data handling.


In [None]:
# Convert column to string
df['Salary'] = df['Salary'].astype(str)
df

In [None]:
df['Salary'] = df['Salary'].astype(int)

In [None]:
# Convert column to category
df['Name'] = df['Name'].astype('category')
df

### Renaming Columns

Renaming columns in Pandas is straightforward using the `.rename()` method, which supports a variety of inputs for specifying new column names, enhancing readability and convenience in data handling.


In [None]:
# Rename columns using a dictionary
df_renamed = df.rename(columns={'Age': 'Employee Age', 'Salary': 'Employee Salary'})
print(df_renamed)

In [None]:
# Rename columns in-place
df.rename(columns={'Name': 'Employee Name'}, inplace=True)
print(df)

### Sorting Data

Sorting data can aid in understanding datasets, identifying patterns, or preparing for analysis. Pandas `.sort_values()` method offers extensive functionality for sorting data by one or more columns, in ascending or descending order.


In [None]:
# Sort by 'Age' in ascending order
df_sorted_age = df.sort_values(by='Age')
df_sorted_age

### Filtering Data

Filtering is selecting subsets of data based on criteria. This is commonly performed using boolean indexing in Pandas, which involves passing a boolean array to the DataFrame to filter rows.


In [None]:
# Filter rows where Age is greater than 30
df_filtered = df[df['Age'] > 25]
df_filtered

#### Understanding Functions in Pandas
- Overview of functions in pandas: built-in functions, user-defined functions, and lambda functions.
- Difference between element-wise functions and aggregation functions.

#### Applying Functions to Series and DataFrame
- `apply()`: Applying a function along an axis of the DataFrame or on values of Series.
- `applymap()`: Applying a function element-wise on a DataFrame.
- `map()`: Applying a function element-wise on a Series.
- `agg()`: Using aggregation functions on DataFrame/Series.


In [None]:
df

In [None]:
# Defining a simple function to double the value
def double_value(x):
    return x * 2

# Applying function to a column
df['doubled_column'] = df['Age'].apply(double_value)
df.head()


### Introduction to Lambda Functions

Lambda functions in Python are small anonymous functions defined with the lambda keyword. They can take any number of arguments but can only have one expression. When working with data manipulation libraries like pandas, lambda functions become extremely useful for applying quick and concise operations over DataFrame columns or rows without needing to define traditional function definitions.

A lambda function's syntax is:
```python
lambda arguments: expression
```
The `expression` is executed and the result is returned.

### Why Use Lambda Functions in Pandas?

- **Conciseness**: They make your code more concise and readable, especially for simple operations.
- **No Defining Functions**: They eliminate the need for defining a standard function for short, one-off operations.
- **Flexibility**: They are highly flexible and can be used in a variety of situations with pandas objects.


In [None]:
import pandas as pd

# Sample data
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 40],
    'Salary': [50000, 62000, 58000, 75000]
}

# Creating DataFrame
df = pd.DataFrame(data)


Suppose you want to categorize employees based on their salary:

In [None]:
df['Salary_Category'] = df['Salary'].apply(lambda x: 'High' if x > 60000 else 'Medium' if x > 55000 else 'Low')
df

The `map()` function is often used to transform data in a Series. With a lambda function, this becomes very powerful:

In [None]:
df['Age_Group'] = df['Age'].map(lambda x: '30s' if x >= 30 else '20s')
df

You can use a lambda function with `apply()` across rows by specifying `axis=1`. This is useful for operations that need values from multiple columns:

In [None]:
df['Custom_Calculation'] = df.apply(lambda row: row['Salary'] / row['Age'], axis=1)
df

# Chapter 5: Advanced DataFrame Operations

### Grouping and Aggregating Data

#### GroupBy Mechanics

GroupBy operations are pivotal for summarizing datasets, enabling you to perform calculations over subsets of data. This process involves one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.


In [None]:
import pandas as pd

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Values': [10, 20, 15, 25, 5, 15, 20, 10]
}
df = pd.DataFrame(data)
df

In [None]:
# Grouping by 'Category' and calculating the sum of 'Values'
grouped = df.groupby('Category').sum()
print(grouped)

#### Aggregate Functions

Pandas provides several built-in methods for aggregating data, like `sum()`, `mean()`, `max()`, and `min()`. You can apply multiple aggregation functions at once by using the `.agg()` method.


In [None]:
# Using agg() to apply multiple aggregation functions
agg_functions = df.groupby('Category').agg({'Values': ['mean', 'min', 'max']})

print(agg_functions)


### Pivot Tables and Cross-tabulations

Pivot tables are a technique in data processing that allows you to summarize data in a structured format. Cross-tabulation, on the other hand, is a method to quantitatively analyze the relationship between multiple variables.

#### Pivot Tables


In [None]:
# Creating a pivot table
pivot_table = df.pivot_table(values='Values', index='Category', aggfunc='mean')

print(pivot_table)


#### Cross-tabulations

In [None]:
# Cross-tabulation of two factors
cross_tab = pd.crosstab(df['Category'], df['Values'])

print(cross_tab)


### Merging, Joining, and Concatenating DataFrames

Combining datasets is a common operation in data manipulation, and Pandas offers several functions to accomplish this, such as `merge()`, `join()`, and `concat()`.

#### Merging DataFrames


In [None]:
df1 = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'],
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})


In [None]:
df1

In [None]:
df2

In [None]:
# Merging df1 and df2 on 'Key'
merged_df = pd.merge(df1, df2, on='Key')

print(merged_df)

#### Joining DataFrames

In [None]:
# Joining DataFrames on index
joined_df = df1.join(df2.set_index('Key'), on='Key')

print(joined_df)
