## Pandas

- Pandas is a powerful and widely-used open-source data manipulation and analysis library for Python. 
- It is particularly well-suited for working with structured data, such as tables, spreadsheets, and time series data.

## Key Features of Pandas

**1. Data Structures:**

- Series: One-dimensional labeled array capable of holding any data type.
- DataFrame: Two-dimensional labeled data structure with columns of potentially different types.

**2. Data Handling:**

- Data Loading: Read data from CSV, Excel, SQL databases, JSON, and more.
- Data Cleaning: Handle missing data, duplicate rows, and data type conversions.
- Data Transformation: Merge, join, and concatenate datasets; reshape data using pivot tables and stack/unstack functions.
- Data Analysis: Perform group operations using groupby; apply statistical functions like mean, median, and mode.

**3. Time Series:**

- Time series-specific functionality like resampling, frequency conversion, moving window statistics, and date range generation.

**4. Visualization:**

- Built-in plotting using matplotlib integration.

## Common Pandas Operations

### Importing pandas

In [1]:
import pandas as pd
import numpy as np

### Creating dataframe

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}
df = pd.DataFrame(data)


In [3]:
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


### Reading data

In [4]:
# df = pd.read_csv('file.csv')
# df = pd.read_excel('file.xlsx')


### Inspecting Data

In [5]:
# df.head()       # Display the first 5 rows
# df.tail()       # Display the last 5 rows
# df.info()       # Display a summary of the DataFrame
# df.describe()   # Generate descriptive statistics


In [6]:
df.head()       # Display the first 5 rows

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [7]:
df.tail()       # Display the last 5 rows

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,San Francisco
2,Charlie,35,Los Angeles


In [8]:
df.info()       # Display a summary of the DataFrame

df.describe()   # Generate descriptive statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


Unnamed: 0,Age
count,3.0
mean,30.0
std,5.0
min,25.0
25%,27.5
50%,30.0
75%,32.5
max,35.0


### Selecting data

In [9]:
# df['Age']         # Select a single column
# df[['Name', 'City']]  # Select multiple columns
# df.iloc[0]        # Select the first row by index
# df.loc[0]         # Select the first row by label
# df[df['Age'] > 30]  # Select rows based on a condition


In [10]:
df['Age']         # Select a single column

0    25
1    30
2    35
Name: Age, dtype: int64

In [11]:
df[['Name', 'City']]  # Select multiple columns


Unnamed: 0,Name,City
0,Alice,New York
1,Bob,San Francisco
2,Charlie,Los Angeles


In [12]:
df.iloc[0]        # Select the first row by index


Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [13]:
df.loc[0]         # Select the first row by label


Name       Alice
Age           25
City    New York
Name: 0, dtype: object

In [14]:
df[df['Age'] > 30]  # Select rows based on a condition


Unnamed: 0,Name,Age,City
2,Charlie,35,Los Angeles


### Modifying data

In [15]:
# df['Age'] += 1   # Increment all ages by 1
# df['New_Column'] = df['Age'] * 2  # Create a new column based on existing data
# df.drop(columns=['New_Column'], inplace=True)  # Drop a column
# df.rename(columns={'Age': 'Years'}, inplace=True)  # Rename a column


In [16]:
df['Age'] += 1   # Increment all ages by 1


In [17]:
df['New_Column'] = df['Age'] * 2  # Create a new column based on existing data


In [18]:
df.drop(columns=['New_Column'], inplace=True)  # Drop a column


In [19]:
df.rename(columns={'Age': 'Years'}, inplace=True)  # Rename a column


In [20]:
df

Unnamed: 0,Name,Years,City
0,Alice,26,New York
1,Bob,31,San Francisco
2,Charlie,36,Los Angeles


### Handling Missing data

In [21]:
# df.dropna()      # Remove rows with missing values
# df.fillna(0)     # Fill missing values with 0
# df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill missing values with the mean


In [22]:
df.dropna()      # Remove rows with missing values


Unnamed: 0,Name,Years,City
0,Alice,26,New York
1,Bob,31,San Francisco
2,Charlie,36,Los Angeles


In [23]:
df.fillna(0)     # Fill missing values with 0


Unnamed: 0,Name,Years,City
0,Alice,26,New York
1,Bob,31,San Francisco
2,Charlie,36,Los Angeles


In [24]:
# df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill missing values with the mean


### Merging Data

In [25]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
df_merged = pd.merge(df1, df2, on='key', how='inner')  # Merge on 'key' column
df_merged

Unnamed: 0,key,value1,value2
0,A,1,4
1,B,2,5
