<a href="https://colab.research.google.com/github/PythonBatch11thMay/PythonBatch11MayPublicOrg/blob/main/Day10/Day10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import Pandas

What is Pandas?
Pandas is an open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work on structured data seamlessly and efficiently. The two primary data structures in Pandas are Series (one-dimensional) and DataFrame (two-dimensional).


In [None]:
#Core Concepts
#DataFrame: A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
#Series: A one-dimensional array-like structure with labels (also known as index).
#Indexing: The process of selecting subsets of data.
#Aggregation: Summarizing data, e.g., calculating mean, median, sum.
#Transformation: Applying functions to modify data, e.g., normalization, scaling.
#Filtering: Extracting data based on specific conditions.
#Merging and Joining: Combining data from multiple DataFrames.

#Use Cases of Pandas in Real-Time Applications
#1. Data Cleaning
#Use Case: A company collects customer feedback forms which contain missing values, inconsistent formats, and unnecessary information.

# 2. Data Analysis
# Use Case: A retail company wants to analyze sales data to understand trends and patterns.

# 3. Time Series Analysis
# Use Case: A financial analyst needs to analyze stock prices over time to forecast future prices.

# Summary:
# Pandas is a powerful tool for data manipulation and analysis in Python. Its ability to handle diverse datasets and
# perform complex operations with minimal code makes it invaluable in various real-time applications, from cleaning and
# preparing data to analyzing trends and making predictions.
# Teaching these concepts and use cases will give your students a strong foundation in data science and analytics.

import pandas as pd

# Create a simple DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 24, 35, 32],
        'City': ['New York', 'Paris', 'Berlin', 'London']}
df = pd.DataFrame(data)

print(df)


# Create an empty DataFrame with defined columns
empty_df = pd.DataFrame(columns=['Name', 'Age', 'City'])

# Add rows using pd.concat
new_row_1 = pd.DataFrame([{'Name': 'John', 'Age': 28, 'City': 'New York'}])
new_row_2 = pd.DataFrame([{'Name': 'Anna', 'Age': 24, 'City': 'Paris'}])

empty_df = pd.concat([empty_df, new_row_1], ignore_index=True)
empty_df = pd.concat([empty_df, new_row_2], ignore_index=True)

# Add more data using direct assignment
empty_df.loc[2] = ['Peter', 35, 'Berlin']
empty_df.loc[3] = ['Linda', 32, 'London']

print(empty_df)


#pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f'])



Viewing Data

In [None]:
# Display the first few rows of the DataFrame
print(df.head())

# Display the last few rows of the DataFrame
print(df.tail())

# Display the DataFrame's columns
print(df.columns.tolist())

# Display summary statistics
print(df.describe())


shape = df.shape

print(shape)

Selection and Indexing

In [None]:
# Select a single column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'City']])

# Select rows by position
print(df.iloc[0:2])

# Select rows by label
print(df.loc['a':'c', ['Name', 'Age']])


Filter

In [None]:
# Filter rows based on a condition
print(df[df['Age'] > 30])

# Filter to keep only the 'Name' and 'Age' columns
filtered_df = df.filter(items=['Name', 'Age'])

print(filtered_df)


# Filter columns that contain the substring 'a'
filtered_df = df.filter(like='a', axis=1)

print(filtered_df)

data1 = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'James', 'Laura'],
    'Age': [28, 24, 35, 32, 45, 22],
    'City': ['New York', 'Paris', 'Berlin', 'London', 'Chicago', 'Boston']
}
df = pd.DataFrame(data1, index=['a', 'b', 'c', 'd', 'e', 'f'])

print("Original DataFrame:")
print(df)

# Filter to keep only the rows with index 'b' and 'd'
filtered_df = df.filter(items=['b', 'd'], axis=0)

print("\nFiltered DataFrame:")
print(filtered_df)



Transformation

In [None]:
# Add a new column
df['Salary'] = [70000, 80000, 120000, 90000]
print(df)

# Drop a column
df = df.drop('Salary', axis=1)
print(df)

# Rename columns
df = df.rename(columns={'Name': 'Full Name'})
print(df)

# Group data by a column and calculate summary statistics
group_by_city = df.groupby('City').mean()
print(group_by_city)

# Apply multiple aggregation functions
agg_df = df.groupby('City').agg({'Age': ['mean', 'min', 'max']})
print(agg_df)

# Detect missing values
print(df.isnull())
print(df.isnull().sum())

# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Drop rows with missing values
df = df.dropna()

#Concatenation
df1 = pd.DataFrame({'Name': ['John', 'Anna', 'Peter', 'Linda'],
                    'Age': [28, 24, 35, 32]})
df2 = pd.DataFrame({'Name': ['John', 'Anna', 'Peter', 'Linda'],
                    'City': ['New York', 'Paris', 'Berlin', 'London']})

merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)
df3 = pd.DataFrame({'Name': ['Laura', 'Tom'],
                    'Age': [22, 34],
                    'City': ['San Francisco', 'Boston']})

concatenated_df = pd.concat([df, df3])
print(concatenated_df)
