# Pandas Introduction

Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures like Series and DataFrame for handling structured data.

In [2]:
import pandas as pd
import numpy as np

print("Pandas imported successfully!")
print(f"Pandas version: {pd.__version__}")

Pandas imported successfully!
Pandas version: 2.3.0


## Pandas Series

A Series is a one-dimensional labeled array that can hold any data type.

In [3]:
# Creating a Series
s = pd.Series([1, 3, 5, 6, 8])
print("Series:")
print(s)
print(f"Type: {type(s)}")

# Series with custom index
s_custom = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("\nSeries with custom index:")
print(s_custom)

# Accessing elements
print(f"s[0]: {s[0]}")
print(f"s_custom['b']: {s_custom['b']}")

# Basic operations
print(f"Mean: {s.mean()}")
print(f"Sum: {s.sum()}")
print(f"Max: {s.max()}")

Series:
0    1
1    3
2    5
3    6
4    8
dtype: int64
Type: <class 'pandas.core.series.Series'>

Series with custom index:
a    10
b    20
c    30
dtype: int64
s[0]: 1
s_custom['b']: 20
Mean: 4.6
Sum: 23
Max: 8


## Pandas DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

In [4]:
# Creating a DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
}
df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print(f"Type: {type(df)}")

# Basic info
print(f"\nShape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Index: {list(df.index)}")

# Accessing columns
print(f"\nNames: {df['Name'].tolist()}")
print(f"Ages: {df['Age'].tolist()}")

# Accessing rows
print(f"\nFirst row:\n{df.iloc[0]}")
print(f"Row with index 1:\n{df.loc[1]}")

# Summary statistics
print(f"\nAge statistics:\n{df['Age'].describe()}")

DataFrame:
      Name  Age     City
0    Alice   25      NYC
1      Bob   30       LA
2  Charlie   35  Chicago
Type: <class 'pandas.core.frame.DataFrame'>

Shape: (3, 3)
Columns: ['Name', 'Age', 'City']
Index: [0, 1, 2]

Names: ['Alice', 'Bob', 'Charlie']
Ages: [25, 30, 35]

First row:
Name    Alice
Age        25
City      NYC
Name: 0, dtype: object
Row with index 1:
Name    Bob
Age      30
City     LA
Name: 1, dtype: object

Age statistics:
count     3.0
mean     30.0
std       5.0
min      25.0
25%      27.5
50%      30.0
75%      32.5
max      35.0
Name: Age, dtype: float64


## Reading Data

Pandas can read data from various file formats like CSV, Excel, JSON, and SQL databases.

In [7]:
# Reading CSV file
df_csv = pd.read_csv('../data/sample_data.csv')
print("Data from CSV:")
print(df_csv)
print(f"\nData types:\n{df_csv.dtypes}")

# Reading with options
df_csv_custom = pd.read_csv('../data/sample_data.csv', index_col=0)
print("\nCSV with custom index:")
print(df_csv_custom)

# Creating sample JSON data
import json
sample_data = [
    {"name": "Alice", "age": 25, "city": "NYC"},
    {"name": "Bob", "age": 30, "city": "LA"}
]
with open('../data/sample.json', 'w') as f:
    json.dump(sample_data, f)

# Reading JSON
df_json = pd.read_json('../data/sample.json')
print("\nData from JSON:")
print(df_json)

Data from CSV:
      name  age         city  salary
0    Alice   25     New York   50000
1      Bob   30  Los Angeles   60000
2  Charlie   35      Chicago   70000
3    Diana   28      Houston   55000
4      Eve   32      Phoenix   65000

Data types:
name      object
age        int64
city      object
salary     int64
dtype: object

CSV with custom index:
         age         city  salary
name                             
Alice     25     New York   50000
Bob       30  Los Angeles   60000
Charlie   35      Chicago   70000
Diana     28      Houston   55000
Eve       32      Phoenix   65000

Data from JSON:
    name  age city
0  Alice   25  NYC
1    Bob   30   LA


## Data Cleaning

Data cleaning involves handling missing values, duplicates, and data type conversions to prepare data for analysis.

In [8]:
# Data Cleaning Examples
import pandas as pd
import numpy as np

# Create a DataFrame with missing values and duplicates
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Eve'],
    'Age': [25, np.nan, 35, 25, 32],
    'Salary': [50000, 60000, np.nan, 50000, 65000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Check for missing values
print(f"\nMissing values:\n{df.isnull().sum()}")

# Fill missing values
df_filled = df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].median()})
print("\nDataFrame after filling missing values:")
print(df_filled)

# Remove duplicates
df_no_duplicates = df_filled.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

# Convert data types if needed
df_no_duplicates['Age'] = df_no_duplicates['Age'].astype(int)
print("\nDataFrame with Age as int:")
print(df_no_duplicates)
print(f"\nData types:\n{df_no_duplicates.dtypes}")

Original DataFrame:
      Name   Age   Salary
0    Alice  25.0  50000.0
1      Bob   NaN  60000.0
2  Charlie  35.0      NaN
3    Alice  25.0  50000.0
4      Eve  32.0  65000.0

Missing values:
Name      0
Age       1
Salary    1
dtype: int64

DataFrame after filling missing values:
      Name    Age   Salary
0    Alice  25.00  50000.0
1      Bob  29.25  60000.0
2  Charlie  35.00  55000.0
3    Alice  25.00  50000.0
4      Eve  32.00  65000.0

DataFrame after removing duplicates:
      Name    Age   Salary
0    Alice  25.00  50000.0
1      Bob  29.25  60000.0
2  Charlie  35.00  55000.0
4      Eve  32.00  65000.0

DataFrame with Age as int:
      Name  Age   Salary
0    Alice   25  50000.0
1      Bob   29  60000.0
2  Charlie   35  55000.0
4      Eve   32  65000.0

Data types:
Name       object
Age         int64
Salary    float64
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['Age'] = df_no_duplicates['Age'].astype(int)
