## Understanding the Pandas Library

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures like Series and DataFrames that are designed for handling structured data efficiently.

In [1]:
# Import the pandas library
import pandas as pd
import numpy as np

### Basic Pandas Functions

Let's start with some basic functions to get acquainted with pandas DataFrames.

In [2]:
# Creating a simple DataFrame
data = {'col1': [1, 2, 3, 4],
        'col2': ['A', 'B', 'C', 'D']}
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Get information about the DataFrame
print("\nDataFrame Info:")
df.info()

# Get descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

# Get the shape of the DataFrame
print("\nShape of the DataFrame:", df.shape)

# Get the columns of the DataFrame
print("\nColumns of the DataFrame:", df.columns)

Original DataFrame:
   col1 col2
0     1    A
1     2    B
2     3    C
3     4    D

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   col1    4 non-null      int64 
 1   col2    4 non-null      object
dtypes: int64(1), object(1)
memory usage: 196.0+ bytes

Descriptive Statistics:
           col1
count  4.000000
mean   2.500000
std    1.290994
min    1.000000
25%    1.750000
50%    2.500000
75%    3.250000
max    4.000000

Shape of the DataFrame: (4, 2)

Columns of the DataFrame: Index(['col1', 'col2'], dtype='object')


### Converting Arrays to DataFrames

You can easily convert NumPy arrays into pandas DataFrames.

In [3]:
# Create a NumPy array
np_array = np.array([[10, 20, 30], [40, 50, 60]])

# Convert NumPy array to DataFrame
df_from_array = pd.DataFrame(np_array, columns=['ColA', 'ColB', 'ColC'])

# Display the new DataFrame
print("DataFrame from NumPy array:")
print(df_from_array)

DataFrame from NumPy array:
   ColA  ColB  ColC
0    10    20    30
1    40    50    60


### Indexing and Slicing in DataFrames

Pandas provides powerful ways to select and slice data using labels (`loc`) and integer positions (`iloc`).

In [4]:
# Select a single column
print("Selecting a single column ('col1'):")
print(df['col1'])

# Select multiple columns
print("\nSelecting multiple columns ('col1', 'col2'):")
print(df[['col1', 'col2']])

# Select rows by label using .loc
print("\nSelecting rows by label (index 0 and 2) using .loc:")
print(df.loc[[0, 2]])

# Select rows by integer position using .iloc
print("\nSelecting rows by integer position (index 1 and 3) using .iloc:")
print(df.iloc[[1, 3]])

# Select a specific cell using .loc
print("\nSelecting a specific cell (row 1, col 'col2') using .loc:")
print(df.loc[1, 'col2'])

# Select a specific cell using .iloc
print("\nSelecting a specific cell (row 0, col 0) using .iloc:")
print(df.iloc[0, 0])

# Slicing rows
print("\nSlicing rows (from index 1 to 3):")
print(df[1:4])

Selecting a single column ('col1'):
0    1
1    2
2    3
3    4
Name: col1, dtype: int64

Selecting multiple columns ('col1', 'col2'):
   col1 col2
0     1    A
1     2    B
2     3    C
3     4    D

Selecting rows by label (index 0 and 2) using .loc:
   col1 col2
0     1    A
2     3    C

Selecting rows by integer position (index 1 and 3) using .iloc:
   col1 col2
1     2    B
3     4    D

Selecting a specific cell (row 1, col 'col2') using .loc:
B

Selecting a specific cell (row 0, col 0) using .iloc:
1

Slicing rows (from index 1 to 3):
   col1 col2
1     2    B
2     3    C
3     4    D


### Data Cleaning (Handling Missing Values)

Dealing with missing data (represented as `NaN`) is a common task. Pandas provides methods to identify, remove, or fill missing values.

In [5]:
# Create a DataFrame with missing values
data_with_nan = {'col1': [1, 2, np.nan, 4],
                 'col2': ['A', np.nan, 'C', 'D'],
                 'col3': [True, False, True, np.nan]}
df_nan = pd.DataFrame(data_with_nan)

print("DataFrame with missing values:")
print(df_nan)

# Check for missing values
print("\nMissing values per column:")
print(df_nan.isnull().sum())

# Drop rows with any missing values
df_dropped_rows = df_nan.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped_rows)

# Drop columns with any missing values
df_dropped_cols = df_nan.dropna(axis=1)
print("\nDataFrame after dropping columns with missing values:")
print(df_dropped_cols)

# Fill missing values with a specific value
df_filled_value = df_nan.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled_value)

# Fill missing values with the mean of the column (for numerical data)
df_filled_mean = df_nan.fillna(df_nan['col1'].mean())
print("\nDataFrame after filling missing values with mean of 'col1':")
print(df_filled_mean)

DataFrame with missing values:
   col1 col2   col3
0   1.0    A   True
1   2.0  NaN  False
2   NaN    C   True
3   4.0    D    NaN

Missing values per column:
col1    1
col2    1
col3    1
dtype: int64

DataFrame after dropping rows with missing values:
   col1 col2  col3
0   1.0    A  True

DataFrame after dropping columns with missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

DataFrame after filling missing values with 0:
   col1 col2   col3
0   1.0    A   True
1   2.0    0  False
2   0.0    C   True
3   4.0    D      0

DataFrame after filling missing values with mean of 'col1':
       col1      col2      col3
0  1.000000         A      True
1  2.000000  2.333333     False
2  2.333333         C      True
3  4.000000         D  2.333333


### Data Manipulation (Columns, Sorting, Grouping)

Pandas offers extensive capabilities for manipulating data within DataFrames.

In [6]:
# Add a new column
df['col3'] = [True, False, True, False]
print("DataFrame after adding 'col3':")
print(df)

# Drop a column
df_dropped_col = df.drop('col3', axis=1)
print("\nDataFrame after dropping 'col3':")
print(df_dropped_col)

# Sort by values in a column
df_sorted = df.sort_values(by='col1', ascending=False)
print("\nDataFrame sorted by 'col1' descending:")
print(df_sorted)

# Group by a column and calculate the mean
data_for_grouping = {'Category': ['A', 'B', 'A', 'B', 'A'],
                     'Value': [10, 15, 12, 18, 11]}
df_group = pd.DataFrame(data_for_grouping)
print("\nDataFrame for grouping:")
print(df_group)

grouped_data = df_group.groupby('Category')['Value'].mean()
print("\nMean 'Value' grouped by 'Category':")
print(grouped_data)

DataFrame after adding 'col3':
   col1 col2   col3
0     1    A   True
1     2    B  False
2     3    C   True
3     4    D  False

DataFrame after dropping 'col3':
   col1 col2
0     1    A
1     2    B
2     3    C
3     4    D

DataFrame sorted by 'col1' descending:
   col1 col2   col3
3     4    D  False
2     3    C   True
1     2    B  False
0     1    A   True

DataFrame for grouping:
  Category  Value
0        A     10
1        B     15
2        A     12
3        B     18
4        A     11

Mean 'Value' grouped by 'Category':
Category
A    11.0
B    16.5
Name: Value, dtype: float64


### Synthetic Data Generation for Practice

Generating synthetic data is useful for practicing pandas operations without needing a real dataset.

In [7]:
# Generate synthetic data
np.random.seed(42) # for reproducibility
synthetic_data = {
    'ID': range(1, 101),
    'Age': np.random.randint(18, 65, 100),
    'Salary': np.random.randint(30000, 100000, 100),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], 100),
    'Rating': np.random.uniform(1, 5, 100).round(1)
}
df_synthetic = pd.DataFrame(synthetic_data)

print("First 5 rows of Synthetic DataFrame:")
print(df_synthetic.head())

print("\nInfo of Synthetic DataFrame:")
df_synthetic.info()

First 5 rows of Synthetic DataFrame:
   ID  Age  Salary         City  Rating
0   1   56   32695      Houston     2.5
1   2   46   78190      Chicago     4.4
2   3   32   35258      Houston     2.3
3   4   60   69504      Chicago     1.7
4   5   25   63159  Los Angeles     3.2

Info of Synthetic DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      100 non-null    int64  
 1   Age     100 non-null    int64  
 2   Salary  100 non-null    int64  
 3   City    100 non-null    object 
 4   Rating  100 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB
