# Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures and functions to work with structured data efficiently.

## Installation

Install pandas using pip:
```bash
pip install pandas
```

Or using conda:
```bash
conda install pandas
```

## Import Pandas and Load Data

In [1]:
import pandas as pd

# Load the Titanic dataset
df = pd.read_csv('../datasets/titanic/titanic_full.csv')
print("Dataset loaded successfully!")
print(f"Shape: {df.shape}")

Dataset loaded successfully!
Shape: (1309, 21)


## View First and Last Rows

In [2]:
# First 5 rows
print("First 5 rows:")
print(df.head())

# Last 5 rows
print("\nLast 5 rows:")
print(df.tail())

First 5 rows:
   PassengerId  Survived  Pclass  \
0            1       0.0       3   
1            2       1.0       1   
2            3       1.0       3   
3            4       1.0       1   
4            5       0.0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare  ... Embarked WikiId  \
0      0         A/5 21171   7.2500  ...        S  691.0   
1      0          PC 17599  71.2833  ...        C   90.0   
2      0  STON/O2. 3101282   7.9250  ...        S  865.0   
3      0            113803  53.1000  ...        S  127.0   
4     

## Dataset Info

In [3]:
# Get dataset information
print("Dataset Info:")
print(df.info())

# Statistical summary
print("\nStatistical Summary:")
print(df.describe())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  WikiId       1304 non-null   float64
 13  Name_wiki    1304 non-null   object 
 14  Age_wiki     1302 non-null   float64
 15  Hometown     1304 non-null   object 
 16  Boarded      1304 non-null   object 
 17  Destination  1304 non-null   object 
 18  Lifeboat     502 non-null    objec

## Column Selection

In [4]:
# Select single column
print("Age column (first 10):")
print(df['Age'].head(10))

# Select multiple columns
print("\nName and Fare columns (first 5):")
print(df[['Name', 'Fare']].head())

Age column (first 10):
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

Name and Fare columns (first 5):
                                                Name     Fare
0                            Braund, Mr. Owen Harris   7.2500
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  71.2833
2                             Heikkinen, Miss. Laina   7.9250
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  53.1000
4                           Allen, Mr. William Henry   8.0500


## Missing Values

In [5]:
# Check missing values
print("Missing values per column:")
print(df.isnull().sum())

# Percentage of missing values
print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df) * 100).round(2))

Missing values per column:
PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
WikiId            5
Name_wiki         5
Age_wiki          7
Hometown          5
Boarded           5
Destination       5
Lifeboat        807
Body           1179
Class             5
dtype: int64

Percentage of missing values:
PassengerId     0.00
Survived       31.93
Pclass          0.00
Name            0.00
Sex             0.00
Age            20.09
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.08
Cabin          77.46
Embarked        0.15
WikiId          0.38
Name_wiki       0.38
Age_wiki        0.53
Hometown        0.38
Boarded         0.38
Destination     0.38
Lifeboat       61.65
Body           90.07
Class           0.38
dtype: float64


## Handling Missing Values

In [6]:
# Fill missing values with median
df_filled = df.fillna({
    'Age': df['Age'].median(),
    'Fare': df['Fare'].median(),
})

print("Missing values after filling:")
print(df_filled.isnull().sum())

Missing values after filling:
PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin          1014
Embarked          2
WikiId            5
Name_wiki         5
Age_wiki          7
Hometown          5
Boarded           5
Destination       5
Lifeboat        807
Body           1179
Class             5
dtype: int64


## Filtering Data

In [7]:
# Filter passengers over 30 years old
over_30 = df_filled[df_filled['Age'] > 30]
print(f"Passengers over 30: {len(over_30)}")

# Filter by multiple conditions
young_high_fare = df_filled[(df_filled['Age'] < 20) & (df_filled['Fare'] > 100)]
print(f"Young passengers (< 20) with high fare (> 100): {len(young_high_fare)}")

Passengers over 30: 437
Young passengers (< 20) with high fare (> 100): 13


## Grouping and Aggregation

In [8]:
# Group by class and calculate mean fare
print("Mean fare by passenger class:")
print(df_filled.groupby('Pclass')['Fare'].mean())

# Multiple aggregations
print("\nAge statistics by passenger class:")
print(df_filled.groupby('Pclass')['Age'].agg(['mean', 'min', 'max', 'count']))

Mean fare by passenger class:
Pclass
1    87.508992
2    21.179196
3    13.304513
Name: Fare, dtype: float64

Age statistics by passenger class:
             mean   min   max  count
Pclass                              
1       37.812446  0.92  80.0    323
2       29.419675  0.67  70.0    277
3       25.750353  0.17  74.0    709


## Sorting

In [9]:
# Sort by age (ascending)
print("Youngest passengers:")
print(df_filled.nsmallest(5, 'Age')[['Name', 'Age']])

# Sort by fare (descending)
print("\nHighest fares:")
print(df_filled.nlargest(5, 'Fare')[['Name', 'Fare']])

Youngest passengers:
                                         Name   Age
1245  Dean, Miss. Elizabeth Gladys Millvina""  0.17
1092  Danbom, Master. Gilbert Sigvard Emanuel  0.33
803           Thomas, Master. Assad Alexander  0.42
755                 Hamalainen, Master. Viljo  0.67
469             Baclini, Miss. Helene Barbara  0.75

Highest fares:
                                                   Name      Fare
258                                    Ward, Miss. Anna  512.3292
679                  Cardeza, Mr. Thomas Drake Martinez  512.3292
737                              Lesurer, Mr. Gustave J  512.3292
1234  Cardeza, Mrs. James Warburton Martinez (Charlo...  512.3292
27                       Fortune, Mr. Charles Alexander  263.0000


## Data Types and Conversion

In [10]:
# Check data types
print("Data types:")
print(df.dtypes)

# Convert column to numeric
df_filled['Age_int'] = df_filled['Age'].astype(int)
print(f"\nAge converted to int: {df_filled['Age_int'].dtype}")

Data types:
PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
WikiId         float64
Name_wiki       object
Age_wiki       float64
Hometown        object
Boarded         object
Destination     object
Lifeboat        object
Body            object
Class          float64
dtype: object

Age converted to int: int64


## Creating New Columns

In [11]:
# Create age group column
df_filled['Age_Group'] = pd.cut(df_filled['Age'], bins=[0, 18, 35, 60, 100], 
                                  labels=['Child', 'Young Adult', 'Adult', 'Senior'])

print("Age groups distribution:")
print(df_filled['Age_Group'].value_counts())

Age groups distribution:
Age_Group
Young Adult    794
Adult          289
Child          193
Senior          33
Name: count, dtype: int64


## Basic Statistics