# Train Dataset Analysis

## **Introduction**
In this project, I explore the Train dataset using Pandas and Numpy to gain fundamental insights into passenger demographics and ticket pricing. By analyzing the dataset, the aim is to understand trends related to passenger age distribution, ticket prices, and travel class. 

The analysis involves various data manipulation techniques, statistical summaries, and indexing methods to uncover key patterns and relationships within the data.

### Dataset Structure  

#### **1. Train Dataset**  
This dataset contains information about passengers aboard the Train, including their survival status, personal details, and travel information. It includes the following fields:  

- **PassengerId**: A unique identifier for each passenger.  
- **Survived**: A binary indicator (0 or 1) where 0 means the passenger did not survive and 1 means the passenger survived.  
- **Pclass**: The passenger's class on the ship, where 1 is first class, 2 is second class, and 3 is third class.  
- **Name**: The full name of the passenger.  
- **Sex**: The gender of the passenger (male or female).  
- **Age**: The age of the passenger in years.  
- **SibSp**: The number of siblings or spouses the passenger was traveling with.  
- **Parch**: The number of parents or children the passenger was traveling with.  
- **Ticket**: The ticket number assigned to the passenger.  
- **Fare**: The fare paid by the passenger for the trip.  
- **Cabin**: The cabin number the passenger was assigned to (if available).  
- **Embarked**: The port at which the passenger boarded the Train, represented as 'C' for Cherbourg, 'Q' for Queenstown, and 'S' for Southampton.  

#### **Importing Required Libraries**

Before starting the analysis, I first import the necessary libraries to perform the required operations. This includes `Pandas` for data manipulation and `NumPy` for numerical calculations.

In [43]:
# Importing Required Libraries
import numpy as np
import pandas as pd

#### **Loading the Dataset**

The dataset containing various passenger details is loaded using `pd.read_csv()`. This allows me to work with the data in a structured format, which will enable further exploration.


In [44]:
# Loading the Dataset

dataframe=pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Train-Dataset-Insights\Data\train.csv")

## Exploring the Dataset
To better understand the structure of the dataset, I display the first 25 records, which provide an overview of the data.


In [45]:
dataframe.head(25)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## Statistical Analysis of Age Column
Next, I perform basic statistical operations to gain insights into the distribution of ages among passengers. These include calculating the mean, maximum, minimum, and standard deviation of the age column.


In [46]:
# Mean Age
print("Mean Age:", dataframe['Age'].mean())

Mean Age: 29.69911764705882


In [47]:
# Maximum Age
print("Max Age:", dataframe['Age'].max())


Max Age: 80.0


In [48]:
# Minimum Age
print("Min Age:", dataframe['Age'].min())


Min Age: 0.42


In [49]:
# Standard Deviation of Age
print("Standard Deviation of Age:", dataframe['Age'].std())

Standard Deviation of Age: 14.526497332334042


## Passenger Class Distribution
To understand the distribution of passengers across different classes, I use `value_counts()` on the `Pclass` column. This helps me see how many passengers belong to each class.

In [50]:
dataframe['Pclass'].value_counts()


Pclass
3    491
1    216
2    184
Name: count, dtype: int64

## Summary Statistics
Using `describe()`, I generate a comprehensive summary of all numerical columns in the dataset. This gives me an overview of important statistics like mean, standard deviation, min, max, and quartiles.

In [51]:
dataframe.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Identifying Elderly Passengers in Third Class
I filter passengers who are older than 60 years and traveled in third class. This provides insight into the elderly passenger demographic in lower-class sections.


In [52]:
dataframe[(dataframe['Age'] > 60) & (dataframe['Pclass'] == 3)]


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
280,281,0,3,"Duane, Mr. Frank",male,65.0,0,0,336439,7.75,,Q
326,327,0,3,"Nysveen, Mr. Johan Hansen",male,61.0,0,0,345364,6.2375,,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


## Adjusting Fare to 2025 Rates
Calculating the fare in 2025 by multiplying the original fare by an inflation factor (146.14).

In [53]:
dataframe['2025_Fare'] = dataframe['Fare'] * 146.14
dataframe.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2025_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1059.515
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1158.1595
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,7760.034
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1176.427


## Accessing Specific Data Points
I demonstrate how to access specific data points using `iloc` and `loc`. Here, I retrieve the third column value from the second row as an example.

In [54]:

# Retrieving the third column value of the second row
dataframe.iloc[1, 3]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

## Grouping Data by Gender and Passenger Class
I group the data by gender and passenger class to analyze the average fare for each category. This helps in understanding payment trends across different groups.

In [55]:
fare = dataframe.groupby(['Sex', 'Pclass']).agg({'Fare': ['count', 'sum']})
fare['fare_avg'] = fare['Fare']['sum'] / fare['Fare']['count']
fare

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Fare,fare_avg
Unnamed: 0_level_1,Unnamed: 1_level_1,count,sum,Unnamed: 4_level_1
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,1,94,9975.825,106.125798
female,2,76,1669.7292,21.970121
female,3,144,2321.1086,16.11881
male,1,122,8201.5875,67.226127
male,2,108,2132.1125,19.741782
male,3,347,4393.5865,12.661633


## Creating DataFrames from Numpy Arrays
To demonstrate various methods of creating Pandas DataFrames, I first show how to create a DataFrame from a dictionary. Then, I demonstrate creating a DataFrame from a NumPy array.

In [56]:
# Creating a DataFrame from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=data)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [57]:
# Creating a DataFrame from a NumPy array
data1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df1 = pd.DataFrame(data=data1, columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
df1

Unnamed: 0,a,b,c
x,1,2,3
y,4,5,6
z,7,8,9


## Alternative DataFrame Creation Method
I show another method for creating a DataFrame using NumPy arrays to further illustrate the flexibility of Pandas for data manipulation.

In [58]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
print(df2)

   a  b  c
x  1  2  3
y  4  5  6
z  7  8  9


## Understanding Data Types
To better understand the dataset, I check the data types of specific elements. This helps me ensure that the data is in the correct format for analysis.



In [59]:
# Checking type of a single row
print(type(dataframe.iloc[0]))

# Checking type of the 'Name' column
print(type(dataframe['Name']))  # or df3.Name

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Working with the Train Dataset
Reassigning the dataset for further exploration, I display the first few rows, check the column names, and get general information about the dataset.



In [60]:
train = dataframe
train.head()

train.columns  # Viewing column names
train.shape    # Checking dataset dimensions
train.info()   # Getting dataset info


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  2025_Fare    891 non-null    float64
dtypes: float64(3), int64(5), object(5)
memory usage: 90.6+ KB


## Extracting Specific Columns
I demonstrate different ways of extracting the 'Age' column. This provides insight into how to handle individual columns in a dataset.


In [61]:

train['Age']
train.Age
train[['Name', 'Age']]


Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


## Accessing Rows and Columns
Using both `iloc` and `loc`, I show how to access specific rows and values. This is helpful for precise data extraction.

In [62]:
# Extracting a single row
train.iloc[0]

# Extracting multiple rows
train.iloc[[0]]
train.iloc[0:3]

# Extracting specific columns for the first three rows
train.iloc[0:3, [3, 4]]
train.iloc[:, [3]]
train.iloc[0, 3]
train.loc[0:3, ['Name']]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"


## Creating a New Column
I add a new column, `Age_plus_100`, which adds 100 years to each passenger's age for a hypothetical scenario. This shows how to create and manipulate new columns in the dataset.

In [63]:

train['Age_plus_100'] = train['Age'] + 100
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,2025_Fare,Age_plus_100
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1059.515,122.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,10417.341462,138.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1158.1595,126.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,7760.034,135.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1176.427,135.0


## **Conclusion**

Through the analysis of the Train dataset, I explored key insights into passenger demographics, travel class distribution, ticket pricing, and fare trends. Using Pandas for data manipulation and statistical analysis, I uncovered patterns such as age distribution, the impact of passenger class on fare pricing, and ticket cost differences when adjusted for inflation. This project highlighted the importance of data wrangling and statistical summaries in extracting meaningful insights.

The combination of Pandas for data processing and NumPy for numerical computations enabled efficient analysis, providing a deeper understanding of passengers' characteristics. These insights can inform service improvements, fare adjustments, and targeted marketing strategies.

## **Application of Insights**

The insights from this analysis of the Train dataset have practical applications in the travel industry, economics, and customer segmentation. By understanding passenger distribution and fare correlations, businesses can optimize pricing, tailor marketing, and enhance services for specific groups. This also informs operational decisions, such as prioritizing upgrades for elderly travelers in third class, and highlights opportunities for inclusive pricing models.

This analysis lays the foundation for further exploration of passenger behavior, travel trends, and fare structures, guiding future modeling and enabling data-driven decisions to refine strategies and improve services.

## **Next Steps**

- **Deeper Demographic Analysis**: Explore additional factors like marital status or family size and their impact on ticket pricing.
- **Expand Dataset**: Include data from similar travel datasets (e.g., other ships or trains) for comparative insights.
- **Predictive Modeling**: Develop models to forecast fare pricing based on features like class and age, optimizing demand forecasting.
- **Visualizing Trends**: Create visualizations for age distribution, fare trends, and class demographics to simplify data interpretation.
  
By addressing these steps, the analysis can be expanded to provide more actionable insights for businesses and stakeholders.