# Train Dataset Analysis

## **Introduction**
In this project, I explore the Train dataset using Pandas and Numpy to gain fundamental insights into passenger demographics and ticket pricing. By analyzing the dataset, the aim is to understand trends related to passenger age distribution, ticket prices, and travel class. 

The analysis involves various data manipulation techniques, statistical summaries, and indexing methods to uncover key patterns and relationships within the data.

### Dataset Structure  

#### **1. Train Dataset**  
This dataset contains information about passengers aboard the Train, including their survival status, personal details, and travel information. It includes the following fields:  

- **PassengerId**: A unique identifier for each passenger.  
- **Survived**: A binary indicator (0 or 1) where 0 means the passenger did not survive and 1 means the passenger survived.  
- **Pclass**: The passenger's class on the train, where 1 is first class, 2 is second class, and 3 is third class.  
- **Name**: The full name of the passenger.  
- **Sex**: The gender of the passenger (male or female).  
- **Age**: The age of the passenger in years.  
- **SibSp**: The number of siblings or spouses the passenger was traveling with.  
- **Parch**: The number of parents or children the passenger was traveling with.  
- **Ticket**: The ticket number assigned to the passenger.  
- **Fare**: The fare paid by the passenger for the trip.  
- **Cabin**: The cabin number the passenger was assigned to (if available).  
- **Embarked**: The port at which the passenger boarded the Train, represented as 'C' for Cherbourg, 'Q' for Queenstown, and 'S' for Southampton.  

#### **Importing Required Libraries**

Before starting the analysis, I first import the necessary libraries to perform the required operations. This includes `Pandas` for data manipulation and `NumPy` for numerical calculations.

In [None]:
# Importing Required Libraries
import numpy as np
import pandas as pd

#### **Loading the Dataset**

The dataset containing various passenger details is loaded using `pd.read_csv()`. This allows me to work with the data in a structured format, which will enable further exploration.


In [None]:
# Loading the Dataset

dataframe=pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Train-Dataset-Insights\Data\train.csv")

## Exploring the Dataset
To better understand the structure of the dataset, I display the first 25 records, which provide an overview of the data.


In [None]:
dataframe.head(25)

## Statistical Analysis of Age Column
Next, I perform basic statistical operations to gain insights into the distribution of ages among passengers. These include calculating the mean, maximum, minimum, and standard deviation of the age column.


In [None]:
# Mean Age
print("Mean Age:", dataframe['Age'].mean())

In [None]:
# Maximum Age
print("Max Age:", dataframe['Age'].max())


In [None]:
# Minimum Age
print("Min Age:", dataframe['Age'].min())


In [None]:
# Standard Deviation of Age
print("Standard Deviation of Age:", dataframe['Age'].std())

## Passenger Class Distribution
To understand the distribution of passengers across different classes, I use `value_counts()` on the `Pclass` column. This helps me see how many passengers belong to each class.

In [None]:
dataframe['Pclass'].value_counts()


## Summary Statistics
Using `describe()`, I generate a comprehensive summary of all numerical columns in the dataset. This gives me an overview of important statistics like mean, standard deviation, min, max, and quartiles.

In [None]:
dataframe.describe()


## Identifying Elderly Passengers in Third Class
I filter passengers who are older than 60 years and traveled in third class. This provides insight into the elderly passenger demographic in lower-class sections.


In [None]:
dataframe[(dataframe['Age'] > 60) & (dataframe['Pclass'] == 3)]


## Adjusting Fare to 2025 Rates
Calculating the fare in 2025 by multiplying the original fare by an inflation factor (146.14).

In [None]:
dataframe['2025_Fare'] = dataframe['Fare'] * 146.14
dataframe.head()


## Accessing Specific Data Points
I demonstrate how to access specific data points using `iloc` and `loc`. Here, I retrieve the third column value from the second row as an example.

In [None]:

# Retrieving the third column value of the second row
dataframe.iloc[1, 3]

## Grouping Data by Gender and Passenger Class
I group the data by gender and passenger class to analyze the average fare for each category. This helps in understanding payment trends across different groups.

In [None]:
fare = dataframe.groupby(['Sex', 'Pclass']).agg({'Fare': ['count', 'sum']})
fare['fare_avg'] = fare['Fare']['sum'] / fare['Fare']['count']
fare

## Creating DataFrames from Numpy Arrays
To demonstrate various methods of creating Pandas DataFrames, I first show how to create a DataFrame from a dictionary. Then, I demonstrate creating a DataFrame from a NumPy array.

In [None]:
# Creating a DataFrame from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=data)
df

In [None]:
# Creating a DataFrame from a NumPy array
data1 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df1 = pd.DataFrame(data=data1, columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
df1

## Alternative DataFrame Creation Method
I show another method for creating a DataFrame using NumPy arrays to further illustrate the flexibility of Pandas for data manipulation.

In [None]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
print(df2)

## Understanding Data Types
To better understand the dataset, I check the data types of specific elements. This helps me ensure that the data is in the correct format for analysis.



In [None]:
# Checking type of a single row
print(type(dataframe.iloc[0]))

# Checking type of the 'Name' column
print(type(dataframe['Name']))  # or df3.Name

## Working with the Train Dataset
Reassigning the dataset for further exploration, I display the first few rows, check the column names, and get general information about the dataset.



In [None]:
train = dataframe
train.head()

train.columns  # Viewing column names
train.shape    # Checking dataset dimensions
train.info()   # Getting dataset info


## Extracting Specific Columns
I demonstrate different ways of extracting the 'Age' column. This provides insight into how to handle individual columns in a dataset.


In [None]:

train['Age']
train.Age
train[['Name', 'Age']]


## Accessing Rows and Columns
Using both `iloc` and `loc`, I show how to access specific rows and values. This is helpful for precise data extraction.

In [None]:
# Extracting a single row
train.iloc[0]

# Extracting multiple rows
train.iloc[[0]]
train.iloc[0:3]

# Extracting specific columns for the first three rows
train.iloc[0:3, [3, 4]]
train.iloc[:, [3]]
train.iloc[0, 3]
train.loc[0:3, ['Name']]

## Creating a New Column
I add a new column, `Age_plus_100`, which adds 100 years to each passenger's age for a hypothetical scenario. This shows how to create and manipulate new columns in the dataset.

In [None]:

train['Age_plus_100'] = train['Age'] + 100
train.head()

## **Conclusion**

Through the analysis of the Train dataset, I explored key insights into passenger demographics, travel class distribution, ticket pricing, and fare trends. Using Pandas for data manipulation and statistical analysis, I uncovered patterns such as age distribution, the impact of passenger class on fare pricing, and ticket cost differences when adjusted for inflation. This project highlighted the importance of data wrangling and statistical summaries in extracting meaningful insights.

The combination of Pandas for data processing and NumPy for numerical computations enabled efficient analysis, providing a deeper understanding of passengers' characteristics. These insights can inform service improvements, fare adjustments, and targeted marketing strategies.

## **Application of Insights**

The insights from this analysis of the Train dataset have practical applications in the travel industry, economics, and customer segmentation. By understanding passenger distribution and fare correlations, businesses can optimize pricing, tailor marketing, and enhance services for specific groups. This also informs operational decisions, such as prioritizing upgrades for elderly travelers in third class, and highlights opportunities for inclusive pricing models.

This analysis lays the foundation for further exploration of passenger behavior, travel trends, and fare structures, guiding future modeling and enabling data-driven decisions to refine strategies and improve services.

## **Next Steps**

- **Deeper Demographic Analysis**: Explore additional factors like marital status or family size and their impact on ticket pricing.
- **Expand Dataset**: Include data from similar travel datasets (e.g., other trains) for comparative insights.
- **Predictive Modeling**: Develop models to forecast fare pricing based on features like class and age, optimizing demand forecasting.
- **Visualizing Trends**: Create visualizations for age distribution, fare trends, and class demographics to simplify data interpretation.
  
By addressing these steps, the analysis can be expanded to provide more actionable insights for businesses and stakeholders.