**Importing necessary libraries and loading the dataset from a specified path to begin data analysis**

In [21]:
# Importing libraries

import pandas as pd


# Loading dataset
# Change the path to the path of your dataset

df = pd.read_csv('/content/train.csv')

**Displaying the first five rows of the dataset to get an initial look at the data structure and contents**

In [22]:
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Checking for missing values in each column of the dataset to identify any data gaps or incomplete entries**

In [5]:
# Checking for missing values

print(df.isnull().sum())



PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


**Displaying the data types of each column to understand the structure of the dataset and identify any necessary type conversions.**

In [6]:
# Checking data types

print(df.dtypes)



PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


**Generating summary statistics for numerical columns in the dataset, including count, mean, standard deviation, minimum, and maximum values.**

In [7]:
# Summarizing data

print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


**Filling missing values in the 'Age' column with the median age and in the 'Embarked' column with the most frequent value (mode) to handle incomplete data.**

In [8]:
# Filling in missing values

df['Age'].fillna(df['Age'].median(), inplace=True)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


**Converting the 'Pclass' column to a categorical data type to optimize memory usage and better represent its nature as a categorical feature.**

In [9]:
# Converting data types

df['Pclass'] = df['Pclass'].astype('category')

**Removing duplicate rows from the dataset to ensure each entry is unique and prevent redundancy in data analysis.**

In [10]:
# Removing duplicates

df.drop_duplicates(inplace=True)

**Correcting data inconsistencies by setting any negative values in the 'Age' column to the median age, ensuring all age values are realistic.**

In [11]:
# Correcting inconsistencies

df.loc[df['Age'] < 0, 'Age'] = df['Age'].median()

**Re-checking for missing values, data types, and summary statistics after data cleaning to confirm the dataset's integrity and readiness for analysis.**

In [12]:
# Checking for missing values

print(df.isnull().sum())


# Checking data types

print(df.dtypes)


# Summarizing data

print(df.describe())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64
PassengerId       int64
Survived          int64
Pclass         category
Name             object
Sex              object
Age             float64
SibSp             int64
Parch             int64
Ticket           object
Fare            float64
Cabin            object
Embarked         object
dtype: object
       PassengerId    Survived         Age       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838   29.361582    0.523008    0.381594   32.204208
std     257.353842    0.486592   13.019697    1.102743    0.806057   49.693429
min       1.000000    0.000000    0.420000    0.000000    0.000000    0.000000
25%     223.500000    0.000000   22.000000    0.000000    0.00

**Standardizing the numeric features 'Age' and 'Fare' using StandardScaler to normalize the data, ensuring that they have a mean of 0 and a standard deviation of 1 for better comparability during analysis.**

In [15]:
# Data Standardization
# Selecting numeric features to standardize
from sklearn.preprocessing import StandardScaler
numeric_features = ['Age', 'Fare']  # Example of numeric features
scaler = StandardScaler()
df[numeric_features] = scaler.fit_transform(df[numeric_features])

**Applying a logarithmic transformation to the 'Fare' column using np.log1p to reduce skewness and stabilize variance, thereby enhancing the distribution of the data for analysis.**

In [16]:
# Data Transformation
# Example of a logarithmic transformation to reduce skewness
df['Fare'] = np.log1p(df['Fare'])  # log1p is used to avoid log(0) issues


**Aggregating the dataset by 'Pclass' to calculate the mean 'Age' and 'Fare' for each passenger class, providing insights into the average characteristics of different classes.**

In [17]:
# Aggregation
# Example of aggregating by 'Pclass' and calculating mean 'Age' and 'Fare'
aggregated_df = df.groupby('Pclass').agg({'Age': 'mean', 'Fare': 'mean'}).reset_index()


  aggregated_df = df.groupby('Pclass').agg({'Age': 'mean', 'Fare': 'mean'}).reset_index()


**Displaying the first five rows of the cleansed dataset to verify the changes made during data cleaning and transformation.**

In [19]:
# Checking the cleansed data
print(df.head())


   PassengerId  Survived Pclass  \
0            1         0      3   
1            2         1      1   
2            3         1      3   
3            4         1      1   
4            5         0      3   

                                                Name     Sex       Age  SibSp  \
0                            Braund, Mr. Owen Harris    male -0.565736      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  0.663861      1   
2                             Heikkinen, Miss. Laina  female -0.258337      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  0.433312      1   
4                           Allen, Mr. William Henry    male  0.433312      0   

   Parch            Ticket      Fare Cabin Embarked  
0      0         A/5 21171 -0.698050   NaN        S  
1      0          PC 17599  0.580452   C85        C  
2      0  STON/O2. 3101282 -0.671101   NaN        S  
3      0            113803  0.351171  C123        S  
4      0            373450 -0.66

**Displaying the aggregated DataFrame to show the average 'Age' and 'Fare' for each passenger class, highlighting key insights from the analysis.**

In [20]:
print(aggregated_df)

  Pclass       Age      Fare
0      1  0.572573  0.512561
1      2  0.031032 -0.313948
2      3 -0.263515 -0.515259
