# üìä Exploratory Data Analysis (EDA)

### üìå What is EDA?
Exploratory Data Analysis (EDA) is the process of analyzing datasets to:
- Understand the structure of the data
- Identify patterns and relationships
- Detect missing values and outliers
- Prepare data for further analysis or modeling


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('train.csv')

## üîπ Part 1: Basic Questions

### üì¶ 1. How Big Is the Dataset?

#### üß† Code Used
- `df.shape`  
Returns the number of rows and columns in the dataset.

#### üìä Output
- **Rows:** 891  
- **Columns:** 12  

#### üìù Explanation
The dataset contains **891 passenger records** and **12 features**, including the target variable `Survived`.  
This is a medium-sized dataset, suitable for both analysis and machine learning models.


In [3]:
df.shape

(891, 12)

### üëÄ 2. How Does the Data Look?

#### üß† Code Used
- `df.sample(5)`  
Displays 5 random rows from the dataset to get an unbiased overview.

#### üìä Output (Sample)
The dataset includes columns such as:
- PassengerId
- Survived
- Pclass
- Name
- Sex
- Age
- SibSp
- Parch
- Ticket
- Fare
- Cabin
- Embarked

#### üìù Explanation
This step helps us understand:
- The type of information stored
- Presence of categorical and numerical features
- Real-world nature of missing values (e.g., Age, Cabin)


In [4]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
780,781,1,3,"Ayoub, Miss. Banoura",female,13.0,0,0,2687,7.2292,,C
52,53,1,1,"Harper, Mrs. Henry Sleeper (Myna Haxtun)",female,49.0,1,0,PC 17572,76.7292,D33,C
375,376,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)",female,,1,0,PC 17604,82.1708,,C
444,445,1,3,"Johannesen-Bratthammer, Mr. Bernt",male,,0,0,65306,8.1125,,S
497,498,0,3,"Shellard, Mr. Frederick William",male,,0,0,C.A. 6212,15.1,,S


### üè∑Ô∏è 3. What Are the Data Types of Columns?

#### üß† Code Used
- `df.info()`  
Provides column names, data types, non-null counts, and memory usage.

#### üìä Output Summary
- **Numerical Columns (int/float):**
  - PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
- **Categorical Columns (object):**
  - Name, Sex, Ticket, Cabin, Embarked

#### üìù Explanation
- `Age`, `Cabin`, and `Embarked` contain missing values
- Categorical columns need encoding before model training
- No incorrect data types detected


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### ‚ùì 4. Are There Any Missing Values?

#### üß† Code Used
- `df.isnull().sum()`  
Counts missing values in each column.

#### üìä Output
- Age ‚Üí 177 missing values  
- Cabin ‚Üí 687 missing values  
- Embarked ‚Üí 2 missing values  
- All other columns ‚Üí 0 missing values

#### üìù Explanation
- **Cabin** has too many missing values and may be dropped
- **Age** can be filled using mean or median
- **Embarked** can be filled using mode


In [6]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### üìê 5. How Does the Data Look Mathematically?

#### üß† Code Used
- `df.describe()`  
Generates descriptive statistics for numerical columns.

#### üìä Key Observations
- Mean Age ‚âà 29.7 years
- Average Fare ‚âà 32.2
- Minimum Fare = 0
- Survival Rate ‚âà 38%

#### üìù Explanation
- Wide fare range indicates possible outliers
- Survival mean confirms class imbalance
- Age distribution suggests young and middle-aged passengers dominate


In [7]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### ‚ôªÔ∏è 6. Are There Duplicate Values?

#### üß† Code Used
- `df.duplicated().sum()`  
Counts the number of duplicate rows.

#### üìä Output
- **Duplicate rows:** 0

#### üìù Explanation
No duplicate records are present, so no action is required in this step.


In [8]:
df.duplicated().sum()

np.int64(0)

### üîó 7. How Is the Correlation Between Columns?

#### üß† Code Used
- `df.corr()['Survived']`  
Calculates correlation of all numerical features with the target variable.

#### üìä Output (Important Correlations)
- Pclass ‚Üí -0.34 (Negative correlation)
- Fare ‚Üí +0.26 (Positive correlation)
- SibSp ‚Üí -0.04 (Weak)
- Parch ‚Üí +0.08 (Weak)
- Age ‚Üí -0.08 (Weak)

#### üìù Explanation
- Passengers in higher classes had better survival chances
- Higher fare indicates higher survival probability
- Age and family size show weak influence


In [9]:
df.corr()['Survived']

ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'