# üìä Understanding Your Data (Before EDA)

---

# üìå Introduction

When you receive a new dataset, **do NOT directly start modeling**.

First step in Machine Learning:

```
Understand the Data
```

This phase includes:

- Asking basic questions
- Checking structure
- Identifying problems
- Understanding patterns

Example Dataset Used:
Titanic Dataset (Kaggle Competition Dataset)

---

# üß† First Step: Ask Basic Questions

When you get a dataset, ask:

1. How big is the data?
2. What does it look like?
3. What are the data types?
4. Are there missing values?
5. What is the statistical summary?
6. Are there duplicate rows?
7. Is there correlation between variables?

---

# 1Ô∏è‚É£ How Big is the Dataset?

Use:

```python
df.shape
```

Output:
```
(rows, columns)
```

Example:
```
(891, 12)
```

Meaning:
- 891 rows
- 12 columns

‚úî Helps understand dataset size  
‚úî Important for memory planning  

---

# 2Ô∏è‚É£ What Does the Data Look Like?

Use:

```python
df.head()
```

Shows first 5 rows.

‚ö† Problem:
Sometimes first rows may be biased.

Better approach:

```python
df.sample(5)
```

‚úî Shows 5 random rows  
‚úî Gives better overall idea  

---

# 3Ô∏è‚É£ What Are the Data Types?

Use:

```python
df.info()
```

This tells:

- Column names
- Non-null values
- Data types
- Memory usage

### Data Types:
- int64 ‚Üí Integer
- float64 ‚Üí Decimal
- object ‚Üí String
- bool ‚Üí Boolean

‚ö† Important:
Sometimes numeric columns are stored as object.
Convert them properly for optimization.

Example:
```python
df["Age"] = df["Age"].astype("int")
```

---

# 4Ô∏è‚É£ Are There Missing Values?

Very important step.

### Method 1:
Check using `info()`

### Method 2 (Best Way):

```python
df.isnull().sum()
```

Output example:
```
Age        177
Cabin      687
Embarked     2
```

‚úî Shows number of missing values per column  
‚úî Helps decide:
- Fill values?
- Drop column?
- Drop rows?

---

# 5Ô∏è‚É£ Statistical Summary (Numerical Columns)

Use:

```python
df.describe()
```

Gives:

- Count
- Mean
- Std deviation
- Min
- 25%
- 50% (Median)
- 75%
- Max

Example insights:
- Average Age ‚âà 29
- Minimum Age ‚âà 0.42
- Maximum Age ‚âà 80

‚úî Helps detect outliers  
‚úî Helps understand distribution  

---

# 6Ô∏è‚É£ Are There Duplicate Rows?

Duplicates can harm ML models.

Check using:

```python
df.duplicated().sum()
```

If result = 0  
‚Üí No duplicates

If > 0  
‚Üí Remove using:

```python
df.drop_duplicates(inplace=True)
```

---

# 7Ô∏è‚É£ Correlation Between Variables

Correlation tells:

Does one variable affect another?

Use:

```python
df.corr()
```

This calculates **Pearson Correlation**  
Range: -1 to +1

- +1 ‚Üí Strong positive correlation  
- -1 ‚Üí Strong negative correlation  
- 0 ‚Üí No correlation  

Example insights (Titanic Dataset):

- Pclass ‚Üî Survived ‚Üí Negative correlation  
  (Higher class number = lower survival)

- Fare ‚Üî Survived ‚Üí Positive correlation  
  (Higher fare = higher survival chance)

- PassengerId ‚Üí No correlation (useless column)

‚úî Helps remove unnecessary columns  
‚úî Helps feature selection  

---

# üìä Correlation Concept

```
-1  ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ 0 ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí +1
Negative      No relation      Positive
```

---

# üî• Complete Workflow

```
Get Dataset
      ‚Üì
df.shape
      ‚Üì
df.head() / df.sample()
      ‚Üì
df.info()
      ‚Üì
df.isnull().sum()
      ‚Üì
df.describe()
      ‚Üì
df.duplicated()
      ‚Üì
df.corr()
```

---

# üéØ Next Step After This

After understanding basic structure ‚Üí  
Start:

## üìà Exploratory Data Analysis (EDA)

EDA includes:
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Visualization (plots)

---

# üöÄ Important Rule

Never jump directly to model building.

Always follow:

```
Understand Data ‚Üí Clean Data ‚Üí Analyze ‚Üí Then Model
```

---


In [41]:
import pandas as pd

In [42]:
df = pd.read_csv(r"C:\Users\chhij\Downloads\titanic.csv.crdownload")

In [43]:
df.shape

(891, 12)

In [44]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [45]:
df.tail(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [46]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    str    
 4   Sex          891 non-null    str    
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    str    
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    str    
 11  Embarked     889 non-null    str    
dtypes: float64(2), int64(5), str(5)
memory usage: 83.7 KB


In [47]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [48]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [49]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [50]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [51]:
df.duplicated().sum()

np.int64(0)

In [52]:
df.drop_duplicates(inplace=True)

In [55]:
df.corr(numeric_only=True)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0
