# Understanding Data

### Import Pandas Library
- Read the Titanic dataset into a DataFrame

In [None]:
import pandas as pd
df = pd.read_csv('titanic_dataset_train.csv')
df

### 1. How Big is the Dataset?

In [None]:
df.shape

### 2. How the Data Looks Like?

In [None]:
df.head()   # Display first 5 rows
df.tail()   # Display last 5 rows
df.sample(5) # Display 5 random rows

### 3. What is the Data Types of Each Feature? 
- (feature is a column in the dataset)

In [None]:
df.info()

### 4. Are there Missing Values in the Dataset?

In [None]:
df.isnull() # Displays True/False for missing values. 
df.isnull().sum() # Sum of missing values per column.
#df.isnull().mean() # Proportion of missing values per column.
#df.isnull().mean().sort_values(ascending=False) # Sorted proportion of missing values per column.

### 5. Summary Statistics of *Numerical Features* or Descriptive Statistics or How does the data look mathmatically?

In [None]:
df.describe() # only for numerical features (int and float)

### 6. Summary Statistics of *Categorical Features* 

In [None]:
df.describe(include=['O']) # only for categorical features (object)
# top - most frequent value
# freq - frequency of the most frequent value
# unique - number of unique values
# count - number of non-missing values



#df.describe(include='all') # for all features 
#df['Age'].describe() # describe for a specific column

### 7. Are there Duplicated Rows in the Dataset?
Duplicated_Rows = df.duplicated()
print("Number of Duplicated Rows: ", Duplicated_Rows.sum())

In [None]:
df.duplicated() # Returns a boolean Series denoting duplicate rows.
df.duplicated().sum() # Number of duplicated rows

### 8. Coorelation between Numerical Features 

- What is correlation? 
Coorelation is a statistical measure that describes the extent to which two variables change together. It indicates the strength and direction of a linear relationship between two numerical features. The correlation coefficient ranges from -1 to +1, where:
  - +1 indicates a perfect positive correlation (as one variable increases, the other also increases).
  - -1 indicates a perfect negative correlation (as one variable increases, the other decreases).

- What is method parameter in df.corr() function?
The method parameter in the df.corr() function specifies the method to be used for calculating the correlation between numerical features. The available methods are:
  - 'pearson': This is the default method and calculates the Pearson correlation coefficient, which measures the linear relationship between two continuous variables.
  - 'spearman': This method calculates the Spearman rank correlation coefficient, which assesses how well the relationship between two variables can be described using a monotonic function. It is suitable for ordinal data or non-linear relationships.
  - 'kendall': This method calculates the Kendall tau correlation coefficient, which measures the ordinal association between two variables. It is also suitable for ordinal data and is less sensitive to outliers compared to Pearson correlation.

- What is numeric_only parameter in df.corr() function?
The numeric_only parameter tells pandas whether to use only numeric columns when calculating correlation.
`numeric_only=True` → use only numbers (ignore text, dates, etc.)
`numeric_only=False` → try to include all columns (may cause errors if non-numeric data exists)
It helps avoid errors when your DataFrame has non-numeric data.

**color-coded correlation table (heatmap style)**

`.style.background_gradient(...)` → .style is an accessor whcih adds colors to the table, background_gradient() is a method which it styles the output.

`cmap='coolwarm'` → it’s a parameter that specifies the color map to use for the gradient. 'coolwarm' is a popular choice that uses a gradient from cool colors (blue) to warm colors (red).

- Red = strong positive correlation

- Blue = strong negative correlation

- White/light = weak or no correlation

Together, it shows a color-coded correlation table (heatmap style) that makes relationships easy to see at a glance.

In [None]:
df.corr(numeric_only=True) # Correlation between numerical features
df.corr(method='pearson', numeric_only=True) # Pearson correlation (default)
df.corr(method='spearman', numeric_only=True) # Spearman correlation
df.corr(method='kendall', numeric_only=True) # Kendall correlation

df.corr(numeric_only=True).style.background_gradient(cmap='coolwarm') # Heatmap style correlation matrix