# Lab 2: Iris Dataset Analysis: Data - Inspect, Explore, Split 
**Author:** Mhamed  
**Date:** 03, 21, 2025 
 
**Objective:** The objective of this project is to perform a comprehensive analysis of the Iris dataset with the goal of building a predictive model for .............


## Introduction
This project analyzes the Titanic dataset to predict passenger survival based on various features. It involves importing and inspecting the data, performing exploratory data analysis, cleaning the data, engineering new features, and selecting relevant input features for modeling. The project also explores methods for splitting the data into training and testing sets to evaluate the performance of a machine learning model.

## Section 1. Import and Inspect the Data
In the code cell below, import the necessary Python libraries for this notebook.  

In [19]:
# This is a Python cell
# All imports should be at the top of the notebook
# This cell will be executed when the notebook is loaded

# Import pandas for data manipulation and analysis (we might want to do more with it)
import pandas as pd

# Import pandas for data manipulation and analysis  (we might want to do more with it)
import numpy as np

# Import matplotlib for creating static visualizations
import matplotlib.pyplot as plt

# Import seaborn for statistical data visualization (built on matplotlib)
import seaborn as sns

# Import the California housing dataset from sklearn
from sklearn.datasets import fetch_california_housing

# Import train_test_split for splitting data into training and test sets
from sklearn.model_selection import train_test_split

# Import LinearRegression for building a linear regression model
from sklearn.linear_model import LinearRegression

# Import performance metrics for model evaluation
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score

# Import the necessary function
from pandas.plotting import scatter_matrix

In [20]:
# Section 1. Import and Inspect the Data
# 1.1 Load the dataset and display the first 10 rows

# Load Iris dataset
df = sns.load_dataset('iris')

# Show the first 10 rows
df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [21]:
# Section 1. Import and Inspect the Data
# 1.1 Load the dataset and display the first 10 rows

# Load Iris dataset

df = sns.load_dataset('iris')
# Load dataset

# If command is not the last statement in a Python cell, you'll have to wrap in the print() function to display.
print('Info:')
print(df.info())
print('First 10 Rows:')
print(df.head(10))
print('Missing Values:')
print(df.isnull().sum())
print('Summary Statistics:')
print(df.describe())
print('Numeric Correlations:')
print(df.corr(numeric_only=True))

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
First 10 Rows:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
7    

This is a Markdown cell.

### 1.2 Check for missing values and display summary statistics

In the cell below:
1. Use `info()` to check data types and missing values.
2. Use `describe()` to see summary statistics.
3. Use `isnull().sum()` to identify missing values in each column.

Example code:

data_frame.info()

data_frame.describe()

data_frame.isnull().sum()

In [22]:
# Python
# Check data types and missing values

print("Data Info:")
df.info()


# Summary statistics
print("Summary Statistics:")
df.describe()
print(df.describe())


# Check for missing values in each column
print("Missing values:")
df.isnull().sum()

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
Summary Statistics:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [23]:
# Check for correlations useing the corr() method

print("\nCorrelation Matrix (numeric features only):")
print(df.corr(numeric_only=True))


Correlation Matrix (numeric features only):
              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941    -0.366126      0.962865     1.000000


## Reflection 1
1) How many data instances are there? 891 rows
2) How many features are there? 15 features (columns)
3) What are the names? survived, pclass sex, age, sibsp, parch, fare, embarked, class, who, adult_male, deck, embark_town, alive, alone
4) Are there any missing values? Yes age (177), embarked (2), embark_town (2), deck (688).
5) Are there any non-numeric features? Yes sex, embarked, class, who, embark_town, alive
6) Are the data instances sorted on any of the attributes? No
7) What are the units of age? Years
8) What are the minimum, median and max age? Min is 0.42, median is 28, and max is 80
9) What two different features have the highest correlation? Parch and Sibsp at 0.414838
10) Are there any categorical features that might be useful for prediction? Sex might be useful feature

## Section 2. Data Exploration and Preparation
Now we need to explore our dataset with charts 

2.1 Explore Data Patterns and Distributions
Create a scatter matrix.
Since Titanic contains both numeric and categorical variables, we'll use only numeric values here.

Important:  Use only numeric attributes for the scatter matrix. If you want to explore categorical data, use count plots and bar plots instead.

In [None]:
# Select only numeric features
attributes = ['age', 'fare', 'pclass']
# Create scatter matrix
scatter_matrix(df[attributes], figsize=(10, 10), color='darkgreen')

# Title
plt.suptitle("Scatter Matrix: Age, Fare, Pclass")
# Show the plot
plt.tight_layout()
plt.show()

    2.1.1 Create a scatter plot of age vs fare, colored by gender

In [None]:
import matplotlib.pyplot as plt

# Create the scatter plot
plt.figure(figsize=(10, 6))

# Plot for males (0)
male_data = df[df['sex'] == 'male']
plt.scatter(
    male_data['age'], 
    male_data['fare'], 
    c='blue', 
    label='Male', 
    alpha=0.6
)

# Plot for females (1)
female_data = df[df['sex'] == 'female']
plt.scatter(
    female_data['age'], 
    female_data['fare'], 
    c='red', 
    label='Female', 
    alpha=0.6
)

# Labels and title
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs Fare by Gender')

# Add a legend
plt.legend(title="Gender")

# Add grid
plt.grid(True)

# Show the plot
plt.show()

    2.1.2 Create a histogram of age "Age Distribution"

In [None]:
# Histogram of Age
# Plot histogram with KDE (Kernel Density Estimate)
sns.histplot(df['age'], kde=True, color='red')

# Add title and display
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid(True)
plt.show()

2.1.3 Create a count plot for class and survival

In [None]:
# Count Plot - Class Distribution by Survival
# Count plot: Passenger class with survival hue
sns.countplot(x='class', hue='survived', data=df)

# Add title and display
plt.title('Class Distribution by Survival')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.legend(title='Survived', labels=['Not Survived', 'Survived'])
plt.grid(True)
plt.show()

Reflection 2.1:

* What patterns or anomalies do you notice?
    - First-class passengers pay higher fares, while third-class passengers pay lower fares.
    - Age Distribution: The distribution skews toward younger passengers, possibly indicating more children or young adults.
    - Age has 177 missing values and deck has 688 missing values.
    - Third-class passengers dominate the dataset, with lower survival rates

* Do any features stand out as potential predictors?
    - Class: Strongly correlated with fare and survival, with survival rates varying across classes.
    - Sex: Women had a higher survival rate, making it an important predictor.

* Are there any visible class imbalances?
    - There is a clear imbalance between the survival rates in different classes, with third-class passengers having a significantly lower survival rate compared to those in first or second class.    

2.2 Handle Missing Values and Clean Data

In [None]:
# Fill missing values in 'age' with the median age
df['age'] = df['age'].fillna(df['age'].median())

# Fill missing values in 'embark_town' with the most frequent value (mode)
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

2.3 Feature Engineering

In [None]:
# 1. Create a new feature: Family size
df['family_size'] = df['sibsp'] + df['parch'] + 1

# 2. Convert categorical variables to numeric
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['embarked'] = df['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# 3. Create a binary feature for 'alone':
df['alone'] = df['alone'].astype(int)

# Print outcome
print(df[['sex', 'embarked', 'family_size', 'alone']].head())

Reflection 2.3

- Why might family size be a useful feature for predicting survival? Family size could influence Titanic survival chances, as people traveling with family might have had a better chance of survival because family groups were more likely to stay together.

- Why convert categorical data to numeric? Converting categorical data to numeric format is a common step in preparing data for machine learning algorithms. The sex and embarked columns are transformed into numerical values so they can be directly used in predictive models.

## Section 3. Feature Selection and Justification

### 3.1 Feature Selection and Target Variable

Select two or more input features (numerical for regression, numerical and/or categorical for classification)
Select a target variable (as applicable)

For classification, we’ll use survived as the target variable.

Input features: age, fare, pclass, sex, family_size
Target: survived

### 3.2 Define X and y

- Assign input features to X
- Assign target variable to y (as applicable)

In [None]:
X = df[['age', 'fare', 'pclass', 'sex', 'family_size']]
y = df['survived']
print("X shape:", X.shape)
print("y shape:", y.shape)

### Reflection 3:

- Why are these features selected? Theses selected features are chosen for their relevance in predicting Titanic survival
- Are there any features that are likely to be highly predictive of survival? Yes, certain features are more predictive of survival, such as sex, pclass, and family_size.

## Section 4. Splitting

In [None]:
## Section 4. Train a Linear Regression Model
# Section 4.1: Basic Train/Test Split

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Section 4.2: Stratified Train/Test Split
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)

for train_indices, test_indices in splitter.split(X, y):
    X_train_strat = X.iloc[train_indices]
    X_test_strat = X.iloc[test_indices]
    y_train_strat = y.iloc[train_indices]
    y_test_strat = y.iloc[test_indices]

# Section 4.3: Compare Class Distributions
print("Original Class Distribution:\n", y.value_counts(normalize=True))
print("\nBasic Train Set Class Distribution:\n", y_train.value_counts(normalize=True))
print("Basic Test Set Class Distribution:\n", y_test.value_counts(normalize=True))
print("\nStratified Train Set Class Distribution:\n", y_train_strat.value_counts(normalize=True))
print("Stratified Test Set Class Distribution:\n", y_test_strat.value_counts(normalize=True))


## Reflection 4:

1. Why might stratification improve model performance? Stratification improves model performance by ensuring that the class distributions in both the training and test sets are more representative of the overall dataset.

2. How close are the training and test distributions to the original dataset? Based on the outputs, both split methods produce train/test distributions are close to the original dataset. However, stratification maintains a more accurate balance, especially in the test set, where the distribution more closely mirrors the original class proportions.
    
3. Which split method produced better class balance? The stratified split method maintained a class distribution closer to the original dataset, offering a more accurate representation in both the training and test sets.