# ai03dTasks
# Machine Learning: Decision Trees
## Training the Model

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Prepare the Data - we'll start fresh with the original dataset

Run this cell to load and prepare the data (repeating steps from previous lessons).

In [3]:
import pandas as pd     

df = pd.read_csv("Titanic Dataset.csv")     # Load the dataset into a DataFrame
df = df[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]   # Keep only selected columns

df['age'].fillna(df['age'].median(), inplace=True)     # Fill missing ages with median age
df['fare'].fillna(df['fare'].median(), inplace=True)   # Fill missing fares with median fare
df.dropna(subset=['embarked'], inplace=True)           # Drop rows that have no 'embarked' value

df = pd.get_dummies(df, columns=['sex'], drop_first=True)        # Convert 'sex' to numbers (one-hot encode)
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)   # Convert 'embarked' to numbers (one-hot encode)

X = df.drop('survived', axis=1)    # X = all features except the target column
y = df['survived']                 # y = target column we want to predict

print("✓ Data loaded and prepared!")      # Confirmation message
print(f"X shape: {X.shape}")              # Show rows and number of feature columns
print(f"y shape: {y.shape}")              # Show number of rows in target


✓ Data loaded and prepared!
X shape: (1307, 8)
y shape: (1307,)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(df['age'].median(), inplace=True)     # Fill missing ages with median age
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['fare'].fillna(df['fare'].median(), inplace=True)   # Fill missing fares with median fare


---
## Task 1: Import Required Libraries

Import the libraries we'll need for training our model.

In [1]:
# TODO: Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# TODO: Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

print("✓ Libraries imported!")

✓ Libraries imported!


---
## Task 2: Understand the Data Before Splitting

Let's check our data before we split it.

In [4]:
# Check the size of X and y
print(f"Total passengers: {len(X)}")               # Number of rows in the dataset
print(f"Number of features: {X.shape[1]}")         # How many input columns we have
print(f"Feature names: {X.columns.tolist()}")      # List of all feature column names

# Check survival rate
survival_rate = y.mean()                           # Average of 0/1 gives the survival percentage
print(f"\nSurvival rate: {survival_rate:.2%}")     # Print survival rate as a percentage
print(f"Survivors: {y.sum()}")                     # Count of '1' values (people who survived)
print(f"Non-survivors: {(y == 0).sum()}")          # Count of '0' values (people who died)


Total passengers: 1307
Number of features: 8
Feature names: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_male', 'embarked_Q', 'embarked_S']

Survival rate: 38.10%
Survivors: 498
Non-survivors: 809


**Q: How many total passengers are in our dataset?**

A: 1307

**Q: What percentage survived?**

A: 38.1%

---
## Task 3: Split the Data

Split the data into training (80%) and testing (20%) sets.

### 3a. Perform the train/test split

In [7]:
# TODO: Use train_test_split to split the data
# Fill in the blanks with the given answers/values
X_train, X_test, y_train, y_test = train_test_split(
    X, y,      # X and y
    test_size=0.2,      # 20% for testing (0.2)
    random_state=42    # Use 42 for reproducibility
)

print("✓ Data split complete!")

✓ Data split complete!


### 3b. Verify the split sizes

In [8]:
# Check the sizes of each set
print(f"Training set size: {X_train.shape[0]} passengers")
print(f"Testing set size: {X_test.shape[0]} passengers")
print(f"\nTraining percentage: {X_train.shape[0] / len(X) * 100:.1f}%")
print(f"Testing percentage: {X_test.shape[0] / len(X) * 100:.1f}%")

Training set size: 1045 passengers
Testing set size: 262 passengers

Training percentage: 80.0%
Testing percentage: 20.0%


**Q: How many passengers are in the training set?**

A: 1045

**Q: How many passengers are in the testing set?**

A: 262

**Q: Is the split approximately 80/20?**

A: yes

### 3c. Check that features match

In [9]:
# Verify X_train and X_test have the same columns
print(f"X_train columns: {X_train.shape[1]}")      # Number of feature columns in training data
print(f"X_test columns: {X_test.shape[1]}")        # Number of feature columns in testing data
print(f"\nColumns match: {X_train.shape[1] == X_test.shape[1]}")   # Check if the counts are the same

X_train columns: 8
X_test columns: 8

Columns match: True


---
## Task 4: Examine the Training Data

Let's look at some examples from our training set.

In [10]:
# Display first few rows of training features
print("First 5 rows of X_train:")
print(X_train.head())

print("\nFirst 10 values of y_train:")
print(y_train.head(10).tolist())

First 5 rows of X_train:
      pclass   age  sibsp  parch     fare  sex_male  embarked_Q  embarked_S
1294       3  28.5      0      0   16.100      True       False        True
545        2  30.0      3      0   21.000     False       False        True
291        1  39.0      1      1   79.650     False       False        True
10         1  47.0      1      0  227.525      True       False       False
147        1  28.0      0      0   42.400      True       False        True

First 10 values of y_train:
[0, 1, 1, 0, 0, 1, 0, 0, 1, 0]


**Q: Do the row indices look sequential or random?**

A: random

**Q: Why is it good that they're random?**

A: to prevent bias.

---
## Task 5: Create the Decision Tree Model

Create an instance of DecisionTreeClassifier.

In [13]:
# TODO: Create a DecisionTreeClassifier
# Set max_depth=5 and random_state=42
model = DecisionTreeClassifier(
    max_depth=5,
    random_state=42
)

print("✓ Model created!")
print(f"Model type: {type(model)}")
print(f"Max depth: {model.max_depth}")

✓ Model created!
Model type: <class 'sklearn.tree._classes.DecisionTreeClassifier'>
Max depth: 5


**Q: What does max_depth=5 mean?**

A: limits tree level split to 5

**Q: Why do we set random_state?**

A: cause the data has to start somewhere

---
## Task 6: Train the Model

Now train the model on the training data.

In [14]:
# TODO: Train the model using .fit()
# Pass X_train and y_train as arguments
model.fit(X_train, y_train)

print("✓ Model trained successfully!")
print(f"Tree depth: {model.get_depth()}")
print(f"Number of leaves: {model.get_n_leaves()}")

✓ Model trained successfully!
Tree depth: 5
Number of leaves: 29


**Q: How deep is the trained tree?**

A: 5

**Q: Is it at the maximum depth we allowed (5)?**

A: yes

---
## Task 7: Verify the Model is Trained

Let's check that our model learned something.

In [15]:
# Check model attributes after training
print("Model attributes:")

print(f"Number of features: {model.n_features_in_}")    # How many input features the model received
print(f"Feature names: {model.feature_names_in_}")      # The exact names of those features

print(f"Number of outputs: {model.n_outputs_}")         # Number of target variables (1 for our case `survivded`)
print(f"Number of classes: {model.n_classes_}")         # How many categories the model predicts (0 and 1 → two classes... survived or not?)
print(f"Classes: {model.classes_}")                     # Shows the actual class labels the model learned

Model attributes:
Number of features: 8
Feature names: ['pclass' 'age' 'sibsp' 'parch' 'fare' 'sex_male' 'embarked_Q'
 'embarked_S']
Number of outputs: 1
Number of classes: 2
Classes: [0 1]


**Q: How many features did the model use?**

A: 8

**Q: What are the two classes (possible outcomes)?**

A: Survived or died [0 1]

## Task 8: Make a Quick Test Prediction

Test that the model can make predictions (we'll evaluate properly in the next lesson).

## How `.fit()` and `.predict()` work

**`fit()`** → trains the model
  The model looks at the training features (X) and the correct answers (y) to learn patterns.

**`predict()`** → makes guesses
  After learning, the model uses those patterns to predict outcomes for new X data.

**Simple idea:**
`fit = learn`
`predict = guess`

In [17]:
# Make predictions on the first 5 rows in the test set
sample_predictions = model.predict(X_test.head(5))      

# Get the actual survival values for those same 5 passengers
actual_values = y_test.head(5).tolist()                

print("Sample predictions vs actual:")

# Loop through each pair of predicted vs actual values
for i, (pred, actual) in enumerate(zip(sample_predictions, actual_values)):
    result = "✓" if pred == actual else "✗"            # Check if the prediction was correct
    print(f"Passenger {i+1}: Predicted={pred}, Actual={actual} {result}")   # Show comparison for each passenger


Sample predictions vs actual:
Passenger 1: Predicted=0, Actual=0 ✓
Passenger 2: Predicted=0, Actual=1 ✗
Passenger 3: Predicted=0, Actual=0 ✓
Passenger 4: Predicted=0, Actual=0 ✓
Passenger 5: Predicted=0, Actual=0 ✓


**Q: Did the model get any predictions correct?**

A: all but one

**Q: Can you tell from just 5 examples how good the model is?**

A: it got 4/5 correct which could mean that it has at least ~80% accuracy

---
## Reflection Questions

Answer these questions based on your work:

**1. Why do we split data into training and testing sets?**

Answer: so that you can use actual data to train the model and then test the model using different but related data instead of made up senarios.

**2. What would happen if we trained on 100% of the data and tested on the same 100%?**

Answer: It wouldn't show accuracy because it's using the same data it trained on so it would obviously get it right.

**3. What does the .fit() method do?**

Answer: trains the model on the specified data

**4. What is max_depth and why is it important?**

Answer: the max subdivisions on the decision tree. It prevents the model from making too many subdivisions.

**5. Explain in your own words what 'training a model' means.**

Answer: supplying data to a model so it can predict related information of the supplied data.

---
## Lesson Complete! 

You've successfully trained your first machine learning model!

**Summary of what you did:**
- Split data into training (80%) and testing (20%) sets
- Created a DecisionTreeClassifier with max_depth=5
- Trained the model using .fit() on training data
- Model learned patterns from 1000+ training examples
- Model is ready to make predictions!

Save this notebook and push to GitHub.

**Next lesson**: Evaluate how well your model performs!