# ai03dTasks
# Machine Learning: Decision Trees
## Training the Model

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Prepare the Data - we'll start fresh with the original dataset

Run this cell to load and prepare the data (repeating steps from previous lessons).

In [None]:
import pandas as pd     

df = pd.read_csv("Titanic Dataset.csv")     # Load the dataset into a DataFrame
df = df[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]   # Keep only selected columns

df['age'].fillna(df['age'].median(), inplace=True)     # Fill missing ages with median age
df['fare'].fillna(df['fare'].median(), inplace=True)   # Fill missing fares with median fare
df.dropna(subset=['embarked'], inplace=True)           # Drop rows that have no 'embarked' value

df = pd.get_dummies(df, columns=['sex'], drop_first=True)        # Convert 'sex' to numbers (one-hot encode)
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)   # Convert 'embarked' to numbers (one-hot encode)

X = df.drop('survived', axis=1)    # X = all features except the target column
y = df['survived']                 # y = target column we want to predict

print("✓ Data loaded and prepared!")      # Confirmation message
print(f"X shape: {X.shape}")              # Show rows and number of feature columns
print(f"y shape: {y.shape}")              # Show number of rows in target


---
## Task 1: Import Required Libraries

Import the libraries we'll need for training our model.

In [None]:
# TODO: Import train_test_split from sklearn.model_selection
from sklearn.model_selection import ________

# TODO: Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import ________

print("✓ Libraries imported!")

---
## Task 2: Understand the Data Before Splitting

Let's check our data before we split it.

In [None]:
# Check the size of X and y
print(f"Total passengers: {len(X)}")               # Number of rows in the dataset
print(f"Number of features: {X.shape[1]}")         # How many input columns we have
print(f"Feature names: {X.columns.tolist()}")      # List of all feature column names

# Check survival rate
survival_rate = y.mean()                           # Average of 0/1 gives the survival percentage
print(f"\nSurvival rate: {survival_rate:.2%}")     # Print survival rate as a percentage
print(f"Survivors: {y.sum()}")                     # Count of '1' values (people who survived)
print(f"Non-survivors: {(y == 0).sum()}")          # Count of '0' values (people who died)


**Q: How many total passengers are in our dataset?**

A: 

**Q: What percentage survived?**

A: 

---
## Task 3: Split the Data

Split the data into training (80%) and testing (20%) sets.

### 3a. Perform the train/test split

In [None]:
# TODO: Use train_test_split to split the data
# Fill in the blanks with the given answers/values
X_train, X_test, y_train, y_test = train_test_split(
    ________, ________,      # X and y
    test_size=________,      # 20% for testing (0.2)
    random_state=________    # Use 42 for reproducibility
)

print("✓ Data split complete!")

### 3b. Verify the split sizes

In [None]:
# Check the sizes of each set
print(f"Training set size: {X_train.shape[0]} passengers")
print(f"Testing set size: {X_test.shape[0]} passengers")
print(f"\nTraining percentage: {X_train.shape[0] / len(X) * 100:.1f}%")
print(f"Testing percentage: {X_test.shape[0] / len(X) * 100:.1f}%")

**Q: How many passengers are in the training set?**

A: 

**Q: How many passengers are in the testing set?**

A: 

**Q: Is the split approximately 80/20?**

A: 

### 3c. Check that features match

In [None]:
# Verify X_train and X_test have the same columns
print(f"X_train columns: {X_train.shape[1]}")      # Number of feature columns in training data
print(f"X_test columns: {X_test.shape[1]}")        # Number of feature columns in testing data
print(f"\nColumns match: {X_train.shape[1] == X_test.shape[1]}")   # Check if the counts are the same

---
## Task 4: Examine the Training Data

Let's look at some examples from our training set.

In [None]:
# Display first few rows of training features
print("First 5 rows of X_train:")
print(X_train.head())

print("\nFirst 10 values of y_train:")
print(y_train.head(10).tolist())

**Q: Do the row indices look sequential or random?**

A: 

**Q: Why is it good that they're random?**

A: 

---
## Task 5: Create the Decision Tree Model

Create an instance of DecisionTreeClassifier.

In [None]:
# TODO: Create a DecisionTreeClassifier
# Set max_depth=5 and random_state=42
model = DecisionTreeClassifier(
    max_depth=________,
    random_state=________
)

print("✓ Model created!")
print(f"Model type: {type(model)}")
print(f"Max depth: {model.max_depth}")

**Q: What does max_depth=5 mean?**

A: 

**Q: Why do we set random_state?**

A: 

---
## Task 6: Train the Model

Now train the model on the training data.

In [None]:
# TODO: Train the model using .fit()
# Pass X_train and y_train as arguments
model.________(________, ________)

print("✓ Model trained successfully!")
print(f"Tree depth: {model.get_depth()}")
print(f"Number of leaves: {model.get_n_leaves()}")

**Q: How deep is the trained tree?**

A: 

**Q: Is it at the maximum depth we allowed (5)?**

A: 

---
## Task 7: Verify the Model is Trained

Let's check that our model learned something.

In [None]:
# Check model attributes after training
print("Model attributes:")

print(f"Number of features: {model.n_features_in_}")    # How many input features the model received
print(f"Feature names: {model.feature_names_in_}")      # The exact names of those features

print(f"Number of outputs: {model.n_outputs_}")         # Number of target variables (1 for our case `survivded`)
print(f"Number of classes: {model.n_classes_}")         # How many categories the model predicts (0 and 1 → two classes... survived or not?)
print(f"Classes: {model.classes_}")                     # Shows the actual class labels the model learned

**Q: How many features did the model use?**

A: 

**Q: What are the two classes (possible outcomes)?**

A: 

## Task 8: Make a Quick Test Prediction

Test that the model can make predictions (we'll evaluate properly in the next lesson).

## How `.fit()` and `.predict()` work

**`fit()`** → trains the model
  The model looks at the training features (X) and the correct answers (y) to learn patterns.

**`predict()`** → makes guesses
  After learning, the model uses those patterns to predict outcomes for new X data.

**Simple idea:**
`fit = learn`
`predict = guess`

In [None]:
# Make predictions on the first 5 rows in the test set
sample_predictions = model.predict(X_test.head(5))      

# Get the actual survival values for those same 5 passengers
actual_values = y_test.head(5).tolist()                

print("Sample predictions vs actual:")

# Loop through each pair of predicted vs actual values
for i, (pred, actual) in enumerate(zip(sample_predictions, actual_values)):
    result = "✓" if pred == actual else "✗"            # Check if the prediction was correct
    print(f"Passenger {i+1}: Predicted={pred}, Actual={actual} {result}")   # Show comparison for each passenger


**Q: Did the model get any predictions correct?**

A: 

**Q: Can you tell from just 5 examples how good the model is?**

A: 

---
## Reflection Questions

Answer these questions based on your work:

**1. Why do we split data into training and testing sets?**

Answer: 

**2. What would happen if we trained on 100% of the data and tested on the same 100%?**

Answer: 

**3. What does the .fit() method do?**

Answer: 

**4. What is max_depth and why is it important?**

Answer: 

**5. Explain in your own words what 'training a model' means.**

Answer: 

---
## Lesson Complete! 

You've successfully trained your first machine learning model!

**Summary of what you did:**
- Split data into training (80%) and testing (20%) sets
- Created a DecisionTreeClassifier with max_depth=5
- Trained the model using .fit() on training data
- Model learned patterns from 1000+ training examples
- Model is ready to make predictions!

Save this notebook and push to GitHub.

**Next lesson**: Evaluate how well your model performs!