<a href="https://colab.research.google.com/github/213815/IPD_2526_FAKE/blob/main/Titanic_Revised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Data Analysis & Visualization  
### AP Computer Science Principles – Mini Unit

## Essential Question
**What factors appear to influence survival on the Titanic, and how can we use data visualizations to support our claims?**


## Before We Start: Make Predictions

Before looking at any graphs, answer the following questions:

- What factors do you think affected a passenger’s chance of survival?
- Which groups of people do you predict were more likely to survive?
- Which variables in the dataset might help us investigate this?

Write down **at least two predictions**.


#Do
Prediction 1:

Prediction 2:

In [None]:
#Setting up imports and loading the dataset
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

sns.set_style('whitegrid')

titanic = sns.load_dataset("titanic")


## Part 1: Exploring the Dataset

This dataset contains information about passengers aboard the Titanic.

Each **row** represents one passenger.  
Each **column** represents a characteristic or outcome.

As you explore the dataset, answer:

- How many rows and columns are there?
- Which columns contain numerical data?
- Which columns contain categories or labels?
- Which columns might be useful for predicting survival?


```
#useful code snipbits
titanic.head()
titanic.shape()
titanic.columns()
titanic.dtypes()
```

In [None]:
#Code Goes Here

#Do
You may look back at the penguin activity to help answer the questions.
- How many rows and columns are there?:
- Which columns contain numerical data?:
- Which columns contain categories or labels?:
- Which columns might be useful for predicting survival?:

## Part 2: Real Data Is Messy

Unlike the penguins dataset, this dataset contains **missing values**.

Before creating graphs, we need to check for missing values.


In [None]:
titanic.isna().sum()


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [None]:
titanic.shape


In [None]:
titanic = titanic.dropna(subset=['age', 'fare', 'sex', 'class', 'survived'])


In [None]:
titanic.shape


(714, 15)

#DO
- Which columns contain missing values?:
- Why might missing data cause problems in graphs or calculations?:
- Why is it reasonable to remove some rows before analysis?:

## Part 3: Looking at One Variable at a Time

We will begin by examining individual variables on their own.

As you view each graph, ask:

- What values are most common?
- Are there any unusual values?
- What does this graph tell us about the passengers overall?

After each graph, write **one observation**.

##Survivability

In [None]:
sns.countplot(x='survived', data=titanic)
plt.xlabel("Survived (0 = No, 1 = Yes)")
plt.title("Overall Survival Count")
plt.show()

Reflection:
Was survival common or rare among passengers?

##Age Distribution

In [None]:
sns.histplot(x='age', data=titanic, bins=20)
plt.title("Age Distribution of Passengers")
plt.xlabel("Age")
plt.show()

Reflection: How would you describe the age groups on the titanic?

DO: Fare Distribution

In the code cell below, make a histogram of passenger fare.

## Part 4: Comparing Survival Across Groups

Now we will begin comparing **survival** across different groups of people.

For each visualization, answer:

- Which group had a higher survival rate?
- How confident are you in that conclusion?
- Does the graph support or contradict your original predictions?

Be prepared to explain your reasoning using evidence from the graph.


In [None]:
#survival by sex
sns.countplot(x='sex', hue='survived', data=titanic)
plt.title("Survival by Sex")
plt.show()

##DO: Claim prompt
Which sex appears more likely to survive? How can you tell?

In [None]:
#Survival by ticket class
sns.countplot(x='class', hue='survived', data=titanic)
plt.title("Survival by Ticket Class")
plt.show()


##Creating New Columns (Data Transformation)

Just like converting grams to pounds in the penguins dataset, we can create new columns to help answer questions.

In [None]:
titanic['is_child'] = titanic['age'] < 18
titanic.head(3)


In [None]:
sns.countplot(x='is_child', hue='survived', data=titanic)
plt.title("Survival: Children vs Adults")
plt.show()

##DO: Create Your Own Boolean Column

Create a new column based on age, fare, or class that divides passengers into two groups.

Examples:

High fare vs low fare

Younger vs older adults

First class vs others

In [None]:
#Your Code Here to make the additional column

In [None]:
#Then make a graph showing how survival differs between your two groups.

##Grouping Ages into Ranges (Binning)

Sometimes it helps to group numerical values into ranges.

In [None]:
titanic['age_bin'] = pd.cut(
    titanic['age'],
    bins=[0,10,20,30,40,50,60,70,80]
)


In [None]:
survival_by_age = (
    titanic
    .groupby('age_bin')['survived']
    .mean()
    .reset_index()
)


In [None]:
sns.barplot(x='age_bin', y='survived', data=survival_by_age)
plt.xticks(rotation=45)
plt.ylabel("Survival Rate")
plt.title("Survival Rate by Age Group")
plt.show()


##DO: Reflection:
Which age range had the highest survival rate? Why might that be?

##Looking at Multiple Variables Together
Age vs Fare Colored by Survival

In [None]:
sns.scatterplot(
    x='age',
    y='fare',
    hue='survived',
    data=titanic,
    alpha=0.7
)
plt.title("Fare vs Age Colored by Survival")
plt.show()

##Final Task: Make a Data-Supported Claim

Write 2–3 sentences answering the question below:

Name a factor that strongly influenced survival on the titanic.

Your response must:

Reference two graphs that you made (new ones)

Clearly explain your reasoning