Locate a dataset that you are interested in working with. The data should be sufficiently complex that you can ask lots of questions about it and engage in creative design techniques, but not so complex that you need specialized hardware or algorithmic approaches to analyze. While you are welcome to use any data you’d like, I recommend that your datasets are tabular (e.g., CSV, TSV, SQL, etc.), contain 5,000 or fewer datapoints (on the order of one hundred or so tends to be sufficiently interesting without causing lag in Altair), and is data that you’re comfortable discussing as part of the course (e.g., avoid data that is overly private or classified). 

Discuss your dataset, including 
- the data’s source, 
- key attributes/dimensions of the data, and 
- your goals for working with that data (i.e., what are the key questions you want to answer). 
- Identify existing relevant visualizations for working with that data (either using the same data, showing the same concepts, or just that might provide some inspiration) and 
    - critique those visualizations based on the practices from this module. 
    - What works well? 
    - What might need improvement or to change to answer your target questions? 

---

### The key attributes/dimensions of the data in this dataset typically include the following:

- Passenger Class (Pclass): This is a categorical variable representing the class in which the passenger traveled (1st, 2nd, or 3rd class).

- Name: The name of the passenger.

- Sex: The gender of the passenger (male or female).

- Age: The age of the passenger.

- SibSp: The number of siblings or spouses the passenger had aboard the Titanic.

- Parch: The number of parents or children the passenger had aboard the Titanic.

- Ticket: The ticket number.

- Fare: The fare paid by the passenger.

- Cabin: The cabin number.

- Embarked: The port at which the passenger boarded the ship (C - Cherbourg, Q - Queenstown, S - Southampton).

- Survived: This is the target variable, indicating whether the passenger survived (1) or did not survive (0).

### Goals for working with the Titanic dataset:

- Survival Prediction: the key question  to answer is predicting whether a passenger would survive or not based on other attributes. 

- Exploratory Data Analysis (EDA)

- Data Visualization: Visualize the data to better understand the distribution of passengers, survival rates, and any patterns or correlations in the data.

### Existing relevant visualizations and critique:

- Survival by Passenger Class: A common visualization is a bar chart showing the survival rate for each passenger class (1st, 2nd, 3rd). This works well to quickly understand the impact of class on survival. However, it could be improved by including absolute counts along with percentages to provide a clearer picture.

- Survival by Gender: A pie chart or bar chart showing the distribution of survivors by gender is a simple and effective visualization. It is easy to understand, but it might be more informative when combined with survival rates (e.g., survival rates for males vs. females).

- Correlation Heatmap: To understand the relationships between numerical attributes (e.g., age, fare, siblings/spouses, parents/children) and survival, a correlation heatmap can be created. This visualization works well for identifying potential correlations but should be complemented with more in-depth statistical analysis.

# Titanic Dataset

In [5]:
import pandas as pd
import numpy as np

In [15]:
df = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')
gen = pd.read_csv('titanic/gender_submission.csv')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
