# Titanic Dataset

## Overview
The Titanic dataset is one of the most famous datasets for practicing binary classification and data exploration. It provides information about the passengers of the RMS Titanic, which sank on its maiden voyage in 1912.

## Dataset Information
The dataset includes the following key features:

| Column Name       | Description                                                                                 |
|-------------------|---------------------------------------------------------------------------------------------|
| **PassengerId**   | Unique identifier for each passenger.                                                      |
| **Survived**      | Binary indicator of survival: `1` = survived, `0` = did not survive (target variable).      |
| **Pclass**        | Passenger's class: `1` = 1st, `2` = 2nd, `3` = 3rd class.                                  |
| **Name**          | Full name of the passenger.                                                                |
| **Sex**           | Gender of the passenger (`male` or `female`).                                              |
| **Age**           | Age of the passenger (in years).                                                           |
| **SibSp**         | Number of siblings/spouses aboard the Titanic with the passenger.                          |
| **Parch**         | Number of parents/children aboard the Titanic with the passenger.                          |
| **Ticket**        | Ticket number.                                                                             |
| **Fare**          | Ticket price (in British pounds).                                                          |
| **Cabin**         | Cabin number (if known).                                                                   |
| **Embarked**      | Port of embarkation: `C` = Cherbourg, `Q` = Queenstown, `S` = Southampton.                 |

## Key Insights for Analysis
1. **Survival Rate Analysis**:
   - What factors influenced survival? E.g., gender, age, class.
   - Women and children are hypothesized to have higher survival rates ("Women and children first").

2. **Socioeconomic Status**:
   - Does passenger class (`Pclass`) correlate with survival?
   - Higher-class passengers often had better access to lifeboats.

3. **Family Connections**:
   - Does traveling with family (siblings, spouses, parents, children) improve survival odds?

4. **Embarkation and Fare**:
   - Were survival rates affected by the port of embarkation or the ticket fare?

## Challenges
- Missing data in `Age` and `Cabin` columns.
- Need to preprocess data (e.g., encode categorical variables like `Sex` and `Embarked`).

## Useful Links
- Kaggle Titanic Dataset: [https://www.kaggle.com/c/titanic/data](https://www.kaggle.com/c/titanic/data)

---

# Breast Cancer Dataset

## Overview
The Breast Cancer dataset is widely used for practicing binary classification. It focuses on predicting whether a tumor is **malignant** or **benign** based on various features derived from medical imaging.

## Dataset Information
The dataset includes the following key features:

| Column Name           | Description                                                                                          |
|-----------------------|------------------------------------------------------------------------------------------------------|
| **ID**                | Unique identifier for each sample.                                                                  |
| **Diagnosis**         | Target variable: `M` = malignant, `B` = benign.                                                     |
| **Radius_mean**       | Mean of the distances from the center to points on the perimeter.                                   |
| **Texture_mean**      | Standard deviation of gray-scale values.                                                            |
| **Perimeter_mean**    | Mean size of the tumor's perimeter.                                                                 |
| **Area_mean**         | Mean area of the tumor.                                                                             |
| **Smoothness_mean**   | Mean smoothness (local variation in radius lengths).                                                |
| **Compactness_mean**  | Mean compactness (perimeter² / area - 1.0).                                                         |
| **Concavity_mean**    | Mean severity of concave portions of the tumor contour.                                             |
| **Symmetry_mean**     | Mean symmetry of the tumor.                                                                         |
| **Fractal_dimension_mean** | Mean of fractal dimension ("coastline approximation" - 1).                                     |
| **...**               | Similar features are provided for standard deviation (`_se`) and worst measurements (`_worst`).     |

## Key Insights for Analysis
1. **Predictive Modeling**:
   - Can we accurately classify tumors as benign or malignant based on the features?
   - Focus on high-importance features like `Radius_mean`, `Area_mean`, and `Compactness_mean`.

2. **Exploratory Analysis**:
   - Visualize differences in features between benign and malignant tumors.
   - Check for correlations between features.

3. **Feature Engineering**:
   - Normalize or scale the data for algorithms sensitive to feature magnitudes.
   - Reduce dimensionality using techniques like PCA (Principal Component Analysis).

4. **Evaluation Metrics**:
   - Use precision, recall, F1-score, and ROC-AUC to evaluate model performance.

## Challenges
- The dataset is already clean but may require feature selection or dimensionality reduction.
- Balancing the classes (`M` and `B`) if there's an imbalance.

## Useful Links
- Kaggle Breast Cancer Dataset: [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)


In [1]:
import pandas as pd


In [4]:
genders = pd.read_csv("gender_submission.csv", sep=",")
tests = pd.read_csv("test.csv", sep=",")
training = pd.read_csv("train.csv", sep="," )

In [7]:
print(genders.info())
print(tests.info())
print(training.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
me

In [8]:
training.shape

(891, 12)