# Business Understanding
## 1 Determining Business Objectives
### Finding Business Objectives
In this project, the Titanic passenger dataset is used to analyze which characteristics most influenced survival during the sinking. The dataset includes passenger attributes that are used to build a machine learning model to predict whether a passenger would survive.

Objectives:
Build a machine learning model to accurately predict who survived the Titanic disaster.
Tip:
Identifying meaningful patterns in features will help improve model’s accuracy


### 1.1 Business Background
#### Determine Organizational Structure
This project is done by me. I am responsible for all project activities, including data preparation, analysis, modeling, evaluation, and documentation. 
The primary stakeholders and benefiting units include 
* (1) the academic program and instructor, who use the project to assess practical data science and CRISP-DM skills
* (2) future students and data science learners, who can use the project as a reference example of predictive modeling and structured data mining workflow.

#### Describe Problem Area
The problem was that there was noe enough lifeboats on Titanic resulting in the death of 1502 out of 2224 people. 
The motivation behind this project is to apply CRISP-DM and build a model that predicts Titanic survival from passenger data.

#### Describe Current Solution
Currently, there is no existing solution.


### 1.2 Defining Business Objectives
The objective of this project is to use data mining techniques to analyze historical passenger data from the Titanic disaster and build a predictive model that estimates the probability of passenger survival. Given a set of passenger attributes such as class, age, sex, family relationships, and fare, the problem is to identify patterns and relationships that influenced survival outcomes and to use these patterns to predict whether a passenger would have survived. 

Business questions (precisely defined)
* Which passenger characteristics like sex, class, age, fare, family relationships have the strongest influence on survival?
* Can a predictive model accurately classify whether a passenger survived based on available features?
* Which variables contribute most to the model’s predictions?
* How well do different classification models perform on this problem?

Other business (project) requirements
* The project must follow the CRISP-DM methodology and be structured accordingly
* All steps must be clearly documented and justified
* The workflow must be reproducible
* Proper model validation and evaluation metrics must be used
* Visualizations and tables must support analytical conclusions
* Results must be consistent with the stated objective and supported by data

### 1.3 Business Success Criteria
* Predictive performance: The final model achieves at least 0.75 accuracy on a held-out test set .
* Transparent reporting: Results are reported with a confusion matrix and key metrics (accuracy, precision, recall, F1, ROC-AUC) so false positives/false negatives are visible.
* Explainability: The project identifies and explains the main survival drivers using feature importance / coefficients plus supporting plots or tables.
* Reproducibility: The full workflow runs end-to-end in a single notebook with fixed random seeds, documented assumptions, and all steps needed to reproduce results.

Subjective quality criteria
* Clarity: Conclusions are clearly stated and directly supported by the reported metrics and visualizations.
* Process quality: The notebook follows CRISP-DM phases with clear documentation of decisions (data cleaning, feature engineering, model choice, evaluation).

## 2 Assessing the Situation
Data
The data available for this project is a single structured dataset in CSV format named titanic1.csv. Each row represents one passenger. The dataset is labeled, which makes it suitable for supervised machine learning, especially a binary classification task survived vs. not survived.

The dataset contains the following main columns:
* Passengerid: unique passenger ID
* Age: passenger age number
* Fare: ticket price number
* Sex: encoded as 0 or 1
* sibsp: number of siblings/spouses aboard
* Parch: number of parents/children aboard
* Pclass: passenger class (1, 2, 3)
* Embarked: encoded embarkation port
* 2urvived: target variable 0 or 1
* zero / zero.*: columns filled with 0 values

Some variables are already encoded  Sex, Embarked, so their meaning needs to be verified during Data Understanding. The dataset also includes multiple “zero” columns, which will be checked and removed during Data Preparation if they contain no useful information

The project is carried out by one student. Basic Python and machine learning knowledge is assumed from the course, and guidance is available through course materials and the instructor.

Risks   
The main risks are missing or inconsistent which will need to be verified, non-informative columns like the zero columns, and overfitting due to the dataset size. There is also a risk of misunderstanding encoded variables if there is no clear explain behind it 

Contingency plan
To handle these risks, the dataset will be checked for missing values and consistency, non-informative columns will be removed after verification.

### 2.1 Resource Inventory
This project is executed on a personal computer. Its basic machine learning, no specific one needed. The primary data source is a single CSV file (titanic1.csv) provided as part of the course assignment. The dataset is stored locally and loaded directly in the Jupyter Notebook using Python. All data preparation, modeling, evaluation, and documentation are performed within the same Jupyter Notebook.


### 2.2 Requirements, Assumptions, and Constraints

#### Determine Requirements
* The project must follow the CRISP-DM methodology and be structured accordingly
* All steps must be clearly documented and justified
* The workflow must be reproducible
* Proper model validation and evaluation metrics must be used
* Visualizations and tables must support analytical conclusions
* Results must be consistent with the stated objective and supported by data

#### Clarify Assumptions

Data assumption
* The survival label is correct
* 2urvived correctly indicates whether the passenger survived (0/1)
* The columns represent what they claim
* Age is age in years, Fare is ticket price, Pclass is class
* Missing values are not completely random, but are manageable
* The available features are sufficient for the classification task
 
The “zero” columns are non-informative placeholders
I assume columns full of zeros do not contain useful signal and can be removed after verification.

I assume the dataset is mostly correct and complete enough to perform analysis. I also assume the survival label is reliable, and that missing values and non-informative columns can be handled during data preparation.

#### Constraints
* Deep learning techniques are not permitted.
* The project must be completed in a single Jupyter Notebook
* Deployment

### 2.3 Risks and Contingencies
* Scheduling: The projects take 5 weeks.At the end of the week 5 it is necessary to submit the related documents mentioned before.
* Financial: No financial risks
* Data: The data should not be poor quality. It was made for educational purpose by the professors. 
* Results: ______

### 2.4 Terminology
Survived (2urvived)- Binary target variable indicating whether a passenger survived 1 or did not survive 0.

Binary classification - A machine learning task where the outcome has two possible values 1 or 2.

Missing values - Data entries that are not present in the dataset and must be handled during data preparation.

Encoding - The process of converting categorical variables into numerical form, so they can be used by machine learning models.

CRISP-DM - A standard data mining methodology used to structure the project from business understanding to evaluation.

### 2.5 Cost/Benefit Analysis
There are no costs. The benefit of better data understanding is to have better knowledge of statics if a ship crashes and who will survive.

## 3 Determining Data Mining Goals

### Data Mining Goals
The task is a supervised classification problem, where the goal is to predict a binary outcome: whether a passenger survived or did not survive the Titanic disaster.

Technical goals/Data Mining goals:
* Build at least one machine learning classification model that predicts survival based on passenger attributes
* Achieve a minimum accuracy of approximately 75% on the test set, while also reporting complementary metrics

### Data Mining Success Criteria
Model performance is evaluated using a train/test split. The trained model performance is measured primarily using accuracy, supported by a confusion matrix to analyze classification errors.

Quantitative success benchmarks:
* The data mining task is considered successful if the model achieves approximately 75% or higher accuracy on the test set while maintaining balanced performance across classes.

Subjective success criteria:
* Results and evaluation metrics are clearly explained and easy to interpret
* The modeling decisions and outcomes are logically consistent with the project objective
* Key survival factors are interpretable and supported by data analysis and visualizations

## 4 Project Plan
### 4.1 Project Plan
#### Week 1 — Business Understanding
- Define the project objective: predict Titanic passenger survival using historical data.
- Formulate business questions and data mining questions.
- Define data mining goals and success criteria (e.g., target accuracy, evaluation metrics).
- Identify requirements, assumptions, and constraints.
- Document the planned CRISP-DM structure of the notebook.

#### Week 2 — Data Understanding
- Load the provided `titanic1.csv` dataset in the Jupyter Notebook.
- Describe the dataset structure, features, and target variable.
- Perform exploratory data analysis (EDA) using summary statistics and visualizations.
- Identify data quality issues (missing values, class imbalance, outliers).
- Decide which features are relevant for modeling.

#### Week 3 — Data Preparation
- Clean the data by handling missing values and inconsistencies.
- Encode categorical variables and scale numerical features if required.
- Engineer additional features where appropriate.
- Split the data into training and test sets.
- Ensure preprocessing steps are reproducible and clearly documented.

#### Week 4 — Modeling
- Train a baseline classification model using a standard ML library.
- Train and compare additional classical machine learning models.
- Tune key hyperparameters where necessary.
- Explain how the selected model(s) work and justify the final choice.

#### Week 5 — Evaluation
- Evaluate the final model on the test set.
- Report accuracy and a confusion matrix (and other relevant metrics).
- Analyze model errors and limitations.
- Assess whether the data mining success criteria are met.
- Summarize findings and conclusions.


### 4.2 Assessing Tools and Techniques
Based on the structured, tabular nature of the data and the binary target variable, classical supervised classification techniques such as Logistic Regression, Decision Tree–based models, and ensemble methods are expected to produce the best results.

- **Tools:**  
  The project is implemented in **Python using a Jupyter Notebook**. Core libraries include `pandas` and `numpy` for data handling, `matplotlib`/`seaborn` for visualization, and `scikit-learn` for modeling and evaluation.

- **Selected techniques:**  
  Since the objective is to predict passenger survival, the task is treated as a **supervised classification** problem. Classical machine learning algorithms are considered, including:
  - Logistic Regression (baseline and interpretable model)
  - Decision Tree–based models (to capture non-linear relationships)
  - Ensemble methods such as Random Forest or Gradient Boosting (for improved performance)

- **Rationale for technique selection:**  
  These techniques are well-suited for structured tabular data, support binary classification, and are allowed under the project constraints (no deep learning). They also provide mechanisms for model evaluation and interpretability.

- **Evaluation support:**  
  `scikit-learn` provides built-in tools for **train/test splitting**, **cross-validation**, and **performance metrics** (accuracy, confusion matrix, precision, recall), enabling transparent and reproducible assessment of model performance.
