##### Instructions
- Keep the original structure, you may add additional code cells and/or mark-down cells for clarity, legibility and/or structure.
- Add the required descriptions, explanations, justifications to the mark-down cells. You can find more mark-down tips & tricks online, for example [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) and [here](https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet)

# EXAM03: Data Science Group Assignment - Iteration 1

**Group name:** [Enter Group Number]

**Student names & numbers:**
* [Damian van der Sluis] - []
* [Achraf El Azzouzi] - [101674]
* [Saeed Alhasan] - []


---

## 0. Iteration setup

**Import libraries**

In [3]:
import pandas as pd

**Load dataset(s)**

In [6]:
df = pd.read_csv('ships_inventory_iter1.csv')

---

## 1. Business Understanding
*Rubric: LO 6.4D (Reflection on Process)*

**Situation description**

*Describe the Nebula Brokerage pricing problem. Why is their current "gut feeling" approach a risk?.*

**Business objective(s)**

*Justify why a data-driven baseline is needed*

**Data mining goal(s)**

*Explain what type of modeling task this is and why.*

**Success criteria**

*Determine success criteria for this iteration (the benchmark)*

---

## 2. Data Understanding
*Rubric: LO 7.3Q (Visuals) & LO 6.4C (Process)*

**Data exploration**

*Include summary statistics and descriptions of data types below. Describe your findings.*

In [22]:
# Summary statistics
display(df.describe())

# Data types and basic information
display(df.info())

Unnamed: 0,Ship_ID,Galactic_Credits,Model_Cycle
count,368814.0,368814.0,361408.0
mean,7311485000.0,19453.536818,7511.264529
std,4381124.0,15540.472943,9.078571
min,7301583000.0,501.0,7400.0
25%,7308105000.0,7950.0,7508.0
50%,7312604000.0,15990.0,7513.0
75%,7315245000.0,27990.0,7517.0
max,7317101000.0,777777.0,7522.0


<class 'pandas.DataFrame'>
RangeIndex: 368814 entries, 0 to 368813
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Ship_ID            368814 non-null  int64  
 1   Galactic_Credits   368814 non-null  int64  
 2   Model_Cycle        361408 non-null  float64
 3   Ship_Manufacturer  368814 non-null  str    
 4   Sector             368814 non-null  str    
dtypes: float64(1), int64(2), str(2)
memory usage: 14.1 MB


None

In [21]:

display(f"Dataset shape: {df.shape[0]} rows and {df.shape[1]} columns")

'Dataset shape: 368814 rows and 5 columns'

**Visualizations and patterns**

*Discover patterns in the data by creating visualizations. Create at least a histogram of Galactic_Credits. Describe your observations.*

In [None]:
# CODE CELL: Generate visualizations (e.g., scatter plots, histograms)

**Data insights and data quality**
* **Insights:** What are the key trends? What does the distribution look like? What does that mean? 
* **Quality issues:** Document missing values, duplicates, outliers, etc.

---

## 3. Data Preparation
*Rubric: LO 6.4C (Data Science Steps)*

**Cleaning and preprocessing**

*Describe and justify steps taken (e.g., imputation, handling outliers, fixing other errors).*

The following data cleaning steps were performed:

1. **Missing Values Handling**: Identified columns with missing values and applied appropriate imputation strategies:
   - For numerical columns: use median imputation (robust to outliers)
   - For categorical columns: use mode imputation (most frequent value)

2. **Duplicate Records**: Check for and remove duplicate rows to avoid biasing the model

3. **Outlier Detection**: Applied statistical methods (e.g., IQR method) to identify extreme values in `Galactic_Credits` and other numerical features. Outliers are flagged but not removed, as they may represent legitimate rare ships.

4. **Data Type Corrections**: Ensure all columns have appropriate data types (e.g., categorical features should be object or category type)

5. **Feature Scaling**: Standardize numerical features for better model performance

In [7]:
# CODE CELL: Data cleaning and preprocessing steps

# Load data if not loaded yet
if 'df' not in globals():
    df = pd.read_csv('ships_inventory_iter1.csv')

# Work on a copy
df_clean = df.copy()

# Missing values
numeric_cols = df_clean.select_dtypes(include='number').columns
categorical_cols = df_clean.select_dtypes(include=['object', 'category']).columns

for col in numeric_cols:
    if df_clean[col].isnull().any():
        df_clean[col] = df_clean[col].fillna(df_clean[col].median())

for col in categorical_cols:
    if df_clean[col].isnull().any():
        df_clean[col] = df_clean[col].fillna(df_clean[col].mode(dropna=True)[0])

# Duplicates
df_clean = df_clean.drop_duplicates()

# Outlier counts (IQR)
for col in numeric_cols:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df_clean[(df_clean[col] < lower) | (df_clean[col] > upper)]
    print(f"{col}: {len(outliers)} outliers detected")

See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_cols = df_clean.select_dtypes(include=['object', 'category']).columns


Ship_ID: 0 outliers detected
Galactic_Credits: 7245 outliers detected
Model_Cycle: 12464 outliers detected


**Adjusting dataset (optional)**
*If you adjusted the dataset for modeling in additional ways, describe that here*

In [8]:
# OPTIONAL CODE CELL: Additional preprocessing steps

# If df_clean doesn't exist yet, create it quickly
if 'df_clean' not in globals():
    if 'df' not in globals():
        df = pd.read_csv('ships_inventory_iter1.csv')
    df_clean = df.copy()

numeric_cols = df_clean.select_dtypes(include='number').columns

# Standardize (z-score)
df_scaled = df_clean.copy()
stds = df_scaled[numeric_cols].std(ddof=0).replace(0, 1)
means = df_scaled[numeric_cols].mean()

for col in numeric_cols:
    df_scaled[col] = (df_scaled[col] - means[col]) / stds[col]

display(df_clean.head())

Unnamed: 0,Ship_ID,Galactic_Credits,Model_Cycle,Ship_Manufacturer,Sector
0,7316160254,4950,7505.0,Galactic Motors,Mon Cala Ocean Worlds
1,7316115206,18999,7518.0,Galactic Motors,Thraxos Blockade
2,7315865657,4000,7486.0,Republic Aerospace,Indoumodo Sector
3,7314772431,6495,7511.0,Nebula Industries,Pantora Moon
4,7311539325,3995,7499.0,Corellian Engineering,Malastare Narrows


---

## 4. Modeling
*Rubric: LO 6.4C (Data Science Steps)*

**Model setup**
*Describe and justify the creation of your simple benchmark model to predict Galactic_Credits*

In [None]:
# CODE CELL: Model training and setup code

**Testing and performance**
*Describe how you tested the model and interpret the metrics. Make sure to present the metrics in a clear overview.*

In [1]:
# CODE CELL: Model evaluation code

---

## 5. Evaluation
*Rubric: LO 6.4C (Results vs. Objectives)*

**Assessment against succes criteria** 
*What is the difference between the metrics? What does this mean? Did you meet the goals set in the Business Understanding?*

**Key findings and limitations**
*What did you learn? What are the limitations of this current model?*

---

## 6 Personal Contribution
*Rubric: LO 7.3P (Equal Contribution)*

| Student name | Contribution | Personal lessons learned |
| :--- | :--- | :--- |
| Damian van der Sluis | *Contribution description* | *Personal lessons learned this iteration* |
| Saeed Alhasan | *Contribution description* | *Personal lessons learned this iteration* |
| Achraf El Azzouzi | *Contribution description* | *Personal lessons learned this iteration* |