# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [13]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\M4x1m\\OneDrive\\Dokumente\\VS Code\\Code Institude'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [9]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [10]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\M4x1m\\OneDrive\\Dokumente\\VS Code\\Code Institude'

# Loading the dataset and initial checks

Section 1 content

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the dataset:
Use pandas to read the raw car data CSV file into a DataFrame.
Check the shape:
Display the number of rows and columns in the loaded DataFrame to understand the dataset size.

In [12]:
df = pd.read_csv('D:/vscode-projects/car-price-analysis-hackathon-team4/Data/raw/cars.csv')
df.shape

FileNotFoundError: [Errno 2] No such file or directory: 'D:/vscode-projects/car-price-analysis-hackathon-team4/Data/raw/cars.csv'

View DataFrame Information:
Display a concise summary of the DataFrame, including column names, data types, and non-null counts. This helps to quickly assess the structure and completeness of the dataset.

In [None]:
df.info

Preview the Data:
Display the first few rows of the DataFrame to get an initial look at the dataset and verify that it loaded correctly.

In [None]:
df.head()

Summary Statistics:
Generate descriptive statistics for the numerical columns in the DataFrame, such as mean, standard deviation, min, max, and quartiles. This provides a quick overview of the data distribution and helps identify potential outliers or anomalies.

In [None]:
df.describe()

List DataFrame Columns:
Display all column names in the DataFrame to review the available features and ensure the expected structure.

In [None]:
df.columns

---

# Cleaning

Check for Duplicate Rows:
Count the number of duplicate rows in the DataFrame to identify potential data quality issues and ensure the uniqueness of records.

In [None]:
df.duplicated().sum()

List DataFrame Columns:
Display all column names in the DataFrame to review the available features and ensure the expected structure.

In [None]:
df.columns

Extract Brand and Model Information:
Split the CarName column into separate brand and model columns for more granular analysis.
Remove the original CarName column to avoid redundancy, and preview the updated DataFrame to verify the changes.

In [None]:
# Extract brand and model from CarName column
df[['brand', 'model']] = df['CarName'].str.split(' ', n=1, expand=True)
# drop CarName column
df = df.drop('CarName', axis=1)
df.head()

Reorder Columns for Clarity:
Move the brand and model columns to immediately follow car_ID in the DataFrame.
This improves readability and makes it easier to analyze car attributes together.
Preview the updated DataFrame to confirm the new column order.

In [None]:
# Reorder columns to place 'brand' and 'model' right after 'car_ID'
cols = list(df.columns)
car_id_index = cols.index('car_ID')

# Remove brand and model from their current positions
cols.remove('brand')
cols.remove('model')

# Insert brand and model right after car_ID
cols[car_id_index+1:car_id_index+1] = ['brand', 'model']
df = df[cols]
df.head()

Standardize Brand Names:
Replace typos and inconsistencies in the brand column to ensure all car brands are labeled consistently for analysis and visualization.
Display the unique brand names after correction to verify the changes.

In [None]:
# Standardize brand names to correct typos and inconsistencies
# This ensures all brands are consistently labeled for analysis and visualization
brand_corrections = {
    'maxda': 'mazda',
    'Nissan': 'nissan',
    'porcshce': 'porsche',
    'toyouta': 'toyota',
    'vokswagen': 'volkswagen',
    'vw': 'volkswagen',
    'alfa-romero': 'alfa-romeo'
}
df['brand'] = df['brand'].replace(brand_corrections)
df['brand'].unique()

Check for Missing Values:
Count the number of missing (null) values in each column to identify incomplete data and guide further cleaning steps.

In [None]:
df.isnull().sum() # Check for missing values

Handle Missing Model Names:
Fill any missing values in the model column with 'unknown' to maintain data integrity.
Recheck for missing values in all columns to confirm that the issue has been addressed.

In [None]:
df['model'] = df['model'].fillna('null') # Fill missing model names with 'null'
df.isnull().sum()

Check Data Types:
Display the data type of each column in the DataFrame to ensure correct formats for analysis and identify any columns that may need conversion.

In [None]:
df.dtypes

Explore Unique Values in Each Column:
Print the unique values for every column in the DataFrame to understand the range and categories of data present, and to identify potential issues or patterns for further analysis.

In [None]:
# Display unique values for every column
for col in df.columns:
    print(f'Unique values in {col}:')
    print(df[col].unique())
    print('-'*40)

# Transforming

Convert Textual Numbers to Integers:
Map the text values in the doornumber and cylindernumber columns to their corresponding integer values.
This conversion makes the data more suitable for mathematical operations, comparisons, and visualizations.
Preview the updated columns to verify the changes.

In [None]:
# Convert 'doornumber' and 'cylindernumber' from text to integers
# By mapping these words to their corresponding integer values, the data is more suitable for mathematical operations, comparisons, and visualizations.

number_map = {'two': 2, 'four': 4, 'three': 3, 'five': 5, 'six': 6, 'eight': 8, 'twelve': 12}
df['doornumber'] = df['doornumber'].map(number_map)
df['cylindernumber'] = df['cylindernumber'].map(number_map)

df[['doornumber', 'cylindernumber']].head()

Convert Price Values to Integers:
Round the values in the price column and convert them to integers to ensure consistency and facilitate numerical analysis.

In [None]:
# Convert all price values to integers
df['price'] = df['price'].round().astype(int)

Convert Engine Size to Liters:

Transform the enginesize column from cubic inches to liters by dividing each value by 60. Then, round the results to two decimal places for consistency and readability. This ensures all engine size values are in a standard metric unit suitable for analysis.

In [None]:
df['enginesize'] = df['enginesize']/60 # Convert enginesize from cubic inches to liters
df['enginesize'] = df['enginesize'].round(2).astype(float)
df.head()

Create Outlier Summary Table:

Generate a summary table that lists, for each numeric feature, the indices and values of detected outliers. This helps to quickly identify which rows (cars) have unusually high or low values for specific features, supporting further investigation or data cleaning. The table provides a clear overview of outlier distribution across the dataset.



In [None]:
# Calculate outlier flags using Z-score for each numeric column
from scipy.stats import zscore

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
z_scores = df[numeric_cols].apply(zscore)
outlier_flags = (abs(z_scores) > 3)

# Create a table showing outliers for every column
outlier_table = []
for col in outlier_flags.columns:
    outlier_indices = df.index[outlier_flags[col]].tolist()
    outlier_values = df.loc[outlier_flags[col], col].tolist()
    outlier_table.append({
        'Feature': col,
        'Outlier_Indices': outlier_indices,
        'Outlier_Values': outlier_values
    })

outlier_table_df = pd.DataFrame(outlier_table)
outlier_table_df

Visualize Outliers with Box Plots:

Display box plots for the cylindernumber and enginesize features to identify the distribution and spot potential outliers. Box plots provide a clear summary of the data’s spread, median, and extreme values for each feature. 

In [None]:
features = ['cylindernumber', 'enginesize']
plt.figure(figsize=(12, 6))
df[features].boxplot()
plt.title('Box Plot of Selected Features')
plt.ylabel('Value')
plt.xticks(rotation=30)
plt.show()

Conclusion: Outliers in the dataset are not random; they cluster in certain features (wheelbase, engine size, cylinder number, price). This is expected in car data, as rare or luxury cars often have extreme values in multiple attributes. They are considered as valid, so they will not be removed.

# New features

Power-to-weight ratio - provides a more accurate measure of a car's real-world performance than horsepower alone. By dividing horsepower by curb weight, you account for both engine power and vehicle mass. This ratio helps compare cars of different sizes and weights, showing which vehicles are likely to accelerate faster and feel more responsive. It is a key feature for performance analysis

In [None]:
df['power_to_weight'] = df['horsepower'] / df['curbweight'] # Create power-to-weight ratio feature


Car volume calculation - provides an estimate of the vehicle's overall physical size. This feature helps comparing cars in terms of interior space, comfort, and cargo capacity. It is useful for understanding how spacious or practical a car is, which can be important for buyers interested in family cars, SUVs, or vehicles with more storage. Including car volume allows for a more comprehensive analysis beyond just performance metrics.

In [None]:
df['car_volume'] = df['carlength'] * df['carwidth'] * df['carheight'] # Create car volume feature

---

Mpg ratio - helps comparing a car’s fuel efficiency in city versus highway driving conditions. By dividing city mpg by highway mpg, we can identify vehicles that perform similarly or differently across these environments. A ratio close to 1 means the car is efficient in both settings, while a lower ratio may indicate much better highway efficiency. This feature is useful for understanding real-world fuel consumption patterns and for recommending cars based on typical driving habits.

In [None]:
df['mpg_ratio'] = df['citympg'] / df['highwaympg'] # Create mpg ratio feature

Luxury car flag - helps us identify which cars in your dataset are considered luxury based on their price. By setting a threshold at the 75th percentile, we classify the top 25% most expensive cars as luxury. This feature is useful for segmenting the market, analyzing trends, and comparing characteristics between luxury and non-luxury vehicles. It supports targeted analysis and can be valuable for marketing, pricing strategies, or customer recommendations.

In [None]:
luxury_threshold = df['price'].quantile(0.75) # Determine 75th percentile price threshold
df['is_luxury'] = (df['price'] > luxury_threshold).astype(int) # Flag cars above threshold as luxury (1) or not (0)

# Standartizing

In [None]:
comp = df.copy()
comp.drop(columns=['brand',
                    'model',
                    'carbody',
                    'fueltype',
                    'enginetype', 
                    'aspiration', 
                    'carbody', 
                    'drivewheel', 
                    'enginelocation', 
                    'fuelsystem'], inplace=True)
comp.head(20)

### Scoring sytshem for Familie / Citycar

symboling - the higher the worse the savety - so a 0 is value 3 and a 1 is -3

Doornumer - the more the saver 

wheelbase - the bigger the worse

carwidth - the more the worse 

cylindernumber - higher the worse

enginesize - the more the worse 


### Scoring sytshem for Offroad 
symboling - the higher the worse the savety - so a 0 is value 3 and a 1 is -3

Doornumer - the more the better

wheelbase - the bigger the better

cylindernumber - 

enginesize - 


### Scoring sytshem for Sportscar
symboling - the higher the worse the savety - so a 0 is value 3 and a 1 is -3

Doornumer - the more the worse

wheelbase - depends (sprectrum slider)

cylindernumber - higher is better





NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
