<h2 style="text-align:center;">Electric Vehicle Data Analysis Project</h2>

<h3 style="text-align:center;">Python-Final Project</h3>

---

<h4>Dataset Review and Observations:</h4>

Before proceeding with data analysis, we conducted an initial review of the dataset to ensure data quality and minimize potential issues. The following key observations were noted:  

1. **Handling Missing Values**  
   - Identified columns containing missing values.  
   - Assessed the extent of missing data to determine an appropriate handling strategy (e.g., imputation or removal).  

2. **Brake Type Extraction**  
   - Extracted both front and rear brake types separately to facilitate detailed analysis.  

3. **Drive Type and Position Separation**  
   - The "Drive Type" column was processed to separate the drive type (e.g., 4WD, 2WD).  
   - Additionally, extracted the drive position (front or rear) where applicable.  

---

<h4>Explanation of Why Each Method Was Used:</h4>

1. **`Type of brakes` (Categorical Column)**:
   - **Method Used**: **Mode** (most frequent value).
   - **Reason**: Since this is a categorical column, the mode is the most appropriate measure of central tendency. It ensures that the most common type of brakes is used to fill missing values, which is logical for this type of data.

2. **`Permissable gross weight [kg]` (Numerical Column)**:
   - **Method Used**: **Median**.
   - **Reason**: The median is robust to outliers. If there are extreme values in this column (e.g., very high or very low weights), the median will provide a more representative value than the mean.

3. **`Maximum load capacity [kg]` (Numerical Column)**:
   - **Method Used**: **Median**.
   - **Reason**: Similar to `Permissable gross weight [kg]`, this column may have outliers. The median ensures that the imputed value is not skewed by extreme values.

4. **`Acceleration 0-100 kph [s]` (Numerical Column)**:
   - **Method Used**: **Mean**.
   - **Reason**: Acceleration times are typically normally distributed (without extreme outliers). The mean provides a balanced estimate of the central tendency for this type of data.

5. **`Boot capacity (VDA) [l]` (Numerical Column)**:
   - **Method Used**: **Median**.
   - **Reason**: Boot capacity may have some outliers (e.g., very large or very small capacities). The median is a better choice to avoid skewing the imputed values.

6. **`mean - Energy consumption [kWh/100 km]` (Numerical Column)**:
   - **Method Used**: **Mean**.
   - **Reason**: Energy consumption is likely to follow a normal distribution. The mean provides a good estimate of the average consumption for cars with missing values.


---

<h4>A. Python Code to Handle Missing Values</h4>

Here’s how you can implement the above strategies using Python and the `pandas` library:

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('auta_elektryczne.csv')

# Fill missing values for each column
# 1. Type of brakes (categorical) - Use MODE
df['Type of brakes'] = (
    df['Type of brakes']
    .fillna(df['Type of brakes']
    .mode()[0])
    )

# 2. Permissable gross weight [kg] (numerical) - Use MEDIAN
df['Permissable gross weight [kg]'] = (
    df['Permissable gross weight [kg]']
    .fillna(round(df['Permissable gross weight [kg]']
    .median(),2))
    )

# 3. Maximum load capacity [kg] (numerical) - Use MEDIAN
df['Maximum load capacity [kg]'] = (
    df['Maximum load capacity [kg]']
    .fillna(round(df['Maximum load capacity [kg]']
    .median(),2))
    )

# 4. Acceleration 0-100 kph [s] (numerical) - Use MEAN
df['Acceleration 0-100 kph [s]'] = (
    df['Acceleration 0-100 kph [s]']
    .fillna(round(df['Acceleration 0-100 kph [s]']
    .mean(),2))
    )

# 5. Boot capacity (VDA) [l] (numerical) - Use MEDIAN
df['Boot capacity (VDA) [l]'] = (
    df['Boot capacity (VDA) [l]']
    .fillna(round(df['Boot capacity (VDA) [l]']
    .median(),2))
    )

# 6. mean - Energy consumption [kWh/100 km] (numerical) - Use MEAN
df['mean - Energy consumption [kWh/100 km]'] = (
    df['mean - Energy consumption [kWh/100 km]']
    .fillna(round(df['mean - Energy consumption [kWh/100 km]']
    .mean(),2))
    )

# Save the cleaned dataset
df.to_csv('auta_elektryczne_missing_data_fix.csv', 
          index=False)

print("Missing values have been filled and the dataset has been saved.")

Missing values have been filled and the dataset has been saved.


---

 <h4> B. Brake Type Extraction </h4>

To handle the **`Type of brakes`** column, where the braking system has two different formats:
1. **`disc (front + rear)`**
2. **`disc (front) + drum (rear)`**

We can **split the column into two separate columns**: **`Front Brakes`** and **`Rear Brakes`**. This will make the data more structured and easier to analyze.

In [3]:
import pandas as pd

# Load the dataset
df = pd.read_csv('auta_elektryczne_missing_data_fix.csv')

# Function to extract front and rear brake types
def extract_brakes(brake_string):
    if 'disc (front + rear)' in brake_string:
        return 'disc', 'disc'
    elif 'disc (front) + drum (rear)' in brake_string:
        return 'disc', 'drum'
    else:
        return None, None  # Handle unexpected formats

# Apply the function to create new columns
df[['Front Brakes', 'Rear Brakes']] = (
    df['Type of brakes']
    .apply(extract_brakes)
    .apply(pd.Series)
    )

# Drop the original 'Type of brakes' column (optional)
df.drop(columns=['Type of brakes'], 
        inplace=True)

# Save the updated dataset
df.to_csv('auta_elektryczne_brake_extraction.csv', 
          index=False)

# Display the first few rows to verify
print(
    df[['Car full name', 'Front Brakes', 'Rear Brakes']]
    .head()
    )

                      Car full name Front Brakes Rear Brakes
0            Audi e-tron 55 quattro         disc        disc
1            Audi e-tron 50 quattro         disc        disc
2             Audi e-tron S quattro         disc        disc
3  Audi e-tron Sportback 50 quattro         disc        disc
4  Audi e-tron Sportback 55 quattro         disc        disc


---

<h4>C. Drive Type and Position Separation</h4>

To handle the **`Drive type`** column, we need to:
1. **Extract the drive type** (e.g., `4WD` or `2WD`).
2. **Extract the drive position** (e.g., `front` or `rear`) when applicable.

Here’s how you can achieve this in Python using pandas:

In [4]:
import pandas as pd

# Load the dataset
df = pd.read_csv('auta_elektryczne_brake_extraction.csv')

# Function to extract drive type and drive position
def extract_drive_info(drive_string):
    if '4WD' in drive_string:
        return '4WD', None  # 4WD doesn't have a specific position
    elif '2WD' in drive_string:
        if '(front)' in drive_string:
            return '2WD', 'front'
        elif '(rear)' in drive_string:
            return '2WD', 'rear'
    return None, None  # Handle unexpected formats

# Apply the function to create new columns
df[['Drive Type', 'Drive Position']] = (
    df['Drive type']
    .apply(extract_drive_info)
    .apply(pd.Series)
    )

# Drop the original 'Drive type' column (optional)
df.drop(columns=['Drive type'], inplace=True)

# Replace blank (NaN) cells with 'None' in the 'Drive Position' column
df['Drive Position'] = df['Drive Position'].fillna('None')

# Save the updated dataset
df.to_csv('auta_elektryczne_drive_extracted.csv', index=False)
df.to_csv('auta_elektryczne_cleaned.csv', index=False)

# Display the first few rows to verify
print(df[['Car full name', 'Drive Type', 
          'Drive Position']].sample(6))

                      Car full name Drive Type Drive Position
48            Volkswagen ID.3 Pro S        2WD           rear
0            Audi e-tron 55 quattro        4WD           None
30  Porsche Taycan 4S (Performance)        4WD           None
12                  Honda e Advance        2WD           rear
22                Mercedes-Benz EQC        4WD           None
15      Hyundai Kona electric 64kWh        2WD          front


---