# Electric Vehicle Population - Exploratory Data Analysis (EDA)

This notebook provides a comprehensive framework for performing EDA on Electric Vehicle Population datasets.

**Tools Used:** Python (Pandas, Matplotlib, Seaborn, NumPy)

**Author:** HexSoftwares

---

## 1. Import Libraries

First, let's import all necessary libraries for our analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ All libraries imported successfully!")

## 2. Load the Dataset

**PROMPT:** Load your Electric Vehicle Population dataset here.

Replace `'your_data_path.csv'` with the actual path to your dataset.

In [None]:
# Load the dataset
data_path = 'electric_vehicle_population.csv'  # Update this path

try:
    df = pd.read_csv(data_path)
    print(f"✓ Data loaded successfully!")
    print(f"  Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
except FileNotFoundError:
    print(f"✗ Error: File not found at {data_path}")
    print("\nPROMPT: Please update the 'data_path' variable with the correct file path.")
except Exception as e:
    print(f"✗ Error loading data: {str(e)}")

## 3. Initial Data Exploration

Let's get a first look at our dataset structure and contents.

### 3.1 Display First Few Rows

In [None]:
# Display first 10 rows
df.head(10)

### 3.2 Display Last Few Rows

In [None]:
# Display last 10 rows
df.tail(10)

### 3.3 Dataset Information

In [None]:
# Get dataset info
print("Dataset Information:")
print("="*60)
print(f"Total rows: {df.shape[0]:,}")
print(f"Total columns: {df.shape[1]}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nColumn Details:")
df.info()

### 3.4 Identify Column Types

In [None]:
# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

print(f"Numeric Columns ({len(numeric_cols)}):")
print(numeric_cols)
print(f"\nCategorical Columns ({len(categorical_cols)}):")
print(categorical_cols)

### 3.5 Statistical Summary

In [None]:
# Statistical summary for numeric columns
df.describe()

In [None]:
# Statistical summary for categorical columns
if len(categorical_cols) > 0:
    df[categorical_cols].describe()

## 4. Missing Values Analysis

**PROMPT:** Identify and analyze missing values in the dataset.

### 4.1 Check Missing Values

In [None]:
# Create missing values summary
missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
    'Data_Type': df.dtypes
})

missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values(
    'Missing_Percentage', ascending=False
).reset_index(drop=True)

if len(missing_df) == 0:
    print("✓ No missing values found in the dataset!")
else:
    print(f"⚠ Found missing values in {len(missing_df)} columns:\n")
    display(missing_df)

### 4.2 Visualize Missing Values

In [None]:
# Visualize missing values
if len(missing_df) > 0:
    plt.figure(figsize=(10, 6))
    plt.barh(missing_df['Column'], missing_df['Missing_Percentage'], color='coral')
    plt.xlabel('Missing Percentage (%)', fontsize=12)
    plt.ylabel('Columns', fontsize=12)
    plt.title('Missing Values by Column', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

### 4.3 Handle Missing Values

**PROMPT:** Choose a strategy to handle missing values:
- Fill numeric columns with median
- Fill categorical columns with mode
- Drop rows with missing values
- Custom strategy per column

In [None]:
# Handle missing values - Auto strategy
# Fill numeric with median, categorical with mode

for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)
        print(f"✓ Filled '{col}' with median")

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)
        print(f"✓ Filled '{col}' with mode")

print(f"\n✓ Missing values handled. Remaining: {df.isnull().sum().sum()}")

## 5. Data Quality Checks

### 5.1 Check for Duplicates

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Duplicate rows: {duplicates:,} ({duplicates/len(df)*100:.2f}%)")

if duplicates > 0:
    print("\nPROMPT: Consider removing duplicates using: df.drop_duplicates(inplace=True)")

### 5.2 Unique Values Count

In [None]:
# Count unique values per column
unique_df = pd.DataFrame({
    'Column': df.columns,
    'Unique_Count': [df[col].nunique() for col in df.columns],
    'Unique_Percentage': [(df[col].nunique() / len(df)) * 100 for col in df.columns]
})

unique_df = unique_df.sort_values('Unique_Count', ascending=False).reset_index(drop=True)
unique_df['Unique_Percentage'] = unique_df['Unique_Percentage'].round(2)

display(unique_df)

## 6. Statistical Analysis

### 6.1 Distribution Metrics

In [None]:
# Analyze distribution metrics for numeric columns
print("Distribution Metrics (Numeric Columns):")
print("="*80)

for col in numeric_cols:
    skew = df[col].skew()
    kurt = df[col].kurtosis()
    print(f"\n{col}:")
    print(f"  • Mean: {df[col].mean():.2f}")
    print(f"  • Median: {df[col].median():.2f}")
    print(f"  • Std Dev: {df[col].std():.2f}")
    print(f"  • Min: {df[col].min():.2f}")
    print(f"  • Max: {df[col].max():.2f}")
    print(f"  • Skewness: {skew:.2f} {'(Right-skewed)' if skew > 0 else '(Left-skewed)' if skew < 0 else '(Symmetric)'}")
    print(f"  • Kurtosis: {kurt:.2f}")

### 6.2 Correlation Analysis

In [None]:
# Calculate correlation matrix
if len(numeric_cols) > 1:
    corr_matrix = df[numeric_cols].corr()
    print("Correlation Matrix:")
    display(corr_matrix.round(3))

## 7. Outlier Detection

**PROMPT:** Detect outliers using the IQR (Interquartile Range) method.

In [None]:
# Detect outliers using IQR method
print("Outlier Detection (IQR Method):")
print("="*60)

outliers_dict = {}

for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    
    if len(outliers) > 0:
        outliers_dict[col] = len(outliers)
        print(f"⚠ '{col}': {len(outliers):,} outliers ({len(outliers)/len(df)*100:.2f}%)")

if len(outliers_dict) == 0:
    print("✓ No significant outliers detected")

## 8. Data Visualizations

### 8.1 Distribution Plots

In [None]:
# Create distribution plots for numeric columns
if len(numeric_cols) > 0:
    n_cols = len(numeric_cols)
    n_rows = (n_cols + 2) // 3
    
    fig, axes = plt.subplots(n_rows, 3, figsize=(15, 5*n_rows))
    axes = axes.flatten() if n_cols > 1 else [axes]
    
    for idx, col in enumerate(numeric_cols):
        if idx < len(axes):
            axes[idx].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7, color='skyblue')
            axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Frequency')
            axes[idx].grid(alpha=0.3)
    
    # Hide extra subplots
    for idx in range(n_cols, len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()

### 8.2 Correlation Heatmap

In [None]:
# Create correlation heatmap
if len(numeric_cols) > 1:
    plt.figure(figsize=(12, 10))
    corr_matrix = df[numeric_cols].corr()
    
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
                square=True, linewidths=1, fmt='.2f', cbar_kws={"shrink": 0.8})
    plt.title('Correlation Matrix - Electric Vehicle Population', fontsize=16, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.show()

### 8.3 Box Plots (Outlier Visualization)

In [None]:
# Create box plots for outlier detection
if len(numeric_cols) > 0:
    n_cols = len(numeric_cols)
    n_rows = (n_cols + 2) // 3
    
    fig, axes = plt.subplots(n_rows, 3, figsize=(15, 5*n_rows))
    axes = axes.flatten() if n_cols > 1 else [axes]
    
    for idx, col in enumerate(numeric_cols):
        if idx < len(axes):
            axes[idx].boxplot(df[col].dropna(), patch_artist=True,
                            boxprops=dict(facecolor='lightblue'))
            axes[idx].set_title(f'Box Plot: {col}', fontweight='bold')
            axes[idx].set_ylabel(col)
            axes[idx].grid(alpha=0.3)
    
    # Hide extra subplots
    for idx in range(n_cols, len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()

### 8.4 Categorical Variables Analysis

In [None]:
# Visualize categorical variables
if len(categorical_cols) > 0:
    top_n = 10
    n_cols = len(categorical_cols)
    n_rows = (n_cols + 1) // 2
    
    fig, axes = plt.subplots(n_rows, 2, figsize=(15, 5*n_rows))
    axes = axes.flatten() if n_cols > 1 else [axes]
    
    for idx, col in enumerate(categorical_cols):
        if idx < len(axes):
            value_counts = df[col].value_counts().head(top_n)
            axes[idx].barh(range(len(value_counts)), value_counts.values, color='steelblue')
            axes[idx].set_yticks(range(len(value_counts)))
            axes[idx].set_yticklabels(value_counts.index)
            axes[idx].set_title(f'Top {top_n} Categories: {col}', fontweight='bold')
            axes[idx].set_xlabel('Count')
            axes[idx].invert_yaxis()
            
            # Add value labels
            for i, v in enumerate(value_counts.values):
                axes[idx].text(v, i, f' {v:,}', va='center')
    
    # Hide extra subplots
    for idx in range(n_cols, len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()

## 9. Key Insights and Patterns

**PROMPT:** Based on the analysis above, document key insights about the Electric Vehicle population:

### Questions to Answer:

1. **Distribution Patterns:**
   - What is the distribution of electric vehicles by make and model?
   - Which EV types are most common (BEV vs PHEV)?

2. **Geographic Insights:**
   - Which counties/cities have the highest EV adoption?
   - Are there geographic clusters?

3. **Temporal Trends:**
   - How has EV adoption changed over the years?
   - What is the distribution of model years?

4. **Technical Specifications:**
   - What is the average electric range?
   - How does range vary by manufacturer?

5. **Data Quality:**
   - Which features have the most missing data?
   - Are there any data quality issues to address?

## 10. Advanced Analysis (Optional)

### 10.1 Time Series Analysis

In [None]:
# Example: Analyze EV adoption over time (if Model Year column exists)
# Uncomment and modify based on your dataset columns

# if 'Model Year' in df.columns:
#     yearly_counts = df['Model Year'].value_counts().sort_index()
#     
#     plt.figure(figsize=(14, 6))
#     plt.plot(yearly_counts.index, yearly_counts.values, marker='o', linewidth=2, markersize=8)
#     plt.xlabel('Model Year', fontsize=12)
#     plt.ylabel('Number of Vehicles', fontsize=12)
#     plt.title('Electric Vehicle Adoption Over Time', fontsize=14, fontweight='bold')
#     plt.grid(alpha=0.3)
#     plt.tight_layout()
#     plt.show()

### 10.2 Geographic Analysis

In [None]:
# Example: Top cities/counties with most EVs
# Uncomment and modify based on your dataset columns

# if 'County' in df.columns:
#     top_counties = df['County'].value_counts().head(15)
#     
#     plt.figure(figsize=(12, 8))
#     plt.barh(range(len(top_counties)), top_counties.values, color='green', alpha=0.7)
#     plt.yticks(range(len(top_counties)), top_counties.index)
#     plt.xlabel('Number of Electric Vehicles', fontsize=12)
#     plt.title('Top 15 Counties by EV Population', fontsize=14, fontweight='bold')
#     plt.gca().invert_yaxis()
#     plt.grid(alpha=0.3, axis='x')
#     plt.tight_layout()
#     plt.show()

### 10.3 Manufacturer Analysis

In [None]:
# Example: Top EV manufacturers
# Uncomment and modify based on your dataset columns

# if 'Make' in df.columns:
#     top_makes = df['Make'].value_counts().head(10)
#     
#     plt.figure(figsize=(12, 6))
#     plt.bar(range(len(top_makes)), top_makes.values, color='steelblue', alpha=0.8)
#     plt.xticks(range(len(top_makes)), top_makes.index, rotation=45, ha='right')
#     plt.xlabel('Manufacturer', fontsize=12)
#     plt.ylabel('Number of Vehicles', fontsize=12)
#     plt.title('Top 10 EV Manufacturers', fontsize=14, fontweight='bold')
#     plt.grid(alpha=0.3, axis='y')
#     plt.tight_layout()
#     plt.show()

## 11. Export Results

### 11.1 Save Cleaned Dataset

In [None]:
# Save cleaned dataset
df.to_csv('ev_population_cleaned.csv', index=False)
print("✓ Cleaned dataset saved as 'ev_population_cleaned.csv'")

### 11.2 Generate Summary Report

In [None]:
# Generate comprehensive summary report
report = f"""
Electric Vehicle Population - EDA Summary Report
{'='*80}

Dataset Overview:
-----------------
• Total Records: {df.shape[0]:,}
• Total Features: {df.shape[1]}
• Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB
• Numeric Columns: {len(numeric_cols)}
• Categorical Columns: {len(categorical_cols)}

Data Quality:
-------------
• Missing Values: {df.isnull().sum().sum():,} ({df.isnull().sum().sum() / (df.shape[0] * df.shape[1]) * 100:.2f}%)
• Duplicate Rows: {df.duplicated().sum():,}

Column Information:
-------------------
"""

for col in df.columns:
    report += f"\n{col}:"
    report += f"\n  Type: {df[col].dtype}"
    report += f"\n  Unique Values: {df[col].nunique():,}"
    if col in numeric_cols:
        report += f"\n  Mean: {df[col].mean():.2f}"
        report += f"\n  Median: {df[col].median():.2f}"
        report += f"\n  Std: {df[col].std():.2f}"
    report += "\n"

print(report)

# Save report to file
with open('eda_summary_report.txt', 'w') as f:
    f.write(report)
print("\n✓ Summary report saved as 'eda_summary_report.txt'")

## 12. Conclusion

**PROMPT:** Summarize your key findings here:

### Key Findings:

1. **Dataset Characteristics:**
   - [Add your findings]

2. **Data Quality Issues:**
   - [Add your findings]

3. **Distribution Patterns:**
   - [Add your findings]

4. **Correlations:**
   - [Add your findings]

5. **Actionable Insights:**
   - [Add your recommendations]

---

### Next Steps:

1. Further cleaning and preprocessing
2. Feature engineering for modeling
3. Predictive modeling (if applicable)
4. Dashboard creation for stakeholders

---

**End of EDA Report**