# Fuel Consumption Analysis

**References:**
+ https://www.kaggle.com/datasets/krupadharamshi/fuelconsumption
+ https://github.com/florence-bockting/python-class-25

**Content:**
  + Data cleaning
     + Finding missing values
     + Handling missing values
     + Finding duplicate rows
     + Removing duplicate rows
     + Data cleaning for the real dataset
  + Data analysis
  + Data visualization

### Data Cleaning
+ **Data cleaning** is the process of identifying and correcting errors and missing values in datasets.
+ It ensures the dataset is accurate, reliable, and suitable for analysis.
+ It improves data quality and consistency and ensures datasets are complete and structured properly.

#### Finding missing values
+ Generate a sample dataset with missing values.
+ `import pandas as pd`
+ `import numpy as np`
+ Use `pd.DataFrame` to create a sample dataset.
+ Use `pd.nan` to create the missing values.
+ `from fuel-consumption-analysis import DataCleaning`
+ Use the `missing_value()` function to count missing values in the dataset.
+ We first identify the number of missing values in each column.

In [8]:
import pandas as pd
import numpy as np
from fuel-consumption-analysis import DataCleaning   # From my package import "DataCleaning" Class

# create a sample dataframe
data = pd.DataFrame(
    {"Name": [
        "Tara",
        "Alice",
        "Ali",
        "Sara",
        "Fara"
    ],
     "Age": [24, 33, np.nan, 19, 26],
     "Salary": [2000, np.nan, 3500, 4000, 2700],
     "Department": ["Finance", np.nan, "IT", np.nan, "IT"]
    }
)

data

#Count the number of missing values 
print(DataCleaning(data).missing_values())

Unnamed: 0,Name,Age,Salary,Department
0,Tara,24.0,2000.0,Finance
1,Alice,33.0,,
2,Ali,,3500.0,IT
3,Sara,19.0,4000.0,
4,Fara,26.0,2700.0,IT


#### Handling missing values
+ Handling missing values is crucial to avoid biased results and incorrect conclusions.
+ Methods for handling missing values:
     + **Removing missing data:** If a column has too many missing values, it might be best to drop it out.
     + **Replacing missing values:**
          + *For categorical data*: Replace missing values with the most frequency category(mode).
          + *For numerical data*: Replace missing value with the mean or median of corresponding columns.
+ The `fuel-consumption-analysis` package uses the **replacing missing values** technique.
+ Use the `replace_missing_values()` function to handle missing values.

In [None]:
data_whitout_missing_values = DataCleaning(data).replace_missing_values()   # Dataframe whitout missing values

# Check to ensure no missing values remain
print(DataCleaning(data_whitout_missing_values).missing_values())

#### Finding duplicate rows
+ Duplicate records can distort analysis results by over-representing certain values, leading to incorrect insights.
+ Create and add a new row to the sample dataset, that is duplicate the 5th row.
+ Use the`pd.concat()` to add a new row to the dataset. 
+ Check the duplicate rows in the dataset.
+ Use the `find_duplicates()` function to show the duplicate rows.
+ The result is a `DataFrame`.

In [None]:
# Find duplicate rows in sample dataset
print(DataCleaning(data).find_duplicates())   # No duplicates

new_row = pd.DataFrame({"Name": ["Fara"], "Age": [26], "Salary": [2700], "Department": ["IT"]})   # Create a new row
new_data = pd.concat([data, new_row], ignore_index = True)   # add a new row to sample dataset
print(new_data)

# Find duplicate rows in new dataset contain duplicate
print(DataCleaning(new_data).find_duplicates())

#### Removing duplicate rows
+ Removing duplicates ensures that each observation in the dataset is unique and accurate.
+ Use the `remove_duplicates()` function to remove duplicates in the dataset.
+ The result is a `DataFrame`.

In [None]:
data_whitout_duplicates = DataCleaning(new_data).remove_duplicates()

# Check to ensure no duplicate rows remain
print(DataCleaning(data_whitout_duplicates).find_duplicates())

#### Data cleaning for the real dataset
+ Load the **fuel consumption** dataset.
+ Use the `pd.read_csv` to read dataset.
+ Show dataset information:
     + Use the `df.head(n = ...)` to show n rows of dataset.
     + Use the `df.dtypes` to show the type of each column.

In [16]:
dataset = pd.read_csv("C:/Users/taher/Desktop/TU Dortmund/Courses/fuel consumption.zip")
dataset.head(n = 7)   # Displaythe dataset
dataset.dtypes    # Display the type of each column

Year                  int64
MAKE                 object
MODEL                object
VEHICLE CLASS        object
ENGINE SIZE         float64
CYLINDERS             int64
TRANSMISSION         object
FUEL                 object
FUEL CONSUMPTION    float64
COEMISSIONS           int64
dtype: object

In [None]:
print(DataCleaning(dataset).missing_values())  # Check missing values

cleaned_data = DataCleaning(dataset).replace_missing_values()  # Handle missing values

print(DataCleaning(cleaned_data).find_duplicates())   # Ckeck duplicate rows

cleaned_data = DataCleaning(cleaned_data).duplicates()  # Remove duplicates

print(cleaned_data.head())  # Display cleaned dataset

#### Data analysis