#Dataset Description

\
# Upload Dataset : https://www.kaggle.com/datasets/avikasliwal/used-cars-price-prediction/data

## The dataset contains information about Used Cars Price with the following columns:

`index`: Index

`Name`: The brand and model of the car.

`Location`: The location in which the car is being sold or is available for purchase.

`Year`: The year or edition of the model.

`Kilometers_Driven`: The total kilometres driven in the car by the previous owner(s) in KM.

`Fuel_Type`: The type of fuel used by the car. (Petrol / Diesel / Electric / CNG / LPG)

`Transmission`: The type of transmission used by the car. (Automatic / Manual)

`Owner_Type`: Whether the ownership is Firsthand, Second hand or other.

`Mileage`: The standard mileage offered by the car company in kmpl or km/kg

`Engine`: The displacement volume of the engine in cc.


#Tasks

## 1 . Data Cleaning

### Read the dataset

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

df = pd.read_csv("Used_Cars.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         6019 non-null   int64  
 1   Name               6019 non-null   object 
 2   Location           6019 non-null   object 
 3   Year               6019 non-null   int64  
 4   Kilometers_Driven  6019 non-null   int64  
 5   Fuel_Type          6019 non-null   object 
 6   Transmission       6019 non-null   object 
 7   Owner_Type         6019 non-null   object 
 8   Mileage            6017 non-null   object 
 9   Engine             5983 non-null   object 
 10  Power              5983 non-null   object 
 11  Seats              5977 non-null   float64
 12  New_Price          824 non-null    object 
 13  Price              6019 non-null   float64
dtypes: float64(2), int64(3), object(9)
memory usage: 658.5+ KB


In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price
0,0,Maruti Wagon R LXI CNG,Mumbai,2010,72000,CNG,Manual,First,26.6 km/kg,998 CC,58.16 bhp,5.0,,1.75
1,1,Hyundai Creta 1.6 CRDi SX Option,Pune,2015,41000,Diesel,Manual,First,19.67 kmpl,1582 CC,126.2 bhp,5.0,,12.5
2,2,Honda Jazz V,Chennai,2011,46000,Petrol,Manual,First,18.2 kmpl,1199 CC,88.7 bhp,5.0,8.61 Lakh,4.5
3,3,Maruti Ertiga VDI,Chennai,2012,87000,Diesel,Manual,First,20.77 kmpl,1248 CC,88.76 bhp,7.0,,6.0
4,4,Audi A4 New 2.0 TDI Multitronic,Coimbatore,2013,40670,Diesel,Automatic,Second,15.2 kmpl,1968 CC,140.8 bhp,5.0,,17.74


### Handling Missing Values

In [3]:
print("Before handling:")
print(df.isnull().sum())

#Removing unnecessary data that will ruin the accuracy :)
df = df.dropna(subset=["Mileage"])
df = df.dropna(subset=["Engine"])
df = df.dropna(subset=["Power"])
df = df.dropna(subset=["Seats"])
df = df.drop(columns=['New_Price'])

print("\nAfter handling:")
df.isnull().sum()

Before handling:
Unnamed: 0              0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  42
New_Price            5195
Price                   0
dtype: int64

After handling:


Unnamed: 0           0
Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
Price                0
dtype: int64

### Correct any inconsistent data entries.

In [4]:
print(df.info())
df['Mileage'] = df['Mileage'].str.replace(' km/kg', '').astype(float)
df['Mileage'] = df['Mileage'].str.replace(' kmpl', '').astype(float)
df['Engine'] = df['Engine'].str.replace(' CC', '').astype(float)
print("\nAfter correct column types: \n")
df
df['Power'] = df['Power'].str.replace(' bhp', '').astype(float)


<class 'pandas.core.frame.DataFrame'>
Index: 5975 entries, 0 to 6018
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         5975 non-null   int64  
 1   Name               5975 non-null   object 
 2   Location           5975 non-null   object 
 3   Year               5975 non-null   int64  
 4   Kilometers_Driven  5975 non-null   int64  
 5   Fuel_Type          5975 non-null   object 
 6   Transmission       5975 non-null   object 
 7   Owner_Type         5975 non-null   object 
 8   Mileage            5975 non-null   object 
 9   Engine             5975 non-null   object 
 10  Power              5975 non-null   object 
 11  Seats              5975 non-null   float64
 12  Price              5975 non-null   float64
dtypes: float64(2), int64(3), object(8)
memory usage: 653.5+ KB
None


ValueError: could not convert string to float: '19.67 kmpl'

### Ensure data types are appropriate for each column.

---

## 2. Exploratory Data Analysis (EDA)

### Perform summary statistics on the dataset.

### Identify and analyze patterns in the data.

### Visualize the distribution of key variables.

### Explore relationships between variables.


## 3. Data Visualization

* Ensure the visualizations are clear and informative.

### Create visualizations to illustrate the findings from the EDA.


### Use appropriate plots such as histograms, bar charts, pie charts, scatter plots, and heatmaps.

## 4. Insights and Conclusions

* <h3>Summarize the key insights gained from the data analysis.<h3/>
* <h3>Draw conclusions based on the patterns observed in the data.<h3/>