<a href="https://colab.research.google.com/github/Sudeepthi13/PDS/blob/main/PDS_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pandas



In [6]:
import pandas as pd
data = pd.read_csv('/content/train.csv')

In [7]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

Unnamed: 0              0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  38
New_Price            5032
Price                   0
dtype: int64


#a) Look for the missing values in all the columns and either impute them (replace with mean, median, or mode) or drop them.

In [16]:
# Convert columns to strings, remove units, then convert to numeric
data['Mileage'] = data['Mileage'].astype(str).str.replace(r'[^0-9.]', '', regex=True).astype(float)
data['Engine'] = data['Engine'].astype(str).str.replace(r'[^0-9.]', '', regex=True).astype(float)
data['Power'] = data['Power'].astype(str).str.replace(r'[^0-9.]', '', regex=True).astype(float)

# Impute missing values without inplace=True to avoid FutureWarnings
data['Mileage'] = data['Mileage'].fillna(data['Mileage'].median())
data['Engine'] = data['Engine'].fillna(data['Engine'].mean())
data['Power'] = data['Power'].fillna(data['Power'].mean())
data['Seats'] = data['Seats'].fillna(data['Seats'].mode()[0])

# Drop New_Price column if it exists
if 'New_Price' in data.columns:
    data = data.drop(columns=['New_Price'])

# Verify all missing values are handled
print(data.isnull().sum())

Unnamed: 0           0
Name                 0
Location             0
Year                 0
Kilometers_Driven    0
Fuel_Type            0
Transmission         0
Owner_Type           0
Mileage              0
Engine               0
Power                0
Seats                0
Price                0
dtype: int64


Mileage (median): Mileage has outliers due to varying fuel efficiency. Using the median avoids skewing data towards extreme values, preserving a realistic distribution.

Engine (mean): Engine size generally follows a normal distribution. Imputing with the mean maintains a balanced average without introducing bias.

Power (mean): Like engine size, horsepower is a continuous variable. The mean keeps the overall power level consistent, which supports performance-based analysis.

Seats (mode): Seats are categorical, with a few common values. Using the mode preserves the dominant configuration, avoiding unrealistic values.

#b)Remove the units from some of the attributes and only keep the numerical values.


In [17]:
# Convert all entries to strings before removing units and converting to numeric
data['Mileage'] = pd.to_numeric(data['Mileage'].astype(str).str.replace(' kmpl', ''), errors='coerce')
data['Engine'] = pd.to_numeric(data['Engine'].astype(str).str.replace(' CC', ''), errors='coerce')
data['Power'] = pd.to_numeric(data['Power'].astype(str).str.replace(' bhp', ''), errors='coerce')

# Optional: Convert 'New_Price' if it exists and is needed
if 'New_Price' in data.columns:
    data['New_Price'] = pd.to_numeric(data['New_Price'].astype(str).str.replace(' lakh', ''), errors='coerce')


#C) Change the categorical variables (“Fuel_Type” and “Transmission”) into numerical one hot encoded value.

In [25]:
# Check the column names
print(data.columns)

# Perform one-hot encoding on categorical variables if columns exist
if 'Fuel_Type' in data.columns and 'Transmission' in data.columns:
    data = pd.get_dummies(data, columns=['Fuel_Type', 'Transmission'], drop_first=True)
else:
    print("One or both columns not found.")


Index(['Unnamed: 0', 'Name', 'Location', 'Year', 'Kilometers_Driven',
       'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats', 'Price',
       'Fuel_Type_Electric', 'Fuel_Type_Petrol', 'Transmission_Manual'],
      dtype='object')
One or both columns not found.


Fuel_Type and Transmission columns have already been one-hot encoded, resulting in new columns like Fuel_Type_Electric, Fuel_Type_Petrol, and Transmission_Manual

#d) Create one more feature and add this column to the dataset (you can use mutate function in R for this).For example, you can calculate the current age of the car by subtracting “Year” value from the current year.

In [26]:
from datetime import datetime

# Calculate the current year
current_year = datetime.now().year

# Create a new 'Car_Age' column
data['Car_Age'] = current_year - data['Year']


In [27]:
# 1. Select specific columns
selected_data = data[['Name', 'Year', 'Mileage', 'Engine', 'Power', 'Price', 'Car_Age']]

# 2. Filter rows (e.g., select cars manufactured after 2015)
filtered_data = selected_data[selected_data['Year'] > 2015]

# 3. Rename columns (e.g., rename 'Power' to 'Horsepower')
renamed_data = filtered_data.rename(columns={'Power': 'Horsepower'})

# 4. Mutate (add a new column, e.g., Price per Horsepower)
renamed_data['Price_per_Horsepower'] = renamed_data['Price'] / renamed_data['Horsepower']

# 5. Arrange (sort by 'Price' in descending order)
arranged_data = renamed_data.sort_values(by='Price', ascending=False)

# 6. Summarize with group by (e.g., average price by year)
summary = arranged_data.groupby('Year')['Price'].mean().reset_index()

# Display the final DataFrame and summary
print("Selected Data:\n", selected_data.head())
print("\nFiltered Data:\n", filtered_data.head())
print("\nRenamed Data:\n", renamed_data.head())
print("\nArranged Data:\n", arranged_data.head())
print("\nSummary (Average Price by Year):\n", summary)

Selected Data:
                                Name  Year  Mileage  Engine   Power  Price  \
0  Hyundai Creta 1.6 CRDi SX Option  2015    19.67  1582.0  126.20  12.50   
1                      Honda Jazz V  2011    13.00  1199.0   88.70   4.50   
2                 Maruti Ertiga VDI  2012    20.77  1248.0   88.76   6.00   
3   Audi A4 New 2.0 TDI Multitronic  2013    15.20  1968.0  140.80  17.74   
4            Nissan Micra Diesel XV  2013    23.08  1461.0   63.10   3.50   

   Car_Age  
0        9  
1       13  
2       12  
3       11  
4       11  

Filtered Data:
                                  Name  Year  Mileage  Engine   Power  Price  \
5   Toyota Innova Crysta 2.8 GX AT 8S  2016    11.36  2755.0  171.50  17.50   
8                    Maruti Ciaz Zeta  2018    21.56  1462.0  103.25   9.95   
14              Honda Amaze S i-Dtech  2016    25.80  1498.0   98.60   5.40   
15              Maruti Swift DDiS VDI  2017    28.40  1248.0   74.00   5.99   
26                Honda WRV i-V