<a href="https://colab.research.google.com/github/SamIv23/Data-Analyst/blob/main/Data%20Analyst%20-%20Project%202%20Cars%20Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analyst - Project 2 Cars Dataset

This project demonstrates an Exploratory Data Analysis (EDA) on a car dataset using the Python Pandas library. The analysis covers several critical stages of the data science lifecycle, including data cleaning, descriptive analysis, filtering, and data transformation. The primary goal is to extract initial insights from the data and to showcase proficiency in effectively manipulating Pandas DataFrames.

-----

Project Objectives :
* Data Cleaning: Identify and handle null values to ensure data integrity.
* Frequency Analysis: Analyze the distribution of categorical data to understand the dataset's composition.
* Data Filtering: Select data subsets based on specific criteria for a more focused analysis.
* Data Manipulation: Remove irrelevant records (such as outliers or data outside the analysis scope).
* Data Transformation: Apply functions to a data column to modify its values as required by the analysis.





## 1. Data Load and Data Cleaning

The initial stage where the dataset from an external file (like a CSV) is read and loaded into the Python environment using the Pandas library, turning it into a data table (DataFrame) ready for processing.

In [1]:
import pandas as pd

In [3]:
car = pd.read_csv(r"/content/Project+2+-+Cars+Dataset.csv")

In [None]:
car.head(5)

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
0,Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6.0,265.0,17.0,23.0,4451.0,106.0,189.0
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2.0,4.0,200.0,24.0,31.0,2778.0,101.0,172.0
2,Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4.0,200.0,22.0,29.0,3230.0,105.0,183.0
3,Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6.0,270.0,20.0,28.0,3575.0,108.0,186.0
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6.0,225.0,18.0,24.0,3880.0,115.0,197.0


In [4]:
car.isnull().sum()

Unnamed: 0,0
Make,4
Model,4
Type,4
Origin,4
DriveTrain,4
MSRP,4
Invoice,4
EngineSize,4
Cylinders,6
Horsepower,4


The null values in this specific column are then filled with the column mean to maintain the data distribution without having to delete rows.

In [10]:
car['Cylinders'] = car['Cylinders'].fillna(car['Cylinders'].mean())

In [11]:
car.isnull().sum()

Unnamed: 0,0
Make,4
Model,4
Type,4
Origin,4
DriveTrain,4
MSRP,4
Invoice,4
EngineSize,4
Cylinders,0
Horsepower,4


This is done to remove null values for all columns

In [12]:
car.dropna(inplace=True)

In [13]:
car.isnull().sum()

Unnamed: 0,0
Make,0
Model,0
Type,0
Origin,0
DriveTrain,0
MSRP,0
Invoice,0
EngineSize,0
Cylinders,0
Horsepower,0


## 2. Analysis of Car Makes

To understand which car makes were present and their frequency, I used the value_counts() function on the Make column. This provided a quick overview of the most common makes in the dataset.

In [None]:
make_counts = car['Make'].value_counts()
print(make_counts)

Make
Toyota           28
Chevrolet        27
Mercedes-Benz    26
Ford             23
BMW              20
Audi             19
Nissan           17
Honda            17
Chrysler         15
Volkswagen       15
Mitsubishi       13
Dodge            13
Volvo            12
Hyundai          12
Jaguar           12
Subaru           11
Kia              11
Pontiac          11
Mazda            11
Lexus            11
Buick             9
Mercury           9
Lincoln           9
Cadillac          8
GMC               8
Saturn            8
Suzuki            8
Infiniti          8
Acura             7
Saab              7
Porsche           7
Land Rover        3
Oldsmobile        3
Jeep              3
Isuzu             2
MINI              2
Scion             2
Hummer            1
Name: count, dtype: int64


## 3. Filtering by Car Origin

Analysis often requires focusing on specific segments. Here, I filtered the dataset to display only cars originating from Asia or Europe. This was accomplished efficiently using the isin() function.

In [None]:
filtered_origin = car[car['Origin'].isin(['Asia', 'Europe'])]
print(filtered_origin.head())

    Make           Model   Type Origin DriveTrain      MSRP   Invoice  \
0  Acura             MDX    SUV   Asia        All  $36,945   $33,337    
1  Acura  RSX Type S 2dr  Sedan   Asia      Front  $23,820   $21,761    
2  Acura         TSX 4dr  Sedan   Asia      Front  $26,990   $24,647    
3  Acura          TL 4dr  Sedan   Asia      Front  $33,195   $30,299    
4  Acura      3.5 RL 4dr  Sedan   Asia      Front  $43,755   $39,014    

   EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
0         3.5        6.0       265.0      17.0         23.0  4451.0   
1         2.0        4.0       200.0      24.0         31.0  2778.0   
2         2.4        4.0       200.0      22.0         29.0  3230.0   
3         3.2        6.0       270.0      20.0         28.0  3575.0   
4         3.5        6.0       225.0      18.0         24.0  3880.0   

   Wheelbase  Length  
0      106.0   189.0  
1      101.0   172.0  
2      105.0   183.0  
3      108.0   186.0  
4      115.0   197.

## 4. Removing Records by Weight

To filter out potential outliers or records irrelevant to a specific analysis segment (e.g., very large vehicles), I removed all rows where the Weight exceeded 4000. The ~ operator was used to negate the condition

In [None]:
car_cleaned = car[~(car['Weight'] > 4000)]
print(f"Original data: {car.shape[0]} rows, Cleaned data: {car_cleaned.shape[0]} rows")

Original data: 432 rows, Cleaned data: 329 rows


## 5. Data Transformation with the apply Function

As an example of data transformation, I increased all values in the MPG_City column by 3 points. This could simulate a scenario such as "what if all cars received a fuel efficiency upgrade?". The apply() function with a lambda was used for this operation.

In [None]:
car['MPG_City'] = car['MPG_City'].apply(lambda x: x + 3)
print(car[['Make', 'MPG_City']].head())

    Make  MPG_City
0  Acura      20.0
1  Acura      27.0
2  Acura      25.0
3  Acura      23.0
4  Acura      21.0
