# Car Sales Dataset

## Background & Context

There is always a huge demand for used cars. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a growing tech start-up in North America that aims to find a good strategy in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

For now your objective is:

### **The objectives:**
* Explore basic stats and visualize the dataset.
* Generate a set of insights and recommendations that will help the business.

### **The key questions:**
* Which factors would affect the price of used cars?


### **Data Dictionary**

**Manufacturer** : Name of the car which includes Brand name

**Model** : Name of the car which includes Model name

**Type** : Size of the car whether 'Small', 'Midsize', 'Compact', 'Large', 'Sporty', 'Van'. 6 unique types.

**Year** : Manufacturing year of the car

**MPG.City** : Total miles the car runs per gallon in city.

**MPG.Highway** : Total miles the car runs per gallon on highway.

**Transmission** : The type of transmission used by the car. (Automatic / Manual)

**Drive** : Type of drive (All Wheel Drive vs Front wheel drive).

**Cylinders** : Number of cylinders in the engine.

**Transmission** : Automatic or Manual.

**Passengers** : Number of seats in the car.

**Price** : The price of the used car in x 1000 USD (**Target Variable**)

In [1]:
# import necessary libraries


In [2]:
# Read the car_sales.csv data and store it as a dataframe


In [3]:
# View the top 10 rows of the dataframe


In [4]:
# get basic info


In [5]:
# difference between df.MPG Highway and df['MPG Highway']


In [6]:
# get stats for both numerical and categorical (object) columns


In [7]:
# Find the unique classes in the Type column


In [8]:
# Plot a histogram for Passengers features and identify the type of distribution 


In [9]:
# check the datatype of variable Cylinders


In [10]:
# find the average price of the cars in this dataset


In [11]:
# Get the five number summary of 'Passengers' variable


In [12]:
# Plot a box plot for Passengers variable


The minimum value in a box plot is the lower whisker, which is defined as the lowest value that is within 1.5 times the interquartile range (IQR) from the lower quartile (Q1). The lower whisker represents the lowest values in the dataset that are not considered outliers.

The maximum value in a box plot is the upper whisker, which is defined as the highest value that is within 1.5 times the IQR from the upper quartile (Q3). The upper whisker represents the highest values in the dataset that are not considered outliers.

The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. Outliers are defined as values that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. In a box plot, outliers are represented by individual points outside the whiskers.

In [13]:
# Plot the correlation Matrix


In [14]:
# Plot a chart of 'Type' vs 'Price'


In [15]:
# Plot a chart of 'Drive' vs 'Price'
