# Exploratory Data Analysis of Craigslist Cars Listing

This notebook explores a dataset of cars listing from Craigslist, which was originally published on Kaggle by Austin Reese. 
The dataset contains information about over 400k listings, including details such as the vehicle's make, model, year, price, location, and description.

The dataset was collected using the Craigslist API and scraped from Craigslist's website. 
The project was taken offline around December 2020 but the posting date of listings seems to range from April 2021 to May 2021 (TBC)

The dataset is available for download on Kaggle at the following link: [Craigslist Car and Truck Listings](https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data)

The dataset is in CSV format, with one row per listing and one column per variable.

The goal of this notebook is to explore the dataset. We will use data visualization and statistical analysis to answer questions such as:

- What are the most popular vehicle makes and models on Craigslist?
- etc.

<!-- variable TBC
- `id`: a unique identifier for each listing
- `url`: the URL of the listing on Craigslist
- `region`: the region where the listing was posted (e.g. "atlanta", "boston", etc.)
- `region_url`: the URL of the region on Craigslist
- `price`: the price of the vehicle, in USD
- `year`: the year of the vehicle
- `manufacturer`: the manufacturer of the vehicle (e.g. "ford", "honda", etc.)
- `model`: the model of the vehicle (e.g. "focus", "civic", etc.)
- `condition`: the condition of the vehicle (e.g. "new", "like new", "excellent", etc.)
- `cylinders`: the number of cylinders in the vehicle's engine (e.g. "4 cylinders", "6 cylinders", etc.)
- `fuel`: the type of fuel used by the vehicle (e.g. "gas", "diesel", etc.)
- `odometer`: the mileage of the vehicle, in miles
- `title_status`: the title status of the vehicle (e.g. "clean", "salvage", etc.)
- `transmission`: the type of transmission used by the vehicle (e.g. "automatic", "manual", etc.)
- `VIN`: the vehicle identification number (VIN) of the vehicle
- `drive`: the type of drive used by the vehicle (e.g. "4wd", "fwd", etc.)
- `size`: the size of the vehicle (e.g. "full-size", "mid-size", etc.)
- `type`: the type of vehicle (e.g. "sedan", "SUV", etc.)
- `paint_color`: the paint color of the vehicle (e.g. "white", "black", etc.)
- `image_url`: the URL of an image of the vehicle
- `description`: a description of the vehicle, provided by the seller
- `county`: the county where the listing was posted
- `state`: the state where the listing was posted
- `lat`: the latitude of the listing's location
- `long`: the longitude of the listing's location
- `posting_date`: the date when the listing was posted on Craigslist -->

# 0 - Imports

## 0.1 - Libs imports

In [1]:
import pandas as pd

## 0.2 - Data import

In [2]:
df = pd.read_csv('../data/vehicles.csv', encoding="utf-8")

# 1 - Exploration

## 1.1 - Basic info and first look at the data

In [3]:
df.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,size,type,paint_color,image_url,description,county,state,lat,long,posting_date
0,7222695916,https://prescott.craigslist.org/cto/d/prescott...,prescott,https://prescott.craigslist.org,6000,,,,,,...,,,,,,,az,,,
1,7218891961,https://fayar.craigslist.org/ctd/d/bentonville...,fayetteville,https://fayar.craigslist.org,11900,,,,,,...,,,,,,,ar,,,
2,7221797935,https://keys.craigslist.org/cto/d/summerland-k...,florida keys,https://keys.craigslist.org,21000,,,,,,...,,,,,,,fl,,,
3,7222270760,https://worcester.craigslist.org/cto/d/west-br...,worcester / central MA,https://worcester.craigslist.org,1500,,,,,,...,,,,,,,ma,,,
4,7210384030,https://greensboro.craigslist.org/cto/d/trinit...,greensboro,https://greensboro.craigslist.org,4900,,,,,,...,,,,,,,nc,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

> Notes:
> - 426.880 entries!
> - Some data are missing, keep an eye on it
> - "county" is always missing 

In [5]:
df["posting_date"].describe()
#df['time'] = pd.to_datetime(df['time'])??

count                       426812
unique                      381536
top       2021-04-23T22:13:05-0400
freq                            12
Name: posting_date, dtype: object

In [6]:
df['manufacturer'].value_counts().head(10)

manufacturer
ford         70985
chevrolet    55064
toyota       34202
honda        21269
nissan       19067
jeep         19014
ram          18342
gmc          16785
bmw          14699
dodge        13707
Name: count, dtype: int64

> Notes: 
> - Ford is the most popular manufacturer, not very surprising haha

In [7]:
df['model'].value_counts().head(5)

model
f-150             8009
silverado 1500    5140
1500              4211
camry             3135
silverado         3023
Name: count, dtype: int64

> Notes:
> - And of course, the most popular car is the F-150 (from Ford)
> - Sales by year: 896,526 (2019), 787,372 (2020) and 726,003 (2021) [Source](https://www.goodcarbadcar.net/ford-f-series-sales-figures/). Yeah, it's huge haha!