# Scraping NamCars.net for car data analysis
  
NamCars.net has no objections to scraping and using the data for personal or non-commercial use  
  
### Goals 
1. Scrape all entries for "Bakkie Double Cab" and "4x4" category. The overlaps will be irrelevant  
   Get the data
2. Store the data. Just a CSV file 
3. Some analysis and presentation  
  
### Motivation 
Cars and 4x4s are fun.
  

### Scraping pseudocode

start with the base page
while there is a "next page":   
- get DOM of current page  
- scrape  
    - get data for all cars 
    - extract the URL of the next page   
- wait a while, not to overwhelm the server or seem suspicious enough to be banned  

save data to CSV

In [1]:
from bs4 import BeautifulSoup
import requests 
import time
import csv

In [2]:
'''Write data to CSV file. 
   Needs: header names, rows of data and the path with file name'''
def log_data(field_names, data, path=''): 
    with open(str(path), 'w', newline='') as f:
        write = csv.writer(f)
        write.writerow(field_names)
        write.writerows(data)
        f.close()
        
    print("Good data log")

In [None]:
'''
Scrape all entries for a type of car (the website has several pages per car type)  
Provide the URL of the first page of a car type and this algo will iterate over all sequent pages and extract the details.
'''
def scrape_cars(nextURL): 
    counter = 0
    carData = []

    # while it is not the last page or we iterated over more than 100 pages (in case of an error)
    while nextURL or counter > 100: 
        print(nextURL)
        counter += 1 # fale safe 

        #Load page DOM
        res = requests.get(nextURL) 
        soup = BeautifulSoup(res.content, 'html5lib')

        # Container for all cars
        carsContainer = soup.find('div', attrs = {'class': 'itemsContainer'})
        allCars = carsContainer.find_all('div', attrs = {'class': 'item_box nomobileInline'})

        # Get data for every car on the current page
        for car in allCars:    
            individual_url = baseURL + car.div.a['href']
            make = car.div.a.strong.span.get_text()
            year = car.div.a.strong.get_text().strip()[:4]
            model = car.div.a.find('span', attrs = {'itemprop': 'model'}).get_text()
            if model.find(' ') == -1: # cases with only model name and no details
                model_name = model
                model_details = '-'
            else: 
                model_name = model[:model.find(' ')].lower()
                model_details = model[model.find(' '):]
            carInfo = car.find('div', attrs = {'class': 'detail'}).find_all('span')
            for info in carInfo:
                price = carInfo[0].get_text().strip()[3:] # be aware of mil 
                kms = carInfo[2].get_text().strip()[:-3]
                transmission = carInfo[4].get_text().strip()
            # Collect all data into a list
            carData.append([make, model_name, model_details, price, kms, year, transmission, individual_url])

        # get next URL 
        navigation = soup.find('div', attrs = {'class': 'pagination'})
        navURLs = navigation.find_all('a')
        if navURLs[-1].get_text() == "Next page >>>": # ensure 1. we have the "next" button 2. stop when we reach the end
            nextURL = baseURL + navURLs[-1]['href']
        else: 
            break

        time.sleep(3) # trying to not DoS the server 
        print(f'counter: {counter}')
        print('-------')
        
    print('Collected data') 
    return carData 

**4x4 run**

In [5]:
baseURL = "https://www.namcars.net"
headers = ['make', 'model_name', 'model_details', 'price', 'kms', 'year', 'transmission', 'detailsLink']


nextURL = "https://www.namcars.net/4x4"
log_data(headers, scrape_cars(nextURL), './4x4_data.csv')

https://www.namcars.net/4x4
counter: 1
-------
https://www.namcars.net/4x4/sale-2
counter: 2
-------
https://www.namcars.net/4x4/sale-3
counter: 3
-------
https://www.namcars.net/4x4/sale-4
counter: 4
-------
https://www.namcars.net/4x4/sale-5
counter: 5
-------
https://www.namcars.net/4x4/sale-6
counter: 6
-------
https://www.namcars.net/4x4/sale-7
counter: 7
-------
https://www.namcars.net/4x4/sale-8
counter: 8
-------
https://www.namcars.net/4x4/sale-9
counter: 9
-------
https://www.namcars.net/4x4/sale-10
counter: 10
-------
https://www.namcars.net/4x4/sale-11
counter: 11
-------
https://www.namcars.net/4x4/sale-12
counter: 12
-------
https://www.namcars.net/4x4/sale-13
counter: 13
-------
https://www.namcars.net/4x4/sale-14
counter: 14
-------
https://www.namcars.net/4x4/sale-15
counter: 15
-------
https://www.namcars.net/4x4/sale-16
counter: 16
-------
https://www.namcars.net/4x4/sale-17
counter: 17
-------
https://www.namcars.net/4x4/sale-18
counter: 18
-------
https://www.namca

**Now Double cab bakkies**

In [6]:
nextURL = "https://www.namcars.net/Bakkie-Double-Cab"
log_data(headers, scrape_cars(nextURL), './BDC_data.csv')

https://www.namcars.net/Bakkie-Double-Cab
counter: 1
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-2
counter: 2
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-3
counter: 3
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-4
counter: 4
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-5
counter: 5
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-6
counter: 6
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-7
counter: 7
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-8
counter: 8
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-9
counter: 9
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-10
counter: 10
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-11
counter: 11
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-12
counter: 12
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-13
counter: 13
-------
https://www.namcars.net/Bakkie-Double-Cab/sale-14
counter: 14
-------
Collected data
Good data log


**Next, all SUVs**

In [7]:
nextURL = "https://www.namcars.net/SUV"
log_data(headers, scrape_cars(nextURL), './SUV_data.csv')

https://www.namcars.net/SUV
counter: 1
-------
https://www.namcars.net/SUV/sale-2
counter: 2
-------
https://www.namcars.net/SUV/sale-3
counter: 3
-------
https://www.namcars.net/SUV/sale-4
counter: 4
-------
https://www.namcars.net/SUV/sale-5
counter: 5
-------
https://www.namcars.net/SUV/sale-6
counter: 6
-------
https://www.namcars.net/SUV/sale-7
counter: 7
-------
https://www.namcars.net/SUV/sale-8
counter: 8
-------
https://www.namcars.net/SUV/sale-9
counter: 9
-------
https://www.namcars.net/SUV/sale-10
counter: 10
-------
https://www.namcars.net/SUV/sale-11
counter: 11
-------
https://www.namcars.net/SUV/sale-12
counter: 12
-------
https://www.namcars.net/SUV/sale-13
counter: 13
-------
https://www.namcars.net/SUV/sale-14
counter: 14
-------
https://www.namcars.net/SUV/sale-15
counter: 15
-------
https://www.namcars.net/SUV/sale-16
counter: 16
-------
https://www.namcars.net/SUV/sale-17
counter: 17
-------
https://www.namcars.net/SUV/sale-18
counter: 18
-------
https://www.namca

With Bakkie, SUV and 4x4, we have 84% coverage. There may be overlaps but not too serious I hope.  
**Finally, a scrape of all cars**

In [11]:
nextURL = "https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=1&show_filters=false&filters=no"
log_data(headers, scrape_cars(nextURL), './allCars_data.csv')

https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=1&show_filters=false&filters=no
counter: 1
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=2
counter: 2
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=3
counter: 3
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=4
counter: 4
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=5
counter: 5
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=6
counter: 6
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=7
counter: 7
-------
https://www.namcars.net/index.php?module=cars&action=search&order_by=mileage&order_to=desc&page=8
counter: 8
-------
https://www.namcars.net/index.php?

Now we have all data from the website in CSV format. 