ID: An identification number or code assigned to each car listing in the dataset. It is a unique identifier that distinguishes one listing from another.

Price:The price of the used car listed for sale. It represents the asking price set by the seller.

Year:The year of manufacture or production of the vehicle. It indicates the age of the car.

Manufacturer: The brand or manufacturer of the car, such as Toyota, Honda, Ford, etc.

Model: The specific model or variant of the car, such as Camry, Civic, Mustang, etc.

Condition:Describes the condition of the car, which can range from new to excellent, good, fair, or salvage. It provides an indication of the overall state of the vehicle.

Cylinders: The number of cylinders in the car's engine. It indicates the engine's configuration and affects the performance of the vehicle.

Fuel:The type of fuel used by the car, such as gasoline, diesel, electric, hybrid, or other alternative fuels.

Odometer: The recorded mileage or distance traveled by the vehicle. It gives an indication of the car's usage and can influence its value.

Title Status: Refers to the status of the car's title or ownership documents, such as clean, salvage, rebuilt, or lien. It provides information about the legal status of the vehicle.

Transmission:Specifies the type of transmission system in the car, such as manual, automatic, or continuously variable transmission (CVT).

Drive: Indicates the drivetrain configuration of the vehicle, such as front-wheel drive (FWD), rear-wheel drive (RWD), all-wheel drive (AWD), or four-wheel drive (4WD).

Size: Describes the size or class of the car, such as compact, mid-size, full-size, SUV, truck, etc.

Type: Refers to the body type or style of the car, such as sedan, coupe, hatchback, SUV, truck, convertible, etc.

Paint Color: Specifies the exterior color of the car, such as black, white, red, blue, etc.

State:Indicates the state or location where the car is listed for sale.

Latitude (Lat) and Longitude (Long): The geographical coordinates of the location where the car is listed. It allows for mapping and geospatial analysis.

Posting Date: The date and time when the car listing was posted or made available for sale.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

print('Finsihed loading libraries')

Finsihed loading libraries


In [None]:
file_path = "/content/drive/MyDrive/Machine_learning/cars/Cars.csv"
data = pd.read_csv(file_path)

In [None]:
print(data.index)
print(data.dtypes)

RangeIndex(start=0, stop=290129, step=1)
Unnamed: 0        int64
id                int64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
drive            object
size             object
type             object
paint_color      object
state            object
lat             float64
long            float64
posting_date     object
price             int64
dtype: object


In [None]:
summary_stats = data.describe()
print(summary_stats)

          Unnamed: 0            id           year      odometer  \
count  290129.000000  2.901290e+05  290129.000000  2.901290e+05   
mean   207301.718108  7.311503e+09    2011.359082  9.764241e+04   
std    119595.649164  4.378450e+06       9.149422  2.058970e+05   
min         0.000000  7.301583e+09    1900.000000  0.000000e+00   
25%    103622.000000  7.308154e+09    2008.000000  3.800000e+04   
50%    207440.000000  7.312664e+09    2014.000000  8.561500e+04   
75%    310804.000000  7.315255e+09    2017.000000  1.334360e+05   
max    414469.000000  7.317101e+09    2022.000000  1.000000e+07   

                 lat           long         price  
count  285726.000000  285726.000000  2.901290e+05  
mean       38.505649     -94.616042  5.193300e+04  
std         5.830007      18.319158  9.591680e+06  
min       -84.122245    -159.827728  0.000000e+00  
25%        34.600000    -111.924900  5.991000e+03  
50%        39.170000     -88.212494  1.399000e+04  
75%        42.408400     -80.830

In [None]:
missing_values = data.isnull().sum()
print("missing values:\n", missing_values)
weird_symbols = data.select_dtypes(include='object').apply(lambda x: x.str.contains('[^A-Za-z0-9\s]', regex=True).any())
print("weird symbols:\n", weird_symbols)

missing values:
 Unnamed: 0           0
id                   0
year                 0
manufacturer     11342
model                0
condition       116104
cylinders       119300
fuel              1453
odometer             0
title_status      5066
transmission         0
drive            88087
size            207684
type             62596
paint_color      87113
state                0
lat               4403
long              4403
posting_date         0
price                0
dtype: int64
weird symbols:
 manufacturer     True
model            True
condition       False
cylinders       False
fuel            False
title_status    False
transmission    False
drive           False
size             True
type             True
paint_color     False
state           False
posting_date     True
dtype: bool


In [None]:
data = data.drop(['id', 'state', 'posting_date', 'lat', 'long'], axis=1)
print(data.head())

   Unnamed: 0    year manufacturer     model  condition    cylinders fuel  \
0      121610  2007.0     infiniti       m45  excellent          NaN  gas   
1      395646  2008.0     cadillac  escalade        NaN          NaN  gas   
2      236366  2003.0        lexus        es        NaN          NaN  gas   
3       66001  2017.0          bmw      320i  excellent          NaN  gas   
4      320855  2008.0       toyota      rav4       good  6 cylinders  gas   

   odometer title_status transmission drive       size   type paint_color  \
0  214740.0        clean    automatic   rwd        NaN  sedan       black   
1  170276.0        clean    automatic   NaN        NaN    SUV       black   
2  176910.0        clean    automatic   fwd        NaN  sedan         NaN   
3   41000.0        clean    automatic   4wd        NaN    SUV         NaN   
4  138900.0        clean    automatic   4wd  full-size    SUV      silver   

   price  
0   7995  
1  14999  
2   5995  
3      0  
4   7995  


In [None]:
print(data.index)
print(data.dtypes)

RangeIndex(start=0, stop=290129, step=1)
Unnamed: 0        int64
year            float64
manufacturer     object
model            object
condition        object
cylinders        object
fuel             object
odometer        float64
title_status     object
transmission     object
drive            object
size             object
type             object
paint_color      object
price             int64
dtype: object


In [None]:
clean_data = data.dropna(inplace=False)
print(clean_data.head())
print(clean_data.index)
print(clean_data.dtypes)
missing_values = data.isnull().sum()
print("missing values:\n", missing_values)
print("\ncleaned data = ",clean_data.shape[0])

    Unnamed: 0    year   manufacturer          model  condition    cylinders  \
4       320855  2008.0         toyota           rav4       good  6 cylinders   
9       115665  2016.0           ford  f-150 xlt fx4   like new  6 cylinders   
10      126073  2005.0           ford         escape       fair  4 cylinders   
16       89487  2013.0  mercedes-benz       gl-class  excellent  8 cylinders   
17        4140  2018.0         nissan       frontier       good  4 cylinders   

   fuel  odometer title_status transmission drive       size   type  \
4   gas  138900.0        clean    automatic   4wd  full-size    SUV   
9   gas   40000.0        clean    automatic   4wd  full-size  truck   
10  gas  180000.0        clean    automatic   4wd   mid-size    SUV   
16  gas   79856.0        clean        other   4wd    compact  wagon   
17  gas   14570.0        clean    automatic   rwd  full-size  truck   

   paint_color  price  
4       silver   7995  
9         grey  43000  
10        blue    70

In [None]:
data.groupby('manufacturer').size().reset_index(name='count')

Unnamed: 0,manufacturer,count
0,acura,4108
1,alfa-romeo,616
2,aston-martin,21
3,audi,5185
4,bmw,10174
5,buick,3792
6,cadillac,4817
7,chevrolet,37324
8,chrysler,4121
9,datsun,42


In [None]:
clean_data['manufacturer'] = clean_data['manufacturer'].astype('category')
clean_data['manufacturer'] = clean_data['manufacturer'].cat.codes

clean_data['condition'] = clean_data['condition'].astype('category')
clean_data['condition'] = clean_data['condition'].cat.codes

clean_data['cylinders'] = clean_data['cylinders'].astype('category')
clean_data['cylinders'] = clean_data['cylinders'].cat.codes


clean_data['fuel'] = clean_data['fuel'].astype('category')
clean_data['fuel'] = clean_data['fuel'].cat.codes

clean_data['title_status'] = clean_data['title_status'].astype('category')
clean_data['title_status'] = clean_data['title_status'].cat.codes

clean_data['drive'] = clean_data['drive'].astype('category')
clean_data['drive'] = clean_data['drive'].cat.codes

clean_data['size'] = clean_data['size'].astype('category')
clean_data['size'] = clean_data['size'].cat.codes

clean_data['type'] = clean_data['type'].astype('category')
clean_data['type'] = clean_data['type'].cat.codes

clean_data['paint_color'] = clean_data['paint_color'].astype('category')
clean_data['paint_color'] = clean_data['paint_color'].cat.codes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['manufacturer'] = clean_data['manufacturer'].astype('category')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['manufacturer'] = clean_data['manufacturer'].cat.codes
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data['condition'] = clean_data['condition'].astype('catego

In [None]:
categorical_features = ['manufacturer', 'condition', 'cylinders', 'fuel', 'title_status', 'drive', 'size', 'type', 'paint_color']
X = clean_data[categorical_features]
y = clean_data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.048196248196248195
