## For this project, I will analyze a dataset on car sales advertisements:

### My aim is to conduct a comprehensive data analysis, exploring the following topics:

### Do customers have any clear preferences for certain types of vehicles?
### Do customers have any preferences for the year of manufacture of the vehicle?
### Do customers have any preferences for the condition in which the vehicle is located?

## The first step is to learn and clean the data.

### With the import command i imported the pandas library and aliased it as pd.

In [2]:
import pandas as pd


### To read a CSV file, I used Pandas' read_csv function. My goal was to find out what information I had. 

In [3]:
data=pd.read_csv('C://Users/anuto/.anaconda/vehicles_us.csv')
data.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28


### I wanted to have a quick look at the file structure and content.This output is valuable because it quickly provides an overview of the dataset's size, structure, and completeness. It allows to identify potential issues such as missing values and helps to decisions about data cleaning.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


### The dataset includes information on 51,525 vehicles with 13 different attributes.

### Several columns have missing values, particularly 'model_year', 'cylinders', 'odometer', 'paint_color', and 'is_4wd'. It's crucial to address these missing values before performing detailed analysis or modeling.

### The 'price' column provides information on the pricing of the vehicles.Other columns such as 'model', 'condition', 'fuel', 'transmission', 'type', and 'paint_color' offer categorical information about the vehicles.

In [5]:
print(data.columns)

Index(['price', 'model_year', 'model', 'condition', 'cylinders', 'fuel',
       'odometer', 'transmission', 'type', 'paint_color', 'is_4wd',
       'date_posted', 'days_listed'],
      dtype='object')


### I wanted to know the column names it is crucial for referencing and manipulating specific columns during data analysis. It helps to understand the available variables and plan the analysis.


In [6]:
print(data.isna().sum())

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
dtype: int64


### I  used the command to count and display the number of missing (NaN) values in each column of the DataFrame

In [7]:
columns_rename ={
    'price':'Price',
    'model_year':'Model year',
    'model':'Model',
    'condition':'Condition',
    'cylinders':'Cylinders',
    'fuel' :'Fuel',
    'odometer':'Odometer',
    'transmission':'Transmission',
    'type':'Type',
    'paint_color':'Paint color',
    'is_4wd':'Is 4wb',
    'date_posted': 'Date posted',
    'days_listed':'Days listed'
}
data.rename(columns=columns_rename, inplace=True)
print (columns_rename)

{'price': 'Price', 'model_year': 'Model year', 'model': 'Model', 'condition': 'Condition', 'cylinders': 'Cylinders', 'fuel': 'Fuel', 'odometer': 'Odometer', 'transmission': 'Transmission', 'type': 'Type', 'paint_color': 'Paint color', 'is_4wd': 'Is 4wb', 'date_posted': 'Date posted', 'days_listed': 'Days listed'}


### By this command, I modified the column name to make it more aesthetically pleasing.

In [8]:
data=data.rename(columns=columns_rename)
print(data.columns)

Index(['Price', 'Model year', 'Model', 'Condition', 'Cylinders', 'Fuel',
       'Odometer', 'Transmission', 'Type', 'Paint color', 'Is 4wb',
       'Date posted', 'Days listed'],
      dtype='object')


In [9]:
data['Odometer'] = data['Odometer'].fillna(data.groupby(['Model','Model year','Type','Condition'])['Odometer'].transform('median'))
print(data)

       Price  Model year           Model  Condition  Cylinders Fuel  Odometer  \
0       9400      2011.0          bmw x5       good        6.0  gas  145000.0   
1      25500         NaN      ford f-150       good        6.0  gas   88705.0   
2       5500      2013.0  hyundai sonata   like new        4.0  gas  110000.0   
3       1500      2003.0      ford f-150       fair        8.0  gas  233000.0   
4      14900      2017.0    chrysler 200  excellent        4.0  gas   80903.0   
...      ...         ...             ...        ...        ...  ...       ...   
51520   9249      2013.0   nissan maxima   like new        6.0  gas   88136.0   
51521   2700      2002.0     honda civic    salvage        4.0  gas  181500.0   
51522   3950      2009.0  hyundai sonata  excellent        4.0  gas  128000.0   
51523   7455      2013.0  toyota corolla       good        4.0  gas  139573.0   
51524   6300      2014.0   nissan altima       good        4.0  gas  100355.0   

      Transmission    Type 

### The purpose of this operation is to impute missing values in the 'Odometer' column with more specific and representative values based on the median odometer value within groups determined by other relevant columns (such as 'Model', 'Model year', 'Type', and 'Condition').

In [10]:
data['Odometer'] = data['Odometer'].fillna(0).round().astype(int)
print (data.head())

   Price  Model year           Model  Condition  Cylinders Fuel  Odometer  \
0   9400      2011.0          bmw x5       good        6.0  gas    145000   
1  25500         NaN      ford f-150       good        6.0  gas     88705   
2   5500      2013.0  hyundai sonata   like new        4.0  gas    110000   
3   1500      2003.0      ford f-150       fair        8.0  gas    233000   
4  14900      2017.0    chrysler 200  excellent        4.0  gas     80903   

  Transmission    Type Paint color  Is 4wb Date posted  Days listed  
0    automatic     SUV         NaN     1.0  2018-06-23           19  
1    automatic  pickup       white     1.0  2018-10-19           50  
2    automatic   sedan         red     NaN  2019-02-07           79  
3    automatic  pickup         NaN     NaN  2019-03-22            9  
4    automatic   sedan       black     NaN  2019-04-02           28  


In [11]:
data['Model year'] = data['Model year'].fillna(0).round().astype(int)
print (data.head())

   Price  Model year           Model  Condition  Cylinders Fuel  Odometer  \
0   9400        2011          bmw x5       good        6.0  gas    145000   
1  25500           0      ford f-150       good        6.0  gas     88705   
2   5500        2013  hyundai sonata   like new        4.0  gas    110000   
3   1500        2003      ford f-150       fair        8.0  gas    233000   
4  14900        2017    chrysler 200  excellent        4.0  gas     80903   

  Transmission    Type Paint color  Is 4wb Date posted  Days listed  
0    automatic     SUV         NaN     1.0  2018-06-23           19  
1    automatic  pickup       white     1.0  2018-10-19           50  
2    automatic   sedan         red     NaN  2019-02-07           79  
3    automatic  pickup         NaN     NaN  2019-03-22            9  
4    automatic   sedan       black     NaN  2019-04-02           28  


### The fillna (0) method is used to fill missing values in the 'Odometer'and 'model_year' column with zeros.

### The round() method is then applied to round the values in the 'Odometer' column to the nearest whole number. I wanted to fill all the data even if it is '0'

In [13]:
print(data.notna().sum())

Price           51525
Model year      51525
Model           51525
Condition       51525
Cylinders       46265
Fuel            51525
Odometer        51525
Transmission    51525
Type            51525
Paint color     42258
Is 4wb          25572
Date posted     51525
Days listed     51525
dtype: int64


### After executing the command, I checked if the empty field in the column I wanted to fill had a value of '0'. This provided me with a summary of non-null values in the DataFrame.

In [14]:
column_name = 'Transmission'
column_info = data['Transmission'].describe()

print(f"Info about the '{'Transmission'}' column:")
print(column_info)

Info about the 'Transmission' column:
count         51525
unique            3
top       automatic
freq          46902
Name: Transmission, dtype: object


### I wanted to extract and print descriptive statistics about the 'Transmission' column in the DataFrame. 

## Here is a brief list of conclusions:
### The data has undergone cleaning and preprocessing steps, improving its quality and making it ready for more advanced analysis or modeling tasks.
### The dataset related to vehicle listings, with various features such as price, model, condition, and transmission.
### Missing values were addressed using thoughtful strategies, considering the characteristics of the data.
### Descriptive statistics for specific columns offer insights into the distribution and central tendencies of the data.