# CAR SALES ADVERTISEMENT APP DEVELOPMENT

***

#### Purpose of the project: To develop and deploy a web application about car sales advertisements to a cloud service so that it is accessible to the public.

#### On this project I want to see some hypotesis:
#### - if car price is determined by odometer, the most used cars are cheaper than newer ones.
#### - if recent cars are lesser used than older cars.

#### To achieve that and further analysis I will outline the following steps below in sequence.

***

### 1. IMPORT THE DATA AND LIBRARY

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np


: 

In [None]:
df = pd.read_csv('vehicles_us.csv') #open the data set
print(df)

df.info()

: 

The first step I did is to import the necessary library and open the data file.
The following steps will be dedicated to process the data, clean missing values, dupplicates and set the data frame for further analysis.

***

### 2. PRE-PROCESSING DATA AND CLEANING

In [None]:
print (df['model_year'].isna().sum()) #to identify missing values

: 

In [None]:
df['cylinders'].fillna(0) #to identify missing values

: 

In [None]:
df.info() #to display what columns have missing values

: 

From the above is obvious that there are missing values on some of the columns. Which will be treated separately on the following code lines.
- model_year
- cylinders
- odometer
- paint_color
- is_4wd

In [None]:
print (df['cylinders'].isna().sum()) #to display missing values

: 

In [None]:
df['cylinders'].fillna(0) #to fill missing values with value 0


: 

In [None]:
print(df['cylinders'].isna().sum()) #total of filled values

: 

In [None]:
print(df['cylinders']) #to display that the columns is now clean of missing values

: 

In [None]:
display(df.isnull().sum()) # to display rest of columns with missing values

: 

In [None]:
df.columns #to display if there are typos or other errors on column titles

: 

I am checking that the titles of the columns are clean or need to be clean, in this case the titles are clean.

In [None]:
df[df.model_year.isnull()==True] #to display missing values in column model_year


: 

In [None]:
columns_to_replace = ['model_year', 'cylinders', 'odometer', 'paint_color', 'is_4wd'] #select columns to modify
for col in columns_to_replace:
    df[col] = df[col].fillna('unknown') #replace missing values with 'unkown'
    print(df)

: 

In [None]:
df.isnull().sum() #display of missing values after processing


: 

The missing values on the columns have been cleaned.

In [None]:
df.duplicated() #checking for dupplicates on the data frame


: 

In [None]:
df[df.duplicated()] #checking for dupplicates on the titles


: 

In [None]:
df.duplicated().sum() #result of dupplicates


: 

There are no dupplicate values on the data frame.

In [None]:
df.model.nunique() #checking if there are hidden dupplicates


: 

On the next steps I am checking for non-obvious dupplicates on some columns, checking if there are variations on same model or brand of car, etc.

In [None]:
df.model.unique() #checking for hidden dupplicates on column 'model'


: 

In [None]:
sorted(df.model.unique()) #to scroll for dupplicates on the 'model' column


: 

In [None]:
df.condition.unique() #checking for dupplicates on 'condition' column


: 

In [None]:
df.type.unique() #checking for dupplicates on 'type' column


: 

In [None]:
df.paint_color.unique() #checking for dupplicates onn 'paint_color' column


: 

In [None]:
df.fuel.unique() #checking for dupplicates on 'fuel' column


: 

In [None]:
df.transmission.unique() #checking for dupplicates on 'transmission' column


: 

From the above steps I can deduce that there are defenitely no dupplicate values and the data frame is cleaned in regards to this factor so I can proceed with the analysys.

In [None]:
df.info() #to display the result of fixing dupplicates and missing values


: 

In [None]:
print(df.model_year.to_string(index=False)) #to clean the column of 'model_year' from trailing zeros and decimals

: 

In [None]:
df['model_year'] = df['model_year'].replace('unknown', 0) #to replace string 'unkown' by int 0
print(df['model_year'])

: 

In [None]:
df['model_year'] = df['model_year'].astype(int) #to check that all values are as integers to remove the trailing zeros and decimals
print(df['model_year'])

: 

In [None]:
print(df.odometer.to_string(index=False)) #to replave string 'unkown' by 0


: 

In [None]:
df['odometer'] = df['odometer'].replace('unknown', 0) #to reaplace the missing values for 0
print(df['odometer'])

: 

In [None]:
df['odometer'] = df['odometer'].astype(int) #to convert the column in integer in order to remove the trailing zeros and decimals
print(df['odometer'])

: 

Now we can see above that the columsn have been clean and it shows the same values on all the columns.
In short:
- Missing values for some objects have been replaced by "unkown" values.
- Missing values on cyclinders and odometer have been replaced by zeroes.
- model_year format have been cleaned deleting the trailing zeroes from the year and de decimal.
- odometer columns format have been cleaned deleting the trailing zeroes from the number of km and the decimal.

On the next steps I am going to work on the column 'price' to be able to use it as parameter for the analysis. in order to fo that I will delete the outliers values from the columns and I will keep the columns on some logical range of prices that I can use.

In [None]:
print(df['price'].value_counts()[1]) #to display outliers on the column


: 

From the above you can see there are 798 rows as outliers with value 1 'price'.

Next I will see what is the minimun and maximum price on the column.

In [None]:
print(df['price'].max()) #to display the maximum amount of price
print(df['price'].min()) #to display the minimum amount of price


: 

In [None]:
df.sort_values("price") #to sort the 'price' column in ascendent order


: 

In [None]:
df = df.drop(df[df['price'] == 1].index) #to display the result after removing the outliers, which shows more outliers
print(df)


: 

Eventhough I got rid of the outliers with value 1, I can still see there are more, so I will take out all the values below 200 which is a price I can take as minimum for the analysis.

In [None]:
df.sort_values("price") #to display the 'price' column in ascendent order to spot further outliers


: 

In [None]:
df = df.drop(df[df['price'] < 200].index) #to filter the 'price' column disregarding the outliers
print(df)
           

: 

In [None]:
df.sort_values("price") #to sort the resulting process in ascending order


: 

Now we have a columns with cars with minimum price 200 to maximum price 375000.

***

### 3. ANALYSYS OF DATA

In [None]:
df.groupby("condition").size() #to filter the data frame by 'condition' column


: 

Most of cars available to sell are in good and excellent condition.

Now I will explore with the model_year column to see if there are any meaning results comparing with prices and condition.

In [None]:
df.groupby("model_year").size() #to filter the data frame by 'model_year' column


: 

I see there are outliers with model_year, there are models from more than 100 years ago, I will filter from 1970 and earlier to clean the column.

In [None]:
df_filtered = df[df['model_year'] >= 1970] #filter the 'model_year column to disregard outliers
new_df = df_filtered
print(new_df)


: 

FINAL DATA FRAME CLEAN

In [None]:
print(new_df) #to display the clean data frame after processing


: 

***

### 4. VISUALIZATIONS

In [None]:
df_filtered = df[df['model_year'] >= 1960] #added a filter with models after 1960 only
#scatter graph comparing amount of km. run for each car and price list
fig3 = px.scatter(df_filtered, x='model_year', y='odometer', title="Comparison of model year (for 1960 and later cars) and odometer")
fig3.show()


: 

VISUALIZATION #1: From the visualization we can see that cars that are from 1990 to 2010 have the most amount of km run. The are most used than the rest.

In [None]:
df_type = df.groupby('type').size() #filter the data frame by 'type' of car
print(df_type)


: 

Top 3 car types are: SUV, sedan and truck.

In [None]:
df_type = df.groupby('fuel').size() #filter the data frame by 'fuel' type for each car
print(df_type)


: 

Top car fuel used is Gas.

In [None]:
df_type = df.groupby('odometer').size() #filter the data frame to display cars for each kilometrage done
print(df_type)


: 

Majority of cars are km. 0

In [None]:
df_type = df.groupby('transmission').size()
print(df_type)


: 

In [None]:
# histogram by groups with different colors comparing 'type' of cars and 'condition' of the car
fig = px.histogram(new_df, x="type", color='condition') 
fig.show()

: 

VISUALIZATION #2: Histogram where you can see the distribution of types of cars depending on their condition. You can click on the colors you want, is a multiple selecion, to visualize the graph for those variables.

In [None]:
# histogram to visualize groups by 'type' of car compared with the 'fuel' they use and the count added up for each one
fig = px.histogram(new_df, x="type", color='fuel') 
fig.show()

: 

VIUALIZATION #3: Histogram where you can see the distribution of types of cars depending on their fuel. You can click on the colors you want, is a multiple selecion, to visualize the graph for those variables. It is obvious that the majority of cars are fueled by gas.

***

### 5. SUMMARY AND CONCLUSIONS

##### After the analsys performed I can conclude that the following hypotesis has been demonstrated:

##### 1. Cars with high odometer are not necessariry old cars, there is a range of years, 1990 to 2010, where cars have the most amount of kilometers run.  Therefore, they are more used than other years.
##### 2. The odometer is not a crucial factor to determine a good price for a car. Price is determined by other parameters together.
    
##### - Cars that are from 1990 to 2010 years have the most amount of km run (highest values on odometer).
##### - Top 4 types of cars with best condition to sell are: SUV, sedan, pipckup and truck.
##### - Top 4 types of cars run by gas are: SUV, sedan, pickup and truck.
##### - Top types of cars run by diesel are: truck and pickup.
##### - Top types of cars run by hybrid are: sedan and hatchback.
##### - Top types of cars run electric are: sedan, truck and hatchback.

##### Most cars are fueled by gas.
##### Majority of cars are km. 0.
##### Most cars are automatic.




--END OF PROJECT--