# Data Preparation. 

## Objectives
1. To practice identifying variable types within the dataset, determining the target variable as defined in the task, and examining variable distributions.
2. To acquaint oneself with the task forthcoming in the modeling stage.
3. To commence data exploration for the modeling phase.





## Components of the Practical Work
1. Load the dataset and become acquainted with it.
2. Explore the variables within the dataset.
3. Determine the variable types in the dataset.
4. Identify the target variable within the data and examine variable distributions.


## Key Emphasis:
- The program delivers correct answers based on the given dataset.
- Reasons for the chosen solutions are described when necessary.
- Code readability is prioritized: meaningful variable names are used, indentation and spacing rules are adhered to.
- The project repository contains meaningful commits, documenting specific implemented features; branches are named according to their purpose, and unnecessary files are not stored in the repository.



## Task
Starting from this lesson, I am commencing the exploration and preparation of data for the modeling stage.

I will be working with a small sample from a collection of used cars for sale in the United States, presented in the file data/vehicles_dataset.csv. Using this data, I will build the first classification model that determines the price category of a used car based on its characteristics.
In this practical work, I will load the dataset and begin its exploration.

## Dataset Description:
 - id: Record identifier.
 - url: URL of the sales record.
 - region: Region.
 - region_url: Region URL.
 - price: Price.
 - year: Year of manufacture.
 - manufacturer: Manufacturer.
 - model: Model.
 - condition: Condition.
 - cylinders: Number of cylinders.
 - fuel: Fuel type.
 - odometer: Mileage.
 - title_status: Title status.
 - transmission: Transmission type.
 - VIN: Vehicle identification number.
 - drive: Drive type.
 - size: Size.
 - type: Body type.
 - paint_color: Paint color.
 - image_url: Image URL.
 - description: Description.
 - county: County.
 - state: State.
 - lat: Latitude.
 - long: Longitude.
 - posting_date: Posting date of the sales advertisement.
 - price_category: Price category.

In [3]:
# Import necessary libraries
import pandas as pd


# Task 1. Loading the dataset and getting acquainted with it
## What needs to be done

First, it is necessary to load the dataset and become familiar with its characteristics.

1. Load the dataset from 'data/vehicles_dataset.csv' and display it.



In [4]:
df = pd.read_excel('C:\\Users\\Lietotajs\\Desktop\\ML Data Prep & Engineering\\1. Data Preparation\\Initial_Dataset\\vehicles_dataset.xlsx', )
df.head()


Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,type,paint_color,image_url,description,county,state,lat,long,posting_date,price_category
0,7308295377,https://chattanooga.craigslist.org/ctd/d/chatt...,chattanooga,https://chattanooga.craigslist.org,54990,2020.0,ram,2500 crew cab big horn,good,,...,pickup,silver,https://images.craigslist.org/00N0N_1xMPvfxRAI...,Carvana is the safer way to buy a car During t...,,tn,35.06,-85.25,2021-04-17T12:30:50-0400,high
1,7316380095,https://newjersey.craigslist.org/ctd/d/carlsta...,north jersey,https://newjersey.craigslist.org,16942,2016.0,ford,explorer 4wd 4dr xlt,,6 cylinders,...,SUV,black,https://images.craigslist.org/00x0x_26jl9F0cnL...,***Call Us for more information at: 201-635-14...,,nj,40.821805,-74.061962,2021-05-03T15:40:21-0400,medium
2,7313733749,https://reno.craigslist.org/ctd/d/atlanta-2017...,reno / tahoe,https://reno.craigslist.org,35590,2017.0,volkswagen,golf r hatchback,good,,...,sedan,,https://images.craigslist.org/00y0y_eeZjWeiSfb...,Carvana is the safer way to buy a car During t...,,ca,33.779214,-84.411811,2021-04-28T03:52:20-0700,high
3,7308210929,https://fayetteville.craigslist.org/ctd/d/rale...,fayetteville,https://fayetteville.craigslist.org,14500,2013.0,toyota,rav4,,,...,wagon,white,https://images.craigslist.org/00606_iGe5iXidib...,2013 Toyota RAV4 XLE 4dr SUV Offered by: R...,,nc,35.715954,-78.655304,2021-04-17T10:08:57-0400,medium
4,7316474668,https://newyork.craigslist.org/lgi/cto/d/baldw...,new york city,https://newyork.craigslist.org,21800,2021.0,nissan,altima,new,4 cylinders,...,,,https://images.craigslist.org/00V0V_3pSOiPZ3Sd...,2021 Nissan Altima Sv with Only 8 K Miles Titl...,,ny,40.6548,-73.6097,2021-05-03T18:32:06-0400,medium


2. Output the size of the dataset.


In [6]:
print(df.shape)

(10050, 27)


3. Output the list of dataset columns.

In [7]:
df.columns.tolist()

['id',
 'url',
 'region',
 'region_url',
 'price',
 'year',
 'manufacturer',
 'model',
 'condition',
 'cylinders',
 'fuel',
 'odometer',
 'title_status',
 'transmission',
 'VIN',
 'drive',
 'size',
 'type',
 'paint_color',
 'image_url',
 'description',
 'county',
 'state',
 'lat',
 'long',
 'posting_date',
 'price_category']

In [8]:
df.columns[1]

'url'

4. Output descriptive statistics for the entire dataset (be sure to specify the correct parameter for this).

In [9]:
df.describe()

Unnamed: 0,id,price,year,odometer,county,lat,long
count,10050.0,10050.0,10014.0,10007.0,0.0,9951.0,9951.0
mean,7311544000.0,20684.29,2010.917815,95657.19,,38.590164,-94.161564
std,4475414.0,124321.6,9.697849,86579.48,,5.844756,18.123096
min,7208550000.0,500.0,1915.0,0.0,,-67.144243,-158.0693
25%,7308193000.0,7900.0,2008.0,38994.5,,34.83,-110.44715
50%,7312756000.0,15749.5,2013.0,88377.0,,39.2851,-87.9991
75%,7315275000.0,27990.0,2017.0,137000.0,,42.42759,-80.83
max,7317090000.0,12345680.0,2022.0,3245000.0,,64.9475,173.885502


# Task 2. Exploration of Dataset Variables
## What needs to be done

After familiarizing yourself with the dataset, examine the values ​​that the variables with characteristics take.

Print in a loop for each column the column name, the number of unique values, and then the list of possible values along with their occurrence count in the dataset.

When displaying information for each characteristic, adhere to the template.



```
Characteristic: id
Number of unique values: 10000
List of values:
7303629857    2
7315995136    2
7316719393    2
7309842734    2
7307971804    2
             ..
7303843163    1
7315223900    1
7309940769    1
7309251820    1
7316428067    1
Name: id, Length: 10000, dtype: int64
```

In [10]:
for column in df.columns:
    print('Characteristic:', column)
    print('Number of unique values:', df[column].nunique())
    print('List of values:')
    print(df[column].value_counts(), '\n')

Characteristic: id
Number of unique values: 10000
List of values:
7316028281    2
7310693445    2
7310816094    2
7307639785    2
7312525382    2
             ..
7308265563    1
7309504220    1
7316238007    1
7311587823    1
7311960763    1
Name: id, Length: 10000, dtype: int64 

Characteristic: url
Number of unique values: 10000
List of values:
https://roswell.craigslist.org/cto/d/artesia-1999-ford-f250-super-duty-super/7316028281.html         2
https://pueblo.craigslist.org/ctd/d/tempe-2017-ford-450-f450-450-drw-lariat/7310693445.html          2
https://monroe.craigslist.org/ctd/d/monroe-2019-infiniti-qx60-luxe/7310816094.html                   2
https://flint.craigslist.org/ctd/d/davison-2010-ford-150-xlt-supercrew/7307639785.html               2
https://charlotte.craigslist.org/cto/d/myrtle-beach-1947-mercury-hotrod-street/7312525382.html       2
                                                                                                    ..
https://mendocino.craigslist.org/

# Task 3. Determining Variable Types in the Dataset
## What needs to be done

After familiarizing yourself with the characteristics, fill in the table indicating the types of some variables. To do this, at the intersection of the variable name and the variable type, mark "X".



| Variable       | Discrete   | Continuous  | Categorical  |
|----------------|------------|-------------|--------------|
| id             | X          |             |              |
| region         |            |             |              |
| year           |            |             |              |
| manufacturer   |            |             |              |
| condition      |            |             |              |
| fuel           |            |             |              |
| odometer       |            |             |              |
| title_status   |            |             |              |
| transmission   |            |             |              |
| VIN            |            |             |              |
| drive          |            |             |              |
| paint_color    |            |             |              |
| state          |            |             |              |
| price_category |            |             |              |


In [11]:
varble_type_dict = {'Variable': [], 'Discrete': [], 'Continuous': [], 'Qualitative': []}
for elem in df.columns:
    if pd.api.types.is_integer_dtype(df[elem].values.dtype):
        varble_type_dict['Variable'].append(elem),
        varble_type_dict['Discrete'].append('X'),
        varble_type_dict['Continuous'].append(' '), 
        varble_type_dict['Qualitative'].append(' ')
    elif pd.api.types.is_string_dtype(df[elem].values.dtype):
        varble_type_dict['Variable'].append(elem),
        varble_type_dict['Discrete'].append(' '),
        varble_type_dict['Continuous'].append(' '), 
        varble_type_dict['Qualitative'].append('X')
    elif pd.api.types.is_float_dtype(df[elem].values.dtype):
        varble_type_dict['Variable'].append(elem),
        varble_type_dict['Discrete'].append(' '),
        varble_type_dict['Continuous'].append('X'), 
        varble_type_dict['Qualitative'].append(' ')

df2 = pd.DataFrame(varble_type_dict)
df2.set_index('Variable', inplace=True)
df2

Unnamed: 0_level_0,Discrete,Continuous,Qualitative
Variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id,X,,
url,,,X
region,,,X
region_url,,,X
price,X,,
year,,X,
manufacturer,,,X
model,,,X
condition,,,X
cylinders,,,X


### Task 4. Determining the Target Variable in the Data and Variable Distributions
**What needs to be done**

1. Based on the task, determine which column contains the target variable.

**Answer:** Price Category - target variable

2. Output the proportion of occurrence in the sample for each value of the target variable.


In [7]:
total=df['price_category'].count()
CategoryShares = {'Category': [],'Share': []}

for i in df['price_category'].unique():
    (CategoryShares['Category']).append(i) 
    (CategoryShares['Share']).append((df[df['price_category'] == i].shape[0] / total).round(4))
    
df3 = pd.DataFrame(CategoryShares)
df3

Unnamed: 0,Category,Share
0,high,0.3497
1,medium,0.3278
2,low,0.3226


3. Once again, examine the proportion or count of occurrences of each value of the target variable in the sample and write down what distribution it represents. It is possible to determine the distribution of the target variable values in this case even without a graph.

**Answer:** Uniform distribution