# Summarizing Automobile Evaluation Data

## Overview
In the following project we’ll use skills about summarizing categorical data to analyze a sample from a popular open source dataset. This dataset contains information on the cost and physical attributes of several thousand cars. 


## Project Goals
The main goal of this project is to apply categorical data analysis methods in practice. Here are some questions that are planned to be considered:
- which data in the set is categorical
- nominal categorical data, frequency of distribution and proportions
- encoding ordinal categorical data, spread and central tendency

## Actions

- analyze data;
- clean up the datasets;
- using methods for analyzing categorical data;
- making conclusions based on the analysis.

## Data

Оne data set is submitted for the project.

car_eval_dataset.csv - contains information on the cost and physical attributes of several thousand cars.

The car evaluation dataset has been sourced from the UCI Machine Learning Repository and has been slightly modified for this project. Specifically, one additional field manufacturer_country has been simulated for illustrative purposes. You can read more about the details, features, and original uses of this dataset in research on the [UCI data description page.](https://archive.ics.uci.edu/dataset/19/car+evaluation)

## Analysis

The analysis contains the main methods and calculations related to the analysis of categorical data. Namely:
- Distribution of categorical data
- Frequency distribution and proportions for nominal categorical data
- Encoding of ordinal categorical data, distribution and central tendency


In [3]:
# importing necessary libraries
import pandas as pd
import numpy as np

# setting options
pd.set_option('display.max_columns', None)
pd.set_option("display.float_format", "{:.2f}".format)
pd.set_option('max_colwidth', 0)

In [4]:
# load and view data
cars = pd.read_csv('car_eval_dataset.csv')
cars.head()

Unnamed: 0,buying_cost,maintenance_cost,doors,capacity,luggage,safety,acceptability,manufacturer_country
0,vhigh,low,4,4,small,med,unacc,China
1,vhigh,med,3,4,small,high,acc,France
2,med,high,3,2,med,high,unacc,United States
3,low,med,4,more,big,low,unacc,United States
4,low,high,2,more,med,high,acc,South Korea


In [5]:
# check types of each column
cars.dtypes

buying_cost             object
maintenance_cost        object
doors                   object
capacity                object
luggage                 object
safety                  object
acceptability           object
manufacturer_country    object
dtype: object

In [8]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   buying_cost           1000 non-null   object
 1   maintenance_cost      1000 non-null   object
 2   doors                 1000 non-null   object
 3   capacity              1000 non-null   object
 4   luggage               1000 non-null   object
 5   safety                1000 non-null   object
 6   acceptability         1000 non-null   object
 7   manufacturer_country  1000 non-null   object
dtypes: object(8)
memory usage: 62.6+ KB


There are several columns in the dataset, including:
- buying_cost - buying price
- maintenance_cost - price of the maintenance
- doors - number of doors
- capacity - capacity in terms of persons to carry
- luggage - the size of luggage boot
- safety - estimated safety of the car
- acceptability - evaulation level (unacceptable, acceptable, good, very good)
- manufacturer_country - country of the manufacturer


### Summarizing Manufacturing Country

`manufacturer_country` is a nominal categorical variable that indicates the country of the manufacturer of each car reviewed.

#### Create a table of frequencies of all the cars reviewed by manufacturer country

In [6]:
cars['manufacturer_country'].value_counts()

manufacturer_country
Japan            228
Germany          218
South Korea      159
United States    138
Italy            97 
France           87 
China            73 
Name: count, dtype: int64

The Japanese cars are the most reviewed. The cars from Italy are on the 5th position.

#### Calculate a table of proportions for countries.

In [7]:
cars['manufacturer_country'].value_counts(normalize=True)

manufacturer_country
Japan           0.23
Germany         0.22
South Korea     0.16
United States   0.14
Italy           0.10
France          0.09
China           0.07
Name: proportion, dtype: float64

The Japanese cars have 23% proportion in dataset.

### Summarizing Buying Costs

`buying_cost` is a categorical variable which describes the cost of buying any car in the dataset.

#### The list of the possible values of buying cost

In [12]:
cars['buying_cost'].unique().tolist()

['vhigh', 'med', 'low', 'high']

`buying_cost` is an ordinal categorical variable, which means we can create an order associated with the values in the data and perform additional numeric operations on the variable. Create a new list, `buying_cost_categories`, that contains the unique values in `buying_cost`, ordered from lowest to highest

In [13]:
buying_cost_categories = ['low', 'med', 'high', 'vhigh']

Convert `buying_cost` to type 'category'

In [16]:
cars['buying_cost'] = pd.Categorical(cars['buying_cost'], categories=buying_cost_categories, ordered=True)

Calculate the median category of the `buying_cost` variable and print it out

In [22]:
median_buying_cost = np.median(cars['buying_cost'].cat.codes)
print(median_buying_cost)

median_buying_cost_category = buying_cost_categories[int(median_buying_cost)]
print(median_buying_cost_category)

1.0
med


### Summarizing Luggage Capacity
