# Getting Basic Insights of Data

## Loading the Data and Get Overview

In [1]:
# Importing Necessary Libraries
import pandas as pd
import numpy as np

In [2]:
# Loading the dataset to pandas dataframe
df = pd.read_csv('car_prices2.csv')

### Understanding data
We have done it previously

In [3]:
# It will show first rows
df.head()

Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak- rpm,city-mpg,highway-mpg,price
0,0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,3,2,164,audi,gas,std,four,sedan,fwd,front,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,4,2,164,audi,gas,std,four,sedan,4wd,front,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [4]:
df.shape

(205, 27)

In [5]:
df.columns

Index(['Unnamed: 0', 'symboling', 'normalized-losses', 'make', 'fuel-type',
       'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
       'engine-location', 'wheel-base', 'length', 'width', 'height',
       'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size',
       'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower',
       'peak- rpm', 'city-mpg', 'highway-mpg', 'price'],
      dtype='object')

### Basic Insights of Data
- Check datatypes: Simply check the datatypes of each column
- Check data distribution: summary statistics for numerical columns in the DataFrame.

In [6]:
df.dtypes

Unnamed: 0             int64
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak- rpm             object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

To get the quick statistics, we use the describe method. It returns the number of terms in the column as count, average column value as mean, column standard deviation as std, the maximum minimum values, as well as the boundary of each of the quartiles.

In [7]:
df.describe()

Unnamed: 0.1,Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,102.0,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,59.322565,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,0.0,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,51.0,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,102.0,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,153.0,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,204.0,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


By default, the dataframe.describe functions skips rows and columns that do not contain numbers. It is possible to make the describe method worked for object type columns as well. To enable a summary of all the columns, we could add an argument. Include equals all inside the describe function bracket. Now, the outcome shows the summary of all the 26 columns, including object typed attributes.

In [8]:
df.describe(include='all')

Unnamed: 0.1,Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak- rpm,city-mpg,highway-mpg,price
count,205.0,205.0,205,205,205,205,205,205,205,205,...,205.0,205,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205
unique,,,52,22,2,2,3,5,3,2,...,,8,39.0,37.0,,60.0,24.0,,,187
top,,,?,toyota,gas,std,four,sedan,fwd,front,...,,mpfi,3.62,3.4,,68.0,5500.0,,,?
freq,,,41,32,185,168,114,96,120,202,...,,94,23.0,20.0,,19.0,37.0,,,4
mean,102.0,0.834146,,,,,,,,,...,126.907317,,,,10.142537,,,25.219512,30.75122,
std,59.322565,1.245307,,,,,,,,,...,41.642693,,,,3.97204,,,6.542142,6.886443,
min,0.0,-2.0,,,,,,,,,...,61.0,,,,7.0,,,13.0,16.0,
25%,51.0,0.0,,,,,,,,,...,97.0,,,,8.6,,,19.0,25.0,
50%,102.0,1.0,,,,,,,,,...,120.0,,,,9.0,,,24.0,30.0,
75%,153.0,2.0,,,,,,,,,...,141.0,,,,9.4,,,30.0,34.0,


We see that for the object type columns, a different set of statistics is evaluated, like unique, top, and frequency. Unique is the number of distinct objects in the column. Top is most frequently occurring object, and freq is the number of times the top object appears in the column. Some values in the table are shown here as NaN which stands for not a number. This is because that particular statistical metric cannot be calculated for that specific column data type.

- Another Way of finding summary of the dataframe is info. This function shows the top 30 rows and bottom 30 rows of the data frame.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         205 non-null    int64  
 1   symboling          205 non-null    int64  
 2   normalized-losses  205 non-null    object 
 3   make               205 non-null    object 
 4   fuel-type          205 non-null    object 
 5   aspiration         205 non-null    object 
 6   num-of-doors       205 non-null    object 
 7   body-style         205 non-null    object 
 8   drive-wheels       205 non-null    object 
 9   engine-location    205 non-null    object 
 10  wheel-base         205 non-null    float64
 11  length             205 non-null    float64
 12  width              205 non-null    float64
 13  height             205 non-null    float64
 14  curb-weight        205 non-null    int64  
 15  engine-type        205 non-null    object 
 16  num-of-cylinders   205 non

## Performing EDA

### Variable Identification

#### 1. Identify Predictor and Target

In [10]:
# predictor_variables = df.drop(columns=['Customer ID', 'Frequency of Purchases'])
# target_variable = df['Frequency of Purchases']

# print(predictor_variables.columns)
# print(target_variable.head())

#### identify the data type and category of the variables.
- We can learn the shape of object types of our data.
- We cannot identify it with code. It's better to identify with our knowledge by seeing the data.
    - Read the problem and identify the variables described. Note key properties of the variables, such as what types of values the variables can take.
    - Identify any variables from step 1 that take on values from a limited number of possible values with no particular ordering. These variables are categorical.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         205 non-null    int64  
 1   symboling          205 non-null    int64  
 2   normalized-losses  205 non-null    object 
 3   make               205 non-null    object 
 4   fuel-type          205 non-null    object 
 5   aspiration         205 non-null    object 
 6   num-of-doors       205 non-null    object 
 7   body-style         205 non-null    object 
 8   drive-wheels       205 non-null    object 
 9   engine-location    205 non-null    object 
 10  wheel-base         205 non-null    float64
 11  length             205 non-null    float64
 12  width              205 non-null    float64
 13  height             205 non-null    float64
 14  curb-weight        205 non-null    int64  
 15  engine-type        205 non-null    object 
 16  num-of-cylinders   205 non