# Car Data Analysis
Pandas semantics refer to the underlying meaning and behavior of operations and data structures within the pandas library, particularly concerning how data is handled, accessed, and modified.

### Pandas `Series` and `DataFrames` are fundamental data structures in the Pandas library, but they differ in their dimensionality and structure.

**Pandas Series:**
- One-dimensional: A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, Python objects, etc.).
- Homogeneous data: All elements within a Series typically have the same data type.
- Index-labeled: Each element in a Series is associated with a label, called an index, which can be custom-defined (integers, strings, etc.) or automatically generated.
- Size-immutable: Once created, the size (number of elements) of a Series cannot be changed.
- Analogous to a single column: A Series can be thought of as a single column of data from a spreadsheet or a SQL table.

**Pandas DataFrame:**
- Two-dimensional: A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It resembles a spreadsheet or a SQL table.
- Heterogeneous data: Columns within a DataFrame can have different data types.
- Rows and columns: DataFrames are organized into rows and columns, each with its own labels (index for rows, column names for columns).
- Mutable: DataFrames are mutable, meaning their size and content can be modified after creation.
- Collection of Series: A DataFrame can be understood as a collection of Series objects, where each column in the DataFrame is a distinct Series.

In [None]:
# This project loads the **car_data** file through a *csv* file saved in the local machine.
import pandas as pd

# pd.read_csv(): function, reads the csv and makes a DataFrame into the given variable
car=pd.read_csv(r"x:REDACTED\02 - Pandas Project\car_data.csv")

# from google.colab import files
# uploaded = files.upload()
# car = pd.read_csv('car_data.csv')

car.head(7)

Saving car_data.csv to car_data (1).csv


Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
0,Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6.0,265.0,17.0,23.0,4451.0,106.0,189.0
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2.0,4.0,200.0,24.0,31.0,2778.0,101.0,172.0
2,Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4.0,200.0,22.0,29.0,3230.0,105.0,183.0
3,Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6.0,270.0,20.0,28.0,3575.0,108.0,186.0
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6.0,225.0,18.0,24.0,3880.0,115.0,197.0
5,Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,"$46,100","$41,100",3.5,6.0,225.0,18.0,24.0,3893.0,115.0,197.0
6,Acura,NSX coupe 2dr manual S,Sports,Asia,Rear,"$89,765","$79,978",3.2,6.0,290.0,17.0,24.0,3153.0,100.0,174.0


`.head()` method in Pandas is used to retrieve the first n rows of a DataFrame or Series. This function is particularly useful for quickly inspecting the structure and initial data of a large dataset.

- Default Behavior: When called without any arguments, `df.head()` returns the first 5 rows of the DataFrame `df`.
- Specifying Number of Rows: To display a specific number of rows, an integer n can be passed as an argument to the method, like `df.head(n)`. For example, `df.head(10)` will display the first 10 rows.

`.shape` attribute of a DataFrame or Series returns a tuple representing its dimensionality.
- **For a DataFrame:** It returns a tuple `(number_of_rows, number_of_columns)`. The first element indicates the number of rows, and the second element indicates the number of columns.

- **For a Series:** It returns a tuple `(number_of_elements)`. Since a Series is a one-dimensional data structure, the tuple will contain only one element representing the number of elements in the Series.

In [4]:
car.shape
# rows, columns

(432, 15)

Checking for the null values

- `.isnull()` is used to detect missing or "Not a Number" (NA) values within pandas Series and DataFrames. It returns a boolean object (Series or DataFrame) of the same shape as the input, where True indicates a missing value and False indicates a valid value.

- `.sum()` when chained after `.isnull()`, the sum() method aggregates the True and False values. In Python, True is treated as 1 and False as 0 during summation.
    - If applied to a DataFrame, `df.isnull().sum()` will return a Series where the index represents the column names of the DataFrame, and the values represent the total count of missing values in each corresponding column.
    
    - If applied to a Series, `series.isnull().sum()` will return a single integer representing the total count of missing values in that Series.

In [5]:
# car.isnull()
car.isnull().sum()

Unnamed: 0,0
Make,4
Model,4
Type,4
Origin,4
DriveTrain,4
MSRP,4
Invoice,4
EngineSize,4
Cylinders,6
Horsepower,4


`isnull()` is a fundamental tool for data preprocessing in pandas, enabling tasks such as:
- Identifying missing data: Quickly locate and understand the extent of missing values in your dataset.
- Filtering data: Select or exclude rows/columns based on the presence or absence of missing values.
- Handling missing values: Prepare data for imputation or removal of missing values.

Because we had null values observed as above, let's drop them all

- The `.dropna()` method in Pandas is used to remove missing values (**NaN**, **None**, or **NaT**) from a DataFrame or Series. It is a crucial tool for data cleaning and preparation. By default, `.dropna()` removes any row that contains at least one missing value.

- `axis=`: Specifies whether to drop rows or columns.
    - `axis=0`: (default): Drops rows.
    - `axis=1`: Drops columns

- `inplace=` If True, modifies the original DataFrame directly; if False (default), returns a new DataFrame.

In [6]:
car.dropna(inplace=True)
car.isnull().sum()

Unnamed: 0,0
Make,0
Model,0
Type,0
Origin,0
DriveTrain,0
MSRP,0
Invoice,0
EngineSize,0
Cylinders,0
Horsepower,0


In [7]:
car.shape
# car.head()

(426, 15)

`.value_counts()` method in Pandas is used to return a Series containing counts of unique values. It is a powerful tool for analyzing the distribution of data, particularly for categorical or discrete data.

Key features and usage:
- Counts unique values: It counts the occurrences of each distinct value within a Pandas Series (e.g., a column of a DataFrame).
- Returns a Series: The output is a new Pandas Series where the index represents the unique values from the original Series, and the values are their corresponding counts.
- Sorted by default: By default, the resulting Series is sorted in descending order, with the most frequent value appearing first.
- Handles missing values: By default, NaN (Not a Number) values are excluded from the counts. This behavior can be changed using the dropna argument.
- Normalization: The normalize=True argument can be used to return the relative frequencies (proportions) of unique values instead of their raw counts.
- Binning for numeric data: For numeric data, the bins argument can be used to group values into equal-width bins, providing a frequency distribution across ranges instead of individual values.

In [8]:
# Counting the different values of the different columns
# DataFrames labels (index for rows, column names for columns).

# car['Make'].value_counts()

car['Type'].value_counts()

Unnamed: 0_level_0,count
Type,Unnamed: 1_level_1
Sedan,262
SUV,60
Sports,47
Wagon,30
Truck,24
Hybrid,3


In [9]:
car[['Type','Cylinders']].value_counts() # merge 2 conditions/columns for more detailed output

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Type,Cylinders,Unnamed: 2_level_1
Sedan,6.0,120
Sedan,4.0,96
Sedan,8.0,38
SUV,6.0,30
SUV,8.0,22
Sports,6.0,20
Sports,8.0,14
Wagon,4.0,14
Sports,4.0,11
Wagon,6.0,11


In [10]:
# car[['Make','Origin']].value_counts()

car[['Make','Type','Horsepower']].value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
Make,Type,Horsepower,Unnamed: 3_level_1
Audi,Sedan,220.0,7
BMW,Sedan,225.0,5
BMW,Sedan,184.0,5
Mercedes-Benz,Sedan,215.0,5
Saturn,Sedan,140.0,5
...,...,...,...
Infiniti,Sedan,255.0,1
Infiniti,Sedan,280.0,1
Infiniti,Wagon,280.0,1
Infiniti,Wagon,315.0,1


`.unique()` method in Pandas is used to extract the distinct, non-repeated values from a Series or an Index. It returns these unique values in the order of their first appearance.

Key characteristics of `.unique()`:
- Returns a NumPy array (or Categorical/Index): When applied to a Pandas Series or a NumPy array, it returns a numpy.ndarray. If applied to a Categorical dtype, it returns a Categorical object, and if applied to an Index, it returns an Index object.
- Order of appearance: The unique values are returned in the order they first appear in the original Series, not in sorted order.
- Includes **NaN** values: If the Series contains missing values (**NaN**), `.unique()` will include **NaN** in the returned array of unique values.
- Efficiency: It is generally more efficient than numpy.unique for longer sequences, especially when dealing with Pandas Series.

In [11]:
# Selecting rows that contain only *Asia* and *Europe* as values in the **Origin** column
Asia_Europe= car[(car['Origin'] == 'Asia') | (car['Origin'] == 'Europe')]  # origin is Asia OR Europe
# Asia_Europe is a new filtered DataFrame

Asia_Europe['Origin'].unique()
# Asia_Europe['Origin'].value_counts()

array(['Asia', 'Europe'], dtype=object)

In [12]:
V8_USA = car[(car['Cylinders'] >= 8) & (car['Origin'] == 'USA')]  # V8 or more and from USA

V8_USA['Make'].unique() # unique brands from USA that have V8s or more

array(['Cadillac', 'Chevrolet', 'Dodge', 'Ford', 'GMC', 'Hummer',
       'Lincoln', 'Mercury', 'Pontiac'], dtype=object)

In [13]:
car=car[~(car['Weight'] > 4000)]
# filters the DataFrame to 'remove' all rows where the value in the 'Weight' column is 'greater than 4000'.
car['Weight'].max() # calculates and returns the maximum value in the Weight column of the modified DataFrame

3992.0

In [14]:
car.shape

(323, 15)

In [15]:
car.head()

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2.0,4.0,200.0,24.0,31.0,2778.0,101.0,172.0
2,Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4.0,200.0,22.0,29.0,3230.0,105.0,183.0
3,Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6.0,270.0,20.0,28.0,3575.0,108.0,186.0
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6.0,225.0,18.0,24.0,3880.0,115.0,197.0
5,Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,"$46,100","$41,100",3.5,6.0,225.0,18.0,24.0,3893.0,115.0,197.0


Adding a value **3** to the column *MPG_City*

In [16]:
car['MPG_City'] = car['MPG_City'] + 3 # adds 3 to each element in the 'MPG_City' column
car.head()

Unnamed: 0,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
1,Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2.0,4.0,200.0,27.0,31.0,2778.0,101.0,172.0
2,Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4.0,200.0,25.0,29.0,3230.0,105.0,183.0
3,Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6.0,270.0,23.0,28.0,3575.0,108.0,186.0
4,Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6.0,225.0,21.0,24.0,3880.0,115.0,197.0
5,Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,"$46,100","$41,100",3.5,6.0,225.0,21.0,24.0,3893.0,115.0,197.0
