# Lab 6 Tasks

In this notebook we will work with a dataset containing records of car sales from dealerships in different parts of Ireland during a one-year period. Each row in the data represents the sale of a single car, described by the following features:

- *date*: the date when the car sale occurred
- *make*: the make or manufacturer of the car
- *model*: the specific model name of the car
- *year*: indicates the age of the car
- *transmission*: indicates if the car is manual, automatic, or semi-automatic
- *fuel_type*: specifies the type type of fuel used by the car
- *mileage*: the distance (in kilometres) that the car has previously been driven
- *region*: the province in Ireland where the car sale took place
- *sale_amount*: the amount (in euros) for which the car was sold

In [None]:
import pandas as pd

## Task 1 - Data Loading

Load the CSV file "car-sales.csv" into a Pandas DataFrame. Check the number of rows and the column names in the DataFrame.

In [None]:
df = pd.read_csv("car-sales.csv")
# load it as a Pandas DataFrame
print("Data: %d rows, %d columns" % df.shape)
df.head(10)

## Task 2 - Handling Missing Values

Check the extent to which there are features with missing values present in the features in the dataset.

In [None]:
# get the number and percentage of missing values
num_missing = df.isnull().sum()
per_missing = num_missing*100.0/len(df)
# turn this into a Data Frame
pd.DataFrame({"# Missing":num_missing, "% Missing": per_missing})

Apply appropriate data preprocessing to address any issues with missing values.

In [None]:
# fix the columns by filling in the most common value (i.e. the mode)
for col in ["make", "transmission", "fuel_type"]:
    mode = df[col].mode()[0]
    print("Filling in '%s' for missing values in %s" % (mode, col))
    df[col] = df[col].fillna(mode)

In [None]:
# check missing values again to make sure everything is fixed
num_missing = df.isnull().sum()
per_missing = num_missing*100.0/len(df)
# turn this into a Data Frame
pd.DataFrame({"# Missing":num_missing, "% Missing": per_missing})

## Task 3 - Handling Irregulary Cardinality

Check the extent to which there are categorical features with irregular cardinality present in the features in the dataset.

In [None]:
# find the categorical columns
df.info()

In [None]:
# check the value counts for each categorical column
for col in ['make', 'model', 'transmission', 'fuel_type', 'region']:
    display(df[col].value_counts())

Apply appropriate data preprocessing to address any issues with irregular cardinality.

In [None]:
# fix the irregular cardinality for the 'fuel_type' feature
df.loc[df['fuel_type'] == 'P','fuel_type'] = "Petrol"
df.loc[df['fuel_type'] == 'D','fuel_type'] = "Diesel"
df.loc[df['fuel_type'] == 'E','fuel_type'] = "Electric"
df["fuel_type"].value_counts()

In [None]:
# fix the irregular cardinality for the 'transmission' feature
df.loc[df['transmission'] == 'Auto','transmission'] = "Automatic"
df["transmission"].value_counts()

## Task 4 - Data Aggregation

Use data aggregation to analysise how the *total sale amount* for cars sold in the full dataset relates to the *region* in which the sale took place. Sort the regions by highest to lowest total value.

In [None]:
# group the rows by region
region_groups = df.groupby("region")
# calculate the sum for each group
region_totals = region_groups["sale_amount"].sum()
# sort the results
region_totals.sort_values(ascending=False)

Next, use data aggregation to analysise how the *total sale amount* for cars sold in the full dataset relates to car *model*. Sort the models by highest to lowest total value.

In [None]:
# group the rows by model
model_groups = df.groupby("model")
# calculate the sum for each group
model_totals = model_groups["sale_amount"].sum()
# sort the results
model_totals.sort_values(ascending=False)

## Task 5 - Cross Tabulation

Use cross tabulation to examine the relationship between the *model* and *region* categorical variables.

In [None]:
# compare this pair of categorical variables
pd.crosstab(df["model"], df["region"])

Next, use cross tabulation to examine the relationship between the *transmission* and *fuel type* categorical variables.

In this case, normalise the values in the cross-tabulation by row.

In [None]:
# compare this pair of categorical variables
pd.crosstab(df["transmission"], df["fuel_type"], normalize="index")