# Data Preprocessing

## 1. Load in the dataset and remove unnecessary columns

In [1]:
import pandas as pd


def load_data(path):
    df = pd.read_csv(path)
    df.drop(["listing_id", "indicative_price", "eco_category"], axis=1, inplace=True)
    return df


df_train = load_data("data/train.csv")
df_test = load_data("data/test.csv")

**Discription:**

In the loading process, we will load the dataset and remove the unnecessary columns. The unnecessary columns are the columns that are not useful for the model. For example, the columns that contain the same value for all the rows, the columns that contain the unique value for all the rows, etc.

## 2. Fill the missing "make" values

In [2]:
def fill_make(df):
    missing_make = df["make"].isnull()

    make_list = df["make"].unique()
    make_list = [str(make) for make in make_list]
    
    def extract_make(title):
        potential_make = title.split(" ")[0].lower()
        make = None
        for item in make_list:
            if potential_make in item:
                make = item
                break
        return make
    
    df.loc[missing_make, "make"] = df.loc[missing_make, "title"].apply(extract_make)

    return df

df_train = fill_make(df_train)
df_test = fill_make(df_test)

**Discription:**

To fill the missing "make" values, we first observe that the first one or two words in the "title" column are the "make" values. So, we first get all the unique not missing "make" values in the dataset. Then, we get the first word in the "title" column with the rows that have missing "make" values. Then we compare the first word with the unique "make" values, check if it's the substring of the unique "make" values, and fill the missing "make" values with the unique "make" values.