# Asset Evaluation with Machine Learning
Machine learning models are useful when predicting the price of an asset you haven't seen before. Let's start by checking out restaurant tips data (http://dlsun.github.io/pods/data/tips.csv).

In [None]:
import pandas as pd

tips_df = pd.read_csv("http://dlsun.github.io/pods/data/tips.csv")
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Now, let's predict the `tip` from the rest of the data we have access to. Since machine learning models can only use numerical data, we'll need to convert our categorical variables into usable numerical data. We'll use `sklearn`'s `OneHotEncoder` and `ColumnTransformer` for this.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

X_train = tips_df.drop(columns=["tip"]).copy()   # we need to make sure that we don't include the target variable in our X_train
y_train = tips_df["tip"]

# technically, we don't need a column transformer yet, but it will come in handy soon
ct = make_column_transformer(
    (OneHotEncoder(sparse_output=False), ["sex", "smoker", "day", "time"]),  # give OneHotEncoder our categorical variables
    remainder="passthrough"   # we want the column transformer to retain the data we haven't encoded
).set_output(transform="pandas")

ct

Now that we have our data in a usable format, let's train a `LinearRegression` model on our data. As we decided before, we're going to predict `tip` using the rest of the data at our disposal.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# once again, technically we don't need a pipeline, but it will be handy soon
pipeline = make_pipeline(
    ct,
    LinearRegression()
)

pipeline.fit(X_train, y_train)

Not important to learn this code, but it's interesting to see the different coefficients our model produced.

In [None]:
# Get the coefficients from the LinearRegression model in the pipeline
coefficients = pipeline.named_steps['linearregression'].coef_

# Get the feature names after the column transformation
feature_names = pipeline.named_steps['columntransformer'].get_feature_names_out()

# Combine the feature names and coefficients into a DataFrame for easier viewing
coefficients_df = pd.DataFrame(coefficients, index=feature_names, columns=['Coefficient'])

# Display the coefficients
coefficients_df

Unnamed: 0,Coefficient
onehotencoder__sex_Female,0.01622
onehotencoder__sex_Male,-0.01622
onehotencoder__smoker_No,0.043204
onehotencoder__smoker_Yes,-0.043204
onehotencoder__day_Fri,0.0773
onehotencoder__day_Sat,-0.044159
onehotencoder__day_Sun,0.051819
onehotencoder__day_Thur,-0.08496
onehotencoder__time_Dinner,-0.034064
onehotencoder__time_Lunch,0.034064


Now let's predict how much a female non-smoker in a group of 3 at Sunday dinner will tip on a $17.92 total bill.

In [None]:
y_test = pd.DataFrame({'total_bill': [17.92],
          'sex': ['Female'],
          'smoker': ['No'],
          'day': ['Sun'],
          'time': ['Dinner'],
          'size': [3]
          })
pipeline.predict(y_test)

array([2.99951978])

In [None]:
pipeline.score(X_train, y_train)
# Our model explains 47% of the variation in tips

0.47007812322060794



---



Now we'll use the Ames Housing dataset (http://dlsun.github.io/pods/data/AmesHousing.txt), predicting `SalePrice` from home features. This time, we're going to use a different regression model: K-nearest-neighbors. In this model, we're essentially finding the most similar observations to a new data point and using their average as our predicted target value. In this new model, scaling our data becomes an important step (Age vs. Income). We'll also check out cross validation.

For this model, we'll use `Gr Liv Area`, `Bedroom AbvGr`, and `Full Bath` to predict `SalePrice`.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

df = pd.read_csv("http://dlsun.github.io/pods/data/AmesHousing.txt", delimiter="\t")



0       215000
1       105000
2       172000
3       244000
4       189900
         ...  
2925    142500
2926    131000
2927    132000
2928    170000
2929    188000
Name: SalePrice, Length: 2930, dtype: int64



---

Now, on your own, try to come up with the best model you can for the dataset. Try different combinations of variables, different models, etc.

Hint: If your model includes categorical variables, you'll want to use the ColumnTransformer from the first exercise.